Home Artificial Intelligence AI-Generated Synthetic Data

AI-Generated Synthetic Data

AI-Generated Synthetic Data

Explained the most effective possible way: with cats!

Towards Data Science

Why is AI-generated synthetic data all the trend as of late? In this text, I’ll explain my favorite way: with cats!

Let’s say I need to coach a cat-not-cat classifier from scratch, but I only have one photo to work with:

The creator’s cat, Huxley.

(Every part that follows is an analogy for what people do with tabular data and text data, so it applies beyond image data.)

Ideally, I’m going to wish a dataset consisting of 1000’s of cat and not-cat photos. If I actually have a camera and plentiful access to cats, I can take a bunch of photos just like the one I have already got, ensuring that I get precisely the dataset I designed:

A photograph I took in a park in Istanbul.

But what if I don’t have a camera and I live catless on the moon? I could get the pictures I would like from a vendor, though I must watch out since inherited data is more dangerous than primary data.

Thanks, Pixabay, for being a superb (free) vendor of cat photos.

But what if there’s no vendor who’ll sell me some cat photos? (Yes, running out of cat photos on the web is a situation that’s more sci-fi than living on the moon, but bear with me.)

Well, if I can’t collect them and I can’t buy them, then I’ll need to make them myself. Behold, my creation:

Your creator is a veritable Michelangelo.

No good? Yeah, drawing was never my strong suit. One other solution to make fake data is to repeat existing datapoints, except this isn’t going to be much use for providing instructional variety.

This approach fools nobody. I’ve still only effectively got one datapoint.

It’ll be like teaching a human student by giving them the identical example over and all over again, so all they learn is that one thing. If my dataset is 30,000 copies of this Huxley photo…


Please enter your comment!
Please enter your name here