A guide to the assorted species of pretend data: Part 2

If you should work with data, what are your options? Here’s a solution that’s as coarse as possible: you would pay money for real data or you would get hold of pretend data.
In my previous article, we made friends with the concept of synthetic data and discussed the thought process around creating it. We compared real data, noisy data, and handcrafted data. Let’s dig into the species of synthetic data that’s fancier than asking a human to choose a number, any number…
(Note: the links on this post take you to explainers by the identical writer.)
Duplicated data
Possibly you measured 10,000 real human heights but you would like 20,000 datapoints. One approach you’re taking is to suppose your existing dataset already represents your population fairly well. (Assumptions are at all times dangerous, proceed with caution.) Then you would simply duplicate the dataset or duplicate some portion of it using ye olde copy-paste. Ta-da! More data! But is it good and useful data? That at all times depends upon what you wish it for. For many situations, the reply could be no. But hey, there are reasons you were born with a head, and people reasons are to chew and to use your best judgment.
Resampled data
Speaking of duplicating only a portion of your data, there’s a approach to inject a spot of randomness to help you in determining which portion to choose. You should use a random number generator to help you in picking which height to attract out of your existing list of heights. You can do that “without alternative”, meaning that you just make at most one copy of every existing height, but…
Bootstrapped data
You’ll more often see people doing this “with alternative”, meaning that each time you randomly pick a height to repeat, you immediately forget you probably did this in order that the identical height could make its way into your dataset as a second, third, fourth, etc. copy. Perhaps if there’s enough interest within the comments, I’ll explain why this can be a powerful and effective technique (yes, it appears like witchcraft at first, I assumed so too) for population inference.
Augmented data
Augmented data might sound fancy, and there *are* fancy ways to reinforce data, but normally while you see this term, it means you took your resampled data and added some random noise to it. In other words, you generated a random number from a statistical distribution and typically you just added it to the resampled datapoint. That’s it. That’s the augmentation.
Oversampled data
Speaking of duplicating only a portion of your data, there’s a approach to be intentional about boosting certain characteristics over others. Possibly you took your measurements at a typical AI conference, so female heights are underrepresented in your data (sad but true nowadays). That’s called the issue of unbalanced data. There are techniques for rebalancing the representation of those characteristics, akin to SMOTE (Synthetic Minority Oversampling TEchnique), which is just about what it appears like. Probably the most naive approach to smite the issue is to easily limit your resampling to the minority datapoints, ignoring the others. So in our example, you’d just resample the feminine heights while ignoring the opposite data. You can also consider more sophisticated augmentation, still limiting your efforts to the feminine heights.
Should you desired to get even fancier, you’d look up techniques like ADASYN (Adaptive Synthetic Sampling) and follow the breadcrumbs on a trail that’s out of scope for a fast intro to this topic.
Edge case data
You can also make up (handcrafted) data that’s totally unlike anything you (or anyone) has ever seen. This may be a really silly thing to do if you happen to were attempting to use it to create models of the true world, nevertheless it’s clever if you happen to’re using it to, for instance, test your system’s ability to handle weird things. To get a way of whether your model/theory/system chokes when it meets an outlier, you would possibly make synthetic outliers on purpose. Go ahead, put in a height of three meters and see what explodes. Form of like a fireplace drill at work. (Don’t leave an actual fire within the constructing or an actual monster outlier in your dataset.)
Simulated data
When you’re getting cozy with the concept of constructing data up in line with your specifications, you would possibly wish to go a step further and create a recipe to explain the underlying nature of the sort of information that you just’d like in your dataset. If there’s a random component, then what you’re actually doing is simulating from a statistical distribution that means that you can specify what the core principles are, as described by a model (which is just a flowery way of claiming “a formula that you just’re going to make use of as a recipe”) with a rule for a way the random bits work. As an alternative of adding random noise to an existing datapoint because the vanilla data augmentation techniques do, you possibly can add noise to a algorithm you got here up with, either by meditating or by doing a little statistical inference with a related dataset. Learn more about that here.
Heights? Wait, you’re asking me for a dataset of nothing but one height at a time? How boring! How… floppy disk era of us. We call this univariate data and it’s rare to see it collected within the wild nowadays.
Now that we’ve got incredible storage capability, data can are available way more interesting and sophisticated forms. It’s very low-cost to grab some extra characteristics together with heights while we’re at it. We could, for instance record hairstyle, making our dataset bivariate. But why stop there? How concerning the age too, so our data’s multivariate? How fun!
But nowadays, we are able to go wild and mix all that with image data (take a photograph in the course of the height measurement) and text data (that essay they wrote about how their unnecessarily boring their statistics class was). We call this multimodal data and we are able to synthesize that too! Should you’d wish to learn more about that, let me know within the comments.
Why might someone have the desire to make synthetic data? There are good reasons to adore it and a few solid reasons to avoid it just like the plague (article coming soon), but if you happen to’re a knowledge science skilled, head over to this text to search out out which reason I feel ought to be your favorite to make use of it often.
Should you had a good time here and also you’re searching for a complete applied AI course designed to be fun for beginners and experts alike, here’s the one I made in your amusement:
P.S. Have you ever ever tried hitting the clap button here on Medium greater than once to see what happens? ❤️