Home Artificial Intelligence Machine Learning Made Intuitive

Machine Learning Made Intuitive

0
Machine Learning Made Intuitive

ML: all you have to know with none overcomplicated math

Towards Data Science
What you might think ML is… (Photo Taken by Justin Cheigh in Billund, Denmark)

What’s Machine Learning?

Sure, the actual theory behind models like ChatGPT is admittedly very difficult, however the underlying intuition behind Machine Learning (ML) is, well, intuitive! So, what’s ML?

Machine Learning allows computers to learn using data.

But what does this mean? How do computers use data? What does it mean for a pc to learn? And to start with, who cares? Let’s start with the last query.

Nowadays, data is throughout us. So it’s increasingly necessary to make use of tools like ML, as it may possibly help find meaningful patterns in data without ever being explicitly programmed to accomplish that! In other words, by utilizing ML we’re capable of apply generic algorithms to a wide selection of problems successfully.

There are a couple of predominant categories of Machine Learning, with a few of the predominant types being supervised learning (SL), unsupervised learning (UL), and reinforcement learning (RL). Today I’ll just be describing supervised learning, though in subsequent posts I hope to elaborate more on unsupervised learning and reinforcement learning.

1 Minute SL Speedrun

Look, I get that you just won’t wish to read this whole article. On this section I’ll teach you the very basics (which for quite a lot of people is all you have to know!) before going into more depth within the later sections.

Supervised learning involves learning the best way to predict some label using different features.

Imagine you are attempting to work out a approach to predict the value of diamonds using features like carat, cut, clarity, and more. Here, the goal is to learn a function that takes as input the features of a selected diamond and outputs the associated price.

Just as humans learn by example, on this case computers will do the identical. To give you the option to learn a prediction rule, this ML agent needs “labeled examples” of diamonds, including each their features and their price. The supervision comes since you might be given the label (price). In point of fact, it’s necessary to think about that your labeled examples are literally true, because it’s an assumption of supervised learning that the labeled examples are “ground truth”.

Okay, now that we’ve gone over probably the most fundamental basics, we are able to get a bit more in depth in regards to the whole data science/ML pipeline.

Problem Setup

Let’s use an especially relatable example, which is inspired from this textbook. Imagine you’re stranded on an island, where the one food is a rare fruit often called “Justin-Melon”. Regardless that you’ve never eaten Justin-Melon specifically, you’ve eaten loads of other fruits, and you recognize you don’t wish to eat fruit that has gone bad. You furthermore may know that sometimes you possibly can tell if a fruit has gone bad by the colour and firmness of the fruit, so that you extrapolate and assume this holds for Justin-Melon as well.

In ML terms, you used prior industry knowledge to find out two features (color, firmness) that you’re thinking that will accurately predict the label (whether or not the Justin-Melon has gone bad).

But how will you recognize what color and what firmness correspond to the fruit being bad? Who knows? You only must try it out. In ML terms, we’d like data. More specifically, we’d like a labeled dataset consisting of real Justin-Melons and their associated label.

Data Collection/Processing

So that you spend the subsequent couple of days eating melons and recording the colour, firmness, and whether or not the melon was bad. After a couple of painful days of continually eating melons which have gone bad, you’ve gotten the next labeled dataset:

Code by Justin Cheigh

Each row is a selected melon, and every column is the worth of the feature/label for the corresponding melon. But notice we’ve words, for the reason that features are categorical moderately than numerical.

Really we’d like numbers for our computer to process. There are a lot of techniques to convert categorical features to numerical features, starting from one hot encoding to embeddings and beyond.

The best thing we are able to do is turn the column “Label” right into a column “Good”, which is 1 if the melon is nice and 0 if it’s bad. For now, assume there may be some methodology to show color and firmness to a scale from -10 to 10, in such a way that is wise. For bonus points, think in regards to the assumptions of putting a categorical feature like color on such a scale. After this preprocessing, our dataset might look something like this:

Code by Justin Cheigh

We now have a labeled dataset, which implies we are able to employ a supervised learning algorithm. Our algorithm must be a classification algorithm, as we’re predicting a category good (1) or bad (0). Classification is in opposition to regression algorithms, which predict a continuous value like the value of a diamond.

Exploratory Data Evaluation

But what algorithm? There are a lot of supervised classification algorithms, ranging in complexity from basic logistic regression to some hardcore deep learning algorithms. Well, let’s first take a take a look at our data by doing a little exploratory data evaluation (EDA):

Code by Justin Cheigh

The above image is a plot of the feature space; we’ve two features, and we’re simply putting each example onto a plot with the 2 axes being the 2 features. Moreover, we make the purpose purple if the associated melon was good, and we make it yellow if it was bad. Clearly, with just just a little little bit of EDA, there’s an obvious answer!

Code by Justin Cheigh

We should always probably classify all points contained in the red circle nearly as good melons, while ones outside of the circle needs to be classified in bad melons. Intuitively, this is sensible! For instance, you don’t desire a melon that’s rock solid, but you furthermore may don’t want it to be absurdly squishy. Slightly, you would like something in between, and the identical might be true about color as well.

We determined we might want a choice boundary that may be a circle, but this was just based off of preliminary data visualization. How would we systematically determine this? This is particularly relevant in larger problems, where the reply just isn’t so easy. Imagine lots of of features. There’s no possible approach to visualize the 100 dimensional feature space in any reasonable way.

What are we learning?

Step one is to define your model. There are tons of classification models. Since each has their very own set of assumptions, it’s necessary to attempt to make a great selection. To emphasise this, I’ll start by making a very bad selection.

One intuitive idea is to make a prediction by weighing each of the aspects:

Formula by Justin Cheigh using Embed Fun

For instance, suppose our parameters w1 and w2 are 2 and 1, respectively. Also assume our input Justin Melon is one with Color = 4, Firmness = 6. Then our prediction Good = (2 x 4) + (1 x 6) = 14.

Our classification (14) just isn’t even considered one of the valid options (0 or 1). It’s because this is definitely a regression algorithm. The truth is, it’s a straightforward case of the best regression algorithm: linear regression.

So, let’s turn this right into a classification algorithm. One easy way can be this: use linear regression and classify as 1 if the output is higher than a bias term b. The truth is, we are able to simplify by adding a relentless term to our model in such a way that we classify as 1 if the output is higher than 0.

In math, let PRED = w1 * Color + w2 * Firmness + b. Then we get:

Formula by Justin Cheigh using Embed Fun

That is actually higher, as we’re not less than performing a classification, but let’s make a plot of PRED on the x axis and our classification on the y axis:

Code by Justin Cheigh

It is a bit extreme. A slight change in PRED could change the classification entirely. One solution is that the output of our model represents the probability that the Justin-Melon is nice, which we are able to do by smoothing out the curve:

Code by Justin Cheigh

It is a sigmoid curve (or a logistic curve). So, as an alternative of taking PRED and apply this piecewise activation (Good if PRED ≥ 0), we are able to apply this sigmoid activation function to get a smoothed out curve like above. Overall, our logistic model looks like this:

Formula by Justin Cheigh using Embed Fun

Here, the sigma represents the sigmoid activation function. Great, so we’ve our model, and we just must work out what weights and biases are best! This process is often called training.

Training the Model

Great, so all we’d like to do is work out what weights and biases are best! But this is far easier said than done. There are an infinite variety of possibilities, and what does best even mean?

We start with the latter query: what’s best? Here’s one easy, yet powerful way: probably the most optimal weights are the one which get the best accuracy on our training set.

So, we just must work out an algorithm that maximizes accuracy. Nonetheless, mathematically it’s easier to attenuate something. In words, moderately than defining a price function, where higher value is “higher”, we prefer to define a loss function, where lower loss is best. Although people typically use something like binary cross entropy for (binary) classification loss, we’ll just use a straightforward example: minimize the variety of points classified incorrectly.

To do that, we use an algorithm often called gradient descent. At a really high level, gradient descent works like a nearsighted skier attempting to get down a mountain. A vital property of a great loss function (and one which our crude loss function actually lacks) is smoothness. If you happen to were to plot our parameter space (parameter values and associated loss on the identical plot), the plot would appear like a mountain.

So, we first start with random parameters, and subsequently we likely start with bad loss. Like a skier attempting to go down the mountain as fast as possible, the algorithm looks in every direction, attempting to see the steepest approach to go (i.e. the best way to change parameters with a view to lower loss probably the most). But, the skier is nearsighted, in order that they only look just a little in each direction. We iterate this process until we find yourself at the underside (keen eyed individuals may notice we actually might find yourself at a neighborhood minima). At this point, the parameters we find yourself with are our trained parameters.

When you train your logistic regression model, you realize your performance remains to be really bad, and that your accuracy is just around 60% (barely higher than guessing!). It’s because we’re violating considered one of the model assumptions. Logistic regression mathematically can only output a linear decision boundary, but we knew from our EDA that the choice boundary needs to be circular!

With this in mind, you are trying different, more complex models, and also you get one which gets 95% accuracy! You now have a totally trained classifier able to differentiating between good Justin-Melons and bad Justin-Melons, and you possibly can finally eat all of the tasty fruit you would like!

Conclusion

Let’s take a step back. In around 10 minutes, you learned rather a lot about machine learning, including what is actually the entire supervised learning pipeline. So, what’s next?

Well, that’s for you to choose! For some, this text was enough to get a high level picture of what ML actually is. For others, this text may leave quite a lot of questions unanswered. That’s great! Perhaps this curiosity will let you further explore this topic.

For instance, in the information collection step we assumed that you just would just eat a ton of melons for a couple of days, without really taking into consideration any specific features. This is unnecessary. If you happen to ate a green mushy Justin-Melon and it made you violently ailing, you most likely would stray away from those melons. In point of fact, you’d learn through experience, updating your beliefs as you go. This framework is more much like reinforcement learning.

And what for those who knew that one bad Justin-Melon could kill you immediately, and that it was too dangerous to ever try one without being sure? Without these labels, you couldn’t perform supervised learning. But possibly there’s still a approach to gain insight without labels. This framework is more much like unsupervised learning.

In following blog posts, I hope to analogously expand on reinforcement learning and unsupervised learning.

Thanks for Reading!

LEAVE A REPLY

Please enter your comment!
Please enter your name here