Home Artificial Intelligence The Math Behind “The Curse of Dimensionality” What’s the curse of dimensionality? The notion of distance in high dimensions The maths: the n-ball Curse of dimensionality, overfitting, and Occam’s Razor Blessing of dimensionality? Conclusion

The Math Behind “The Curse of Dimensionality” What’s the curse of dimensionality? The notion of distance in high dimensions The maths: the n-ball Curse of dimensionality, overfitting, and Occam’s Razor Blessing of dimensionality? Conclusion

0
The Math Behind “The Curse of Dimensionality”
What’s the curse of dimensionality?
The notion of distance in high dimensions
The maths: the n-ball
Curse of dimensionality, overfitting, and Occam’s Razor
Blessing of dimensionality?
Conclusion

Dive into the “Curse of Dimensionality” concept and understand the mathematics behind all of the surprising phenomena that arise in high dimensions.

Towards Data Science
Image from Dall-E

Within the realm of machine learning, handling high-dimensional vectors isn’t just common; it’s essential. That is illustrated by the architecture of popular models like Transformers. As an illustration, BERT uses 768-dimensional vectors to encode the tokens of the input sequences it processes and to higher capture complex patterns in the information. Provided that our brain struggles to visualise anything beyond 3 dimensions, the usage of 768-dimensional vectors is sort of mind-blowing!

While some Machine and Deep Learning models excel in these high-dimensional scenarios, additionally they present many challenges. In this text, we are going to explore the concept of the “curse of dimensionality”, explain some interesting phenomena related to it, delve into the mathematics behind these phenomena, and discuss their general implications on your Machine Learning models.

Note that detailed mathematical proofs related to this text can be found on my website as a supplementary extension to this text.

People often assume that geometric concepts familiar in three dimensions behave similarly in higher-dimensional spaces. This isn’t the case. As dimension increases, many interesting and counterintuitive phenomena arise. The “Curse of Dimensionality” is a term invented by Richard Bellman (famous mathematician) that refers to all these surprising effects.

What’s so special about high-dimension is how the “volume” of the space (we’ll explore that in additional detail soon) is growing exponentially. Take a graduated line (in a single dimension) from 1 to 10. There are 10 integers on this line. Extend that in 2 dimensions: it’s now a square with 10 × 10 = 100 points with integer coordinates. Now consider “only” 80 dimensions: you’ll have already got 10⁸⁰ points which is the variety of atoms within the universe.

In other words, because the dimension increases, the amount of the space grows exponentially, leading to data becoming increasingly sparse.

High-dimensional spaces are “empty”

Consider this other example. We wish to calculate the farthest distance between two points in a unit hypercube (where either side has a length of 1):

  • In 1 dimension (the hypercube is a line segment from 0 to 1), the utmost distance is just 1.
  • In 2 dimensions (the hypercube forms a square), the utmost distance is the gap between the alternative corners [0,0] and [1,1], which is √2, calculated using the Pythagorean theorem.
  • Extending this idea to n dimensions, the gap between the points at [0,0,…,0] and [1,1,…,1] is √n. This formula arises because each additional dimension adds a square of 1 to the sum under the square root (again by the Pythagorean theorem).

Interestingly, because the variety of dimensions n increases, the most important distance throughout the hypercube grows at an O(√n) rate. This phenomenon illustrates a diminishing returns effect, where increases in dimensional space result in proportionally smaller gains in spatial distance. More details on this effect and its implications will likely be explored in the next sections of this text.

Let’s dive deeper into the notion of distances we began exploring within the previous section.

We had our first glimpse of how high-dimensional spaces render the notion of distance almost meaningless. But what does this really mean, and may we mathematically visualize this phenomenon?

Let’s consider an experiment, using the identical n-dimensional unit hypercube we defined before. First, we generate a dataset by randomly sampling many points on this cube: we effectively simulate a multivariate uniform distribution. Then, we sample one other point (a “query” point) from that distribution and observe the distance from its nearest and farthest neighbor in our dataset.

Here is the corresponding Python code.

def generate_data(dimension, num_points):
''' Generate random data points inside [0, 1] for every coordinate within the given dimension '''
data = np.random.rand(num_points, dimension)
return data

def neighbors(data, query_point):
''' Returns the closest and farthest point in data from query_point '''
nearest_distance = float('inf')
farthest_distance = 0
for point in data:
distance = np.linalg.norm(point - query_point)
if distance < nearest_distance:
nearest_distance = distance
if distance > farthest_distance:
farthest_distance = distance
return nearest_distance, farthest_distance

We can even plot these distances:

Distances to nearest and farthest points as n increases (Image by the creator)

Using a log scale, we observe that the relative difference between the gap to the closest and farthest neighbor tends to diminish because the dimension increases.

It is a very unintuitive behavior: as explained within the previous section, points are very sparse from one another due to the exponentially increasing volume of the space, but at the identical time, the relative distances between points turn into smaller.

The notion of nearest neighbors vanishes

Which means the very concept of distance becomes less relevant and fewer discriminative because the dimension of the space increases. As you’ll be able to imagine, it poses problems for Machine Learning algorithms that solely depend on distances akin to kNN.

We are going to now discuss another interesting phenomena. For this, we’ll need the n-ball. A n-ball is the generalization of a ball in n dimensions. The n-ball of radius R is the gathering of points at a distance at most R from the middle of the space 0.

Let’s consider a radius of 1. The 1-ball is the segment [-1, 1]. The two-ball is the disk delimited by the unit circle, whose equation is x² + y² ≤ 1. The three-ball (what we normally call a “ball”) has the equation x² + y² + z² ≤ 1. As you understand, we will extend that definition to any dimension:

The query now’s: what’s the amount of this ball? This isn’t a straightforward query and requires quite loads of maths, which I won’t detail here. Nonetheless, yow will discover all the small print on my website, in my post concerning the volume of the n-ball.

After loads of fun (integral calculus), you’ll be able to prove that the amount of the n-ball could be expressed as follows, where Γ denotes the Gamma function.

For instance, with R = 1 and n = 2, the amount is πR², because Γ(2) = 1. That is indeed the “volume” of the 2-ball (also called the “area” of a circle on this case).

Nonetheless, beyond being an interesting mathematical challenge, the amount of the n-ball also has some very surprising properties.

Because the dimension n increases, the amount of the n-ball converges to 0.

That is true for each radius, but let’s visualize this phenomenon with a number of values of R.

Volume of the n-ball for various radii because the dimension increases (Image by the creator)

As you’ll be able to see, it only converges to 0, but it surely starts by increasing after which decreases to 0. For R = 1, the ball with the most important volume is the 5-ball, and the worth of n that reaches the utmost shifts to the proper as R increases.

Listed here are the primary values of the amount of the unit n-ball, as much as n = 10.

Volume of the unit n-ball for various values of n (Image by the creator)

The quantity of a high-dimensional unit ball is concentrated near its surface.

For small dimensions, the amount of a ball looks quite “homogeneous”: this isn’t the case in high dimensions.

A spherical shell

Let’s consider an n-ball with radius R and one other with radius R-dR where dR could be very small. The portion of the n-ball between these 2 balls is named a “shell” and corresponds to the portion of the ball near its surface(see the visualization above in 3D). We will compute the ratio of the amount of your complete ball and the amount of this thin shell only.

Ratio (total volume / thin shell volume) as n increases (Image by the creator)

As we will see, it converges in a short time to 0: just about all the amount is near the surface in high dimensional spaces. As an illustration, for R = 1, dR = 0.05, and n = 50, about 92.3% of the amount is concentrated in the skinny shell. This shows that in higher dimensions, the amount is in “corners”. That is again related to the distortion of the concept of distance we’ve got seen earlier.

Note that the amount of the unit hypercube is 2ⁿ. The unit sphere is essentially “empty” in very high dimensions, while the unit hypercube, in contrast, gets exponentially more points. Again, this shows how the concept of a “nearest neighbor” of some extent loses its effectiveness because there is sort of no point inside a distance R of a question point q when n is large.

The curse of dimensionality is closely related to the overfitting principle. Due to the exponential growth of the amount of the space with the dimension, we want very large datasets to adequately capture and model high-dimensional patterns. Even worse: we want quite a few samples that grows exponentially with the dimension to beat this limitation. This scenario, characterised by many features yet relatively few data points, is especially liable to overfitting.

Occam’s Razor suggests that simpler models are generally higher than complex ones because they’re less more likely to overfit. This principle is especially relevant in high-dimensional contexts (where the curse of dimensionality plays a task) since it encourages the reduction of model complexity.

Applying Occam’s Razor principle in high-dimensional scenarios can mean reducing the dimensionality of the issue itself (via methods like PCA, feature selection, etc.), thereby mitigating some effects of the curse of dimensionality. Simplifying the model’s structure or the feature space helps in managing the sparse data distribution and in making distance metrics more meaningful again. As an illustration, dimensionality reduction is a quite common preliminary step before applying the kNN algorithm. More moderen methods, akin to ANNs (Approximate Nearest Neighbors) also emerge as a technique to take care of high-dimensional scenarios.

Image by Dall-E

While we’ve outlined the challenges of high-dimensional settings in machine learning, there are also some benefits!

  • High dimensions can enhance linear separability, making techniques like kernel methods more practical.
  • Moreover, deep learning architectures are particularly adept at navigating and extracting complex patterns from high-dimensional spaces.

As all the time with Machine Learning, this can be a trade-off: leveraging these benefits involves balancing the increased computational demands with potential gains in model performance.

Hopefully, this provides you an idea of how “weird” geometry could be in high-dimension and the numerous challenges it poses for machine learning model development. We saw how, in high-dimensional spaces, data could be very sparse but in addition tends to be concentrated within the corners, and distances lose their usefulness. For a deeper dive into the n-ball and mathematical proofs, I encourage you to go to the prolonged of this text on my website.

While the “curse of dimensionality” outlines significant limitations in high-dimensional spaces, it’s exciting to see how modern deep learning models are increasingly adept at navigating these complexities. Consider the embedding models or the newest LLMs, for instance, which utilize very high-dimensional vectors to more effectively discern and model textual patterns.

LEAVE A REPLY

Please enter your comment!
Please enter your name here