Using the Monte Carlo method to visualise the behavior of observations with very large numbers of features
Consider a dataset, product of some variety of observations, each remark having N features. If you happen to convert all features to a numeric representation, you could possibly say that every remark is a degree in an N-dimensional space.
When N is low, the relationships between points are only what you’ll expect intuitively. But sometimes N grows very large — this might occur, for instance, in case you’re creating a whole lot of features via one-hot encoding, etc. For very large values of N, observations behave as in the event that they are sparse, or as if the distances between them are by some means larger than what you’ll expect.
The phenomenon is real. Because the variety of dimensions N grows, and all else stays the identical, the N-volume containing your observations really does increase in a way (or at the very least the variety of degrees of freedom becomes larger), and the Euclidian distances between observations also increase. The group of points actually does turn out to be more sparse. That is the geometric basis for the curse of dimensionality. The behavior of the models and techniques applied to the dataset will likely be influenced as a consequence of those changes.
Many things can go improper if the variety of features could be very large. Having more features than observations is a typical setup for models overfitting in training. Any brute-force search in such an area (e.g. GridSearch) becomes less efficient — you would like more trials to cover the identical intervals along any axis. A subtle effect has an impact on any models based on distance or vicinity: because the variety of dimensions grows to some very large values, in case you consider any point in your observations, all the opposite points look like far-off and by some means nearly equidistant — since these models depend on distance to do their job, the leveling out of differences of distance makes their job much harder. E.g. clustering doesn’t work as well if all points look like nearly equidistant.
For all these reasons, and more, techniques similar to PCA, LDA, etc. have been created — in an effort to maneuver away from the peculiar geometry of spaces with very many dimensions, and to distill the dataset right down to a variety of dimensions more compatible with the actual information contained in it.
It is difficult to perceive intuitively the true magnitude of this phenomenon, and spaces with greater than 3 dimensions are extremely difficult to visualise, so let’s do some easy 2D visualizations to assist our intuition. There’s a geometrical basis for the rationale why dimensionality can turn out to be an issue, and that is what we’ll visualize here. If you may have not seen this before, the outcomes may be surprising — the geometry of high-dimensional spaces is much more complex than the everyday intuition is prone to suggest.
Consider a square of size 1, centered within the origin. Within the square, you inscribe a circle.
That’s the setup in 2 dimensions. Now think in the final, N-dimensional case. In 3 dimensions, you may have a sphere inscribed in a cube. Beyond that, you may have an N-sphere inscribed in an N-cube, which is essentially the most general strategy to put it. For simplicity, we’ll check with these objects as “sphere” and “cube”, irrespective of what number of dimensions they’ve.
The quantity of the cube is fixed, it’s at all times 1. The query is: because the variety of dimensions N varies, what happens to the amount of the sphere?
Let’s answer the query experimentally, using the Monte Carlo method. We’ll generate a really large variety of points, distributed uniformly but randomly throughout the cube. For every point we calculate its distance to the origin — if that distance is lower than 0.5 (the radius of the sphere), then the purpose is contained in the sphere.
If we divide the variety of points contained in the sphere by the full variety of points, that can approximate the ratio of the amount of the sphere and of the amount of the cube. Because the volume of the cube is 1, the ratio will likely be equal to the amount of the sphere. The approximation gets higher when the full variety of points is large.
In other words, the ratio inside_points / total_points
will approximate the amount of the sphere.
The code is quite straightforward. Since we’d like many points, explicit loops should be avoided. We could use NumPy, but it surely’s CPU-only and single-threaded, so it is going to be slow. Potential alternatives: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), etc. We’ll use PyTorch — however the NumPy code would look almost equivalent.
If you happen to follow the nested torch
statements, we generate 100 million random points, calculate their distances to the origin, count the points contained in the sphere, and divide the count by the full variety of points. The ratio
array will find yourself containing the amount of the sphere in several numbers of dimensions.
The tunable parameters are set for a GPU with 24 GB of memory — adjust them in case your hardware is different.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# force CPU
# device = 'cpu'# reduce d_max if too many ratio values are 0.0
d_max = 22
# reduce n in case you run out of memory
n = 10**8
ratio = np.zeros(d_max)
for d in tqdm(range(d_max, 0, -1)):
torch.manual_seed(0)
# mix large tensor statements for higher memory allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), device=device) - 0.5, 2), dim=1)
)
<= 0.5
).item()
/ n
)
# clean up memory
torch.cuda.empty_cache()
Let’s visualize the outcomes: