Home Artificial Intelligence A Comprehensive Overview of Gaussian Splatting Table of contents: TL;DR Representing a 3D world Image formation model & rendering Optimization View-dependant colours with SH Limitations Where to play with it Acknowledgments References

A Comprehensive Overview of Gaussian Splatting Table of contents: TL;DR Representing a 3D world Image formation model & rendering Optimization View-dependant colours with SH Limitations Where to play with it Acknowledgments References

0
A Comprehensive Overview of Gaussian Splatting
Table of contents:
TL;DR
Representing a 3D world
Image formation model & rendering
Optimization
View-dependant colours with SH
Limitations
Where to play with it
Acknowledgments
References

Moreover, Gaussian splatting doesn’t involve any neutral network in any respect. There isn’t even a small MLP, nothing “neural”, a scene is basically only a set of points in space. This in itself is already an attention grabber. It is sort of refreshing to see such a technique gaining popularity in our AI-obsessed world with research firms chasing models comprised of increasingly more billions of parameters. Its idea stems from “Surface splatting”³ (2001) so it sets a cool example that classic computer vision approaches can still encourage relevant solutions. Its easy and explicit representation makes Gaussian splatting particularly interpretable, a excellent reason to decide on it over NeRFs for some applications.

As mentioned earlier, in Gaussian splatting a 3D world is represented with a set of 3D points, in actual fact, hundreds of thousands of them, in a ballpark of 0.5–5 million. Each point is a 3D Gaussian with its own unique parameters which might be fitted per scene such that renders of this scene match closely to the known dataset images. The optimization and rendering processes will likely be discussed later so let’s focus for a moment on the crucial parameters.

Figure 2: Centers of Gaussian (means) [Source: taken from Dynamic 3D Gaussians⁴]

Each 3D Gaussian is parametrized by:

  • Mean μ interpretable as location x, y, z;
  • Covariance Σ;
  • Opacity σ(𝛼), a sigmoid function is applied to map the parameter to the [0, 1] interval;
  • Color parameters, either 3 values for (R, G, B) or spherical harmonics (SH) coefficients.

Two groups of parameters here need further discussion, a covariance matrix and SH. There’s a separate section dedicated to the latter. As for the covariance, it’s chosen to be anisotropic by design, that’s, not isotropic. Practically, it signifies that a 3D point will be an ellipsoid rotated and stretched along any direction in space. It could have required 9 parameters, nevertheless, they can’t be optimized directly because a covariance matrix has a physical meaning provided that it’s a positive semi-definite matrix. Using gradient descent for optimization makes it hard to pose such constraints on a matrix directly, that’s the reason it’s factorized as a substitute as follows:

Such factorization is referred to as eigendecomposition of a covariance matrix and will be understood as a configuration of an ellipsoid where:

  • S is a diagonal scaling matrix with 3 parameters for scale;
  • R is a 3×3 rotation matrix analytically expressed with 4 quaternions.

The great thing about using Gaussians lies within the two-fold impact of every point. On one hand, each point effectively represents a limited area in space near its mean, in accordance with its covariance. However, it has a theoretically infinite extent meaning that every Gaussian is defined on the entire 3D space and will be evaluated for any point. That is great because during optimization it allows gradients to flow from long distances.⁴

The impact of a 3D Gaussian i on an arbitrary 3D point p in 3D is defined as follows:

Figure 3: An influence of a 3D Gaussian i on a degree p in 3D [Source: Image by the author]

This equation looks almost like a probability density function of the multivariate normal distribution except the normalization term with a determinant of covariance is ignored and it’s weighting by the opacity as a substitute.

Image formation model

Given a set of 3D points, possibly, probably the most interesting part is to see how can it’s used for rendering. You is perhaps previously aware of a point-wise 𝛼-blending utilized in NeRF. Seems that NeRFs and Gaussian splatting share the identical image formation model. To see this, let’s take a little bit detour and re-visit the volumetric rendering formula given in NeRF² and plenty of of its follow-up works (1). We can even rewrite it using easy transitions (2):

You possibly can consult with the NeRF paper for the definitions of σ and δ but conceptually this will be read as follows: color in a picture pixel p is approximated by integrating over samples along the ray going through this pixel. The ultimate color is a weighted sum of colours of 3D points sampled along this ray, down-weighted by transmittance. With this in mind, let’s finally have a look at the image formation model of Gaussian splatting:

Indeed, formulas (2) and (3) are almost an identical. The one difference is how 𝛼 is computed between the 2. Nonetheless, this small discrepancy seems extremely significant in practice and leads to drastically different rendering speeds. In actual fact, it’s the inspiration of the real-time performance of Gaussian splatting.

To know why that is the case, we’d like to know what f^{2D} means and which computational demands it poses. This function is solely a projection of f(p) we saw within the previous section into 2D, i.e. onto a picture plane of the camera that’s being rendered. Each a 3D point and its projection are multivariate Gaussians so the impact of a projected 2D Gaussian on a pixel will be computed using the identical formula because the impact of a 3D Gaussian on other points in 3D (see Figure 3). The one difference is that the mean μ and covariance Σ should be projected into 2D which is completed using derivations from EWA splatting⁵.

Means in 2D will be trivially obtained by projecting a vector μ in homogeneous coordinates (with extra 1 coordinate) into a picture plane using an intrinsic camera matrix K and an extrinsic camera matrix W=[R|t]:

This will be also written in a single line as follows:

Here “z” subscript stands for z-normalization. Covariance in 2D is defined using a Jacobian of (4), J:

The entire process stays differentiatable, and that’s after all crucial for optimization.

Rendering

The formula (3) tells us how you can get a color in a single pixel. To render a whole image, it’s still crucial to traverse through all of the HxW rays, similar to in NeRF, nevertheless, the method is rather more lightweight because:

  • For a given camera, f(p) of every 3D point will be projected into 2D prematurely, before iterating over pixels. This fashion, when a Gaussian is mixed for a couple of nearby pixels, we won’t must re-project it over and another time.
  • There’s no MLP to be inferenced H·W·P times for a single image, 2D Gaussians are blended onto a picture directly.
  • There’s no ambiguity during which 3D point to judge along the ray, no need to decide on a ray sampling strategy. A set of 3D points overlapping the ray of every pixel (see N in (3)) is discrete and glued after optimization.
  • A pre-processing sorting stage is completed once per frame, on a GPU, using a custom implementation of differentiable CUDA kernels.

The conceptual difference will be seen in Figure 4:

Figure 4: A conceptual difference between NeRF and GS, Left: Query a continuous MLP along the ray, Right: Mix a discrete set of Gaussians relevant to the given ray [Source: Image by the author]

The sorting algorithm mentioned above is one in all the contributions of the paper. Its purpose is to arrange for color rendering with the formula (3): sorting of the 3D points by depth (proximity to a picture plane) and grouping them by tiles. The primary is required to compute transmittance, and the latter allows to limit the weighted sum for every pixel to α-blending of the relevant 3D points only (or their 2D projections, to be more specific). The grouping is achieved using easy 16×16 pixel tiles and is implemented such that a Gaussian can land in a couple of tiles if it overlaps greater than a single view frustum. Because of sorting, the rendering of every pixel will be reduced to α-blending of pre-ordered points from the tile the pixel belongs to.

Figure 5: View frustums, each corresponding to a 16×16 image tile. Colours don’t have any special meaning. The results of the sorting algorithm is a subset of 3D points inside each tile sorted by depth. [Source: Based on the plots from here]

A naive query might come to mind: how is it even possible to get a decent-looking image from a bunch of blobs in space? And well, it’s true that if Gaussians aren’t optimized properly, you’ll get every kind of pointy artifacts in renders. In Figure 6 you’ll be able to observe an example of such artifacts, they appear quite literally like ellipsoids. The important thing to getting good renders is 3 components: good initialization, differentiable optimization, and adaptive densification.

Figure 6: An example of renders of an under-optimized scene [Source: Image by the author]

The initialization refers back to the parameters of 3D points set at the beginning of coaching. For point locations (means), the authors propose to make use of a degree cloud produced by SfM (Structure from Motion), see Figure 7. The logic is that for any 3D reconstruction, be it with GS, NeRF, or something more classic, it’s essential to know camera matrices so you’ll probably run SfM anyway to acquire those. Since SfM produces a sparse point cloud as a by-product, why not use it for initialization? In order that’s what the paper suggests. When a degree cloud shouldn’t be available for whatever reason, a random initialization will be used as a substitute, under the danger of a possible lack of the ultimate reconstruction quality.

Figure 7: A sparse 3D point cloud produced by SfM, means initialization [Source: Taken from here]

Covariances are initialized to be isotropic, in other words, 3D points begin as spheres. The radiuses are set based on mean distances to neighboring points such that the 3D world is nicely covered and has no “holes”.

After init, a straightforward Stochastic Gradient Descent is used to suit all the things properly. The scene is optimized for a loss function that may be a combination of L1 and D-SSIM (structural dissimilarity index measure) between a ground truth view and a current render.

Nonetheless, that’s not it, one other crucial part stays and that’s adaptive densification. It’s launched from time to time during training, say, every 100 SGD steps and its purpose is to handle under- and over-reconstruction. It’s vital to emphasise that SGD by itself can only do as much as adjust the present points. But it surely would struggle to search out good parameters in areas that lack points altogether or have too lots of them. That’s where adaptive densification is available in, splitting points with large gradients (Figure 8) and removing points which have converged to very low values of α (if a degree is that transparent, why keep it?).

Figure 8: Adaptive densification. A toy example of fitting a bean shape that we’d prefer to render with a couple of points. [Source: Taken from [1]]

Spherical harmonics, SH for brief, play a big role in computer graphics and were first proposed as a option to learn a view-dependant color of discrete 3D voxels in Plenoxels⁶. View dependence is a nice-to-have property that improves the standard of renders because it allows the model to represent non-Lambertian effects, e.g. specularities of metallic surfaces. Nonetheless, it’s definitely not a must because it’s possible to make a simplification, decide to represent color with 3 RGB values, and still use Gaussian splatting prefer it was done in [4]. That’s the reason we’re reviewing this representation detail individually after the entire method is laid out.

SH are special functions defined on the surface of a sphere. In other words, you’ll be able to evaluate such a function for any point on the sphere and get a price. All of those functions are derived from this single formula by selecting positive integers for and −m, one (ℓ, m) pair per SH:

While a bit intimidating at first, for small values of l this formula simplifies significantly. In actual fact, for ℓ = 1, Y = ~0.282, just a relentless on the entire sphere. Quite the opposite, higher values of produce more complex surfaces. The speculation tells us that spherical harmonics form an orthonormal basis so each function defined on a sphere will be expressed through SH.

That’s why the thought to specific view-dependant color goes like this: let’s limit ourselves to a certain degree of freedom ℓ_max and say that every color (red, green, and blue) is a linear combination of the primary ℓ_max SH functions. For each 3D Gaussian, we would like to learn the proper coefficients in order that once we have a look at this 3D point from a certain direction it would convey a color the closest to the bottom truth one. The entire strategy of obtaining a view-dependant color will be seen in Figure 9.

Figure 9: A strategy of obtaining a view-dependant color (red component) of a degree with ℓ_max = 2 and 9 learned coefficients. A sigmoid function maps the worth into the [0, 1] interval. Oftentimes, clipping is used as a substitute [Source: Image by the author]

Despite the general great results and the impressive rendering speed, the simplicity of the representation comes with a price. Essentially the most significant consideration is various regularization heuristics which might be introduced during optimization to protect the model against “broken” Gaussians: points which might be too big, too long, redundant, etc. This part is crucial and the mentioned issues will be further amplified in tasks beyond novel view rendering.

The alternative to step except for a continuous representation in favor of a discrete one signifies that the inductive bias of MLPs is lost. In NeRFs, an MLP performs an implicit interpolation and smoothes out possible inconsistencies between given views, while 3D Gaussians are more sensitive, leading back to the issue described above.

Moreover, Gaussian splatting shouldn’t be free from some well-known artifacts present in NeRFs which they each inherit from the shared image formation model: lower quality in less seen or unseen regions, floaters near a picture plane, etc.

The file size of a checkpoint is one other property to have in mind, although novel view rendering is much from being deployed to edge devices. Considering the ballpark variety of 3D points and the MLP architectures of popular NeRFs, each take the identical order of magnitude of disk space, with GS being just a couple of times heavier on average.

No blog post can do justice to a technique in addition to just running it and seeing the outcomes for yourself. Here is where you’ll be able to mess around:

  • gaussian-splatting — the official implementation with custom CUDA kernels;
  • nerfstudio —yes, Gaussian splatting in nerfstudio. It is a framework originally dedicated to NeRF-like models but since December, ‘23, it also supports GS;
  • threestudio-3dgs — an extension for threestudio, one other cross-model framework. It is best to use this one when you are involved in generating 3D models from a prompt fairly than learning an existing set of images;
  • UnityGaussianSplatting — if Unity is your thing, you’ll be able to port a trained model into this plugin for visualization;
  • gsplat — a library for CUDA-accelerated rasterization of Gaussians that branched out of nerfstudio. It may possibly be used for independent torch-based projects as a differentiatable module for splatting.

Have a good time!

This blog post relies on a gaggle meeting within the lab of Dr. Tali Dekel. Special thanks go to Michal Geyer for the discussions of the paper and to the authors of [4] for a coherent summary of Gaussian splatting.

  1. Kerbl, B., Kopanas, G., Leimkühler, T., & Drettakis, G. (2023). 3D Gaussian Splatting for Real-Time Radiance Field Rendering. SIGGRAPH 2023.
  2. Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. ECCV 2020.
  3. Zwicker, M., Pfister, H., van Baar, J., & Gross, M. (2001). Surface Splatting. SIGGRAPH 2001
  4. Luiten, J., Kopanas, G., Leibe, B., & Ramanan, D. (2023). Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. International Conference on 3D Vision.
  5. Zwicker, M., Pfister, H., van Baar, J., & Gross, M. (2001). EWA Volume Splatting. IEEE Visualization 2001.
  6. Yu, A., Fridovich-Keil, S., Tancik, M., Chen, Q., Recht, B., & Kanazawa, A. (2023). Plenoxels: Radiance Fields without Neural Networks. CVPR 2022.

LEAVE A REPLY

Please enter your comment!
Please enter your name here