## A large breakthrough in scene representation

As now we have seen with methods like DeepSDF [2] and SRNs [4], encoding 3D objects and scenes inside the weights of a feed-forward neural network is a memory-efficient, implicit representation of 3D data that’s each accurate and high-resolution. Nevertheless, the approaches now we have seen thus far will not be quite able to capturing realistic and sophisticated scenes with sufficient fidelity. Relatively, discrete representations (e.g., triangle meshes or voxel grids) produce a more accurate representation, assuming a sufficient allocation of memory.

This modified with the proposal of Neural Radiance Fields (NeRFs) [1], which use a feed-forward neural network to model a continuous representation of scenes and objects. The representation utilized by NeRFs, called a radiance field, is a bit different from prior proposals. Specifically, NeRFs map a five-dimensional coordinate (i.e., spatial location and viewing direction) to a volume density and view-dependent RGB color. By accumulating this density and appearance information across different viewpoints and locations, we will render photorealist, novel views of a scene.

Like SRNs [4], NeRFs might be trained using only a set of images (together with their associated camera poses) of an underlying scene. Compared with prior approaches, NeRF renderings are higher each qualitatively and quantitatively. Notably, NeRFs may even capture complex effects comparable to view-dependent reflections on an object’s surface. By modeling scenes implicitly within the weights of a feed-forward neural network, *we match the accuracy of discrete scene representations without prohibitive memory costs*.

**why is that this paper vital? **This post is an element of my series on deep learning for 3D shapes and scenes. NeRFs were a revolutionary proposal on this area, as they allow incredibly accurate 3D reconstructions of scene from arbitrary viewpoints. The standard of scene representations produced by NeRFs is incredible, as we are going to see throughout the rest of this post.

Many of the background concepts needed to grasp NeRFs have been covered in prior posts on this topic, including:

- Feed-forward neural networks https://towardsdatascience.com/understanding-nerfs-2a082e13c6eb?source=rss—-7f60cf5620c9—4
- Representing 3D objects https://towardsdatascience.com/understanding-nerfs-2a082e13c6eb?source=rss—-7f60cf5620c9—4
- Problems with discrete representations https://towardsdatascience.com/understanding-nerfs-2a082e13c6eb?source=rss—-7f60cf5620c9—4

We only have to cover just a few more background concepts before going over how NeRFs work.

As a substitute of directly using `[x, y, z]`

coordinates as input to a neural network, NeRFs convert each of those coordinates into higher-dimensional positional embeddings. We’ve discussed positional embeddings in previous posts on the transformer architecture, as positional embeddings are needed to offer a notion of token ordering and position to self-attention modules.

Put simply, positional embeddings take a scalar number as input (e.g., a coordinate value or an index representing position in a sequence) and produce a higher-dimensional vector as output. We will either learn these embeddings during training or use a set function to generate them. For NeRFs, we use the function shown above, which takes a scalar `p`

as input and produces a `2L`

-dimensional position encoding as output.

There are just a few other (possibly) unfamiliar terms that we may encounter on this overview. Let’s quickly make clear them now.

**end-to-end training.** If we are saying that a neural architecture might be learned “end-to-end”, this just implies that all components of a system are differentiable. Consequently, after we compute the output for some data and apply our loss function, we will differentiate through your complete system (i.e., end-to-end) and train it with gradient descent!

Not all systems might be trained end-to-end. For instance, if we’re modeling tabular data, we’d perform a feature extraction process (e.g., one-hot encoding), then train a machine learning model on top of those features. Since the feature extraction process is hand-crafted and never differentiable, we cannot train the system end-to-end!

**Lambertian reflectance. **This term was completely unfamiliar to me prior to reading about NeRFs. Lambertian reflectance refers to how reflective an object’s surface is. If an object has a matte surface that doesn’t change when viewed from different angles, we are saying this object is Lambertian. Alternatively, a “shiny” object that reflects light otherwise based on the angle from which it’s viewed could be called non-Lambertian.

The high-level process for rendering scene viewpoints with NeRFs proceeds as follows:

- Generate samples of 3D points and viewing directions for a scene using a Ray Marching approach.
- Provide the points and viewing directions as input to a feed-forward neural network to supply color and density output.
- Perform volume rendering to build up colours and densities from the scene right into a 2D image.

We’ll now explain each component of this process in additional detail.

**radiance fields.** As mentioned before, NeRFs model a 5D vector-valued (i.e., meaning the function outputs multiple values) function called a radiance field. The input to this function is an `[x, y, z]`

spatial location and a 2D viewing direction. The viewing direction has two dimensions, corresponding to the 2 angles that might be used to represent a direction in 3D space; see below.

In practice, the viewing direction is just represented as a 3D cartesian unit vector.

The output of this function has two components: volume density and color. The colour is just an RGB value. Nevertheless, this value is view-dependent, meaning that the colour output might change given a distinct viewing direction as input! Such a property allows NeRFs to capture reflections and other view-dependent appearance effects. In contrast, volume density is just dependent upon spatial location and captures opacity (i.e., how much light accumulates because it passes through that position).

**the neural network.** In [1], we model radiance fields with a feed-forward neural network, which takes a 5D input and is trained to supply the corresponding color and volume density as output; see above. Recall, nevertheless, that color is view-dependent and volume density just isn’t. To account for this, we first pass the input 3D coordinate through several feed-forward layers, which produce each the amount density and a feature vector as output. This feature vector is then concatenated with the viewing direction and passed through an additional feed-forward layer to predict the view-dependent, RGB color; see below.

**volume rendering (TL;DR).** Volume rendering is just too complex of a subject to cover here in-depth, but we should always know the next:

- It could produce a picture of an underlying scene from samples of discrete data (e.g., color and density values).
- It’s differentiable.

For those enthusiastic about more details on volume rendering, take a look at the reason here and Section 4 of [1].

**the massive picture.** NeRFs use the feed-forward network to generate relevant details about a scene’s geometry and appearance along quite a few different camera rays (i.e., a line in 3D space moving from a particular camera viewpoint out right into a scene along a certain direction), then use rendering to aggregate this information right into a 2D image.

Notably, each of those component are differentiable, which suggests we will train this whole system end-to-end! Given a set of images with corresponding camera poses, we will train a NeRF to generate novel scene viewpoints by just generating/rendering known viewpoints and using (stochastic) gradient descent to attenuate the error between the NeRF’s output and the actual image; see below.

**just a few extra details. **We now understand a lot of the components of a NeRF. Nevertheless, the approach that we’ve described up thus far is definitely shown in [1] to be inefficient and customarily bad at representing scenes. To enhance the model, we will:

- Replace spatial coordinates (for each the spatial location and the viewing direction) with positional embeddings.
- Adopt a hierarchical sampling approach for volume rendering.

Through the use of positional embeddings, we map the feed-forward network’s inputs (i.e., the spatial location and viewing direction coordinates) to a higher-dimension. Prior work showed that such an approach, versus using spatial or directional coordinates as input directly, allows neural networks to higher model high-frequency (i.e., changing lots/quickly) features of a scene [5]. This makes the standard of the NeRF’s output a lot better; see below.

The hierarchical sampling approach utilized by NeRF makes the rendering process more efficient by only sampling (and passing through the feed-forward neural network) locations and viewing directions which might be more likely to impact the ultimate rendering result. This fashion, we only evaluate the neural network where needed and avoid wasting computation on empty or occluded areas.

NeRFs are trained to represent only a single scene without delay and are evaluated across several datasets with synthetic and real objects.

As shown within the table above, NeRFs outperform alternatives like SRNs [4] and LLFF [6] by a big, quantitative margin. Beyond quantitative results, it’s really informative to look visually on the outputs of a NeRF in comparison with alternatives. First, we will immediately tell that using positional encodings and modeling colours in a view-dependent manner is basically vital; see below.

One improvement that we’ll immediately notice is that NeRFs — because they model colours in a view-dependent fashion — can capture complex reflections (i.e., non-Lambertian features) and view-dependent patterns in a scene. Plus, NeRFs are able to modeling intricate features of underlying geometries with surprising precision; see below.

The standard of NeRF scene representations is most evident after they are viewed as a video. As might be seen within the video below, NeRFs model the underlying scene with impressive accuracy and consistency between different viewpoints.

For more examples of the photorealistic scene viewpoints that might be generated with NeRF, I highly recommend testing the project website linked here!

As we will see within the evaluation, NeRFs were a large breakthrough in scene representation quality. Consequently, the technique gained a number of popularity inside the bogus intelligence and computer vision research communities. The potential applications of NeRF (e.g., virtual reality, robotics, etc.) are nearly countless on account of the standard of its scene representations. The important takeaways are listed below.

**NeRFs capture complex details.** With NeRFs, we’re capable of capture fine-grained details inside a scene, comparable to the rigging material inside a ship; see above. Beyond geometric details, NeRFs also can handle non-Lambertian effects (i.e., reflections and changes in color based on viewpoint) on account of their modeling of color in a view-dependent manner.

**we want smart sampling.** All approaches to modeling 3D scenes that now we have seen thus far use neural networks to model a function on 3D space. These neural networks are typically evaluated at every spatial location and orientation inside the volume of space being considered, which might be quite expensive if not handled properly. For NeRFs, we use a hierarchical sampling approach that only evaluates regions which might be more likely to impact the ultimate, rendered image, which drastically improves sample efficiency. Similar approaches are adopted by prior work; e.g., ONets [3] use an octree-based hierarchical sampling approach to extract object representations more efficiently.

**positional embeddings are great.** To date, a lot of the scene representation methods now we have seen pass coordinate values directly as input to feed-forward neural networks. With NeRFs, we see that positionally embedding these coordinates is a lot better. Specifically, mapping coordinates to a better dimension seems to permit the neural network to capture high-frequency variations in scene geometry and appearance, which makes the resulting scene renderings far more accurate and consistent across views.

**still saving memory.** NeRFs implicitly model a continuous representation of the underlying scene. This representation might be evaluated at arbitrary precision and has a set memory cost — we just have to store the parameters of the neural network! Consequently, NeRFs yield accurate, high-resolution scene representations without using a ton of memory.

“Crucially, our method overcomes the prohibitive storage costs of discretized voxel grids when modeling complex scenes at high-resolutions.”

— from [1]

**limitations. **Despite significantly advancing state-of-the-art, NeRFs will not be perfect — there’s room for improvement in representation quality. Nevertheless, the important limitation of NeRFs is that they only model a single scene at a time and are expensive to coach (i.e., 2 days on a single GPU for every scene). It can be interesting to see how future advances on this area can find more efficient methods of generating NeRF-quality scene representations.

Thanks a lot for reading this text. I’m Cameron R. Wolfe, Director of AI at Rebuy and PhD student at Rice University. I study the empirical and theoretical foundations of deep learning. You can too take a look at my other writings on medium! For those who liked it, please follow me on twitter or subscribe to my Deep (Learning) Focus newsletter, where I help readers construct a deeper understanding of topics in deep learning research via comprehensible overviews of popular papers on that topic.

[1] Mildenhall, Ben, et al. “Nerf: Representing scenes as neural radiance fields for view synthesis.” *Communications of the ACM* 65.1 (2021): 99–106.

[2] Park, Jeong Joon, et al. “Deepsdf: Learning continuous signed distance functions for shape representation.” *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*. 2019.

[3] Mescheder, Lars, et al. “Occupancy networks: Learning 3d reconstruction in function space.” *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*. 2019.

[4] Sitzmann, Vincent, Michael Zollhöfer, and Gordon Wetzstein. “Scene representation networks: Continuous 3d-structure-aware neural scene representations.” *Advances in Neural Information Processing Systems* 32 (2019).

[5] Rahaman, Nasim, et al. “On the spectral bias of neural networks.” *International Conference on Machine Learning*. PMLR, 2019.

[6] Mildenhall, Ben, et al. “Local light field fusion: Practical view synthesis with prescriptive sampling guidelines.” *ACM Transactions on Graphics (TOG)* 38.4 (2019): 1–14.