2. .dvc
files
.dvc
files use the YAML 1.2 file format, which is a human-friendly data serialization format for all programming languages.
As I mention above, DVC creates one lightweight .dvc
file for every file or folder tracked with DVC.
Whenever you take a peek contained in the contents of images.dvc
, you will note the next entries:
Essentially the most interesting part is md5
. MD5 is a well-liked hashing function. It takes a file of arbitrary size and uses its contents to provide a string of characters of fixed length (32 characters in our case).
These characters can seem random, but they are going to at all times be the identical should you rerun the hashing function on the file nonetheless repeatedly. But, even when a single bit is modified within the file, the resulting hash will probably be completely different.
DVC uses these hashes (also called checksums) to distinguish whether two files are equivalent, completely different, or different versions of the identical file.
For instance, if I add a brand new fake image to the images
folder, the resulting MD5 hash inside images.dvc
will probably be different:
As mentioned earlier, it is best to track all .dvc
files with Git in order that modifications to large assets turn out to be a component of your Git commits and history.
$ git add images.dvc
Discover more about how .dvc
files work from this page of the DVC user guide.
3. DVC cache
Whenever you call dvc add
on a big asset, it gets copied right into a special directory called DVC cache, situated under .dvc/cache
.
The cache is the place where DVC keeps a pristine record of your data and models at different versions. The .dvc
files in the present working directory could also be showing the newest or another version of the massive assets, however the cache will include all the several states the assets have been in because you began tracking them with DVC.
For instance, let’s say you added a 1 GB data.csv
file to DVC. By default, the file will probably be each in your workspace and contained in the .dvc/cache
folder, taking over twice as much space (2 GB).
Any subsequent changes tracked with dvc add data.csv
will create a new edition of data.csv
with a brand new hash inside .dvc/cache
, taking over one other gigabyte of memory.
So, you would possibly already be asking — isn’t this highly inefficient? And the reply can be yes! Not less than, for single files, but we are going to see methods to mitigate this problem in the subsequent section.
As for folders, it’s a bit different.
Whenever you track different versions of folders with dvc add dirname
, DVC is sensible enough to detect only the files that modified inside that directory. Which means unless you update each file within the directory, DVC will cache only the versions of the modified files, which won’t take up much space.
In summary, consider DVC cache as a counterpart to Git’s staging area.
Learn more in regards to the cache and internal DVC files from this user guide section.