Home Artificial Intelligence Data Version Control for the Modern Data Scientist: 7 DVC Concepts You Can’t Ignore

Data Version Control for the Modern Data Scientist: 7 DVC Concepts You Can’t Ignore

0
Data Version Control for the Modern Data Scientist: 7 DVC Concepts You Can’t Ignore

2. .dvc files

.dvc files use the YAML 1.2 file format, which is a human-friendly data serialization format for all programming languages.

As I mention above, DVC creates one lightweight .dvc file for every file or folder tracked with DVC.

Whenever you take a peek contained in the contents of images.dvc, you will note the next entries:

Image by me

Essentially the most interesting part is md5. MD5 is a well-liked hashing function. It takes a file of arbitrary size and uses its contents to provide a string of characters of fixed length (32 characters in our case).

These characters can seem random, but they are going to at all times be the identical should you rerun the hashing function on the file nonetheless repeatedly. But, even when a single bit is modified within the file, the resulting hash will probably be completely different.

DVC uses these hashes (also called checksums) to distinguish whether two files are equivalent, completely different, or different versions of the identical file.

For instance, if I add a brand new fake image to the images folder, the resulting MD5 hash inside images.dvc will probably be different:

Image by me

As mentioned earlier, it is best to track all .dvc files with Git in order that modifications to large assets turn out to be a component of your Git commits and history.

$ git add images.dvc

Discover more about how .dvc files work from this page of the DVC user guide.

3. DVC cache

Whenever you call dvc add on a big asset, it gets copied right into a special directory called DVC cache, situated under .dvc/cache.

The cache is the place where DVC keeps a pristine record of your data and models at different versions. The .dvc files in the present working directory could also be showing the newest or another version of the massive assets, however the cache will include all the several states the assets have been in because you began tracking them with DVC.

For instance, let’s say you added a 1 GB data.csv file to DVC. By default, the file will probably be each in your workspace and contained in the .dvc/cache folder, taking over twice as much space (2 GB).

Image by me

Any subsequent changes tracked with dvc add data.csv will create a new edition of data.csv with a brand new hash inside .dvc/cache, taking over one other gigabyte of memory.

So, you would possibly already be asking — isn’t this highly inefficient? And the reply can be yes! Not less than, for single files, but we are going to see methods to mitigate this problem in the subsequent section.

As for folders, it’s a bit different.

Whenever you track different versions of folders with dvc add dirname, DVC is sensible enough to detect only the files that modified inside that directory. Which means unless you update each file within the directory, DVC will cache only the versions of the modified files, which won’t take up much space.

In summary, consider DVC cache as a counterpart to Git’s staging area.

Learn more in regards to the cache and internal DVC files from this user guide section.

LEAVE A REPLY

Please enter your comment!
Please enter your name here