Data Version Control

211 points by shcheklein 2 days ago

I've used DVC for most of my projects for the past five years. The good things is that it works a lot like git. If your scientists understand branches, commits and diffs, they should be able to understand DVC. The bad thing is that it works like git. Scientists often do not, in fact, understand or use branches, commits and diffs. The best thing is that it essentially forces you to follow Ten Simple Rules for Reproducible Computational Research [1]. Reproducibility has been a huge challenge on teams I've worked on.

[1] https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

dpleban a day ago

Great to see DVC being discussed here! As a tool, it’s done a lot to simplify version control for data and models, and it’s been a game-changer for many in the MLOps space.

Specifically, it's a genius way to store large files in git repos directly on any object storage without custom application servers like git-lfs or rewriting git from scratch...

At DagsHub [0], we've integrated directly with DVC for a looong time, so teams can use it with added features like visualizing and labeling datasets managing and models, running experiments collaboratively, and tracking everything (code, data, models, etc.) all in one place.

Just wanted to share that for those already using or considering DVC—there are some options to use it as a building block in a more end-to-end toolchain.

[0] https://dagshub.com

dmpetrov 2 days ago

hi there! Maintainer and author here. Excited to see DVC on the front page!

Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

johanneskanybal 21 hours ago

Mostly consult as a data engineer not ML ops but I’m interested in some aspects of this. We have 10 years of parquet files from 300+ different kafka topic and we’re currently migrating to apache iceberg. We’ll back fill on a need only basis and it would be nice to track that with git. Would this be a good fit for that?
Another potential aspect would be tracking schema evolution in a nicer way than we currently do.
thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).
ajoseps 2 days ago

if the data files are all just text files, what are the differences between DVC and using plain git?
- miki123211 2 days ago
  
  DVC does a lot more than git.
  It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.
  There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.
  It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.
  In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.
  
  amelius a day ago
  
  Sounds like it is more a framework than a tool.
  Not everybody wants a framework.
  
  stochastastic 21 hours ago
  
  It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.
  
  JadeNB a day ago
  
  > Sounds like it is more a framework than a tool.
  > Not everybody wants a framework.
  The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.
  
  amelius a day ago
  
  Just saying that what makes Git so appealing is that it does one thing well, and from this view DVC seems to be in an entirely different category.
  
  woodglyst a day ago
  
  This sounds a lot like the experimental project Jacquard [0] from Ink & Switch.
  [0] https://www.inkandswitch.com/jacquard/notebook/
  
  azinman2 2 days ago
  
  So where do the adjusted 10M rows live instead? S3?
  
  thangngoc89 a day ago
  
  DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.
  [0] https://dvc.org/doc/user-guide/data-management/remote-storag...
- dmpetrov 2 days ago
  
  In this cases, you need DVC if:
  1. File are too large for Git and Git LFS.
  2. You prefer using S3/GCS/Azure as a storage.
  3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.
  Otherwise, vanilla Git may be sufficient.
- agile-gift0262 a day ago
  
  It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.
stochastastic 20 hours ago

Thanks for making and sharing DVC! It’s been a big help.
Is there any support that would be helpful? I’ll look at the project page too.
- dmpetrov 20 hours ago
  
  Thank you!
  Just shoot an email to support and mention HN. I’ll read and reply.

jiangplus a day ago

How does it compare to Oxen?

https://github.com/Oxen-AI/Oxen

gregschoeninger a day ago

Maintainer of Oxen here, we initially built Oxen because DVC was pretty painfully slow to work with, and had a lot of extra bells and whistles that we didn’t need. Under the hood we optimized the merkle tree structure, hashing algorithms, network protocols, etc to make it speedy when it came to large datasets. We have a pretty nice front end at https://oxen.ai for viewing and querying the data as well.
Happy to answer any thoughts or questions!
bagavi 21 hours ago

Can this be used with GitHub? If yes, I would shift from dvc immediately
jFriedensreich a day ago

never heard of oxen but it looks like a super interesting alternative. would love to hear from someone who has experience with both.
my first impression: dvc is made to use with git where there are arbitrary folders handled by dvc INSIDE your git repo, where oxen is an alternative for a separate data repo. also oxen has lots of integration with dataframes and tabular, ai training and infernece data that dvc is missing. on the other hand dvc has a full DAG pipeline engine integrated as well as import/ export and pluggable backends.

ulnarkressty a day ago

We actually were considering DVC, however for our particular use case (huge video files which don't change much) the git paradigm was not that useful - you need at least one copy of the data on the origin and another one on each system that's doing the training. So in the end we just went with files and folders on a NAS, seemed to work good enough.

A hybrid solution of keeping dataset metadata under DVC and then versioning that could work. This was many years ago though and I would be curious if there are any other on-prem data versioning solutions, when I last searched all of them seem geared towards the cloud.

jerednel 2 days ago

It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?

riedel 2 days ago

DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.
- haensi 2 days ago
  
  There’s another thread from October 2022 on that topic.
  https://news.ycombinator.com/item?id=33047634
  What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?
  Disclaimer: I work at W&B.
  
  riedel 2 days ago
  
  DVC is much more basic (feels more unix style), integrates really well with any simple CI/CD scripting with git versioning without the need to set up any additional servers.
  And it is not either or. People actually combine MLFlow and SVC [0]
  [0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-runs-...
- matrss 2 days ago
  
  Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.
- starkparker 2 days ago
  
  I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.
  We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.

shicholas 2 days ago

What are the benefits of DVC over Apache Iceberg? If anyone used both, I'd be curious about your take. Thanks!

tomnicholas1 a day ago

If you're wondering this you should look at Icechunk too, which was open-sourced just this week. It's Apache Iceberg but for multidimensional data (e.g. Zarr).
https://earthmover.io/blog/icechunk
https://news.ycombinator.com/item?id=41850352
andrew_lettuce 2 days ago

I don't see any real benefits, as it feels like using the tool you already know even though it's not quite right. Iceberg is maybe geared towards slower changing models than this approach?
- foobarbecue 2 days ago
  
  username checks out
  
  dmd 2 days ago
  
  You must be into Apache Ignite.
dijksterhuis a day ago

head is a bit discombobulated today, but i’ll give this a shot
when i say ‘blob’ data, a good example to think of is a set of really long 1080p video files.
tl;dr version
* throw data into dvc when unstructured ‘blob’ data.
* throw it into iceberg when you’ve got structured data.
benefits of dvc over iceberg:
* not forcing ‘blob’ data into a tabular format and all the “fun” (read: annoying) processing steps that come with doing that
* don’t have to have to run some processing step to extract ‘blob’ data out of what is basically a parquet file, dvc pull (?) will just download each file as is.
* edit files locally then run three-ish (?) commands to commit changes, without needing to run a data ingestion pipeline to force ‘blob’ data into a table
* completely schema less, so don’t have to worry about ‘blob’ data being the wrong type, just shove it in the repo and commit it
* roll back throughout all of commit history, not just to the last vacuum/checkpoint
basically, tabular data formats and ‘blob’ data shoved into them is a recipe for pain.
shoving ‘blobs’ into a git like repo is much faster and easier.
especially if you need full version history, branches for different outcomes etc.
trying to have different branches in Iceberg for your set of really long 1080p video files where you have applied different ffmpeg filters in different branches and want people to be able to access all of them, and the history of them sounds nightmare-ish.
in dvc, that’s ^ easy.
basically, it’s like creating a data lake which won’t turn into a data swamp because everything is version controlled.

notrealyme123 a day ago

I had a lot of problems when using it with a dataset of many jpg Files.

The indexing for every dvc status took many minutes to check every file. Caching did not work.

Sadly I had to let go of it.

woodson a day ago

Yes, its performance is rather poor and there can be a lot of headaches with caching (especially if you're using a file system that doesn't support reflinks). For large sharded datasets (e.g. WebDataset), you're better off with other solutions, especially when your ML pipeline can stream them directly from object storage.
- dmpetrov a day ago
  
  Right, DVC caches data for consistency and reproducibility.
  If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.
  WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...
  
  notrealyme123 9 hours ago
  
  Thank you! Thats news to me. I will absolutely give it a try

sohooo a day ago

I also heart about lakeFS for data versioning on S3 object stores. Is DVC a contender in this area?

causal 2 days ago

This useful for large binaries?

mkbehbehani 2 days ago

Yes, I’ve been using it for about a year to populate databases with a reference DB dump. The current file is about 18GB. I use cloudflare R2 as the backing store so even though it’s being pulled very frequently the cloudflare bill is a few bucks per month.
kbumsik a day ago

Large files are good, but it may have performance issues with many (millions) small files.
dmpetrov 2 days ago

Yes. And if you track transformations of the binaries or ml training
natsucks 2 days ago

Would appreciate a good answer to this question. I deal with large medical imaging data (DICOM) and i cannot tell whether it's worth it and/or feasible.
- thangngoc89 a day ago
  
  It's very much feasible. I'm currently using DVC for DICOM, the repo has growth to about 5TB of small dcm files (less than < 100KB each). We use a NFS mounted NAS for development but the DVC's cache needs to be on the NVMe, otherwise performance would be terrible.
- tomnicholas1 a day ago
  
  You should look at Icechunk. Your imaging data is structured (it's a multidimensional array), so it should be possible be to represent it as "Virtual Zarr". Then you could commit it to an Icechunk store.
  https://earthmover.io/blog/icechunk

tomtom1337 a day ago

The animated ripple across the «what’s new» button is infuriating. It keeps drawing my attention from reading what this is.

FergusArgyll a day ago

I bound Ctrl-Shift-Z to Ublock Origin's Zapper mode. It's really helpful