Hacker Newsnew | past | comments | ask | show | jobs | submit | hichkaker's commentslogin

Thank you for mentioning Data Diff! Founder of Datafold here. We built Data Diff to solve a variety of problems that we encountered as data engineers: (A) Testing SQL code changes by diffing the output of production/dev versions of SQL query. (B) Validating that data is consistent when replicating data between databases.

Data Diff has two algorithms implemented for diffing in the same database and across databases. The former is based on JOIN, and the latter utilizes checksumming with binary search, which has minimal network IO and database workload overhead.


As a data engineer, validating your SQL code as you develop pipelines is hard. It requires running the code to materialize a "development" dataset, then auditing the resulting dataset to ensure your code produces the correct output. That tedious process usually involves writing lots of ad-hoc SQL scripts.

We built an integration with a popular framework for orchestrating SQL pipelines – dbt – that enables developers quickly profile their development datasets and diff them against production to audit their work as they code. It's free for individual developers. (We make money by selling CI/CD validation to teams)


I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

If so:

1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.

2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]

We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])

I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

[1] https://docs.getdbt.com/docs/build/tests [2] https://github.com/calogica/dbt-expectations [3] https://github.com/datafold/data-diff/pull/364 [4] https://www.datafold.com/dbt [5] https://www.datafold.com/blog/the-day-you-stopped-breaking-y...


As a data engineer myself, totally agree about the abuse of the word “data”. What we strived for when naming was to make it self-describing as much as possible. Since the tool does one thing - diff datasets - we could name it “dataset diff” but that seemed more clunky. “table diff” wouldn’t work since we’re working on adding APIs (e.g. Stripe) as a data source to make validation of API-to-database syncs possible, and that goes beyond just tables. There is always an option to give a nondiscriptive or metaphorical name but we were concerned that would make the tool far less discoverable by potential users.


Datafold automates data pipeline testing for data engineers. With Datafold, data engineers can deal with data quality issues in the pull request by seeing how a change to source code impacts data produced throughout the entire data pipeline/DAG. Datafold is used by data teams at Patreon, Thumbtack, Substack, Angellist, among others, and raised $22M from YC, NEA & Amplify Partners. Our founding story and Launch HN: [https://news.ycombinator.com/item?id=24071955] Roles:

* Frontend Engineer: $180K - $250K + equity [https://bit.ly/3tW4zk7]

* Backend Engineer: $180K - $250K + equity [https://bit.ly/3J2mmdz]

* Data Solutions Engineer $130K – $200K + equity [https://bit.ly/3iPsNpY]

Salary ranges are for SF Bay Area (adjusted based on location factor), Intermediate to Staff levels.

Location: REMOTE (PST ±5) US visa sponsorship: yes

Stack: Python, Rust, FastAPI, PostgreSQL, Neo4j, ClickHouse | Typescript, React, Redux

Here are some projects you might work on:

* Static code analysis to compute column-level data lineage graph

* ML-based anomaly detection in multidimensional time series

* Data diff tool that finds discrepancies between 1B+ row datasets across databases

Contact: careers+hn@datafold.com


We actually built a tool exactly for that: https://www.datafold.com/data-diff The core use case is helping data engineers building analytical pipelines test the changes to their code prior to deploying, but can also work for any query. You can diff tables or just SQL queries directly.


Datafold automates data pipeline testing for data engineers. With Datafold, data engineers can deal with data quality issues in the pull request by seeing how a change to source code impacts data produced throughout the entire data pipeline/DAG. Datafold is used by data teams at Patreon, Thumbtack, Substack, Angellist, among others, and raised $22M from YC, NEA & Amplify Partners. Our founding story and Launch HN: [https://news.ycombinator.com/item?id=24071955]

Roles:

* Frontend Engineer: $180K - $250K + equity [https://bit.ly/3tW4zk7]

* Backend Engineer: $180K - $250K + equity [https://bit.ly/3J2mmdz]

* Data Solutions Engineer $130K – $200K + equity [https://bit.ly/3iPsNpY]

Salary ranges are for SF Bay Area (adjusted based on location factor), Intermediate to Staff levels.

Location: REMOTE (PST ±5) US visa sponsorship: yes

Stack: Python, Rust, FastAPI, PostgreSQL, Neo4j, ClickHouse | Typescript, React, Redux

Here are some projects you might work on:

* Static code analysis to compute column-level data lineage graph

* ML-based anomaly detection in multidimensional time series

* Data diff tool that finds discrepancies between 1B+ row datasets across databases

Contact: careers+hn@datafold.com


Datafold (YC S20) | https://www.datafold.com

Datafold automates data pipeline testing for data engineers. With Datafold, data engineers can deal with data quality issues in the pull request by seeing how a change to source code impacts data produced throughout the entire data pipeline/DAG. Datafold is used by data teams at Patreon, Thumbtack, Substack, Angellist, among others, and raised $22M from YC, NEA & Amplify Partners. Our founding story and Launch HN: [https://news.ycombinator.com/item?id=24071955]

Roles: - Frontend Engineer: $180K - $250K + equity [https://bit.ly/3tW4zk7] - Backend Engineer: $180K - $250K + equity [https://bit.ly/3J2mmdz] - Data Solutions Engineer $130K – $200K + equity [https://bit.ly/3iPsNpY]

Salary ranges are for SF Bay Area (adjusted based on location factor), Intermediate to Staff levels.

Location: REMOTE (PST ±5) US visa sponsorship: yes

Stack: Python, FastAPI, PostgreSQL, Neo4j, ClickHouse | Typescript, React, Redux

Here are some projects you might work on:

* Static code analysis to compute column-level data lineage graph

* ML-based anomaly detection in multidimensional time series

* Data diff tool that finds discrepancies between 1B+ row datasets across databases

Contact: careers+hn@datafold.com


We've been using Tella for product demos. A huge upgrade from hacking together QuickTime recordings!


Thank you for sharing!

I assume we are talking about analytical, not transactional data:

> Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.

It depends on the use case: if the goal is to assess the impact of a change in source code on the resulting dataset produced (extremely common in ETL dev workflow in my experience), then isn't diff the natural solution? Of course, it depends on how the results are presented. A row-by-row output for a billion-row dataset is useless. That's why we provide diff stats across columns/rows and data distribution comparisons while allowing the user to see value-level diff if needed.

> Diffs are relatively slow

In general – yes, that's why we've implemented configurable sampling. In the majority of cases, developer is looking to assess the magnitude of difference and certain patterns, for which you don't need a large sample size. Our customers typically use ~1/10000 of the target table row count as a sample size.

> when they fail you get a blizzard of errors We try to fail gracefully :)

> I don't see how this helps with schema migration or performance issues.

For schema migration, you can verify whether anything has changed in your dataset besides the intended schema changes (which certainly happened on my watch).

> or performance issues

We certainly don't claim to solve all DBA issues with diff, but here's an actual real example from our customer: they are optimizing their ETL jobs in BigQuery to lower GCP bill by reducing query runtime. After refactoring the code, they diff production vs. new code output to ensure that the data produced hasn't been affected.

> If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data.

Possibly, but

> They're also a pain to code

...which is often a prohibitive pain point if you have 300+ analytical datasets with 50+ columns each (a common layout for companies of 250+).

And another problem: the more test cases, the more failures on every run, and unlike app code unit testing, you can't expect the cases to stay relevant since the data is changing constantly, so those "unit testing" test suites require constant maintenance, and as soon as you stop actualizing them, their value drops to 0.

I think that diffing and "unit testing" are complimentary approaches and neither one is a panacea. So my recommendation has been to use both: 1) Specific test cases to validate the most important assumptions on data 2) Diff tool for regression testing.


I'm unconvinced your approach works beyond a narrow range of use cases. The weakness is the "is this a problem" issue. You have a diff. Is it really significant? If it's significant, how did it arise? You can spend an inordinate amount of time answering those two questions, and you may have to do it again with every run. Diffs are cheap to implement but costly to use over time. That inversion of costs means users may end up bogged down maintaining the existing mechanism and unable to invest in other approaches.

If I were going after the same problem I would try to do a couple of things.

1. Reframe the QA problem to make it smaller. Reducing the number and size of pipelines is a good start. That has a bunch of knock-on benefits beyond correctness.

2. Look at data cleaning technologies. QA on datasets is a variation on this problem. For example if you can develop predicates that check for common safety conditions on data like detecting bad addresses or SSANs you give users immediately usable quality information. There's a lot more you can do here.

Assuming you are working on this project, I wish you good luck. You can contact me at rhodges at altinity dot com if you want to discuss further. I've been dealing with QA problems on data for a long time.


p.s., to expand on #2 if you can "discover" useful safety conditions on data you change the economics of testing, much as #1 does.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: