Home
>
Data Science
>
Update Build Data Factories, Not Data Warehouses

April 13, 2022 by Anh Hoang

Update Build Data Factories, Not Data Warehouses

Main Contents:

Build Data Factories, Not Data Warehouses is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Build Data Factories, Not Data Warehouses in today’s post !

Data Quality Control Priorities

Jeremy Stanley

Jeremy Stanley is the co-founder and CTO of Anomalo, where he helps companies improve the quality and reliability of their data. Most recently, he was the VP of data science at Instacart, where he focused on machine learning for logistics and marketplace discovery. Prior to that, he led data science and engineering at Sailthru, building personalization tools for e-commerce and publishing companies. Before Sailthru, he was responsible for creating advertising optimization and bidding technology at Collective. His early experience with data, machine learning and strategy began at EY.

To establish data quality control in our metaphorical factory, we could test at four points:

The raw materials that arrive in our factory.
The machine performance at each step in the line.
The work-in-progress material that lands between transformation steps.
The final products we ship to internal or external customers.

These testing points are not equally important. As a factory operator, the most critical quality test is at the end of the line. Factories have dedicated teams that sample finished products and ensure they meet rigorous quality standards.

The same holds for data. We don’t know if the data we produce is high quality until we have tested the finished product. For example:

Did a join introduce duplicate rows?
Did a malformed column cause missing values?
Are timestamps inconsistently recorded?
Has a change in query logic affected business metrics?

After validating the quality of our final product, we should ensure we are consuming high-quality raw materials. Identifying defects in raw data arriving into the factory will save us time and effort in root-causing issues later.

Insufficient Investments

Unfortunately, to date, most investment in testing our data factories has been the equivalent of evaluating machine performance or visualizing floor plans:

We monitor data infrastructure for uptime and responsiveness.
We monitor Airflow tasks for exceptions and run times.
We apply rule-based tests with dbt to check the logic of transformations.
We analyze data lineage to build complex maps of data factory floors.

These activities are helpful, but we have put the cart before the horse! We should first ensure that our factory produces and ingests high-quality data. From conversations I have had with hundreds of data teams, I believe we have failed to do so for three reasons:

1. We use the tools we have at hand.
Engineering teams have robust tools and best practices for monitoring the operations of web and backend applications. We can use these existing tools to monitor the infrastructure and orchestration for our data factory. However, these tools are incapable of monitoring the data itself.

2. We have tasked machine operators with quality control.
The burden of data quality often falls on the backs of the data and analytics engineers operating the machines in the factory. They are experts in the tools and logic used to transform the data. They may write tests to ensure their transformations are correct, but they can overlook upstream or downstream issues from their processing.

3. Testing data well is difficult.
Our data factories produce thousands of incredibly diverse data tables with hundreds of meaningful columns and segments. The data in these tables constantly changes for reasons that range from “expected” to “entirely out of our control.” Simplistic testing strategies frequently miss real issues, and complex strategies are hard to maintain. Poorly calibrated tests can spam users with false-positive alerts, leading to alert fatigue.

Data Quality Control Needs

We need purpose-built tools to monitor and assess the quality of data arriving into or exiting our data factories.

We should place these tools into the hands of data consumers — the subject matter experts who deeply care about the quality of the data they use. These consumers should be able to quickly test their data and monitor their key metrics, with or without code.

Our data quality tools must scale to cover thousands of tables, with billions of rows, across hundreds of teams, in daily batch processes or real-time flows.

The algorithms used should be flexible enough to handle data from diverse applications and industries. They should gracefully adapt to different tabular structures, data granularity and table update mechanics. We should automate testing to avoid burdening data consumers with busy work.

We should avoid creating alert fatigue by minimizing false positives through notification controls, feedback loops and robust predictive models. When issues arise, we should visually explain them by leveraging context in the data and upstream data generation processes.

The Future of Data Quality

Organizations today can capture, store and query a remarkable breadth of data relevant to their business. They can democratize access to this data so that analyses, processes or products can depend on it.

Data teams operate complex data factories to service the data needs of their organization. But they are often unable to control the quality of data produced. Data teams risk losing trust and becoming sidelined if they do not catch and address data quality issues before downstream users.

Data leaders must take responsibility for data quality by defining and enforcing quality control standards. They need tools and processes that test data in ways that scale, both with the data itself and the people involved in producing and consuming it.

These are complex challenges, but a tremendous amount of innovation is happening in the data community to address them. I look forward to a future where our data factories are transparent, fast, inexpensive, and produce data of outstanding quality!

I’d like to thank Anthony Goldbloom, Chris Riccomini, Dan Siroker, D.J. Patil, John Joo, Kris Kendall, Monica Rogati, Pete Soderling, Taly Kanfi and Vicky Andonova for their feedback and suggestions.

AWS and Snowflake are sponsors of InApps Technology.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Anh Hoang

Anh Hoang is Head of SEO Optimization at InApps Technology, ensuring that the message and research of InApps Technology reach the most people possible while adhering to our strict journalistic standards of excellence and integrity.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

April 10, 2026 by Anh Hoang

Update Build Data Factories, Not Data Warehouses

Read more about Build Data Factories, Not Data Warehouses at Wikipedia

Data Quality Control Priorities

Insufficient Investments

Data Quality Control Needs

The Future of Data Quality

Best Angular Projects for Beginners in 2026

Offshore Product Development and How It Differs?

Is It Too Late to Switch Into Tech? What Reddit Career Changers Say

Are Developers Becoming Too Dependent on AI Tools?

Is Being a Self-Taught Developer Still Viable in 2026?

Imposter Syndrome in Tech: Why So Many Developers Feel Like Frauds

Too Many Tools, Too Little Time: How Developers Deal With Stack Fatigue

Why AI Productivity Is Making Developers Feel More Stressed, Not Faster

How to Stay Relevant in Tech Without Learning Everything

Why So Many Developers Feel Burned Out (And What Actually Helps)

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

Hire Offshore Angular Developers: The Right Development Team In Vietnam

What Is ODC (Offshore Development Center)? Understand Offshore Development Center In 3 Seconds

Hire Full-Stack Developers From Software Outsourcing Companies in 2026

Locations

Read more about Build Data Factories, Not Data Warehouses at Wikipedia

Data Quality Control Priorities

Insufficient Investments

Data Quality Control Needs

The Future of Data Quality

Get a custom Proposal

You need to enter your email to download

Blog post

Locations