Home
>
Data Science
>
Update Part 2: The Evolution of Data Pipeline Architecture

March 22, 2022 by Anh Hoang

Update Part 2: The Evolution of Data Pipeline Architecture

Main Contents:

Part 2: The Evolution of Data Pipeline Architecture is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Part 2: The Evolution of Data Pipeline Architecture in today’s post !

A New Paradigm

Certain things occurred in the past five years that have created an environment where data pipelines can be re-invented.

There are three main changes that we are interested in.

The Separation of Computing and Storage

Kostas Pardalis

Head of product at RudderStack. CEO at Blendo (acquired). Kostas is always an entrepreneur and engineer.

I’m not sure if Snowflake was the first to come up with this paradigm, but they definitely are the company that managed to make it mainstream and convince the market of its value.

By separating computing from storage, some amazing things happened.

We can scale storage almost indefinitely. We do not worry anymore about the size of our cluster and the downtime of resizing. Also, the cost for storing our data is associated only with storing and is thus much lower.
We can also scale our processing while we access only the data we need. In this way, processing also scales while costs remain low.

The impact of the above is tremendous on how we deal with data. We do not have to worry about the data we store anymore, and we have the flexibility to perform complex transformations on huge amounts of historical data when — and only when — we need it.

The Transformation of ETL to EL(T)

The separation of computing and storage in data warehouses had, as a consequence, some changes to how ETL (extract, transform, load) is perceived.

Smart people quickly identified that now we don’t have to implement complicated ETL pipelines, where heavy transformations are applied to the data. Instead, we can simplify a complex engineering mechanism by removing the transformation and offloading it to the data warehouse.

Why is this important?

ETL, although it might sound like copying data around, is not as simple of an engineering problem as it sounds. Especially if you want to do it at scale and in distributed environments. By removing processing from ETL, we can focus on building super resilient and consistent data pipelines and solve that problem once and for all.
Now it’s possible to create a single source of truth of our data and perform as many and as destructive transformations as we want, without losing the original data. We can always go back and change everything again.

We have the original data and the resources to do it inexpensively and at scale.

Data Platforms Are More Expressive Than Ever

I’m sure you have heard of DBT (data build tool) and how data engineers and analysts love it. What about Materialize?

These tools are only a few of a new set of tools that data engineers, analysts, and data scientists have access to today.

These tools are also a testament to what data platforms are becoming. Data platforms started with simple SQL query engines. That’s not the case anymore. They have matured and become complex processing engines capable of deploying and executing extremely complex processing logic by conforming to all the best practices that engineering has developed for delivering products and practices like versioning and testing.

Putting All the Pieces of the Puzzle Together

All the above points are signaling that it’s time to rethink how we develop and implement data pipelines and, most importantly, how we want to manage the state that data pipelines are responsible for replicating.

The paradigm of keeping only the latest state of data should be abandoned and we should instead follow something closer to a pattern known in software engineering as “Event Sourcing.”

By doing this, instead of replicating the state between the source system and the data warehouse or data lake, the pipeline stores a sequence or stream of state-changing events.

This is not a new concept in data warehousing. In data warehouse architectures, it was always understood that keeping track of how data changes is important. People who have experience in data warehouses will be aware of the term Slowly Changing Dimension (SCD).

Sponsor Note

sponsor logo

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter with enriched data from your warehouse for identity stitching and other advanced use cases. Start building smarter customer data pipelines today with RudderStack.

There are different SCD strategies that have been invented. What we were describing earlier as the first architecture of Data Pipelines is actually an example of a Type 1 SCD strategy. This particular strategy dictates that old data is overwritten by the new data.

If you ever had to work with Salesforce, you might be aware of a companion object called “History” that every Salesforce object comes with. This is another example of SCD in action. This time it’s type 4 SCD that we see in practice. We keep the main object where we overwrite old data with new, and at the same time we keep a history table with all the changes of records. In Salesforce, as in most cases, there’s also a retention period for the historical data after which we delete older data and keep only the new data.

Small Changes with Big Impact

The architecture of data pipelines doesn’t have to change a lot, but these small changes simplify their logic and have a big impact on how we work with the data.

All we have to do is to remove the logic the data pipelines had to keep track of — what is on System B already — and update it with any updates that have been observed on System A.

In this way, the pipelines become even simpler, and as we saw with ETL to ELT, we can focus even more on their resilience and consistency. This approach can also reduce latency a lot, as the complicated logic of deduplicating the data does not have to exist anymore.

At the same time, the separation of computing and processing on the destination storage systems allows us to keep an indefinite sequence of changes that our data has gone through. If we are smart with how we process the data, we will probably never have to delete old data.

On top of this sequence of state-changing events, we can use tools like DBT and Materialize to reconstruct the current final state of our data or transform our data completely and build data applications on top of it.

If we want to implement different SDC strategies, now we can do it. We can even change the strategies we implement, if we want to in the future, without losing any data.

We don’t have to deal with two different data models anymore. Instead, now everything can be handled as a stream of events. We can use processing systems to always reconstruct the final state and have it always available, but at the same time, we can use streaming processing frameworks to add even more use cases that are supported by our data infrastructure.

Most importantly, now we have access to all the data we need at any point in time and we can support all the possible use cases we have. Our data analysts and data scientists will always be happy, and we can support them at scale and with predictable costs.

Final Thoughts

There are small changes that we can apply to our software architectures that can have a tremendous impact on what we can do and how we can do it.

This is also the case with the Data Pipeline as one of the pillars of modern data infrastructure. Just changing how we replicate the state from one system to the other, we can have an amazing impact on the performance and the capabilities of the systems we design.

Of course, for this to happen huge changes had to happen in the industry. As we saw, we wouldn’t be able to rethink our architecture so easily without some major changes in the way we store and process data.

Not everything is solved though. The complexity of the sources of data we have to deal with is increasing and although we have solutions for some of them (e.g. CDC for databases), how we can interact effectively with other systems, including cloud applications, is still a work in progress.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Anh Hoang

Anh Hoang is Head of SEO Optimization at InApps Technology, ensuring that the message and research of InApps Technology reach the most people possible while adhering to our strict journalistic standards of excellence and integrity.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

April 10, 2026 by Anh Hoang

Update Part 2: The Evolution of Data Pipeline Architecture

Read more about Part 2: The Evolution of Data Pipeline Architecture at Wikipedia

A New Paradigm

The Separation of Computing and Storage

The Transformation of ETL to EL(T)

Data Platforms Are More Expressive Than Ever

Putting All the Pieces of the Puzzle Together

Small Changes with Big Impact

Final Thoughts

Best Angular Projects for Beginners in 2026

Offshore Product Development and How It Differs?

Is It Too Late to Switch Into Tech? What Reddit Career Changers Say

Are Developers Becoming Too Dependent on AI Tools?

Is Being a Self-Taught Developer Still Viable in 2026?

Imposter Syndrome in Tech: Why So Many Developers Feel Like Frauds

Too Many Tools, Too Little Time: How Developers Deal With Stack Fatigue

Why AI Productivity Is Making Developers Feel More Stressed, Not Faster

How to Stay Relevant in Tech Without Learning Everything

Why So Many Developers Feel Burned Out (And What Actually Helps)

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

Hire Offshore Angular Developers: The Right Development Team In Vietnam

What Is ODC (Offshore Development Center)? Understand Offshore Development Center In 3 Seconds

Hire Full-Stack Developers From Software Outsourcing Companies in 2026

Locations

Read more about Part 2: The Evolution of Data Pipeline Architecture at Wikipedia

A New Paradigm

The Separation of Computing and Storage

The Transformation of ETL to EL(T)

Data Platforms Are More Expressive Than Ever

Putting All the Pieces of the Puzzle Together

Small Changes with Big Impact

Final Thoughts

Get a custom Proposal

You need to enter your email to download

Blog post

Locations