Building and maintaining a data pipeline is still one of the most challenging parts of analytics and data science. Pitfalls abound: pipelines get stuck, scheduled tasks fail, data transformations — which are needed to make raw data analytics-ready — take forever, security errors occur, and components run out of sequence. When data pipelines break, everyone wants a hero to fix immediate and recurring issues.
Being such a hero takes a combination of experience, talent, and the right tools to maximize productivity. Among the many vendors vying for that vaunted position in data engineers’ toolsets is Fivetran, which offers a cloud-based, fully-managed data integration service for ELT (extract-load-transform). Also popular in the data pipeline arena is dbt Labs’ dbt Core, an open source framework for SQL-based data engineering. In 2020, Fivetran announced its native integration with dbt Core. It pursued this integration in order to help users manage their transformation schedules, minimize time spent on monitoring, and accelerate time to insights, via reusable models, based on dbt packages.
In January, the plot thickened, with Fivetran announcing extensions to this integration to include scheduling and data lineage graphs. The goal with this enhanced integration is maximizing data freshness and controlling compute costs, all through pipeline automation.
The vehicle for this enhanced integration is Fivetran Transformations, an automated orchestration tool. When data arrives in a data lake or data warehouse, Fivetran Transformations automatically kicks off the transformation process via pre-built data models, normalizing the data, and putting it into a schema intuitive for business users.
The prebuilt models, now implemented as dbt Core SQL script packages designed to work with popular data source connectors, are used to generate new reports quickly, perform basic transformations, and give users a head start on their data pipelining process. Fivetran Transformations does this by automating the tasks that would otherwise have to be done manually when working with the kind of data addressed by the models. For example, analysts working with sales data can see various sales aggregations immediately after the raw data arrives, courtesy of the transformations provided in the model.
Currently on offer are pre-built models for 40 different data sources, including CRM, ERP and accounting systems. All of the models are open source. Some of the packages even encompass multiple systems, for example, the social media “roll-up” package is schema-savvy for data coming from platforms like Facebook, LinkedIn, Twitter and even TikTok; the ad roll-up package provides analogous functionality across multiple online ad platforms.
The enhanced integration also entails scheduling capabilities, which enable users to run their dbt Core packages automatically, following the completion of a connector sync. In addition, data lineage graphs, a feature of dbt Core, are integrated and can provide an end-to-end visualization of a created data pipeline, showing all data models and connectors used within it.
Alexander Lovell, Fivetran’s Head of Product, explained to InApps Technology that the motivation for automating so many of these tasks is derived from observed customer needs. “We are following our customers,” Lovell commented. “We are following what customers value and what they need out of these systems. Fivetran is and remains a data pipelines company. Data connectors are our lifeblood and the data analyst, that’s our hero.”
Initially, Fivetran as a company focused almost exclusively on data movement, i.e. extracting and loading data — the “E” and the “L” in “ELT.” Now, its integration with dbt Core lets it add the “T” — data transformation. Referring to Fivetran’s customers, Lovell said “they need data moved and analysis-ready.” Adding scheduling and lineage visualizations makes all of it even easier.
While Fivetran now provides a lot of built-in capabilities, it also allows users to integrate third-party tools via REST APIs. For example, users can leverage Apache Airflow for intelligent scheduling if they have investments in that technology. But such integration is optional since Fivetran Transformations now provides the intelligent scheduling capabilities mentioned previously.
Coordinated or Fragmented?
Nowadays we see a divergent trend in the industry, where the steps of automated data science are either very integrated or over-componentized. The latter imposes a non-trivial burden of effort and expense on the customer, as it must take on the task of selecting and integrating an array of technologies, in order to build a solution. When vendors take a big chunk of that responsibility back, and especially if they do it in a way where integrations with other products are still feasible, that’s a big deal and, arguably a feature even more valuable than scheduling and lineage graphs.
New companies are often pressured by their investors to maintain a narrow focus, in order to get traction and become known for their niche. But as those companies mature, they need to provide more integrated functionality to their customers. That’s exactly what Fivetran is doing, some seven years after nabbing its first big customer.
More separation of steps brings more complexity for customers and opens the door for a bigger risk of failure. On the other hand, an integrated approach makes some customers concerned about vendor lock-in. With its Transformations announcement, Fivetran is working to allay concerns on both sides, providing advanced capabilities in its own platform, while allowing integration of other technologies for customers who prefer it.