Home
>
Data Science
>
Update Databand: Observability for Data Pipelines

March 28, 2022 by Phu Nguyen

Update Databand: Observability for Data Pipelines

Main Contents:

Databand: Observability for Data Pipelines is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Databand: Observability for Data Pipelines in today’s post !

Read more about Databand: Observability for Data Pipelines at Wikipedia

You can find content about Databand: Observability for Data Pipelines from the Wikipedia website

For all the buzz around observability these days, that’s largely focused on infrastructure operations, while data engineers are left to plug gaps in knowledge around how data pipelines work — or don’t.

Application performance management (APM) offerings and typical monitoring tools don’t provide the kind of insight that data engineers need into their pipelines, according to Josh Benamram, CEO at Databand.

They will have their cloud, Docker and Kubernetes running, then on top of that, the more specialized tools for processing data: different tools for streaming and for batch processes, such as Spark, Presto or Apache Airflow, he said.

“You can think about data in a common data pipeline in a business would be something like every single day at 12 p.m. I run this process if I’m a financial services company, “ he explained.

“I run this process which takes in enormous amounts of data from 20 different exchanges, like stock exchanges, like NYSE, NASDAQ, whatever pulls in that data into my system every day … runs an hours-long process to aggregate extract features from the data, cleanse the data, pull it together into a single location that my data scientists use. The underlying infrastructure that powers something like that would be your cloud environment, maybe Kubernetes, Apache Airflow for scheduling your run every day at 12 p.m., Spark for prod, for ingesting the data at scale and doing large-scale processing, and then something like Snowflake or BigQuery, or Redshift for delivering the data into some data lake that other teams can use. So it’s just another order of complexity than what software engineers normally will work with.”

He explains in a blog post that typical APM tools focus on metrics logs and traces, while data pipeline monitoring also requires insight into data flows — are there issues in data quality? — the schedules on which batch processes run and the internal and external dependencies that link pipelines together.

“If you were running a data pipeline every day at 12 p.m. There’s the ‘ephemeral-ness,’ if that’s a word, of that process. The fact that it runs as a long-running batch process creates nuances around how you monitor it relative to a normal microservice or application, which is supposed to run 24/7 all the time, no downtime ever.

“If I’m running a batch process, it’s totally normal for a batch process to fail five, or six or seven or 10 times before it kicks on and successfully runs. And it’s very normal for these batch processes to have a really complicated web of dependencies,” he said.

“I might have one pipeline that delivers data into one location, and then another pipeline that reads the data from that place and delivers it to another place. You can just imagine this big web growing. And when you think about all those nuances, it really creates a huge need for a dedicated tool that understands this stuff.”

Tracking Data Pipelines

Databand co-founders Evgeny Shulman, Benamram and Victor Shafran met at tech community events in Tel Aviv. They launched the company in 2018. It’s based in New York, with engineering and R&D still centered in Tel Aviv.

They have released DBND an open source framework for building and tracking data pipelines. It includes a Python library, a set of APIs, and CLI that can be used for data ingestion, preparation, machine learning model training and production. DBND can be used as an orchestrator for systems such as Airflow, providing deep tracking of pipeline metadata and decoupling of code from underlying compute and data systems. DBND requires Python 2.x or 3.x and supports Windows, macOS and Linux.

The Databand offering is billed as an observability solution plugged into the open source data ecosystem, providing deeper understanding of infrastructure performance, how much it’s costing, and how accurate the data is.

“Databand was founded by experienced data scientists and software developers who vividly appreciate the pain points of data science project management. Their backgrounds make them uniquely positioned to help data engineers and data scientists be more productive and more effective in using data in enterprise research and production environments,” David Magerman, managing partner and Chief Technology Officer at Differential Ventures, one of Databand’s investors, said of the company in an email.

Among the issues Shulman points out in a post, is that because many data processes are long-running, failure toward the end can be costly as jobs must be restarted from the beginning.

The main competition to Databand is the in-house work that organizations cobble together to connect their data pipeline infrastructure and some standard monitoring solution, Benamram said.

Databand is specifically designed for data with integrations with Airflow, Databricks, Spark, Kubernetes, MLflow and other tools.

“The second element that makes us different is we collect different kinds of information from these processes. Examples would be collecting a lot more metadata about the scheduler that you’re using to run your pipelines, collecting more metadata from the engine, like Spark, and collecting more metadata about the data itself, the actual structure of the data that you’re operating on within your pipelines. And if you wanted to meet the same kind of monitoring with a standard monitoring tool, there would just be a lot of middleware and logging that you would need to build as a company to get to the same place that we do out of the box,” he said.

The third differentiator is that monitoring information is presented in the context of the pipeline itself, he said.

“Within our system, you see your data pipeline, you see the different nodes of the data transformation tasks that you’re running. And you open up those nodes to understand what the data lineage looks like and how your data structure might be changing between runs of this process, and whether there’s problems in the data set or problems in the code that you’re executing within the pipeline,” he said.

The world is still talking a lot about machine learning and AI, he said, and it is seeing those use cases.

“The most advanced teams in the world are building up systems now to do automated maintenance of their machine learning models through retraining processes. And we will help them to monitor and observe those kinds of systems,” he said.

“But we also see a lot of just standard ETL cases … A lot of the world is still maturing those classic ETL cases where you just need to make sure that you’re taking in data from as many sources as you need, transforming the data in the right way, and delivering it to the people that need it and making sure that all this system is held to the right quality standard.”

Image by JuraHeep from Pixabay.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Databand: Observability for Data Pipelines

Read more about Databand: Observability for Data Pipelines at Wikipedia

Tracking Data Pipelines

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Databand: Observability for Data Pipelines at Wikipedia

Tracking Data Pipelines

Get a custom Proposal

You need to enter your email to download

Blog post

Locations