Home
>
Data Science
>
Update Pachyderm Challenges Hadoop with Containerized Data Lakes

March 30, 2022 by Phu Nguyen

Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Main Contents:

Pachyderm Challenges Hadoop with Containerized Data Lakes is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Pachyderm Challenges Hadoop with Containerized Data Lakes in today’s post !

What’s New

The two components Pachyderm developed for the stack are file system and pipeline system.

Pachyderm Pipelines is a system of stringing containers together and doing data analysis with them. You create a containerized program with the tools of your choice that reads and writes to the local filesystem. It uses a FUSE volume to inject data into the container, then automatically replicates the container, showing each one a different chunk of data. This technique enables Pachyderm to scale any code you write to process massive data sets in parallel, according to Zwicker. It doesn’t require using Java at all: If it fits in a container, you can use it for data analysis.

Pachyderm File System is a distributed file system that draws inspiration from git, providing version control over all the data. It’s the core data layer that delivers data to containers. The data is stored in generic object storage such as Amazon’s S3, Google Cloud Storage or the open source Ceph file system. And like Apple’s Time Machine, it provides historical snapshots of how you data looked at different points in time.

“It lets you see how things have changed; it lets people work together,” Zwicker said. “It allows people to not only collaborate on code but on data. One data scientist can build a data set, and another can fork it and build off of it, then merge the results back with the original one. This is something that has been completely missing from the data science tools out there.”

There’s no shortage of technologies — Spark, Pig, Hive and others — considered alternatives MapReduce, the processing layer in Hadoop.

“We think the existence of all those tools is an indication that MapReduce was the wrong idea to begin with. It was an overly constraining way of analyzing,” Zwicker said.

“What Hadoop found was that MapReduce could do a bunch of stuff, but they needed to invent other things on top of it, like Pig and Hive and those things,” Zwicker said. “Hadoop has something kind of like what we do, which is called Hadoop Streaming, but it’s a very second-class citizen that’s added afterward rather than us having our containerized workload be the core layer that everybody uses.”

Doliner adds that Spark and Hive and other tools are all still built on top of the core pieces of the Hadoop infrastructure, like Zookeeper, YARN, HDFS, pieces of the infrastructure that are among the weaknesses to Hadoop.

Docker Was ‘Aha’ Moment

Doliner and Zwicker founded the San Francisco-based company in 2014 and participated in Y Combinator in early 2015. It has raised $2 million from Data Collective, Blumberg Capital, Foundation Capital, and others.

It might appear nakedly ambitious to boldly state one’s plans to replace Hadoop — the founders contend they have the only company building something totally new.

“If you look at what [the others] are building, all of it is still the same Hadoop primitives repackaged in some way. We’ve believed from very early on that the problem isn’t that Hadoop isn’t packaged in the right way, but that Hadoop has inherent flaws,” Doliner said.

The company started out before Docker was released. The founders initially knew they wanted to build a replacement for Hadoop, but saw an early demo of Docker at their former employer, RethinkDB.

“That was the ‘aha’ moment,” Zwicker said. “We knew Hadoop was going to be replaced and saw containers are the perfect tool to do it. We knew they were going to create this whole ecosystem we could use to replace it. When we put all of that together, that is when things really started working for us.”

Adds Doliner: “We’re not just saying, ‘Hey containers are a hot new technology. Let’s take everything and shove it in a container’ and all of a sudden that’s a new product.”

By being early to the container movement, it’s all been evolving together, he said.

One of the key benefits of Pachyderm, they says, is that it doesn’t take a large team with specific expertise that Hadoop requires to be productive. That was an attractive feature for its customer Fogger, according to CEO Kamil Kozak.

Fogger makes a software platform for processing sensor data on industrial machinery such as solar farms and wind turbines. Its Fog Computing platform allows data processing on small Linux boxes close to the machines and pushes it over a peer-to-peer network to a central cloud hub. It uses Pachyderm for local data processing on it way to the cloud.

“At Fogger, we believe that containers are redefining infrastructure and that they will be used in all types of deployments,” Kozak said.

“Pachyderm has a very well-designed technological stack. We love the idea of map/reduce pipelines built with containers and a simple Git-like triggering system.

“We were evaluating having to build our own solution in-house or using something like Hadoop/Spark when I stumbled across Pachyderm. We chose Pachyderm because the learning curve and infrastructure overhead for Hadoop/Spark was significantly harder than Pachyderm; it just fit seamlessly into our containerized stack,” he said.

Containers allow Fogger to build data-processing algorithms in any programming language, “which simplifies our lives drastically as we don’t have to learn any new technology other than Pachyderm CLI itself,” Koziak said.

Feature image via Pixabay, licensed under CC0.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Read more about Pachyderm Challenges Hadoop with Containerized Data Lakes at Wikipedia

What’s New

Docker Was ‘Aha’ Moment

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Pachyderm Challenges Hadoop with Containerized Data Lakes at Wikipedia

What’s New

Docker Was ‘Aha’ Moment

Get a custom Proposal

You need to enter your email to download

Blog post

Locations