A Different Table Design for Big Data – InApps 2022

Main Contents:

A Different Table Design for Big Data – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn A Different Table Design for Big Data – InApps in today’s post !

Handling Partitioning

One of its selling points is that users don’t have to know that much about partitioning.

“In the old model, the columns that were used to produce those directories, those were just normal columns, and they had no association to other columns, Blue said. “So if you wanted to store data by day, you would probably derive that date from a timestamp. But the system had no way of saying, ‘Oh, I know that you’re looking for this timestamp range.’ You had to add both the timestamps you’re looking for and the days that you’re looking for, which was just very, very error-prone. So we started keeping track of those relationships so that we can take queries on timestamp and bake those down into queries on the date ranges and automatically figure out what files you need.”

Iceberg users don’t have to maintain partition columns or even understand the physical table layout to get accurate query results, an IBM blog post explains. Iceberg handles all the details of partitioning and querying, and keeps track of the relationship between a column value and its partition without requiring additional columns.

In addition to addressing the reliability and correctness issues, the project focused on improving performance by using file metadata to can skip more files to satisfy queries faster, and in-place table evolution, so it can change as business needs change.

“Now we’re working on all of the new things that we can do, given this fundamentally better design for tables,” Blue said. They include adding row-level deletes and upserts. It just committed “merge into” as a high-level SQL operation and “delete from” and will add “update.”

“So those operations are a lot less targeted at knowing how your table is stored and laid out, and much more focused on what do you want to do to individual rows in your table? And that’s where we want our data engineers to be focused,” he said. The system can make things fast and efficient because it can figure out exactly which data files need to be updated, and then go rewrite those data files.

Blue said he’s excited about the capabilities with row-level deletes and the ability to able to build data services that can operate on tables that don’t require users to think about the details or physical layout quite so much.

Decoupling Compute and Data

It’s part of the trend toward decoupling compute and data, according to Tomer Shiran, co-founder of Dremio, and an alternative to the tradeoffs between data lake and data warehouse.

“We want the data tier to support things like transactions and data mutations and time travel. And it needs to be open source and accessible to all these different engines, that the whole value of a modern, loosely coupled architecture. So Iceberg is a perfect fit for that,” he said.

Dremio, which aims to eliminate the middle layers and the work involved between the user and the data stores, has announced plans to integrate its platform with Iceberg this year. It has two projects related to Iceberg:

Project Nessie provides a git-like semantics for data lakes. It enables users to experiment with branches of data or prepare data without affecting the live view of the data.
Arrow Flight 3.0 provides the ability for Apache Arrow-enabled systems to exchange data between them simultaneously at speeds that are orders of magnitude faster than possible before.

He sees two competing standards in the space, Delta Lake, created by Databricks, and Iceberg.

One of the problems with Delta Lake, he said, is that you can only do inserts and transactions from Spark, while Iceberg allows transactions and updates in time travel from any system — from Dremio, Spark, Presto, etc.

“It comes to how data is stored,” Shiran said. “People are always going to choose the more open approach. We saw that with Parquet, when it came to file formats, right? There were competing standards at the time and some only worked with one engine like Hive and others work across the board. Parquet obviously won, and I think it’s kind of very similar situation.”

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Dremio.

Image by enriquelopezgarre from Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.