Home
>
Data Science
>
Update Apache Spark 1.6: Strong Typing, Faster Throughput

March 30, 2022 by Phu Nguyen

Update Apache Spark 1.6: Strong Typing, Faster Throughput

Main Contents:

Apache Spark 1.6: Strong Typing, Faster Throughput is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Apache Spark 1.6: Strong Typing, Faster Throughput in today’s post !

Read more about Apache Spark 1.6: Strong Typing, Faster Throughput at Wikipedia

You can find content about Apache Spark 1.6: Strong Typing, Faster Throughput from the Wikipedia website

The release of Spark 1.6 continues the evolution of the data analysis platform toward greater performance and usability, according to Reynold Xin, co-founder of Spark sponsoring company Databricks. He noted that the number of project contributors has topped 1,000, a 50 percent increase in the past year.

He points to automatic memory management among the ways the new release makes life simpler for users.

“Now, instead of users having to tune memory settings, it figures it out for you. Most users don’t understand tuning,” he said.

“In the past, Spark separated memory into two different regions. One was caching where Spark temporarily persists user data and scans it again quickly. The other is the execution memory region, which is used for execution of the algorithm,” he explained.

“If Spark aggregates some data or joins some data together, it needs memory to perform those. The size of these two types of memory was statically divided in the past,” he explained. “Users don’t understand how to tune this, and leaving it static generally leads to underutilization of one and overutilization of the other and this can slow things down. So we removed that setting and you can just let Spark figure out what’s the best one. You can let Spark understand the workload and set the best value possible.”

As Spark is much simpler to use than Hadoop, Spark has been outgrowing the traditional Hadoop audience, he said. Increasingly, its users are data scientists and data analysts – people who are less technical. That highlights the need for even greater simplicity. A survey done last summer, however, found performance ranked as the chief concern among users.

Version 1.6 includes a new Parquet reader that grew out of Project Tungsten, an effort to rewrite a lot of the Spark execution back end to bypass a lot of inefficiencies at the JVM. Parquet is one of the most commonly used data formats in Spark. In looking at how customers used Spark, they found many were spending most of their time in the format spec library parquet-mr on a step called record assembly.

“What record assembly does is turns a bunch of raw bytes into rows. So we rewrote the entire record assembly using a technique called vectorized read, which leverages a lot of the more modern CPU capabilities. Now we can use, for example, ‘send to’ instructions to do multiple data-processing steps in a single CPU instruction. It turned out this had a huge benefit to it,” Xin said.

This new reader bypasses parquert-mr’s record assembly and uses a more optimized code path for flat schemas. It increases the scan throughput for 5 columns from 2.9 million rows per second to 4.5 million rows per second, an almost 50 percent improvement, the company explains in a blog post.

That project’s not complete, according to Xin. The next version of Spark, due in late March or early April, will have a fully implemented vectorized read, and there could be two to three times improvement in speed in reading Parquet, he says.

It also touts a 10-fold speedup for streaming state management with a new mapWithState API that scales linearly to the number of updates rather than the total number of records.

“One of the most widely used features of Spark streaming is to track sessions — how many users in a particular period of time. It’s tracking a lot of different states in memory,” he explained.

“In the past, every few seconds it tried to understand the current state of the universe. It would go through every single item in the state management process. We implemented it to use a new algorithm that only walks through every time there’s an update – and we only walk through the updated values. So now we’re tracking session deltas. This essentially cuts down the amount of data we have to scan every few seconds from all the data to just the data that’s being updated.”

Spark 1.6 also introduces an extension of the DataFrames API called Datasets to address compile-time type safety.

“When using Java or Scala, it enforces the type information for you. This is great for data science, but if you’re serious about building very large engineering pipelines, type safety, enforced by the compiler is a very important feature. So we took [our customer] feedback and built a dataset API to add type safety on top of DataFrames. So the idea is you can now add the Datasets API to get stronger type guarantees enforced by the Java or Scala compiler,” he said.

The new release also includes a number of new data science functionalities that are largely ad-hoc, Xin said.

“One of the more interesting ones is the way that Spark users can persist. They can save and read back machine learning pipelines and models they train as part of the pipeline,” he said.

“Spark people are used to training models where you have the data and can build predictive models and tie the model to some live datasets. In the past, the last step was actually frustratingly difficult. You’d have to build a lot of machinery yourself to actually save this model. Then you have to load it back into your production application.

“The pipeline persistence functionality for machine learning actually automates this whole thing, so now it’s just a save function in the pipeline you built. Then in your production application, just load back the model that’s built from this pipeline with a single API call. This is a crucial step in building end-to-end predictive applications. It makes users’ lives much simpler.”

Feature Image: “sparkling light” by james j8246, licensed under CC BY-SA 2.0.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Apache Spark 1.6: Strong Typing, Faster Throughput

Read more about Apache Spark 1.6: Strong Typing, Faster Throughput at Wikipedia

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Apache Spark 1.6: Strong Typing, Faster Throughput at Wikipedia

Get a custom Proposal

You need to enter your email to download

Blog post

Locations