The release of Spark 1.6 continues the evolution of the data analysis platform toward greater performance and usability, according to Reynold Xin, co-founder of Spark sponsoring company Databricks. He noted that the number of project contributors has topped 1,000, a 50 percent increase in the past year.
He points to automatic memory management among the ways the new release makes life simpler for users.
“Now, instead of users having to tune memory settings, it figures it out for you. Most users don’t understand tuning,” he said.
“In the past, Spark separated memory into two different regions. One was caching where Spark temporarily persists user data and scans it again quickly. The other is the execution memory region, which is used for execution of the algorithm,” he explained.
“If Spark aggregates some data or joins some data together, it needs memory to perform those. The size of these two types of memory was statically divided in the past,” he explained. “Users don’t understand how to tune this, and leaving it static generally leads to underutilization of one and overutilization of the other and this can slow things down. So we removed that setting and you can just let Spark figure out what’s the best one. You can let Spark understand the workload and set the best value possible.”
As Spark is much simpler to use than Hadoop, Spark has been outgrowing the traditional Hadoop audience, he said. Increasingly, its users are data scientists and data analysts – people who are less technical. That highlights the need for even greater simplicity. A survey done last summer, however, found performance ranked as the chief concern among users.
Version 1.6 includes a new Parquet reader that grew out of Project Tungsten, an effort to rewrite a lot of the Spark execution back end to bypass a lot of inefficiencies at the JVM. Parquet is one of the most commonly used data formats in Spark. In looking at how customers used Spark, they found many were spending most of their time in the format spec library parquet-mr on a step called record assembly.
“What record assembly does is turns a bunch of raw bytes into rows. So we rewrote the entire record assembly using a technique called vectorized read, which leverages a lot of the more modern CPU capabilities. Now we can use, for example, ‘send to’ instructions to do multiple data-processing steps in a single CPU instruction. It turned out this had a huge benefit to it,” Xin said.
This new reader bypasses parquert-mr’s record assembly and uses a more optimized code path for flat schemas. It increases the scan throughput for 5 columns from 2.9 million rows per second to 4.5 million rows per second, an almost 50 percent improvement, the company explains in a blog post.
That project’s not complete, according to Xin. The next version of Spark, due in late March or early April, will have a fully implemented vectorized read, and there could be two to three times improvement in speed in reading Parquet, he says.
It also touts a 10-fold speedup for streaming state management with a new mapWithState API that scales linearly to the number of updates rather than the total number of records.
“One of the most widely used features of Spark streaming is to track sessions — how many users in a particular period of time. It’s tracking a lot of different states in memory,” he explained.
“In the past, every few seconds it tried to understand the current state of the universe. It would go through every single item in the state management process. We implemented it to use a new algorithm that only walks through every time there’s an update – and we only walk through the updated values. So now we’re tracking session deltas. This essentially cuts down the amount of data we have to scan every few seconds from all the data to just the data that’s being updated.”
Spark 1.6 also introduces an extension of the DataFrames API called Datasets to address compile-time type safety.
“When using Java or Scala, it enforces the type information for you. This is great for data science, but if you’re serious about building very large engineering pipelines, type safety, enforced by the compiler is a very important feature. So we took [our customer] feedback and built a dataset API to add type safety on top of DataFrames. So the idea is you can now add the Datasets API to get stronger type guarantees enforced by the Java or Scala compiler,” he said.
The new release also includes a number of new data science functionalities that are largely ad-hoc, Xin said.
“One of the more interesting ones is the way that Spark users can persist. They can save and read back machine learning pipelines and models they train as part of the pipeline,” he said.
“Spark people are used to training models where you have the data and can build predictive models and tie the model to some live datasets. In the past, the last step was actually frustratingly difficult. You’d have to build a lot of machinery yourself to actually save this model. Then you have to load it back into your production application.
“The pipeline persistence functionality for machine learning actually automates this whole thing, so now it’s just a save function in the pipeline you built. Then in your production application, just load back the model that’s built from this pipeline with a single API call. This is a crucial step in building end-to-end predictive applications. It makes users’ lives much simpler.”
Feature Image: “sparkling light” by james j8246, licensed under CC BY-SA 2.0.