Redis Labs has released a connector that would allow the Spark data processing platform to use the Redis in-memory data store.

Using Redis for Spark will allow users to “store a huge amount of data without paying a significant amount of money for infrastructure,” explained Yiftach Shoolman, co-founder and Chief Technology Officer of Redis Labs, noting that Redis can be a lower cost alternative to a full-fledged in-memory database system. “Today we want the big data performance to be as close to real-time as possible. That is what we try to do.”

Specifically, the open source Spark-Redis connector package provides an easy way to run SparkSQL queries against data stored on Redis.

Running Spark against a Redis data store can speed processing by 135 times, compared to using HDFS (Hadoop File System) and is even 45 times faster than using the Tachyon in-memory data store, according to benchmarks from Redis Labs.


Redis Labs is eager to make Redis the de-facto data store for Spark, Shoolman asserted.

The package is a library that provides a library for writing to and reading from a Redis cluster. It exposes all of Redis’ data structures – string, hash, list, set, sorted set, bitmaps, hyperloglogs – as Spark RDDs (Resilient Data Sets)  or through the Spark DataSet API.

The library minimizes the overhead that occurs with serialization and deserialization of large amounts of data.

Read More:   Unmaintained Dependencies and Other Ways to Measure CI/CD Security – InApps 2022

Spark itself has emerged as the chief successor to the Hadoop data processing platform thanks in no small part to an ability to process data in near-real time, rather than the batch processing of ‘big data’ that Hadoop originally offered.

“Apache Spark is becoming a default in-memory engine for high-performance data integration and analytics,” said Matt Aslett, research director, data platforms and analytics at 451 Research, in a statement. “The combination of Redis and Spark should enable high-performance, real-time analytics with extremely large and variable datasets.”