Home
>
Data Science
>
Update Don’t Get Stuck: Migrating to a Post-Spark on YARN World

March 22, 2022 by Phu Nguyen

Update Don’t Get Stuck: Migrating to a Post-Spark on YARN World

Main Contents:

Don’t Get Stuck: Migrating to a Post-Spark on YARN World is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Don’t Get Stuck: Migrating to a Post-Spark on YARN World in today’s post !

Workloads with the Simplest Job and YARN Requirements

Of course, the “low-hanging fruit” is to move those workloads that have the least-complex YARN configuration. There are many articles, blog posts and even custom calculators on how best to calculate your YARN container configurations for your workload. My favorite and my go-to is from Princeton Research Computing on Tuning Spark Applications. Ignoring the SLURM requirements, their explanations are the simplest to follow when trying to tune your Spark applications on YARN.

Figure 1. Calculating your YARN container configuration, Princeton Research Computing

In general, you can start from these two YARN configuration buckets:

Simple container definitions
Complex scheduler definitions

Your simplest YARN container definitions should be moved first as those will more easily translate to Kubernetes resource assignments (number CPU, memory, etc.). If you have more complex YARN scheduler definitions such as those used with the fair scheduler or capacity scheduler, you should move those last after you have considered how your Kubernetes resource assignment will be defined. It should be noted that a YARN implementation using a capacity scheduler more easily translates into shared resources within a single Kubernetes cluster deployment with multiple workloads.

Verifying Your Data Connectivity Needs

Part of moving to a post-SoY implementation is more freedom of choice on connecting to either your current or new data sources that Spark can use. Some common methods I see are:

Connecting to existing HDFS clusters.
Connecting to S3 API enabled storage.
Connecting to Cloud Object Storage providers.
Connecting to other filesystems using Kubernetes CSI.

Most of my customers are taking this time to update their standard’s data-access definition patterns, meaning they are defining which type of data should be stored in which type of data system/object. They are spending the time to define for which business use case or data type where it should be stored. For example, financial ticker data from a stock trade is to be stored in a parquet file format on an S3 API system and the data science machine learning workbooks are to be stored on a k8s compliant CSI filesystem. The most common being storing all data on S3 API-enabled storage such as HPE Ezmeral Data Fabric or those within a cloud provider.

Keep in mind that with Kubernetes it will give you greater flexibility in connecting to more new and interesting data sources, and those should be accounted for in your data governance policies.

Sponsor Note

sponsor logo

HPE Ezmeral advances digital transformation initiatives by shifting time and resources from IT operations to innovations. Modernize and secure your apps. Simplify your Ops. And harness data to go from insights to impact.

Compute and Storage Latency Needs

One of the benefits with Hadoop-era workloads was its powerful combination of having your storage “next door” to your compute. Sure, early in the initial MapReduce days you had some issues with the shuffle tasks of your workload, but you could control them if needed. Part of the benefits with SoY is having that combination of compute and storage, which means for most workloads, data transfers should be reduced. When you migrate to a Spark on Kubernetes workload, you must keep this fact in mind.

A couple of questions to ask on your SoY workloads:

Do I have a large data size of files or data sets that are read into your Spark jobs?
Do I have a large number of files or data sets that are read in your Spark jobs?
If I introduce additional read or write latency to my Spark jobs, will that affect my job time or performance?

It is important to run a sample job on your new Spark implementation being careful to note your RDD read and write times. One way to get a “level set” of base performance on your current implementation versus your new implementation is to turn off all “MEMORY_ONLY” settings on your RDDs. Why? Because if you can get a baseline of what your “DISK_ONLY” performance is, your memory-enabled RDD’s performance should be like for like, assuming you will be using the same number of resources for assignment in Kubernetes.

It is also important to note that moving to a post-SoY world means you have to revisit your security policies and monitoring system implementation to properly secure and monitor Spark on Kubernetes resources. Luckily HPE Ezmeral has a single container platform for analytics that can support you on this central security and monitoring journey to your new workload.

Recap

With these simple steps, you can create the traction you need to move to a post-SoY implementation using Kubernetes:

Migrating your simplest YARN configuration first, being careful to spend time on complex YARN scheduler definitions and transition those to Kubernetes resources definitions as needed.
Verify any new data connectivity needs in your K8s cluster as well as the security implications around them.
Run test workloads after separation of compute and storage to ensure you do not introduce any new latency into your jobs.

If you or your organization are struggling to start your journey on a post-SoY implementation, HPE is here to help. Check out the HPE AMP Assessment Program, a proven best practices migration methodology, to learn how HPE can help you avoid getting stuck in the mud and start you on your migration journey.

Featured image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Don’t Get Stuck: Migrating to a Post-Spark on YARN World

Read more about Don’t Get Stuck: Migrating to a Post-Spark on YARN World at Wikipedia

Workloads with the Simplest Job and YARN Requirements

Verifying Your Data Connectivity Needs

Compute and Storage Latency Needs

Recap

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Don’t Get Stuck: Migrating to a Post-Spark on YARN World at Wikipedia

Workloads with the Simplest Job and YARN Requirements

Verifying Your Data Connectivity Needs

Compute and Storage Latency Needs

Recap

Get a custom Proposal

You need to enter your email to download

Blog post

Locations