Home
>
Data Science
>
Update Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark

March 28, 2022 by Phu Nguyen

Update Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark

Main Contents:

Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark in today’s post !

Read more about Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark at Wikipedia

You can find content about Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark from the Wikipedia website

Kubernetes offers some powerful benefits as a resource manager for Big Data applications, but comes with its own complexities.

Speaking at ApacheCon North America recently, Christopher Crosbie, product manager for open data and analytics at Google, noted that while Google Cloud Platform (GCP) offers managed versions of open source Big Data stacks including Apache Beam and TensorFlow for machine learning, at the same time, Google is working with the open source community to make open source Big Data software more cloud-friendly.

Open Keynote: Yet Another Resource Negotiator for Big Data? on YouTube.

“What folks tend to do, when they move from on-prem to the cloud with these Big Data stacks, is they start to piece up all the different workloads, to run those on an appropriate size cluster — or appropriate size and shape really,” he explained.

“So you might have a lot of BI or reporting applications that will try to stick onto a memory-heavy cluster, or you’ll have a bunch of machine learning jobs, you’ll stick onto these compute-heavy clusters. But piecing all that up and figuring those out, which jobs align with each other — that can be a pretty difficult task.”

That’s why Google, with the open source community, has been experimenting with Kubernetes as an alternative to YARN for scheduling Apache Spark.

Crosbie works on Google’s Cloud Dataproc team, which offers managed Hadoop and Spark. These distributed systems require a cluster-management system to handle tasks such as checking node health and scheduling jobs. With the Apache Spark, you can run it like a scheduler YARN, Mesos, standalone mode or now Kubernetes, which is now experimental, Crosbie said.

It is using custom resource definitions and operators as a means to extend the Kubernetes API. So far, it has open-sourced operators for Spark and Apache Flink, and is working on more.

For users that don’t want to run these applications in Google Cloud, they can download a Helm chart and run their Kubernetes clusters on other clouds or on-prem.

He pointed to three primary benefits to using Kubernetes as a resource manager:

Unified management — Getting away from two cluster management interfaces if your organization already is using Kubernetes elsewhere.
Ability to isolate jobs — You can move models and ETL pipelines from dev to production without the headaches of dependency management.
Resilient infrastructure — You don’t worry about sizing and building the cluster, manipulating Docker files or Kubernetes networking configurations.

But there are tradeoffs, he said, outlining what he called “the Yin and Yang of going from YARN to Kubernetes”:

“It provides a unified interface if you are already moving to this Kubernetes world, but if not, this might just be like yet another cluster type to manage if you’re not already investing in that ecosystem.

Kubernetes will enable your data scientists and developers to tap into a lot of resources. If your servers are busy during the day, you can run Big Data jobs at night when they’re less busy. But if you’ve been trying to do that already with YARN, everything you’ve done with YARN will be thrown out because Kubernetes has a different way to manage resources.

Developers are going to love Kubernetes because they can start to put in all these custom configurations. But you’ll definitely be going to want to track what they’re doing. Most companies know how to do that with YARN, what to look for, what to alert on.”

“With Kubernetes, you definitely have logging, but you’re going to have to rethink what those logs actually look like,” he said.

If you have everybody might be on an older version of Spark that’s production tested, but one data scientist really wants this a new feature and the latest version of Spark, they can package that as a container running all the same infrastructure with Kubernetes and the jobs don’t have to conflict.

But for a lot of use cases, developers might find themselves dealing with something that they didn’t expect. One that often comes up is a Kubernetes network configuration to get to some data source that wasn’t part of the standard. That’s the kind of thing Google has been trying to address with Operators.

With Kubernetes, you can go from thinking about things in a cluster level, to just a particular job with assigned memory, CPU and other resources. You can really isolate those containers. But there are times you want to share data between jobs, and that can be a little more difficult in this more isolated world.

Kubernetes has a lot of really cool features, especially around security, things like the secret manager. But security also can get more complicated, he said.

“It reminds me of like one of those Russian Dolls, where you have account within an account within an account — where you have a VM running a service account, then within that there’s actually a Kubernetes service account and insides of that you have Kerberos principals,” he said, adding that tracking through all that can sometimes be a problem.

InApps Technology is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

Feature image by Gerd Altmann from Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

May 29, 2025 by Anh Hoang

Update Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark

Read more about Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark at Wikipedia

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Homestays Marketplace Application Development

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2023

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

How can businesses effectively integrate AI into their operations?

Locations

Read more about Big Data: Google Replaces YARN with Kubernetes to Schedule Apache Spark at Wikipedia

Get a custom Proposal

You need to enter your email to download

Blog post

Locations