Home
>
Data Science
>
Update Databases — Finally — Get Containerized

March 29, 2022 by Phu Nguyen

Update Databases — Finally — Get Containerized

Main Contents:

Databases — Finally — Get Containerized is an article under the topic Data Science Many of you are most interested in today !! Today, let’s InApps.net learn Databases — Finally — Get Containerized in today’s post !

New Approaches to Containerized Data

Over the course of the past year, there have been several projects to adapt databases to work in containerized environments.

For example, companies using Cassandra, a popular no-SQL database, have been collaborating on K8ssandra, an open source project from DataStax.

DataStax first unveiled K8ssandra (pronounced “Kate Sandra”) last November, together with the tooling and dashboards required to run the database in a Kubernetes cluster.

It was built on top of a simpler Kubernetes operator for Cassandra released in the spring of 2020.

K8ssandra is based on DataStax’s own experience with running Astra, its managed cloud data service.

Other databases are also being ported to containers. Cockroach Labs, for example, has been working on bringing its distributed SQL database CockroachDB to Kubernetes.

Meanwhile, PlanetScale, for example, uses open source Vitess to horizontally scale MySQL, and also has an operator that lets it work on Kubernetes. The Vitess scaling technology was originally developed at YouTube and now supports Square, Slack, HubSpot and other large Internet sites.

The trick is to provide developers with a data fabric that just works, without forcing developers to struggle with security, auditability or scalability, Ramji said.

That’s true even if developers are only building small-scale applications, with, say, just three nodes.

“You want them [applications] to become popular,” he said. “But once [an application] becomes popular, you don’t want to re-architect the whole thing. You can’t shut it down, so you end up trying to build two parallel systems.”

Companies can avoid that problem by picking a platform that can scale well from the start.

Cassandra, for example, started out at Facebook to power its inbox search feature. It was released as an open source project in 2008. Other companies using it include Instagram, GoDaddy, eBay, Spotify and Netflix. But the single largest deployment is probably at Apple, which is heavily invested in Cassandra. Apple has over three times as many openings for Cassandra-related jobs as it does for HBase, Couchbase and MongoDB combined.

“Apple is reported to run a 200,000 node Cassandra cluster that powers data services on iCloud, including iMessage and many others,” said Ramji.

Cassandra works by automatically sending inbound requests to the least loaded server, he said.

“You can create database clusters that span multiple geographic regions,” he added. “The Facebook inbox, for example, had to be geographically available everywhere.”

Cassandra can default to full copies of the data in every instance, or companies can use intelligent replication and specify which data can go where. Intelligent replication is particularly useful, Ramji said, when there are regulatory requirements about moving sensitive data out of certain regions.

“It’s a good fit for Kubernetes because Cassandra knows how to scale itself horizontally,” he said. “No matter how widely you scale your Kubernetes cluster, you can add Cassandra nodes fluidly. But the challenge is to make Cassandra Kubernetes-native.”

That has taken a few years. “Kubernetes is a very hostile environment for databases,” he said.

To start with, Kubernetes is all about stateless applications.

“With Kubernetes, you can stop and start the service at any moment and then pop [it] up somewhere else,” he said. “And you have no memory of the previous service.”

As a result, developers typically keep their data outside the Kubernetes world.

To move data into Kubernetes, first of all, the platform needed to support stateful applications. The solution, StatefulSets, arrived in 2018 with Kubernetes 1.9. StatefulSets were in beta starting in release 1.5 in 2016.

The approach quickly became popular. Today, 55% of companies use stateful applications in containers in production, according to the CNCF survey.

“StatefulSets lets you tell Kubernetes, ‘I’m actually a database, so be cool,’” said Ramji.

That means that containers have to be extra careful when shutting down, he said. A database has to take its in-memory writes and commit them to permanent storage.

Then there are the issues of synchronization and coordination. Cassandra clusters normally communicate with one another using the gossip protocol. That had to change with a move to a containerized environment.

“Cassandra had to stop gossiping among its own nodes and learn to use a protocol in the Kubernetes control plane,” said Ramji.

Finally, running Cassandra traditionally requires some manual management and control functions.

“In order to scale, repair itself, restore and work in the Kubernetes control plane, it had to get radically automated in a way that Cassandra had never been,” he said.

When K8ssandra was first released in November, it was ready to work on Kubernetes. This month, the project is adding out-of-the-box support for all the major cloud providers so it can work with particular flavors of Kubernetes without any extra configuration required.

“We expect that people will run it on Amazon, Google or RedHat OpenShift,” he said. “We’ve also been able to fix some bugs and dependencies and make the configuration smarter.”

Storage Containers

The most important thing about running databases in containers is finding a way to store the data, said Dan Yasny, principal field engineer at MayaData, another company working on deploying Cassandra on Kubernetes.

Storage-area networks (SAN) are one approach, but it’s expensive, he said.

“A typical SAN project is six figures out of pocket right then and there,” he said. “And in five years it will be end-of-life, and you end up having to buy a new one.”

Then there’s the costs associated with managing the platform, he added. “When you’re spending six figures, you need someone who knows Hyperchannel. It’s not simple.”

With Kubernetes, companies can use local attached storage and can scale by adding more nodes with more disks.

That’s for private cloud deployments. Clouds have their version of local storage as well.

“On Amazon, GCP and Azure, you have instance types with local non-volatile memory,” he said. “A single disk can provide 100,000 operations per second, which is insane. A typical SCSI disk will give you 150 at best. So when you’re in those clouds, and you’re using those instances, you have 60 terabytes on a single virtual machine you can provision. It’s huge and it’s insanely fast.”

The downside, of course, is that it’s ephemeral.

“If you stop a virtual machine and start it again, the disks will be empty,” he said. “Running a database like that sounds insane. But think about the database having multiple nodes, with multiple replicas, and if a single note goes down, it comes back up again.”

With a self-replicating Kubernetes stack, companies don’t have to worry about setting up a separate storage functionality for their application because the databases themselves can take care of things being replicated.

And the new container-friendly databases have their own backup solutions, he added.

“You take a snapshot and ship off your current state,” he said. “You can backup just the increments, or the whole thing every time — there are so many possibilities.”

MayaData helps provide the backup and orchestration capability to Kubernetes-based databases with its OpenEBS Mayastor.

OpenEBS is an open source project backed by MayaData that lets stateful Kubernetes applications access dynamic local persistent volumes or replicated persistent volumes of memory.

It takes care of another piece of the container data puzzle — managing data that is spread across multiple Kubernetes storage environments.

“Sometimes you have separate nodes that have the disks, and the workload is on other nodes,” said Yasny.

Previous solutions could offer replication, snapshots and other features, but suffered when it came to performance, he said.

OpenEBS is a storage orchestrator that can connect to both local- and network-attached storage volumes, he said. It’s the most popular open source storage implementation on Kubernetes and has been around for a few years. Mayastor extends that ability across containers.

“In good lab conditions, we got to just a single digit of percent overhead,” he said. “And without too much tuning or working too hard, we can get to 15 percent overhead.”

In March, MayaData released a benchmarking report in conjunction with Intel about its performance tests.

OpenEBS MayaStor is currently in beta.

The official release date will be determined by the broader community and will be based on criteria such as code stability, test coverage and test results, said Evan Powell, chairman and CEO at MayaData. That could be a few more releases, he said, which would suggest that the project will exit beta within a few months.

Indian ecommerce giant Flipkart is currently moving Cassandra workloads to Kubernetes, using different flavors of OpenEBS.

“They will become one of the largest users of Kubernetes as they scale,” said Powell. “It is an honor to be partnering with them.”

Containers Fuel Agility in a Time of Change

Target has been using the Cassandra database since around 2014. In 2018, the company rolled out individual Cassandra clusters in all its stores and needed those clusters to run in Kubernetes.

That was before there was a K8ssandra project, and Target built this infrastructure from scratch.

According to Daniel Parker, Target’s director of engineering, the first challenge was that when new nodes started up, they had to find other nodes to connect to, and if several new nodes were coming online at once in the same cluster, they had to be able to find each other and cluster together.

Then there were issues with setting up backups that don’t get wiped when a container restarts, setting up automated monitoring and alerts.

“We had a lot of hurdles to overcome in deploying Cassandra clusters to all Target stores,” Parker wrote.

But this investment likely paid off, said Patrick McFadin, vice president of developer relations at DataStax.

When the pandemic hit, retail stores around the world had to switch to delivery or curbside pickup.

That meant companies needed to have technology infrastructure in place that let them quickly switch business processes.

“Companies that did not have this in place struggled to adapt. Just look at the changes in retail recently including Gap, JCPenney and Sears,” said McFadin.

Other companies that require high degrees of scalability or agility are entertainment firms, healthcare, finance, industries like retail and logistics that have a lot of seasonal fluctuation, SaaS vendors, companies deploying 5G and edge computing, companies deploying new AI applications and automation.

Today, technology and business agility is a matter of survival, he added. And it’s not just the pandemic. Companies are under extreme market stress in multiple areas.

If it’s not the pandemic, it’s something else. A new startup arrives. A competing company decides to aggressively expand into your territory. There’s disruption in supplies or in market demand. Or, in increasing frequency, Amazon decides to enter a new niche and threatens to put all the incumbents there out of business overnight.

“It’s a matter of survival,” said McFadin. “If you don’t move, you are [in] Chapter 11.”

The containerized, agile approach allows for quick upgrades to applications and fast expansion of capacity.

“We can’t go back to the old traditional waterfall methods,” McFadin said.

Avoid Cloud Lock-in

Adapting databases to work natively on Kubernetes also creates an additional benefit: enterprises are no longer locked into their cloud providers.

According to Gartner, once a company deploys an application on a particular cloud platform, it tends to stay there. And once it’s there, it tends to attract other applications and services, a concept often referred to as data gravity.

“This is due to data lakes being hard — and expensive — to port, and therefore end up acting as centers of gravity,” said Gartner analyst Marco Meinardi in a report last fall.

“Look at what the large clouds are trying to do,” said McFadin. “If cloud providers can convince you to use their proprietary database in their cloud, you may never leave. It’s like taking the blue pill. You’re done. You don’t want to be on your cloud provider anymore? Go ahead, move your data, I dare you.”

But the ability to switch providers is what allows enterprises to shop around for the best deals.

“Commoditization is key,” said McFadin. “Commoditization is how they can negotiate prices and get long-term savings. Clouds are not producing a lot of commodity right now, but Kubernetes is forcing them to become a commodity.”

With portable containers, companies can create virtual data centers across multiple public clouds and optimize for price or performance. “And if I’m not getting a very good price, then I can pick it up and move it somewhere else.”

Having support for data on Kubernetes is that last piece of the portability puzzle that companies have been missing, he said.

InApps Technology is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: PlanetScale, Docker.

Image by Hessel Visser from Pixabay

Source: InApps.net

List of Keywords users find our article on Google:

error: a port needed by openshift is not available

kubernetes friendly databases

top nosql databases 2019 gartner

aws sagemaker tutorial

sagemaker batch inference

visser hurdles

hessel hire

couchbase sync

couchbase monitoring

“mayastor”

meinardi facebook

spring cloud data flow kubernetes

mongodb backup sharded cluster

flexera one

couchbase mobile tutorial

commoditization consulting big data

popit food storage

azure sql database manual backup

parallels idc report 2020

simplify storage tote

kubernetes mongodb sharding

kubernetes mongodb sharded cluster

openshift local storage

owncloud kubernetes

sagemaker containers

mongodb sharded cluster kubernetes

mongodb sharding kubernetes

lock n lock food storage containers

mysql openshift how to

site:www.inapps.net

nosql database on kubernetes

azure backup database

mongodb deploy sharded cluster

mongodb sharding backup

mongodb manual sharding

azure sql database backup

mongodb sharding docker

aws sagemaker deployment

kops statefulsets

mongodb cluster

design thinking lean agile gartner

calico kubernetes tutorial

calico docker

cassandra read performance tuning

vitess operator

couchbase dashboard

what kind of cloud deployment is this database

calico container

azure mongodb pricing

hessel visser

neovia logistics jobs

couchbase java driver

amazon storage containers

cloud adoption framework gcp

ownbackup jobs

ownbackup linkedin

servicenow instance data replication

field engineer wikipedia

mongodb vs couchbase 2016

games done quick wikipedia

ravendb pricing

spring couchbase example

jcpenney pocket square

always somewhere wikipedia

amazon team topologies

backup and restore mysql openshift

couchbase client configuration

flexera jobs

cheap time locking container

flexera software linkedin

sagemaker bring your own container

flexera reviews

servicenow developer instance

owncloud core

backup and restore mongodb on kubernetes

backup mongodb kubernetes

databases were originally developed to address which problem

servicenow automated test framework tutorial

cassandra java driver

flexera saas manager

kubernetes mongodb backup and restore

backup cassandra kubernetes

daniel parker linkedin

hbase batch put

sagemaker docker container

hire sql tuning developers

backup mongodb on kubernetes

openshift couchbase

47.offshore installation manager

kubernetes cassandra backup

cassandra driver java

chief architect tutorial youtube

java microservices with spring cloud coordinating services

mysql openshift

openebs

popit containers

hubspot to mysql

scsi jobs

carl parker linkedin

idc survival rate

negotiating with agility online courses

openshift microservices tutorial

sagemaker container

sagemaker real time inference

servicenow data replication

spring cloud data flow

couchbase protocol

red hat openshift container platform tutorial

couchbase connect 2020

openshift backup restore

blue cloud gets agile case study

learn nosql database design from scratch

redhat openshift tutorial

best clouds for x plane 11

nosql database cloud service

sagemaker instance types

cassandra kubernetes cluster

couchbase managed service

could not find a version that satisfies the requirement tensorflow

owncloud change logo

sql change automation

aws sagemaker model deployment

cassandra stress tests

cloudlock jobs

couchbase vs cassandra

godaddy trustpilot

calico aws

couchbase monitor

couchbase tutorial

how to negotiate with servicenow

lock and lock storage containers

openshift container platform tutorial

sharded cluster mongodb

openshift tutorial youtube

target storage containers

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.