David Jones-Gilardi
David is a developer advocate at DataStax who is a total nerd for distributed databases. He has over 25 years of relevant experience in programming, database administration, cloud, server/network monitoring and analytics.

Migrating a legacy SQL-based application to NoSQL can pose a real challenge; it might feel like a sliding puzzle, requiring a clear vision of the order of operations to get it right. But if scale and velocity become increasingly important, moving to a NoSQL database is a path worth traveling.

Here, you’ll learn how to migrate your existing SQL-based applications to NoSQL, the practical implications to your data model, and how you can flip the switch between the two once your new system is in place.

We walk through the process using the PetClinic reference application and the open source NoSQL database, Apache Cassandra.

Cédrick Lunven
Cédrick leads the developer advocate team at Datastax. A speaker and trainer, his first — but not only — passion is Java. In 2013 he created the open source feature toggle library called FF4J, which he has been actively maintaining.

Cassandra is the only distributed NoSQL database that delivers the always-on availability, fast read-write performance, and unlimited linear scalability needed to meet the demands of successful modern applications.

One of the easiest ways to spin up a fully-realized Cassandra database is through DataStax Astra DB, a multicloud database-as-a-service built on Cassandra. Astra DB is free of charge up to 80GB storage and 20 million monthly operations; no credit card commitment required.

The three approaches we’ll cover here include:

1) Offline migration
2) Zero-downtime migration with shadow writes
3) Minimal-downtime migration with dual reads.

Why Switch from SQL to NoSQL?

Artem Chebotko
Artem is a developer advocate at DataStax. His core expertise is in data modeling, data management, data warehousing, data mining and data analytics. For over 15 years, he has been leading and participating in research and development projects on NoSQL, relational, graph, XML and provenance databases.

For most of you, you probably learned about relational databases first. Relational databases are very powerful because they ensure consistency and availability at the same time. They are super effective and easy as long as you have your databases running on the same machine.

Read More:   Update Piston Cloud’s Smart Move to Build a Fungible Platform

But if you need to run more transactions or need more space to store your data, there are upper limits to how far you can vertically scale on a single machine. The problem arises: relational databases aren’t able to scale efficiently.

The solution to scale out is to split the data among multiple machines, creating a distributed system. In fact, NoSQL databases were invented to cope with these new requirements of volume (capacity), velocity (throughput) and variety (format) of big data.

With the rise of Big Tech, the global data sphere skyrocketed 15-fold in the last decade. And relational databases simply weren’t ready to cope with the new data volume or new performance requirements. Huge global operations like Google, Facebook or LinkedIn create and use NoSQL databases to scale efficiently, go global, and achieve zero downtime.

Cassandra, like many other NoSQL databases, is governed by the CAP Theorem. The theorem states that a distributed database system can only guarantee two out of these three characteristics in case of a network failure scenario: Consistency, Availability, and Partition Tolerance.

Cassandra is usually described as an “AP” system, meaning it errs on the side of ensuring data availability even if this means sacrificing consistency. But that’s not the whole picture. Cassandra is configurably consistent: you can set the Consistency Level you require and tune it to be more AP or CP according to your use case.

This YouTube video provides a more in-depth breakdown of SQL and NoSQL databases.

Defining the Migration Problem

Migrating from SQL to NoSQL isn’t a one-step process, and there’s no absolute path or pattern. Although there are multiple options for migration, our goal is to help you puzzle out a complete picture.

In an ideal world, your applications on relational databases don’t run into any issues and there’s no need to switch them over to NoSQL. But if you run into scalability issues, or if you ever need to go global, then you have some great reasons to migrate. Cassandra partitions over distributed architecture to handle petabytes of data for linear scalability and replicates data on multiple centers around the world, keeping up with your data needs.

Imagine how easy it would be if you can just take the old application, rewrite parts of it, and migrate your schema and data to get your new application. But that’s not how it works in reality because a relational database contains hundreds of tables, with different applications using them for different business functions.

Migrating the whole application can take a long time, during which application and data requirements might change. As the application and relational database evolves, your migration has to pick up the evolution, becoming way too complex.

A more realistic approach is to take a small piece of a larger application, say a feature or business function that only uses 20 or 50 tables in a relational database, and migrate it to NoSQL in a reasonable amount of time. To understand this, let’s look at the four layers in data architecture:

  1. Presentation layer displays information to and collects information from users.
  2. Business logic layer acts as an intermediary for data exchange between the presentation layer and the data access layer. It handles the business rules that determine how data is created, stored, and changed.
  3. Data access layer facilitates access between applications and any type of persistent storage, in this case, SQL and NoSQL databases.
  4. Data persistence layer is the actual database consisting of schemas and tables.

When we migrate a business function, it is likely business logic is not going to change except in rare cases. There are certain features in the relational database that do not map exactly to a NoSQL database: for example, asset transactions.

But most of the time, what will change is data access objects, which now have to interact with Cassandra. You will also need to migrate data persistence layers, or schema and tables, from SQL to NoSQL.

Approach 1: Offline Migration

Figure 3. The offline migration approach

In the offline migration approach, you turn off your old SQL system completely while you build a migration to NoSQL. It’s considered the simplest and safest method as there’s less chance of data loss. This is the data migration workflow:

  1. Migrate data model.
  2. Migrate function and data access objects.
  3. Extract, transform, load (ETL).

Afterwards, you can start the new NoSQL function and if you tested everything correctly, you can serve reads and writes from the real application. But there’s no guarantee that the new NoSQL application will work in every case, and most applications can’t afford downtime, which leads to the second approach.

Approach 2: Zero-Downtime Migration with Shadow Writes

Figure 4. The zero-downtime migration with shadow writes approach

To achieve zero downtime, you need to run both functions: the old and new functions simultaneously in parallel. Give enough time to test it well before we decide to switch completely and shut down the old function. The first two steps of the data migration workflow for this approach are the same:

  1. Migrate data model.
  2. Migrate function and data access objects.

But now, you can start using the new function as if we’ve already completed the migration and run shadow-writes, allowing the writes from the real application to reach both the relational database and the NoSQL database.

Read More:   The Linux Kernel as a Case Study on Rapid Development for Complex Software – InApps 2022

Because we are doing shadow writes, we will need a proxy to send data to a relational database and the same data will be shadowed to Cassandra. At this point, we are heavily testing writes and ensuring our tables and our data models are designed correctly.

If you made a mistake during this process, you can always switch back to the relational database because you have replica data there. Although we are using Cassandra as an example, you can do this with any other NoSQL migration.

The next step is to extract, transform and load historical data to Cassandra. After that, you keep running both databases and make sure the reads work as you expected. Once you’re ready, turn off or replace your old solution with the new one. This video segment offers a more detailed explanation.

Approach 3: Minimal-Downtime Migration with Dual Reads 

Figure 5. The minimal-downtime migration with dual reads approach

The interesting thing about this approach is that you will do dual-reads, which means running both relational and Cassandra databases at the same time with no intention to turn off the relational one completely. Again, the first two steps are the same:

  1. Migrate data model.
  2. Migrate function and data access objects.

In the third step, you can use your new function to do the writes to Cassandra, but the historical data is still in the relational system. When you do the reads, you need to read both from Cassandra and the relational system which you’ll need a proxy for. Once you’re ready, you ETL the historical data from a relational database to Cassandra. Then, switch the feature off and use it on Cassandra.

There are certain use cases when this approach is favorable:

  • If the historical data that you have in a relational database is not that important and you read it infrequently, there’s no point to move your data to do the ETL and spend your resources on storing that data in Cassandra.
  • If the historical data uses relational features that are not available in Cassandra or are not easy to implement in your application.
  • If you have already prepaid your service cloud SQL database service and you want to keep using it while you have the resources.

This video segment offers a more in-depth explanation.

Conclusion

All three approaches above have one thing in common: you’ll always need to migrate data model, function, and data. But, as we’ve shown here, you can pick and choose the approach that can satisfy the needs of your organization and the features of your applications.

Photo by Ian Parker on Unsplash