High-performance database specialists Percona — which, among other things, provide its own distribution of the MongoDB document-oriented database program — are excited for the MongoDB.live event next week, the user conference held by MongoDB, the company that manages the formerly open source database system. Based on what the Percona team can glean from the current MongoDB code and the MongoDB Jira issue tracker, they spoke with InApps, ahead of their own event Thursday, which dives into potential upcoming features.

In May, MongoDB published the release candidates for version 5.0 for the purpose of testing and catching bugs before the General Availability (GA) release on July 13, the first day of MongoDB’s live event. The Percona team highlighted features that they believe will be in the 5.0 release, some of which overlap with these release candidates, while others do not.

Note that as MongoDB has moved in the last couple of years from an open source license to a source available license, and the self-named company behind the technology have published less of its roadmaps and release plans, so this is a piece on conjecture and the possibility of some important features to the platform. However, the MongoDB team has confirmed that next week’s 5.0 will continue to evolve MongoDB for more workloads including time-series capabilities, serverless instances, and integrated analytics, as well as to expand product capabilities for search and mobile.

Certainly, InApps will be covering the official GA release, so check back next week, but this piece is just a fun glimpse at the potential of one of the world’s most popular NoSQL database offerings.

MongoDB Resharding

Certainly, the most popular potential feature update tackles the very challenging case of resharding. Two tickets on create reshardCollection command on config server and create class for ReshardingCoordinatorStateMachine with function stubs were closed about a year ago now but are not included in the aforementioned release notes. Additional code changes and work around resharding have continued during the last month. This is not surprising since this is a complex task that impacts many different parts of the core functionality within MongoDB.

One of MongoDB’s most important features is its ability to scale horizontally. It does this by adding hosts and then distributing the data and the load across those hosts and utilizing their additional resources. This is known as sharding, which occurs at the collection or table level.

In order to shard, a shard key is chosen and is used to distribute a collection’s documents across the shards or nodes. The data is partitioned into chunks with each chunk containing a part of the sharded data and thus a part of the documents in a collection. The chunks are then distributed evenly across the shards. Shard key selection has a large impact on your database scalability and performance.

Back in 2015, MongoDB wrote about the importance of selecting a shard key: “If you pick the wrong shard key, you can totally trash the performance of your cluster.”

As Kimberly Wilkins, MongoDB technical lead at Percona, put it: “Many people don’t realize how important it is to spend the time and effort upfront to pick the best shard key. They don’t think about all of their access patterns or what their data looks like today versus what it will look like tomorrow.”

She said certain important characteristics of data isn’t always considered when selecting a shard key including:

  • Cardinality of data
  • Frequency and distinctiveness of data
  • Monotonicity of data

“They are often in a hurry or think they can change it later. And this causes problems for them down the road,” Wilkins said, reflecting on her career supporting production databases in both Oracle and MongoDB. Since starting work with MongoDB databases in 2014, she has worked with customers, specializing in helping them pick the best shard keys for their workloads. She calls this ability to reshard collections and mitigate negative performance impacts “very exciting.”

Read More:   Update The Data Stack Journey: Lessons from Architecting Stacks at Heroku and Mattermost

Until last year’s Version 4.4, you were stuck with your shard key forever. And even MongoDB referred to this choice of five different key considerations as “more of an art than a science.” MongoDB 4.4 allowed for refinable shard keys, which provided the ability to modify the current shard key to provide more divisibility and fine-grained data distribution.

Wilkins gave the example of an application capturing social media data. You select the shard key based on which social media app the data was coming from — for example, Twitter, Instagram, Facebook and TikTok. But then suddenly TikTok blows up. This sudden, exponential increase in data would overload the shard holding the TikTok data, causing real scale and performance problems across your whole application.

Resharding will allow you to completely change the shard key on existing collections and redistribute the data more efficiently across your shards. By selecting a new shard key that better matches your application needs and data characteristics, you’ll be able to overcome the negative performance impacts that the previous shard key caused. Wilkins said this is a feature years in the making because, until now, manually implementing any change to the shard key is very complex. It basically involves cloning an existing collection, then rebalancing chunks and data for that cloned collection across the shards based on the new shard key. Then you have to remove the original collection and update the metadata in the config database.

All of this has to be done while not impacting the underlying performance, which is why, until now, most teams hire an external consultant.

MongoDB Simultaneous and Resumable Indexing

MongoDB Indexes are a whole section of the 5.0 release notes. One key update points to a closed ticket talking about improvements to simultaneous index builds across replica set members. That work involves the consensus protocol for two-phase index builds, specifically allowing two-phase index builds to run with a commit quorum and requiring the index creation to complete on a majority of the secondaries before being marked as complete.

Akira Kurogane, MongoDB product owner at Percona, told InApps that this simultaneous indexing “shows that MongoDB is going to get more serious about maintaining index consistency across replica sets. Before, you could build an index on the primary but you might not have a complete index build happen across all members of your replica set. If an election or other disaster recovery event occurred and a secondary that is missing, the new index was elected as the new primary, then performance goes all to hell.”

He continued that before it was only a relatively small risk to force all nodes to build the index over time, but it slowed your database down. This risk however grows alongside the size of today’s ever-larger datasets and collection sizes.

He said that this update would continue a trend from versions 4.0 and 4.2 which focuses on increasing the strength of some guarantees across distributed systems. It’s also a feature more with database administrators in mind, rather than app developers.

Kurogane said another set of important releases will be around resumable index builds, which allows your work to resume after shutdowns, reducing the pain of that build.

“The loss of the node accidentally or because you’re doing maintenance restarts will no longer cause you to lose the work that has already been completed with your index build process. The increase in the size of hard disk means many, many people are putting many terabytes in each node,” he said. “If you’re reading and indexing your whole collection, that really takes time.”

With this new resumable index build, you can just stop and start the index process build, resuming when you need to.

Wilkins said that, besides sharding, one of the main benefits of MongoDB has always been its improved performance. That performance comes because of the “lack of unions and joins” that are inherent to relational databases. MongoDB uses indexes but she said that building those indexes can impact performance and cause your applications to suffer during the build process.

She said, “Imagine if you have just spent three, four or ten hours building an index across a multi-terabyte collection and something happens before the index build completes and you have to start all over again from scratch. Resumable index builds overcome that problem. This is another long-term request that will prevent negatively impacting instances when adding indexes meant to improve performance in the first place.”

Read More:   Update Azure Durable Functions: Making Serverless Last

This is why the Percona team picked out these two indexing updates as particularly impactful and important to the community.

MongoDB Time Series Collections

Kurogane has worked with MongoDB for over a decade and he describes the addition of time series collections as a “really big change.” After all, time series deals with a one-to-one relation of time and values, which is essential for analyzing anything from transactional history to disease diagnosis to monitoring and observability.

Currently, in MongoDB, you are able to specify a date field and all the data will then be rearranged in buckets. With this update, it makes it simple to rearrange data to put all values into one column.

“This is really inverting the table relationship between WiredTiger label and what users see in MongoDB,” Kurogane said.

Kurogane predicted that this change won’t perform as well as databases that were designed for time series from the get-go, however, it would make any time series actions within MongoDB much faster — a very welcome addition to the MongoDB feature set.

This is another of those requests that Percona observes coming down the roadmap, that was also completed about a year ago, but is not in the 5.0 release notes. But, in this case, it should be noted that there are still several aspects of time series collections still showing as being worked on including this currently pending related bug. However, this has been confirmed to be a part of 5.0.

In the end, we won’t know for certain what MongoDB 5.0 will bring until it is released, but we are looking forward to diving into the GA release next week. In the meantime, we will see you July 13 and 14 at MongoDB.live — registration is still open.

Feature Image par Jean Didier de Pixabay.