Developing enterprise software is far from simple. Designing a platform to serve hundreds of thousands of users, devices, or data streams (sometimes all at once) is a Herculean task. But that doesn’t mean that it’s impossible to approach the design methodology in a way that encourages scalability in the future.
Scalability is one of the most important considerations in making a new software solution. Without it, the software cannot support user growth without crippling the user experience, and similarly inhibiting sales. Making a scalable software platform is challenging simply because it’s near impossible to know what factors, options and problems the vendor needs to take into consideration beforehand, requiring companies to instead iterate along the way.
That was the issue Forward Networks had to face when developing its network management solution. The platform creates a digital twin model of enterprise networks to assist with network management, verification and operations. But these networks contain thousands of devices, across hundreds of device types, from dozens of different vendors, spanning multiple locations, and millions of varying configurations. Building an algorithm that could accurately identify and model behavior for these networks was a challenge.
While the platform started from a mere 16 device sample — today it has been scaled to model up to 45,000 devices and process up to 1,000,000,000,000,000,000,000,000,000,000 traffic flows. But the team’s scaling efforts won’t stop there, as enterprise network complexity is only going to increase in the future. Data from Enterprise Strategy Group’s research uncovered that 66% of organizations view their IT environments as more complex than they were two years ago, and 46% plan to continue upgrading and expanding their network infrastructure.
Software scalability is naturally only going to become more important as these platforms need to contend with growing networks. Now let’s get into some steps to address how to build software from the ground up to support scaling rapidly.
Don’t Always Trust Open Source
At first, leveraging off-the-shelf platforms and tools like Elasticsearch and Apache Spark can seem like a great way to save money and time. But when trying to scale software, it quickly becomes clear that while these platforms are generic and applicable to a wide range of applications, they aren’t the best fit when major customizations for a specific platform or use case are needed.
This was a problem the Forward Networks team ran into in its early years. Initially the team relied heavily on Elasticsearch to compute, index and store all of the platform’s end-to-end network behavior calculations. It slowly became obvious that this wasn’t a long-term solution. Pre-computing all that data in Elasticsearch was becoming computationally infeasible and increasingly expensive to store. Reliance on these open-source tools was starting to become a problem, so the team decided to create a homegrown distributed compute and search platform instead.
When designing the platform in-house, where possible, it’s smart to adapt a lazy computation approach. By pre-computing just enough data needed to perform quick processes — and doing the rest of the computation specific to a user query as it’s entered — will lead to major improvements. The massive reduction in computing power, and storage needed, allows for immediately improved performance and better scaling of the platform in the future.
Design with Minimalism in Mind
Always plan for the software platform to have to run on minimal on-premise hardware. While it is pretty easy to provision an instance in AWS, Azure or other cloud provider with 1TB+ RAM, most customers will operate a vendor’s platform on the smallest amount of RAM available. Especially when a potential customer is testing out the software, they don’t want to have to provision a lot of resources for a platform they aren’t sure of. It’s not uncommon to have to work with a mere 128GB or 256GB RAM.
The simple fact is that when software needs a lot of internal computing resources, there can be a lot of bureaucratic tape to get through. On the flip-side, it’s important that potential customers start using the software quickly to finish the proof-of-concept period, because otherwise they haven’t yet seen the value of operating the platform in their environment — which is absolutely key in converting a sale.
Even when needing scale to 1,000 times, one can’t simply use a cluster with 1,000 nodes. Even for the rare instances where that is a technical possibility, it isn’t a realistic approach for the customer. Software vendors have to do the hard engineering work to get the same scaling efforts done with minimal resources. Here are some specific approaches that can help teams design the software platform with minimalism in mind:
- Avoid repeated computation.
- Dedupe data structures in memory and on disk.
- Lazy computation: delay processing to when it is actually needed.
- Make core data structures as compact as possible and with very low serialization and access overheads.
- Use fastutil for fast and memory-efficient collections in Java.
- Profile to detect and optimize the performance of actual bottlenecks.
If possible, a platform’s resource requirements should ideally be so low that developers can run the entire stack on their laptops. This is critical for enabling fast debugging and quick iterations. Building software platforms to operate on minimal hardware in this way can speed up adoption — and will also ultimately save customers money and improve margins.
Always be Gathering Data
Even for the best-designed software, there are likely environmental considerations or data patterns that simply won’t be anticipated. Over time, the computation core of the software may need to be rewritten or significantly changed multiple times to adapt to new problems, constraints or inefficiencies that couldn’t have been foreseen. The larger dataset available to test the platform, the sooner teams can identify these bottlenecks and limitations.
But this isn’t exactly an easy proposition for a new startup. Why should a large enterprise spend the time to install a platform from a new vendor, configure their security policies to allow the software to connect to their entire network to pull their configurations, and send the data to a small company that doesn’t have a proven product yet? It’s a tough sell. Instead, most developers need to take the long way around. This means gathering any relevant data possibly available from customers and pilot programs to continually build and expand internal datasets for scale testing and improvements. Even software trials that don’t lead to a new customer can offer new data with invaluable insight for improving the platform. Then as the software is made better, faster and more scalable — the vendor can advance to even larger customer environments, get even bigger data sets, and find the next set of platform bottlenecks to work on.
Another complication is that many enterprises have strict security and privacy policies. They won’t share their data and information directly with a software vendor. These types of companies necessitate spending the time to build data obfuscation capabilities into the platform, to analyze performance bottlenecks without actually sharing customers’ actual data.
Invest in Internal Testing
Naturally, no vendor wants their platform to “break.” It’s not a good look for the company and could potentially cost customers millions of dollars in time spent trying to find a workaround — or even worse, a replacement. For software platform vendors that want to move fast without breaking things, then it’s vital to invest sophisticated testing.
For reference, Forward Networks releases one major update for its software platform every month. Each of these releases is compiled of over 900 changes in git comments — and that number is only going to increase as the platform scales even further. The team also implements many periodic tests that every few hours run more expensive tests against latest merged changes to ensure everything is operating as planned. This level of testing is what’s needed to ensure there aren’t any major bugs or regressions in product releases. While it’s not feasible to run extensive testing on every single change, vendors should invest in as much internal quality insurance as possible.
There are also various tools available to engineers and developers that can significantly help in this process. Git comments can be verified by Jenkins jobs that run thousands of unit and integration tests to ensure there are no regressions. Test failures from the verification would then prevent the problematic change from getting merged. Additionally, Error Prone can be used to detect and avoid common bugs, and Checkstyle will enforce a consistent coding style throughout the process. Having each test performance checked on SignalFx can also showcase changes over time, so development teams can see when problematic issues were introduced.
By hiring enough engineers to run consistent and thorough tests to check for correctness and performance regressions, vendors can make their software platforms as reliable and resilient as possible.
Crossing the Finish Line?
The honest truth is that these processes are a constant cycle. No vendor is ever truly “finished” improving their software platform. There will always be the need to further develop software to work better, faster, and more efficiently — and also to become compatible with new technologies and platforms appearing on the market. And all of this applies even more so when building scalable enterprise software.
As more and more devices become smart and connected, the need for scalable solutions will increase exponentially. Designing software from the ground-up for scalability also allows for improvements in performance, computing and storage efficiency, and the customer experience. It’s a smart choice that can keep vendors ahead of the curve of their competitors, and set them up for long-term success in an increasingly connected world.