NGINX sponsored this post.
For the past few years, engineers working in the rapidly emerging realm of service mesh have rolled their eyes at media references to the “service mesh wars.” While seemingly hyperbole, a battle does exist. For better or worse, service meshes aren’t diplomatic. Just getting two different service meshes to talk to each other remains a nightmare. While the Service Mesh Interface (SMI) has pushed the industry toward the open standards required for true interoperability, much work remains for driving those standards to the point where service meshes can become a universal communication and management layer.
A particular case in point: Workload data is now poorly standardized, which creates problems in building standardized management tools for service meshes. Different meshes also have conflicting views on observability and telemetry. Those disparate views mean installing a service mesh requires some serious tuning to get the same observability plane set up on different meshes.
So, how do we end this battle and help service meshes reach interoperability? The key is neutrality. With a core group of neutral standards, service meshes can stop fighting and instead act like Switzerland, a country known for neutrality. Being like the Promised Land of fondue and chocolates isn’t impossible. Going beyond SMI, this wish list can standardize service meshes — not through the lens of defining the standard, but with how those standards should behave and deliver in practice.
1. Fast Installation Standard
A significant barrier to service mesh adoption is fear of installation. Understandably so. Service meshes can be devilishly hard to install and deploy. To address this, as a design principle, we can institute a “fast installation” standard and even benchmark installation time on top of a standard Kubernetes cluster.
Installation time is also a good indication of how well a service mesh can handle complexity. It demonstrates the ability to deliver a good user experience despite complex activities under the hood. Better yet, it demonstrates intelligent prioritization by limiting the mesh to core capabilities. That said, the general goal should be an “opt-in” experience with less complexity rather than an “opt-out” nightmare.
2. Fast Removal Standard
Fast, hitless removal, wherever possible, is the flip side of fast installation. Any hard-to-remove service mesh will reduce the likelihood of adoption and make it harder for application teams to trust the mesh for critical tasks. Rollback to running any application or service without a mesh, particularly internally, will be table stakes as teams design for Kubernetes environments that are fluid and afford them complete control.
Granted, fast removal mileage may vary. For example, if an application team has created a mesh environment with numerous customized CRDs and advanced functionality, then ripping out the mesh may take longer. Although setting a standard and benchmark for this is a tractable problem easily agreed upon.
3. Core Observability Standard
If you can’t observe it, you can’t manage it and understand it. Kubernetes and service meshes present some novel observability challenges because much of their focus has been on the networking layer rather than the application layer where user-facing transactions occur.
OpenTracing was a fantastic start, and the community around it built a powerful vision to give all service meshes observability with a common API. Zipkin and Jaeger have their strengths as well, with Zipkin being an all-in-one tracing solution. Then you had other projects like OpenCensus that attack tracing and observability in other ways. However, having too many competing projects has led to a lack of accepted standards for tracing context.
Fortunately, in an effort to unify observability into a single standard, the incompatible OpenTracing and OpenCensus projects began merging in the spring of 2019 to form OpenTelemetry. This was a huge step forward, combining tracing and supported language libraries into a broader vision of cloud native telemetry. OpenTelemetry also embraced W3C Trace Context as the standardized trace-propagation mechanism. Still, there is work to do. While the most popular coding languages are supported, dozens remain unsupported. In addition, not all observability backends support OpenTelemetry equally well. It’s a work in progress, but a very promising one.
4. Workload Management Standard
Knowing the requirements of a workload should affect the way a service mesh treats that workload. For instance, a workload that is a financial transaction should require encryption and mTLS for all spans of activity that contain account or personally identifiable data. There is no easy way to label workload types and set different standards for how those workloads should be treated in Kubernetes and service meshes. Currently, the closest thing is setting up rules and policies — such as retries, timeouts and terminations — for each service. While that might work in an environment with a handful of microservices, it quickly becomes more complicated as you add more microservices. Service meshes need a standardized method to label workload types while assigning requirements and rules to them. That way, workload management will be easier, automatic and precise at scale.
5. Data Management Standard
Managing data in an ephemeral and ever-changing environment is challenging. In the early days, Kubernetes users avoided running critical databases in their clusters and linked service meshes to outside data stores to ensure that their data was safe and sound. Data requires special care due to regulations like GDPR, FIPS and CCPA, which cover how you handle data, how you make it accessible to customers and the data’s physical location. Data handling in Kubernetes and service meshes remains complex and largely an afterthought.
This is similar to the non-Kubernetes world, where most developers dump data for prototypes and early stage application projects into a simple SQL store, S3 bucket or MongoDB, delaying the data management decision-making process. In Kubernetes and service mesh, kicking that can down the road injects far more complexity as microservices break down data transactions into more discrete tasks and usages. This, in turn, requires more detailed management. Similar to workloads, data management standards would simplify life for operators and development teams alike by allowing them to define the data requirements for each service using a standard language or standardized structure.
6. Protocol Support Standard
Today, there is a wide variance in protocol support among service meshes, which complicates swapping one service mesh for another. Additionally, a full agreement on which protocols should be supported and what that support should look like has yet to exist. Arriving at an agreement is crucial because protocols are often critical in application design. For example, some applications use gRPC for specific communications needs, but not all service meshes support full observability of gRPC processes. Thus, consistent protocol support with consistent support functionality is essential for mesh interoperability and portability.
Switzerland Is Not That Far Away
In the skirmish of technology, creating a neutral, Switzerland-like standardization is both tangible and necessary. While it does require a lot of coordination, we know it’s doable – even at the broadest possible scale. The web browser and W3C have proven so with the way they created evolving standards to deliver near-perfect interoperability across all browsers, alongside the web applications running on top of them.
At F5, NGINX has laid foundations for these standards by adopting “simplicity” as one of our core tenets for NGINX Service Mesh. Simplicity trends towards efficiency, and we’re proud to support a fast and easy mesh. Installation and removal can take mere minutes. We’re also always looking to provide value with expanding protocol support, both down the stack with L4 needs like UDP, and up the stack with feature parity for gRPC and other L7+ protocols as they develop. Finally, workload and data classifications are an innovative goal we hope to realize for the future.
With these standards, neutrality can be found, and the fondue and chocolates are within reach. Integrating this base level of standardizations, so that architects, developers, security and platform ops teams can mix and match meshes as needed, will be the best way to drive adoption and make service meshes just like browsers — ubiquitous, understood and trusted.
Featured image via Getty Images.