Stateful Service Design Considerations for the Kubernetes Stack

Key Takeaways

Most developers building applications on top of Kubernetes are still mainly relying on stateless protocols and design. The problem is that focusing exclusively on a stateless design ignores the hardest part in distributed systems: managing state&mdas;your data.
The challenge is not designing and implementing the services themselves, but managing the space in between the services: data consistency guarantees, reliable communication, data replication and failover, component failure detection and recovery, sharding, routing, consensus algorithms and so on.
Kubernetes and Akka work well together since each being responsible for a different layer and function in the application stack. Kubernetes allows for coarse-grained container-level management of resilience and scalability. Akka allows for fine-grained entity-level management of resilience and scalability.
Akka has been growing rapidly in the last two years. Today it has around 5 million downloads a month, compared with 500,000 downloads a month two years ago.

At this summer’s QCon in New York, Jonas Bonér delivered one of the most popular talks of the conference with his focus on Designing Events-First Microservices. In this InfoQ Q&A, we asked Bonér to explain how “bringing bad habits from monolithic design” is a road to nowhere for service design, and where he sees his Akka framework fitting in the cloud-native stack.

InfoQ: At QCon you said, “when you start with a microservices journey you should take care not to end up with ‘microliths’ because you may bring bad habits from monolithic design to microservices, which creates a strong coupling between services.” Explain.

Jonas Bonér: In the cloud-native world of application development, I’m still seeing a strong reliance on stateless, and often synchronous, protocols.

Most of the developers building applications on top of Kubernetes are still mainly relying on stateless protocols and design. They embrace containers and too often hold on to old architecture, design, habits, patterns, practices, and tools—made for a world of monolithic single node systems running on top of the almighty RDBMS.

The problem is that focusing exclusively on a stateless design ignores the hardest part in distributed systems: managing state—your data.

It might sound like a good idea to ignore the hardest part and push its responsibility out of the application layer—and sometimes it is. However, as applications today are becoming increasingly data-centric and data-driven, taking ownership of your data by having an efficient, performant, and reliable way of managing, processing, transforming, and enriching data close to the application itself, is more important than ever.

Many applications can’t afford the round-trip to the database for each data access or storage and need to continuously process data in close to real-time, mining knowledge from never-ending streams of data. This data also often needs to be processed in a distributed way—for scalability, low-latency, and throughput—before it is ready to be stored.

InfoQ: Stateful services have been long-cited as the toughest obstacle for mainstream container adoption. Tell us a little bit more about why this is such a tricky area?

Bonér: The strategy of treating containers as logically identical units that can be replaced, spun up, and moved around without much thought works really well for stateless services but is the opposite of how you want to manage distributed stateful services and databases. First, stateful instances are not trivially replaceable since each one has its own state which needs to be taken into account. Second, deployment of stateful replicas often requires coordination among replicas—things like bootstrap dependency order, version upgrades, schema changes, and more. Third, replication takes time, and the machines which the replication is done from will be under a heavier load than usual, so if you spin up a new replica under load, you may actually bring down the entire database or service.

One way around this problem—which has its own problems—is to delegate the state management to a cloud service or database outside of your Kubernetes cluster. That said, if we want to manage all of your infrastructure in a uniform fashion using Kubernetes then what do we do?

At this time, the Kubernetes answer to the problem of running stateful services that are not cloud-native is the concept of a StatefulSet, which ensures that each pod is given a stable identity and dedicated disk that is maintained across restarts (even after it’s been rescheduled to another physical machine). As a result, it is now possible—albeit still quite challenging—to deploy distributed databases, streaming data pipelines, and other stateful services on Kubernetes.

What is needed is a new generation of tools that allow developers to build truly cloud-native stateful services that only have the infrastructure requirements of what Kubernetes gives to stateless services. This is not an argument against the use of low-level infrastructure tools like Kubernetes and Istio—they clearly bring a ton of value—but a call for closer collaboration between the infrastructure and application layers in maintaining holistic correctness and safety guarantees.

InfoQ: Tell us about where Akka fits into this set of stateful service requirements for the Kubernetes stack.

Bonér: Really, the hard part is not designing and implementing the services themselves, but in managing the space in between the services. Here is where all the hard things enter the picture: data consistency guarantees, reliable communication, data replication and failover, component failure detection and recovery, sharding, routing, consensus algorithms, and much more. Stitching all that together, and maintaining it over time, yourself is very very hard.

End-to-end correctness, consistency, and safety mean different things for different services, is completely dependent on the use-case, and can’t be outsourced completely to the infrastructure. What we need is a programming model for the cloud, paired with a runtime that can do the heavy lifting that allows us to focus on building business value instead of messing around with the intricacies of network programming and failure modes—I believe that Akka paired with Kubernetes can be that solution.

Akka is an open source project created in 2009—designed to be a fabric and programming model for distributed systems, for the cloud. Akka is cloud-native in the truest sense, it was built to run natively in the cloud before the term “cloud-native” was coined.

Akka is based on the Actor Model and built on the principles outlined in the Reactive Manifesto, which defines Reactive Systems as a set of architectural design principles that are geared toward meeting the demands that systems face—today and tomorrow.

In Akka the unit of work and state is called an actor and can be seen as a stateful, fault-tolerant, isolated, and autonomous, component or entity. These actors/entities are extremely lightweight in terms of resources—you can easily run millions of them concurrently on a single machine—and communicate using asynchronous messaging. They have built-in mechanisms for automatic self-healing and are distributable and location transparent by default. This means that they can be scaled, replicated, and moved around in the cluster on-demand—reacting to how the application is being used—in a way that is transparent to the user of the actor/entity.

InfoQ: So where is the separation of concerns between Kubernetes and Akka when they are used together?

Bonér: One way to look at it is that Kubernetes is great in managing and orchestrating “boxes” of software (the containers) but managing these boxes only gets you halfway there. Equally important is what you put inside the boxes, and this is what Akka can help with.

Kubernetes and Akka compose very well, each being responsible for a different layer and function in the application stack. Akka is the programming model to write the application in, and its supporting runtime—it helps to manage business logic; data consistency and integrity; operational semantics; distributed and local workflow and communication; integration with other systems, etc. Kubernetes the tool for operations to manage large numbers of container instances in a uniform fashion—it helps managing container life-cycle; versioning and grouping of containers; routing communication between containers; managing security, authentication, and authorization between containers, etc.

In essence, Kubernetes’ job is to give your application enough compute resources, getting external traffic to a node in your application, and manage things like access control—while Akka’s job is deciding how to distribute that work across all the computing resource that has been given to it.

InfoQ: Akka embraces a “let it crash” approach—that is losing one actor shouldn’t matter because another will pick up the work. Can you explain how this works in a container environment? Does an admin need to intervene?

Bonér: Traditional thread-based programming models only give a single thread of control, so if this thread crashes with an exception, you are in trouble. This means that you need to make all error handling explicit within this single thread. Exceptions do not propagate between threads, or across the network, so there is no way of even finding out that something has failed. But losing the thread, or in the worst case, the whole container is very expensive. To make things worse, the use of synchronous protocols can cause these failures to cascade across the whole application. We can do better than this.

In Akka you design your application in so-called “supervisor hierarchies” where the actors are watching out for each other’s health and manage each other’s failures. If an actor fails, its error is isolated and contained, reified as a message that is sent asynchronously—across the network if needed—to its supervising actor, who can handle the failure in a safe healthy context and restart the failed actor automatically according to declaratively defined rules. This naturally yields a non-defensive way of programming and a fast fail (and recover) approach that is also called “let it crash.”

This might sound like it is overlapping with the roles of Kubernetes, and it’s true that both Kubernetes and Akka helps to manage resilience and scalability, but at distinct granularity levels in the application stack. You can also look at the two technologies in terms of fine-grained versus coarse-grained resilience and scalability.

Kubernetes allows for coarse-grained container-level management of resilience and scalability, where the container is replicated, restarted, or scaled out/in as a whole. Akka allows for fine-grained entity-level management of resilience and scalability—working closely with the application—where each service in itself is a cluster of entity replicas that are replicated, restarted, and scaled up and down as needed, automatically managed by the Akka runtime, without the operator or Kubernetes having to intervene.

InfoQ: Could you give us an idea of Akka adoption? How many monthly downloads do you currently see?

Bonér: Akka has been growing rapidly in the last two years. Today it has around 5 million downloads a month, compared with 500,000 downloads a month two years ago. If you’re interested in some of the earliest history and milestones around the project, this infographic has some cool data points.

Press | November 15, 2018

Stateful Service Design Considerations for the Kubernetes Stack

Stateful Service Design Considerations for the Kubernetes Stack

Key Takeaways

High-performance microservices and APIs with no operations required

High-performance microservices and APIs
with no operations required