Lightbend Podcast: What Is CloudState?
Jonas Bonér and Viktor Klang describe how the newest OSS technology from Lightbend tackles distributed state management
In this Lightbend podcast for August 30th 2019, Jonas Bonér (CTO) and Viktor Klang (Deputy CTO) talk about CloudState, Lightbend's recently-announced OSS project that provides application state handling in a world of stateless infrastructure. As Kubernetes continues to dominate this area of cloud-native container orchestration, this is just the right time for addressing one of the trickiest areas in distributed computing: managing state. In this conversation, we review the WHY behind CloudState, as well as a description of what's going on under the hood, and how you can support this ground-breaking project with your own contributions.
Note: this transcript has been edited for clarity and flow, though none of the sentiment or details have been changed. Some early banter has been left for your viewing pleasure :-)
Host: Jonas, Viktor, it's great to have you on the podcast. Before we get into it, how would you describe what you do as a job to a non-technical person, i.e. at a dinner party, and without using CTO in the description?
Jonas: That's hard, as I do so many different things. I guess it's sort of a mixture between doing some outward marketing, being sort of the public face of the company in the technical sense. I'm trying to be a bridge between the technology and the business, both by looking forward and thinking ahead a little bit and try to predict the future–even though it's extremely hard and often I'm wrong. I also talk to customers about what they are struggling with today, and try to bring that back to the teams.
Host: Alright, so some fortune-telling and getting the word out, I like that. Viktor how do you describe what you do?
Viktor: Yeah, it's a tricky question. I think, the most succinct response is that I try to bridge product and technology and customer. Essentially what I do is research on potential opportunities and what could work and what could not work or doesn't work. And how that fits into our current technologies and what could be our future technologies. It's a very dynamic job, where you do a lot of different tasks and are involved in different projects. So every day is a new day.
Jonas: You do what I tell you to do.
Viktor: Yeah, that's true.
Jonas: I'm just kidding.
Viktor: I try my very best, hahaha...
Host: Alright, so let's talk about CloudState. How would you describe CloudState in 30 seconds, Jonas.
Jonas: Okay, 30 seconds, when can I start?
Host: You're already five seconds in.
Jonas: Oh, sorry.
Viktor: You've probably wasted 10 now.
Jonas: CloudState is an initiative to define what is “Serverless 2.0”—or what's the next generation of serverless is all about–by adding a concept, model, and implementation around managing state. Anyone who has built a cloud application, a distributed system, knows that state is actually the hardest thing to deal with, while serverless up to this point more or less ignores that. So we're trying to add that to the mix with essentially two things. First is a standardization effort defining a specification, a protocol and a TCK for implementing these things. Second is a reference implementation, implementing the spec. And we can talk about, more about the details about this later.
CloudState is an initiative to define what is “Serverless 2.0” by adding a concept, model, and implementation around managing state.
Host: Viktor, as Jonas mentioned, one of the hardest parts of distributed computing is managing state. Can you explain to people who don't know much about this topic, why is that so hard?
Viktor: I think it boils down to everybody wanting to be able to keep track of what has happened. And the only way to do that is to store it somewhere—most systems need to store data somewhere. Most systems are not just pure functions or transformations between different points of data. They need to keep some sort of memory, and that memory needs to live somewhere. Something needs to manage that memory, and that memory needs to be disseminated to the different parties that are interested in recalling it.
That doesn't really come for free, because when you store something somewhere, you need to be able to access it. If you don't store it in the same location that you access it from, it adds coordination and communication costs to the equation as well. Since a lot of development requires keeping memory, I think having a solution for writing functions as a service (FaaS), or more the pay-as-you-go kind of development, requires a solution to the repeating problem of state management.
CloudState tries to address that by facilitating the storing and recalling of memory, what we call state here. So by narrowing down the problem space of it, the access patterns and the usage patterns of the state, CloudState can facilitate that part of the concern leave the business code cleaner for the developer. So you don't need to add state management code within your business code, and since CloudState facilitates that, we can keep track of latencies and sort out where is the current bottleneck. Are we bottlenecking in latency, or are we bottlenecking in compute, performing the actual business code? So, it actually provides a really nice separation of IO versus compute as an end result, which we think is a very interesting thing because you might want to scale your storage completely separate from scaling the business application.
Host: Jonas, you mentioned getting a TCK ready, having a reference implementation and all of these things, is what we're working on with CloudState. Was this a similar process to the Reactive Streams initiative that we spearheaded along with an excellent group of technologists, and if so, how was getting this prepared for general consumption different or similar to Reactive Streams?
Jonas: [Like Reactive Streams] it all started with us scratching our own itch. We've been hearing it from customers building their applications with Akka that they are very intrigued with the idea of serverless; however, they feel that in a real world, non-trivial application, serverless would be just a small part of the application with a few use cases that fit the model as it is today in FaaS, meaning stateless functions just consuming and emitting events. This means that either they can't utilize serverless at all, or they have to split up, run some sort of hybrid approach with certain things running in the cloud with Lambda or hosted on their own computers.
So this CloudState initiative started with essentially the bold goal, the grand goal of, “What will it take to be able to write any application in this new experience?” The way I look at it, serverless is not the same thing as FaaS, even though some people might confuse the two. FaaS is just a first implementation of serverless that sort of paves the way and shows what's possible, but abstracting away all these things that FaaS is able to do now is really only the first step.
What I would like to do is be able to write any application, any use case, in this new experience in the cloud. That’s not possible today. So what we were thinking through is “What is the missing piece here? What is it that people need support for in order to do this?” And it all came down to one thing, and that's state. Distributed state management. That means both having a scalable and available way of persisting your domain logic, the thing that you’d normally put in SQL databases and just serve a record or whatever need you wanna call it. That's usually your business value.
What is the missing piece here? What is it that people need support for in order to do this? It all came down to one thing, and that's state. Distributed state management.
It also means having a good way of managing a state that flies by, state in motion. That's often found in streaming use cases where services need to be somehow stateful and durable because it's just too costly to redo things over and over again. This also includes simple things like coordination of state, and making sure that your different user functions or services can coordinate and be in sync with each other. Really the only way of doing that is if you have a stateless layer–and this holds true both with traditional three-tier stateless architecture as well as FaaS. The only way you can do that now is through storing it in all the states, even coordination states, in a database to make it fully available and fully reliable. That of course adds a ton of latency if you must go over this durable medium somewhere else.
When you want to communicate perhaps on the same machine with other services, there's a lot complexity that we need to deal with here. The interesting thing is that we have already solved this over the last 10 years using Akka (Akka Cluster and Akka Persistence), but we have always delivered this solution to on premise infrastructure. So, what we try to do now is just take all these good ideas, many actually stolen from Erlang and other languages, and implement them in a solid platform around this idea. We try to bring them to the serverless experience, bring them to the cloud in this dressed up new model, and automate as much as possible. As Viktor talked about, the only way you can really do this is by constraining yourself and the data that you work with.
As a concrete example, FaaS is excellent at abstracting away communication. It's only event-in and event-out. So the question we asked was “Can we solve state in the same way?” so that it's always state-in, state-out, through a very well-defined protocol. It actually turns out that we're able to solve both domain state, as well as coordination state, through this well-defined, very constrained protocol.
By having a more constrained protocol, we can do exactly what Viktor talked about: having the runtime optimize things because if less things start happening in the business code, the runtime and the infrastructure can take more assumptions. Then it can actually start optimizing things like latency versus throughput, start sorting, and scaling out things in a much better way than if you have the hurdle of state inside the function as it is today.
Host: That was a good explanation of several topics. I wanted to take a quick step back on something here. We're talking about the current limitations of FaaS, for example Amazon Lambda. What are some of these limitations and how would those limitations look if we mapped that onto let's say a mobile app experience? I don't know, Twitter, Spotify something like that...what would it look like if we were using FaaS for an application or service that many of us use every day, and how would we see those limitations manifest themselves?
Jonas: Do you want to start with that Viktor or should I continue ranting? Hehe. I can start. I think first what FaaS today is really good at, is it is great for and I think it was intended for. It's not fair to say that it's bad at something else. It really shines at these processing intensive workloads, when you parallelize out a ton of work. Ideally, in what we call in computer science “embarrassingly parallel”, when there's really no data coupling. That's really where it shines. It's excellent as sort of data fabric, where you're moving data from A to B, or from one endpoint to another endpoint, and before that some sort of data enrichment or transformation.
What is lacking is the tools and mechanisms and abstractions and models to build general purpose applications. We need support for any type of microservice, not just a stateless processing type of microservice, streaming pipelines, and all the things with low latency real-time streaming that we see a lot of use cases for–like building up and serving machine learning models in memory while still having them durable so you don't lose them if your node goes down. An example might be taking your recommendation and using it for predicting things. Or a classic use case like just having an easy, reliable, but still very high throughput and low latency way of deploying your shopping cart.
That's really hard–and not just hard, but inefficient too if you always have to store it a database somewhere else, e.g. DynamoDB or Apache Cassandra. There are no other tools to do it in a more efficient way.
So there are many interesting use cases that are possible in Lambda and FaaS, but just not in an optimal way. Sometimes things are just too slow in this new world of real-time, data-centric applications where it’s all revolving around data and getting intelligence, or getting inside the data as fast as possible. This latency of always having to store and fetch things from somewhere else, whatever that might be, is just too high.
Host: So it's not only that performance would probably suffer and we would start to deal with really slow services. Viktor, you mentioned the communication overhead would eventually get completely disproportionate to the value that it's providing. Why is that?
Viktor: First and foremost, the timeliness of providing value is the most important feature of a feature. If you can't provide the value within the time boundaries when it's needed, then what good is it? It doesn't matter if you're correct if you are not able to supply value within a reasonable amount of time.
I think one of the challenges here, to expand a bit on Jonas' explanation, is if a function (i.e. FaaS) needs to keep memory, then it needs to coordinate with some data storage facility. If that coordination is done within that function, then nothing outside of it knows what that function is going to do. So there is no way for the runtime to really optimize things.
So let's say that a function decides to call out to fetch or retrieve some memory that it stored before. Now, if there are multiple instances of the same function because we want to sort of scale this out or make it, as Jonas called it, “embarrassingly parallel”, we can just pin up new instances and feed them data. You might now have multiple functions activated at the same time, which coordinate against the same memory and the same state. Then you need to do state management: either you have some fancy transactional protocol that you need to figure out how to work, or you need to push that responsibility to the data store in the form of something like transactions in a relational database. Nonetheless, whenever you have contention over the same state, you're actually losing performance because you need to do that coordination in the database.
Whenever you have contention over the same state, you're actually losing performance because you need to do that coordination in the database.
Host: It's like shifting the bottleneck further down the chain.
Viktor: Yeah. You're pushing the bottleneck further down to the storage; but also, when you have a distribution of data or requests coming in that is non-optimal in the sense that it's trying to access and change the same data, then you're getting more slowed down because the database has much more to do and tries to figure out who should go first, and who should go after that, etc. When you have lots of traffic, you also have the worst performance, because of all this traffic that is trying to coordinate against the same data.
Host: So right when you need as much performance as possible, when you're a retailer having a seasonal campaign or a flash sale, that's exactly when performance starts to die down?
Viktor: Exactly. That's mainly because if the runtime doesn't know anything about how the function is doing its data access, or what data it's going to access, then it can't really help out either. By moving that responsibility to the platform or the runtime, we know exactly what data is going to be used by the CloudState “function”—it's more of an entity, because it has state. Since the runtime knows about the data, then the runtime can also handle that data access. The runtime will either cache if it's been used before on that node, or it will fetch that data before it invokes that entity or function. Nothing will get started before the data is available. Which means that you can do all kinds of decisions around that. You might want to serve other traffic while you're waiting to get data in.
By moving more information and knowledge into the runtime, you can get more value as a user, without having to sort of pay for it with extra work, complexity, maintenance, etc. When we started with this idea, the question was “could we solve this problem?” We didn't really know whether we we're going to be able to solve it, and it was really interesting to get to the state, no pun intended, where we could say that, “Hey, this actually works.” By just having the solution, look what other things that we get as well. It's been really interesting to see both the solution to a problem, and the benefits of solving that problem. There are tons of additional benefits that are sort of falling out of this solution.
Jonas: One benefit of solving it this way, by abstracting over state-in and state-out, is that you can have multiple different types of implementations depending on what type of state you have and what kind of guarantees you need. It's not one size fits all: all transactions are lined up, waiting. Regardless if you need that level of guarantee or not. You can actually have more knobs to turn, so to speak, when it comes to consistency, reliability, availability, latency and throughput.
Up to this point, talking of concrete things, we have implemented support for two different models, with a third on the way. And there's really nothing preventing us from having even more, as long as they fit the model of having this smaller surface area when it comes to the state in and state out of the function. What we've implemented so far is event sourcing, which is a great model for event-driven services, which all FaaS elements are. A function is receiving and making events, and it fits this event sourcing model very, very well as a service of record.
The other one is a little less known: conflict free replicated data types (CRDTs), which is quite new research, around 10 years or so. We’ve had them in Akka for about five years or so (Akka Distributed Data). CRDTs are a data type that suits very well this type of coordination when you want to have really good availability and little contention. The third one that we are thinking about implementing is key value store, which also has a bit smaller surface area, but a well defined model that fits quite well.
Host: Why don't we change gears a little bit and talk about some of the specifics around CloudState. Maybe let's talk a little bit about the general design and architecture. I understand that it's polyglot, but what else is going on?
Jonas: Yeah. I think that's an extremely important point that it's polyglot, even though it's implemented on the JVM–some people might be worried about start up times with the JVM so we can get more into that in a second. So it's written on the JVM, but only exposing this to JVM developers and the Java community is of course sad, when we're really trying to change the way serverless is managed in a general sense.
Viktor: One of the most interesting aspects of this is really bringing a lot of the value that we've been coming up with over the past decade and making it available in essentially any language. It's going to be very interesting to see whether somebody will implement a PHP version of this where PHP gets the benefits of having a scalable, very reliable way of writing essentially serverless stateful functions.
Relying on widely-adopted industry standards like gRPC means that you can interoperate with everything else as well. GRPC is typically running over HTTP/2 and Protobuf, and one thing that we added fairly recently was auto-generation of HTTP plus JSON endpoints for all your endpoints as well, so now you can interoperate or call your stateful functions from any language that speaks HTTP plus JSON, which is practically everything. Having this fit in to an existing ecosystem of the web is extremely exciting and though we'll see where we can take it from here, just having what we have right now is extremely powerful.
One thing that we added fairly recently was auto-generation of HTTP plus JSON endpoints for all your endpoints as well, so now you can interoperate or call your stateful functions from any language that speaks HTTP plus JSON, which is practically everything.
Jonas: This is about the general architecture only from a user's perspective. The way it works there is the gRPC Protocol on the back-end is served by a proxy or a side-car, and that runs Akka with Akka Cluster and Akka Persistence. Akka Cluster and Akka Persistence together do most of the heavy lifting when it comes to managing state. Both your domain logic and service record using things like event sourcing as well as CRDTs are all coming from Akka Cluster, plus cluster sharding.
We haven't talked about but one of the things that's really key: CloudState provides co-location of data and processing, which is extremely important, of course. Then it runs everything on Kubernetes, which is the de facto standard when it comes to these things. We have looked into running it on top of Knative as well, but we've run into some problems when it comes to the autoscaler. We’re having discussions with the Knative team to revisit some of those points.
Another interesting angle though, if you want to talk more about it, Viktor, is the presence of GraalVM. I hinted that the JVM might not be the optimal platform for running serverless with its long start-up times and high memory consumption, but that is something that GraalVM can fix.
Viktor: Right. GraalVM has a tool called Native Image, which can take a JVM application and creating a native executable out of that. This takes a bit of work, but once you get it working you actually end up with a boot-up/start up time which is extremely fast. You need that because if you are providing essentially all the value in the sidecar, then the sidecar needs to be able to boot up rapidly. The user's function is not sitting around waiting for the sidecar to boot up. By making the sidecar into a native executable, we can get the boot-up times extremely fast.
Host: Alternatively, without GraalVM, you wouldn't be able to have a native executable or a native package. For people who may not know too much about that, what does that mean? What's the disadvantage?
Viktor: Well, then you will have to run a JVM, and a JVM typically has a significant start-up time. Let's say a couple of seconds. But instead we can boot up in the order of milliseconds with GraalVM. It's almost instantaneous.
Host: So, when you're doing this millions of times per minute perhaps...
Jonas: Exactly. One thing that's important to understand, of course, is it's generally not a problem. Our customers have been very successful running Akka and everything on the JVM on-prem for years. But why it matters so much when it comes to serverless is that the whole serverless model is a pay-as-you-go model. When you have idle work, then it's extremely important that you can scale down very, very rapidly. And, if you scale down ideally to zero or there's a few instances, when load goes up again, it better start up extremely quickly or else you're sitting there and your requests are queuing up and you have a big problem. That's why it matters so much in the serverless world.
Viktor: Yeah, the JVM is fantastic once it's warms up and it's been extremely successful with long-running server applications. The real difference here is the on-demand processing that you want to do. This enables us to essentially take a lot of the value on the long-running server side of things and bring that to the on-demand, short-running things.
Host: Great, that makes a lot of sense. Viktor, is Reactive Streams, let's say, inside of CloudState? Is it supported?
Host: Let's talk a little bit about what it's like to use CloudState. I know that our colleague, James Roper, has some good videos out there. Viktor, could you describe what it looks like for developers starting out with CloudState for the first time?
Viktor: CloudState is a contract-first approach, which means that the developer first creates gRPC definition, a proto-file describing the external interface of the stateful function or the entity. And then, depending on the language and depending on what kind of state that they want to manage... Jonas mentioned before that we have both event sourcing and CRDTs, and we're planning to have more in the future like a key value storage or something akin to CRUD, but it won't be CRUD. It will be sort of a saner version of CRUD.
Host: CRUD 2.0?
Viktor: Yeah, or removing one or two of the letters.
Viktor: Haha, or “UD”. I don't know. Depending on the language and the state management strategy that you want to use for your entity, the end-user API will be different. And, what's really good about that is that every single language is able to create an end-user API that makes sense for that language. If you want to have a functional Java API, then you can have that. And, if you want to do a functional PHP one, or if you want to have Rust or C, you can make a C or Rust idiomatic API. The possibilities are endless. After writing this initial external service definition, you implement the service using the state strategy that you've chosen for your entity and you implement that.
What happens then is that after you have packaged this thing–and we recommend to wrap them up as Docker images because then you can deploy that to Kubernetes–our Kubernetes operator will see that “Hey, this is a CloudState entity.” And, we will add the side-car, the proxy, to that depending on what kind of state management strategy and kind of storage you want to use.
When the function starts, it will actually boot up an internal gRPC server and the proxy will call that server and say, “Hey, who are you?” The user's function will respond with the protocol, the external interface that it wants to present to the world, and the proxy will actually proxy that interface.
That flexibility of not tying the external interface to the internal API is also very powerful for having potentially different APIs for different styles for every language.
To the outside world, the proxy looks like the service. This means that there can be a huge difference in the external interface and the internal implementation or internal API.
Host: Does that also improve the security of these entities?
Viktor: It certainly can. With the proxy, we can implement access control, and we can do all kinds of things because the user function is not accessible from the outside, it's only accessible to the sidecar with the proxy, so it allows you to have a buffer between the outside world and the function itself. But of course, this is sort of a moving target so there's tons of stuff that we could add but the question is, what we should delegate to other services that are running? Authentication and authorization might be one of the best things to have there.
Host: That's good news to a lot of people like that. When it comes to getting a CloudState entity, your new function out into the world, what does it look like from the observability deployment, DevOps, sort of perspective? What's the deployment process look like?
Jonas: It's very rudimentary at this point. As with everything you have to say we are just piggybacking on Docker and Kubernetes. We don't have anything for observability at this point. That said, as far this vision there should be a great UI and because the whole idea is you should consume this basically only through UI and a command line. The ecosystem around Kubernetes for deployment and these type of things is extremely advanced and it's heaven compared to a few years ago, so I think it's quite decent even though we haven't done anything really around it.
Viktor: I think it's a strategic choice. You definitely want to decide where you want to innovate and where you want to essentially lean back and let others solve the problems. So for this specific thing, it just looks like another Kubernetes service. It's essentially just a
kubectl apply command and if you have tooling that works with that, then this just looks like anything else. So that also means that it inter-operates really well with other things in that ecosystem. So it's actually a conscious decision not to do anything fancy or specific here.
Host: Or reinvent anything that might be out there or emergent...
Jonas: One really important point here as Viktor said, by just going with the standards, that means that other tools that we even haven't thought of that people might want in their deployment pipelines or processes will just work because we don't do anything special or specific.
Host: No magic annotations haha...
Jonas: We are just trying to be good citizens.
Host: Guys, this has been great. I guess before we end for today, this is an open source project, and obviously contributions are welcome. Where would you suggest someone listening to this to get started with helping contribute or reviewing? What would you suggest?
Jonas: Go to cloudstate.io first. From there you can navigate out to the documentation and now the documentation currently lives on GitHub. It's in one big readme, but it has a table of contents so it should be fairly easy to browse. That's a good starting point. Read up on the background to the project, the problem we're trying to solve, the general design approach. If you disagree, or have ideas, or are thinking ahead about what we can do, let us know and tell us what you think. Or if you want to roll up your sleeves and contribute, you should join the mailing list, or Gitter channel, or find us on Twitter at @cloudstateIO.
I want to stress that this is not meant as the closed project, not even a general open source project where we “accept” contributions. We actively encourage contributions. I actually don't think we can get where we want without other people and companies helping out, and this is why we have the angle of having CloudState as more of a standardization project with reference implementation. There can absolutely be other reference implementations, and you can help out by adding more client libraries, if you get excited about these things.
CloudState is not meant as the closed project, not even a general open source project where we “accept” contributions. We actively encourage contributions. I actually don't think we can get where we want without other people and companies helping out.
Viktor: Absolutely. As you mentioned earlier in this interview Oliver, James [Roper]'s new video demonstrating how it works are very informative as well, if you prefer to watch videos. Of course, nothing here is set in stone at all, so we definitely want help in formalizing the spec. We already have a rudimentary TCK to validate the implementations that also makes it easier to write new language implementations and we can automate the testing of those. So do you have a favorite language that you would like to use with this? Then that would be amazing to see more languages have implementations for this. And it's a network effect, so whatever language is added can be used by any other of the languages that are already available. So the more the merrier.
Host: Well, Jonas, Viktor it's been excellent speaking with you today. I have to say I'm a lot more excited about CloudState after listening to you two describe it than before this podcast started. Distributed state management for Cloud native applications and stuff like that sounds like just what the industry needs. So, for our listeners out there if you'd like to give us any feedback, feel free to ping us on Twitter and be sure to visit cloudstate.io the website, and fill your brain with interesting ideas there. So gentlemen, thanks once again and we'll say goodbye.
Jonas: Thanks a lot, Oliver. Always enjoy being on your podcast.
Viktor: Thanks Oliver. Have a good day.