Akka and Play Framework at Groupon: A Look Inside A Real-Life Production System
Q/A On Groupon’s Reactive System Built With Akka And Play Framework
Groupon is building “The daily habit in local commerce”, offering a vast mobile and online marketplace where people discover and save on amazing things to do, see, eat and buy. By enabling real-time commerce across local businesses, travel destinations, consumer products and live events, shoppers can find the best a city has to offer.
Lightbend recently teamed with Groupon to tell the story of their adoption of Akka and Play Framework to solve issues of throughput, scalability, and resilience against downtime in their most critical systems. Read about this part of Groupon’s journey in How Groupon Scales Personalized Offers To 48 Million Customers On Time.
The story above describes the WHAT and HOW of Groupon’s modernization with Lightbend. For a more in-depth look at WHY Lightbend, we sat down with Aditya Athalye, Senior Software Engineer at Groupon, for a Q/A interview.
Q: Tell us a little about your development teams and technology stack. What Lightbend technologies are you using in production?
AA: There are many developers at Groupon using Lightbend technologies, namely Play Framework and Akka, with some Scala users in there as well. As part of our overall legacy infrastructure modernization goals, we are heavily invested into Play Framework for many of the microservices in Groupon, and looking at upgrading one of our existing services to the latest versions of Play and Akka.
Specifically, Play Framework provides Groupon’s aggregator service for email and ad serving, scheduling jobs, and taking data from other services. Play interacts with Akka Actors, which are used to request n-services that focus on customer email content that feature deals, promos, visual banners and other marketing promotion assets across Groupon’s purchase funnel.
Q: Why wasn’t your legacy solution able to meet your needs?
AA: Like most e-commerce companies that started out in the previous decade, Groupon’s original platform was largely monolithic in nature. While this enabled the company to get off the blocks in record time with expansion in North America and many other countries within the first 3 years of operation, it became increasingly difficult to continue supporting the ever increasing demand of business to incorporate more and more features into the platform.
The biggest issue was handling scale. The first generation platform had been built using Ruby On Rails, which made it very difficult to scale to meet spikes in demand or longer periods of high use, such as during events like the Super Bowl and Black Friday sales. We needed to continue creating ever-more sophisticated HTTP caching layers, but this wasn’t a long-term solution.
Another challenge was time-to-market for new features and services on the platform. As you can imagine, anytime you need to extend new features to a monolith–even for something simple, like adding a new email type–meant running lots of expensive and time-consuming tests and taking the entire system down for deployment.
Though the Marketing engineering stack was largely re-built using the JVM, including technologies like Java 8, Postgres/MySQL for RDBMS, Redis/Cassandra/ElasticSearch for NoSQL, and Apache Kafka, it wasn’t just about adopting modern day solutions. We knew that we had to solve this scalability and extensibility challenge quickly, and that meant breaking down the monolith into more granular microservices. This required constant balancing act with concurrency and parallelism to keep things lightweight and efficient, which was getting difficult and was vulnerable to programming errors. This is when we discovered Play Framework and Akka.
Q: How did Groupon decide to select Lightbend technologies?
AA: Since our platform was already running on the JVM, it was helpful that Play Framework and Akka supported Java as well as Scala. After doing a lot of research, the team concluded that Lightbend technologies were the best fit for various reasons (this was before my time at Groupon).
First, it came down to the popularity and stability of these frameworks. Lightbend’s technologies have a very active developer/user community, which is growing by the day. Documentation for Play and Akka is extremely elaborate, and discussion forums are abuzz with activity all the time, which provides easy access to developers worldwide (including the active contributors to these frameworks) for any queries, and support on issues.
Next, these technologies were ideally suited for the microservices architecture required for our use cases. Our microservices had to be largely stateless, as well as fully asynchronous and non-blocking (aka Reactive), to achieve the level of scalability required for our campaigns. To keep the system performant, we needed to move away from the traditional thread-per-request model, which often results in underutilization of precious CPU resources.
Finally, Lightbend is the creator of the Reactive Manifesto and a leader in the Reactive systems that modern businesses need to meet evolving needs. Groupon needs to be able to handle peak loads during holiday seasons, or flash sales, and Reactive technologies provided by Lightbend go a long way in accomplishing that.
Q: Do you have any Play and Akka performance metrics to share?
AA: Actually, we do! Unfortunately I don’t have metrics from the previous legacy stack, which was before my time at Groupon, but there are some throughput and latency statistics that I think are interesting.
While the email/push notification infrastructure also uses messaging systems like RabbitMQ, Kafka for distributed/fire-and-forget communication, and NoSQL solutions based on tools like Cassandra, Redis etc, it is Play Framework which allows our services to support very high throughput and stringent SLA requirements.
For example, this graph below shows how the Lightbend Stack handles 10X peaks in traffic, as we see everyday during our timed campaigns.
Our delivery services ecosystem written in Play Framework can rapidly scale to handle sudden peaks in traffic coming at a more than 10,000 requests/sec, spread across multiple VMs deployed on very few physical servers. By using a powerful yet simple asynchronous HTTP programming model, our developers can write code to orchestrate highly complex workflows required by the aggregator services in the email delivery pipeline, which is where Akka comes into the picture.
Marketing campaigns are targeted towards specific users (audiences) which can run into millions. These audiences (meeting a certain criteria) need to be pulled from our Hadoop stack exposed via microservices and uploaded into our Cassandra DB for fast runtime access.
We use Akka Actors to produce and consume messages from queueing systems, create schedules for updating local caches, assemble/aggregate data for building email content, as well as publish over SMTP. It the Akka Actor model that enables Groupon services to transfer hundreds of gigabytes of data each day from the source to its destination.
Akka Actors are extremely lightweight entities, so we can create a dedicated actor for every batch of data records to be uploaded. By using Actors instead of native threads–as was the case earlier–we saw faster data transfer times and greater resource efficiency:
|Metric||Size (number of records)||Native Threading Model||Akka Actor Model|
|Upload Time||10 million||~25 min||~10 min|
|CPU Utilization||10 million||~50 %||~40%|
Our observation was Akka actors are not only faster, but can manage the same thread load with less CPU to get the job done. This means that for the same cost, Akka can serve a large audience and use threads more efficiently than if those threads are used directly.
Furthermore, the native threaded model required usage of blocking queues, and futures with client code having to worry about polling completion queues to find out the status of a batch of data, making client code more complex. Actors took most of this complexity away by simple message passing between the caller and actor. Synchronized blocks of code went away with the Actor model, thereby simplifying the code, and from an operational perspective improved efficiency of the overall process of creation of marketing campaigns.
Q: What would you tell a system architect or VP of engineering about Lightbend technologies?
AA: I can try to break the benefits it into categories. There are two areas of focus that bear mentioning: the human benefits and the technical benefits. Both are important, naturally. Let’s start with the human side of things...
Productivity - Developer productivity has gone up at Groupon and time to market for new features to reach out to Groupon customers/merchants has reduced. Building a new endpoint is a much easier task than was the case with yesteryear frameworks. Writing services using the aforementioned Lightbend technologies has improved developer productivity immensely. It has also allowed the engineering teams to push code live on a daily basis if needed with independently deployable units.
Stability (Uptime) - Nobody likes system downtime; this occurred during a Black Friday event some years ago, but since implementing Play and Akka, we haven’t had any outages or failures in several years. It’s been a powerful psychological benefit to our teams to be using Reactive technologies like Akka and Play with support for Reactive Streams to allow sophisticated flow control using concepts like Back Pressure. This makes sure that our team doesn’t run into producer/consumer bottlenecks, and our systems are automatically protected from failures during surges in traffic.
Correctness - Writing correct multi threaded code for large systems has always been a challenge, especially when it comes to testing its correctness. With async paradigms supported by Play and the hassle free concurrency provided by Akka, our developers didn’t have to spend sleepless nights debugging notorious race conditions or deadlocks. They were able to focus on the business logic and let Akka worry about handling threads safely based on concurrency, parallelism, and self-healing.
All of these elements are important for maintaining a positive attitude towards the work we do--no more brittle monolith falling apart at the seams. It makes people more excited to come to work knowing that they are using cutting edge technologies that self-heal and scale to meet whatever demand we thought at it.
Now we can talk about some of the technical benefits we saw from adopting Lightbend technologies.
Scalability - Play Framework and Akka allowed us to easily scale our microservices horizontally and seamlessly handle failures while continuing to operate at the scale demanded by the business. From meeting the surges in traffic for daily campaigns to the really intense events like Black Friday and the Super Bowl, Lightbend technologies made sure that scaling was never a stressful experience.
Efficiency - Our resource consumption is much lower with Play and Akka, since threads are no longer blocked up to serve requests, especially for I/O intensive tasks like making calls to remote services, Databases, NoSQL stores etc. We were able to reduce resource consumption to compute by around 25% with Akka, which allows the email delivery system to send millions of emails and push notifications on a daily basis with optimum utilization of hardware. Lightbend provides all the right tools to build high performance microservices that communicate asynchronously and consume much fewer system resources compared to traditional thread-oriented models.
Integration - We were able to integrate Play and Akka with our existing CI/CD infrastructure with ease. This also helped us manage conflicting dependencies better with a more simple path to upgrading libraries than before, and led to much more predictable deployment cycles.
Bonus Q: What’s next for Groupon’s platform?
Groupon is currently working on a business critical use case where personalized promotion eligibility of a customer is shown on a deal. This will be shown as a Legal Disclosure, so that a potential customer knows whether any promotion can be used on the deal or not.
Built with Akka and Play, this will be part of the highest throughput microservice operation in the Groupon Purchase funnel (500,000 requests per minute daily peak in North America and 1.2-1.5 million rpm during special events and holidays). Latency SLAs are very strict for this service—the 99th percentile latency is 20 ms, which means the system has to be able to compute a customer’s eligibility for a deal across various promotional campaigns in just 20 ms.
We are employing most of the Reactive principles to achieve this: high degree of parallelism through Actors, fully asynchronous and non-blocking I/O calls to other microservices and NoSQL DBs, fair load distribution to achieve elasticity and resilience.
We are also using the interesting Akka Routing Strategy “ScatterGatherFirstCompleted ” to send a response (if successful) without having to wait for all the myriad campaign evaluations to complete.