Executive Highlight - The Results (TL;DR)
- Improved system performance by over 800%
- Reduced infrastructure footprint by 75%
- Protected from the risk of expensive downtime
- Able to scale elastically when needed
Flipkart first opened for business selling books in 2007, and has grown rapidly since then to now employing over 33,000 people serving a marketplace of over 100 million consumers. Now with a market valuation of over $11 billion and over $2 billion in revenues, Flipkart has become a significant market power in India’s massive e-commerce industry–offering everything from books and electronics to furniture and washing machines.
One of the untold stories behind the amazing growth of Flipkart is how it continues to scale its infrastructure to serve millions of customers every day. Highly successful “Big Billion Days” campaigns in fact up the game while also placing additional demands on the infrastructure. As an example, in October 2015, the “Big Billion Days” campaign drove tens of millions of users to Flipkart who bought $300 million in merchandise in just five days. This broke all previous records and helped Flipkart close 2015 with nearly $1.5 billion in revenue.
These campaigns, and others like them, are a reflection of India’s dynamic marketplace, where companies need to be able to target different segments easily and quickly, testing and running different campaigns continually to drive high volumes of website/app traffic.
At Flipkart, the customer engagement team responsible for these efforts modernized its legacy platform by going Reactive, and using Akka Streams, Akka HTTP and Apache Kafka to build, deploy and manage a new generation of resilient, elastic microservices, to meet Flipkart’s needs.
The Challenge Of Poor Scalability And Unpredictable Performance
While 2015 was record-breaking for Flipkart, in the past their platform was faced by scalability and performance challenges as their customer base continued to grow rapidly. To keep up the pace, a special team led by software engineers Durga Prasana Sahoo and Kinshuk Bairagi was tasked with learning about the next generation of Reactive solutions to rebuild their mission critical customer engagement platforms, which interact with end-users via push messages through various channels (mobile push-notifications, email, sms, etc).
For us, request messages are first-class entities and it’s critical to know how they accrue to overall deliverability figures. Akka Streams gives us the insights we need to categorize message processing failures and counteract them at the very core of the platform, rather than looking to throw this down the DevOps pipeline for debugging later.Durga Prasana Sahoo Software Engineer in the Customer Engagement Team at Flipkart
Motivated by the limitations in their existing solution, Flipkart’s Customer Engagement Team set out to re-write their ecosystem of legacy microservices with a new platform called Connekt. The problems facing the existing legacy platform stack were varied, but a lack of cohesion, insulation and long-term maintenance requirements were major issues, among others:
- Each platform was built with contrastingly different technologies
- Each platform suffered from throughput limitations and regular bottlenecks
- There was no common methodology in place for failure handling or understanding deliverability metrics of messages sent
- Building a reliable ETL pipe was either not feasible for any of the platforms, or brought too much additional maintenance overhead
- None of the platforms ensured that clients’ individual SLAs are insulated from each other
- Lack of flexible authentication and authorization control over the APIs
One of the biggest challenges for Flipkart lay in the Java thread model itself, which often suffers from points of contention (i.e. hot spots) that leads overall processing throughput to tumble drastically. With uneven rates of processing for various stages, this leads to unpredictable over/under-utilization of resources across different components. In turn, this was leading to observable production bottlenecks that provided a laggy and inconsistent user experience.
The Connekt platform team began searching for the right technologies that would enable them to better scale for throughput without facing the bottlenecks they were experiencing with the synchronous and blocking Java thread model. Some of the specific goals of the Connekt team were to:
- Never again face unexpected downtime. Even with capacity planning to handle 20x peak, a flood of consumer traffic in the past still resulted in unexpected downtime, scalability and performance issues. In order to deliver on the company’s vision, this could not happen again.
- Scale elastically in response to any level of demand. Maintaining responsiveness when demand peaks is one challenge, but what happens when that demand decreases and the need for additional resources is not high? By architecting a system that can scale in and out, Flipkart needed to meet demand when traffic peaks and free up system resources and infrastructure when traffic is slower.
- Utilize resources as efficiently as possible. With the traditional Java thread model, it’s not easy to ensure high CPU utilization while keeping the core load as low as possible for other operations. Facing CPU utilization rates of as low as 30%, Flipkart needed to improve the resource utilization of their hardware and infrastructure.
- Reduce overall infrastructure footprint. With less-than-satisfactory resource utilization, Flipkart was forced to running tens of additional servers just to handle normal capacity. Reducing the infrastructure footprint to a lightweight model was a requirement to keep infrastructure expense low.
After discovering Scala and Spray (now Akka HTTP) and Apache Kafka, the team began exploring Akka, and quickly discovered Akka Streams as the answer to their workflow modeling requirements and their strategic goals, resulting in a best-than-ever Big Billion Days sale in 2016.
To The Rescue: Back-Pressured Throughput With Akka Streams
Akka Streams allowed us to design our input message schema to be decoupled from the performance and latency differences between each mobile platform we support. Our platform templates offering a consistent API interface to our clients, so that when operating at this web-scale, a mediation pattern along with Akka Streams helps us insulate our clients from cloud providers' latencies.Kinshuk Bairagi Software Engineer in the Customer Engagement Team at Flipkart
To solve the problem of unpredictable messaging throughput, Akka Streams enables back-pressured communication, so that neither the producer or consumer end of the workflow can overwhelm the other. Working hand in hand with Akka HTTP and Apache Kafka for high volume message processing, the Connekt team was able to remove bottlenecks and points of contention from their new platform, and consolidate functionality into two major microservices:
- Receptor: An Akka HTTP web-service which exposes all client facing APIs for submitting requests for various actions. These include, but are not limited to: receiving push messages requests, handling the feed of pending messages, platform health and diagnostic recording, client SLA management and reporting.
- “Busybees”: This dispatcher is built using Akka Streams Graph DSL and Akka HTTP to fetch the stream of enqueued messages from clients & processes them as batches / discrete messages. Processing includes transformation through various wired stages, which finally produce enriched user messages that are transmitted to respective cloud services.
Powered by Akka’s supervised Actor Model to ensure automated self-healing and extreme elasticity for meeting peaks in demand, Akka Streams gives developers the power of the Actor model, with the addition of workflow modeling and back-pressure for an additional level of resilience at scale. To keep resource utilization as efficient as possible, Akka Streams enables elastic spawning of actor instances in response to varying load, which then elastically scale back when not needed.
With Akka Streams, developers are able to segregate logical processing blocks and ensuring type-safety over stage boundaries, which saves precious time and headaches from runtime debugging. Akka Streams provides a graph DSL which allows the Connekt team to wire together different stages, and develop what could be a snapshot view of the entire working topology of multiple conceptual blocks.
Powered by Akka Streams, Akka HTTP and Apache Kafka, the Connekt team created a new platform that significantly outperforms their legacy solution, and does it with 75% less infrastructure and cloud resources than before.
Results: Orders Of Magnitude Performance Increases On 75% Less Infrastructure
In all critical areas, our new platform with Akka Streams outperforms our legacy solution by orders of magnitude, and does it with far less resources needed than before.Connekt Team at Flipkart
For the Connekt team, one of the key performance metrics that the Connekt platform team focused on was scaling to handle far more requests per second for both outbound HTTP and web server API connections than the previous system. In this table, the Connekt team has graciously shared the most meaningful metrics behind the shift to Reactive:
|Metric||Legacy platform||Connekt with Akka|
|HTTP per instance outbound requests/second||300 - 800||4,500 - 5,000|
|HTTP web server API requests/second||500||10,000|
|CPU utilization rate||30 - 40%||80% +|
|Number of active servers required||40||10|
Improved System Performance By Over 800%
While the legacy solution was capable of processing between 300 - 800 outbound HTTP requests per second, the new Connekt platform with Akka HTTP outperforms that by over 8x, handling between 4500 - 5000 requests per second. And for HTTP web server API performance, the new Connekt platform can handle 10,000 requests per second, compared to 500 request per second for the legacy platform.
Reduced Infrastructure Footprint By 75%
With Akka Streams providing back-pressured modelling of stages, the reduction in points of contention and blocking I/O resulted in improved CPU utilization, going from 30-40% to upwards of 80% and more. This allowed Flipkart to effectively reduce the amount of on-premise and cloud infrastructure needed to do the same work, enabling the Connekt team to cut back from 40 to 10 active servers at any given time.
Protected From The Risk Of Expensive Downtime
In order to prevent past issues with downtime, scalability and performance from repeating, Flipkart made extreme resilience against problems in the future a major priority. The resilient nature of Akka has allowed Flipkart to create a truly isolated, self-healing platform that prevents cascading failures from affecting the entire system topology. When an error occurs, Akka takes recovery measures in an asynchronous fashion so that the affected stages will be handled separately from processing the next message.
Able To Scale Elastically When Needed
Flipkart is currently in the process of migrating to their own cloud platform, which offers a cloud API for instance / cluster management. While this is still evolving, they’ve found that the Connekt platform is already capable of nearly horizontally scaling with additional instances, indicating an out-of-the-box per node scalability improvement of 20x and higher compared to the legacy platform.
By going Reactive with Lightbend technologies, Flipkart’s CE team was able to meet and exceed their strategic goals of eliminating downtime, scale beyond their wildest dreams, and drastically increase performance throughput while reducing the amount of resources needed to serve their customers.
In addition to leveraging Akka, the Connekt team reports that they have been able to learn a lot about the principles of Reactive system design from their previous system decisions, achieving amazing results with plenty of headroom for future extensibility. To learn more, you can follow the Connekt team’s progress and achievements on GitHub: https://github.com/Flipkart/connekt
Inspired by this story? Contact us to learn more about what Lightbend can do for your organization.