How Credit Karma uses Akka to manage big data at scale
January 12, 2017
[Note: this article by Matt Asay originally appeared on TechRepublic.com]
Fintech is one of the fastest-growing investment categories for venture capital. Since 2010, more than 1,200 VC firms have wired more than $22 billion into the bank accounts of over 1,800 startups, according to Pitchbook. J.P. Morgan CEO Jamie Dimon, who recently removed himself from consideration as Treasury Secretary in the Trump administration, warned his peers that "Silicon Valley is coming."
One of the brightest stars in fintech is San Francisco-based Credit Karma, founded in 2007 with more than $368 million in venture funding raised since then. Today, the company, which gives out free credit scores and makes money from making personalized recommendations to its users and matching them with products, has more than 60 million users.
Critical to its success is figuring out how to deliver a more personalized user experience. I recently spoke to Dustin Lyons, engineering manager at Credit Karma, to learn more about how the consumer platform handles big data at this scale and volume.
A problem of serious scale
TechRepublic: Tell me a little more about the scale of data managed by Credit Karma.
Lyons: Our data infrastructure regularly handles more than 750,000 events per minute, processing 600-700MB a minute. Our data pipeline was originally built using PHP, but we saw that it wasn't going to scale to meet the demands of our growth. We were using Gearman to parallelize importing data across machines, but as the data ingest grew, we began to investigate other languages and tools with better support for managing concurrent workloads.
A parallelized approach
TechRepublic: Sounds like a good problem to have. What did you do?
Lyons: We needed a new architecture, not tooling around the edges: We were on a rocket ship of growth. Some of our engineers had been testing Scala and Akka and we knew we wanted to move away from dynamic types and the lack of first class concurrency in PHP.
So, we started with Scala and Akka from Lightbend, using Apache Kafka, and began to build out an Akka Actor System that could help us parallelize work across threads and boxes. We found success in this approach: Scala gives us type safety and the blazing performance of the JVM, and Akka abstracts away the complexity of multi-threaded, distributed systems. This was much better suited for the scale of data we were seeing.
Even so, our initial versions in production hit a roadblock—constantly exceeding heap space and massive memory consumption. We realized we were missing something fundamental in trying to process literally tens of millions of events.
Under (back) pressure
TechRepublic: How'd you get around this?
Lyons: We basically stumbled into a very important concept that had been ignored: Back pressure. This new concept introduced us then to the ideas of the Reactive Manifesto. Back pressure, and it's well defined in the Manifesto, is a feedback mechanism that allows a system to control the rate at which messages are received by downstream actors. This prevents slower components in your system from getting overwhelmed by faster components. We had developed actors pushing data around without evaluating the throughput of work between them.
Thankfully, our friends at Lightbend provide a tool to solve this: Akka Streams. It provides a streaming API that introduces the concept of demand between publishers and subscribers. Instead of actors pushing messages through the system, Akka Streams moves to a model where actors ask for and pull work down the stream. In this way, slower actors can request work as resources become free. We were seeing heap space errors in production for this reason.
After rewriting the core logic to use Akka Streams, these errors disappeared and memory usage became much more predictable. By switching to this pull-based system, instead of front loading a lot of work to queue up in memory, we were able to cut heap usage by 300%. Now we use Scala, Akka, and Akka Streams across many of our critical fast data pipelining scenarios.