Why your company's big data is not nearly fast enough
September 09, 2015
Published originally on TechRepublic
Back in the olden days of 2014, "big" used to be enough for big data. Just a year later, however, data also has to be fast. Really, really fast.
"Big data," of course, has never truly been a matter of data volumes. At least, not exclusively. It has also fixated on data velocity (and variety), as Gartner's 3 V'spopularized.
But knowing that data needs real-time processing and actually doing it are two very different things.
To better understand the industry shift toward fast, stream-based data processing, I reached out to Kevin Webber, developer advocate for Typesafe. Last year, his team led the rollout of Walmart Canada's new e-commerce platform built on a Scala/Akka/Play stack. I wanted to get more detail on the whitepaper that Dean Wampler authored.
TechRepublic: What's the big idea behind fast data? Typesafe recently put out an interesting whitepaper on the topic, "Fast Data: Big Data Evolved," but tell me how it relates to the big data everyone talks about?
Webber: Rather than acting on data at rest, modern software increasingly operates on data in near real time. So, big data is less important than fast data.
As our computing systems embrace data in motion, traditional batch architectures are evolving to stream-based architectures. In these systems, live data is captured, processed, and used to modify behavior with response times of seconds or less. There is major business value in sub-second response times to changing information.
Think about 10 years ago when you didn't have a critical piece of market information until the day after it happened—now, think about getting that critical piece of information as it's happening. This is going to create a lot of value for companies.
After all, fast data is critical to fast knowledge, and businesses want knowledge as quickly as possible.
TechRepublic: Typesafe is heavily involved in an industry initiative around so-called Reactive Streams that is all about fast data. What is that?
Webber: Reactive Streams is a specification for toolmakers who are building the next generation of fast data frameworks and libraries for developers. Reactive Streams started as an initiative in late 2013 among engineers at Netflix, Pivotal, and Typesafe.
The goal of Reactive Streams is to bring the level of abstraction up a notch for data streaming use cases. Instead of developers worrying about the low-level plumbing of handling streams, the spec defines those problems, and it's up to the library developers to solve them.
That way, toolmakers can focus on the most challenging aspects of stream processing, while developers focus on their business, which is what they know the best. Even only a few years ago, building a stream processing system was simply too complex for most businesses to tackle. The goal of Reactive Streams is to change that and make it the de-facto standard for processing data in motion.
TechRepublic: What's an example of a problem that Reactive Streaming addresses?
Webber: A major objective of the spec is to define a model for something we call back pressure. In lay terms, it's a way to ensure that a fast publisher of real-time data in your system doesn't overwhelm a slower subscriber of that data within your system.
If you think of real-time data as water flowing down a river, at some point, heavy rain may cause the water to overflow and flood the surrounding area. Back pressure is a way to completely avoid flooding.
Imagine that a river could simply send a command upstream to slow the flow of water. That's back pressure, it's flow control—a mechanism to provide resilience by ensuring that all participants in a stream-based system participate in flow control for a steady state of operation and graceful degradation if the stream flows at a rate beyond the capacity of the slowest component to cope.
Reactive Streams takes this up a notch and allows developers to connect different rivers together, so to speak, built with different compliant libraries. Back pressure works across all those libraries. It literally allows a developer to design systems that are immune to flooding, which—in the world of streams—results in Out Of Memory Errors and types of catastrophic crashes.
What Reactive Streams provides is a bi-directional flow of data—elements emitted downstream from publisher to subscriber and a signal for demand emitted upstream from subscriber to publisher.
If the subscriber is placed in charge of signaling demand, the publisher is free to safely push up to the number of elements demanded. This also prevents wasting resources. Because demand is signaled asynchronously, a subscriber can send many requests for more work before actually receiving any.
This is dynamic and changes in real time. When the subscriber is slower, it works like a pull-based system. When the subscriber is faster, it works like a push-based system.