PayPal is a ubiquitous presence in the world of online commerce, providing payment services for millions of merchants and individuals across the web. To anticipate the needs of its wide variety of customers, the company must continuously monitor the digital economy as it evolves.
A key element in this analytics effort is PayPal’s big data platform, which is continuously updated with new data obtained from millions of webpages by a web crawler. This crawler needs to be able to operate efficiently at the massive scale of the modern web.
Crawling webpages can be a demanding task. Each URL the crawler visits typically contains multiple additional URLs, which the crawler may then also need to visit. As the crawl depth increases, the number of URLs that need to be processed rises exponentially—and the only way to deliver results in timely manner is to build a crawler that can process large numbers of pages concurrently.
This was a problem for one of the early versions of PayPal’s crawler, which was built using the Java Spring framework. In order to scale, the crawler had to use a large number of threads, which was inefficient.
Moreover, each crawling job involves a complex sequence of actions: loading URLs from the database, validating them, checking a cache to find out when each URL was last crawled, downloading new data, processing it, and saving any new URLs back into the database.
In the original crawler, the Java code specifying this sequence was complex and difficult to understand, which made development cycles slow and changes risky.
PayPal decided to rewrite the crawler in Scala, using Akka Platform from Lightbend to design a much more responsive, resilient and elastic architecture for orchestrating concurrent crawling jobs.
The new architecture uses Apache Kafka as a buffer and load distributor for the crawling jobs, streaming URLs into the processing pipeline which is built using Akka Platform and runs on a cluster of servers. The Akka solution provides back-pressure, pulling URLs from Kafka only when a worker is available to process them, and avoiding any risk of overwhelming the processing pipeline.
Akka Platform also provides an asynchronous, non-blocking HTTP client that enables a high degree of parallelism while keeping resource requirements to a minimum. The whole system is architected to ensure that each HTTP request is handled independently, so there is no risk of one slow response or large webpage delaying the processing of smaller, faster jobs in the same batch.
Akara Sucharitakul, Principal Member of Technical Staff at PayPal, comments: “Akka Platform helps our systems stay responsive even at 90% CPU utilization, which is very uncharacteristic for our older architectures and provides for transaction densities never seen before.”
Crucially for development teams, the code of the new crawler is also much easier to read and maintain, thanks to the use of a domain-specific language that allows PayPal to express its stream architecture in terms of a simple graph. Just by glancing at the code, engineers can see how data flows through the crawler, and can make changes quickly and easily.
PayPal’s technical team has been delighted by the improvements to the crawler that Scala and Akka Platform have enabled. Compared to the old Java Spring implementation, performance has increased tenfold, and batch processing runs much more efficiently—using a much lower number of threads and achieving a CPU utilization rate of around 90%. Meanwhile, the expressiveness of Scala and the powerful abstractions provided by Akka Platform have enabled an 80% reduction in the size of the codebase that the team needs to maintain.
“Batches or micro-batches do their jobs in one-tenth of the time it took before,” says Akara Sucharitakul. “With wider adoption, we will see this kind of technology being able to reduce cost and support organizational growth without requiring corresponding growth in our compute infrastructure.”
In fact, PayPal was so keen to drive adoption of Akka Platform that its technical team developed and open-sourced a project named squbs. By adding standardized monitoring, logging and security features, squbs makes it even easier for PayPal and other large organizations to deploy Akka Platform across complex development, testing and production environments.
Akara Sucharitakul concludes: “Powered by Akka Platform and Scala, squbs has already provided very high-scale results with a low infrastructure footprint: our applications are able to serve over a billion hits a day with as little as eight VMs and two virtual CPUs each.”
PayPal allows any business or individual with an email address to securely, conveniently and cost-effectively send and receive payments online. PayPal’s network builds on the existing financial infrastructure of bank accounts and credit cards to create a global, real-time payment solution that is ideally suited for small businesses, online merchants, individuals and others currently under-served by traditional payment mechanisms.