Anyone who has worked with Play Framework will tell you that it's fast. Play, which was recently ranked in a RebelLabs report as the 6th most-used web framework among Java developers, does all it can to wring every last drop of performance out of the JVM. The big question is how, and why, is Play able to be so fast?
The short answer: Play is a stateless, asynchronous, and non-blocking framework that uses an underlying fork-join thread pool to do work stealing for network operations, and can leverage Akka for user level operations.
While this is accurate, it's not especially helpful to non-experts. So let's break it down and explain why Play is designed the way it is, and how those decisions make Play fast.
Every web framework does essentially the same job of processing an HTTP request. The web framework gets the bytes handed to it by the OS, turns it into the logical HTTP representation (GET, POST, etc), makes some calls of its own (to a database, to disk or to a REST API), and returns the result back to the browser, wrapped in an HTTP response. To do this, the web framework needs to make use of a thread, which contains the book keeping information needed for a CPU core to execute instructions.
At some points in processing the HTTP request, the system has made a call for remote data, and is waiting for the result. Typically, the thread blocks while waiting for the result. During this time, the CPU is idle.
This is a model that most programmers are familiar with, but it's still too low level. Fundementally, an HTTP request is a demand for work to be done, and the CPU core is the thing that's doing the work.
So imagine that instead of an HTTP request, there's a book that contains pages. Imagine instead of a CPU core, there's a paper shredder, in another room. And imagine that instead of a thread, there's a person who has the job ripping out some pages out of the room with the book, going into the shredder room, and feeding a page into the paper shredder.
One person ripping pages out and feeding pages into a shredder is doing as much work as possible, but the IO -- the cost of going backwards and forwards between different rooms -- means that the shredder is idle while the person is bringing more pages to it. Even worse, some times the door to the book room is locked, and the person has to wait (blocking) until the door opens again.
You can shred more pages if you add more people. Each person will hold some pages out of a book, and they will hold those pages and make an orderly line to the shredder. If you calculate it right (using Little's Law), the average wait time balances out with the arrival and departure time and the shredder always has pages to shred.
So far so good. But there's a wrinkle: there's never one single HTTP request. Instead, there are multiple HTTP requests. So in our imaginary model, there are multiple books, each in their own room, and a number of people that take pages out of those books and line up behind the shredder.
There's another wrinkle: all the HTTP requests have to be served at roughly the same rate. This means that there has to be some fairness, so that people don't all go to the same room and rip all the pages out of one book, while neglecting all the others.
The Java servlet model was invented in 1998, and ensured fairness through a "thread per request" model–every book had a person assigned to it, and everyone lined up in front of a single shredder. This was a huge advance at the time, as the alternatives (CGI, thread per connection) were horribly inefficient.
Java EE application servers are all built on top of servlets, and so inherit from the underlying assumptions of the threading model. In particular, Java EE applications typically make heavy use of ThreadLocal as a way to ensure an implicit context for the request–this is like putting a sticky note on the person handling the book, and it means that even when the person you're looking at isn't in the book room, you can still identify them.
This worked at the time, when there was a single CPU with a single core. But computers added multiple CPUs and cores, and CPU cores got faster than network and disk access did. Now, not only are there more shredders, but they also chew through paper faster. The people that feed the paper to the shredders must move as fast as possible in order to meet demand. This is where the "thread per request" model has a problem. If there is any blocking in the request–if the door is locked–then the person is out of action and cannot feed paper.
Java 1.6 and 1.7 made great strides in thread management and non-blocking IO. Play takes full advantage of these new APIs, building directly on a thread pool that enables work stealing. Because Play is non-blocking, threads are not blocked when the network is slow, but are free to serve another HTTP request–and because Play is stateless, there is no session information tied to the thread that would confuse it.
In shredder terms, if a person is taking pages from a book, and goes back to find the door locked, that person can try another door and take pages from a different book -- all while maintaining levels of fairness between the books being shredded and the lines for the various shredders.
So that's why one reason why Play is fast: Play uses CPUs more efficiently. In fact, many customers find that Play works so well that they are able to retire servers after moving to Play.