Support
akka

Q&A with Caoyuan Deng: Akka at Wandoujia

Today we're excited to post a Q&A with Caoyuan Deng, Platform Architect at Wandoujia, about Akka's role at one of China’s leading mobile entertainment search engines. 

In our interview, Caoyuan discusses how Akka was used to implement a real-time reactive platform, sings the praises of the Actor Model, and shares benefits and obstacles experienced along the way. He also shares feedback on future iniatives and improvements he'd like to see from Typesafe. 

--

Typesafe: Tell us a little bit about yourself.

Caoyuan: I’ve been coding for 25 years. I’m the author of the Scala plugin for NetBeans, and the author of a distributed, parallel, real-time stock trading platform based on Scala and Akka’s actor, which corresponds to the philosophy of the Reactive Platform. Before Scala and Akka, I wrote lots of Erlang and an Erlang plugin for NetBeans (but without updating for years). I joined Wandoujia one year ago.

Typesafe: So what is Wandoujia?

Caoyuan: First conceived in December 2009, Wandoujia (http://wandoujia.com) is China’s leading search engine for mobile entertainment, creating innovations with widespread influence over China’s mobile internet industry. Wandoujia has been installed more than 400 million times, with new installations each day exceeding 800,000. Its search index provides over one million choices in multiple content verticals: apps, videos, and ebooks, as well as supplying wallpapers, music, and themes from more than 130 content providers. Wandoujia also produces SnapPea (http://snappea.com), a line of international products for Android users beyond China.`

Typesafe: What is the role of Akka at Wandoujia?

Caoyuan: 10 million mobile devices connect to our system daily. We hope to serve them better via a real-time reactive platform, anywhere, anytime. In the Internet of Things age, The Network is shaping up to be a virtual world that reflects everyone’s daily life, or, The Network will map everyone’s life itself rather than just being part of their life as in the web age. We believe, in the long term, that reactive platforms will lead to such a virtual world.

To implement a real-time reactive platform, we created a persistent channel to connect each device to our backend, The System. That System thereby receives and reacts instantly to tons of events from these devices. That’s what Akka could stand for.

I think that the Actor Model is perfect for the minimal or basic granularity of parallel computing and incremental computing, which reacts to coming events or changes, shifts states, fires outgoing events or changes, respectively. The parallel behavior is the entirety of the individual behavior of many many actors. From the view of such an actor model, every device is virtually an actor, and every user is virtually an actor, etc. The virtual world is running in parallel, driven by event streams, just as good as the real world. Akka provides the solution for this vision.

Typesafe: Can you tell us more about the use case of Akka in Wandoujia?

Caoyuan: As the first step toward a fully reactive platform in Wandoujia, we implemented a connect and status server cluster (base on our open-source project https://github.com/wandoulabs/spray-socketio). The cluster could be described as

  • The stateless transport layer
  • The stateful status layer
  • The event streaming interface for business logic

 

Long-Connections.png

The figure above shows the architecture of our system. The transport layer keeps the persistent connection for each device, one actor per device. The stateful layer corresponds to keep the status of each connected device. Distributed Region and Distributed Mediator are the interfaces for the business logic here, the Region is the resolver for status actors according to their identity, and all messages that are published to the Mediator are then transformed to a RxScala event stream that can be subscribed by the business layer later.

With the above architecture, the device can share / inquire status, push events and messages bi-directionally, get real-time notifications, and connect to each other virtually.

Typesafe: What benefits have you seen so far from Akka?

Caoyuan: We've seen great benefits. Below are a few highlights:

  • Development efficiency: Before we implemented spray-socketio, we finished spray-websocket (https://github.com/wandoulabs/spray-websocket), an Akka / Spray based WebSocket protocol implementation, which passes all Autobahn WebSocket test cases and is qualified for production uses. With Akka’s actor model, and the goodies from spray.io, we coded spray-websocket and spray-socketio in 3 months between two engineers, my colleague, Xingrun Chen (cowboy129 at github), and I. Chen also contributed the consumer group pattern for the DistributedPubSubMediator during this period.
  • Stable and consistent cluster: Since actor has the perfect granularity for parallel and distributed computing, it’s easy to deploy the features from a single server up to a cluster, a perfect cluster for scaling out, online-rebalancing, and online-upgrading.
  • Performance: The performance is good enough for our use case so far, and the cluster is scaling out. Our running cluster now serves 1 millions persistent connections with4 transport nodes (250k per node), and, with 3 status nodes, serves 50k/s device heartbeats, plus 20k/s status requests. According to our benchmark evaluation, the cluster has the capacity to serve 1 millions connections per transport node, and 15k status requests per status node. We are expecting 10+ million persistent connected devices—going possibly up to billions—and to process several hundred thousands messages per second in the near future with more nodes gradually appended to cluster.

Typesafe: Have you encountered any obstacles along the way?

Caoyuan: The lack of Akka’s best practice guide in real-life. We developed the whole system in 3 months, but spent another 2 months tuning it to stability. At the early launching period, we encountered an unreachable node issue frequently. Chen and I were hit by alarms from our monitoring system every day. There are various kinds of interval parameters in Akka cluster, the heartbeat interval for the failure detector between cluster node monitor, the heartbeat interval for the remote transport failure detector, the heartbeat interval for the remote watch-failure-detector, and the cluster gossip-interval etc. We had to narrow down the cause step by step. And we finally found the issue was coming from the remote watching [editor’s note: remote actor lifecycle monitoring]. At the early design, the actors on status nodes would remote watch the actors on transport nodes one by one, which resulted in millions of remote watch heartbeats per second—too much, isn’t it? After adopting an alternate approach that removes this remote watching, the cluster runs stable and consistent everyday now. The lesson we learned is to be careful of massive remote watching between nodes. And, more in general, a peer to peer based, non-centralized society (what Akka cluster looks like), will fall into disorder suddenly if the communication cost is too expensive, and therefore loses resilience from disturbance. But after being fine-tuned, it can keep running smoothly and beautifully.

Typesafe: What other initiatives would you like to see from Typesafe and the Akka team?

Caoyuan: The most exciting part is the stateful sharding cluster, which was released by the Akka team in March. It’s so powerful in that it helps resolve so many cases in distributed computing. For example, it could be eventually an alternative to Redis in my eyes and a good starting point for an implementation of AMQP, per queue per actor.

For the Redis alternative, we’ve done some preliminary implementation, which could distributedly store AVRO based records, as an in-memory, full-range store. We also implemented an XPath-like querying mechanism named AvPath, which can select, insert, update or delete items according to the path expression. These all benefit from Akka’s sharding cluster feature, with online rebalancing and scaling out. We are going to open source it later this year.

We are also tracking the Akka Reactive Streams implementation, we’ll migrate the spray-websocket and spray-socketio to Akka Streams when it is released.

Typesafe: What improvements would you like to see from Typesafe and the Akka team?

Caoyuan: We very much appreciate the progression and approach of Akka, which has successfully derived a great deal from academic research and the best practices of distributed computing. For the coming releases, we’d like to see:

  1. The runtime performance improvement. There is room for runtime performance tuning. We’ve seen parboiled2 in the coming Akka HTTP module, which should bring better performance. We’d like to see the Akka community do more profiling to gain an understanding of the performance bottlenecks in real-life.
  2. Best practices for the persistence layer. We are trying various persistence technologies, including HBase and Cassandra, to see the best cases for journal store vs. snapshot store. We’d like to share and see more best practices from the community here.
  3. Deployment and monitoring tools. We’d like a consistent way to deploy the cluster nodes without too many pre-defined configuration parameters. We’re not sure if there is some way better than Zookeeper etc. Also, we want to easily locate the bottleneck of the cluster. We tried kamon, but failed to make it work on the Akka Cluster so far.

--

Thanks Caoyuan!

Share