Regardless of what you may think of Facebook as a platform, they run a massive operation, and when you reach their level of scale you have to get more creative in how you handle every aspect of your computing environment.
Engineers quickly reach the limits of human ability to track information, to the point that checking logs and analytics becomes impractical and unwieldy on a system running thousands of services. This is a perfect scenario to implement machine learning, and that is precisely what Facebook has done.
The company published a blog post today about a self-tuning system they have dubbed Spiral. This is pretty nifty, and what it does is essentially flip the idea of system tuning on its head. Instead of looking at some data and coding what you want the system to do, you teach the system the right way to do it and it does it for you, using the massive stream of data to continually teach the machine learning models how to push the systems to be ever better.
In the blog post, the Spiral team described it this way: “Instead of looking at charts and logs produced by the system to verify correct and efficient operation, engineers now express what it means for a system to operate correctly and efficiently in code. Today, rather than specify how to compute correct responses to requests, our engineers encode the means of providing feedback to a self-tuning system.”
They say that coding in this way is akin to declarative code, like using SQL statements to tell the database what you want it to do with the data, but the act of applying that concept to systems is not a simple matter.
“Spiral uses machine learning to create data-driven and reactive heuristics for resource-constrained real-time services. The system allows for much faster development and hands-free maintenance of those services, compared with the hand-coded alternative,” the Spiral team wrote in the blog post.
If you consider the sheer number of services running on Facebook, and the number of users trying to interact with those services at any given time, it required sophisticated automation, and that is what Spiral is providing.
The system takes the log data and processes it through Spiral, which is connected with just a few lines of code. It then sends commands back to the server based on the declarative coding statements written by the team. To ensure those commands are always being fine-tuned, at the same time, the data gets sent from the server to a model for further adjustment in a lovely virtuous cycle. This process can be applied locally or globally.
The tool was developed by the team operating in Boston, and is only available internally inside Facebook. It took lots of engineering to make it happen, the kind of scope that only Facebook could apply to a problem like this (mostly because Facebook is one of the few companies that would actually have a problem like this).