Some of Socialcast’s largest clients rely heavily on our On-Premise virtual appliance to communicate and collaborate in their day-to-day work. The appliance was designed to be scalable and high-performing, and as such we are always looking for ways to optimize its performance to further those goals. Recently, we began looking at ways to improve our tool even further and discovered the need for more adequate and realistic performance tests for the application. These tests needed to be able to integrate well with our fast-moving master branch and continuous deployment while allowing us to test and tune variously sized virtual clusters.
We looked into existing load testing tools, but found them somewhat lacking. Some tools, like apachebench, lacked support for more than very simple HTTP requests. Other tools required complex load testing procedures to be captured and maintained. Finally, none of the tools we surveyed were capable of testing HTTP long-polling or interacting with awareness of data being returned by the service under test.
After discussing these challenges, we decided to implement a simple load generator, Crusher (http://github.com/socialcast/crusher), during a Socialcast hack-fest. We chose an agent based approach to encourage flexibility and reusability in the final tool. Rather than providing a means of running pre-recorded tests, Crusher provides a framework for simulating the behavior of individual users and domain-specific language to define different load scenarios. We also decided to separate the tool from the service under test by introducing a simple rake-inspired configuration DSL to allow highly-extensible agent instantiation and configuration.
Due to the highly parallel nature of the system, we decided to tackle some of the agent modeling in a prototype with a simple, thread-based concurrency model, and then move on to a more robust approach. After exploring the space and observing that our threaded prototype began choking itself with around 150 agents per ruby process, we chose to base the project on EventMachine because of its ability to inexpensively parallelize thousands of concurrent IO operations. This characteristic made EventMachine an ideal candidate for our testing tool, as individual agents do little more than make HTTP requests and wait for responses. One issue we found was that the HTTP client(s) in the current version of EventMachine didn’t support SSL/TLS connections. Luckily, the underlying framework supported establishing secure connections, so adding support for this was trivial.
Once the Crusher library was in place, we set out to model the behavior of a Socialcast user. The first step was to collaborate with with Socialcast’s data scientist to gather anonymous statistics about typical user behavior based on traffic in select production SaaS environments. After some exploration, we came up with a set of average and peak load scenarios based on community size and type.
We added several layers of instrumentation to capture and visualize performance data. First, like our production clusters, the performance cluster was added to our NewRelic RPM account. This provided application-server response time characteristics, and a great deal of transparency into both the Ruby code and the underlying SQL queries which consume the majority of the time in our application. We also added Munin, which provides longer-term, lower-level system monitoring. Out of the box, Munin monitors general aspects of a unix system (like CPU utilization, disk IO, and memory utilization) and we added and authored a variety of system and application specific plugins to provide even better view on our application’s performance.
To ease integration with our agile development style, we established a virtual performance evaluation cluster and a scheduled performance test. Our goal was to provide place where features which raised eyebrows with regard to overal application performance could be quickly integrated and tested without additional effort on the part of the engineering or operations teams. To accomplish this, we created a simple means of deploying an arbitrary branch in our centralized git repository to the cluster. This made it painless to mix and match topical branches and compare results to our existing baselines.
These tools and techniques helped us build a much more complete peicture about how our application scales under a variety of workloads. We were able to establish a firm benchmark of Socialcast performance and meet our goal of providing recommended cluster configurations for our virtual appliance customers, while identifying and addressing several key performance bottlenecks along the way.