If you want to skip to the explanation, just read below. But before then, here is a little background.
A Little Background
On January 2nd, we started to run into some serious performance problems with Dating DNA, so we began the process of going and deciding what were our problems, and how to optimize them. At the beginning, the Database was the bottleneck. It would get flooded with requests, and unable to handle them all, each query would slow down. We were using a tool called Jet Profiler (which I will post about more in detail later), but here is a graph it would output before we started out optimizations:
The light blue is total threads connected. The dark blue is thread running a query. The red are threads that are taking 2 seconds or longer to run, which are slow queries. The red is bad, very bad. So we were getting in bad shape. It was lovingly nick named “The Red Zone” while we were working on optimizations. Now, granted, we weren’t in “The Red Zone” the whole time, but when things got busy, things would slow down.
But optimization after optimization, we started to get things more and more under control:
After about three days of optimizations, we get back down to a manageable load:
The Problem
However, when the website was having high traffic, we noticed an anomaly, which looked like “blue waves.” We lovingly gave them the nickname of “blue meanies” from the Beatles’ Yellow Submarine Cartoon. On Jet Profiler, here is what it would report:
It would get worse and worse, these big blue waves of connections reporting as “sleeping.” At first we didn’t think it would be a problem. However, when ever we had “blue meanies” the site and iPhone app felt really slow. So, I won’t cover all the things I tested and tried that didn’t work, but he is ultimately how we figured out our problem.
The Solution
At first, I thought it was an issue with Garbage Collection with PHP. So I set the wait_timeout on MySQL to something really low, like 5 seconds. We then started to get errors all over the website, so we knew that they were legitimate connections from PHP. The only thing that made sense is that PHP & Apache now had become the bottle neck, that MySQL was returning requests so quickly that the threads were almost always sleeping, waiting for the PHP to finish. We slowly started to disable different functions on the website, trying to narrow down if there was a particular part of the website that was causing it. After a few hours, we figured out the feature: the ChatWalls. So we started to investigate why turning off the ChatWalls would make MySQL run faster, since we had moved the ChatWalls completely off MySQL and to run on Redis.
What we found is one particular function had a typo, that would cause PHP to iterate over an array not 10 to 20 times, but 1,000-2,000 times or more. This function was also called a lot by several Ajax calls. So, I fixed the typo, and the blue waves went away.
What was happening is Apache & PHP were spending so much time processing the buggy function, that it would cause the rest of the web requests to slow down greatly. That would keep open may too many MySQL connections, causing the blue ways, and slowing down the website even more.
So in reality, the blue waves were a symptom of the problem, not the cause. It is the whole Correlation vs Causation situation (which I probably should blog about in more detail when it comes to finding performance issues).
So if you have a lot of sleeping connections, but MySQL is performing well, most likely it is PHP or Apache slowing things down. I hope this can help those having a similar problem. As for our Database, its working well now. A few more problems to iron out, but it is running really fast. The few red spikes are from the score generation system that are doing bulk inserts, and do not slow down the end user experience: