*** Update: I gave a presentation at the Utah PHP Usergroup about memcached that went really well. Here are the slides I used:
Delayed, yet again, at the airport. Time to get this article written once and for all.
This is a rough draft, I still need to go through and proof read this article. However, several friends were anxious to read it, so here it is rough for now. 😛
When your website or project grows, demands on your architecture and infrastructure can dramatically increase. You can then run into “bottlenecks”, or parts of your project that cap out their abilities, and cause the rest of your application to slow down. One of the more common parts of your architecture to reach its limits is your database. There is a reason for this: its called ACID. While I won’t get into the details, basically databases are awesome because of it’s “ACID compliance.” You can store information and get information easily. However, these requirements of being a good database can also require a lot of leg work for your server. So when you have hundreds, thousands, and even millions of queries executing on your server, it can require a lot of CPU, Memory, and I/O to do all the work.
This is where memcached comes in. I’ve implemented with great success in the past, and you can too. This article is not a step-by-step how-to setup memcached. There are plenty of articles that show you how here and here (to name a few). You also have the php documentation for memcache + php. Instead, we’re going to discuss the theory behind creating an effective cache. Our memcache servers at Dating DNA run at about 99.9% efficiency (meaning 99.9% of all requests to our cache find a valid entry and doesn’t hit our database). We’ll cover a few basic concepts, and then talk about the two types of caching methods.
What is a cache? To quote Wikipedia:
In computer science, a cache (pronounced /kæʃ/) is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data is expensive to fetch (owing to longer access time) or to compute, compared to the cost of reading the cache. In other words, a cache is a temporary storage area where frequently accessed data can be stored for rapid access. Once the data is stored in the cache, it can be used in the future by accessing the cached copy rather than re-fetching or recomputing the original data.
So basically it is a collection of data that sits between your code and server. You typically check the cache first to see if it has a valid entry. If so, you use the information in the cache. If not, you generate the information manually. After generating the information manually, you put it in the cache. Ideally you want your cache to be full of good information to save your database the work again.
Why cache? In a perfect world where servers have no limitations you wouldn’t need a cache. Your database would be able to handle trillions of queries without ever running into locking issues, slow responses, expensive joins, etc. However, we live in a realistic world where our databases have limitations. So we implement caches to help alleviate those limitations.
Databases aren’t the only things to look for a source of what/where to cache. RSS feeds, web service responses, xml files, etc. can also be sources of load for your server. While through out this article I will reference databases a great deal, keep in mind they aren’t the only source of data & load for your application.
Identifying What To Cache
So how do we do this? We analyze. There are many techniques that you can find good places cache. They key is to look for information that has some of the following characteristics:
- High Demand – If you have something that is used on every single page of your website, most likely you’ll be find a way to cache that information and gain performance.
- Expensive – Some information is faster or easier than others to retrieve. If you find one particular query that takes longer, or requires more work, those are the queries to target first.
- Large – The query might be relatively quick, but it also might have a lot of data. Transferring that data takes time, and while usually quick, being large can also compound the problem when added to these other characteristics.
- Common – This information is common through out the site and is non-unique to a particular scenario.
The more characteristics a particular piece of information has, the greater performance boost you will get when you implement an effective cache for it. Lets look at a good and bad example of something to cache.
Good Example – Profile Information
I’ll use this example later on as well, but profile information is always something good to cache. Typically this information is spread across several tables in your database and can have a lot to it. It is used often, and if you have a lot of users it can make those user tables very, very busy.
Bad Example – User Message
While your messaging system for your website might be used a lot, caching individual messages between users won’t be very effective. An individual message is viewed only a handful of times. So you would fill up your cache with tons of data that would be retrieved very little. This is an example of a piece of data that appears like it would be good to cache, but its unique so its not needed often.
Techinque for Finding Potential Caches
To know what to cache requires you to know your application. There are several ways you can do this:
- Monitor Queries – Knowing what queries run when and how often is half the battle. There are several ways you can do this, however my one recommendation is a tool called “Jet Profiler“. They have a free version and a paid version. The free version should give you most information you need. The paid version is more advanced. My rule of thumb: if you’re site is popular enough and you’re running into locking issues or other advanced problems, the paid version will pay itself off in a few hours.
- Output Queries – If you use some form of Database classes to handle your queries, most likely there is a way to log your queries and at the end of the page spew out the log. Only do this for developers in a development environment. But being able to see this information in context will help. Also, if you can list the time it takes for that query to execute next to it, that will help identify problems as well.
- Monitor Page Loads – Keep an eye on which pages take longer to load. The longer a page loads, the more likely the information on that page could benefit from caching.
- Monitor Web Analytics – Not only are the pages that take longer to load important, but also the most viewed pages.
- Brainstorming – As time passes by, as a developer, you should have a “feel” for your application. Brainstorm with your team about the different parts of your application that could cache to decrease load.
Once you find parts to implement caching, there are two basic methods of caching you can implement. While there are other ways to implement caching, I feel like these are the two most common.
Timeout Cache / Output Caching
It it characterized by having information that is queries frequently, but also changes frequently. Typically its a summary of some sort. Here is an example of this type of cache:
This screenshot is from WordPress.com‘s front page. It lists a series of blog posts on the WordPress.com network. Its a list of popular blogposts, either by very important people or hot topics. While I’m not positive, I’m pretty sure they use some form of algorithm to generate that list. WordPress.com has about 200,000 posts a day. Lest just guess and say that the WordPress.com home page is viewed 1,000 times a minute. Thats a lot of traffic, and imagine if each page view their PHP script would have to query 200,000 posts to determine what should be on that front page. That is a LOT of work. So how about every 15 to 30 minutes you re-generate that list and then cache it. Instead of generating that piece of information 30,000 times in half an hour, you generate it maybe once or twice.
The reason its called an Output Cache is you’re typically storing the raw output instead of the information used to create the output. It is also an example of being a Timeout Cache because the majority of the time what causes it cache to expire and re-generate is time driven: every minute, hour, day, etc. In the example above, while a new post may be written by a VIP, or become a hot topic, its okay if it doesn’t show up on the list immediately. If it takes 30 minutes, that okay.
When To Use a Timeout / Output Based Cache
- Summaries – If you have a summary list of products, news items, or new members that is viewed a lot. If it is even slightly expensive to generate the list, that is less work on your database because typically summaries have to read a lot of rows before cutting down the information.
- Not Time Critical – If the information is safe to be a little out of date, or not reflect new information immediately, then its a good idea to rely on a timeout for refreshing the data.
Event Driven Cache / Object Caching
Event driven caches and object caches differ from the previous scenario of caching. Information in this type of a cache is queried frequently, however it isn’t changed frequently. On many websites, certain things are more complex. One website I work on is Dating DNA, an social networking dating website. A “person” on the website is a collection of rows on many tables. They have their general user information, things they display on their profile, pictures, address information, their answers on our dating survey, etc. An image from xkcd can show how “complicated” things can get:
Before implementing caching, many times I would have to join multiple tables together to get the desired information I needed to use. Contrasted between the previous example with WordPress.com’s 200,000 posts a day, a single user on our website changes their information seldom. Maybe once a day, but more likely the change several things once every week or so. So this information, while used on almost every page of the website, changes rarely for website standards.
What a developer can do is create an object called ProfileInformation. This object would contain everything about a person. Then, you create a factory method to retreive the ProfileInformation instance for a user. If the cached data is missing, invalid, or expired the class will call the SQL queries to gather all of that information, store it in the class’s variables, and store the class in the cache. If a valid entry is found, once again it returns the valid entry instead of doing all the hard work of putting it together.
This cache is similar to the one above, for the exception we can set the timeout to something high, like 7 days. If the user changes any of their information, after the UPDATE command finishes executing, we call ProfileInformation::DestroyCache($user_id). This will then mark the cache as invalid and the next time its requested the data will be regenerated. That is why I call this method of caching an “Object Cache” or “Event Cache”: you store a complex object that the primary method of expiration is event driven.
Why event driven? It doesn’t change very often, however, when it does change it is very important that it expire promptly. Imagine if a user changes their email address, but when they go back to view their account summary it still shows their old information? They won’t think “Oh, they are probably caching this information to reduce load in their servers.” They will think “What the heck? This stupid site is broken.”
When to use Event Driven / Object Caching
- Infrequent Updates – If this information updates infrequently, then having it expire based on events will allow you to keep that cached data longer and less load on your servers.
- Inconsistent Updates – If the information updates inconsistently then use events to expire data. Example: one day it might be updated 5 or 6 times, but then go an entire week without being updated. You could safely make your cache timeout set to several days without problems.
- Time Critical – If having out-of-date information for a short period of time is not an option, then having each event that changes the information clear the cache is your best option.
Common Pitfalls w/ Caching
While caching can be a great solution to improve performance, if not engineered properly it can cause major headaches.
I’ve seen a lot of examples with caching where a developer caches a query. They will do something like this:
// Execute and get the result of the query in an array
$result = Database::FetchArray($sql);
// set the timeout to 4 hours
$mem->set(md5($sql), $result, 0, 4 * 60 * 60);
While it works, there is a problem. There isn’t an easy way to expire the cache. That can cause you headaches down the road. If you want to make sure you can clear the cache from anywhere in your code, wrap the cache in a class. I highly recommend you wrap every cache in a class. Here is an example:
// If the cache didn’t have a valid entry lets make one
// structure the query
$sql = “SELECT * FROM users WHERE id = $user_id “;
// execute the query and get the result
$data = Database::FetchArray($sql);
// store the result in the cache
$mem->set($key, $data, 0, 4 * 60 * 60);
// return the cache with the correct data
// Deleting/Clearing the cache
static public function DestroyCache($user_id)
// Get our memcache object
$mem = self::GetMemcache();
// Get the key for this user
$key = self::GetKey($user_id);
// delete it from the cache
/** somewhere else in your code **/
// Get the user’s information!
$data = UserCache::GetCache(5);
/** the function that updates the user record **/
// .. update the user
// .. continue execution
Over Caching / Only Solution
Once start caching, you can get a little overboard if not careful. It can also be used as a band-aid to cover up inefficient code that even with caching down the road will cause problems. You also have to face the fact that your memcached servers might go offline. Memcached should be used to speed up your website, but not be the only thing holding it up. If your caching servers go down, or you need to flush them, your website needs to be able to function without them. I know of people who have fried their database server after their caching solution went offline. Mecached is a great solution, but not a band-aid for sloppy design.
Here is a Jing recording of the Dating DNA website and the several places we use caching. These aren’t a full list of all the places we cache, but a few. You can view the full size video here.
Speed up your website with caching. While its not the solution for everything, it can decrease database load and page load times. If you have any questions or comments, please feel free to leave them below.