Archive for the ‘squid’ Category

Advanced Squid Caching in Scribd: Cache Invalidation Techniques

Май 29th, 2010

Having a reverse-proxy web cache as one of the major infrastructure elements brings many benefits for large web applications: it reduces your application servers load, reduces average response times on your site, etc. But there is one problem every developer experiences when works with such a cache – cached content invalidation.

It is a complex problem that usually consists of two smaller ones: individual cache elements invalidation (you need to keep an eye on your data changes and invalidate cached pages when related data changes) and full cache purges (sometimes your site layout or page templates change and you need to purge all the cached pages to make sure users will get new visual elements of layout changes). In this post I’d like to look at a few techniques we use at Scribd to solve cache invalidation problems.


So, the first problem – ongoing cache invalidation when content changes. This is actually a pretty simple task in squid: you just use HTCP protocol and send CLR requests to your caching farm (we didn’t find any HTCP protocol implementations so we’ve implemented our own simple client that supports just one command).

Since we use haproxy to balance our traffic in the cluster it is hard to predict where should we send a purge request. So we fan those out to all cache servers.

To make sure cache purging won’t slow the site down, especially considering we need to do more that just a simple cache purge (submit documents to search indexes, etc, etc), we just spool a “document changed” request to a queue and then have a set of asynchronous processes that do all the work in background.

Next, The Hard Problem – handling full cache purges w/o killing our backend servers with 5x-10x traffic (our normal hit ratio is ~90-95%).

We’ve spent a lot of time thinking about this problem and the first idea we came up with was to have a loop process somewhere that would iterate all documents we have cached and purge them one by one… but that does not seem to be a practical solution when you have tens of millions documents (and few page versions per document) and obviously the solution would not scale with constantly growing documents corpus.

So we kept brainstorming and finally got one idea that works just perfectly for us: what if we’d be able to take our traffic and define a function f(t) that would return a percentage of the traffic that should be purged at any moment in time. So we did it – we’ve implemented a nginx module that would version our cache by assigning every cached page a revision (using a custom HTTP-headers + Vary-caching) and would be able to slowly migrate the cache from one revision to another over a pre-defined period of time.

Having that module we are able to do so called “slow” cache purges that could take any time from a few minutes (that still helps to reduce the load spike generated by the hottest content) up to many hours (this is what we normally use) or days (never used this option, but it is definitely possible).

Here is an example 100% cache purge over an 8 hour interval:

  1. Daily hit ratio graph:
    day
  2. Weekly hit ratio graph:
    week

As you can see, during those slow purges our cached pages would be slowly updated without putting too much pressure on the backend. Cache hit ratio would slowly degrade and then slowly get back to its normal levels, but with our normal (6-8 hours) purges hit ratio never gets lower that 65-70% which makes it possible for us to save huge amounts of money on not having 90% spare capacity just for the cache purge load surges (we used to have lots of spare application cluster capacity before introducing this approach).



PlanetMySQL Voting: Vote UP / Vote DOWN

Advanced Squid Caching in Scribd: Hardware + Software Used

Август 4th, 2009

After the previous post in this caching related series I’ve received many questions on hardware and software configuration of our servers so in this post I’ll describe our server’s configs and the motivation behind those configs.

Hardware Configuration

Since in our setup Squid server uses one-process model (with an asynchronous requests processing) there was no point in ordering multi-core CPUs for our boxes and since we have a lots of pages on the site and the cache is pretty huge all the servers ended up being highly I/O bound. Considering these facts we’ve decided to use the following hardware specs for the servers:

CPU: One pretty cheap dual-core Intel Xeon 5148 (no need in multiple cores or really high frequencies – even these CPUs have ~1% avg load)
RAM: 8Gb (basically to reduce I/O pressure by caching hot content in RAM)
Disks: 4 x small SAS 15k drives in JBOD mode (no RAIDS – we’ve tried all kinds of RAID configs and it did not help with the I/O performance)

So, once again: nothing is as important in a squid box as I/O throughput.

Here is a sample CPU load graph from one of the boxes:

squid-cpu-graph

Software Configuration

This could be a long story, but in a few words our experience with different squid versions was the following.

First, when I’ve started working on this caching project I’ve just installed squid using Debian’s apt-get install squid command. As the result we’ve got some ancient squid 2.6 release that for some reason (still unclear to me) was painfully slow in I/O operations and it had some leaking file descriptors problem so after a few hours under production load the box would simply stop processing requests.

When the first approach failed, I’ve decided to go to the squid web site, download the latest production release and install it from sources (yes, we do it all the time when OS vendor ships too old or buggy releases). Result – freaking fast and stable squid 3.0 which worked flawlessly for about 5 months.

Few months ago we’ve found out about the stale-* extensions available in squid 2.7 and I’ve started wondering if we should change our perfectly stable 3.0 setup to 2.7. And some time later I’ve decided to use Vary HTTP header in our caching architecture and then I found out that vary-caching correctly implemented only in 2.7 and since 3.0 is a complete rewrite of the 2.X branch, vary-caching is not yet implemented there (or not in a way we’d want it to be implemented).

So, the final result: at this moment in time we’re using custom-built Squid 2.7STABLE6 and really happy with it, it is stable, fast and feature-rich caching proxy server.

Caching Cluster Configuration

Apparently we have more than one squid server in scribd and this makes it a bit harder to use those servers (comparing to one box when you’d send all requests to one IP:port pair). We’ve tried to use round-robin balancing for the squid boxes + ICP-based neighbor checks but it was adding more latency to our responses and we’ve decided to put haproxy load balancer between nginx and squid farm and set up URL hash based balancing to distribute requests evenly amongst squid backends.

This scheme worked pretty nice, but we had one serious problem with this setup: if one squid box would go down, haproxy would quickly detect the problem and would remove it from the pool… And here comes the problem – removing a server from the pool completely changes hashing keys space and all cached requests become invalid. To solve this problem we’ve developed a nginx balancer module that performs consistent hashing of URLs and we’re testing this module now in production. What is really good about this module is that it removes one hop from the chain if http proxies between the site and a user.

So, this was a short description of what hardware we use for our caching cluster and why do we use it. In the next posts of this series we’ll talk about cache control and objects invalidation.


WebStack 1.5 — Your (L)AMP Stack

Июль 30th, 2009

Sun's LAMP support is assembled from two pieces: the L is from our Linux/GNU Support (see GlassFish WebStack, which, in its latest incarnation includes Apache HTTP Server, lighttpd, memcached, MySQL, PHP, Python, Ruby, Squid, Tomcat, GlassFish (v2.1) and Hudson.

The inclusion of Hudson is a bit of an opportunistic move (more on that in a bit), the rest comprises a well tested, integrated, optimized, and extended component stack for your new and old Web Apps.

The WebStack can be downloaded here; the bundle includes the WebStack Enterprise Manager, which, unlike the other components, is not free right-to-use but rather is available with an eval license; this is a model like that of the GlassFish Enterprise Manager. The current release supports RHEL, Solaris and OpenSolaris (it is bundled in OpenSolaris); for additional details, check out the Documentation and Discussion Forum.

ALT DESCR

Check out these posts from the WebStack team:

• CVR's Announcement and Overview.
• CVR's note on two key properties: Fully Relocatable, and Updatable.
• Sriram on Installing AMP stack within GlassFish Web Stack 1.5.
• Irfan on the Enterprise Manager's Navigation Panel.
• Jeff on Installing via IPS tools.