The Difficulty of High Volume Servers

Well, it has been far too long since I’ve posted anything. My apologies. For the last several months, I’ve been working on some interesting projects in the IPTV (Internet Protocol Television) world. In particular I’ve been working on a Real-Time Encryption Server (RTES) for broadcast quality mpeg streams. A typical broadcast stream will run about 3.5-4.0 Mbps. That’s a good chunk of data. In order to encrypt it real-time, you need to have a fairly lean pipeline. Again, no surprises. The trick is getting a server to handle several of these streams concurrently. When I started on this project, the current RTES was capable of handling 20 streams, or channels. For the typical TV service, such as your local cable company, this means that between three and ten servers are needed at the head-end to encrypt all of the channels. This becomes very expensive once you take into account the need for redundant server hardware, giga-bit routers to manage everything, and so forth. Besides, when you look at the numbers, 20 channels is really only 80Mb of data per second. A lot, certainly, but nowhere near the limit for giga-bit, and really only approaching the limit for fast-ethernet.

So why could we only handle 20 channels? Well, it turns out that the previous server was starving CPU resources due to the fact that between 2 and 4 threads were being allocated for each channel. The server was falling under the crushing burden of context switches between all of these threads.

Well, here was a nice chance to dive in and see what we could do. There happens to be a nice little system call to handle querying multiple sockets for data, select(). Sure it has its problems, but for the most part, select is quite good at polling a group of sockets to see if there’s data waiting on them.

After a bit of rewriting, we got RTES to use select to query each socket and notify an object that the given socket was ready for reading. At that point a small thread-pool was used to execute the encryption pipeline on that socket. This yielded a 300% gain. We can now handle 60 channels. An impressive gain, however the story doesn’t end there. With this new design, RTES is now very sensitive to buffer overruns in the network driver. When that happens, flow control is triggered and all threads block on IO, or return a “wouldblock” error in our case. Once flow control is triggered, no sockets may be read from until all IO in the single current thread that’s handing the flow finishes ALL IO. By the time that’s done, everything has fallen so hopelessly behind that flow control is triggered on the next socket read, and the next, and so on. I describe it as the kid on a skateboard holding on to the bumper of the bus. Once he lets go, he can’t ever catch up, and he falls on his face. But I digress.

How do we avoid the flow control problem? Since this is all functionality in the network driver, we can’t really tweak too much, however we can wildly increase buffer sizes so flow control is triggered far less often. This seems to help quite a bit. But it really just pushes the problem farther away, not a real resolution.

Well, I went digging for other designs that could be used. A buddy here at work pointed me at a paper written by Matt Welsh, David Culler, and Eric Brewer at UC Berkely. They describe a modular architecture that they call SEDA (Staged Event Driven Architecture). It’s a fairly interesting read, and strangely enough, right around the time they came up with this back in 2001, a few friends of mine and I were working on a generic data processing engine, for an outfit called Create-A-Check, using a very similar architecture.

So what is SEDA? Well, basically it is a collection of objects, called stages, that each handle a small component of processing. A web server stage list might look like this.

  1. Listen for connections
  2. Read request
  3. return cached static page
  4. fetch static page and add to cache
  5. fetch dynamic page

Each stage has a small queue at the front that holds incoming events. Each stage then acts on the event, then forwards it to the next stage. The execution would look like tree traversal, stage 1 to stage 2 to stage 3, stage 1 to stage 2 to stage 5, etc.

So what are the advantages to this modular architecture? Well, for one, you can dynamically adjust your pipeline based on things like server load. For example, if a server is getting slammed, it can start to reject dynamic page requests, and just handle static requests. Perhaps it can redirect requests that would require a cache-refresh, or insert a new stage that just returns a global “Service is Down” page for all requests. The business rules could be applied in a much more fine-grained fashion, and it would be easier to change them.

So now we’re looking at ways that SEDA could possibly help us solve our problems with high-volume multicast problems that RTES is facing. We are also looking at applying this model to our key request/certificate generation/verification servers. SEDA seems like a perfect fit for them.

In all of this, I’ve learned a couple of things. When it comes to high performace servers, it’s not the first 90% of capacity that’s the hard part. It’s that last 10%. Everything gets a bit blurry, and things can degrade very quickly, and unexpectedly. Handling load gracefully is a fun trick, and there’s no real silver bullet. As computing progresses, and high volume servers become more and more prevelant, new techniques will surely develop. It should be a fun ride.


Comments are closed.