Naughty Cloud Servers, Naughty
Those of you following me on Twitter or Facebook will have seen that the past week or so has been a bit stressful, mainly due to naughty servers misbehaving. At work we have a set of redundant systems in production, so failures don't have too much impact (you still have to fix them of course, or your redundancy vanishes). It's in the staging and test areas where we have the most naughty-server fun, as we're trying out new features, newer versions of software or pushing systems to their limits to get a feel for how they will behave under load. However, on our test setup we don't have the redundancy, because they aren't critical systems and there's no great loss if they fall over.
This is just as well really, as we've had some interesting issues with Amazon's EC2 services, which provide the cloud computing infrastructure for some of our systems. It was Amazon who used to host Wikileaks, and have been under 'cyber-attack' by groups trying to take revenge for Wikileaks being kicked off.
The problem can be summed up with this little spot the difference game:
Both charts are of data traffic into and out of our systems:the green area is incoming data, blue is outgoing to our data processing cluster.
Chart #1 shows what the traffic is supposed to look like: a gentle sinusoidal wave which rises and falls with demand. That's our production server. The outgoing data wobbles if we restart systems further down the processing chain - there's a trough when they stop consuming data, and a peak when the buffered data is sent on.
Chart #2 shows the main interface to our test servers: they tap into the live data stream but are isolated by a 'siphon' script, which usually provides an identical copy of our data, but disconnects in the event of any problems, so as to protect the live systems from any errant behaviour in testing. Thus, the green part of both graphs should be broadly the same shape if everything works correctly (the axes are different, as the blue line varies depending on the number of data consumers we have running).
Clearly, we had a problem. It looks like a 2 year old child has scribbled on it with crayons (to be fair, my drawing isn't much better). There are gaps, peaks and troughs all over the shop. Gaps indicate that our monitoring systems were unable to contact the server, and because it has been out of contact or slow for random periods the incoming data ceases to be a smooth line as the system 'catches-up' with the data stream, bursting stored data over to the test system.
It all points to some pretty horrible latency on the network and coincides with attempts to bring Amazon EC2 down as a result of them removing Wikileaks.
Ouch.
Spinning up a new EC2 box has given us a nice waveform again, so hopefully the new one is in a better neighbourhood.
