The last couple of weeks have certainly been a bit of a rollercoaster ride here, although it’s finally ending on a very positive note.
For several weeks now we’ve had a significant problem when we were restarting physical servers, in terms of how long it was taking them to come back online, due to the initialisation they needed to do. This was managable when we only needed to restart one server, but if we needed to restart multiple servers it became rather frustrating.
This has compounded the other problems we have had in the last couple of weeks, where on several occasions (for an upgrade, a power outage and then a switch replacement) the entire platform needed re-initialised (either bit by bit, which we can do without service interruption, or completely). Until yesterday this process could unfortunately take up to 7 hours or more to happen. I’m very pleased to thus let you know that this problem has now been completely fixed due to some innovative and rather clever work by our engineers, and the initialisation of a server now takes 30 seconds.
This should ensure if we do have any problems in the future (fingers crossed, but sod’s law is fairly hard to avoid!), that we can recover from them very quickly.
So, on to yesterday’s problem.
There was a very brief (a few seconds) power outage at the main datacentre we use for FlexiScale, caused by human error, which we have been reassured won’t happen again as the process that was happening is being modified to prevent this.
This caused a spike to hit some of our equipment, and although the vast majority (some 100 servers) all came back ok, we started to see some intermittent issues with our core FlexiScale switches.
I should point out at this time that the switches were in a redundant configuration, and we did have an arrangement to obtain additional switches should one fail within a matter of hours. We didn’t consider both failing at the same time a realistic risk, now we know better.
The switches were still functioning to a degree so we left them running whilst we got the two replacement switches delivered. (Which involved yours truly being the courier for them to speed up the process!). These then needed installed, configured and then patched into the network which duly happened, and then the platform was brought back online.
Needless to say we have learnt a lot from this last few days, here are a few of the things we have achieved or are going to be changing:
- We’ve upgrade the software running the system to a newer version, which has a lot of improvements in the stability of individual servers.
- We’ve fixed the problem with initialising servers, which will help enourmously in the long run.
- We will be investigating powering parts of our cage from different sides of the datacentre to ensure maximum redundancy (including the switches being on completely seperate feeds!)
- We will be working out a better plan for coverage of key equipment (even in cases where it is in a redundant configuration) to ensure multiple failure situation’s can be dealt with more effectively.
Overall I’d like to say thankyou for the support we’ve recieved from customers during this time, and we look forward to continue bringing you more innovative features, and a highly reliable service in the future. We have some very exciting features being released over the next few months, and look forward to showing you them.
Tony Lucas
Chief Executive Officer