A long answer to a very good question

Here’s a really interesting question posted in our forum about Minimum CPU Level. I felt that it needed a thorough answer, so you can read both of them here.

Julez says,”I currently have a dedicated server and traffic is very steady and relatively low, so the flexiscale’s pay-as-you-go pricing works for me. I also like the added hardware reliability of flexiscale – not being reliant on any one hardware component.

However, I don’t like the idea of my server’s performance dropping below an acceptable level due to high levels of activity on other servers. This was why I moved off shared servers & onto dedicated in the first place.

What concerns me about flexiscale is that it will attract other clients with wildly fluctuating traffic with occasional very high peaks such as Cheddarvision. These are exactly the kind of rack neighbours I want to avoid.

Any server is impacted by high network traffic to another server in the same rack, but at least a dedicated server guarantees 100% CPU availability.

I realise that stressed servers can be switched to receive more resources and this reduces the risk of impact on other servers.

However, what I need to know is that even in a WORST CASE scenario, my servers will get a minimum level of CPU.

If this is a planned feature, is there an ETA?”

My reply:

This is a very good question indeed. If there was a question that has occupied us internally the most, it was this one. QoS in a virtualised environment is kind of the Holy Grail really. How can we guarantee a certain level of performance to our customers? Unfortunately, due to the complexity of the challenge, this will be a long forum post. You better go and get a coffee and make yourself comfortable

Before I go into an in-depth explanation, let me introduce myself. I am the COO and acting CTO of XCalibre. My team and myself have been working on this platform for the last 18 months. Besides my operational responsibilities, I have been working on this new platform day and night. I haven’t started dreaming in FlexiScale API calls …yet, but I think it is just a matter of time.

Background
Let’s go back to the very beginning. You could claim that virtualisation is nothing other than a very advanced version of shared hosting. There are a number of key differences though:

Each Virtual Dedicated Server runs in its own user space, therefore allowing you to mix Windows Server with Linux and various version of Apache, MySQL and PHP. This was not possible with Shared Hosting and was a problem if a customer wanted slightly different LAMP stack versions than the average user.
The Xen engine provides increasingly good control over the CPU an individual guest system can consume. We will be doing an upgrade to a new Xen engine on October 6th that will give us massively improved CPU control compared to what we have today. Again, Shared Hosting didn’t provide any control over how much an individual customer was using, making the platform unpredictable.

So what are the key contention factors in a platform like FlexiScale:
1. Memory
2. CPU
3. Network Bandwidth
4. Storage Bandwidth

Memory
While Memory is a big factor for the provider, it isn’t really the same for the customer. Once the Memory is allocated to you, it is yours and yours only. It is not shared in any form or shape.

CPU
Things are bit different for CPU. The total of available CPU resources in a physical server is shared among all customers that currently are ‘sitting’ on that box. It is correct that other customers that are on running on the same physical box as you can have an impact on your performance.

As mentioned before though, Xen provides us with a very good level of control over CPU. We can control how much of the overall CPU of the physical server a single guest system can consume (from 1-100%). This already gives us a good way to avoid a customer ’stealing’ your CPU cycles. Besides that, Xen has a built-in CPU scheduler that lets you play with priorities and multiple virtual CPUs at guest level. This again will stop someone dominating the CPU consumption. We currently give all customers the same level of priority, but we will be launching a detailed study in November that will look into the impact of giving customers different levels of priority and numbers of virtual CPUs.

Network Bandwith
As far as Network Bandwidth is concerned, this is shared as well. However, as far as we are concerned, this should never be a concern. In total, our physical servers have multi-Gigabit available Network Bandwidth. Different types of Bandwidth are totally separated – user, management, inter-server traffic and storage access.

Beyond the Bandwidth for an individual server, we will always ensure that we have plenty of free Bandwidth in our switched/routed network architecture and in our IP transits into the internet. This is something that is not FlexiScale specific, something we have been doing for the last 10 years.

Therefore, within reason, you never even notice if another customer in either the same server or, in another server or another rack, pushes a very large amount of traffic (i.e. the Cheddarvision example, this did not affect any other customer adversely).

Storage Capacity & Bandwith
Available Storage Bandwidth is determined by the network (physically separate from the other bandwidth) connecting the physical servers to the storage backend and by the capability of the storage controller to handle I/O requests. Storage Capacity is simply dependent on how many hard disks you bought, as simple as that.

FlexiScale is built upon top-quality high-end storage product from a Tier 1 manufacturer that can scale up in terms of Storage Capacity and Storage Bandwidth on demand. We closely monitor our storage backend that provides us with comprehensive management tools and will always add more resources before our customers will notice.

The missing link
The key problem is that there is currently no applicable performance benchmark available for providers like us. We are in talks to Amazon (EC2 group) and others to move things forward.

Obviously, we are closely monitoring our competitors and what kind of information they provide to their customers. Notably, Amazon EC2 provides some form of performance figure when they quote that each virtual machine has the equivalent of a 1.7Ghz x86 processor. I am personally not really sure if this has anything to do with a relative or absolute performance measure. Check http://en.wikipedia.org/wiki/X86 and figure out yourself. x86 stands for an instruction set and does not refer to a specific type of processor, i.e. Pentium or Xeon.

Not surprisingly, Amazon very recently got into trouble with their performance numbers:
http://developer.amazonwebservices.c…6912&tstart=60,
http://developer.amazonwebservices.c…2297&tstart=60

We don’t want to repeat that mistake.

My CEO pushed me hard to release a number, too, but the core team came to the conclusion that while we would love to do so, we do not feel comfortable doing it. However, we haven’t ignored the challenge and have been very active to tackle it in our own way with short-term and strategic long-term solutions.

What are we doing to address the issue today
There are a number of things we are already doing to provide customers with consistent performance. These are:

Since we are proving a high-end virtualisation platform, we are using a relatively low level of server contention (number of VDSs per processor CPU). This is really the most important factor when looking into predictable performance. If you pile in too many guest systems into a single piece of HW, even the best monitoring system and management platform can provide you with good performance.
We collect detailed information in real-time about all key performance factors of our platform – available memory, CPU, network bandwidth and storage capacity/bandwidth. We can look at the general trends and purchase new equipment accordingly. This is really important – we will NEVER run out of free capacity. The more customers buy from us, the more of the profits will be directly re-invested to scale the platform up. Just imagine the face of my sales guys if I tell them that they have to stop selling to all these hungry customers because I would rather retain all profits for me!
Our management platform monitors the server load of each individual server in real-time and should one physical server move into amber/red, it will automatically select one or more VDSs that will move (LiveMigrate) to another server with lower CPU load. This process is fully transparent to the end-user. Xen guest systems are agnostic of physical HW and can be moved to any physical server in the same cluster.
Looking beyond a single individual server, it could well make sense to buy two smaller (2x 512Mb instead of 1x 1Gb) FlexiScale servers and load-balance them. With linear memory costs, this could be a more effective solution. Furthermore, it allows you to do your own applications upgrades without taking the service down (not true for all architectures, needs to be reviewed individually) and it spreads the risk.

There are a few things we are in the process of doing. These are:
We will kick-off a detailed study of how the new Xen Resource Scheduler works in November 2007. It lets you play with VDS CPU priority and number of virtual CPUs within the VDS. We need to understand how these two parameters affect a fully loaded server and the individual VDSs. It is our intention to release a more granular per hour pricing as soon as we are confident enough in our understanding of how these two parameters interact and therefore justify different pricing.
We have launched a major research project with a prestigious UK university that in a nutshell will eventually allow us to provide QoS. Details will be released closer to a launch date.

OK, finished with rambling. The answer to your question …..I need to know that even in a WORST CASE scenario, my servers will get a minimum level of CPU … is that you will always get a good level of performance from FlexiScale. We offer a high-quality product and good performance is a key to that. Our reputation hinges on it.

Totally predictable VDS performance (QoS) will become reality once we release the first results from our research project. Additionally, we will soon be more specific about our performance once a formal body publishes an officially recognised benchmark.

Watch this space.

Phil Huber, CTO XCalibre Communications Ltd.

One Response to “A long answer to a very good question”

  1. Mowdolwis.Com » A long answer to a very good question Says:

    [...] wrote an interesting post today on A long answer to a very good questionHere’s a quick [...]

Leave a Reply