November, 2008
by Michael Ley
A few weeks ago I was phoned by a salesman who was trying to sell a "performance measurement tool". It was a bad time to call, I was under pressure to deliver some work and so I tried my usual brush-off / delaying tactics ~ "now is not the right time, phone me back in a few months". However this salesman was more persistent than most and he continued on about how the tool could manage "system capacity".
Thoroughly confused about what the tool was attempting to do, I asked him what I thought was a simple question, "what do you understand to be the difference between capacity and performance?" ...and I got blubber.
Many, many words came across the phone line, mainly a re-iteration of the latest management jargon ~ "allows you to pro-actively measure performance", "dynamically identifies capacity bottlenecks", and "optimises platform efficiency". The one thing that was missing was a clear definition of the difference between the meanings of the terms "capacity" and "performance".
Later I thought ‘if this is what specialist companies in the area of capacity and performance are presenting to senior managers then no wonder people are confused about the role of capacity / performance management and the tools that are required'.
So to bring a little clarity to the area let me propose the following definition of the terms, capacity and performance:
Thus, performance is about response times and performance measurement tools are tools that measure response times. Capacity is about some activity over time, e.g. bytes moved over a period, from which utilisation can be derived. However, in the real world there may not always a clean divide between the two and several tools collect both capacity and performance data.
As phrased, these definitions may appear a little arcane. They certainly are not readily explainable to a manager. To address this I have developed the following highway analogy in the hope of making the difference between the two terms more readily understood.
|
Performance is the time it takes me to drive from London back to the family home in Wales ~ a distance of about 200 miles |
![]() |
Capacity is the number of cars per hour the M4 motorway* can handle |
|
By extension, utilisation is the number of cars per hour travelling on the M4, divided by the maximum number of cars per hour that the M4 can handle.** |
|
|
And high utilisation on the M4 certainly may lead to delays because of queuing, which in turn leads to elongated travelling time i.e. reduced performance |
![]() |
"*" The M4 is the direct motorway ( = Interstate highway) from London to Wales and the West of England.
"**" The procedures for estimating highway capacity may be found at http://www.fhwa.dot.gov/ohim/hpmsmanl/appn2.cfm ~ source Barry Sokolik
This analogy highlights that the terms performance and capacity, while interdependent, are not the same thing, nor are they interchangeable.
In the real world, it is perfectly possible to have a system that is a "poor performer" even though the utilisation is low. For example, a service that only processes one transaction per minute may have a low CPU utilisation yet, because the transaction path length is long (i.e. the transaction hasn't been tuned) the time taken to execute the transaction is long and performance may be regarded as poor.
In the highway analogy, one can see that on an empty M4 a highly tuned car, such as a Ferrari, will take less time to complete the journey than my little car.
|
|
Of course, in the real-world environment the time difference getting to Wales between a Ferrari and my car is much less significant. Traffic jams, (queuing to a capacity / performance people), mean that the faster car is unable to use its speed advantage all the time and this brings home a major truth
The highway analogy can be extended to explain how priority systems work. It is a hard thing to say but a priority system is a rationing mechanism, in that it shares out a fixed resource by favouring or limiting some groups at the expense of others. On certain parts of the M4 there is a dedicated bus lane, which means that normal traffic can only use the outer two lanes. Thus, the normal three lanes of traffic are pushed into two and one lane is left almost empty. The result is that buses get to their destination quicker than I (or the Ferrari) can, even though my car has the ability to perform better than a bus (just).
Priority systems are better accepted however if the performance of the group that is being discriminated against still meets user expectation. If not, then user dissatisfaction will arise (as some government minister found out with regard to the M4 bus lane).
Defining user expectation beforehand, by the use of an SLA is therefore almost essential to making a priority system work, but, understanding the difference between capacity and performance is immensely important in writing an SLA.
Many Service Level Agreements (SLA) or Operational Level Agreements (OLA) use utilisation as a target measure, however, utilisation is a measure of work and not user experience. How a particular level of utilisation relates to user experience can depend upon many things. High utilisation need not always slow performance because in some circumstances you can run systems at high utilisation with very little queuing. Thus, having an SLA or OLA which defines an utilisation level only limits the work that can be done and this can be wasteful.
As a final point, in the highway analogy the measure of capacity does not have to be "cars per hour" it could equally be cars per day. This brings out the point that it is not meaningful just to say that utilisation is 60%, since utilisation is a measure of work over a period of time and the time interval should always be specified. For example, CPU utilisation in the peak hour is 60%. Without this qualification the measure is meaningless as there is a significant difference between a system that has a peak hour utilisation of 60% and one that has a peak day utilisation of 60%, or even 60% utilisation over a week.
Similarly, the timing of any car journey is important. Thus, when writing SLA definitions it is important to specify the period they relate to ~ peak hour, peak day. I could easily get my average journey time to Wales to be less than 4 hours if I always travelled at 2 a.m. in the morning
Having worked in this area of capacity and performance for sometime, I would strongly recommend that you remember and apply these definitions, as they may well save you much money and / or pain. Apart from the saving in time dealing with some salesmen, you may avoid the fate of one company who bought a capacity tool and a performance tool only to find they had bought two tools that collected much the same data, without any discernable extra benefit.