Below is an edited version of the speech I gave at the ITWeb Software Development Management Conference 2015 about organisational clock speed.
How do we build capacity? What do you say when your boss comes to you and says “Fred, we need to double the amount we delivered last year”. What’s normally our first thought? “How many more people do I need in order to double my capacity?” Logically the answer would be double the people?
Of course it’s not as simple as just doubling the amount of staff is it? Unless you have a huge amount of spare time, you’re going to have to delegate a lot of tasks in order to keep the team well managed, so you probably need a team lead kind of person for your current team, and another one for the whole new team you’re creating. Okay, so a bit more than doubling and that’s just the start. When you bring in more people they all come with their own needs and wants. The bigger things get the more complicated they become.
The Mythical Man-Month by Fred Brooks explores how adding software developers to a project doesn’t increase overall productivity. There is an added communication and synchronization overhead to having more developers. From this hypothesis we’ve derived what’s called Brooks’ Law: adding software developers to a late project makes it later. What this means is that you’ll need to more than double your number of staff in order to achieve a doubling of output.
Gordon Moore, from Intel, observed that the number of transistors in a dense integrated circuit like a CPU doubles every 2 years. Put simply computers double their performance every 2 years. But lately they’ve had to do some pretty clever tricks to achieve that performance, and currently the state of the art is multi-core CPUs. The problem is that CPUs also suffer from a variant of Brooks Law. Adding an extra CPU to a single CPU does not double performance. There’s synchronisation that needs to be done, locks are needed for shared resources, all this puts in place an overhead that gets worse as you increase the number of CPUs.
If you think about this in human terms: if you a have one developer and you need to double capacity you get a second. They sit next to each other, they share information, and they probably manage themselves. Now if you have 20 developers, suddenly you need 20 more. The logistics become more difficult to manage, as does communication. Who sits where? Who is working on what? Who is managing these developers? Now if you have 100 developers…you get the idea.
Currently there is a lot of talk about scale. How do we scale out? We add virtual machines, lots of them, each with an operating system and monitoring programs and network cards. We explode the communication and management overheads. We add Redis and Hadoop and Varnish, and a hundred other systems. We shard our data onto multiple machines. We do all this to spread the load for the flood of users we expect.
HighScalability.com recently had an interesting story about a system serving 10,000 daily customers. They had 130 VMs. That’s 77 users per server and they weren’t handling the load, which is just ridiculous. Plus it’s 2 or 4 CPUs per server. On an average that’s 1 CPU for every 20 users.
And what do you think all this computing power is doing? Calculating orbital trajectories? Bringing about world peace? Maybe creating a universe from scratch? Nope, what you have is almost 400 CPUs being thrown at this massive problem, a website! Just a website with a few simple pages.
Obviously this is absurd but what had gone wrong? They had made the faulty assumption that more machines equals more capability. That assumption is just wrong. To fix it, one of the things they did was cut the number of machines from 130 down to…
One, one machine. This one machine serves all 10,000 of their customers with excellent performance. Okay, I’m sure this one machine has got a fair bit of grunt, but even if it’s got 64 CPUs we’re still talking 156 users per CPU, and I think 64 would be overkill.
Now consider how much less this solution must be costing the company. We’ve gone from 130 VMs, with all those licensing and support costs to one. Imagine how many fewer support technicians they need, imagine how much quicker and easier it is to track down defects. Imagine how much simpler it is to develop against and deploy changes to. Imagine these improvements as a wave rolling through the organisation.
I was working on a huge project for a Palantir client, huge costs, huge required performance, huge team, and everything was huge. We were taking software designed to handle a large brokerage and then we were scaling it to handle lots of large brokerages. To accomplish this we created multiple databases, each talking to multiple servers, all with complex integration patterns between them.
The teams working on the project were huge and as such became disjointed. Requirements were misunderstood; communication and decision making were major bottlenecks. More team members were added and as a result more overheads were added. More servers and environments were added, which meant more system and management and support overheads. As a result of this people were working more overtime to maintain everything. But despite all those challenges we managed to get to a point where we could see the end in sight.
However, the complexity of the solution and the massive hardware required meant that the system operationally would cost more than the mainframe it was meant to replace. So what happened? They killed the project, and wrote off many, many millions of Rands. An abject failure by any standard. But it could have been worse. We could have worked for an extra year to complete the system, and worse yet, taken it live. The support for this beast would have crippled the client. Killing the project was the right decision; my regret is that it took so long to come to that decision.
Throwing capacity at the problem, at both a systemic and at a team level had made the problems worse not better.
Another client I worked with had some serious problems, it’s my company’s specialty: helping to turnaround struggling teams. Anyway, a lot of their problems appeared to relate to a lack of capacity. People were too busy, servers were overloaded, and project deliveries were few and far between. What was different at this client, and what inspired me to come and give this talk, was their approach to the problem.
They had a new CIO, who had brought in some counterintuitive ideas about capacity. First and foremost was the concept that adding more capability to a team increases liability, management costs, licensing and so on, which quickly reaches a point of diminishing returns. Instead they decided to solve for efficiency, simplify and optimise.
This is not as easy as it sounds, optimising complex systems and teams requires a different way of thinking, thinking with a systemic view: mindful of interactions and their effects rather than an isolated view focusing on individual systems.
One of the main optimisations was to focus on time. As an example, there was a process that had to start on month end and had to be complete by the next morning. The problem was that it took 16 hours. In order for it to complete in that time the machine had to be made unavailable to the business, which meant that the contact centre couldn’t work while this process was running.
The machine had to go down at 3 in the afternoon, and this meant that the business lost 2 hours for the entire contact centre every month. Calculate that out for a 20 person contact centre, that’s 480 hours a year wasted. Think about all those angry clients not being able to get answers to their queries.
And the work they couldn’t do during those lost hours doesn’t go away, so overtime was needed to catch up. Staff would have to come in on the weekend. Many of them relied on public transport so on weekend’s taxis needed to be used to get staff to work. The canteen was closed on weekends so catering was needed. All these costs add up, quickly.
Obviously, this run had to be carefully monitored; we couldn’t dare have it fail. So someone had to baby-sit it all night. That’s another 14 extra hours being spent on this process.
They could have thrown another machine at the problem but that would have other costs and difficulties associated. You might give the business back some of their hours, but you would pay for it with increased IT effort and management.
Instead they optimised the process. They got it down from 16 hours to 7 hours. They gave the business back the lost 480 hours a year. The system administrator got back 5 hours each month, 30 hours a year. Add to this less obvious cost savings, there was now this extra capacity of 9 hours more processing power on the server – so the machine could be used to do more things, to run 5 applications instead of only 3. Reinvestment of these savings is important because if done wisely it becomes a compounding improvement.
They called this “organisational clock speed improvements”. It’s not the catchiest title but we work in IT so it will have to do. Think back to our CPUs, we’re not adding more CPUs, we’re making our existing CPUs faster. No increase in overhead, no increase in licensing costs, but doing more nonetheless. This initiative was run as a competition between teams to see who could push back the biggest savings. They only tracked the direct savings, not the downstream and knock on savings, and they saved thousands of hours.
They gave people their weekends back.
Now some of you might think that this reduction in required work could be threatening. What do you think when 15 days of your work disappear? You start thinking you can be replaced. And that’s maybe one way to use that saving but that’s not a reinvestment. Instead they encouraged the use of that time for the most important and under rated activity: thinking. Let’s say that for every 8 hours thinking, a person saves 1 hour a month for the organization. But that 1 hour a month is EVERY month. In a year you’re looking at a 50% return: 8 hours invested for a 12 hour return.
So they changed the way they thought about capacity and what was the result?
Actual spend on ICT was 10% lower than budgeted. With the same staff they delivered 48 projects in 2014 compared to 9 the year before. They had 10 major releases compared to 4 the previous year and all for 10% less money.
Simplification gives unintended and unexpected results. Simplification of systems, simplification of activities, results in more capacity from the same capabilities. This investment in simplicity is applied recursively, again and again, winding up the clock more and more, resulting in faster and faster pace and delivery. All with the same staff, but happier, less stressed, less tired, and more thoughtful.
You just need to change the way you think.
Organisational Clock Speed by Sean Hederman is licensed under a Creative Commons Attribution 4.0 International License.