On stuff-ups

One of my team members has recently made a flurry of mistakes in production. Anyway, we were chatting, and he seemed a bit downcast about it, and I told him not to worry too much, we’ve all made mistakes and we all will make more; the trick is to not keep making the same mistakes again and again.

Tonight we test in ProductionSo, I thought I’d share some of the tricks I use to make sure I don’t mess up Production. Please contribute any tricks you use too.

Avoiding SQL production mistakes

SQL mistakes have the fun attribute of being very, very quick, and affecting lots and lots of users all at once. To stop ourselves from making silly mistakes there’s a range of techniques we can use.

Clearly Identify Production connections

Most SQL development tools have a way of indicating that there’s something special about this particular connection.

SQL Server Management Studio allows you to set a color for each connection
SQL Server Management Studio allows you to set a color for each connection
SQL Server Management Studio showing the warning red of Production
SQL Server Management Studio showing the warning red of Production
LinqPad connection screen showing "Contains production data" checkbox
LinqPad connection screen showing “Contains production data” checkbox
Linqpad query showing "Production Connection" warning
Linqpad query showing “Production Connection” warning

Always Rollback until you’re sure

begin transaction

-- Put your code here

rollback transaction

This is a good habit to get into and helps ensure the only effect you’ll have on Production is via locking/performance issues.

Always have a WHERE clause

begin transaction

update customers
set name = 'Test'
where 1 = 0

rollback transaction

The first part of anything you write should be “where 1 = 0″, this again helps ensure that you won’t have any kind of effect until you want to. Use this technique for SELECTS as well.

Always use two independent WHERE clauses

You never want to assume that something is what you think it is just based on one piece of evidence. You must have at least 2. For example, assume that you want to update the customer “Sean Hederman” which has ID 456 to a web site address of “http://www.palantir.co.za”. The wrong way is:

begin transaction

update customers
set website = 'http://www.palantir.co.za'
where ID = 456

rollback transaction

The reason this is wrong is that you could be mistaken about who the ID refers to. Instead, the right way is:

begin transaction

update customers
set website = 'http://www.palantir.co.za'
where ID = 456
and name = 'Sean Hederman'

rollback transaction

This ensures that if ID is wrong and refers to another customer, the update will fail; which is the behaviour we want.

Combine these techniques into a Data Fix Template

I strongly suggest creating a data fix template that fits the following pattern:

  1. Performs all the checks and updates in a transaction.
  2. Checks that the data before the update is as expected, using multiple where clauses for each check, and fails if it is not.
  3. Checks errors and rowcounts when it does the update. If you expect 3 rows to be updated, fail the script if more or less than 3 are updated.
  4. Checks that the data after the update is as expected, using multiple where clauses for each check, and fails if it is not.
  5. Prints out status messages about it’s progress and evidence through the various steps.

I then strongly recommend the following practices around the data fixes:

  • Get them peer reviewed by a DBA or SQL expert who knows the database in question.
  • Check them into source control, prefixed with the date of their execution.
  • Take the execution output (with the evidence outputted by it), and check that into source control with the data fix script.

I also like having the DBA’s actually execute the scripts. This ensures that the developer doesn’t “fix” any minor mistakes out of sight of anybody.

Avoiding Server Production Mistakes

Server mistakes are incredibly common, usually because someone does something dumb.

Test before making changes

No change to Production (apart from maybe critical resolutions) should be done without being tested in a similar environment.

Clearly identify Production servers on the desktop wallpaper

No brainer. Your Production servers should be obviously Production servers. Using a different colour background in addition to text is advisable.

Clearly identify Production servers in Remote Desktop Tools

If you have a remote desktop tool that shows a list of servers you can access, clearly identify and separate which servers are Production servers.

Automate all the things

The easiest way to not mess up Production servers is to not go into Production servers. The best way to do that is to automate all the tasks you do in Production. Go in to clear out log files? Automate it. Going in to deploy new versions of your software? Use an automated deployment tool like Octopus Deploy. Going in to look at log files? Have them aggregated into a tool like Splunk.

Having reliable Change Requests that tell you why people are logging in helps you decide what needs to be automated next.

Make sure you have strict controls over your automation

There is nothing that can stuff up your Production environment as fast or effectively as an automated system. Make sure you ensure that these tools require authorisations, and that the authorisers check that the right tool is being applied to the right environment.

Installs are evil

I made this mistake yesterday. Happily it wasn’t a front line Production server, but it was our source code server, so it annoyed the devs quite a lot.

Simply put, some installs break things. An install should be treated like any other deployment. Testing must be done, evidence must be provided, a change must be logged.

Check that DRP/backup is working first

Just in case what you’re about to do breaks stuff, do you have a backup plan if the rollback fails? Another machine? A backup?

Always backup first

You’re about to make a change to a directory? Back it up first. You’re going to make a change to the server installation? Back up the server. Making a big change to a database? Snapshot or backup is your friend.

Make sure you back up everything [New]

Do you know everything that needs to be backed up? Program Files directory? ProgramData directory? Per user app settings? Registry? I literally just made this mistake with a TeamCity upgrade, I backed up Program Files and not ProgramData.

Happily in my case we do take full backups of the server so I will get this mistake resolved, but a full server restore is a lot slower than if I’d just backed up the right directories in the first place.

General rules to avoid Production Mistakes

Here are some general rules of thumb for avoiding/recovering from mistakes in Production.

Step away from the keyboard

The moment you realize you’ve make a production mistake the temptation is to dive in and fix it immediately. This is a very big mistake. You probably made the mistake because of a misunderstanding/faulty assumption, and rushing back in, in a panicked state, is likely to worsen your error, not make it better.

No matter how confident you are in the fix, step away.

Do NOT hide the mistake

Under no circumstances should you hide the mistake. You should immediately, and calmly, let the relevant people know about your error. This protects you, it provides a team of people to help you think through the problem resolution and verify your thinking, and it protects Production.

Hiding a Production mistake will often get you fired. Making a Production mistake will only rarely get you fired.

Note down the facts

Write down what you did, and why you through it was a good idea at the time. Include any evidence you may have that backs up the assumptions you made that caused the mistake. In most cases, mistakes in production are caused by:

  • Faulty assumptions (e.g. assuming you’re not dealing with production)
  • Faulty thinking
  • Finger trouble
  • Misleading evidence
  • Insufficient evidence
  • Stupidity (this is qualitatively different from faulty thinking)

It’s important to note down why you made the mistake, so that lesson can be used to close the gap in your Production protocols.

Don’t panic

Spend a bit of effort to keep calm. Often panic produces even worse thinking and can also cause you to miscommunicate the problem to the team. Take some deep breaths, have a glass of cold water, and calm yourself before doing anything of importance.

People make mistakes all the time, it’s how we learn. This is a learning experience, a crap one to be sure, but still something that is going to make you better at what you do.

Don’t get jaded

If you often deal with Production, it is easy to become jaded and lackadaisical. This is human nature, we become acclimeitized to risk we encounter often, even incredibly dangerous risk. For example, pedestrians on our highways is a major problem in South Africa. Research has shown that the first time the pedestrians cross the highway they are fearful, full of adrenaline, and fully appreciative of the risk of death. They usually only do it in special circumstances. However, the next time, their appreciation of the risks decreases, they widen the circumstances under which they will take the risk because they perceive the risk as less.

It hasn’t got less, they just perceive it as less due to acclimatisation. This is why South Africa has a very high pedestrian death rate: we put people on the other side of highways from their livelihoods, and provide too infrequent crossing points. This results in many, many people becoming acclimatised to a deadly risk.

Production must always be treated with respect and care. Try and psych yourself up about it whenever you’re about to touch Production, give yourself a talking to about the importance of being certain about everything you do.

Don’t use common logins

A lot of organisations use shared login accounts to access Production. This is a Very Bad Idea™. The reason is simple: when someone stuffs up and doesn’t own up to it, how will you know who did it? Related to this is that login and file auditing should ideally be turned on, especially on servers containing sensitive information.

Separate Production access logins from normal logins

Make sure that users cannot log into Production directly with their normal logins. This is security best practice, but for our purposes the main reason is psychological: it forces the users to be aware that something different is going on.

Invest in a Privileged Access Management system

A Privileged Access Management (PAM) system doles out connections to Production. So basically, no-one can log into Production without asking the PAM system for a login. This allows you to use “common logins” because now you have a record of who “owned” that login at a point in time. Most PAM systems allow you to remote desktop direct from the PAM system. All of them I’ve seen also support password provision, where they control the common login password and provide the user a “temp password”.

Why would you do this rather than just give everyone a second login to Production? Well, one reason is ease of administration. Maybe you don’t want to manage a huge number of Production logins. Another reason is that a normal login can be used at any time, whereas a PAM controlled login can be part of an approval process.

My preference when using a PAM is to have two main logins for Production: a read-only user which does not require approval, but is simply tracked, and an admin user which requires an approval process. Normally said process would also require a logged Change Request. This is to ensure that every change to Production is logged and recorded, so we can analyse the reasons and drive them down as much as possible.

Have a monitoring system

No brainer, but it’s absence is surprisingly common. Which servers are running out of memory, CPU, or disk space? Which services have stopped? Which scheduled tasks failed? What you want is a set of alerts which notify you when something is badly wrong. You don’t want alerts about everything because otherwise you won’t see the wood for the trees.

A big dashboard with red for when things go wrong, and green when everything is okay is also a must. In this way you get quick visibility that a problem has happened.

Nothing worse than making a mistake in Production and not knowing you’ve made a problem until the customers start complaining.

Organisational Clock Speed

Below is an edited version of the speech I gave at the ITWeb Software Development Management Conference 2015 about organisational clock speed.

Adding Capacity

How do we build capacity? What do you say when your boss comes to you and says “Fred, we need to double the amount we delivered last year”. What’s normally our first thought? “How many more people do I need in order to double my capacity?” Logically the answer would be double the people?

Of course it’s not as simple as just doubling the amount of staff is it? Unless you have a huge amount of spare time, you’re going to have to delegate a lot of tasks in order to keep the team well managed, so you probably need a team lead kind of person for your current team, and another one for the whole new team you’re creating. Okay, so a bit more than doubling and that’s just the start. When you bring in more people they all come with their own needs and wants. The bigger things get the more complicated they become.

The Mythical Man-Month by Fred Brooks explores how adding software developers to a project doesn’t increase overall productivity. There is an added communication and synchronization overhead to having more developers. From this hypothesis we’ve derived what’s called Brooks’ Law: adding software developers to a late project makes it later. What this means is that you’ll need to more than double your number of staff in order to achieve a doubling of output.

Gordon Moore, from Intel, observed that the number of transistors in a dense integrated circuit like a CPU doubles every 2 years. Put simply computers double their performance every 2 years. But lately they’ve had to do some pretty clever tricks to achieve that performance, and currently the state of the art is multi-core CPUs. The problem is that CPUs also suffer from a variant of Brooks Law. Adding an extra CPU to a single CPU does not double performance. There’s synchronisation that needs to be done, locks are needed for shared resources, all this puts in place an overhead that gets worse as you increase the number of CPUs.If transistors were people

If you think about this in human terms: if you a have one developer and you need to double capacity you get a second. They sit next to each other, they share information, and they probably manage themselves. Now if you have 20 developers, suddenly you need 20 more. The logistics become more difficult to manage, as does communication. Who sits where? Who is working on what? Who is managing these developers? Now if you have 100 developers…you get the idea.

Currently there is a lot of talk about scale. How do we scale out? We add virtual machines, lots of them, each with an operating system and monitoring programs and network cards. We explode the communication and management overheads. We add Redis and Hadoop and Varnish, and a hundred other systems. We shard our data onto multiple machines. We do all this to spread the load for the flood of users we expect.

HighScalability.com recently had an interesting story about a system serving 10,000 daily customers. They had 130 VMs. That’s 77 users per server and they weren’t handling the load, which is just ridiculous. Plus it’s 2 or 4 CPUs per server. On an average that’s 1 CPU for every 20 users.

And what do you think all this computing power is doing? Calculating orbital trajectories? Bringing about world peace? Maybe creating a universe from scratch? Nope, what you have is almost 400 CPUs being thrown at this massive problem, a website! Just a website with a few simple pages.

Obviously this is absurd but what had gone wrong? They had made the faulty assumption that more machines equals more capability. That assumption is just wrong. To fix it, one of the things they did was cut the number of machines from 130 down to…One server

Drumroll maestro…

One, one machine. This one machine serves all 10,000 of their customers with excellent performance. Okay, I’m sure this one machine has got a fair bit of grunt, but even if it’s got 64 CPUs we’re still talking 156 users per CPU, and I think 64 would be overkill.

Now consider how much less this solution must be costing the company. We’ve gone from 130 VMs, with all those licensing and support costs to one. Imagine how many fewer support technicians they need, imagine how much quicker and easier it is to track down defects. Imagine how much simpler it is to develop against and deploy changes to. Imagine these improvements as a wave rolling through the organisation.

I was working on a huge project for a Palantir client, huge costs, huge required performance, huge team, and everything was huge. We were taking software designed to handle a large brokerage and then we were scaling it to handle lots of large brokerages. To accomplish this we created multiple databases, each talking to multiple servers, all with complex integration patterns between them.

The teams working on the project were huge and as such became disjointed. Requirements were misunderstood; communication and decision making were major bottlenecks. More team members were added and as a result more overheads were added. More servers and environments were added, which meant more system and management and support overheads. As a result of this people were working more overtime to maintain everything. But despite all those challenges we managed to get to a point where we could see the end in sight.

However, the complexity of the solution and the massive hardware required meant that the system operationally would cost more than the mainframe it was meant to replace. So what happened? They killed the project, and wrote off many, many millions of Rands. An abject failure by any standard. But it could have been worse. We could have worked for an extra year to complete the system, and worse yet, taken it live. The support for this beast would have crippled the client. Killing the project was the right decision; my regret is that it took so long to come to that decision.

Throwing capacity at the problem, at both a systemic and at a team level had made the problems worse not better.

Improving Capacity

Another client I worked with had some serious problems, it’s my company’s specialty: helping to turnaround struggling teams. Anyway, a lot of their problems appeared to relate to a lack of capacity. People were too busy, servers were overloaded, and project deliveries were few and far between. What was different at this client, and what inspired me to come and give this talk, was their approach to the problem.

They had a new CIO, who had brought in some counterintuitive ideas about capacity. First and foremost was the concept that adding more capability to a team increases liability, management costs, licensing and so on, which quickly reaches a point of diminishing returns. Instead they decided to solve for efficiency, simplify and optimise.

This is not as easy as it sounds, optimising complex systems and teams requires a different way of thinking, thinking with a systemic view: mindful of interactions and their effects rather than an isolated view focusing on individual systems.

Clock faceOne of the main optimisations was to focus on time. As an example, there was a process that had to start on month end and had to be complete by the next morning. The problem was that it took 16 hours. In order for it to complete in that time the machine had to be made unavailable to the business, which meant that the contact centre couldn’t work while this process was running.

The machine had to go down at 3 in the afternoon, and this meant that the business lost 2 hours for the entire contact centre every month. Calculate that out for a 20 person contact centre, that’s 480 hours a year wasted. Think about all those angry clients not being able to get answers to their queries.

And the work they couldn’t do during those lost hours doesn’t go away, so overtime was needed to catch up. Staff would have to come in on the weekend. Many of them relied on public transport so on weekend’s taxis needed to be used to get staff to work. The canteen was closed on weekends so catering was needed. All these costs add up, quickly.

Obviously, this run had to be carefully monitored; we couldn’t dare have it fail. So someone had to baby-sit it all night. That’s another 14 extra hours being spent on this process.

They could have thrown another machine at the problem but that would have other costs and difficulties associated. You might give the business back some of their hours, but you would pay for it with increased IT effort and management.

Instead they optimised the process. They got it down from 16 hours to 7 hours. They gave the business back the lost 480 hours a year. The system administrator got back 5 hours each month, 30 hours a year. Add to this less obvious cost savings, there was now this extra capacity of 9 hours more processing power on the server – so the machine could be used to do more things, to run 5 applications instead of only 3. Reinvestment of these savings is important because if done wisely it becomes a compounding improvement.

They called this “organisational clock speed improvements”. It’s not the catchiest title but we work in IT so it will have to do. Think back to our CPUs, we’re not adding more CPUs, we’re making our existing CPUs faster. No increase in overhead, no increase in licensing costs, but doing more nonetheless. This initiative was run as a competition between teams to see who could push back the biggest savings. They only tracked the direct savings, not the downstream and knock on savings, and they saved thousands of hours.

Sandals at the beachThey gave people their weekends back.

Now some of you might think that this reduction in required work could be threatening. What do you think when 15 days of your work disappear? You start thinking you can be replaced. And that’s maybe one way to use that saving but that’s not a reinvestment. Instead they encouraged the use of that time for the most important and under rated activity: thinking. Let’s say that for every 8 hours thinking, a person saves 1 hour a month for the organization. But that 1 hour a month is EVERY month. In a year you’re looking at a 50% return: 8 hours invested for a 12 hour return.

So they changed the way they thought about capacity and what was the result?

Actual spend on ICT was 10% lower than budgeted. With the same staff they delivered 48 projects in 2014 compared to 9 the year before. They had 10 major releases compared to 4 the previous year and all for 10% less money.

Simplification gives unintended and unexpected results. Simplification of systems, simplification of activities, results in more capacity from the same capabilities. This investment in simplicity is applied recursively, again and again, winding up the clock more and more, resulting in faster and faster pace and delivery. All with the same staff, but happier, less stressed, less tired, and more thoughtful.

You just need to change the way you think.

Archive: Threading Do’s and Dont’s

ThreadingFor work recently I was asked to write a little document on some threading tips, and while I was about it, I noticed this thread on  StackOverflow asking for the same thing. I don’t pretend that this is comprehensive or even necessarily 100% correct. However, it’s a start to try and apply some simple guidelines to threading. Any improvements, suggestions, additions, please let me know and I’ll make the necessary changes. Since I see this as a “living” page I won’t clutter it with edit marks as it changes. Look in the comments for change history.

DO reconsider your options

Concurrency is very tricky and difficult to get right. It often leads to subtle and difficult to debug programming errors, and far too commonly does not result in significant speed improvements. Make absolutely certain that there are not other alternatives.

DO Use lock() { … }

In almost all cases this is faster and more efficient than more complex low lock schemes involving Interlocked or ReaderWriterLocks.Threading Safe

DO Ensure that all static methods are thread-safe

It is an accepted pattern that static methods and properties are thread safe, and non-statics are not. If you violate this in either direction, document it, and be prepared to explain why.

DO NOT use the object being locked as the lock

Always create a new object to control the lock e.g.

private object lockCollection = new object();
private List<string> collection = new List<string>();

This ensures that even if you pass the collection around and someone else locks on it, that this will not affect your locking code and will not result in deadlocks.

DO place locks anywhere when multithreaded execution order can be significant

Especially if it seems that your code does not actually need a lock. It is impossible to determine in what order code will be executed if there are no direct dependencies. Consider the following code:

int x = 0, y = 0;

// Thread 1
x = 10;                 // Line a
y++;                    // Line b

// Thread 2
y = 4;                  // Line c
x++;                    // Line d

What will the possible outputs be after both threads complete?

What perhaps is not clear is that there is no requirement that b execute after a, or that d execute after c. The reason is processor reordering, where the CPU sees that the “local” execution is unaffected by order and takes it upon itself to run in any order that makes sense from a performance perspective. Thus, a result of x = 10 and y = 4 is possible. Just because your code is in a certain sequence does not mean that the sequence will be honored by the CLR or the CPU. Wrapping it in a lock will ensure correct ordering. Alternatively you can use Thread.MemoryBarrier to separate a from b, and c from d.

SpeedDO NOT assume that multithreading will automatically create a speedup

Concurrency can lead to enhancements in performance, but not in all cases. A good understanding of execution times should be acquired before looking at concurrency. Note, not estimated execution times,actual execution times. If you have a task that can be broken into two concurrent portions, it does not help if one of the portions only takes 5% of the total execution time. The overhead in threading, context switches and synchronization will almost certainly exceed the expected concurrency gains. Ideally, the tasks should be fairly similar in their execution times for the best improvements.

DO try and use well-understood patterns like Producer-Consumer

Many concurrency issues can be simplified to one of the Producer-Consumer options (single Producer/multi Consumer, multi Producer/single Consumer, or multi Producer/single Consumer).There are a great many articles, code samples, and libraries (e.g. the Task Parallel Library) around this pattern. Make use of them to make your life easier. That way it’s much less likely that you’ll be bitten by some obscure and difficult to debug problem.

DO NOT have long-running tasks in QueueUserWorkItem and other thread pools

Thread pools are designed for short running tasks. Long-running tasks in such pools cause initial starvation while the pool determines that the relevant thread is not going to be usable. This can cause significant performance hits, especially early on in service startup. Long running tasks should have their own thread.

DO NOT spin up threads for short-running tasks

Threads are expensive resources, and should not be created and destroyed without good reason. If you have a quick task, rather use the ThreadPool (or the Task Parallel Library) to execute it.

DO use Begin… and End…

Many classes, such as the IO classes have methods such as these, e.g. File.BeginRead and File.EndRead. These are often much more efficient than using the equivalent synchronous methods. File.Read effectively removes a thread from your app for the duration of the call. File.BeginRead makes use of IO completion ports and does not keep a thread occupied. The callback is fired when the OS (via the device driver) notifies .NET that the operation has completed. This effectively means you are not using a thread for the read operation at all, just a tiny bit in the beginning, and to invoke the callback on completion.

DO call End…

Many async operations create expensive resources which are only disposed when the relevant End… method is called. Always ensure you call End, otherwise you can leak resources. If you’re really, really lucky these will be .NET resources and will eventually be reclaimed. In many cases they will not, which will lead to resource leaks.

DO NOT create threads in IIS or SQL CLR

IIS  and SQL Server are heavily controlled environments with many many threads. They are finely tuned to make effective use of their threads, and adding new threads into the mix can make them run less effectively. Always try to use ThreadPool threads when running in IIS. In fact, in IIS, it is sometimes better to schedule mid to long-running tasks in the ThreadPool than to spin up a new thread, as IIS monitors it and adjusts accordingly. In SQL CLR, creating new threads is usually not a very good idea at all.

DO use concurrency libraries and controls

BackgroundWorker and the Task Parallel Library are brilliant examples of making threading less difficult. Make use of such tools and libraries extensively when you can, as they will help shield you from some of the more common concurrency issues.

DO use InvokeRequired

If you’re writing multi-threaded Windows Forms applications, please, please, please use BackgroundWorker, and only update the UI in the RunWorkerCompleted and ProgressChanged events. If that is not an option for whatever reason, use the following pattern in the method called by the threaded code:

private void OnEventOccurred()
    if (!InvokeRequired)
        // Do work here
        Invoke(new Action(OnEventOccurred), null);

Obviously if your method takes parameters you would use a delegate other than Action and would pass the parameters in when you call Invoke.

DO buy Joe’s Book

One of the best concurrency books around.  It’s a bit of a hard slog, mainly due to the depth and breadth of the content, but well worth it if you’re interested in concurrent programming.

This article has been recovered from an archive from my old blog site. Slight changes have been made.

Archive: Faking Performance


What’s something everybody wants for their application, but very few people have the time to deliver? Performance. Let’s face it, in most software projects, performance requirements are relegated to the very end of the project, when every knows they won’t have the time to address them. In one sense this is a good thing, as one of my biggest bugbears is premature optimisation.

Premature optimization is the root of all evil.

- Hoare’s Dictum, Sir Tony Hoare

When should I optimise?

Now keep in mind that by denigrating premature optimisation I am not saying that you should never think of performance when writing an application. Of course you should, especially when looking at your design. Sir Tony’s’ quote is all too often taken out of context as Randall Hyde argues. Good performance is most effectively obtained by thinking very carefully about your design up-front.

However, when you have the choice between an easy to write, slow algorithm and a difficult, fast one then chances are that you should probably write the easy one. This is not always true, of course. For example, you may know that the algorithm is being used on a critical path in your application, in which case you definitely should go for speed. Far too few people even keep these trade-offs in mind and go for one extreme or another; optimising code that will not be a bottleneck, or ignoring code that will. I often comment code that I know is slow with a // PERF: comment , so that I can go back to it later to improve it. The nice thing about this approach is that you can move the slow code into your unit tests in order to ensure that the results match your optimised code, since all too often bugs are introduced during optimisation.

The type of application you are writing matters a lot as well. Obviously, if you are developing a software product, you will want much tighter performance requirements from most of your code than if you’re writing a bog-standard enterprise app. The reason? Well, in my experience almost all custom enterprise applications are IO-bound, and spend an awful lot of time waiting for the user or for results from the database. In such an application, your database design and tweaks will likely make far more impact than anything else you may do.

That said, what happens when, at the end of your project you find that your code is too slow to deliver to the customer? Well, apparently in one talk Rico Mariani said that the Ten Commandments of Performance are

1. Measure

2. Measure

3. Measure

4. Measure

5. Measure

6. Measure

7. Measure

8. Measure

9. Measure

10. Measure

Scott Kirkwood has some interesting arguments and counter arguments to premature optimisation:

[…] Now back to premature optimization. I think what they really want to say is that “Unnecessary optimization makes code that is unmanageable, buggy and late” and there’s more:

  • When a program has performance problems the programmer always knows which part of the code is slow…and is always wrong.
  • Only through profiling do you really see where the performance issue is.
  • You can waste a lot of time doing optimization that doesn’t matter.
  • Optimization can often make the code more obscure, and hard to maintain.
  • Spending more time on optimization means you are spending less time on other things (like correctness and testing).

Well that’s the theory. And here’s some of my counter arguments:

  • If a developer really enjoys what he is doing it wont take any “extra” time. In other words, taking time to optimize doesn’t necessarily steal time from testing, more likely it steals time from surfing the web
  • Every time a developer looks at the code for something to optimize, he’s looking at the code! He understands (groks) it better and may fix more bugs.
  • Encouraging developers to leave in code that they know is embarrassingly slow makes them a little less proud of their code, a little less enthusiastic about finding and fixing their bugs.
  • Products have failed because in a review they mention that it took twice as long to load a document than the competitor (even though it was 2 seconds instead of 1)
  • When you put the code in production and it’s too slow, you may be able to fix it by profiling and optimizing, but then again, you may not – you may have to redesign it.

So, crack out your profiler (I’m a big fan of the ANTS Profiler from Red Gate) and measure and find your bottlenecks and optimise them. If you follow my approach with marking code you know to be slow, you might be surprised to find how rarely you are correct in your estimation of what code is a performance bottleneck.

When should I stop optimising?

Obviously, this is a big question to ask. Usually, you will get a few low-hanging fruit, a couple of optimisations that give you large performance benefits. After that, however, it will become more and more difficult to find good high-value optimisations. At this point most people stop optimising and ship the code. However, what if that’s still not good enough? Well, let’s think about faking performance.

John Maeda says

Often times, the perception of waiting less is just as effective as the actual fact of waiting less. For instance, an owner of a Porsche achieves the thrill of directness between translation of a slight tap on the acceleration pedal, to be manifest as an immediate burst of speed. Yet in any normal rush hour situation, a Porsche doesn’t go any faster than a Hyundai. The Porsche owner, however, still derives pleasure from his or her perception that they are getting to work faster in a quantitatively faster machine. The visual and tactile semantics of the Porsche’s cockpit all support the qualitative illusion that the driver is going faster than when he or she is sitting inside a Hyundai.

[…] The premise was when a user was presented with a task that required time for the computer to crunch on something, when a progress barwas shown, the user would perceive that the computer took less time to process versus having been shown no progress bar at all.

So, one of the easiest ways to fake performance is to slap in a BackgroundWorker component, put your expensive code in there, and report progress via a progress bar. Since you are adding not only an extra thread, but also more UI updates there is no doubt whatsoever that your code is less efficient, yet the user will perceive it as being more efficient.

Now, if even that’s not enough, another approach is obviously to offload the processing to another machine. This is even better than the progress bar if the user does not need the results of the calculation right away, since they say what they want to happen and press the start button or whatever and can immediately begin working on something else. By offloading the processing, perhaps to a multiprocessor server, you are gaining a massive improvement in the users perception of the speed of your applications, as well as an improvement in the running time. The cost, obviously, is the work required to implement the handover as well as the hardware costs of the server.

Now, I am not advocating no optimisations at all, but I am trying to get across that sometimes these “faking it” approaches are easier and cheaper than extensive performance tweaking. Needless to say, sometimes, even with massive optimisation, you still need massive offloading capabilities. Just look at SETI@Home as an example.


So, keep performance very much in mind when designing your software, keep performance trade-offs in mind when writing it, keep difficulty and impact of optimisations in mind when profiling, and keep faking it mind when polishing your application.


Jeff Atwood has a nice post on how changes to the File Copy progress bar made users see the copy as less efficient, even when it was in fact more accurate.

This article has been recovered from an archive from my old blog site. Slight changes have been made.

Archive: Estimating Software Development

Creating an estimate for a software development project is hard, really hard. There are books and articles and speeches and academic papers, and you know none of them has got it completely right, because projects are still badly estimated. So what am I going to add to the mix? Nothing, really. Just a set of tools and techniques you can use to help you improve your estimations. But first off let’s dispel some myths.

Myth 1: Estimating isn’t necessary for Agile

Well, if you work in a company where they’re happy that the date of delivery and feature set be vague, then good for you. In the rest of the world we often have to present a quote to a customer that specifies what they’re going to get and how long it’s going to take. This means you’re going to have to estimate. Happily most estimation techniques work fine with Agile, they just require that you get an idea of your scope up front.

Even without a requirement for a full estimate, you still need to estimate the scope of the user stories to see which will likely make it into the sprint and which won’t.

Myth 2: Developers make good estimators

Cone of uncertainty
Sorry, but we really don’t. Look at our track record. We tend to underestimate tasks we consider “fun”, and overestimate those we consider “boring”. It would be great if that added up to the right amount, but it turns out that we suck even at overestimating, and generally forget about huge swathes of requirements when asked to “thumbsuck”. Oh, about giving a thumbsuck estimate: don’t, ever. It is about as reliable as throwing a dart at a board, so you might as well just do that instead. If you’re being pressured for one, consider a parable:

There was a pilot, who often flew from Paris to London
He was asked how much fuel he’d need for trip between Madrid and New York
He didn’t ask in which direction, nor the make of plane, nor the number of passengers
He just pulled a number out the air
And the plane ran out of fuel mid-ocean and everyone died

If you think people’s lives are not affected by poor estimating for software projects, think again. I’ve helped rescue some projects where poor estimating had badly affected code quality on systems affecting people’s medical aids, finances, and yes, the possibility of making an airplane fall out of the sky.

It does not take long to do a proper estimate, a few hours at most, so do it properly.

Myth 3: Estimates aren’t important

A proper estimate serves a broad array of functions:

  • It lets you know when something is harder than expected, critically important knowledge in most projects
  • It allows you to synchronize timelines with other departments. For our Document Management System we delivered the first version 4 months before the web site and marketing was ready. Ooops! Admittedly my software delivery estimate was spot on, but I hugely underestimated how long the marketing would take.
  • It allows you to determine what your profitability would be on a project, which can mean the difference between costly failure and profitable success. You can walk away from projects that wouldn’t make you money. Otherwise you just don’t know. In the early days of Palantir, before I joined, there was a project like that, and it very nearly sank the company.

Okay, so let’s move on to some of the techniques:

Technique 1: Always keep records

This comes from Joel Spolsky I believe: always record your initial estimate, as well as the final result. Keep those records and use them in future estimation techniques. They allow you to build up a pattern of how various people estimate. Make sure you break up your estimates based on who gave which one. It is not embarrassing being off in your estimates, what is embarrassing is being off consistently and not factoring that into your future estimates.

Project 1: Estimated 50% below actual (ouch)
Project 2: Estimated 45% below actual, 50% below quoted (yay!)

We learn from our mistakes. Not doing so is insanity. You can build up an estimation factor to apply to people’s estimates to get a more accurate feel. Just make sure you update your estimation factor constantly, as their skills will change. Make sure to keep it isolated on project type. My estimation factor on Windows Services may be 80%, but on ASP.NET WebForms it could be 120%.

Technique 2: Multi-estimate

Have multiple people independently give estimates for the same item. One advantage of this technique is that you get a broader base of data for Technique 1. Probably the most important however is that it provides a level of confidence.

Assuming estimation factors already applied.

Item 1: Sean estimates 6 hours, Graeme estimates 7 hours, Craig estimates 5 hours. Standard deviation is 1. So, we can say the estimate is 6±1 hours with high confidence.

Item 2: Sean estimates 3 hours, Graeme estimates 7 hours, Craig estimates 14 hours. Standard deviation is 5.56. So, we can say the estimate is 8±6 hours with low confidence. With such wildly differing estimates they could all be way off.

This is a hugely important technique, and the least often applied. It offers huge benefits in accuracy as well as feeding nicely into Technique 1.

Technique 3: Constantly compare progress against the estimate

This is often done as part of standard project planning, but the real-world data is not fed back into your estimation. As you go, you can start calculating a new estimation factor which is specific to this project. You can then apply that back into your original estimates to get an updated idea of how your estimation might differ. This would mean you now have the following:

  • Original Estimates
  • Final Estimate – Calculated from Originals with factors applied
  • Committed Timelines – Hopefully somewhere north of the Final Estimate
  • “Current” Estimate – Recalculated from new estimation factors generated from progress

If your current estimate is creeping upwards towards the committed timelines, you need to raise that as a problem, before it becomes a problem. You might also find that one persons estimates for the project seem to be more in accord with reality, and give their estimates more weight, but be careful to keep the uncertainty values in place.

Technique 4: Be detailed

I get very suspicious of “2 week” items in an estimate. Sounds like a thumbsuck to me. Joel says you should break everything down to 16 hours at most. I prefer shorter even than that, but could live with 16 hours.

So why break it down? Well, it turns out that we’re really bad at estimating all the pieces that go into a big task, but pretty good at estimating small tasks. So, by forcing us to break it down into smaller items we’re required to think more about the makeup of each task. How much of a difference can this make to your timelines?

Oh, about 50%.

Oh, and make sure you include everything in your estimate, including wasted time waiting for third party vendors, holidays, sick leave, maternity leave, scope creep, project start delays, document signing delays, testing, debugging, the likelihood of having to rewrite a module or two, the lot. You can use the data you accumulate to feed back into future projects, improving their accuracy.

Other Techniques

Take those variances you got from each developer in Technique 2, plug in their historical variances, and use Monte Carlo simulation to generate probability distributions. Now, you can confidently go along and say, “we have a 90% chance of hitting X months”.

If you find a great deal of variance on a task, it likely has not been scoped well. Consider investing a little more time in nailing down the requirements, and then re-estimate. Yes, still keep the original estimation data for historical reference.

If you find that some staff are more accurate, don’t use them in preference to everyone else. They could always have a bad day after all. Rather ask them to share the techniques that they use that make them so accurate.

Estimation must be seen as a high priority item, one of the highest. It’s more important than the project plan, more important than the specifications, more important than the actual development work. How can I say this? With a badly estimated project, you can do the development work perfectly and still make a loss.

Estimations are also not made just by padding them. Rather give the real numbers as accurate as you can to management. They can then use that as input on their decisions about whether to go for a project and how much to charge for it. If you pad your estimates too much, you could lose out on very lucrative business opportunities.

Do not get pressured into removing the error bars. Make sure you include the ± variance, or if you’re using Monte-Carlo (hopefully), the percentage probabilities. The nice thing about Monte Carlo is that you can give a 100% number if pressured for it, but it’s usually way more than the 90% figure. On some projects we go with 80%, on others 90%, on a few 95%. 100% is usually not worth using, unless millions ride on the delivery date.

Any others? Share them here.

This article has been recovered from an archive from my old blog site. Slight changes have been made.

Archive: Storing Millions of files in a File Structure

A recent question on HighScalability.com was “How Do I Organize Millions Of Images?”. The asker had found that storing files in a database was inefficient, and wanted to know what scheme he should use to structure the files. I started writing this as a comment, but decided to do it as a blog post instead. The questioner is moving in the right direction; databases are a very poor place to put large amounts of files from a performance perspective – although don’t discount the convenience of this.
Lots of files
So, the question to ask is this:

How many file system entries can a folder efficiently store?

I did tests on this a couple years back, and on Windows at that time the answer was “about a thousand”. Okay, so that implies that we must go for a tree structure, and each node should have no more than about a thousand child nodes. This implies that we want to keep the tree nice and balanced. Having a huge tree structure with 90% of the files nodes distributed into 5% of the nodes is not going to be hugely helpful.

So, for a million files, with a perfectly even distribution, a single folder level is sufficient. So, that’s one “root folder” containing 1000 child folders, each containing 1000 files. For simplicities sake, I’m going to assume that only “leaf” folders will store files. Okay, so that will efficiently store about 1,000,000 files. Except, he wants to store millions of files. Okay, so that implies that either we accept more entries per node or we increase the tree depth. I’d suggest the more entries as the starting point to consider, my “1000 entry” testing is a bit out-dated.

So, 2-level structure; a “root folder”, with 1000 folders, each containing 1000 folders, each containing 1000 files gives us a nice even billion, 10003, assuming an even distribution. That last part is the tricky part. How do assure even distribution? Well, the simplest method would be to generate the folder names randomly using a pseudo random number generator with even distribution, so probably a cryptographically secure one. Some of the schemes suggested in the comments ranged from generating GUIDs to generating SHA-1 hashes of the files. Some of them may work well; I’ve personally used the GUID one myself to good effect. But a GUID does not guarantee good distribution, and it might bite you, badly.

Using a hash function is cute, it limits you to a folder size of 256 nodes though; which implies a deeper folder structure – additionally it means you must hash the file as part of the file location. But, um, if you’re looking for the file, how do you hash it? I assume you store the hash somewhere; this is good for detecting tampering  and if you are doing this or plan on doing this – then this seems like a good approach. Unfortunately it is inefficient compared to our “ideal” 1000 node per folder size. As the commenter points out; one other benefit is that if the same image is uploaded multiple times, the same file path will be generated. The problem with this approach is that the commenter is incorrect when he says that SHA-1 does not have collisions; there is in fact a theoretical approach to generate collisions for SHA-1, and NIST suggests that it’s use for name collision avoidance should be stopped by Federal agencies. So, maybe SHA-2? Well, it is off a similar base to SHA-1, so it’s possible a collision attack could be found – although one hasn’t been found yet. Oh, and why we should worry about a collision attack? Because person A uploads a photo of her wedding and person B uploads some porn – and person B overwrites person A’s photo.

The technique I’ve used many times is the GUID one, and it works well in most cases. The random number generator approach I’ve used for larger systems, using random numbers for folders, and a GUID for the file name. The hashing approach is very interesting. I think I might have to give it a try in a year or two when I have some spare time. I’d want to modify it to have a few thousand nodes per level, rather than just 256; and I’d want to handle collisions – but it has some really nice emergent features; and it makes good use of the hash I always store for file verification.

I haven’t touched on the approaches of segregating based on user ID and similar; since in my case where I need to store millions of files for a single company, this doesn’t apply. It may well apply quite nicely to your needs however.

Here are some simple rules to live by:

  • DO Compress stored documents – this will save you huge amounts of storage space
  • DO Compress transmissions – this will save you huge amounts of bandwidth, and speed up downloads
  • DO Support HTTP Resume – this will save you huge amounts of bandwidth
  • DO NOT Store large amounts of BLOBs in a database. If anyone tells you to do this; then they haven’t handled large number of binary documents. This always seems like a good idea at the time, and never is. Seriously. NEVER.
  • DO Separate your path generation logic from your path lookup. In other words, don’t replicate your path generation on lookups. Rather store the generated path, and just read it. This allows you to move files around if you need to, rebase them, change your algorithm – a whole bunch of things.
  • DO NOT use MDS for anything. Ever. No, not even path generation.

This article has been recovered from an archive from my old blog site. Slight changes have been made.