On stuff-ups

One of my team members has recently made a flurry of mistakes in production. Anyway, we were chatting, and he seemed a bit downcast about it, and I told him not to worry too much, we’ve all made mistakes and we all will make more; the trick is to not keep making the same mistakes again and again.

Tonight we test in ProductionSo, I thought I’d share some of the tricks I use to make sure I don’t mess up Production. Please contribute any tricks you use too.

Avoiding SQL production mistakes

SQL mistakes have the fun attribute of being very, very quick, and affecting lots and lots of users all at once. To stop ourselves from making silly mistakes there’s a range of techniques we can use.

Clearly Identify Production connections

Most SQL development tools have a way of indicating that there’s something special about this particular connection.

SQL Server Management Studio allows you to set a color for each connection
SQL Server Management Studio allows you to set a color for each connection
SQL Server Management Studio showing the warning red of Production
SQL Server Management Studio showing the warning red of Production
LinqPad connection screen showing "Contains production data" checkbox
LinqPad connection screen showing “Contains production data” checkbox
Linqpad query showing "Production Connection" warning
Linqpad query showing “Production Connection” warning

Always Rollback until you’re sure

begin transaction

-- Put your code here

rollback transaction

This is a good habit to get into and helps ensure the only effect you’ll have on Production is via locking/performance issues.

Always have a WHERE clause

begin transaction

update customers
set name = 'Test'
where 1 = 0

rollback transaction

The first part of anything you write should be “where 1 = 0″, this again helps ensure that you won’t have any kind of effect until you want to. Use this technique for SELECTS as well.

Always use two independent WHERE clauses

You never want to assume that something is what you think it is just based on one piece of evidence. You must have at least 2. For example, assume that you want to update the customer “Sean Hederman” which has ID 456 to a web site address of “http://www.palantir.co.za”. The wrong way is:

begin transaction

update customers
set website = 'http://www.palantir.co.za'
where ID = 456

rollback transaction

The reason this is wrong is that you could be mistaken about who the ID refers to. Instead, the right way is:

begin transaction

update customers
set website = 'http://www.palantir.co.za'
where ID = 456
and name = 'Sean Hederman'

rollback transaction

This ensures that if ID is wrong and refers to another customer, the update will fail; which is the behaviour we want.

Combine these techniques into a Data Fix Template

I strongly suggest creating a data fix template that fits the following pattern:

  1. Performs all the checks and updates in a transaction.
  2. Checks that the data before the update is as expected, using multiple where clauses for each check, and fails if it is not.
  3. Checks errors and rowcounts when it does the update. If you expect 3 rows to be updated, fail the script if more or less than 3 are updated.
  4. Checks that the data after the update is as expected, using multiple where clauses for each check, and fails if it is not.
  5. Prints out status messages about it’s progress and evidence through the various steps.

I then strongly recommend the following practices around the data fixes:

  • Get them peer reviewed by a DBA or SQL expert who knows the database in question.
  • Check them into source control, prefixed with the date of their execution.
  • Take the execution output (with the evidence outputted by it), and check that into source control with the data fix script.

I also like having the DBA’s actually execute the scripts. This ensures that the developer doesn’t “fix” any minor mistakes out of sight of anybody.

Avoiding Server Production Mistakes

Server mistakes are incredibly common, usually because someone does something dumb.

Test before making changes

No change to Production (apart from maybe critical resolutions) should be done without being tested in a similar environment.

Clearly identify Production servers on the desktop wallpaper

No brainer. Your Production servers should be obviously Production servers. Using a different colour background in addition to text is advisable.

Clearly identify Production servers in Remote Desktop Tools

If you have a remote desktop tool that shows a list of servers you can access, clearly identify and separate which servers are Production servers.

Automate all the things

The easiest way to not mess up Production servers is to not go into Production servers. The best way to do that is to automate all the tasks you do in Production. Go in to clear out log files? Automate it. Going in to deploy new versions of your software? Use an automated deployment tool like Octopus Deploy. Going in to look at log files? Have them aggregated into a tool like Splunk.

Having reliable Change Requests that tell you why people are logging in helps you decide what needs to be automated next.

Make sure you have strict controls over your automation

There is nothing that can stuff up your Production environment as fast or effectively as an automated system. Make sure you ensure that these tools require authorisations, and that the authorisers check that the right tool is being applied to the right environment.

Installs are evil

I made this mistake yesterday. Happily it wasn’t a front line Production server, but it was our source code server, so it annoyed the devs quite a lot.

Simply put, some installs break things. An install should be treated like any other deployment. Testing must be done, evidence must be provided, a change must be logged.

Check that DRP/backup is working first

Just in case what you’re about to do breaks stuff, do you have a backup plan if the rollback fails? Another machine? A backup?

Always backup first

You’re about to make a change to a directory? Back it up first. You’re going to make a change to the server installation? Back up the server. Making a big change to a database? Snapshot or backup is your friend.

Make sure you back up everything [New]

Do you know everything that needs to be backed up? Program Files directory? ProgramData directory? Per user app settings? Registry? I literally just made this mistake with a TeamCity upgrade, I backed up Program Files and not ProgramData.

Happily in my case we do take full backups of the server so I will get this mistake resolved, but a full server restore is a lot slower than if I’d just backed up the right directories in the first place.

General rules to avoid Production Mistakes

Here are some general rules of thumb for avoiding/recovering from mistakes in Production.

Step away from the keyboard

The moment you realize you’ve make a production mistake the temptation is to dive in and fix it immediately. This is a very big mistake. You probably made the mistake because of a misunderstanding/faulty assumption, and rushing back in, in a panicked state, is likely to worsen your error, not make it better.

No matter how confident you are in the fix, step away.

Do NOT hide the mistake

Under no circumstances should you hide the mistake. You should immediately, and calmly, let the relevant people know about your error. This protects you, it provides a team of people to help you think through the problem resolution and verify your thinking, and it protects Production.

Hiding a Production mistake will often get you fired. Making a Production mistake will only rarely get you fired.

Note down the facts

Write down what you did, and why you through it was a good idea at the time. Include any evidence you may have that backs up the assumptions you made that caused the mistake. In most cases, mistakes in production are caused by:

  • Faulty assumptions (e.g. assuming you’re not dealing with production)
  • Faulty thinking
  • Finger trouble
  • Misleading evidence
  • Insufficient evidence
  • Stupidity (this is qualitatively different from faulty thinking)

It’s important to note down why you made the mistake, so that lesson can be used to close the gap in your Production protocols.

Don’t panic

Spend a bit of effort to keep calm. Often panic produces even worse thinking and can also cause you to miscommunicate the problem to the team. Take some deep breaths, have a glass of cold water, and calm yourself before doing anything of importance.

People make mistakes all the time, it’s how we learn. This is a learning experience, a crap one to be sure, but still something that is going to make you better at what you do.

Don’t get jaded

If you often deal with Production, it is easy to become jaded and lackadaisical. This is human nature, we become acclimeitized to risk we encounter often, even incredibly dangerous risk. For example, pedestrians on our highways is a major problem in South Africa. Research has shown that the first time the pedestrians cross the highway they are fearful, full of adrenaline, and fully appreciative of the risk of death. They usually only do it in special circumstances. However, the next time, their appreciation of the risks decreases, they widen the circumstances under which they will take the risk because they perceive the risk as less.

It hasn’t got less, they just perceive it as less due to acclimatisation. This is why South Africa has a very high pedestrian death rate: we put people on the other side of highways from their livelihoods, and provide too infrequent crossing points. This results in many, many people becoming acclimatised to a deadly risk.

Production must always be treated with respect and care. Try and psych yourself up about it whenever you’re about to touch Production, give yourself a talking to about the importance of being certain about everything you do.

Don’t use common logins

A lot of organisations use shared login accounts to access Production. This is a Very Bad Idea™. The reason is simple: when someone stuffs up and doesn’t own up to it, how will you know who did it? Related to this is that login and file auditing should ideally be turned on, especially on servers containing sensitive information.

Separate Production access logins from normal logins

Make sure that users cannot log into Production directly with their normal logins. This is security best practice, but for our purposes the main reason is psychological: it forces the users to be aware that something different is going on.

Invest in a Privileged Access Management system

A Privileged Access Management (PAM) system doles out connections to Production. So basically, no-one can log into Production without asking the PAM system for a login. This allows you to use “common logins” because now you have a record of who “owned” that login at a point in time. Most PAM systems allow you to remote desktop direct from the PAM system. All of them I’ve seen also support password provision, where they control the common login password and provide the user a “temp password”.

Why would you do this rather than just give everyone a second login to Production? Well, one reason is ease of administration. Maybe you don’t want to manage a huge number of Production logins. Another reason is that a normal login can be used at any time, whereas a PAM controlled login can be part of an approval process.

My preference when using a PAM is to have two main logins for Production: a read-only user which does not require approval, but is simply tracked, and an admin user which requires an approval process. Normally said process would also require a logged Change Request. This is to ensure that every change to Production is logged and recorded, so we can analyse the reasons and drive them down as much as possible.

Have a monitoring system

No brainer, but it’s absence is surprisingly common. Which servers are running out of memory, CPU, or disk space? Which services have stopped? Which scheduled tasks failed? What you want is a set of alerts which notify you when something is badly wrong. You don’t want alerts about everything because otherwise you won’t see the wood for the trees.

A big dashboard with red for when things go wrong, and green when everything is okay is also a must. In this way you get quick visibility that a problem has happened.

Nothing worse than making a mistake in Production and not knowing you’ve made a problem until the customers start complaining.