These days all the rage is continuous deployment. Once development is “complete”, you push a button, (insert automated process here), and boom, the code is in production. I applaud shops which can reach this stage. It requires extreme devotion to automated testing and scripted processes. I haven’t worked with that sort of a system. However, many of the same principles apply to deploying the old fashioned way.
The One and Only
First, let’s get this out of the way. There is no “right” deployment process– rather, there are merely multiple options with different advantages and disadvantages depending on the environment. You’ve heard this before: It depends. I would hope that the process around upgrading the firmware on a pacemaker looks a lot different than what you followed when you changed the colors on your personal blog. And that’s a good thing. We live in a world of limited resources. Spend them where they matter. Take more care only where there is more risk.
The discussion below is from the perspective of a system with most behavior controlled by database-level code, where the interface has little to no logic, though many of the ideas are broadly applicable.
A cheap way to reduce risk is to simply do less– at a time. The more changes you wrap together into a launch, the more things can go wrong. Most new development involves new things– new tables, procedures, functions, and views. Those can all go live without affecting users, so do it. Next, there are often changes which shouldn’t affect users– code optimizations, new columns, optional parameters. Put those live and catch any errors when they are easy to fix, before they’re entangled with more complex changes.
This technique is more effective when it’s baked into the development process. When coding, you will always notice something which could be done a little differently (“better”, in your mind). It’s oh-so-tempting to just make that quick change in the middle of whatever else you were doing. Succumb and you have just increased your development footprint (even if you’ve simplified the code) and therefore increased the risk of error. If you start trying to convince yourself it’s trivial, then use that momentum to drive it as a separate change. Stash your other changes, adjust that small section, and get your tweak live. Then you can return to your originally scheduled programming with one less thing to worry about.
It’s not just about exercising restraint, though. It’s also about taking small additional steps to minimize the impact of new development. If you are making a new stored procedure parameter, can you provide a default so that its omission will make the procedure act as before? If so, you can now deploy that change without affecting any consuming code. If you are changing the name of a column, can you create a computed column with the old name which simply mirrors the original column’s data? All insert- and update-related code would still need to change, but you could save the SELECT statements for the next round.
Ideally, these steps are transitory. Optional parameters can be made mandatory once all of the consuming procedures have been adapted. The legacy-named computed columns should be deleted once all old queries are updated. But recognize that this might not happen. Other priorities come up, and these supposedly temporary constructs can linger.
Just because code is in production doesn’t mean it has to run. Ideally, every new behavior in a program could be toggled on and off quickly and easily through configuration without any code changes. Even if you don’t have a system to manage these switches, you can directly circumvent your code by wrapping it in impossible conditions (such as IF 1 = 0). This allows you to put your actual code into production without affecting users, making final implementation a quick change which can be more easily coordinated with other processes or people. Importantly, it also makes turning it off easy, in case you encounter problems.
This is another practice which accumulates complexity over time. Each switch doubles the possible paths through your logic, growing exponentially (literally). But coupled with a system for users to manage these settings, it empowers the business to experiment with different behaviors and control their own destiny at their own speed. Of course, they can also configure themselves into a corner and cause all sorts of problems.
Your development environment can get pretty messy. With application code, source control branching helps you manage things, though you have to work through the occasional merge conflict. With database structure and data, multiple developers are often integrating changes into a single database server instance. Understanding which changes are related and must travel to production together is a challenge. (Having frequent implementations helps mitigate this issue.) We’ve found one of the most frequent implementation errors is omitting an object (especially when the collection spans databases). Having an environment between development and production allows a set of development to be tested in isolation from other changes, and moving to that environment serves as a trial run for the move to production.
Inevitably, someday something will go wrong. You will need to go back to the way you had things. If you are working with source control, getting your application code back the way it was is not a problem. However, database code is often not included. There are tools which automate getting your database schema into source control, but you can at least script out the affected objects from production and be prepared to re-apply them in case something goes wrong. Of course, you have to revert them in the right order when dependencies are involved. Small deployments make these recoveries easier.
Code is the easy part, though. Sometimes new features require data changes. At a certain scale, those can be scripted out as well and treated as code, but it can be voluminous (you’re essentially importing data into your codebase). It’s often easier (especially with manual deployments) to simply create a backup of the data. We’re not talking about an actual database backup, because then to get at the affected data, you’d need a special tool or a full restore to another environment. Instead, you can do a quick an easy snapshot of particular data with a single line of code. You’re going to mess with TableA? `SELECT * INTO TableA_Backup FROM TableA` Be careful for concurrency issues (you probably wouldn’t want to do this on large, active tables) and recognize that your data becomes progressively more stale with each passing moment. Ideally, you would have your rollback scripts created in advance, but this can be challenging with large data changes, and you might be forced into emergency data wizardry.
The best way to be able to be able to revert your changes is to have another production-like environment waiting in the wings. If you provision a backup production database using log shipping (with the secondary in read-only mode), your previous schema and data will be available for copying whenever you might need it. This can also be handy for all sorts of mistakes. (Consider setting up a linked server to be able to quickly execute cross-server queries without having to use an SSIS package or Import/Export wizard to move the data.)
For over a decade, we’ve been using RedGate’s SQL Compare and SQL Data Compare for schema and data comparisons across instances (development, staging, production, etc.). Implementing schema changes without a comparison tool is a recipe for failure. The tool will detect all changes in all objects and create a script to implement the changes in that target environment, respecting the order required for dependent objects. (There’s even one built into the free SQL Server Data Tools (SSDT) from Microsoft.)
If you don’t have a backup production environment which you can use as a source for rolling back your changes, you can use SQL Compare to create a schema-only snapshot for your production environment (prior to changes) which can be used a base for comparison. You can even use these snapshots as a rudimentary form of primitive source control (though Redgate has tool for that too). You can also get an HTML version of your schema changes which can be used for code review and change logging.
If you are having trouble finding the resources to create development and staging environments, but somehow have enough money for SQL Clone, you can use it to easily create as many environments as you’d like (including one for each developer), all based on a backup of production data.
Finally, when you are ready for the next step, Redgate also has a continuous integration and deployment tool.
Even if everything is fully automated, you should carefully consider when you implement your changes. If your changes are disruptive, you may be tempted to make them during off hours. If you are making the changes manually, that means you’re staying up late or getting up early, so you will likely not be operating at your peak. If you are staying up late, you will be sleeping after the event, which means that by the time you wake up, users will already potentially be encountering errors. So then you’re fixing things from home or cursing rush hour with extra fervor, all while under stress. You might have also had to act during your maintenance window, so your backups or index changes might happen at inopportune times.
If you hit “Deploy” during the day, you might cause problems for more users, but you will also get immediate feedback and be able to manage the incident with a full IT staff during their normal hours. If you start early in the day and experience an extended outage, business is affected for most of the day and workers might not have other things to work on. If you start late in the day, workers can avoid the brunt of the impact by simply leaving early, but that might mean you’re working late to fix things.
Obviously the best time will depend on the type of changes and the needs of the business, but don’t discount the human effects (on business users and IT staffs).
No matter how many precautions you take and how you mitigate the risk, eventually something bad will happen. And when it does, you will be faced with a choice: do you fix the problem quickly in production, or do you roll back and put the fix through your normal process? As with everything else, the best answer depends on your technology stack and your business.
For me personally, it’s hard to resist the temptation to keep moving forward. Rolling back feels like admitting defeat. But no one’s going to die if we make the problem worse in our business. Let the consequences and your conscience be your guide.