In this post, I’m going to lay out the basis of version control when dealing with databases. Before starting, it’s worth pointing out that some systems don’t have version control for their databases; and, similarly, some systems (maybe not the same systems) don’t have version control for their code either.
This article is not to try and persuade you (or your company) to adopt version control for either, just to outline the possibilities and concerns if you do.
What is Version Control?
Let’s start by outlining exactly what is meant by version control in this article. I’m sure there are better definitions, but for me, something that is version controlled should have the following properties:
- It should be possible to recall the state of a specific file at any time in the past, since the version controlling started.
- It should be possible to recover the state of the system to any time in the past, since the version controlling started.
- It should be possible to deploy from the files held in version control. (It’s not really the responsibility of the version control system that you choose to ensure that the deployment works - but that would be nice, too.)
- It should be possible to determine what changes were made, when, and by whom.
I’ve purposely left out things like conflict resolution, branching, etc. Without those (admittedly essential) features, you could still refer to a system as source controlled, but I feel not without the points above. Also, these are the relevant facets for this post.
Why is it Different for Databases
Most source controlled code files - in fact, I’ll go as far as to say all represent the current state of that software; for example; let’s imagine that we have the following file:
MyFile.cs`v1 public void PrintSomething() => Console.WriteLine("One");
Now, let’s imagine that version 2 of that file looks a little different:
MyFile.cs`v2 public void PrintSomething() => Console.WriteLine("Two");
I can deploy version 1, then deploy version 2, then back to 1. In fact, I can switch these out all day long, and all that will happen is that the software will behave differently.
Now let’s imagine the following DB File:
MySqlFile.sql`v1 ALTER TABLE MyTable ADD NewColumn VARCHAR(10);
And let’s imagine version two of that file:
MySqlFile.sql`v2 ALTER TABLE MyTable DROP COLUMN NewColumn; ALTER TABLE MyTable ADD NewColumn2 VARCHAR(50);
You can deploy these two files in order, but you cannot just restore v1 after deploying v2. Or, to be specific, you can, but you will not end up with the system in the state that you expect it to be in (that is,
NewColumn2 will still be there). Further, you can’t switch back and forth, as executing the second file twice will error. Finally, and most importantly, these columns may contain data (if they don’t, then why are you creating them in the first place).
We can boil this down to two specific problems that we’re faced with when specifically dealing with database source control:
- Data Loss - as you move between versions, there should be a way to manage the data.
- Repeatability - it needs to be possible to move from a version of the database and back to that version of the database without causing the deployment to crash.
Obviously, the other requirements stated above are still applicable, but these are additional. In fact, these requirements also exist for all source control - but you might not typically think to include them, because they are understood as a kind of base line.
Both of these problems boil down to a single issue, which is state - deploying a database affects the state of the system, which, in turn, means that the baseline for deployment changes each time: this isn’t true for software - destroying and re-creating a software environment is a common (expected) activity.
What Approaches Are Available?
Let’s start with the fundamental approaches, and then we can deal with the specifics. The two options available are state and migration driven.
For the state based approach, the idea is to take the target system, or a representation of that system, and compare that to the system as it stands, and then generate a change based on the difference between the two. Let’s see how we could do that manually based on the above example; here’s version 1:
CREATE TABLE MyTable ( ID INT PRIMARY KEY, Column1 VARCHAR(100), NewColumn VARCHAR(10) );
And version 2:
CREATE TABLE MyTable ( ID INT PRIMARY KEY, Column1 VARCHAR(100), NewColumn2 VARCHAR(50) );
Since I have the definition of both systems, I can generate a script that moves between the two; for example, if I’m at
v2 and I want to move to
v1, I could generate the following script:
ALTER TABLE MyTable DROP COLUMN NewColumn2; ALTER TABLE MyTable ADD NewColumn VARCHAR(10);
I’m purposely ignoring the data aspect of this for now, but will come back to that later.
I generated this script manually; and there are a number of automated tools that can do this for you. If you’re using SQL Server, then SQL Server Data Tools will do that for you, RedGate make their living out of providing such tools, and there’s a few other options depending on what type of DB you’re using.
At all times, I have an accurate representation of the entire system and what it should look like - in a similar way that if I have a series of code files, they represent a target system.
Before we move on from this approach, we need to speak about, possibly, the best feature of it; let’s imagine that, unbeknown to me, someone has changed
Column1 to be a 99 character field, instead of 100; suddenly, the script would change, and highlight this fact.
Migration is almost the opposite of the state driven approach. Migration driven can be very simple; and, arguably, requires no tools whatsoever. I can simply maintain a list of changes to the system since its inception; in our example above, my script might look like this:
MySqlFile.sql ALTER TABLE MyTable ADD NewColumn VARCHAR(10); ALTER TABLE MyTable DROP COLUMN NewColumn; ALTER TABLE MyTable ADD NewColumn2 VARCHAR(50);
If the file itself is version controlled, then we can recover the state to a point in the past. This isn’t ideal, as it would mean removing the entire data structure. However, some migration tools, such as Entity Framework will generate a script to go to the next and previous version from where you are (in fact, EF does exhibit some features of a state driven approach).
There are dozens of other tools available for this approach, I’ve previously written about DbUp. All of these tools work in the same basic manner: they store the state of the system inside a database table, meaning that they can track the state. Some of them, like EF, provide a means to move down the version list, as well as up, but to do this, you essentially have to provide the reverse of each migration script.
Finally, most of them require (or at least strongly suggest) that each script is idempotent.
There are some downsides to this approach; for example, if the source or target system are changed outside of the source control it will, at best, crash the deployment; and, at worse, go un-noticed. Further, there are few checks and balances: it’s easy to make a mistake in SQL and, although the code can be put through a code review process, damage can occur that can be difficult to rectify.
For both approaches, the biggest issue, and danger, is that you make an unintended change that destroys data. Some of the tools that you can get can help with this, but one approach is to follow a process whereby you create first; for example, imagine that we’re renaming
ALTER TABLE MyTable ADD Description VARCHAR(100); UPDATE MyTable SET Description = Column1;
We now have a system with two columns:
Description. Once this is deployed, we can remove
Column1 at a later date, but this approach means that data is less likely to be destroyed. This also applies to both approaches.
Let’s imagine that, in the above example, we have some code that references
Column1, we would need to change that code, but the order of the deployment would matter. Typically, the way to approach this is to deploy the DB change (as above) first; that is, you would have the two columns, and then change the software. Although this can be a more difficult problem that it initially seems; what if the code in question looks like this:
string updateQuery = "UPDATE MyTable SET Column1 = @NewValue WHERE ID = @ID"; using SqlConnection connection = new SqlConnection(connectionString); using SqlCommand command = new SqlCommand(updateQuery, connection); command.Parameters.AddWithValue("@NewValue", newValue); command.Parameters.AddWithValue("@ID", id); connection.Open(); int rowsAffected = command.ExecuteNonQuery(); . . .
If we deploy the DB change first, but this code executes before the software change is enacted, we end up with a delta. In this situation, this can be overcome by simply having a two phase change (i.e. update the DB, update the software, then update the DB again) - however, that’s then so tightly coupled that you might as well be making the change manually.
Other possible solutions here are to add something akin to a trigger, so that the two columns are temporarily linked, but the state of the software is bound to the state of the database.
Entity Framework addresses this problem quite nicely, in that you can force it to update the DB as the software starts.
This certainly wasn’t an in-depth look at this subject, but I think it lays out the main issues with database version control. It’s worth pointing out that, in most cases, this is massively overthinking the problem. For the vast majority of situations, using something very simple like DbUp will solve the problem for you, and you’ll never need to think about it again. However, databases do have a tendency to grow in size and complexity - and these problems are easier to solve with a small database, than retrospectively on a large and complex one.