Surrogate Keys, Primary Keys and IDENTITY
by Brian Walker

The debate about surrogate keys is a heated argument (one of several) with no end in sight.

It's hard for me to respect the opinions of those who argue with a near religious fervor, especially when they almost always resort to demeaning others who have different opinions. They seem to be too consumed with ego to recognize that they make themselves look very hateful and/or very arrogant. I believe that a person who is willing to deny or accept severe problems with their self-representation is also likely to deny or accept severe problems with their technical judgment. I wish the debate could be civil and rational, but it never seems to happen that way.

Dale Preston, who I had never heard of before, is apparently an example of the above. With regard to surrogate keys he states "Not that I am wrong - I am not - but that there is so much debate..." in his blog. He further suggests that only he and Joe Celko really understand the issue. He emphatically ends two of his points with "Period." as the next sentence. He clearly believes that his opinions are right, his opinions are irrefutable, and there should be no discussion on the matter. I could be wrong, but those beliefs sound very arrogant to me.

Those who favor natural data keys generally support their dogmatic arguments with theoretical ideas while ignoring technical merits. They often spout theory as though blind promotion of theoretical ideas somehow makes them practical facts. They love to talk jargon while pretending that every term has an extremely precise meaning, but their favorite expression may be "by definition" and they repeat it quite often. Their world seems to consist of a utopia of logical models while others are left to deal with the reality of physical existence. They very often mention duplicate rows as an inherent problem with using surrogate keys as though a simple mechanism (such as unique constraints) to prevent them does not exist, and as though natural data key systems are somehow immune from the same software development errors. They refer to an IDENTITY value as a "physical locator" or "file position pointer" even though it's several levels removed from such concepts, but the descriptions support their disdain for anything "physical" and defamatory labels serve to make surrogate keys sound terribly obsolete. The blog by Preston and the post by Celko (referenced by Preston) are both typical of proponents of natural data keys.

Those who favor surrogate keys generally take a much more pragmatic approach. They usually do not preach or lecture. They may feel strongly that their methods are better for various practical reasons, but they normally do not belittle those who prefer natural data keys. They acknowledge that surrogate key systems are not a perfect solution. They do not view the two approaches as being mutually exclusive. They do not view the other method as a form of pollution in a pristine world. They do not enter database discussions with a propensity to vigorously and aggressively defend intellectual territory. They simply want to implement reliable and efficient databases.

I think there is a dramatic difference in the tone demonstrated by the two sides of this debate.

I think there are advantages and drawbacks to both natural data keys and surrogate keys.

Natural data keys make bulk import of child rows a little easier. The child rows can often be imported directly into their destination table. The same task in a surrogate key system often requires an extra step involving a staging table. The staging table provides an opportunity to look up the surrogate keys of the parent rows (to become foreign keys in the child rows) using candidate keys within the natural data. There's nothing wrong with taking advantage of the relationships that may exist between two sets of natural data. I'm not aware of anybody that has ever proposed ignoring the candidate keys. However, the natural data candidate keys are not necessarily the best choice for primary keys and foreign keys in the physical implementation of most databases.

Natural data keys may mean fewer joins in a complex query. The natural data primary keys of parent rows exist in all the corresponding child rows as foreign keys, so there may be no need to perform joins to the highest level parent tables. However, this situation is also a compelling reason to avoid natural data keys. The mere fact that natural data exists in multiple places seems like a violation of basic normalization rules. If/when the natural data changes the change must be cascaded down through layers of dependent tables. In addition, the natural data keys, potentially several columns wide, will require significant storage space in the physical implementation of most databases.

The proponents of natural data keys say we should use industry standard codes as primary keys. That's supposed to minimize the need to change key data and possibly minimize the width of foreign keys. I think it would be very unlikely to find industry standard codes for anything more than a few high level tables in most databases. It's quite probable that most of the lower level tables would have composite primary keys. The lowest layers could easily have composite primary keys consisting of many columns. I work with a mission-critical corporate database every day. The database is of only modest size, but yet it's up to 14 levels deep. There are few applicable industry standard codes. Most of the lower level tables would require very wide natural data primary keys consisting of several columns. The table that contains the largest number of rows would require a composite primary key consisting of six columns.

These issues would be very worrisome if the corporate database were using natural data keys. It's not. It's using single column surrogate keys. Every table has an integer column with the IDENTITY property and that column is the primary key. Every foreign key is the single column primary key from the parent table. Every join between a parent table and a child table can be coded with one predictable column from each table. This architecture provides substantial benefits. The primary key values never change, so foreign key values are stable. All of the primary key and foreign key indexes are compact, which is very good for join performance (there are some additional indexes on selected natural data columns). The SQL code expression of a relationship between any two tables is simple and very consistent.

I believe that surrogate keys have stability and performance advantages over natural data keys in most of the current relational database systems. However, it's the consistency of universally applied surrogate keys that makes the architecture so appealing to me. A given query in a natural data key architecture might join together eight tables with the join structure being radically different between each pair because of how the relationships (foreign key references) are defined. The same query in a surrogate key architecture might join together 10 tables, but the join structure would be short and predictable between each pair.

The consistency of a surrogate key architecture makes life easier for developers. I believe they can write data retrieval code faster and with fewer bugs. With surrogate keys there is only one column from each table involved in most joins, and that column is obvious. With natural data composite keys there are several columns from each table involved in most joins, and those columns are not obvious. Surely that must make it more tedious to write the join structure and more difficult to get it correct, and forgetting to include one column in the join could have disastrous consequences. Many DBAs become indignant with the thought of making architecture decisions with developers in mind. I see nothing wrong with decisions that do not compromise the database and allow developers to be more productive.

The consistency of a surrogate key architecture makes it practical to have generic routines that perform very powerful data manipulation for maintenance purposes. Among many other things, such routines allow DBAs to perform bulk operations (copy/delete/merge) on sets of relational data with a simple stored procedure call. The popular database utility applications do not even attempt such operations because many of the existing databases do not possess sufficient consistency in their architecture. A consistent architecture also makes it easier to create a variety of code generation tools and administrative tools.

I strongly favor using surrogate keys over natural data keys, and an integer column with the IDENTITY property (or a similar feature on other database platforms) makes a great surrogate. There are several practical reasons for the preference including stability, performance, and simplicity. However, my primary reason for the preference is the architectural consistency that surrogate keys provide. It's the consistency that paves the way for some very desirable features and benefits.