Surrogate Keys, Primary Keys and IDENTITY Revisited
by Brian Walker
I would like to revisit some of the points I made in my article. I have no illusion of being able to convince anybody to change his/her mind on this matter because it's very apparent that will not happen. My goal here is to further comment on some technical concerns and the debate itself. I refer to the surrogate keys versus natural data keys debate as a subjective matter because there is clearly a difference of opinion. Obviously there are many databases of each type operating quite successfully, despite the assertion by one side of the debate that the opposing approach does not work.
Joe Celko wrote in his article:
"IDENTITY is an exposed PHYSICAL locator."
"The value is created by looking at the internal state of the hardware ..."
"The purpose is to fake a sequential file's positional record number, so I can reference the PHYSICAL storage location."
An IDENTITY value is not derived from any hardware state, it's not derived from any operating system state, and it's not derived from the state of the SQL Server application. You could say that it's derived from the state of the table at the time of insertion. Envision a Brand X server box running Windows 2003 Server and SQL Server 2000 Enterprise. Envision a Brand Y server box running Windows 2000 Server and SQL Server 2000 Standard. If you copy a SQL Server database from one to the other the generating of IDENTITY values is unaffected. A new row inserted into both copies of a table will be assigned the same IDENTITY value.
An IDENTITY value is not a positional record number and it does not reference a physical storage location. If you have a table with no clustered index new rows are inserted where sufficient free space exists. There is no intentional correlation between the physical location of a new row and the IDENTITY value generated for the new row. If you add a clustered index to the table it physically rearranges the rows of data by copying them to completely different data pages in the order of the index values. Despite this physical rearrangement the IDENTITY value for any given row never changes.
Celko has been told that his statements about IDENTITY values are false. It's been stated many different times and by several different people. Yet he persists in making the same false statements. Does that indicate he is unwilling/unable to learn the facts or he is aware of the facts and chooses to ignore them? When he is categorically wrong about this simple objective matter it's difficult to believe that he could be right about related complex subjective matters.
Celko points out in his article that IDENTITY values are natural numbers (1, 2, 3, and so on). Would his arguments implode if the column property was defined as IDENTITY (-1,-1)? He also points out that gaps in IDENTITY values can develop, but he may be the only person who cares about such gaps. I do not care about gaps in IDENTITY values any more than I care about gaps in Social Security numbers. The actual values of surrogate keys are irrelevant and have no meaning, but Celko tries to impose an ordinal meaning on them for the sake of his arguments.
It may have been only a grammatical error in Celko's article, but he seems to say that a unique constraint is procedural code. He further seems to say that a unique constraint on a candidate key renders an IDENTITY value primary key redundant. I'm not sure how one could make the other redundant when they serve two different purposes. The IDENTITY value is used to provide a stable and efficient table join column. The unique constraint is used to ensure that a candidate key is unique.
Some say that a natural data key system has the advantage of inherent enforcement of referential integrity. They say it's simply impossible to insert a child row that's a mismatch with the parent row. Is it really? Some say that a surrogate key system has the downfall of providing no referential integrity. They say it's alone in being susceptible to inserting a child row that's a mismatch with the parent row. Is it really? Envision an application with a grid of parent rows. Click on a parent row to enter a new child row. The entry form displays the appropriate parent row data and accepts input of child row data. Click the Save button and all is well, right? Yikes! The developer has a bug in his/her code. The application prepares to save the child row and accidentally grabs data from the parent row immediately above the correct row in the grid. The new child row contains a foreign key connecting it to the wrong parent row! Does it make any difference at all whether the data grabbed incorrectly is a 4-column natural data key or a 1-column surrogate key?
I have no doubt that a natural data key system works. I have no doubt that a surrogate key system works too, and it offers several advantages. I have mentioned stability, performance, and simplicity as among the benefits. However, I see consistency as the biggest benefit. A consistent key architecture, when applied universally, makes it practical to accomplish some wonderful things.
Envision yourself as a DBA for a company. Your company has other companies as customers. Each customer has several contacts. Each contact has several addresses and phone numbers. Each phone number has a call log. Each customer places many orders. Each order includes products. Some products have a service log. Each log entry can involve several technicians. Each technician... Well, you get the idea. Imagine that there are 10 levels of descendent tables below customers in your database. Now suppose one of the customer companies goes out of business and you want to copy their data into an archive database and remove it from the production database. How would you perform the copy? I have asked several database professionals such a question and I have gotten a variety of answers. Some suggested using DTS. Some suggested writing SQL code. Some suggested using an INSERT statement generator. All of these are painful options. I'm aware of only one commercial product that claims to be able to perform such a copy, and that product is extremely expensive. Here's what we would do:
EXECUTE CopyBranch 'Archive','Production','Customer','CustomerID',@CustomerID -- names were changed to protect the innocent
The CopyBranch stored procedure is a homegrown generic routine that can copy any branch of our corporate database. In this example it starts with a single row in the Customer table and copies all associated rows from all descendent tables. It does not matter how many tables are involved or how deeply the structure is layered. This routine is a very powerful tool for us. We can easily copy subsets of related data into an archive database, a development database, a QA database, or a demo database. This capability was made possible by having a very consistent key architecture.
There are strong feelings on both sides of the surrogate keys versus natural data keys debate, but it's how those feelings are expressed that makes the debate itself bizarre. Dale Preston calls my article a "personal attack" in his web log. There is some justification to that complaint because the first third of the article is along those lines (he conveniently completely ignores the other two thirds of the article). Review the related reader comments in the SSWUG daily newsletter. Review Preston's blog entries. Review Celko's posts in the newsgroups. Review Celko's article (the first paragraph alone validates several of my comments). Review the feedback on Celko's article. Review the feedback on my article. I believe I have described the tone of the debate accurately. Some individuals take exception to my description because they find themselves among the group that frequently exhibits arrogance and irrational contempt for those who have different opinions. I think it's one thing to demean people who have a different opinion on a subjective technical matter (as Celko often does), but it's something quite different to decry those who have been doing the demeaning (as I did in my article).