Utilities for a Surrogate Key Architecture by Brian Walker In article 1 of this series I described a surrogate key architecture and explained how and why to use it. I also provided several tools to support the proposed architecture. In article 2 of this series I described six standard table columns and nine standard table-specific objects. I also provided several tools to generate and manage CRUD routines. This article offers some additional commentary about the use of SQL Server IDENTITY values as surrogate keys. I will describe several utility routines to help maintain a surrogate key database. I will also provide six stored procedures to implement the utility functionality. Before I get into describing the utility routines, there's a general point that I want to make sure is very clear to readers. The point is this: a surrogate key value has no intrinsic meaning. Surrogate key values can be used to facilitate table relationships, but that's all they do. There should be no business logic or intelligence attached to surrogate key values. The values in a surrogate key column should not be assumed to imply any useful order for the rows. There should be no expectations about the range or continuity of values in a surrogate key column. I want to stress these aspects of surrogate keys because some of the utility routines could lead readers to believe that I think surrogate key values have intrinsic meaning. The utility routines in this article provide some helpful manipulations of IDENTITY values. An IDENTITY value is just one form of surrogate key, but it happens to be a very convenient and efficient form for most SQL Server databases. A GUID value is another common form of surrogate key. GUID values are beneficial when surrogate key values must be unique across multiple databases, but the data type is relatively large and that can become a hindrance for performance. The architecture proposed in my earlier article uses IDENTITY values for surrogate keys. There are a few database "experts" out there who like to make false statements about IDENTITY values. There's one particularly outspoken character whose initials are JC. JC has often stated that IDENTITY values are exposed physical locators created by looking at the internal state of the hardware. JC is categorically wrong about this simple objective matter, so it's difficult to believe that JC could be right about related complex subjective matters. An IDENTITY value does not reference a physical storage location. If you have a table with no clustered index, new rows are inserted where sufficient free space exists. There is no intentional correlation between the physical location of a new row and the IDENTITY value generated for the new row. If you add a clustered index to the table it physically rearranges the rows of data by copying them to completely different data pages in the order of the index values. Despite this physical rearrangement the IDENTITY value for any given row never changes. An IDENTITY value is not derived from any hardware state, it's not derived from any operating system state, and it's not derived from the state of the SQL Server application. You could say that it's derived from the state of the table at the time of insertion. Envision a Brand X server box running Windows 2003 Server and SQL Server 2000 Enterprise. Envision a Brand Y server box running Windows 2000 Server and SQL Server 2000 Standard. If you copy a SQL Server database from one to the other the generating of IDENTITY values is unaffected. A new row inserted into both copies of a table (one on each server) will be assigned the same IDENTITY value. Now for the utility routines... The T-SQL code in Listing 1 creates six system stored procedures in the master database. I chose the master database for convenience, but the stored procedures could be created in user databases instead (remove the "sp_" prefixes from the names before creating the stored procedures in a user database). The six stored procedures are described separately, and their parameters are listed with the descriptions for easy reference. My surrogate key architecture article includes an example database. The examples for each system stored procedure presented here are intended to be executed in the context of that database, which uses the required surrogate keys. NOTE: The naming of things within this T-SQL code is not very helpful for understanding how it works. Parameters, variables, tables, and columns are named rather capriciously. I humbly beg your forgiveness for my programming idiosyncrasies! These stored procedures definitely reflect my weird style. These stored procedures contain a local variable, @TPre, that can be used to specify a table name prefix. If a table name prefix is identified with this variable then the prefix can be omitted from table names provided as parameters. The SQL code below should be used to add an extra table to my example database. The extra table is referenced in some of the examples below. The SQL code also adds a detail/child table to demonstrate the handling of foreign keys. -- DROP TABLE ExampleDetail -- DROP TABLE Example CREATE TABLE Example (ExampleID int IDENTITY(1,1), MyString char(1)) INSERT Example (MyString) VALUES ('A') INSERT Example (MyString) VALUES ('B') INSERT Example (MyString) VALUES ('C') INSERT Example (MyString) VALUES ('D') INSERT Example (MyString) VALUES ('E') INSERT Example (MyString) VALUES ('F') INSERT Example (MyString) VALUES ('G') INSERT Example (MyString) VALUES ('H') INSERT Example (MyString) VALUES ('I') INSERT Example (MyString) VALUES ('J') INSERT Example (MyString) VALUES ('K') INSERT Example (MyString) VALUES ('L') INSERT Example (MyString) VALUES ('M') INSERT Example (MyString) VALUES ('N') INSERT Example (MyString) VALUES ('O') INSERT Example (MyString) VALUES ('P') INSERT Example (MyString) VALUES ('Q') INSERT Example (MyString) VALUES ('R') INSERT Example (MyString) VALUES ('S') INSERT Example (MyString) VALUES ('T') INSERT Example (MyString) VALUES ('U') INSERT Example (MyString) VALUES ('V') INSERT Example (MyString) VALUES ('W') INSERT Example (MyString) VALUES ('X') INSERT Example (MyString) VALUES ('Y') INSERT Example (MyString) VALUES ('Z') ALTER TABLE Example ADD CONSTRAINT ExamplePK PRIMARY KEY NONCLUSTERED (ExampleID) CREATE TABLE ExampleDetail (ExampleDetailID int IDENTITY(1,1), ExampleID int, WhatEver char(2)) INSERT ExampleDetail (ExampleID, WhatEver) VALUES ( 5,'E1') INSERT ExampleDetail (ExampleID, WhatEver) VALUES ( 5,'E2') INSERT ExampleDetail (ExampleID, WhatEver) VALUES ( 6,'F1') INSERT ExampleDetail (ExampleID, WhatEver) VALUES ( 6,'F2') INSERT ExampleDetail (ExampleID, WhatEver) VALUES (21,'U1') INSERT ExampleDetail (ExampleID, WhatEver) VALUES (21,'U2') ALTER TABLE ExampleDetail ADD CONSTRAINT ExampleDetailPK PRIMARY KEY NONCLUSTERED (ExampleDetailID) ALTER TABLE ExampleDetail ADD CONSTRAINT ExampleDetailFK FOREIGN KEY (ExampleID) REFERENCES Example (ExampleID) DELETE FROM Example WHERE ExampleID IN (3,12,13,14,25,26) Utility Routine: sp_CheckRowCounts The sp_CheckRowCounts routine returns a result set containing row counts along with IDENTITY value information for selected tables. The sp_CheckRowCounts stored procedure is useful for examining the current state of selected tables with regard to row counts and IDENTITY values used for surrogate keys. It returns a result set with six columns. The columns include the table name, the number of rows in the table, the smallest existing IDENTITY value, the largest existing IDENTITY value, the last assigned IDENTITY value, and the next available IDENTITY value. CREATE PROCEDURE dbo.sp_CheckRowCounts @DBIntra varchar(8000) = NULL, @DBExtra varchar(8000) = NULL, @PCIntra varchar(100) = NULL, @PCExtra varchar(100) = NULL, @PCUltra bit = 0 The first four parameters are all optional and they work together to form a combination of table selection criteria using names. The first and second parameters (@DBIntra/@DBExtra) are lists of table names separated by pipes (vertical bars). The @DBIntra parameter specifies table names to be included. The @DBExtra parameter specifies table names to be excluded. The third and fourth parameters (@PCIntra/@PCExtra) offer table selection based on pattern matching of names. The @PCIntra parameter includes tables by using a LIKE operator on the names. The @PCExtra parameter excludes tables by using a NOT LIKE operator. The effects of these four parameters are combined with AND operators. Often only one of the parameters (or none) would be used for a given call, but it may be useful to provide @DBExtra and/or @PCExtra in combination with @PCIntra in order to work with the desired subset of tables. If the parameters are omitted, or if null values are provided, they are effectively ignored for selection purposes. The method used to select tables in these routines is not necessarily the most efficient way of handling delimited strings. The CHARINDEX function was used for simplicity because performance is not a significant issue when referencing tables with very modest row counts in an administrative routine. If this kind of selection task were being done in a production routine that referenced tables with larger row counts it would probably be better to parse the delimited string and use the result set for a join. The fifth parameter (@PCUltra) is also optional and it determines whether table name prefixes (if used and if defined) are included in the output. A value of zero (0) means they will be omitted and a value of one (1) means they will be included. This example returns IDENTITY value information for selected tables in my example database. EXECUTE sp_CheckRowCounts NULL,NULL,'Purchase%' This example returns IDENTITY value information for the extra table in my example database. EXECUTE sp_CheckRowCounts 'Example' Utility Routine: sp_CheckKeyValues The sp_CheckKeyValues routine returns a result set containing the integer value ranges represented in a selected table and column. The result set also includes the size of each range and the existing row count within each range. The sp_CheckKeyValues stored procedure is useful for finding gaps in IDENTITY values. Do gaps affect the operation of surrogate keys? No! However, the information can be interesting. This routine is not limited to checking IDENTITY values. It can be used to look for gaps in the values of any column with an integer data type. It does not assume that all values in the column are unique. CREATE PROCEDURE dbo.sp_CheckKeyValues @DBFetch varchar(4000), @DBField varchar(100), @DBUltra bit = 0 The first parameter (@DBFetch) specifies the source of rows to examine. The parameter value can be a table name (including a temporary table), view name, user-defined function call, or SELECT statement. The second parameter (@DBField) specifies the column to examine within the set of rows defined by @DBFetch. The third parameter (@DBUltra) is optional and it determines whether the output will represent ranges of existing values or ranges of missing values. A value of zero (0) returns existing values and a value of one (1) returns missing values. This example examines the primary key column of the Product table in my example database. EXECUTE sp_CheckKeyValues 'Product','ProductID' These examples examine the IDENTITY values in the ExampleID column of the extra table in my example database. EXECUTE sp_CheckKeyValues 'Example','ExampleID',0 EXECUTE sp_CheckKeyValues 'Example','ExampleID',1 Utility Routine: sp_ResetNextKeyValue The sp_ResetNextKeyValue routine resets the next available IDENTITY value for selected tables. The sp_ResetNextKeyValue stored procedure is useful for helping to maximize the potential range of IDENTITY values. If the rows most recently inserted into a table were subsequently deleted, this routine allows the IDENTITY values allocated for those rows to be reassigned. The next row inserted will be assigned an IDENTITY value that is one increment larger than the largest existing IDENTITY value. The increment for IDENTITY values is almost always one (1), but that's not a requirement. CREATE PROCEDURE dbo.sp_ResetNextKeyValue @DBIntra varchar(8000) = NULL, @DBExtra varchar(8000) = NULL, @PCIntra varchar(100) = NULL, @PCExtra varchar(100) = NULL The four parameters are all optional and they work together to form a combination of table selection criteria using names. The first and second parameters (@DBIntra/@DBExtra) are lists of table names separated by pipes (vertical bars). The @DBIntra parameter specifies table names to be included. The @DBExtra parameter specifies table names to be excluded. The third and fourth parameters (@PCIntra/@PCExtra) offer table selection based on pattern matching of names. The @PCIntra parameter includes tables by using a LIKE operator on the names. The @PCExtra parameter excludes tables by using a NOT LIKE operator. The effects of these four parameters are combined with AND operators. Often only one of the parameters (or none) would be used for a given call, but it may be useful to provide @DBExtra and/or @PCExtra in combination with @PCIntra in order to work with the desired subset of tables. If the parameters are omitted, or if null values are provided, they are effectively ignored for selection purposes. The method used to select tables in these routines is not necessarily the most efficient way of handling delimited strings. The CHARINDEX function was used for simplicity because performance is not a significant issue when referencing tables with very modest row counts in an administrative routine. If this kind of selection task were being done in a production routine that referenced tables with larger row counts it would probably be better to parse the delimited string and use the result set for a join. This example resets the next available IDENTITY value for the extra table in my example database. The change can be observed by checking the output of the two surrounding sp_CheckRowCounts stored procedure calls. EXECUTE sp_CheckRowCounts 'Example' EXECUTE sp_ResetNextKeyValue 'Example' EXECUTE sp_CheckRowCounts 'Example' Utility Routine: sp_OrganizeKeyValues The sp_OrganizeKeyValues routine organizes the IDENTITY values into a single contiguous range (1-N) for a selected table. The sp_OrganizeKeyValues stored procedure is useful for eliminating gaps in the IDENTITY values in the primary key column of a table. Do gaps affect the operation of surrogate keys? No! However, some people prefer to have their data as orderly as possible. This routine rearranges rows within the indicated table to close gaps in the IDENTITY values. The result is a single contiguous range of IDENTITY values from 1 through N, where N is the number of rows in the table. It also resets the next available IDENTITY value. The sp_OrganizeKeyValues stored procedure preserves the integrity of foreign keys that reference rows in the affected table. It automatically adjusts the foreign key values in all child tables of the affected table. CREATE PROCEDURE dbo.sp_OrganizeKeyValues @DBTable varchar(100), @DBUltra bit = 0 The first parameter (@DBTable) specifies the table to be affected. The table must follow the rules of the surrogate key architecture suggested in my earlier article. The second parameter (@DBUltra) is optional and it determines whether the routine will minimize the movement of rows or preserve the relative order of rows based on their IDENTITY values. A value of zero (0) means row movement will be minimized and a value of one (1) means the relative order will be preserved. This example reorganizes the IDENTITY values in the extra table in my example database. It minimizes the movement of rows. NOTE: Be sure to DROP and CREATE the extra tables using the SQL code above before trying this example. The actions performed by sp_OrganizeKeyValues can be observed by checking the output of the surrounding SELECT statements. SELECT * FROM Example SELECT * FROM ExampleDetail EXECUTE sp_OrganizeKeyValues 'Example' SELECT * FROM Example SELECT * FROM ExampleDetail This example reorganizes the IDENTITY values in the extra table in my example database. It preserves the relative order of the rows. NOTE: Be sure to DROP and CREATE the extra tables using the SQL code above before trying this example. The actions performed by sp_OrganizeKeyValues can be observed by checking the output of the surrounding SELECT statements. SELECT * FROM Example SELECT * FROM ExampleDetail EXECUTE sp_OrganizeKeyValues 'Example',1 SELECT * FROM Example SELECT * FROM ExampleDetail Utility Routine: sp_FindDuplicateRows The sp_FindDuplicateRows routine finds cases where multiple rows match across a specified set of criteria (columns) in a selected table. The sp_FindDuplicateRows stored procedure is useful for finding duplicate data. In order for the routine to identify rows that are actually duplicates from a business perspective it's important to specify an appropriate column list (the @DBThere parameter). A column, or a list of columns, that represents an alternate key is often a good choice. For example, a name column would not be used as the primary key in this architecture and it may not be practical to place a unique constraint on such a column. However, it may be typical to look for duplicates in such a column. Any column, or any list of columns, that usually (but not always) holds unique data might be of interest to check. The proponents of natural data keys might say that the sp_FindDuplicateRows routine is necessitated by the use of surrogate keys. They seem to believe that natural data keys would ensure that no duplicate rows could exist. That's nonsense. Any column (or set of columns) that would be the primary key in a natural data key system could easily be the target of a unique constraint in a surrogate key system. The systems would be equally susceptible to duplicate rows. A lot of data (possibly the vast majority of data) is entered into databases by humans, and humans make mistakes. An employee table could have two rows with the same name and address where the Social Security number differs by one digit. It does not matter whether the Social Security number is the primary key, or the target of a unique constraint (with an IDENTITY value as the primary key). A book table could have two rows with the same title and author where the ISBN differs by one digit. It does not matter whether the ISBN is the primary key, or the target of a unique constraint (with an IDENTITY value as the primary key). The use of check digits can reduce the number of such errors, but it does not eliminate them. Both natural data key systems and surrogate key systems can contain duplicate rows. CREATE PROCEDURE dbo.sp_FindDuplicateRows @DBTable varchar(100), @DBWhere varchar(2000) = NULL, @DBThere varchar(2000) = NULL, @DBUltra bit = 0 The first parameter (@DBTable) specifies the table to be analyzed. The table must follow the rules of the surrogate key architecture suggested in my earlier article. The second parameter (@DBWhere) is optional and it provides a way to limit the rows to be analyzed. The parameter value is used as a WHERE clause. The third parameter (@DBThere) is optional and it specifies the criteria (columns) to be considered for matching. The parameter value, which is typically a list of column names separated by pipes (vertical bars), is used to form a GROUP BY clause and an ORDER BY clause. Expressions involving columns can be used in place of column names in the list. The default is to use all columns with "Name" in their name, or all columns except the first column. The fourth parameter (@DBUltra) is optional and it determines whether summary information is returned or matching rows are returned. A value of zero (0) means the match criteria and match counts are returned and a value of one (1) means the matching rows themselves are returned along with reference information. This example finds duplicate rows in the Product table in my example database. It returns a summary of the findings. The column used in this example does not represent an actual duplication of data from a business perspective, but it demonstrates how the stored procedure works. EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID',0 This example finds duplicate rows in the Product table in my example database. It returns a summary of the findings. The combination of columns used in this example returns an empty result set because there is no duplication of data in those columns. EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID|Description',0 This example finds duplicate rows in the Product table in my example database. It returns a result set including the rows found to be duplicates. The combination of a column and an expression used in this example does not represent an actual duplication of data from a business perspective, but it demonstrates how the stored procedure works. EXECUTE sp_FindDuplicateRows 'Product',NULL,'VendorID|LEFT(Description,7)',1 Utility Routine: sp_ChangeParentValue The sp_ChangeParentValue routine transfers the relationship to child table rows from one parent table row to another. The sp_ChangeParentValue stored procedure is useful for taking child rows found under multiple parent rows and combining them under a single parent row. Envision three customer rows that are found to be duplicates, where each customer row has several child rows in an order table. This routine can be used to combine all the order rows under a single customer row. Then the other two customer (parent) rows can be deleted. CREATE PROCEDURE dbo.sp_ChangeParentValue @DBTable varchar(100), @DBValue int, @DBAdmin int The first parameter (@DBTable) specifies the parent table. The table must follow the rules of the surrogate key architecture suggested in my earlier article. The second parameter (@DBValue) specifies the primary key IDENTITY value that acts as the source. The third parameter (@DBAdmin) specifies the primary key IDENTITY value that acts as the destination. The sp_ChangeParentValue stored procedure checks the child tables of @DBTable and changes the foreign key value @DBValue to @DBAdmin. The effect is to transfer relationship to (ownership of) the child rows from the @DBValue parent row to the @DBAdmin parent row. This example uses the extra tables in my example database. It combines the child rows in the detail table under a single parent row. NOTE: Be sure to DROP and CREATE the extra tables using the SQL code above before trying this example. The actions performed by sp_ChangeParentValue can be observed by checking the output of the surrounding SELECT statements. SELECT P.* , C.* FROM Example AS P LEFT JOIN ExampleDetail AS C ON P.ExampleID = C.ExampleID WHERE P.ExampleID IN (5,6,21) EXECUTE sp_ChangeParentValue 'Example',5,21 EXECUTE sp_ChangeParentValue 'Example',6,21 SELECT P.* , C.* FROM Example AS P LEFT JOIN ExampleDetail AS C ON P.ExampleID = C.ExampleID WHERE P.ExampleID IN (5,6,21)