A Surrogate Key Architecture and Support Tools
by Brian Walker

I have written previous articles in support of surrogate keys, but they were mainly rebuttals to ridiculous comments by others.  This article goes beyond rhetoric and describes a proposed surrogate key architecture.  It also provides several T-SQL tools to support the architecture.

The rules for this proposed architecture are quite simple.  Here they are in a loose order of importance (with 1 being the most important)...

  1) Every table has a primary key.
  2) The primary key is a single column.
  3) The primary key is the first column.
  4) The primary key column is named to correspond with the table name.
  5) The primary key migrates to child tables as a foreign key with the same characteristics.
  6) The primary key column is numeric.
  7) The primary key column is a 4-byte integer data type.
  8) The primary key column uses the IDENTITY property (starting at 1 and incrementing by 1).

I know, I know, some purists will complain about rule 3.  They will say there's no such thing as column order in Relational Theory.  In theory, there is no order.  In practice, there is an order to the columns.  It appears that the vast majority of databases using surrogate keys already follow rule 3, so all I have done is document current practice.

Perhaps the best way to describe this surrogate key architecture is with an example database.  The simplistic table definitions below are intended to serve as examples of the rules listed above.  They are not intended to represent a realistic model for any particular business.

  CREATE TABLE dbo.Region
     (RegionID                                int IDENTITY(1,1),
      RegionCode                         char( 2) NOT NULL,
      RegionName                      varchar(40) NOT NULL,
      Representative                  varchar(40) NOT NULL)

  CREATE TABLE dbo.Customer
     (CustomerID                              int IDENTITY(1,1),
      RegionID                                int NOT NULL,
      Name                            varchar(80) NOT NULL,
      Address                         varchar(80) NOT NULL,
      Phone                           varchar(20) NOT NULL)

  CREATE TABLE dbo.Vendor
     (VendorID                                int IDENTITY(1,1),
      Name                            varchar(80) NOT NULL,
      Address                         varchar(80) NOT NULL,
      Phone                           varchar(20) NOT NULL)

  CREATE TABLE dbo.Product
     (ProductID                               int IDENTITY(1,1),
      VendorID                                int NOT NULL,
      Description                     varchar(80) NOT NULL,
      WholesaleCost                  decimal(7,2) NOT NULL,
      RetailPrice                    decimal(7,2) NOT NULL)

  CREATE TABLE dbo.Purchase
     (PurchaseID                              int IDENTITY(1,1),
      CustomerID                              int NOT NULL,
      OrderNumber                             int NOT NULL,
      OrderDate                     smalldatetime NOT NULL)

  CREATE TABLE dbo.PurchaseItem
     (PurchaseItemID                          int IDENTITY(1,1),
      PurchaseID                              int NOT NULL,
      ProductID                               int NOT NULL,
      Quantity                           smallint NOT NULL,
      LineNumber                         smallint NOT NULL)

  CREATE TABLE dbo.Shipment
     (ShipmentID                              int IDENTITY(1,1),
      Carrier                         varchar(20) NOT NULL,
      TrackingNumber                  varchar(20) NOT NULL,
      ShipDate                      smalldatetime NOT NULL)

  CREATE TABLE dbo.PurchaseItemShipment
     (PurchaseItemShipmentID                  int IDENTITY(1,1),
      PurchaseItemID                          int NOT NULL,
      ShipmentID                              int NOT NULL)

  CREATE TABLE dbo.Payment
     (PaymentID                               int IDENTITY(1,1),
      PurchaseID                              int NOT NULL,
      Method                          varchar(20) NOT NULL,
      Amount                         decimal(7,2) NOT NULL)

You may notice that the Region table and the PurchaseItemShipment table each have an IDENTITY column that will become the primary key.  Some would consider an IDENTITY column to be excessive in such tables.  The RegionCode column could be the primary key for the Region table.  The PurchaseItemID column and the ShipmentID column could form a composite primary key for the PurchaseItemShipment table.  It's hard to argue against the point from a pure performance perspective.  However, a main objective with this architecture is consistency and I'm willing to sacrifice an insignificant amount of performance to achieve it.  In order to remain very consistent this example database follows all eight rules above.  The rules are universally applied to every table, including tables such as Region and PurchaseItemShipment.  I believe the benefits of consistency far outweigh the costs and this article will try to demonstrate some of the benefits.

You may notice that these table definitions do not include a primary key declaration.  That's where the support tools come into play.  Because these tables follow a consistent architecture, and do so universally, it's easy to create tools that allow you to implement Declarative Referential Integrity (DRI) for the entire database with two stored procedure calls.  I have provided the stored procedures here.  The T-SQL code in Listing 1 creates six system stored procedures in the master database.  I chose the master database for convenience, but the stored procedures could be created in user databases instead (remove the "sp_" prefixes from the names before creating the stored procedures in a user database).  The six stored procedures are grouped as three sets of two each.  The first pair creates/deletes primary key constraints.  The second pair creates/deletes foreign key constraints.  The third pair creates/deletes foreign key indexes.

NOTE:  The naming of things within this T-SQL code is not very helpful for understanding how it works.  Parameters, variables, tables, and columns are named rather capriciously.  I humbly beg your forgiveness for my programming idiosyncrasies!  These stored procedures definitely reflect my weird style.

These stored procedures accept five or six parameters, and all of them are optional.  The first five parameters are the same for all the routines.

The first parameter (@DBUltra) determines whether T-SQL script is generated or the actions are carried out immediately.  T-SQL script might be used to build a distribution script file.  Immediate actions take place in the current database.  A value of zero (0) causes T-SQL script to be generated and a value of one (1) causes the actions to be carried out immediately.

The next four parameters work together to form a combination of table selection criteria using names.

The second and third parameters (@DBIntra/@DBExtra) are lists of table names separated by pipes (vertical bars).  The @DBIntra parameter specifies table names to be included.  The @DBExtra parameter specifies table names to be excluded.

The fourth and fifth parameters (@PCIntra/@PCExtra) offer table selection based on pattern matching of names.  The @PCIntra parameter includes tables by using a LIKE operator on the names.  The @PCExtra parameter excludes tables by using a NOT LIKE operator.

The effects of these four parameters are combined with AND operators.  Often only one of the parameters (or none) would be used for a given call, but it may be useful to provide @DBExtra and/or @PCExtra in combination with @PCIntra in order to work with the desired subset of tables.  If the parameters are omitted, or if null values are provided, they are effectively ignored for selection purposes.

The method used to select tables in these routines is not necessarily the most efficient way of handling delimited strings.  The CHARINDEX function was used for simplicity because performance is not a significant issue when referencing tables with very modest row counts in an administrative routine.  If this kind of selection task were being done in a production routine that referenced tables with larger row counts it would probably be better to parse the delimited string and use the result set for a join.

The sp_CreatePrimaryKeys stored procedure creates primary key constraints for selected tables.  The sixth parameter (@PCUltra) determines whether the supporting indexes for the primary key constraints are clustered.  A value of zero (0) creates nonclustered primary keys and a value of one (1) creates clustered primary keys.

You can create primary key constraints for the entire database with this stored procedure call:

  EXECUTE sp_CreatePrimaryKeys 1

The sp_CreateForeignKeys stored procedure creates foreign key constraints for selected tables.  The primary key constraints must already exist.  There is a local variable, @FKey, that's used as a template for recognizing foreign key columns.  The asterisk in the value is dynamically replaced by a potential parent table name and the result is compared to column names.

You can create foreign key constraints for the entire database with this stored procedure call:

  EXECUTE sp_CreateForeignKeys 1

The sp_CreateForeignKeyIndexes stored procedure creates foreign key indexes for selected tables.  The foreign key constraints must already exist.  The sixth parameter (@DBAdmin) allows a FILLFACTOR percentage to be specified for the indexes.

You can create foreign key indexes for the entire database with this stored procedure call:

  EXECUTE sp_CreateForeignKeyIndexes 1

The sp_DeletePrimaryKeys stored procedure deletes primary key constraints for selected tables.  The foreign key constraints must be deleted first.

The sp_DeleteForeignKeys stored procedure deletes foreign key constraints for selected tables.

The sp_DeleteForeignKeyIndexes stored procedure deletes foreign key indexes for selected tables.

These routines contain a local variable, @TPre, that can be used to specify a table name prefix.  If a table name prefix is identified with this variable then the prefix can be omitted from table names provided as parameters.  It's especially important to specify a table name prefix (if a prefix is being used) in the sp_CreateForeignKeys stored procedure or the routine will expect to find the prefix in foreign key column names.

An example database is not really complete without some example data.  The INSERT statements below add some simple data that's intended to make apparent the relationships between tables.

  INSERT Region (RegionCode,RegionName,Representative) VALUES ('NW','NW USA'    ,'NW Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('SW','SW USA'    ,'SW Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('NE','NE USA'    ,'NE Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('SE','SE USA'    ,'SE Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('CA','Canada'    ,'CA Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('EU','Europe'    ,'EU Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('DU','Down Under','DU Sales')
  INSERT Region (RegionCode,RegionName,Representative) VALUES ('AO','All Others','AO Sales')

  INSERT Customer (RegionID,Name,Address,Phone) VALUES (1,'Customer X','Address X','Phone X')
  INSERT Customer (RegionID,Name,Address,Phone) VALUES (4,'Customer Y','Address Y','Phone Y')
  INSERT Customer (RegionID,Name,Address,Phone) VALUES (7,'Customer Z','Address Z','Phone Z')

  INSERT Vendor (Name,Address,Phone) VALUES ('Vendor X','Address X','Phone X')
  INSERT Vendor (Name,Address,Phone) VALUES ('Vendor Y','Address Y','Phone Y')
  INSERT Vendor (Name,Address,Phone) VALUES ('Vendor Z','Address Z','Phone Z')

  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (1,'Product X-1',2.00,4.00)
  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (1,'Product X-2',4.00,6.00)
  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (2,'Product Y-1',3.00,5.00)
  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (2,'Product Y-2',5.00,7.00)
  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (2,'Product Y-3',5.00,9.00)
  INSERT Product (VendorID,Description,WholesaleCost,RetailPrice) VALUES (3,'Product Z-1',6.00,8.00)

  INSERT Purchase (CustomerID,OrderNumber,OrderDate) VALUES (1,1001,'07/01/2005')
  INSERT Purchase (CustomerID,OrderNumber,OrderDate) VALUES (2,1002,'07/02/2005')
  INSERT Purchase (CustomerID,OrderNumber,OrderDate) VALUES (2,1003,'07/03/2005')

  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (1,1,1,1)
  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (2,1,1,1)
  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (2,3,2,2)
  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (3,2,1,1)
  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (3,4,2,2)
  INSERT PurchaseItem (PurchaseID,ProductID,Quantity,LineNumber) VALUES (3,5,3,3)

  INSERT Shipment (Carrier,TrackingNumber,ShipDate) VALUES ('UPS'  ,'1','07/01/2005')
  INSERT Shipment (Carrier,TrackingNumber,ShipDate) VALUES ('UPS'  ,'2','07/02/2005')
  INSERT Shipment (Carrier,TrackingNumber,ShipDate) VALUES ('FedEx','1','07/03/2005')
  INSERT Shipment (Carrier,TrackingNumber,ShipDate) VALUES ('FedEx','2','07/05/2005')

  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (1,1)
  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (2,2)
  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (3,2)
  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (4,3)
  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (5,4)
  INSERT PurchaseItemShipment (PurchaseItemID,ShipmentID) VALUES (6,4)

  INSERT Payment (PurchaseID,Method,Amount) VALUES (1,'PayPal', 4.00)
  INSERT Payment (PurchaseID,Method,Amount) VALUES (2,'PayPal',14.00)
  INSERT Payment (PurchaseID,Method,Amount) VALUES (3,'Check' ,16.00)
  INSERT Payment (PurchaseID,Method,Amount) VALUES (3,'Check' ,16.00)
  INSERT Payment (PurchaseID,Method,Amount) VALUES (3,'Check' ,15.00)

I know, I know, natural data key proponents will point out that my use of assumed IDENTITY values in the VALUES clause is exactly the failing of surrogate keys.  They are right in a sense.  The INSERT statements above make some assumptions about parent rows because child rows are not connected to parent rows using natural data.  However, customers do not place their orders using a batch of INSERT statements.  Customers use an application, and an application does not have to make any assumptions about parent rows.  An application can handle IDENTITY values dynamically, but my example data must be static.

The T-SQL code in Listing 2 creates a system stored procedure named sp_ExamineChildren.  The stored procedure accepts two required parameters, a table name (@DBTable) and a primary key value (@DBValue).  The routine returns a result set containing a row for every child table.  Each row includes a child table name and the number of rows in the table that relate to the parent table row specified by the parameters.  This stored procedure might be used for simple data analysis.  A SUM aggregation on the number of rows column can determine whether the parent row has any child rows.  If the sum is greater than zero then child rows exist and the parent row must not be deleted.

You can examine the children of a row in the Purchase table with this stored procedure call:

  EXECUTE sp_ExamineChildren 'Purchase',3

The query below returns a set of data containing purchase details for a certain customer.  The query involves eight tables with a variety of parent/child relationships.  The result is a denormalized set of data suitable for a report.  However, the result may not be of much use to a DBA or developer because the source of each column is not apparent in the output.  A problem could easily be concealed among all the redundant data (the same parent data repeated with each child row).

    SELECT R.RegionName
         , C.Name
         , O.OrderNumber
         , O.OrderDate
         , I.LineNumber
         , I.Quantity
         , V.Name
         , P.Description
         , S.Carrier
         , S.TrackingNumber
         , S.ShipDate
      FROM Customer AS C
      JOIN Region AS R
        ON C.RegionID = R.RegionID
      JOIN Purchase AS O
        ON C.CustomerID = O.CustomerID
      JOIN PurchaseItem AS I
        ON O.PurchaseID = I.PurchaseID
      JOIN Product AS P
        ON I.ProductID = P.ProductID
      JOIN Vendor AS V
        ON P.VendorID = V.VendorID
      JOIN PurchaseItemShipment AS E
        ON I.PurchaseItemID = E.PurchaseItemID
      JOIN Shipment AS S
        ON E.ShipmentID = S.ShipmentID
     WHERE C.CustomerID = 2
  ORDER BY O.OrderNumber
         , I.LineNumber

The T-SQL code in Listing 3 creates a system stored procedure named sp_FetchRelatedRows.  The routine returns related data similar to the query above, but it returns the data in a separate result set for each table involved.  This kind of output might be more useful to a DBA or developer as he/she is investigating an issue.  All the appropriate rows from all the appropriate tables are presented at once.  The rows included for each table are limited to those involved in a parent or child relationship with rows from one or more of the other sets.

The sp_FetchRelatedRows stored procedure accepts seven parameters.

The first parameter (@DBChain) is a list of table names separated by pipes (vertical bars).  Each table beyond the first one must be a parent or child of a previous table in the list.  Typically, the chain of tables would traverse the hierarchy down from parent to child.  When the lowest level table has been specified a table that joins to any table appearing earlier in the list would be specified, starting a new sequence of tables.  The new sequence would then traverse up or down the hierarchy as necessary.

The second parameter (@DBWhere) is optional and is used as a WHERE clause with the first table.

The third parameter (@DBAdmin) is optional and it controls how and when joins are established between the tables in the chain specified by @DBChain.  The default value is zero (0).

If @DBAdmin is negative then the tables are joined as a unit (like the query above).  This option eliminates parent rows that have no child rows (the other options do not). 

If @DBAdmin is zero then each table is joined to every preceding table for which a relationship exists.  Rows are selected based on the sequence(s) up to that point.  This option (or the previous option) is required if any sequence of tables runs parallel to a previous sequence.

If @DBAdmin is positive then each table is joined to only the nearest preceding table for which a relationship exists.  Rows are selected based on the sequence(s) up to that point.

The next two parameters (@DBField and @DBValue) are optional and are used to limit the rows selected from the first table, based on a column with an integer value.  Typically, a primary key or foreign key column, along with a corresponding value, is used for this purpose.

The next two parameters (@PCField and @PCValue) are optional and are used to limit the rows selected from each table, based on a column with an integer value.  If this pair of parameters is used, the indicated column must exist in every table.

You can investigate the data used in the query above with this stored procedure call:

  EXECUTE sp_FetchRelatedRows 'Customer|Region|Purchase|PurchaseItem|Product|Vendor|PurchaseItemShipment|Shipment',NULL,0,'CustomerID',2

You can easily include an additional table to investigate with this stored procedure call:

  EXECUTE sp_FetchRelatedRows 'Customer|Region|Purchase|PurchaseItem|Product|Vendor|PurchaseItemShipment|Shipment|Payment',NULL,0,'CustomerID',2

The sp_FetchRelatedRows stored procedure is presented here in the context of database administration work or development work.  However, it's worth mentioning that the output of the routine can be handled very gracefully by ADO.NET.  A DataSet object can be populated with all the tables at once using a single method.  Further, a generic routine can build all the correct relationships between the DataTable objects in the DataSet.  The process creates a mini-database that can be navigated using the GetParentRow and GetChildRows methods.  I would not recommend calling sp_FetchRelatedRows from a production application without fully understanding the potential ramifications of using dynamic T-SQL code.

All the stored procedures presented here are made possible, or at least made practical, by having a consistent surrogate key architecture.  These routines are quite simple, but there are many other much more powerful routines that can use the consistency to great advantage.