Last February I posted on how storing UUIDs using the native UUID datatype in PostgreSQL was about 15% faster then using a VARCHAR(35) datatype. In March I posted about the performance effects of the “String Format” setting for MS SQL Server databases when using cfqueryparam. Recently I have been working on checking the performance effects when combining different datatypes for UUIDs with the string formats settings for a MS SQL Server 2005 database. (Or to be more exact, I have been checking the performance effects of different datatypes for storing 128-bit integers in the database.)

The starting point is an existing database for an application. That database stores UUIDs using a VARCHAR(35) datatype. From that database I extracted the CREATE statements for the schema using SQL Server Management Studio and build a script from there that follows the following pattern:

  1. create database
  2. create tables including:
    • primary key constraints
    • indexes
    • check constraints
  3. load data (and transform where needed)
  4. create foreign key constraints

Due to the ordering of the load script where all data is loaded before the foreign keys are created the order in which the tables are created and loaded doesn’t really matter: there will never be foreign key constraint errors, not even with circular foreign keys.

For the different datatypes I then manipulated the script a little bit. The first change was a replace to change all occurrences of VARCHAR(35) to NVARCHAR(35). The second was to change all occurrences of VARCHAR(35) with UNIQUEIDENTIFIER. For the last one I then wrote a function to convert the ColdFusion UUID format to the standard format the the UNIQUEIDENTIFIER datatype expects. This function looks like:

CREATE  function var_to_UUID(@cfUUID VARCHAR(35)) RETURNS uniqueidentifier AS
BEGIN
  Return
    CASE
      WHEN Len(@cfUUID) = 35 THEN Substring(@cfUUID,1,23) + '-' + Substring(@cfUUID,24,12)
      ELSE NULL
    END;
END

Next I had to manipulate the data loading script to use this function to convert all the UUIDs. That required some manual editing to make all the insert statements look like:

INSERT INTO [testUUID].[dbo].[language](
  [languageID]
  ,[language]
  )
SELECT
  testUUID.dbo.var_to_uuid([languageID])
  ,[language]
FROM
  [source_DB].[dbo].[language]

The database I used for these tests is on the one hand far from trivial: 80+ tables with normal and composite primary keys, 120+ foreign keys and many check constraints. On the other hand it is very trivial: just one stored procedure and one UDF. To make sure that I am comparing apples to apples I ran the existing database through the create and load sequence to establish my base for comparison and I ran all databases through the example index optimization script from MSDN. The numbers I got after just that sequence are:

Datatype Storage requirements (bytes) Database size
UNIQUEIDENTIFIER 16 29.6 GB
VARCHAR(35) 35 32.5 GB
NVARCHAR(35) 70 38.6 GB

As you can see the database size correlates very well to the storage size of the different datatypes. Apparently we have about 26.5 GB of non-UUID data in the database (which for this database is mostly in NTEXT fields in 2 tables of 18 GB and 6 GB data size respectively). For the tables most and least affected by the change the statistics are:

Table Datatype Storage requirements (bytes) Table size Index size
Most affected UNIQUEIDENTIFIER 16 2.0 GB 1.5 GB
VARCHAR(35) 35 3.4 GB 2.6 GB
NVARCHAR(35) 70 6.4 GB 5.1 GB
Least affected UNIQUEIDENTIFIER 16 17.9 GB 0.3 MB
VARCHAR(35) 35 17.9 GB 0.7 MB
NVARCHAR(35) 70 17.9 GB 2.0 MB

So as you can see there may be good reason to pick the right datatype: if your schema has many UUID columns you can save upto a factor three in storage space (or hardly any space at al if you have few UUIDs). Looking at it from the MS SQL Sever perspective the preferred storage format is UNIQUEIDENTIFIER, followed by VARCHAR(35), and lastly NVARCHAR(35). But MS SQL Server is only half of the equation, ColdFusion is the other half and in the next installment I will show you why the order from the ColdFusion side is radically different.

4 Comments

  1. duncan says:

    very interesting. I don’t think I’ve ever used the UniqueIdentifier datatype; all our UUIDs are stored as varchars. I’ll wait for your second instalment before making any changes to our tables!

  2. “it could be bunnies” » Blog Archive » Querying MS SQL Server G/UUIDs from ColdFusion says:

    [...] the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database [...]

  3. Orange is my favorite color » Blog Archive » Converting from UUIDs to GUIDs says:

    [...] been reading through a handful of Jochem’s posts about Adobe’s (incompatible with the standard 35-character) UUID and [...]

  4. “it could be bunnies” » Blog Archive » Converting a CF application from MS SQL Server to PostgreSQL? says:

    [...] At Prisma IT we have an application that has been under constant development for 2 years. Over that time it has seen lots of feature creep and changing requirements, and no refactoring. As you can imagine that doesn’t make for pretty code or an optimal database model. Due to a retargeting of that application and a much more focused requirements process we are finally getting ready to start reworking that application. Over the past month I have been working on and off (mostly in the train on my way to clients) on putting more structure in the long list with things we (I) really would like to change about the technology of the application. I have also been running some experiments on the code to see what the impact of certain changes would be. Amongst others this includes the experiments with changing the datatype used in MS SQL Server to store ColdFusion UUIDs. [...]