Posts tagged ‘UUID’

I have been promising myself (and others) to write about the ColdFusion UUID implementation for quite a while now and I feel like I have been procrastinating long enough. So at long last the definitive guide to ColdFusion UUIDs, based on many years of experience and a few conversations with the ColdFusion engineering team over beer at the MAX.

What is a UUID

A UUID is an Universally Unique Identifier which is just a fancy name for a 128-bit integer. While a 128-bit integer is a really large number, it is not an infinite number, so it is not really unique, it is just so rare for a conflict to occur that we normally just presume it is actually unique. This 128-bit integer is typically represented as a hexadecimal string split into 5 groups by hyphens in the pattern 8-4-4-4-12. This UUID is typically generated from one of 5 different algorithms:

  1. MAC address based
  2. DCE based
  3. MD5 hash based
  4. Random
  5. SHA-1 hash based

Each of these versions offers different guarantees for uniqueness and randomness. For ColdFusion developers the import version are 1 and 4.

MAC address based UUIDs

The algorithm for a MAC address based UUID is based on 3 different components:

  1. timestamp
  2. clock sequence
  3. node identifier

The timestamp is a 60-bit integer counting the number of 100 nanosecond increments since the beginning of the Gregorian calendar in 1582. The clock sequence is an initially random number used to prevent duplicate UUIDs when the time is reset backwards for instance through an NTP client. The node identifier is a supposedly unique identification for the node on which the UUID is generated. Since this node identifier is typically the MAC address of one of the NICs of the system this version is commonly referred to as a MAC based UUID.

From this algorithm a few things stand out:

  1. The timestamp will overflow in stardate 3400 or something and from that moment on the generated UUIDs may conflict with earlier generated UUIDs. But since I doubt anybody was generating UUIDs in 1582 it is safe to assume the first actual conflicts from that will occur a few hundred years later.
  2. The UUID is only as unique as the MAC address is. While MAC addresses are supposedly unique anybody who has run a somewhat larger network like a campus network will know that in reality they are not.
  3. It is impossible to generate more then 10 million version 1 UUIDs per second per node due to the 100 nanosecond timestamp resolution.
  4. MAC based UUIDs are actually quite predictable.

The MAC based algorithm is the algorithm used in ColdFusion.

Random UUIDs

Random UUIDs are generated mostly random. The version number and 2 other bits are restricted, but the other 122 bits are generated from a random source. This means:

  1. Version 4 UUIDs are unpredictable.
  2. Version 4 UUIDs are more likely to conflict than version 1 UUIDs. Still for all practical purposes they are unique.
  3. The quality and speed of the generation of version 4 UUIDs depends on your entropy source.

Amongst others, java.util.UUID is one of the implementations of a version 4 UUID generator.

UUIDs in ColdFusion

UUIDs are generated in ColdFusion through the createUUID() function. This function generates UUIDs using the version 1 algorithm (MAC address based).  The one thing that makes these UUIDs stand out very much is that they have a non-standard string representation. Instead of being grouped in 5 groups with the pattern 8-4-4-4-12 they are grouped in 4 groups with the pattern 8-4-4-16. I have been told this was an unintentional deviation that was not discovered until after shipping and then backward compatibility was deemed more important than conforming to the string representation of others.

The ColdFusion createUUID() function gets interesting with the rewrite to Java in ColdFusion MX. At that time Java had no API to find the MAC address of a NIC in the system, so on Windows a little bit of native code in NeoUUID.dll was used to find the MAC address and on other platforms a MAC address was faked. When doing a native Java deployment on Windows (EAR/WAR file) the system would also fall back the same as on other platforms. In addition the timestamp resolution of the Sun JVMs was rather limited (10 milliseconds on Windows, 1 millisecond on other platforms). Since you can generate only one UUID per clock tick, the theoretical limit for the number of UUIDs generated per second was 100 on Windows (64 on multi-core systems).

A particular problem in this version was a bug in the Sun JVM where using createUUID() would cause the system clock to move forward a little bit. Under heavy use the clock would move forward up to 12 seconds per minute. Then when the time was resynchronized with the NTP server and the server clock went back a minute or so, the generation of UUIDs was stalled until the system was back in the future. Very much the intended behavior of a UUID generation algorithm that values uniqueness over everything else, but still an unpleasant surprise.

With the arrival of ColdFusion 9 createUUID() got a speed boost. The implementation was rewritten from using a millisecond time API to use a new Java API that provides timestamps with a nanosecond resolution. That means the theoretical limit of 100 or 1000 UUIDs per second got increased to 10 million per second. The practical limit is still a bit lower because the clock tick is not really 1 nanosecond, but the speed improvement is still very significant. The speed of createUUID() now actually varies depending on the clock speed of the hardware you use to run the test.

GUIDs in ColdFusion

In addition to a UUID datatype ColdFusion also has a GUID datatype. This is another 128-bit integer that is unfortunately incompatible with ColdFusion UUIDs because it uses the 8-4-4-4-12 string representation . On the other hand it has the huge benefit that it is compatible with the way the rest of the world represents UUIDs so we can natively exchange them with Java, databases etc. instead of having to serialize them to a string. I have written previously about the performance benefits you can reap if you use a native uniqueidentifier datatype in MS SQL Server instead of a string representation.

What ColdFusion does not have is a native function to generate GUIDs. Typically this is solved by generating GUIDs from UUIDs by just inserting another hyphen, or by falling back to the Java java.util.UUID class. Just remember that when you use the ColdFusion createUUID() function you get better uniqueness guarantees since it is a version 1 UUID, while when using java.util.UUID you get better performance since it is a version 4 UUID (if you have sufficient entropy).

In the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database size. Using an NVARCHAR(35) instead of a UNIQUEIDENTIFIER can triple the size of your tables, and a VARCHAR(35) is somewhere in between. Conventional wisdom relates the performance inversely to the size of a database, with a factor between O(ln(n)) for indexed access and O(n) for table access. To test that, I created an artificial benchmark for my databases with several queries that mainly exercise range access to the tables that saw the highest size ratio from the datatype conversion and access more data then fits in RAM. (Due to the string format difference the values used in the queries differed slightly.) Then I ran those queries from the MS SQL Management Studio and checked the wall clock time.

Datatype Storage requirements (B) Database size (GB) Run time (s)
UNIQUEIDENTIFIER 16 29.6 22
VARCHAR(35) 35 32.5 35
NVARCHAR(35) 70 38.6 47

The benchmark proved conventional wisdom right (unsurprisingly I might add, because it was designed to prove conventional wisdom right).

Querying from ColdFusion

If we add ColdFusion to the mix, the equation changes. Crazy as it may sound, the typeless ColdFusion language has 2 incompatible 128-bit integer datatypes. First there is the UUID, a 128-bit integer with a 35-byte string representation. It has a terribly slow generator function CreateUUID(), but it is incompatible with the rest of the world. Then there is the GUID, a 128-bit integer with a 36-byte string representation. There is no function in ColdFusion to generate GUIDs, but you can either insert an extra - in a UUID or generate them from java.util.UUID. This is the string representation that the rest of the world, including MS SQL Server and Java, uses. Since neither the database nor ColdFusion can cast automatically, we need to make sure we use the right string representation on both sides.

The next question is whether or not to use cfqueryparam. In my opinion not using cfqueryparam should only be done in very specific circumstances where you have a reason not to use it (and comments in the code should explain that reason for the next programmer).

Since we use cfqueryparam we need to decide is whether to use Unicode for our string parameters or not. This is an option in the ColdFusion administrator that enables or disables this for the entire datasource. Most applications I work with need some form of i18n support so we need unicode and have to enable this.

Lastly we need to pick a database driver. ColdFusion ships with a MS SQL Server driver from DataDirect, but there are other alternatives. I have tested the drivers from Microsoft as well during this round.

Test 1

The benchmark I am using for this first round is extremely simple, just a primary key lookup of a row of the table that had the largest relative size reduction between NVARCHAR and UNIQUEIDENTIFIER. This is run for each database datatype, for each driver, once with cfqueryparam and once without cfqueryparam.

Query 1
	SELECT
		*
	FROM
		table
	WHERE
		primary_key= '#searchValue#'
Query 2
	SELECT
		*
	FROM
		table
	WHERE
		primary_key = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">

The query without cfqueryparam is the query that is most similar to a query as we would run it from the MS SQL Management Studio. It is only included to provide a reference since I couldn’t believe the results at first, I would not want this in any production code. The table below shows the results, averaged over all runs and all drivers.

Datatype Query 1 (ms) Query 2 (ms)
UNIQUEIDENTIFIER 5 6
VARCHAR(35) 5 39700
NVARCHAR(35) 6 6

As you can see the differences are minimal, except for the case where we have a query sending a cfqueryparam with Unicode hint. This is the same behavior I have shown before: a character set mismatch forces MS SQL Server to change the execution plan to do a table scan instead of an index lookup. The effect is just much bigger here because this is a real sized dataset and not a small testcase.

Test 2

After seeing this I dropped the VARCHAR database from the tests. That allowed me to increase the number of iterations for each test and add a few more queries and still finish this year. So for the following test I had the following queries:

Query 1
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = '#searchValue#'
Query 2
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = N'#searchValue#'
Query 3
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">
Query 4
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_uniqueidentifier" value="#searchValue#">

The way I ended up with Query 4 is that I had no idea what the proper cfsqltype for a uniqueidentifier was, so I just tried something and it didn’t generate an error. (It doesn’t really work either because it doesn’t do any validation on the client side either.) The results over 100 queries provided an interesting view.

With unicode hint for datasource Without unicode hint for datsource
Datatype Driver Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms) Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms)
UNIQUEIDENTIFIER A 266 485 969 1360 265 500 637 1406
B 250 500 953 1391 250 547 984 1437
NVARCHAR(35) A 297 562 1141 1578 313 579 1110 1594
B 297 579 1079 1625 296 579 1093 1640

As you can see there is a consistent pattern where queries without cfqueryparam are faster then queries with cfqueryparam, and also that cf_sql_varchar is faster then cf_sql_uniqueidentifier. There is a small difference between the speed of a UNIQUEIDENTIFIER and a NVARCHAR and queries with a Unicode hint appear to be marginally faster. The differences between the drivers are well within the margin of error for this test.

But the most important thing is that no combination shows the awful behavior the combination of a VARCHAR field and Unicode hints show.

Preliminary conclusions

So with this batch of tests I can now indicate an order of preference for the datatype to store 128-bit integers in:

  1. If you are building a new application, use the GUID datatype on the ColdFusion side and the UNIQUEIDENTIFIER datatype on the MS SQL Server  side. This minimizes the size of your database, maximizes the speed and gives you a string representation that is compatible with the rest of the world.
  2. If you have an existing application where you use UUIDs and store them in a VARCHAR field, it may be worthwhile to convert that VARCHAR field. Conversion to an NVARCHAR should be quick and easy, conversion to a UNIQUEIDENTIFIER gives the best performance but will require a lot more work.

A still need to figure out where the speed differences in the last table originate: the MS SQL Server or ColdFusion. I have run some initial tests that suggest that the performance difference is actually in the driver and not the database, but since the differences are a lot smaller I need more tests and SQL profiles to figure that out. That may take a while, but when I do I’ll post back.

Last February I posted on how storing UUIDs using the native UUID datatype in PostgreSQL was about 15% faster then using a VARCHAR(35) datatype. In March I posted about the performance effects of the “String Format” setting for MS SQL Server databases when using cfqueryparam. Recently I have been working on checking the performance effects when combining different datatypes for UUIDs with the string formats settings for a MS SQL Server 2005 database. (Or to be more exact, I have been checking the performance effects of different datatypes for storing 128-bit integers in the database.)

The starting point is an existing database for an application. That database stores UUIDs using a VARCHAR(35) datatype. From that database I extracted the CREATE statements for the schema using SQL Server Management Studio and build a script from there that follows the following pattern:

  1. create database
  2. create tables including:
    • primary key constraints
    • indexes
    • check constraints
  3. load data (and transform where needed)
  4. create foreign key constraints

Due to the ordering of the load script where all data is loaded before the foreign keys are created the order in which the tables are created and loaded doesn’t really matter: there will never be foreign key constraint errors, not even with circular foreign keys.

For the different datatypes I then manipulated the script a little bit. The first change was a replace to change all occurrences of VARCHAR(35) to NVARCHAR(35). The second was to change all occurrences of VARCHAR(35) with UNIQUEIDENTIFIER. For the last one I then wrote a function to convert the ColdFusion UUID format to the standard format the the UNIQUEIDENTIFIER datatype expects. This function looks like:

CREATE  function var_to_UUID(@cfUUID VARCHAR(35)) RETURNS uniqueidentifier AS
BEGIN
  Return
    CASE
      WHEN Len(@cfUUID) = 35 THEN Substring(@cfUUID,1,23) + '-' + Substring(@cfUUID,24,12)
      ELSE NULL
    END;
END

Next I had to manipulate the data loading script to use this function to convert all the UUIDs. That required some manual editing to make all the insert statements look like:

INSERT INTO [testUUID].[dbo].[language](
  [languageID]
  ,[language]
  )
SELECT
  testUUID.dbo.var_to_uuid([languageID])
  ,[language]
FROM
  [source_DB].[dbo].[language]

The database I used for these tests is on the one hand far from trivial: 80+ tables with normal and composite primary keys, 120+ foreign keys and many check constraints. On the other hand it is very trivial: just one stored procedure and one UDF. To make sure that I am comparing apples to apples I ran the existing database through the create and load sequence to establish my base for comparison and I ran all databases through the example index optimization script from MSDN. The numbers I got after just that sequence are:

Datatype Storage requirements (bytes) Database size
UNIQUEIDENTIFIER 16 29.6 GB
VARCHAR(35) 35 32.5 GB
NVARCHAR(35) 70 38.6 GB

As you can see the database size correlates very well to the storage size of the different datatypes. Apparently we have about 26.5 GB of non-UUID data in the database (which for this database is mostly in NTEXT fields in 2 tables of 18 GB and 6 GB data size respectively). For the tables most and least affected by the change the statistics are:

Table Datatype Storage requirements (bytes) Table size Index size
Most affected UNIQUEIDENTIFIER 16 2.0 GB 1.5 GB
VARCHAR(35) 35 3.4 GB 2.6 GB
NVARCHAR(35) 70 6.4 GB 5.1 GB
Least affected UNIQUEIDENTIFIER 16 17.9 GB 0.3 MB
VARCHAR(35) 35 17.9 GB 0.7 MB
NVARCHAR(35) 70 17.9 GB 2.0 MB

So as you can see there may be good reason to pick the right datatype: if your schema has many UUID columns you can save upto a factor three in storage space (or hardly any space at al if you have few UUIDs). Looking at it from the MS SQL Sever perspective the preferred storage format is UNIQUEIDENTIFIER, followed by VARCHAR(35), and lastly NVARCHAR(35). But MS SQL Server is only half of the equation, ColdFusion is the other half and in the next installment I will show you why the order from the ColdFusion side is radically different.