Posts tagged ‘NVARCHAR’

In the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database size. Using an NVARCHAR(35) instead of a UNIQUEIDENTIFIER can triple the size of your tables, and a VARCHAR(35) is somewhere in between. Conventional wisdom relates the performance inversely to the size of a database, with a factor between O(ln(n)) for indexed access and O(n) for table access. To test that, I created an artificial benchmark for my databases with several queries that mainly exercise range access to the tables that saw the highest size ratio from the datatype conversion and access more data then fits in RAM. (Due to the string format difference the values used in the queries differed slightly.) Then I ran those queries from the MS SQL Management Studio and checked the wall clock time.

Datatype Storage requirements (B) Database size (GB) Run time (s)
UNIQUEIDENTIFIER 16 29.6 22
VARCHAR(35) 35 32.5 35
NVARCHAR(35) 70 38.6 47

The benchmark proved conventional wisdom right (unsurprisingly I might add, because it was designed to prove conventional wisdom right).

Querying from ColdFusion

If we add ColdFusion to the mix, the equation changes. Crazy as it may sound, the typeless ColdFusion language has 2 incompatible 128-bit integer datatypes. First there is the UUID, a 128-bit integer with a 35-byte string representation. It has a terribly slow generator function CreateUUID(), but it is incompatible with the rest of the world. Then there is the GUID, a 128-bit integer with a 36-byte string representation. There is no function in ColdFusion to generate GUIDs, but you can either insert an extra - in a UUID or generate them from java.util.UUID. This is the string representation that the rest of the world, including MS SQL Server and Java, uses. Since neither the database nor ColdFusion can cast automatically, we need to make sure we use the right string representation on both sides.

The next question is whether or not to use cfqueryparam. In my opinion not using cfqueryparam should only be done in very specific circumstances where you have a reason not to use it (and comments in the code should explain that reason for the next programmer).

Since we use cfqueryparam we need to decide is whether to use Unicode for our string parameters or not. This is an option in the ColdFusion administrator that enables or disables this for the entire datasource. Most applications I work with need some form of i18n support so we need unicode and have to enable this.

Lastly we need to pick a database driver. ColdFusion ships with a MS SQL Server driver from DataDirect, but there are other alternatives. I have tested the drivers from Microsoft as well during this round.

Test 1

The benchmark I am using for this first round is extremely simple, just a primary key lookup of a row of the table that had the largest relative size reduction between NVARCHAR and UNIQUEIDENTIFIER. This is run for each database datatype, for each driver, once with cfqueryparam and once without cfqueryparam.

Query 1
	SELECT
		*
	FROM
		table
	WHERE
		primary_key= '#searchValue#'
Query 2
	SELECT
		*
	FROM
		table
	WHERE
		primary_key = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">

The query without cfqueryparam is the query that is most similar to a query as we would run it from the MS SQL Management Studio. It is only included to provide a reference since I couldn’t believe the results at first, I would not want this in any production code. The table below shows the results, averaged over all runs and all drivers.

Datatype Query 1 (ms) Query 2 (ms)
UNIQUEIDENTIFIER 5 6
VARCHAR(35) 5 39700
NVARCHAR(35) 6 6

As you can see the differences are minimal, except for the case where we have a query sending a cfqueryparam with Unicode hint. This is the same behavior I have shown before: a character set mismatch forces MS SQL Server to change the execution plan to do a table scan instead of an index lookup. The effect is just much bigger here because this is a real sized dataset and not a small testcase.

Test 2

After seeing this I dropped the VARCHAR database from the tests. That allowed me to increase the number of iterations for each test and add a few more queries and still finish this year. So for the following test I had the following queries:

Query 1
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = '#searchValue#'
Query 2
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = N'#searchValue#'
Query 3
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">
Query 4
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_uniqueidentifier" value="#searchValue#">

The way I ended up with Query 4 is that I had no idea what the proper cfsqltype for a uniqueidentifier was, so I just tried something and it didn’t generate an error. (It doesn’t really work either because it doesn’t do any validation on the client side either.) The results over 100 queries provided an interesting view.

With unicode hint for datasource Without unicode hint for datsource
Datatype Driver Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms) Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms)
UNIQUEIDENTIFIER A 266 485 969 1360 265 500 637 1406
B 250 500 953 1391 250 547 984 1437
NVARCHAR(35) A 297 562 1141 1578 313 579 1110 1594
B 297 579 1079 1625 296 579 1093 1640

As you can see there is a consistent pattern where queries without cfqueryparam are faster then queries with cfqueryparam, and also that cf_sql_varchar is faster then cf_sql_uniqueidentifier. There is a small difference between the speed of a UNIQUEIDENTIFIER and a NVARCHAR and queries with a Unicode hint appear to be marginally faster. The differences between the drivers are well within the margin of error for this test.

But the most important thing is that no combination shows the awful behavior the combination of a VARCHAR field and Unicode hints show.

Preliminary conclusions

So with this batch of tests I can now indicate an order of preference for the datatype to store 128-bit integers in:

  1. If you are building a new application, use the GUID datatype on the ColdFusion side and the UNIQUEIDENTIFIER datatype on the MS SQL Server  side. This minimizes the size of your database, maximizes the speed and gives you a string representation that is compatible with the rest of the world.
  2. If you have an existing application where you use UUIDs and store them in a VARCHAR field, it may be worthwhile to convert that VARCHAR field. Conversion to an NVARCHAR should be quick and easy, conversion to a UNIQUEIDENTIFIER gives the best performance but will require a lot more work.

A still need to figure out where the speed differences in the last table originate: the MS SQL Server or ColdFusion. I have run some initial tests that suggest that the performance difference is actually in the driver and not the database, but since the differences are a lot smaller I need more tests and SQL profiles to figure that out. That may take a while, but when I do I’ll post back.

Last February I posted on how storing UUIDs using the native UUID datatype in PostgreSQL was about 15% faster then using a VARCHAR(35) datatype. In March I posted about the performance effects of the “String Format” setting for MS SQL Server databases when using cfqueryparam. Recently I have been working on checking the performance effects when combining different datatypes for UUIDs with the string formats settings for a MS SQL Server 2005 database. (Or to be more exact, I have been checking the performance effects of different datatypes for storing 128-bit integers in the database.)

The starting point is an existing database for an application. That database stores UUIDs using a VARCHAR(35) datatype. From that database I extracted the CREATE statements for the schema using SQL Server Management Studio and build a script from there that follows the following pattern:

  1. create database
  2. create tables including:
    • primary key constraints
    • indexes
    • check constraints
  3. load data (and transform where needed)
  4. create foreign key constraints

Due to the ordering of the load script where all data is loaded before the foreign keys are created the order in which the tables are created and loaded doesn’t really matter: there will never be foreign key constraint errors, not even with circular foreign keys.

For the different datatypes I then manipulated the script a little bit. The first change was a replace to change all occurrences of VARCHAR(35) to NVARCHAR(35). The second was to change all occurrences of VARCHAR(35) with UNIQUEIDENTIFIER. For the last one I then wrote a function to convert the ColdFusion UUID format to the standard format the the UNIQUEIDENTIFIER datatype expects. This function looks like:

CREATE  function var_to_UUID(@cfUUID VARCHAR(35)) RETURNS uniqueidentifier AS
BEGIN
  Return
    CASE
      WHEN Len(@cfUUID) = 35 THEN Substring(@cfUUID,1,23) + '-' + Substring(@cfUUID,24,12)
      ELSE NULL
    END;
END

Next I had to manipulate the data loading script to use this function to convert all the UUIDs. That required some manual editing to make all the insert statements look like:

INSERT INTO [testUUID].[dbo].[language](
  [languageID]
  ,[language]
  )
SELECT
  testUUID.dbo.var_to_uuid([languageID])
  ,[language]
FROM
  [source_DB].[dbo].[language]

The database I used for these tests is on the one hand far from trivial: 80+ tables with normal and composite primary keys, 120+ foreign keys and many check constraints. On the other hand it is very trivial: just one stored procedure and one UDF. To make sure that I am comparing apples to apples I ran the existing database through the create and load sequence to establish my base for comparison and I ran all databases through the example index optimization script from MSDN. The numbers I got after just that sequence are:

Datatype Storage requirements (bytes) Database size
UNIQUEIDENTIFIER 16 29.6 GB
VARCHAR(35) 35 32.5 GB
NVARCHAR(35) 70 38.6 GB

As you can see the database size correlates very well to the storage size of the different datatypes. Apparently we have about 26.5 GB of non-UUID data in the database (which for this database is mostly in NTEXT fields in 2 tables of 18 GB and 6 GB data size respectively). For the tables most and least affected by the change the statistics are:

Table Datatype Storage requirements (bytes) Table size Index size
Most affected UNIQUEIDENTIFIER 16 2.0 GB 1.5 GB
VARCHAR(35) 35 3.4 GB 2.6 GB
NVARCHAR(35) 70 6.4 GB 5.1 GB
Least affected UNIQUEIDENTIFIER 16 17.9 GB 0.3 MB
VARCHAR(35) 35 17.9 GB 0.7 MB
NVARCHAR(35) 70 17.9 GB 2.0 MB

So as you can see there may be good reason to pick the right datatype: if your schema has many UUID columns you can save upto a factor three in storage space (or hardly any space at al if you have few UUIDs). Looking at it from the MS SQL Sever perspective the preferred storage format is UNIQUEIDENTIFIER, followed by VARCHAR(35), and lastly NVARCHAR(35). But MS SQL Server is only half of the equation, ColdFusion is the other half and in the next installment I will show you why the order from the ColdFusion side is radically different.