In the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database size. Using an NVARCHAR(35) instead of a UNIQUEIDENTIFIER can triple the size of your tables, and a VARCHAR(35) is somewhere in between. Conventional wisdom relates the performance inversely to the size of a database, with a factor between O(ln(n)) for indexed access and O(n) for table access. To test that, I created an artificial benchmark for my databases with several queries that mainly exercise range access to the tables that saw the highest size ratio from the datatype conversion and access more data then fits in RAM. (Due to the string format difference the values used in the queries differed slightly.) Then I ran those queries from the MS SQL Management Studio and checked the wall clock time.
Datatype | Storage requirements (B) | Database size (GB) | Run time (s) |
---|---|---|---|
UNIQUEIDENTIFIER | 16 | 29.6 | 22 |
VARCHAR(35) | 35 | 32.5 | 35 |
NVARCHAR(35) | 70 | 38.6 | 47 |
The benchmark proved conventional wisdom right (unsurprisingly I might add, because it was designed to prove conventional wisdom right).
Querying from ColdFusion
If we add ColdFusion to the mix, the equation changes. Crazy as it may sound, the typeless ColdFusion language has 2 incompatible 128-bit integer datatypes. First there is the UUID, a 128-bit integer with a 35-byte string representation. It has a terribly slow generator function CreateUUID(), but it is incompatible with the rest of the world. Then there is the GUID, a 128-bit integer with a 36-byte string representation. There is no function in ColdFusion to generate GUIDs, but you can either insert an extra - in a UUID or generate them from java.util.UUID. This is the string representation that the rest of the world, including MS SQL Server and Java, uses. Since neither the database nor ColdFusion can cast automatically, we need to make sure we use the right string representation on both sides.
The next question is whether or not to use cfqueryparam. In my opinion not using cfqueryparam should only be done in very specific circumstances where you have a reason not to use it (and comments in the code should explain that reason for the next programmer).
Since we use cfqueryparam we need to decide is whether to use Unicode for our string parameters or not. This is an option in the ColdFusion administrator that enables or disables this for the entire datasource. Most applications I work with need some form of i18n support so we need unicode and have to enable this.
Lastly we need to pick a database driver. ColdFusion ships with a MS SQL Server driver from DataDirect, but there are other alternatives. I have tested the drivers from Microsoft as well during this round.
Test 1
The benchmark I am using for this first round is extremely simple, just a primary key lookup of a row of the table that had the largest relative size reduction between NVARCHAR and UNIQUEIDENTIFIER. This is run for each database datatype, for each driver, once with cfqueryparam and once without cfqueryparam.
Query 1 SELECT * FROM table WHERE primary_key= '#searchValue#' |
Query 2
SELECT
*
FROM
table
WHERE
primary_key = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#"> |
The query without cfqueryparam is the query that is most similar to a query as we would run it from the MS SQL Management Studio. It is only included to provide a reference since I couldn’t believe the results at first, I would not want this in any production code. The table below shows the results, averaged over all runs and all drivers.
Datatype | Query 1 (ms) | Query 2 (ms) |
---|---|---|
UNIQUEIDENTIFIER | 5 | 6 |
VARCHAR(35) | 5 | 39700 |
NVARCHAR(35) | 6 | 6 |
As you can see the differences are minimal, except for the case where we have a query sending a cfqueryparam with Unicode hint. This is the same behavior I have shown before: a character set mismatch forces MS SQL Server to change the execution plan to do a table scan instead of an index lookup. The effect is just much bigger here because this is a real sized dataset and not a small testcase.
Test 2
After seeing this I dropped the VARCHAR database from the tests. That allowed me to increase the number of iterations for each test and add a few more queries and still finish this year. So for the following test I had the following queries:
Query 1 SELECT * FROM questionScore WHERE questionScoreID = '#searchValue#' |
Query 2 SELECT * FROM questionScore WHERE questionScoreID = N'#searchValue#' |
Query 3
SELECT
*
FROM
questionScore
WHERE
questionScoreID = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#"> |
Query 4
SELECT
*
FROM
questionScore
WHERE
questionScoreID = <cfqueryparam cfsqltype="cf_sql_uniqueidentifier" value="#searchValue#"> |
The way I ended up with Query 4 is that I had no idea what the proper cfsqltype for a uniqueidentifier was, so I just tried something and it didn’t generate an error. (It doesn’t really work either because it doesn’t do any validation on the client side either.) The results over 100 queries provided an interesting view.
With unicode hint for datasource | Without unicode hint for datsource | ||||||||
---|---|---|---|---|---|---|---|---|---|
Datatype | Driver | Query 1 (ms) | Query 2 (ms) | Query 3 (ms) | Query 4 (ms) | Query 1 (ms) | Query 2 (ms) | Query 3 (ms) | Query 4 (ms) |
UNIQUEIDENTIFIER | A | 266 | 485 | 969 | 1360 | 265 | 500 | 637 | 1406 |
B | 250 | 500 | 953 | 1391 | 250 | 547 | 984 | 1437 | |
NVARCHAR(35) | A | 297 | 562 | 1141 | 1578 | 313 | 579 | 1110 | 1594 |
B | 297 | 579 | 1079 | 1625 | 296 | 579 | 1093 | 1640 |
As you can see there is a consistent pattern where queries without cfqueryparam are faster then queries with cfqueryparam, and also that cf_sql_varchar is faster then cf_sql_uniqueidentifier. There is a small difference between the speed of a UNIQUEIDENTIFIER and a NVARCHAR and queries with a Unicode hint appear to be marginally faster. The differences between the drivers are well within the margin of error for this test.
But the most important thing is that no combination shows the awful behavior the combination of a VARCHAR field and Unicode hints show.
Preliminary conclusions
So with this batch of tests I can now indicate an order of preference for the datatype to store 128-bit integers in:
- If you are building a new application, use the GUID datatype on the ColdFusion side and the UNIQUEIDENTIFIER datatype on the MS SQL Server side. This minimizes the size of your database, maximizes the speed and gives you a string representation that is compatible with the rest of the world.
- If you have an existing application where you use UUIDs and store them in a VARCHAR field, it may be worthwhile to convert that VARCHAR field. Conversion to an NVARCHAR should be quick and easy, conversion to a UNIQUEIDENTIFIER gives the best performance but will require a lot more work.
A still need to figure out where the speed differences in the last table originate: the MS SQL Server or ColdFusion. I have run some initial tests that suggest that the performance difference is actually in the driver and not the database, but since the differences are a lot smaller I need more tests and SQL profiles to figure that out. That may take a while, but when I do I’ll post back.