In the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database size. Using an NVARCHAR(35) instead of a UNIQUEIDENTIFIER can triple the size of your tables, and a VARCHAR(35) is somewhere in between. Conventional wisdom relates the performance inversely to the size of a database, with a factor between O(ln(n)) for indexed access and O(n) for table access. To test that, I created an artificial benchmark for my databases with several queries that mainly exercise range access to the tables that saw the highest size ratio from the datatype conversion and access more data then fits in RAM. (Due to the string format difference the values used in the queries differed slightly.) Then I ran those queries from the MS SQL Management Studio and checked the wall clock time.
Datatype | Storage requirements (B) | Database size (GB) | Run time (s) |
---|---|---|---|
UNIQUEIDENTIFIER | 16 | 29.6 | 22 |
VARCHAR(35) | 35 | 32.5 | 35 |
NVARCHAR(35) | 70 | 38.6 | 47 |
The benchmark proved conventional wisdom right (unsurprisingly I might add, because it was designed to prove conventional wisdom right).
Querying from ColdFusion
If we add ColdFusion to the mix, the equation changes. Crazy as it may sound, the typeless ColdFusion language has 2 incompatible 128-bit integer datatypes. First there is the UUID, a 128-bit integer with a 35-byte string representation. It has a terribly slow generator function CreateUUID(), but it is incompatible with the rest of the world. Then there is the GUID, a 128-bit integer with a 36-byte string representation. There is no function in ColdFusion to generate GUIDs, but you can either insert an extra - in a UUID or generate them from java.util.UUID. This is the string representation that the rest of the world, including MS SQL Server and Java, uses. Since neither the database nor ColdFusion can cast automatically, we need to make sure we use the right string representation on both sides.
The next question is whether or not to use cfqueryparam. In my opinion not using cfqueryparam should only be done in very specific circumstances where you have a reason not to use it (and comments in the code should explain that reason for the next programmer).
Since we use cfqueryparam we need to decide is whether to use Unicode for our string parameters or not. This is an option in the ColdFusion administrator that enables or disables this for the entire datasource. Most applications I work with need some form of i18n support so we need unicode and have to enable this.
Lastly we need to pick a database driver. ColdFusion ships with a MS SQL Server driver from DataDirect, but there are other alternatives. I have tested the drivers from Microsoft as well during this round.
Test 1
The benchmark I am using for this first round is extremely simple, just a primary key lookup of a row of the table that had the largest relative size reduction between NVARCHAR and UNIQUEIDENTIFIER. This is run for each database datatype, for each driver, once with cfqueryparam and once without cfqueryparam.
Query 1 SELECT * FROM table WHERE primary_key= '#searchValue#' |
Query 2
SELECT
*
FROM
table
WHERE
primary_key = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#"> |
The query without cfqueryparam is the query that is most similar to a query as we would run it from the MS SQL Management Studio. It is only included to provide a reference since I couldn’t believe the results at first, I would not want this in any production code. The table below shows the results, averaged over all runs and all drivers.
Datatype | Query 1 (ms) | Query 2 (ms) |
---|---|---|
UNIQUEIDENTIFIER | 5 | 6 |
VARCHAR(35) | 5 | 39700 |
NVARCHAR(35) | 6 | 6 |
As you can see the differences are minimal, except for the case where we have a query sending a cfqueryparam with Unicode hint. This is the same behavior I have shown before: a character set mismatch forces MS SQL Server to change the execution plan to do a table scan instead of an index lookup. The effect is just much bigger here because this is a real sized dataset and not a small testcase.
Test 2
After seeing this I dropped the VARCHAR database from the tests. That allowed me to increase the number of iterations for each test and add a few more queries and still finish this year. So for the following test I had the following queries:
Query 1 SELECT * FROM questionScore WHERE questionScoreID = '#searchValue#' |
Query 2 SELECT * FROM questionScore WHERE questionScoreID = N'#searchValue#' |
Query 3
SELECT
*
FROM
questionScore
WHERE
questionScoreID = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#"> |
Query 4
SELECT
*
FROM
questionScore
WHERE
questionScoreID = <cfqueryparam cfsqltype="cf_sql_uniqueidentifier" value="#searchValue#"> |
The way I ended up with Query 4 is that I had no idea what the proper cfsqltype for a uniqueidentifier was, so I just tried something and it didn’t generate an error. (It doesn’t really work either because it doesn’t do any validation on the client side either.) The results over 100 queries provided an interesting view.
With unicode hint for datasource | Without unicode hint for datsource | ||||||||
---|---|---|---|---|---|---|---|---|---|
Datatype | Driver | Query 1 (ms) | Query 2 (ms) | Query 3 (ms) | Query 4 (ms) | Query 1 (ms) | Query 2 (ms) | Query 3 (ms) | Query 4 (ms) |
UNIQUEIDENTIFIER | A | 266 | 485 | 969 | 1360 | 265 | 500 | 637 | 1406 |
B | 250 | 500 | 953 | 1391 | 250 | 547 | 984 | 1437 | |
NVARCHAR(35) | A | 297 | 562 | 1141 | 1578 | 313 | 579 | 1110 | 1594 |
B | 297 | 579 | 1079 | 1625 | 296 | 579 | 1093 | 1640 |
As you can see there is a consistent pattern where queries without cfqueryparam are faster then queries with cfqueryparam, and also that cf_sql_varchar is faster then cf_sql_uniqueidentifier. There is a small difference between the speed of a UNIQUEIDENTIFIER and a NVARCHAR and queries with a Unicode hint appear to be marginally faster. The differences between the drivers are well within the margin of error for this test.
But the most important thing is that no combination shows the awful behavior the combination of a VARCHAR field and Unicode hints show.
Preliminary conclusions
So with this batch of tests I can now indicate an order of preference for the datatype to store 128-bit integers in:
- If you are building a new application, use the GUID datatype on the ColdFusion side and the UNIQUEIDENTIFIER datatype on the MS SQL ServerĀ side. This minimizes the size of your database, maximizes the speed and gives you a string representation that is compatible with the rest of the world.
- If you have an existing application where you use UUIDs and store them in a VARCHAR field, it may be worthwhile to convert that VARCHAR field. Conversion to an NVARCHAR should be quick and easy, conversion to a UNIQUEIDENTIFIER gives the best performance but will require a lot more work.
A still need to figure out where the speed differences in the last table originate: the MS SQL Server or ColdFusion. I have run some initial tests that suggest that the performance difference is actually in the driver and not the database, but since the differences are a lot smaller I need more tests and SQL profiles to figure that out. That may take a while, but when I do I’ll post back.
Zac Spitzer says:
I’d be interested to see the stats where you just use a plain old (of course not replicate friendly) numeric key
2008/12/16, 03:04Tjarko says:
Really curious about your next findings, i always had the sneaking suspicion that without cfqueryparam everything went a lot faster, and apparently this is true.
2008/12/16, 08:19Jochem says:
Numeric or integer? I expect numeric to perform about the same as a uniqueidentifier because, although it is smaller, it is a ‘complex’ type: it has some internal representation that needs to be decoded before any operation can be performed on it.
An integer will be a bit faster because it is smaller. Not that much though, when I went from a varchar to a uniqueidentifier I already got most of the gains becuase that conversion gets rid of all the extra internal operations that are involved with manipulating strings (locale, case sensitivity etc.). So all the gains going from uniqueidentifier to integer stem from the difference in storage size, and that matters less.
But this benchmark is designed for a specific situation (real data from a production system that already has UUIDs that can not be shortened to integers) and to prove a specific point (I benchmark tables with predominantly UUID columns), so it would not be very suited to benchmark anything else.
2008/12/16, 10:38Jochem says:
@Tjarko: I think you not extrapolate a benchmark I wrote to prove a specific point to “everything” and “a lot faster”.
2008/12/16, 10:40Orange is my favorite color » Blog Archive » Converting from UUIDs to GUIDs says:
[...] been reading through a handful of Jochem’s posts about Adobe’s (incompatible with the standard 35-character) [...]
2008/12/16, 19:12Tjarko says:
@Jochem What I mean is that when i looked at some sites we are running I had several statements “tuned” with cfqueryparam and after those changes the site went from tens of milliseconds to hundreds of milliseconds. Changing it back after several days returned everything to tens of milliseconds. These sites have about 10 to 15 thousand visitors every day.. and that makes a difference in performance.. hence “everything a lot faster” The only difference for this was the use of cfqueryparam.
2008/12/18, 10:15Judah McAuley says:
Jochem, just an fyi, but there is a cfsqlparam datatype specified for uniqueidentifiers. It is CF_SQL_IDSTAMP I would not have known but I was just running Illudium against a table with a uniqueidentifier and noticed a weird cfsqlparam that I hadn’t seen before.
http://livedocs.adobe.com/coldfusion/7/htmldocs/wwhelp/wwhimpl/common/html/wwhelp.htm?context=ColdFusion_Documentation&file=00000317.htm
2009/03/06, 23:55