Archive for the ‘databases’ Category

At Prisma IT we have an application that has been under constant development for 2 years. Over that time it has seen lots of feature creep and changing requirements, and no refactoring. As you can imagine that doesn’t make for pretty code or an optimal database model. Due to a retargeting of that application and a much more focused requirements process we are finally getting ready to start reworking that application. Over the past month I have been working on and off (mostly in the train on my way to clients) on putting more structure in the long list with things we (I) really would like to change about the technology of the application. I have also been running some experiments on the code to see what the impact of certain changes would be. Amongst others this includes the experiments with changing the datatype used in MS SQL Server to store ColdFusion UUIDs.

One of the last things I wanted was to get a feel for the cost/benefit ratio of a conversion from MS SQL Server to PostgreSQL. Continue reading ‘Converting a CF application from MS SQL Server to PostgreSQL?’ »

So far in our attempts to get access to the templates and data of other hosted customers we have primarily focused on accessing the templates of other users in order to get access to their data through them. A more direct approach is to try to access the database directly.

Registered datasources

To access a registered datasource all you need is its name. More often then not, names are easy to guess. Depending on how the datasource is registered in the CF Administrator you may need a username and password as well. If the username and password to the datasource are registered in the ColdFusion Administrator ColdFusion will write them to neo-datasource.xml. They are encrypted there, but with a reversible algorithm (ColdFusion needs to be able to decrypt them in order to authenticate). The way to decrypt these passwords is well known. So if you have read access to neo-datasource.xml through an incorrect Sandbox configuration you have all usernames and passwords registered there.
If the username and password of the datasource are not registered in the ColdFusion Administrator chances are they are stored in the application scope somewhere and we have already seen that the application scope is insecure.

The first line of defense against this is to configure Sandbox Security to only allow access to specific datasources to each Sandbox. This will effectively lock down access from ColdFusion to the datasource, even if you know the full connection string.

Database to database connections

But as good as the protection for datasources is in ColdFusion, it only protects against connections from ColdFusion’s cfquery and cfstoredproc. Just as the trick to work around a disabled cfexecute is to create Java / COM / .NET objects, the trick to work around a disabled ColdFusion datasource connection is to start the connection from somewhere else. And the way to do that is to connect from one database to another. The protocol to do so, SQL/MED, is not widely implemented in databases, but most databases offer proprietary ways to do so. For instance, in MS SQL Server you would use:

SELECT *
FROM OpenDataSource(
  ’SQLOLEDB‘
  ,Data Source=server.asdf.com;User ID=yourusername;Password=yourpassword‘
  ).somedatabase.dbo.sometable;

In PostgreSQL that would be:

SELECT *
FROM dblink(
  'dbname=postgres'
  , 'SELECT ID, value FROM tables'
  ) AS t1(
  ID INT
  , value TEXT
  );

The one thing that makes this more difficult is that you usually need some sort of superuser privilege on the database server to be allowed to do this.

MS Access

MS Access is a file based database that is pretty popular in shared hosting. While people frequently question performance, scalability and security, it is still an attractive offering because it allows users to download the database, edit it offline and then upload an entirely new database with new data.

MS Access offers a very simple feature to access tables in other MS Access databases without the superuser privileges you would need on a more full-featured database server. You just specify the path to the data file you want to access directly in the query:

SELECT *
FROM table IN 'h:\sites\database.mdb'

I was reading a little discussion on whether the maxrows attribute of cfquery was pushed down all the way into the database like a TOP or LIMIT statement would, or whether the full query statement was executed and only a limited number of rows was retrieved. And while there was a lot of debate, there was no proof, not even a reference to some standard that said how it should work (which is something entirely different from how it does work).

Since I am personally convinced that the full query statement is executed and only a limited number of rows is retrieved I set out to prove so. In order to prove so, we need to have a query statement that returns rows to ColdFusion, but has a side-effect as well that is registered in the database. The the results of that side effect should be visible in ColdFusion. My first idea was to do a select with a RAISE NOTICE call per record, and then see if the number of print notices was equal to the maxrows in the cfquery. Only the output of a notice isn’t visible in ColdFusion.

The new idea was to write a query statement that not just inserts a record, but returns a record at the same time. That is easily done with the RETURNING clause in PostgreSQL.

INSERT INTO x (y)
VALUES (z)
RETURNING *

With some searching I figured out that MS SQL Server offers similar functionality

INSERT INTO x (y)
OUTPUT INSERTED.*
VALUES (z)

So here is the script to set up the database tables and values:

create table src (ID INTEGER);
create table tgt (ID INTEGER);
 
insert into src (ID) values(1);
insert into src (ID) values(2);
insert into src (ID) values(3);
insert into src (ID) values(4);
insert into src (ID) values(5);
insert into src (ID) values(6);
insert into src (ID) values(7);
insert into src (ID) values(8);
insert into src (ID) values(9);
insert into src (ID) values(10);

And the test script to run from ColdFusion:

<!--- baseline --->
<cfquery datasource="pgtest">
DELETE FROM tgt;
</cfquery>
<cfquery datasource="pgtest" result="debug">
INSERT INTO tgt (ID) SELECT ID FROM src RETURNING *;
</cfquery>
<cfquery datasource="pgtest" name="result">
SELECT COUNT(*) AS result FROM tgt;
</cfquery>
<cfdump var="#debug#">
<cfdump var="#result.result#">
 
<!--- LIMIT/TOP --->
<cfquery datasource="pgtest">
DELETE FROM tgt;
</cfquery>
<cfquery datasource="pgtest" result="debug">
INSERT INTO tgt (ID) SELECT ID FROM src LIMIT 5 RETURNING *;
</cfquery>
<cfquery datasource="pgtest" name="result">
SELECT COUNT(*) AS result FROM tgt;
</cfquery>
<cfdump var="#debug#">
<cfdump var="#result.result#">
 
<!--- maxrows --->
<cfquery datasource="pgtest">
DELETE FROM tgt;
</cfquery>
<cfquery datasource="pgtest" result="debug" maxrows="5">
INSERT INTO tgt (ID) SELECT ID FROM src RETURNING *;
</cfquery>
<cfquery datasource="pgtest" name="result">
SELECT COUNT(*) AS result FROM tgt;
</cfquery>
<cfdump var="#debug#">
<cfdump var="#result.result#">

Results

The results are very clear:

Query PostgreSQL MS SQL Server
cfquery recordcount database count cfquery recordcount database count
Baseline 10 10 10 10
LIMIT / TOP 5 5 5 5
maxrows 5 10 5 10

So using a maxrows attribute in your query does not prevent you database from executing the query statement completely. The limit in the number of records is only enforced in the JDBC driver (or in ColdFusion) and the database runs the full query.

I would be interested in seeing results from other databases, although I don’t expect them to be different.

In the previous part I have shown that the datatype you use in MS SQL Server has major consequences for the database size. Using an NVARCHAR(35) instead of a UNIQUEIDENTIFIER can triple the size of your tables, and a VARCHAR(35) is somewhere in between. Conventional wisdom relates the performance inversely to the size of a database, with a factor between O(ln(n)) for indexed access and O(n) for table access. To test that, I created an artificial benchmark for my databases with several queries that mainly exercise range access to the tables that saw the highest size ratio from the datatype conversion and access more data then fits in RAM. (Due to the string format difference the values used in the queries differed slightly.) Then I ran those queries from the MS SQL Management Studio and checked the wall clock time.

Datatype Storage requirements (B) Database size (GB) Run time (s)
UNIQUEIDENTIFIER 16 29.6 22
VARCHAR(35) 35 32.5 35
NVARCHAR(35) 70 38.6 47

The benchmark proved conventional wisdom right (unsurprisingly I might add, because it was designed to prove conventional wisdom right).

Querying from ColdFusion

If we add ColdFusion to the mix, the equation changes. Crazy as it may sound, the typeless ColdFusion language has 2 incompatible 128-bit integer datatypes. First there is the UUID, a 128-bit integer with a 35-byte string representation. It has a terribly slow generator function CreateUUID(), but it is incompatible with the rest of the world. Then there is the GUID, a 128-bit integer with a 36-byte string representation. There is no function in ColdFusion to generate GUIDs, but you can either insert an extra - in a UUID or generate them from java.util.UUID. This is the string representation that the rest of the world, including MS SQL Server and Java, uses. Since neither the database nor ColdFusion can cast automatically, we need to make sure we use the right string representation on both sides.

The next question is whether or not to use cfqueryparam. In my opinion not using cfqueryparam should only be done in very specific circumstances where you have a reason not to use it (and comments in the code should explain that reason for the next programmer).

Since we use cfqueryparam we need to decide is whether to use Unicode for our string parameters or not. This is an option in the ColdFusion administrator that enables or disables this for the entire datasource. Most applications I work with need some form of i18n support so we need unicode and have to enable this.

Lastly we need to pick a database driver. ColdFusion ships with a MS SQL Server driver from DataDirect, but there are other alternatives. I have tested the drivers from Microsoft as well during this round.

Test 1

The benchmark I am using for this first round is extremely simple, just a primary key lookup of a row of the table that had the largest relative size reduction between NVARCHAR and UNIQUEIDENTIFIER. This is run for each database datatype, for each driver, once with cfqueryparam and once without cfqueryparam.

Query 1
	SELECT
		*
	FROM
		table
	WHERE
		primary_key= '#searchValue#'
Query 2
	SELECT
		*
	FROM
		table
	WHERE
		primary_key = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">

The query without cfqueryparam is the query that is most similar to a query as we would run it from the MS SQL Management Studio. It is only included to provide a reference since I couldn’t believe the results at first, I would not want this in any production code. The table below shows the results, averaged over all runs and all drivers.

Datatype Query 1 (ms) Query 2 (ms)
UNIQUEIDENTIFIER 5 6
VARCHAR(35) 5 39700
NVARCHAR(35) 6 6

As you can see the differences are minimal, except for the case where we have a query sending a cfqueryparam with Unicode hint. This is the same behavior I have shown before: a character set mismatch forces MS SQL Server to change the execution plan to do a table scan instead of an index lookup. The effect is just much bigger here because this is a real sized dataset and not a small testcase.

Test 2

After seeing this I dropped the VARCHAR database from the tests. That allowed me to increase the number of iterations for each test and add a few more queries and still finish this year. So for the following test I had the following queries:

Query 1
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = '#searchValue#'
Query 2
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = N'#searchValue#'
Query 3
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_varchar" value="#searchValue#">
Query 4
	SELECT
		*
	FROM
		questionScore
	WHERE
		questionScoreID = <cfqueryparam cfsqltype="cf_sql_uniqueidentifier" value="#searchValue#">

The way I ended up with Query 4 is that I had no idea what the proper cfsqltype for a uniqueidentifier was, so I just tried something and it didn’t generate an error. (It doesn’t really work either because it doesn’t do any validation on the client side either.) The results over 100 queries provided an interesting view.

With unicode hint for datasource Without unicode hint for datsource
Datatype Driver Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms) Query 1 (ms) Query 2 (ms) Query 3 (ms) Query 4 (ms)
UNIQUEIDENTIFIER A 266 485 969 1360 265 500 637 1406
B 250 500 953 1391 250 547 984 1437
NVARCHAR(35) A 297 562 1141 1578 313 579 1110 1594
B 297 579 1079 1625 296 579 1093 1640

As you can see there is a consistent pattern where queries without cfqueryparam are faster then queries with cfqueryparam, and also that cf_sql_varchar is faster then cf_sql_uniqueidentifier. There is a small difference between the speed of a UNIQUEIDENTIFIER and a NVARCHAR and queries with a Unicode hint appear to be marginally faster. The differences between the drivers are well within the margin of error for this test.

But the most important thing is that no combination shows the awful behavior the combination of a VARCHAR field and Unicode hints show.

Preliminary conclusions

So with this batch of tests I can now indicate an order of preference for the datatype to store 128-bit integers in:

  1. If you are building a new application, use the GUID datatype on the ColdFusion side and the UNIQUEIDENTIFIER datatype on the MS SQL Server  side. This minimizes the size of your database, maximizes the speed and gives you a string representation that is compatible with the rest of the world.
  2. If you have an existing application where you use UUIDs and store them in a VARCHAR field, it may be worthwhile to convert that VARCHAR field. Conversion to an NVARCHAR should be quick and easy, conversion to a UNIQUEIDENTIFIER gives the best performance but will require a lot more work.

A still need to figure out where the speed differences in the last table originate: the MS SQL Server or ColdFusion. I have run some initial tests that suggest that the performance difference is actually in the driver and not the database, but since the differences are a lot smaller I need more tests and SQL profiles to figure that out. That may take a while, but when I do I’ll post back.

Last February I posted on how storing UUIDs using the native UUID datatype in PostgreSQL was about 15% faster then using a VARCHAR(35) datatype. In March I posted about the performance effects of the “String Format” setting for MS SQL Server databases when using cfqueryparam. Recently I have been working on checking the performance effects when combining different datatypes for UUIDs with the string formats settings for a MS SQL Server 2005 database. (Or to be more exact, I have been checking the performance effects of different datatypes for storing 128-bit integers in the database.)

The starting point is an existing database for an application. That database stores UUIDs using a VARCHAR(35) datatype. From that database I extracted the CREATE statements for the schema using SQL Server Management Studio and build a script from there that follows the following pattern:

  1. create database
  2. create tables including:
    • primary key constraints
    • indexes
    • check constraints
  3. load data (and transform where needed)
  4. create foreign key constraints

Due to the ordering of the load script where all data is loaded before the foreign keys are created the order in which the tables are created and loaded doesn’t really matter: there will never be foreign key constraint errors, not even with circular foreign keys.

For the different datatypes I then manipulated the script a little bit. The first change was a replace to change all occurrences of VARCHAR(35) to NVARCHAR(35). The second was to change all occurrences of VARCHAR(35) with UNIQUEIDENTIFIER. For the last one I then wrote a function to convert the ColdFusion UUID format to the standard format the the UNIQUEIDENTIFIER datatype expects. This function looks like:

CREATE  function var_to_UUID(@cfUUID VARCHAR(35)) RETURNS uniqueidentifier AS
BEGIN
  Return
    CASE
      WHEN Len(@cfUUID) = 35 THEN Substring(@cfUUID,1,23) + '-' + Substring(@cfUUID,24,12)
      ELSE NULL
    END;
END

Next I had to manipulate the data loading script to use this function to convert all the UUIDs. That required some manual editing to make all the insert statements look like:

INSERT INTO [testUUID].[dbo].[language](
  [languageID]
  ,[language]
  )
SELECT
  testUUID.dbo.var_to_uuid([languageID])
  ,[language]
FROM
  [source_DB].[dbo].[language]

The database I used for these tests is on the one hand far from trivial: 80+ tables with normal and composite primary keys, 120+ foreign keys and many check constraints. On the other hand it is very trivial: just one stored procedure and one UDF. To make sure that I am comparing apples to apples I ran the existing database through the create and load sequence to establish my base for comparison and I ran all databases through the example index optimization script from MSDN. The numbers I got after just that sequence are:

Datatype Storage requirements (bytes) Database size
UNIQUEIDENTIFIER 16 29.6 GB
VARCHAR(35) 35 32.5 GB
NVARCHAR(35) 70 38.6 GB

As you can see the database size correlates very well to the storage size of the different datatypes. Apparently we have about 26.5 GB of non-UUID data in the database (which for this database is mostly in NTEXT fields in 2 tables of 18 GB and 6 GB data size respectively). For the tables most and least affected by the change the statistics are:

Table Datatype Storage requirements (bytes) Table size Index size
Most affected UNIQUEIDENTIFIER 16 2.0 GB 1.5 GB
VARCHAR(35) 35 3.4 GB 2.6 GB
NVARCHAR(35) 70 6.4 GB 5.1 GB
Least affected UNIQUEIDENTIFIER 16 17.9 GB 0.3 MB
VARCHAR(35) 35 17.9 GB 0.7 MB
NVARCHAR(35) 70 17.9 GB 2.0 MB

So as you can see there may be good reason to pick the right datatype: if your schema has many UUID columns you can save upto a factor three in storage space (or hardly any space at al if you have few UUIDs). Looking at it from the MS SQL Sever perspective the preferred storage format is UNIQUEIDENTIFIER, followed by VARCHAR(35), and lastly NVARCHAR(35). But MS SQL Server is only half of the equation, ColdFusion is the other half and in the next installment I will show you why the order from the ColdFusion side is radically different.

So I have a shiny new laptop (Lenovo Thinkpad T61p) that I have been configuring for the last 2 weeks now. I have installed all the office applications and most of my development tools. ColdFusion 8.0.1, Eclipse 3.4, FlexBuilder, CS4 and MS SQL Server Express 2008. verything worked reasonably well. Except that I had to update the JDBC drivers from Microsoft to version 2 CTP, and then I had to switch to the JDBC 4 drivers because the other driver is supposedly incompatible with an 1.6 JVM. (It worked just fine with the 1.x versions of the drivers.)

And that is where the trouble started, many queries suddenly gave datatype resolution errors in ColdFusion. After some testing it turned out that problem queries are typically similar to:

	SELECT
		T.credits
		, PT.pageText AS displayName
	FROM
		test
			INNER JOIN
		pageText PT ON T.nameTextID = PT.pageTextID

In a subsequent QoQ there would be an error if the datatype of PT.pageText was an N-datatype. What fixes the problem is to change the datatype using an explicit cast:

	SELECT
		T.credits
		, CAST(PT.pageText AS VARCHAR) AS displayName
	FROM
		test
			INNER JOIN
		pageText PT ON T.nameTextID = PT.pageTextID

Not something that I fancy changing in all queries, but since this version of the driver is still a technology preview not something that really bothers me either.

Next on the list of applications to install was LiveCycle ES. Installation was a bit troublesome, the installer doesn’t recognize Flash 10, the LC Designer from LC ES is incompatible with the LC Designer from CS4 and the configuration instructions haven’t been updated for CS4 yet, but in the end I got it all installed. But not running. LC Designer crashed consistently within 3 seconds of being started. After lots of digging and searching I finally stumbled upon an explanation of the problem.

SQL Server installs a newer version of the MFC DLL on top of the one we install, and the new version has a critical bug.

Other people have reported there are lots of other applications that suddenly start crashing as soon as MS SQL Server 2008 is installed. Or maybe it is not really MS SQL Server 2008 but on of the dependencies, like the .NET Framework 3.5. I don’t know, I don’t care, I think MS SQL Server 2008 has wasted enough of my time and I have uninstalled it completely.

Hidden below the “Show Advanced Settings” button on the datasource settings page for MS SQL Server lies the “String format” setting. According to the manual this setting is used to “Enable this option if your application uses Unicode data in DBMS-specific Unicode datatypes, such as National Character or nchar.” In other words, this allows you to use N fields in your database without adding those pesky N’s to your SQL and without butchering your data in the character set conversion process.

What that doesn’t tell you is that this setting can have serious performance consequences. When you enable this setting all variables send to the database using cfqueryparam are sent in their N format. Even the ones where the corresponding database field is not an N field but a regular field. And when that happens you have an encoding mismatch between the parameter and the column. Not really a problem when it is part of an insert statement, but disastrous when it is part of a predicate. Because an encoding mismatch will often prevent MS SQL Server from using indexes to execute the query, the query plan you nicely tuned from the SQL Server Management Studio (without unicode) will degrade into table scans.

Same encoding:
Index Seek(OBJECT:([playground].[dbo].[testCase].[testCase_reg]), SEEK:([playground].[dbo].[testCase].[regular]=[@1]) ORDERED FORWARD)

Encoding mismatch:
Table Scan(OBJECT:([playground].[dbo].[testCase]), WHERE:(CONVERT_IMPLICIT(nvarchar(36),[playground].[dbo].[testCase].[regular],0)=[@P1]))

In a simple testcase with only 256K short records this already accounts for a speed degradation from 1 to 100 milliseconds. I suspect that in real scenario’s the performance impact may very well be magnitudes larger because wide columns will make your I/O explode and there are all sorts of concurrency penalties if you switch from index access to full table scans.

The good news is that it is pretty easy to identify whether you are having this problem. Just profile your SQL Server with the option to trace the event SHOWPLAN ALL and in all your execution plans search for the string CONVERT_IMPLICIT. The bad news is that this is hard to fix. Switching “String format” off means your data gets garbled, switching it on means your server dies from an overload, so pretty much the only solution is to switch your entire database to N fields. That makes you wonder why other databases can do without this setting. For PostgreSQL I know the answer: charsets are a database wide setting so this situation can’t happen, you just switch your entire database to UTF-8 (and since it is UTF-8 and not UCS-16 like in MS SQL Server that doesn’t have the I/O overhead). But how about MySQL, Oracle etc., are they smart enough to do the conversion once on the side of that parameter instead of for every row on the side of the query or is it something else entirely?

Download the MS SQL Server String format testcase

Through the Planet PostgreQL site I found a nice paper that compares database internals. It provides an interesting read, even if you are not really into databases.

Architecture of a Database System
Joseph M. Hellerstein, Michael Stonebraker and James Hamilton

The common misconception that a serializable transaction will single thread access to an entire table or even an entire database is pretty hard to eradicate. And the sad thing is that everybody just keeps repeating other supposedly informed sources instead of doing a little experimentation. So I have gathered some scripts that will show you how 2 serializable transactions are writing to the same database table simultaneously. When you run these scripts, run them from top to bottom, the commands on the left through one connection, the commands on the right through another connection.

PostgreSQL

There is no need to set up any configuration in PostgreSQL as it will be fully concurrent out of the box.

CREATE TABLE test (ID INT PRIMARY KEY);

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
INSERT INTO test (id) VALUES (1);

                              SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
                              BEGIN TRANSACTION;
                              INSERT INTO test (id) VALUES (2);
                              COMMIT;
                              SELECT * FROM test WHERE id = 2;
                              SELECT * FROM test;

COMMIT;
SELECT * FROM test;

As you can see, despite having 2 serializable transactions, the transaction that was started last commits first. And none of the queries block at any point.

MS SQL Server

By default MS SQL Server 2005 is somewhat less concurrent then PostgreSQL because it does predicate locking. As long as you stay out of the way of the predicates of other transactions it is still concurrent, but when you cross other running transactions you might need to wait for those other transactions to commit or rollback. So here is the first example.

CREATE TABLE test (ID INT PRIMARY KEY);

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
INSERT INTO test (id) VALUES (1);

                              SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
                              BEGIN TRANSACTION;
                              INSERT INTO test (id) VALUES (2);
                              COMMIT;
                              SELECT * FROM test WHERE id = 2;
                              SELECT * FROM test;

COMMIT;
SELECT * FROM test;

Again you can see the transaction that was started last commits first. But while the query for record 2 from the test table didn’t block, the query for the entire table blocked until the first transaction was committed. This is caused by predicate locking.

So what does serializable mean?

Mathematically speaking serializable means that it is possible to replay all the queries that were executed simultaneously one after another in a single threaded fashion and still get the same results. In terms of the SQL standard it doesn’t mean that. In terms of the SQL standard it just means that 3 specific events are not possible (and I quote this directly from SQL/IEC 9075-2:2003):

  • Dirty read: SQL-transaction T1 modifies a row. SQL-transactionT2 then reads that row before T1 performs a COMMIT. If T1 then performs a ROLLBACK, T2 will have read a row that was never committed and that may thus be considered to have never existed.
  • Non-repeatable read: SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.
  • Phantom: SQL-transaction T1 reads the set of rows N that satisfy some <search condition>. SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the <search condition> used by the SQL-transaction T1. If SQL-transaction T1 then repeats the initial read with the same <search confition>, it obtains a different collection of rows.

The above is what serializable means in a database. It does not mean mathematical serializability. It does not mean full table locks. It just means that these three specific events can not occur.