At Prisma IT we have an application that has been under constant development for 2 years. Over that time it has seen lots of feature creep and changing requirements, and no refactoring. As you can imagine that doesn’t make for pretty code or an optimal database model. Due to a retargeting of that application and a much more focused requirements process we are finally getting ready to start reworking that application. Over the past month I have been working on and off (mostly in the train on my way to clients) on putting more structure in the long list with things we (I) really would like to change about the technology of the application. I have also been running some experiments on the code to see what the impact of certain changes would be. Amongst others this includes the experiments with changing the datatype used in MS SQL Server to store ColdFusion UUIDs.
One of the last things I wanted was to get a feel for the cost/benefit ratio of a conversion from MS SQL Server to PostgreSQL. One of the problems we are having with this application is that it stores lXML documents. Since these documents are between 70 and 300 KB each the database gets large fast. As you have seen in the posts about the storage of UUIDs in the database the total database size was approximately 32 GB, and 24 GB of that was XML. With that size you can no longer use MS SQL Server Express so local development on a developers laptop suddenly gets a lot more complicated. And you have to shell out for CPU licenses for MS SQL Server which gets expensive fast. And even with a retargeting of the appliction for running in the cloud, that is still the difference between $0.40 per hour for a Linux instance or $1.10 per hour for a Windows + SQL Server Standard instance (about $500 per month).
While PostgreSQL is my database of choice some of my co-workers feel more comfortable with MySQL. But in this case I had a very good technical reason for trying PostgreSQL. The way PostgreSQL internally deals with wide rows is to start moving columns to an overflow table. That overflow or TOAST table is not visible to the end user, it is completely transparent to the user. But what PostgreSQL does as well is compress the content of those columns. The compression algorithm is very lightweight: light compression, but even lighter CPU usage. But considering that the content was XML I had high hopes for the compression ratio and the effect this would have on the size and manageability of the database. Compression of fields is possible in MySQL, but it isn’t transparent: you need to write it yourself and I’m not interested in that complication. So PostgreSQL it was.
Converting the database
Scripting the schema for the database was a breeze. I had the DDL scripts for MS SQL Server from my earlier experiments and I just did a lot of search and replace on them. Switched all bit datatypes to booleans, changed all datetimes to timestamps, removed the N from all national character datatypes (in PostgreSQL you just declare the whole database to be UTF-8 and be done with it) and switched some fields with IP addresses to the INET datatype.
The actual data conversion was pretty trivial as well. I created linked tales from MS Access and ran queries to select the records from one database and insert them in the other. I just had to make sure the queries only selected a limited number of records, because apparently Access runs out of resources if you try to insert 20 million rcords in one insert statement. Running these conversions took almost a whole day, which is quite excessive even for a 30 GB database. I got the impression Access would process the queries one record at a time instead of in batches. But since it was a process that can run unattended that is not really a problem either.
The results in terms of database size are pretty amazing. For comparison I have included the results I previously recorded with MS SQL Server as well.
|Database||Datatype||Storage requirements (B)||Database size (GB)|
|MS SQL Server||UNIQUEIDENTIFIER||16||29.6|
|MS SQL Server||VARCHAR(35)||35||32.5|
|MS SQL Server||NVARCHAR(35)||70||38.6|
As you can see the database was reduced in size by about a factor 4, from 29.6 to 7.6 GB. That alone will make some of the management issues a whole lot easier to deal with. I also did some apples to oranges benchmarks with this database (I have PostgreSQL on my laptop instead of on a heavy server) and the results are in the same ballpark as the results for MS SQL Server. At the very least the benchmarks showed that there were no performance analomies like with MS SQL Server and the wrong Unicode hints. I am also very confident that PostgreSQL will scale to many more concurrent users then the application currently has. So all in all on the database side PostgreSQL is looking like a very good option.
Fixing the CFML code
On the ColdFusion side I had to overcome some hurdles though. First is that ColdFusion comes with an ancient version of the PostgreSQL JDBC driver. To use the UUID datatype I had to install the PostgreSQL 8.3 driver. With that out of the way I had to fix many of our cfqueryparam tags to use the cf_sql_other cfsqltype instead of cf_sql_varchar (this application doesn’t use an ORM (yet)). Once that is done there remain about 50 queries that require more work. In most cases the extra work is due to different casting rules. In the old database the UUIDs were stored in VARCHAR fields and in quite a few queries these varchar fields were concatenated or processed by string functions. (This extra work is the same as when the database stays MS SQL Server but the datatype is changed from VARCHAR to UNIQUEIDENTIFIER.) Only for an estimated 15 queries out of 800 I actually need to rewrite the SQL to make it work again because the SQL syntax is really different. This is mainly related to locking (FOR UPDATE vs. WITH (UPDLOCK)), recursive queries and in three cases to reverse the case sensitivity (MS SQL Server needed extra code to make the password lookup case sensitive, PostgreSQL will need extra code to make the username lookup case insensitive).
So does that mean this application will soon run on PostgreSQL? Well, no. Everything works great and if I have my way it will happen in due time. But the most important short term gain with a database conversion is the elimination of the performance inconsistencies with MS SQL Server caused by Unicode hints. The cheapest way to get that gain is to convert the datatype used to store the UUIDs of that application from VARCHAR to NVARCHAR and leaving the database as MS SQL Server. That change doesn’t require any code changes on the ColdFusion side. Sure, it will make the whole application a few percent slower and the database a few GB larger, but for an existing deployment that isn’t a real problem.
Gains like a reduction in license fees and lower cost for backups are long term and will have to wait while I first fix a bad case of class explosion in the user roles, rewrite the reporting engine from HTML and cfdocument to dynamic PDF and cfpdfformdata and a refactor a webservices API that duplicates thousands of lines of code from the main application (with one change every 10 lines of course).