Archive for the ‘PostgreSQL’ Category

At Prisma IT we have an application that has been under constant development for 2 years. Over that time it has seen lots of feature creep and changing requirements, and no refactoring. As you can imagine that doesn’t make for pretty code or an optimal database model. Due to a retargeting of that application and a much more focused requirements process we are finally getting ready to start reworking that application. Over the past month I have been working on and off (mostly in the train on my way to clients) on putting more structure in the long list with things we (I) really would like to change about the technology of the application. I have also been running some experiments on the code to see what the impact of certain changes would be. Amongst others this includes the experiments with changing the datatype used in MS SQL Server to store ColdFusion UUIDs.

One of the last things I wanted was to get a feel for the cost/benefit ratio of a conversion from MS SQL Server to PostgreSQL. Continue reading ‘Converting a CF application from MS SQL Server to PostgreSQL?’ »

So far in our attempts to get access to the templates and data of other hosted customers we have primarily focused on accessing the templates of other users in order to get access to their data through them. A more direct approach is to try to access the database directly.

Registered datasources

To access a registered datasource all you need is its name. More often then not, names are easy to guess. Depending on how the datasource is registered in the CF Administrator you may need a username and password as well. If the username and password to the datasource are registered in the ColdFusion Administrator ColdFusion will write them to neo-datasource.xml. They are encrypted there, but with a reversible algorithm (ColdFusion needs to be able to decrypt them in order to authenticate). The way to decrypt these passwords is well known. So if you have read access to neo-datasource.xml through an incorrect Sandbox configuration you have all usernames and passwords registered there.
If the username and password of the datasource are not registered in the ColdFusion Administrator chances are they are stored in the application scope somewhere and we have already seen that the application scope is insecure.

The first line of defense against this is to configure Sandbox Security to only allow access to specific datasources to each Sandbox. This will effectively lock down access from ColdFusion to the datasource, even if you know the full connection string.

Database to database connections

But as good as the protection for datasources is in ColdFusion, it only protects against connections from ColdFusion’s cfquery and cfstoredproc. Just as the trick to work around a disabled cfexecute is to create Java / COM / .NET objects, the trick to work around a disabled ColdFusion datasource connection is to start the connection from somewhere else. And the way to do that is to connect from one database to another. The protocol to do so, SQL/MED, is not widely implemented in databases, but most databases offer proprietary ways to do so. For instance, in MS SQL Server you would use:

SELECT *
FROM OpenDataSource(
  ’SQLOLEDB‘
  ,Data Source=server.asdf.com;User ID=yourusername;Password=yourpassword‘
  ).somedatabase.dbo.sometable;

In PostgreSQL that would be:

SELECT *
FROM dblink(
  'dbname=postgres'
  , 'SELECT ID, value FROM tables'
  ) AS t1(
  ID INT
  , value TEXT
  );

The one thing that makes this more difficult is that you usually need some sort of superuser privilege on the database server to be allowed to do this.

MS Access

MS Access is a file based database that is pretty popular in shared hosting. While people frequently question performance, scalability and security, it is still an attractive offering because it allows users to download the database, edit it offline and then upload an entirely new database with new data.

MS Access offers a very simple feature to access tables in other MS Access databases without the superuser privileges you would need on a more full-featured database server. You just specify the path to the data file you want to access directly in the query:

SELECT *
FROM table IN 'h:\sites\database.mdb'

Through the Planet PostgreQL site I found a nice paper that compares database internals. It provides an interesting read, even if you are not really into databases.

Architecture of a Database System
Joseph M. Hellerstein, Michael Stonebraker and James Hamilton

The common misconception that a serializable transaction will single thread access to an entire table or even an entire database is pretty hard to eradicate. And the sad thing is that everybody just keeps repeating other supposedly informed sources instead of doing a little experimentation. So I have gathered some scripts that will show you how 2 serializable transactions are writing to the same database table simultaneously. When you run these scripts, run them from top to bottom, the commands on the left through one connection, the commands on the right through another connection.

PostgreSQL

There is no need to set up any configuration in PostgreSQL as it will be fully concurrent out of the box.

CREATE TABLE test (ID INT PRIMARY KEY);

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
INSERT INTO test (id) VALUES (1);

                              SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
                              BEGIN TRANSACTION;
                              INSERT INTO test (id) VALUES (2);
                              COMMIT;
                              SELECT * FROM test WHERE id = 2;
                              SELECT * FROM test;

COMMIT;
SELECT * FROM test;

As you can see, despite having 2 serializable transactions, the transaction that was started last commits first. And none of the queries block at any point.

MS SQL Server

By default MS SQL Server 2005 is somewhat less concurrent then PostgreSQL because it does predicate locking. As long as you stay out of the way of the predicates of other transactions it is still concurrent, but when you cross other running transactions you might need to wait for those other transactions to commit or rollback. So here is the first example.

CREATE TABLE test (ID INT PRIMARY KEY);

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
INSERT INTO test (id) VALUES (1);

                              SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
                              BEGIN TRANSACTION;
                              INSERT INTO test (id) VALUES (2);
                              COMMIT;
                              SELECT * FROM test WHERE id = 2;
                              SELECT * FROM test;

COMMIT;
SELECT * FROM test;

Again you can see the transaction that was started last commits first. But while the query for record 2 from the test table didn’t block, the query for the entire table blocked until the first transaction was committed. This is caused by predicate locking.

So what does serializable mean?

Mathematically speaking serializable means that it is possible to replay all the queries that were executed simultaneously one after another in a single threaded fashion and still get the same results. In terms of the SQL standard it doesn’t mean that. In terms of the SQL standard it just means that 3 specific events are not possible (and I quote this directly from SQL/IEC 9075-2:2003):

  • Dirty read: SQL-transaction T1 modifies a row. SQL-transactionT2 then reads that row before T1 performs a COMMIT. If T1 then performs a ROLLBACK, T2 will have read a row that was never committed and that may thus be considered to have never existed.
  • Non-repeatable read: SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.
  • Phantom: SQL-transaction T1 reads the set of rows N that satisfy some <search condition>. SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the <search condition> used by the SQL-transaction T1. If SQL-transaction T1 then repeats the initial read with the same <search confition>, it obtains a different collection of rows.

The above is what serializable means in a database. It does not mean mathematical serializability. It does not mean full table locks. It just means that these three specific events can not occur.

In part 1 I showed how to adapt ColdFusion to work with PostgreSQLs native UUID datatype. In this part, I will show how to adapt PostgreSQL to accept ColdFusions native UUID datatype.

UUIDs: equal, but not the same

As ColdFusion developers are very aware thestring representation of a UUID in ColdFusionis a 35 character string. However, the rest of the world has decided on a 36 character string representation. Both in standards (RFC 4122, ITU-T X.667 and The Open Group) as in implementations (MS SQL Servers newID(), Javas java.util.UUID etc.). PostgreSQL is no exception and uses the 36 character string representation. As a result, the output of the ColdFusion CreateUUID() function will not be accepted into a PostgreSQL UUID field and the output of a PostgreSQL UUID field will not be recognized by ColdFusion as a UUID. I actually believe this is a bug in ColdFusions type detection and conversion system. Since CFML is a typeless language, it should just convert both the 35 and the 36 character string representation. ColdFusion will on the fly convert both a m/d/yy string and a yyyy-mm-dd string to a date object, so why not convert both UUID string representations to their native integer representation?

Patching PostgreSQL

Unfortunately ColdFusion is a closed source product and I can not adapt it to work the way I want it to work. However, PostgreSQL is Open Source and I can adapt it. A little digging in the source code reveals that the input and output formats for the UUID datatype are defined in uuid.c and it is trivial to adapt them to use the ColdFusion format. Just apply the patch for PostgreSQL 8.3 to accept a 35-character string representation as UUID and recompile PostgreSQL. Once you have done that, the following code will work:

CREATE TABLE test (testID UUID);

<cfquery>
INSERT INTO test (testID)
VALUES (<cfqueryparam cfsqltype="cf_sql_other" value="#CreateUUID()#">)
</cfquery>

Warnings:

  • c is not a language I am comfortable in so the code is probably suboptimal;
  • I do not intend to submit this as a patch to PostgreSQL since they follow all the standards and ColdFusion does not;
  • a permanent solution to the problem of ColdFusion UUIDs clashing with the rest of the world should come from Adobe.

So to go from ColdFusion to PostgreSQL we now can use the 36-character representation as an inline variable, with the updated JDBC driver as a bind variable, and with this patch also using a 35-character string. What is left to do is patching the JDBC driver to convert the 16-byte representation it receives from PostgreSQL to a 35-character string ColdFusion understand. Looks like that requires writing a new type. Unfortunately I will be busy for the rest of the month with a new three week Flex/LiveCycle class we will be teaching so I won’t be able to start on that for a while.

I set up a little test to measure the performance difference between native PostgreSQL UUIDs () and ColdFusion UUIDs stored as a CHAR(35) field in PostgreSQL. First the schema.

CREATE TABLE uuidanswer (
 answerID UUID,
 displayOrder INT
);
CREATE TABLE uuidtestAssignment (
     testAssignmentID UUID
);
CREATE TABLE uuidquestionScore (
     testAssignmentID UUID,
     answerID UUID,
     questionIDID UUID
);
load data (400000 answers, 1000000 questionscores and 77000 testassignments)
CREATE INDEX uuidanswer_idx
  ON uuidanswer (answerID);
CREATE INDEX uuidtestAssignment_idx
  ON uuidtestAssignment (testAssignmentID);
CREATE INDEX uuidquestionScore_idx
  ON uuidquestionScore (testassignmentid, questionid, answerid);
ANALYZE uuidanswer;
ANALYZE uuidtestAssignment;
ANALYZE uuidquestionScore;

For the string version all UUID datatypes were replaced with a CHAR(35) datatype and in the data loading stage the appropriate hyphens where added or removed. This is a very much trimmed down version of a real query from one of our applications. It aggregates the score of a candidate in some sort of Human Resource testing scenario. The real one joins 11 tables of a 19 GB table but I am too lazy to move all that data to PostgreSQL just for a little demonstration. Then the following query was run against the dataset 1000 times:

SELECT SUM(displayorder)
FROM
 uuidanswer ca
 INNER JOIN uuidquestionscore cqs USING (answerid)
 INNER JOIN uuidtestassignment cta USING (testassignmentid)
WHERE
 cta.testassignmentid = '164ab167-1517-0f51-7cef-1c6654e721b5'
 OR
 cta.testassignmentid = '01F3168A423B2BA34E914AEBFE9C8757'

The results, averaged over 5 runs were:

  • CHAR(35): 2680 ms
  • UUID: 2210 ms

Due to the test setup the difference here is probably smaller then it is in a real production environment. Since all the data fits into cache the effects of I/O are pretty much removed from the equation. With a dataset that is larger then RAM the results will favor the smaller dataset more (and the total size of the UUID dataset is 164 MB versus 311 MB for the CHAR(35) dataset).

UUID datatype and cfqueryparam

But obviously it is desirable from a security perspective to use cfqueryparam for all data values. A quick check of the cfqueryparam docs reveals that there is no cfsqltype for UUIDs, so we have to fall back to cf_sql_other for the UUID case. Which unfortunately blows up with ColdFusion 8 because the included driver is the 8.1 JDBC 3 driver, and the UUID datatype is only supported in the 8.3 drivers. Replacing postgresql-8.1-407.jdbc3.jar with postgresql-8.3-603.jdbc3.jar and restarting solves that problem. (An enhancement request for updating the PostgreSQL JDBC has been entered, but a few more voices wouldn’t hurt.) Now the query statement is changed to:

SELECT SUM(displayorder)
FROM
 uuidanswer ca
 INNER JOIN uuidquestionscore cqs USING (answerid)
 INNER JOIN uuidtestassignment cta USING (testassignmentid)
WHERE
 cta.testassignmentid = <cfqueryparam cfsqltype="cf_sql_other" value="164ab167-1517-0f51-7cef-1c6654e721b5" />
 OR
 cta.testassignmentid = <cfqueryparam cfsqltype="cf_sql_other" value="01F3168A423B2BA34E914AEBFE9C8757" />

The results, averaged over 5 runs were:

  • CHAR(35): 2640 ms
  • UUID: 2250 ms

So next to all the data validation and integrity benefits there are real performance benefits to using the UUID datatype.

Today the PostgreSQL Global Development Group announced the release of PostgreSQL 8.3. The major new features that are most exciting to me are:

  • TSearch has been integrated into the core so PostgreSQL now has Full Text Searching without the need to install additional modules;
  • all the performance features;
  • a UUID datatype for storing 128-bit integers.

I don’t think I need to explain why Full Text Searching is an important addition, but let me expand a little bit on my other favorites.

Performance features

Let me start by saying that PostgreSQL hasn’t been slow in a very long time. Sure, it comes out of the box with an arcane configuration so it will start on a P90 with 64 MB of RAM, but once you have a configuration that matches your hardware it will perform. But where it shines is scalability. Back in October 2006 I had the privilege of assisting tweakers.net in optimizing PostgreSQL for their database benchmark and the outcome there was that “eight cores deliver 7.4 times as much as a single core“. Other benchmarks consistently show that if you have the hardware, PostgreSQL will use it.

That is not to say there is nothing to improve. There are several specific scenario’s where PostgreSQL will not be very fast. One of those scenario’s was the repeated update of a single record. Which is actually a pretty common scenario for websites that store user sessions in the database and update the record on each page hit. The reason for this is that PostgreSQL uses MultiVersion Concurrency Control and hence doesn’t overwrite records on updates, but invalidates them and writes a new copy. Repeating this over and over again would strain the garbage collection mechanism and bloat the table. To improve with that scenario PostgreSQL now has Heap Only Tuples (HOT), where an update that does not change an indexed field will only change the heap and not the index. Simple, efficient, and up to 4 times as fast.

Amongst the other performance improvements is a lot of work to plan OUTER JOINs more efficiently and to smooth out the performance spikes caused by background database maintenance processes.

UUID support

While technically they are 128-bit integers, I historically store UUIDs as CHAR or VARCHAR fields of 35 or 36 (depending on which application is talking to the DB) characters. Due to overhead and byte alignment issues that means a UUID takes 40 bytes of space, instead of the 16 bytes it would need if it were just an 128-bit integer. And that space usage is repeated for every index, so for a simple table with 2 UUIDs that creates a many-to-many relationship and has 2 indexes this is the difference between 200 and 80 bytes.

That isn’t all though. Even though the operations performed on UUIDs are pretty limited (equality testing and sorting are pretty much the only relevant operations), they are also impacted by the difference between a string and an integer. Operations on integers are simple, bitwise operations. Operations on strings have to take the locale into account which means convoluted (multi-byte) functions. So not only is the storage smaller and thus faster, operations are faster too.

In a follow up I will provide some benchmarks of the performance difference between native UUIDs and VARCHAR UUIDs. Also, ColdFusion users can look forward to a way to get the PostgreSQL UUID datatype working from ColdFusion and for a patch that will make PostgreSQL accept the 35-byte string representation of ColdFusion UUIDs.