Today the PostgreSQL Global Development Group announced the release of PostgreSQL 8.3. The major new features that are most exciting to me are:
- TSearch has been integrated into the core so PostgreSQL now has Full Text Searching without the need to install additional modules;
- all the performance features;
- a UUID datatype for storing 128-bit integers.
I don’t think I need to explain why Full Text Searching is an important addition, but let me expand a little bit on my other favorites.
Let me start by saying that PostgreSQL hasn’t been slow in a very long time. Sure, it comes out of the box with an arcane configuration so it will start on a P90 with 64 MB of RAM, but once you have a configuration that matches your hardware it will perform. But where it shines is scalability. Back in October 2006 I had the privilege of assisting tweakers.net in optimizing PostgreSQL for their database benchmark and the outcome there was that “eight cores deliver 7.4 times as much as a single core“. Other benchmarks consistently show that if you have the hardware, PostgreSQL will use it.
That is not to say there is nothing to improve. There are several specific scenario’s where PostgreSQL will not be very fast. One of those scenario’s was the repeated update of a single record. Which is actually a pretty common scenario for websites that store user sessions in the database and update the record on each page hit. The reason for this is that PostgreSQL uses MultiVersion Concurrency Control and hence doesn’t overwrite records on updates, but invalidates them and writes a new copy. Repeating this over and over again would strain the garbage collection mechanism and bloat the table. To improve with that scenario PostgreSQL now has Heap Only Tuples (HOT), where an update that does not change an indexed field will only change the heap and not the index. Simple, efficient, and up to 4 times as fast.
Amongst the other performance improvements is a lot of work to plan OUTER JOINs more efficiently and to smooth out the performance spikes caused by background database maintenance processes.
While technically they are 128-bit integers, I historically store UUIDs as CHAR or VARCHAR fields of 35 or 36 (depending on which application is talking to the DB) characters. Due to overhead and byte alignment issues that means a UUID takes 40 bytes of space, instead of the 16 bytes it would need if it were just an 128-bit integer. And that space usage is repeated for every index, so for a simple table with 2 UUIDs that creates a many-to-many relationship and has 2 indexes this is the difference between 200 and 80 bytes.
That isn’t all though. Even though the operations performed on UUIDs are pretty limited (equality testing and sorting are pretty much the only relevant operations), they are also impacted by the difference between a string and an integer. Operations on integers are simple, bitwise operations. Operations on strings have to take the locale into account which means convoluted (multi-byte) functions. So not only is the storage smaller and thus faster, operations are faster too.
In a follow up I will provide some benchmarks of the performance difference between native UUIDs and VARCHAR UUIDs. Also, ColdFusion users can look forward to a way to get the PostgreSQL UUID datatype working from ColdFusion and for a patch that will make PostgreSQL accept the 35-byte string representation of ColdFusion UUIDs.