Helsinki PostgreSQL meetup on Tuesday September 1st

Helsinki PostgreSQL User Group will convene tomorrow, on Tuesday September 1st 2015 at 15:00.

I will be babbling about the various index types in PostgreSQL. GIN, GiST, and the humble B-tree. What does the structure of each one look like? What are the strengths and weaknesses of each? And of course, a lot of free chatter on whatever other topics you want to bring up.

We have also created a Meetup group. Registration at Meetup is recommended to ensure you get a seat:

This is the fourth meeting of the Helsinki PostgreSQL User Group. Come and meet fellow PostgreSQL users and developers. Welcome!

September 1st 2015

Richardinkatu Library
Rikhardinkatu 3

pg_rewind in PostgreSQL 9.5

Before PostgreSQL got streaming replication, back in version 9.0, people kept asking when we’re going to get replication. That was a common conversation-starter when standing at a conference booth. I don’t hear that anymore, but this dialogue still happens every now and then:

- I have streaming replication set up, with a master and standby. How do I perform failover?
- That’s easy, just kill the old master node, and run “pg_ctl promote” on the standby.
- Cool. And how do I fail back to the old master?
- Umm, well, you have to take a new base backup from the new master, and re-build the node from scratch..
- Huh, what?!?

pg_rewind is a better answer to that. One way to think of it is that it’s like rsync on steroids. Like rsync, it copies files that differ between the source and target. The trick is in how it determines which files have changed. Rsync compares timestamps, file sizes and checksums, but pg_rewind understands the PostgreSQL file formats, and reads the WAL to get that information instead.

I started hacking on pg_rewind about a year ago, while working for VMware. I got it working, but it was a bit of a pain to maintain. Michael Paquier helped to keep it up-to-date, whenever upstream changes in PostgreSQL broke it. A big pain was that it has to scan the WAL, and understand all different WAL record types – miss even one and you might end up with a corrupt database. I made big changes to the way WAL-logging works in 9.5, to make that easier. All WAL record types now contain enough information to know what block it applies to, in a common format. That slashed the amount of code required in pg_rewind, and made it a lot easier to maintain.

I have just committed pg_rewind into the PostgreSQL git repository, and it will be included in the upcoming 9.5 version. I always intended pg_rewind to be included in PostgreSQL itself; I started it as a standalone project to be able to develop it faster, outside the PostgreSQL release cycle, so I’m glad it finally made it into the main distribution now. Please give it a lot of testing!

PS. I gave a presentation on pg_rewind in Nordic PGDay 2015. It was a great conference, and I think people enjoyed the presentation. Have a look at the slides for an overview on how pg_rewind works. Also take a look at the page in the user manual.

GIN as a substitute for bitmap indexes

On some other DBMS’s, bitmap indexes are often used for columns that have only a few distinct values. Boolean columns, status codes and such. Compared to normal B-tree indexes, bitmap indexes are very efficient at packing duplicates.

PostgreSQL doesn’t have bitmap indexes, but GIN is also very good at handling duplicates. The internal structure of a GIN index is actually quite similar to a bitmap index. It doesn’t pack items in a bitmap, but in the upcoming 9.4 version it uses so-called varbyte encoding, which in the best case uses only a single byte per indexed row. 2-3 bytes is typical. In earlier versions, each item was packed using 6 bytes, which is still a lot less than the per-tuple overhead of a B-tree index.

For example, let’s create a table with a column, and fill it with some test data:

postgres=# create table numbers (n int4, mod2 int4);
postgres=# insert into numbers (n, mod2) select n, n % 2 from generate_series(1, 10000000) n;
INSERT 0 10000000

To create a GIN index as a substitute for a regular B-tree index, use the btree_gin extension that comes with PostgreSQL:

postgres=# create extension btree_gin;
postgres=# create index numbers_mod2 on numbers using gin (mod2);

For comparison, let’s also create a regular B-tree index on it:

postgres=# create index numbers_mod2_btree on numbers using btree (mod2);

And check the sizes of the indexes:

postgres=# \\\\di+
                               List of relations
 Schema |        Name        | Type  | Owner  |  Table  |  Size  | Description
 public | numbers_mod2_btree | index | heikki | numbers | 214 MB |
 public | numbers_mod2_gin   | index | heikki | numbers | 10 MB  |
(2 rows)

GIN is 20x smaller than a B-tree index!

The above was from a 9.4devel checkout. On 9.3, the GIN index takes 58 MB instead of 10 MB. That’s still a lot smaller than a B-tree, but 9.4 will really up the ante.

How to write a Java Transaction Manager that works with PostgreSQL

Two-phase commit was implemented in PostgreSQL in version 8.1, and driver support for it was added to the JDBC driver at the same time. However, we never added support for two features required by the Java Transaction API: transaction interleaving and suspend/resume. Even though they are required by the specification, they are not necessary for reliable two-phase commit, and many other JTA implementations also don’t implement them. They are difficult to implement if the underlying protocol used to talk to the database doesn’t support them natively.

People have been using the driver happily without those features for years, but every now and then someone runs into issues, when trying to configure a new application server to use the PostgreSQL’s XADataSource implementation. With some configuration, you usually can get things to work – major application servers don’t rely on those features, because they are not widely implemented, or at least have an option to do so if it’s not the default.

But what are these missing features, and why are they so difficult to implement?

The Java Transaction API

The way to implement two-phase commit in a JDBC driver is specified by the Java Transaction API specification. The API is modeled after the older X/Open XA specification, which defines the same thing for native Unix applications. That explains some peculiarities of the API; it’s quite different than most Java APIs. I believe the history also explains how the suspend/resume and transaction interleaving features sneaked into the Java standard: they made a lot more sense in a traditional single-threaded Unix process, than they do in a multi-threaded Java environment.

To implement the API, a driver needs to provide a class implementing the XAResource interface. It consists of five basic methods, plus some support methods related to timeouts and heuristic commits, which are optional features. The basic required methods are:

public void start(Xid xid, int flags);
Associates the current JDBC connection with a global transaction.

public void end(Xid xid, int flags);
Disassociates the current JDBC connection from a global transaction.

public int prepare(Xid xid);
The 1st, prepare phase of committing a transaction.

public void commit(Xid xid, boolean onePhase);
The 2nd phase commit.

public void rollback(Xid xid);
Rolls back a transaction.

public Xid[] recover(int flag);
Returns a list of global transactions that have been prepared but not committed yet. Used for crash recovery.

Basic two-phase commit

Here are the minimum steps to perform a global transaction using the API:

  1. XAResource.start(<xid>, TMNOFLAGS);
  2. Do your stuff with the Connection.
  3. XAResource.end(<xid>, TMSUCCESS);
  4. XAResource.prepare(<xid>);
  5. XAResource.commit(<xid>, false);

Simple, really. If you stick to the above steps, your transaction manager will work with any driver.

To work reliably with any JDBC driver, perform steps 1-4 in the same connection. The specification is more flexible than that, but not all implementations support transferring a global transaction from one connection to another, including PostgreSQL. Do not use the connection for anything else in between, consider the connection to be reserved for that transaction for the whole duration. If the connection is lost after step 4, you can use a new connection to perform the 2nd-phase commit() step, however.

Note that the above steps are performed by the transaction manager, typically included in a Java Application Server. The application developer doesn’t see any of that, he will use a completely different set of APIs to interact with the transaction manager.


According to the JTA specification, it’s possible to temporarily disassociate (suspend) a global transaction from the connection. Ie. you can do this:

  1. XAResource.start(xid1, TMNOFLAGS);
  2. Do some stuff with the Connection.
  3. XAResource.end(xid1, TMSUSPEND); // Suspend the old transaction
  4. XAResource.end(xid2, TMNOFLAGS); // Start a new transaction using the same connection
  5. Do other stuff with the Connection.

So, the same connection is used for two different transactions, before preparing either one. The problem is that in PostgreSQL, a single database connection can only perform work for a single transaction at a time. If you want to use the connection for another transaction, you have to prepare, commit or rollback the current transaction first. And once you do that, you cannot return to the old transaction, continuing to add more work to it.

This is no problem for an application server. It can simply open a new connection for the 2nd transaction. But the JTA specification requires the driver to support that! There is no way for the driver to say “hey, I can’t do this stuff, please use one connection for one transaction only”. Furthermore, this is completely transparent to the application – it’s the application servers choice to use suspend/resume or not.

There are other complications in the JTA spec, like support for “joining” a connection to an in-progress global transaction. Like suspend/resume, that’s also totally unnecessary for reliable two-phase commit.


The authors of the JTA specification screwed up. Instead of writing a simple API for two-phase commit, they mixed other non-essential features into the specification that have nothing to do with two-phase commit, and made them required. One can argue that there is more to the JTA specification than two-phase commit, but in reality, if you ask any Java App Server administrator, the only reason to ever use an XA-enabled driver is if you want to use two-phase commit.

The result is that some drivers implement the whole specification, while others implement only the minimum set of features required for reliable two-phase commit. But a transaction manager doesn’t know which features are actually implemented – they’re all required by the spec – so to work with real-world drivers, a transaction manager has to be written to the lowest common denominator. Had the JTA authors kept the API simple, and at least made the non-essential features optional, more JDBC implementations could fulfill the letter of the spec, and TM developers could code against the spec instead of an ill-defined subset of the spec.

So, why would an application server use the advanced features of the spec? One use case I’ve seen is to interleave transactions to improve concurrency, but in practice it’s unlikely to give any performance benefit. Even on a DBMS that supports that natively, switching transaction contexts isn’t free – you will most likely achieve the same or better performance by just opening one more connection.

If you’re developing a Transaction Manager, please stick to the basic steps listed above! Don’t use suspend/resume or other advanced features of the JTA specification. This ensures that your software works reliably with the widest possible range of drivers, including the PostgreSQL driver, and saves your users from the headache of debugging strange concurrency bugs.

Joining VMWare

Today is my last day at EnterpriseDB. I’ve had a good time working there for the past six years, and will miss many of my colleagues. Tomorrow, I’m starting at VMWare. I will continue to be an active member of the PostgreSQL community; VMWare wants to see a healthy community and wants to be part of it.

Thank you EnterpriseDB for all the good time I’ve had, and thank you VMWare for all the good time I will have in the future!