pg_rewind in PostgreSQL 9.5

Before PostgreSQL got streaming replication, back in version 9.0, people kept asking when we’re going to get replication. That was a common conversation-starter when standing at a conference booth. I don’t hear that anymore, but this dialogue still happens every now and then:

– I have streaming replication set up, with a master and standby. How do I perform failover?
– That’s easy, just kill the old master node, and run “pg_ctl promote” on the standby.
– Cool. And how do I fail back to the old master?
– Umm, well, you have to take a new base backup from the new master, and re-build the node from scratch..
– Huh, what?!?

pg_rewind is a better answer to that. One way to think of it is that it’s like rsync on steroids. Like rsync, it copies files that differ between the source and target. The trick is in how it determines which files have changed. Rsync compares timestamps, file sizes and checksums, but pg_rewind understands the PostgreSQL file formats, and reads the WAL to get that information instead.

I started hacking on pg_rewind about a year ago, while working for VMware. I got it working, but it was a bit of a pain to maintain. Michael Paquier helped to keep it up-to-date, whenever upstream changes in PostgreSQL broke it. A big pain was that it has to scan the WAL, and understand all different WAL record types – miss even one and you might end up with a corrupt database. I made big changes to the way WAL-logging works in 9.5, to make that easier. All WAL record types now contain enough information to know what block it applies to, in a common format. That slashed the amount of code required in pg_rewind, and made it a lot easier to maintain.

I have just committed pg_rewind into the PostgreSQL git repository, and it will be included in the upcoming 9.5 version. I always intended pg_rewind to be included in PostgreSQL itself; I started it as a standalone project to be able to develop it faster, outside the PostgreSQL release cycle, so I’m glad it finally made it into the main distribution now. Please give it a lot of testing!

PS. I gave a presentation on pg_rewind in Nordic PGDay 2015. It was a great conference, and I think people enjoyed the presentation. Have a look at the slides for an overview on how pg_rewind works. Also take a look at the page in the user manual.

20 thoughts on “pg_rewind in PostgreSQL 9.5

  1. This is interesting and all but pg_ctl promote (or equivalently, touch trigger file) has always been the wrong way to do a *controlled* failover.

    The right way:
    1. stop the master
    2. stop the slave(s)
    3. re-configure as needed
    4. start the new master
    5. start the old-master-as-new-slave (and any other slaves you might have), it will connect and replicate fine because it stopped at the same point in the timeline

    Yes, there’s downtime. It’s a failover, you’re going to have downtime.

    If pg_rewind lets us do a controlled failover faster and easier than that, fantastic! But for all the folks out there struggling with this today, you don’t have to rebuild your old master to make it a slave! Stop doing that!

        • Actually no, in a setup like this it’s common to have a connection pooler like pgpool/pgbouncer/haproxy that is the endpoint for your app.

    • The purpose of pg_rewind is for recover the *old* master without need to recreate from scratch using the new master.

      • Right, that’s exactly what I said. You don’t *need* to rebuild the old master to reconnect it to the new one *if* you stop them properly before switching.

        • Is there any way to recover “Old” master in postgres 9.1? Do I have always have to start from scratch by taking entire base backup?
          Thanks

          • Not really. pg_rewind only works with later versions. On older versions, you’ll have to take a new base backup. You can use rsync to speed that up, to copy only the difference, but rsync will still need to scan all the data. And you have to be careful with the flags to use, to make it safe.

            The presentation I linked to above mentions the rsync method.

    • Doesn’t help much in an uncontrolled scenario, where the master fails and then comes back – as illustrated in the first few slides in presentation linked in the blog.

  2. I wrote a little library for Postgres failover orchestration: https://github.com/compose/governor

    I would like to get your feedback on it.

    I’ve not had issues with old leaders coming back online after the old leader was confirmed dead. The only issue I’ve required rebuilding was if the old primary continue to run in a leader role after the new primary takes over. The wal timeline will never resolve on this.

    I found it generally works with a old dead leaders using promote on the new leader, and setting timeline to latest on the old leader when bringing back online. The code above shows those changes properly.

    How could I incorporate pg_rewind into this project?

    Cheers,
    Chria

  3. Hi,

    pg_rewind is really useful for postgresql dba. Thanks!

    I just want to verify some interesting point. Sync replication is quite different from async replication. There may be data loss (transaction already commit in master) for aync replication, which may cause data inconsistency in the standby and may not be acceptable for some application. Under async replication situation, recovery should start from the master.

    On the other hand, sync replication is no data loss replication. Consider the following algorithm, which is derived from the presentation of postgresql sync replication development team:

    Here, T1, T2,… represent time point in the time line

    T1. Master issue a transaction (xid: TX1) Commit

    T2. Master Flush TX1 WAL to disk WAL records

    T3. Master Send TX1 WAL records to standby and standby flush the received WAL records to disk

    Segments are formed
    from records in the
    standby server.

    WAL entries are sent
    before returning from
    commits by records.

    T4. standby return acknowledge to master

    T5. master commit the tansaction TX1 in the memory and flush to db disk storage

    The worst case scenario of this algorithm is after T2, master crash and the WA
    L records cannot send to the standby. However, in the master, the records in memory and db disk storage have not commit yet since it crash before it receive any ack from the standby, end users and/or customers have no idea whether the transaction TX1 is successful or not.

    The standby machine now promote to new master, and transaction TX1 has no trace in the standby, no rollback is needed. As long as the old master use pg_rewind to sync from the new master, TX1 never happened (TX1 onl

  4. As long as the old master use pg_rewind to sync from the new master, TX1 never happened (TX1 only exist in the old master old WAL log).

    From the postgresql HA-cluster view point, TX1 never commit. From the end user view point, TX1 never commit.

    For database reseach analyst, this is an interesting topic. TX1 only consider comit it both the master and the standby have the WAL record on disk.

  5. If the standby machine fail, then only the master is running and is not HA-cluster anymore, WAL record written to the master is considered comit.

  6. Finally, if both master and standby fail after T2, then recover from master. TX1 will consider commit.

    The above scenario consider there is only one sync replication standby machine, other standby machine should be thru cascading async replication for performance reason.

  7. General operations guildline for 1 master + 1 standby HA posted in the following link:

    my.oschina.net/u/2399919/blog/469330

  8. Hi Heikki,

    Thanks for this wonderful utility. But I have a doubt. In case of a disaster(unclean shutdown of database), pg_rewind does not work.
    What would be the best way to bring back old master as a new slave.

    Thanks in advance

    • You can restart the cluster, let crash recovery complete, and shut it down again. Then you can run pg_rewind.

      • Hello Heikki

        Though the approach seems to be logical but when practicing it doesn’t work.

        Please refer the log below when starting the DB server after successfully running pg_rewind :
        ./pg_ctl -D /opt/app/D360_Database/shopping_cart_42 start
        server starting
        LOG: database system was interrupted; last known up at 2016-02-18 15:37:42 CET
        LOG: entering standby mode
        LOG: database system was not properly shut down; automatic recovery in progress
        LOG: redo starts at 0/39000028
        LOG: invalid record length at 0/3AAA75F8
        LOG: consistent recovery state reached at 0/3AAA75F8
        LOG: database system is ready to accept read only connections
        LOG: fetching timeline history file for timeline 7 from primary server
        LOG: started streaming WAL from primary at 0/3A000000 on timeline 6
        LOG: replication terminated by primary server
        DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
        LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
        LOG: restarted WAL streaming at 0/3A000000 on timeline 6
        LOG: replication terminated by primary server
        DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
        LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
        LOG: restarted WAL streaming at 0/3A000000 on timeline 6
        LOG: replication terminated by primary server
        DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
        LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
        LOG: restarted WAL streaming at 0/3A000000 on timeline 6
        LOG: replication terminated by primary server
        DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
        LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8

        LOG: database system was shut down at 2016-02-18 15:51:54 CET
        LOG: entering standby mode
        LOG: consistent recovery state reached at 0/3D000098
        LOG: invalid record length at 0/3D000098
        LOG: database system is ready to accept read only connections
        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CD60C00

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CE96F30

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CEE42E8

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CF3ABA8

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CF88410

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CFD52B0

        LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
        LOG: invalid resource manager ID 96 at 0/3D000098
        FATAL: terminating walreceiver process due to administrator command
        LOG: invalid resource manager ID 96 at 0/3D000098
        LOG: invalid resource manager ID 96 at 0/3D000098

        Any suggestion ??

  9. Hello Heikki,

    we are currently using postgresql streaming replication, and we are some times recycling old masters. I read through your presentation, and saw that this is considered dangerous.

    I wrote the postgresql mailing list in order to understand why (https://www.postgresql.org/message-id/1462941074281.128157.22300%40webmail6), but I did not get a clear response on wether or not our setup is dangerous.

    if at all possible, I would greatly appreciate if you would read my original post and comment (either here or there).

    Best regards,
    Fredrik

Leave a Reply

Your email address will not be published. Required fields are marked *