pg_rewind in PostgreSQL 9.5

Posted on March 23, 2015 by Heikki Linnakangas

Before PostgreSQL got streaming replication, back in version 9.0, people kept asking when we’re going to get replication. That was a common conversation-starter when standing at a conference booth. I don’t hear that anymore, but this dialogue still happens every now and then:

– I have streaming replication set up, with a master and standby. How do I perform failover?
– That’s easy, just kill the old master node, and run “pg_ctl promote” on the standby.
– Cool. And how do I fail back to the old master?
– Umm, well, you have to take a new base backup from the new master, and re-build the node from scratch..
– Huh, what?!?

pg_rewind is a better answer to that. One way to think of it is that it’s like rsync on steroids. Like rsync, it copies files that differ between the source and target. The trick is in how it determines which files have changed. Rsync compares timestamps, file sizes and checksums, but pg_rewind understands the PostgreSQL file formats, and reads the WAL to get that information instead.

I started hacking on pg_rewind about a year ago, while working for VMware. I got it working, but it was a bit of a pain to maintain. Michael Paquier helped to keep it up-to-date, whenever upstream changes in PostgreSQL broke it. A big pain was that it has to scan the WAL, and understand all different WAL record types – miss even one and you might end up with a corrupt database. I made big changes to the way WAL-logging works in 9.5, to make that easier. All WAL record types now contain enough information to know what block it applies to, in a common format. That slashed the amount of code required in pg_rewind, and made it a lot easier to maintain.

I have just committed pg_rewind into the PostgreSQL git repository, and it will be included in the upcoming 9.5 version. I always intended pg_rewind to be included in PostgreSQL itself; I started it as a standalone project to be able to develop it faster, outside the PostgreSQL release cycle, so I’m glad it finally made it into the main distribution now. Please give it a lot of testing!

PS. I gave a presentation on pg_rewind in Nordic PGDay 2015. It was a great conference, and I think people enjoyed the presentation. Have a look at the slides for an overview on how pg_rewind works. Also take a look at the page in the user manual.

20 thoughts on “pg_rewind in PostgreSQL 9.5”

Noah Yetter on March 24, 2015 at 00:28 said:

This is interesting and all but pg_ctl promote (or equivalently, touch trigger file) has always been the wrong way to do a *controlled* failover.

The right way:
1. stop the master
2. stop the slave(s)
3. re-configure as needed
4. start the new master
5. start the old-master-as-new-slave (and any other slaves you might have), it will connect and replicate fine because it stopped at the same point in the timeline

Yes, there’s downtime. It’s a failover, you’re going to have downtime.

If pg_rewind lets us do a controlled failover faster and easier than that, fantastic! But for all the folks out there struggling with this today, you don’t have to rebuild your old master to make it a slave! Stop doing that!

Reply ↓
- Joe Love on March 24, 2015 at 09:10 said:
  
  No.. you don’t need downtime, that’s the point of a HA failover.
  
  Reply ↓
  - Noah Yetter on March 25, 2015 at 00:51 said:
    
    Your clients still need to switch servers. You are going to have downtime whether you like it or not.
    
    Reply ↓
    - Robert Wysocki on March 26, 2015 at 00:40 said:
      
      Actually no, in a setup like this it’s common to have a connection pooler like pgpool/pgbouncer/haproxy that is the endpoint for your app.
      
      Reply ↓
- Gabriel Francisco on March 24, 2015 at 15:41 said:
  
  The purpose of pg_rewind is for recover the *old* master without need to recreate from scratch using the new master.
  
  Reply ↓
  - Noah Yetter on March 25, 2015 at 00:50 said:
    
    Right, that’s exactly what I said. You don’t *need* to rebuild the old master to reconnect it to the new one *if* you stop them properly before switching.
    
    Reply ↓
    - Nandini on May 6, 2015 at 10:59 said:
      
      Is there any way to recover “Old” master in postgres 9.1? Do I have always have to start from scratch by taking entire base backup?
      Thanks
      
      Reply ↓
      - Heikki Linnakangas on May 6, 2015 at 12:55 said:
        
        Not really. pg_rewind only works with later versions. On older versions, you’ll have to take a new base backup. You can use rsync to speed that up, to copy only the difference, but rsync will still need to scan all the data. And you have to be careful with the flags to use, to make it safe.
        
        The presentation I linked to above mentions the rsync method.
- Brad on March 24, 2015 at 15:42 said:
  
  Doesn’t help much in an uncontrolled scenario, where the master fails and then comes back – as illustrated in the first few slides in presentation linked in the blog.
  
  Reply ↓
chris winslett on March 25, 2015 at 17:06 said:

I wrote a little library for Postgres failover orchestration: https://github.com/compose/governor

I would like to get your feedback on it.

I’ve not had issues with old leaders coming back online after the old leader was confirmed dead. The only issue I’ve required rebuilding was if the old primary continue to run in a leader role after the new primary takes over. The wal timeline will never resolve on this.

I found it generally works with a old dead leaders using promote on the new leader, and setting timeline to latest on the old leader when bringing back online. The code above shows those changes properly.

How could I incorporate pg_rewind into this project?

Cheers,
Chria

Reply ↓
wing Mak on June 19, 2015 at 14:15 said:

Hi,

pg_rewind is really useful for postgresql dba. Thanks!

I just want to verify some interesting point. Sync replication is quite different from async replication. There may be data loss (transaction already commit in master) for aync replication, which may cause data inconsistency in the standby and may not be acceptable for some application. Under async replication situation, recovery should start from the master.

On the other hand, sync replication is no data loss replication. Consider the following algorithm, which is derived from the presentation of postgresql sync replication development team:

Here, T1, T2,… represent time point in the time line

T1. Master issue a transaction (xid: TX1) Commit

T2. Master Flush TX1 WAL to disk WAL records

T3. Master Send TX1 WAL records to standby and standby flush the received WAL records to disk

Segments are formed
from records in the
standby server.

WAL entries are sent
before returning from
commits by records.

T4. standby return acknowledge to master

T5. master commit the tansaction TX1 in the memory and flush to db disk storage

The worst case scenario of this algorithm is after T2, master crash and the WA
L records cannot send to the standby. However, in the master, the records in memory and db disk storage have not commit yet since it crash before it receive any ack from the standby, end users and/or customers have no idea whether the transaction TX1 is successful or not.

The standby machine now promote to new master, and transaction TX1 has no trace in the standby, no rollback is needed. As long as the old master use pg_rewind to sync from the new master, TX1 never happened (TX1 onl

Reply ↓
wing Mak on June 19, 2015 at 14:22 said:

As long as the old master use pg_rewind to sync from the new master, TX1 never happened (TX1 only exist in the old master old WAL log).

From the postgresql HA-cluster view point, TX1 never commit. From the end user view point, TX1 never commit.

For database reseach analyst, this is an interesting topic. TX1 only consider comit it both the master and the standby have the WAL record on disk.

Reply ↓
wing Mak on June 19, 2015 at 14:27 said:

If the standby machine fail, then only the master is running and is not HA-cluster anymore, WAL record written to the master is considered comit.

Reply ↓
wing Mak on June 19, 2015 at 17:17 said:

Finally, if both master and standby fail after T2, then recover from master. TX1 will consider commit.

The above scenario consider there is only one sync replication standby machine, other standby machine should be thru cascading async replication for performance reason.

Reply ↓
wing Mak on June 22, 2015 at 02:49 said:

General operations guildline for 1 master + 1 standby HA posted in the following link:

my.oschina.net/u/2399919/blog/469330

Reply ↓
wing Mak on June 27, 2015 at 08:54 said:

Postgresql HA cluster-Sync Rep+pg_rewind – part 2
Postgresql HA Cluster by using Chained Cascading Replication : A data consistency capable HA architecture
http://my.oschina.net/u/2399919/blog/471459

Reply ↓
Amit on February 17, 2016 at 01:26 said:

Hi Heikki,

Thanks for this wonderful utility. But I have a doubt. In case of a disaster(unclean shutdown of database), pg_rewind does not work.
What would be the best way to bring back old master as a new slave.

Thanks in advance

Reply ↓
- Heikki Linnakangas on February 17, 2016 at 01:46 said:
  
  You can restart the cluster, let crash recovery complete, and shut it down again. Then you can run pg_rewind.
  
  Reply ↓
  - Nitin Bisht on February 18, 2016 at 16:57 said:
    
    Hello Heikki
    
    Though the approach seems to be logical but when practicing it doesn’t work.
    
    Please refer the log below when starting the DB server after successfully running pg_rewind :
    ./pg_ctl -D /opt/app/D360_Database/shopping_cart_42 start
    server starting
    LOG: database system was interrupted; last known up at 2016-02-18 15:37:42 CET
    LOG: entering standby mode
    LOG: database system was not properly shut down; automatic recovery in progress
    LOG: redo starts at 0/39000028
    LOG: invalid record length at 0/3AAA75F8
    LOG: consistent recovery state reached at 0/3AAA75F8
    LOG: database system is ready to accept read only connections
    LOG: fetching timeline history file for timeline 7 from primary server
    LOG: started streaming WAL from primary at 0/3A000000 on timeline 6
    LOG: replication terminated by primary server
    DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
    LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
    LOG: restarted WAL streaming at 0/3A000000 on timeline 6
    LOG: replication terminated by primary server
    DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
    LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
    LOG: restarted WAL streaming at 0/3A000000 on timeline 6
    LOG: replication terminated by primary server
    DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
    LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
    LOG: restarted WAL streaming at 0/3A000000 on timeline 6
    LOG: replication terminated by primary server
    DETAIL: End of WAL reached on timeline 6 at 0/3A9F5BE8.
    LOG: new timeline 7 forked off current database system timeline 6 before current recovery point 0/3AAA75F8
    
    LOG: database system was shut down at 2016-02-18 15:51:54 CET
    LOG: entering standby mode
    LOG: consistent recovery state reached at 0/3D000098
    LOG: invalid record length at 0/3D000098
    LOG: database system is ready to accept read only connections
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CD60C00
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CE96F30
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CEE42E8
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CF3ABA8
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CF88410
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/3D000000 is ahead of the WAL flush position of this server 0/3CFD52B0
    
    LOG: started streaming WAL from primary at 0/3D000000 on timeline 7
    LOG: invalid resource manager ID 96 at 0/3D000098
    FATAL: terminating walreceiver process due to administrator command
    LOG: invalid resource manager ID 96 at 0/3D000098
    LOG: invalid resource manager ID 96 at 0/3D000098
    
    Any suggestion ??
    
    Reply ↓
Fredrik Huitfeldt on June 17, 2016 at 16:37 said:

Hello Heikki,

we are currently using postgresql streaming replication, and we are some times recycling old masters. I read through your presentation, and saw that this is considered dangerous.

I wrote the postgresql mailing list in order to understand why (https://www.postgresql.org/message-id/1462941074281.128157.22300%40webmail6), but I did not get a clear response on wether or not our setup is dangerous.

if at all possible, I would greatly appreciate if you would read my original post and comment (either here or there).

Best regards,
Fredrik

Reply ↓

Heikki Linnakangas' blog

Mostly PostgreSQL-related stuff

pg_rewind in PostgreSQL 9.5

20 thoughts on “pg_rewind in PostgreSQL 9.5”

Leave a Reply to Robert Wysocki Cancel reply