Lustre on ZFS Metadata Failover

by Andrew Wagner — last modified May 19, 2014

 

Important Note: The definitive source for Lustre documentation is the Lustre Operations Manual available at https://wiki.hpdd.intel.com/display/PUB/Documentation.

These documents are copied from internal SSEC working documentation that may be useful for some, but be we provide no guarantee of accuraccy, correctness, or safety. Use at your own risk.

 

NOTE: Use at your own risk, especially for back up and recovery, test things and do not simply trust these instructions.

Switching over to a ZFS Snapshot Copy on a Backup Metadata Server

  1. Shutdown primary Lustre MDS. Ensure poweroff as duplicate MDTs will be bad news bears.
  2. Shutdown OSTs.
  3. On backup MDS, change the network settings for em1, ib0 and /etc/sysconfig/network to match the primary MDS. Ensure that /etc/ldev.conf matches the right vdev devices.
  4. Reboot the backup MDS to change over network settings. 
  5. Ensure that winbind is working appropriately. If not, "net join -U Administrator" and restart winbind. Verify via "id" that usernames are resolving.
  6. Turn on Lustre on backup MDS via "service lustre start" after verifying that network settings are correct.
  7. Turn on Lustre on OSTs.
  8. Recovery must complete on OSTs and MDT. Check both before mounting any clients.
 

Example Filesystem details

 

 
Method for switch-over:
 
 
snapshot transfer example:
 
zfs send -R lustre-meta@2014051521 | ssh server-2 zfs receive -uv backup/lustre-meta
 
 
 

Switch-over Procedure

 
This procedure assumes you already performed your snapshot / backups or they are current. If you have not, review:
 
https://jira.hpdd.intel.com/browse/LUDOC-161
 
  1. unmount the lustre clients 
    1. to see clients, on lustre server (mds probably) use: lshowmount -lv
  2. Prepare the primary MDS for shutdown
    1. umount the OSTs
    2. edit its /etc/sysconfig/network-scripts/ifcfg-ib0, ifcfg-ib1 and ifcfg-em1 so they match the backup server.
    3. update /etc/sysconfig/network so that hostname is the secondary hostname (server-2 in our example)
    4. verify the lustre server won't start if it is powered on: chkconfig lustre off
    5. Review /etc/ldev.conf and the ZFS file system mount point
    6. Verify that your backup script exists on server to become the primary before shutting this one down.
    7. poweroff the server
  3. Prepare the secondary to become primary
    1. edit its /etc/sysconfig/network-scripts/ifcfg-ib0, ifcfg-ib1 and ifcfg-em1 so they match the primary server
    2. update /etc/sysconfig/network so that hostname is the secondary hostname (server-1 in our example)
    3. reboot
    4. Check winbind (or whatever authentication you might use..)
    5. Review /etc/ldev.conf and the ZFS file system mount point
    6. Upon boot completion: start lustre (service lustre start)
      1. test/debug with lctl list_nids and you should see 172.16.23.14@o2ib
    7. Ensure MDT is mounted
    8. start service / ensure OST mounted on each OSS (rocks run host $geoarc_oss command=service lustre start)
    9. Monitor recovery (do we really want to even allow recovery? This is going to recover the client list as of the time of the snapshot, which probably doesn't make sense.. otherwise it's just a wasted 5 minutes minimum)
    10. verify that your backup script is now set to execute when this becomes the primary
  4. Remount clients 
 
 

ZFS file system / etc/ldev.conf consistency

 
    1. Verify the volume is unmounted
    2. service lnet stop
    3. service lustre stop
    4. Rename the file system (mount point) with zfs rename so that it matches what should be in /etc/ldev.conf. ex:
      1. change the contents of /etc/ldev.conf to reflect final ZFS mount points ("device-path")
      2. zfs rename backup/lustre-meta/mgs lustre-meta/mgs
      3. zfs rename backup/lustre-meta/arcdata-meta lustre-meta/arcdata-meta