2013-10-09

Lustre 2.X Problems

Interesting Lustre 2.x problems showed up today. Actually, a convergence of a number of problems.

There is a bug for 1.8 clients and 2.1 servers that can truncate directory listings such that they are properly formed (no IO errors or anything else that would indicate a problem). This appears to happen when the MDS is in some sort of death spiral.

When that happens, large memory clients will cache the incorrect data, even when the failover MDS comes up. So, even though the problem has been resolved by the backup MDS coming up, the clients won't know until something happens that causes the cache for that directory to be marked dirty (like creating a file in the directory).

There (at least on our pacemaker setup) are timing issues, that cause the volume group the MGS to be activated, but not available for mounting. Need to stick a sleep in there somewhere.

Continuing pacemaker problems such that the reboot original host to not rejoin the cluster.