[gpfsug-discuss] Disk can't be recovered due to uncorrectable read error in vdisk (GSS)

Mon Oct 17 18:35:37 BST 2016

hi ralph,

>>Currently our file system is down due to down/unrecovered disks. We
>> try to start the disks again with mmchdisk, but when we do this, we
>> see this error in our mmfs.log:
>> ...
>> This is a 3-way replicated vdisk, and not one of the recovering disks,but
>> this disk is in 'up' state..
> First, please open a PMR through your normal support organization, and make it 
> clear in the PMR that the problem is GPFS and GNR (a.k.a. ESS).  Like that, it 
> will be assigned to the correct support group.  Support will request that you 
> upload a snap.
PMR is created, but we are in particular puzzled by the IO read error.
and getting some details about what is gone here (with details as you
provided) is somethign we usually do not get from support ;)

> There seems to be a combination of two problems here:
> One, a NSD (which is also a GNR vdisk) is down, which is usually caused by an IO 
> error on the vdisk, or by both servers for the recovery group that contains the 
> vdisk being down simultaneously.  Usually, that is easily fixed by running 
> mmchdisk with a start option, but you tried that and it didn't work.  This 
> problem is at the NSD layer (meaning in the GPFS client that accesses the GNR 
> vdisk), not in the GNR layer.
it's actually more than disk down, all on same recoverygroup. the vdisk
with the read error is on the other recoverygroup (only 2, this is/was a
GSS24)

> Second, another vdisk has an internal error, caused by read error from the 
> physical disks (which is what "uncorrectable read error" means). 
what does physical mean here? we have no IO error anywhere
(OS/kernel/scsi, also the error counters on the mmlspdisk output do not
increase).

 Now, give that
> you say that this vdisk is 3-way replicated, that probably means that there are 
> multiple problems.  This error is purely in the GNR layer, and the error message 
> you quote "smallRead VIO..." comes from the GNR layer.  Now, an error from one 
> vdisk can't prevent mmchdisk on a different vdisk from working, so these two 
> problems seem unrelated.
well, they can: the disks with all metadata replica on the one
recoverygroup are down. starting those forces the read of the ones on
the other group, and this runs into the IOread error, and everything
stops (well, that's how we understand it ;)


> Furthermore, I'm going to bet that the two problems (which at first seem 
> unrelated) must in reality have a common root cause; it would be too weird a 
> coincidence to get two problems that are unrelated at the same time.  To debug 
> this requires looking at way more information than a single line from the 
> mmfs.log file, which is why the support organization needs a complete PMR 
> opened, and then have the usual snap (with logs, dumps, ...) uploaded, so it can 
> see what the cause of the problem is.
yep, trace and snap uploaded.

> Good luck!
thanks (and thanks again for some insights, much appreciated !)

> Ralph Becker-Szendy
> IBM Almaden Research Center - Computer Science -Storage Systems
> ralphbsz at us.ibm.com
> 408-927-2752
> 650 Harry Road, K56-B3, San Jose, CA 95120
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>