[gpfsug-discuss] node lockups in gpfs > 4.1.1.14

Mon Aug 14 22:53:35 BST 2017

I was remiss in not following up with this sooner and thank you to the 
kind individual that shot me a direct message to ask the question.

It turns out that when I asked for the fix for APAR IV96776 I got an 
early release of 4.1.1.16 that had a fix for the APAR but also 
introduced the lockup bug. IBM kindly delayed the release of 4.1.1.16 
proper until they had addressed the lockup bug (APAR IV98888).

As I understand it the version of 4.1.1.16 that was released via fix 
central should have a fix for this bug although I haven't tested it I 
have no reason to believe it's not fixed.

-Aaron

On 08/04/2017 11:02 AM, Aaron Knister wrote:
> I've narrowed the problem down to 4.1.1.16. We'll most likely be 
> downgrading to 4.1.1.15.
>
> -Aaron
>
> On 8/4/17 4:00 AM, Aaron Knister wrote:
>> Hey All,
>>
>> Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16?
>>
>> We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some 
>> rather disconcerting behavior. Specifically on some of the upgraded 
>> nodes GPFS will seemingly deadlock on the entire node rendering it 
>> unusable. I can't even get a session on the node (but I can trigger a 
>> crash dump via a sysrq trigger).
>>
>> Most blocked tasks are blocked are in cxiWaitEventWait at the top of 
>> their call trace. That's probably not very helpful in of itself but 
>> I'm curious if anyone else out there has run into this issue or if 
>> this is a known bug.
>>
>> (I'll open a PMR later today once I've gathered more diagnostic 
>> information).
>>
>> -Aaron
>>
>