[gpfsug-discuss] node lockups in gpfs > 4.1.1.14
Aaron Knister
aaron.s.knister at nasa.gov
Mon Aug 14 22:53:35 BST 2017
I was remiss in not following up with this sooner and thank you to the
kind individual that shot me a direct message to ask the question.
It turns out that when I asked for the fix for APAR IV96776 I got an
early release of 4.1.1.16 that had a fix for the APAR but also
introduced the lockup bug. IBM kindly delayed the release of 4.1.1.16
proper until they had addressed the lockup bug (APAR IV98888).
As I understand it the version of 4.1.1.16 that was released via fix
central should have a fix for this bug although I haven't tested it I
have no reason to believe it's not fixed.
-Aaron
On 08/04/2017 11:02 AM, Aaron Knister wrote:
> I've narrowed the problem down to 4.1.1.16. We'll most likely be
> downgrading to 4.1.1.15.
>
> -Aaron
>
> On 8/4/17 4:00 AM, Aaron Knister wrote:
>> Hey All,
>>
>> Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16?
>>
>> We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some
>> rather disconcerting behavior. Specifically on some of the upgraded
>> nodes GPFS will seemingly deadlock on the entire node rendering it
>> unusable. I can't even get a session on the node (but I can trigger a
>> crash dump via a sysrq trigger).
>>
>> Most blocked tasks are blocked are in cxiWaitEventWait at the top of
>> their call trace. That's probably not very helpful in of itself but
>> I'm curious if anyone else out there has run into this issue or if
>> this is a known bug.
>>
>> (I'll open a PMR later today once I've gathered more diagnostic
>> information).
>>
>> -Aaron
>>
>
More information about the gpfsug-discuss
mailing list