[gpfsug-discuss] data integrity documentation

Sven Oehme oehmes at gmail.com
Wed Aug 2 20:10:07 BST 2017


ok, i think i understand now, the data was already corrupted. the config
change i proposed only prevents a potentially known future on the wire
corruption, this will not fix something that made it to the disk already.

Sven



On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt <stijn.deweirdt at ugent.be>
wrote:

> yes ;)
>
> the system is in preproduction, so nothing that can't stopped/started in
> a few minutes (current setup has only 4 nsds, and no clients).
> mmfsck triggers the errors very early during inode replica compare.
>
>
> stijn
>
> On 08/02/2017 08:47 PM, Sven Oehme wrote:
> > How can you reproduce this so quick ?
> > Did you restart all daemons after that ?
> >
> > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt <stijn.deweirdt at ugent.be>
> > wrote:
> >
> >> hi sven,
> >>
> >>
> >>> the very first thing you should check is if you have this setting set :
> >> maybe the very first thing to check should be the faq/wiki that has this
> >> documented?
> >>
> >>>
> >>> mmlsconfig envVar
> >>>
> >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1
> >>> MLX5_USE_MUTEX 1
> >>>
> >>> if that doesn't come back the way above you need to set it :
> >>>
> >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1
> >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1"
> >> i just set this (wasn't set before), but problem is still present.
> >>
> >>>
> >>> there was a problem in the Mellanox FW in various versions that was
> never
> >>> completely addressed (bugs where found and fixed, but it was never
> fully
> >>> proven to be addressed) the above environment variables turn code on in
> >> the
> >>> mellanox driver that prevents this potential code path from being used
> to
> >>> begin with.
> >>>
> >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in
> Scale
> >>> that even you don't set this variables the problem can't happen anymore
> >>> until then the only choice you have is the envVar above (which btw
> ships
> >> as
> >>> default on all ESS systems).
> >>>
> >>> you also should be on the latest available Mellanox FW & Drivers as not
> >> all
> >>> versions even have the code that is activated by the environment
> >> variables
> >>> above, i think at a minimum you need to be at 3.4 but i don't remember
> >> the
> >>> exact version. There had been multiple defects opened around this area,
> >> the
> >>> last one i remember was  :
> >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from
> >> dell, and the fw is a bit behind. i'm trying to convince dell to make
> >> new one. mellanox used to allow to make your own, but they don't
> anymore.
> >>
> >>>
> >>> 00154843 : ESS ConnectX-3 performance issue - spinning on
> >> pthread_spin_lock
> >>>
> >>> you may ask your mellanox representative if they can get you access to
> >> this
> >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3
> >>> cards its a general issue that affects all cards and on intel as well
> as
> >>> Power.
> >> ok, thanks for this. maybe such a reference is enough for dell to update
> >> their firmware.
> >>
> >> stijn
> >>
> >>>
> >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt <
> stijn.deweirdt at ugent.be>
> >>> wrote:
> >>>
> >>>> hi all,
> >>>>
> >>>> is there any documentation wrt data integrity in spectrum scale:
> >>>> assuming a crappy network, does gpfs garantee somehow that data
> written
> >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the
> >>>> nsd gpfs daemon to disk.
> >>>>
> >>>> and wrt crappy network, what about rdma on crappy network? is it the
> >> same?
> >>>>
> >>>> (we are hunting down a crappy infiniband issue; ibm support says it's
> >>>> network issue; and we see no errors anywhere...)
> >>>>
> >>>> thanks a lot,
> >>>>
> >>>> stijn
> >>>> _______________________________________________
> >>>> gpfsug-discuss mailing list
> >>>> gpfsug-discuss at spectrumscale.org
> >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> gpfsug-discuss mailing list
> >>> gpfsug-discuss at spectrumscale.org
> >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>>
> >> _______________________________________________
> >> gpfsug-discuss mailing list
> >> gpfsug-discuss at spectrumscale.org
> >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>
> >
> >
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170802/48b2864f/attachment-0002.htm>


More information about the gpfsug-discuss mailing list