RuntimeError: unable to open shared memory object (depending on the model)

Hi!
Thanks for your reply, you perfectly summarized my approaches.
And thanks to your suggestion, my model finally works!
As you suggested, I had to call:

ulimit -n 64000

The 64000 is kinda arbitrary, it corresponds to the open files limit. I’m not really sure how this is is related to distributed training and what a good value should be. The default value of 1024 was enough for some models but not for other.

Anyway, some relevant infos if someones ends up with the same issue:

You need admin rights to call ulimit. Note that calling ulimit will only affect the current shell. Values will be resetted at reboot.
Thanks again.

1 Like