Hi!
Thanks for your reply, you perfectly summarized my approaches.
And thanks to your suggestion, my model finally works!
As you suggested, I had to call:
ulimit -n 64000
The 64000 is kinda arbitrary, it corresponds to the open files limit. I’m not really sure how this is is related to distributed training and what a good value should be. The default value of 1024 was enough for some models but not for other.
Anyway, some relevant infos if someones ends up with the same issue:
You need admin rights to call ulimit. Note that calling ulimit will only affect the current shell. Values will be resetted at reboot.
Thanks again.