Skip to content

How many hours did you take to train agents in each substrate? #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
YetAnotherPolicy opened this issue Feb 25, 2022 · 12 comments
Closed

Comments

@YetAnotherPolicy
Copy link

Dear authors,

Thanks for building such ambitious environments for MARL research. In your paper, I found it will take 10^9 steps to run the simulation for each agent. In order to train agents, how many rollout workers did you set and how many hours did you take to get the final results in Table 1: Focal per-capita returns?

Thank you.

@duenez
Copy link
Contributor

duenez commented Mar 10, 2022

Hello,

Estimating training time is very difficult, since it entirely depends on the training stack, available compute, etc. There is typically a fundamental tradeoff between wall-clock time and compute. From our side, we have tried two very different training stacks, and one of them trained populations in a bit under a week, and in another stack it took just one day. The number of workers was also quite different in the two stacks.

We recognise that compute is likely a limiting factor in training these population which is why we are actively working on improving the performance of the substrates, including reducing the time spent in Python, trying instead to delegate to the underlying C++ implementation of the substrate engine (Lab2D) as soon as possible.

Hope this helps

@duenez duenez closed this as completed Mar 10, 2022
@YetAnotherPolicy
Copy link
Author

Dear @duenez, thanks for the detailed and helpful reply. I appreciate your team's efforts to make MeltingPot a great testbed in MARL research.

@ManuelRios18
Copy link

@YetAnotherPolicy I am curious to know how long you take to train these populations!

In my case, I can train 1e^6 steps in almost exactly an hour using 4 RLlib workers and a 64GB RAM machine with Rtx 3060 Nvidia GPU.

@ManuelRios18
Copy link

@YetAnotherPolicy
Could you please tell me your hardware specs ?
I mean, num CPU’s , RAM, GPU ?
or do you train in the cloud ?

@YetAnotherPolicy
Copy link
Author

YetAnotherPolicy commented Jul 8, 2022

@YetAnotherPolicy I am curious to know how long you take to train these populations!

In my case, I can train 1e^6 steps in almost exactly an hour using 4 RLlib workers and a 64GB RAM machine with Rtx 3060 Nvidia GPU.

Hi in my case I use 32 workers and it will take 8 minutes to run 1M steps. Note that it depends on the simulation speed.

@YetAnotherPolicy
Copy link
Author

YetAnotherPolicy commented Jul 8, 2022

@YetAnotherPolicy
Could you please tell me your hardware specs ?
I mean, num CPU’s , RAM, GPU ?
or do you train in the cloud ?

I use very common Intel's CPUs, 40 in total. As the states are RGB images. I use A100, which can be faster than 3090. RAM is 256G.

@ManuelRios18
Copy link

@YetAnotherPolicy Sorry! I am back with the questions!

Which algorithm are you using to train?
I have notice that in my case PPO is 8 times slower than A3C. Have you experienced anything similar?

@YetAnotherPolicy
Copy link
Author

@YetAnotherPolicy Sorry! I am back with the questions!

Which algorithm are you using to train? I have notice that in my case PPO is 8 times slower than A3C. Have you experienced anything similar?

Hi, I use PPO. Note that there is an inner training loop in each update in PPO, see this link: https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/ppo/ppo.py#L265. Please also check if RLlib uses this trick.

Training with PPO costs 1.5 days for 200M.

@yesfon
Copy link

yesfon commented Jul 12, 2022

Hello @YetAnotherPolicy,

I got confused for your last message, i would like to know if for the training of the workers you used the RLlib library?

@YetAnotherPolicy
Copy link
Author

Hello @YetAnotherPolicy,

I got confused for your last message, i would like to know if for the training of the workers you used the RLlib library?

Hi, @yesfon, I did not use RLlib.

@yesfon
Copy link

yesfon commented Jul 12, 2022

Hello @YetAnotherPolicy,
I got confused for your last message, i would like to know if for the training of the workers you used the RLlib library?

Hi, @yesfon, I did not use RLlib.

May I ask what did you use ?

@YetAnotherPolicy
Copy link
Author

Hello @YetAnotherPolicy,
I got confused for your last message, i would like to know if for the training of the workers you used the RLlib library?

Hi, @yesfon, I did not use RLlib.

May I ask what did you use ?

Hi, I use multiprocessing as well as ray's remote actor to collect data. RLlib is also good, but it takes a lot of time to learn its APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants