-
Notifications
You must be signed in to change notification settings - Fork 138
Baseline Model Architecture #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Kinal, Thanks for your kind words! The answers to your questions are as follows.
The complete list of observations used by all agents is RGB, READY_TO_SHOOT, and INVENTORY. The implementation is here. Many substrates also expose other observations like POSITION and ORIENTATION, but those are only intended for debugging. There is also a one-hot "layers" encoding (sprite identity by position) which can be selected instead of RGB, we didn't use it ourselves but it is available as an alternative option. Not all substrates naturally expose INVENTORY, in those cases we add a "fake" observation of a zero tensor with the same structure so that the same agent code can be used without modification for all substrates. The implementation is here.
Yes, it's essentially as you say. We used a CNN followed by an MLP. The output of that MLP then fed into an LSTM. The output of the LSTM was transformed through another MLP into the policy. The extra observations (READY_TO_SHOOT, INVENTORY) along with a one-hot representation of the previous step's action were all concatenated to the input to the LSTM. We used this basic network architecture for all the algorithms we tested. There are some more details in part C of the ICML paper's appendix.
The prosocial algorithms we tried were the most naive kind. On each timestep we just substituted the individual instantaneous rewards obtained by summing everyone's instantaneous reward on that step. We didn't try VDN or QMIX or any other "cooperative MARL" algorithms. I'm very curious what the result would be for these though. I think there's a nice paper waiting to be written by someone who wants to look into this. We also did not try using the third person global observation (WORLD.RGB). It think it is justifiable under the "centralized training and decentralized execution paradigm", though it depends a bit on your interpretation. If you are happy to assume free access to anything the simulator can produce during training then why even stop with the global observation? You might as well also use other debug signals too. Of course none of these are available at test time so the algorithm would need to be able to handle that. Aside from using the third person global observation, another representation that could be used in centralized training is obtained by concatenating individual observations from all players together. That's the representation we used in the centralized training phase for OPRE, which was the only algorithm we tested so far that was actually designed for the decentralized training + centralized execution regime. Regarding OPRE's prosocial variant, it's fair to describe it as using a centralized critic. Though the OPRE algorithm was designed mainly for non-cooperative games, especially zero sum ones, not cooperative MARL. All the other algorithms we tried were designed for the fully decentralized regime (decentralized even at training time). I expect that performance improvements could be obtained by using more of the available global information during centralized training. But it's not completely obvious that it would work. It might cause agents to be more overfit to one another's behavior. If so, then they would generalize poorly when faced with unfamiliar co-players at test time. I think a thorough investigation of this issue, probably including algorithms like VDN and QMIX, would make for an interesting paper in its own right. Someone should write that paper! |
Thank you for the detailed response. I completely missed the supplementary material. Two more questions.
|
You should be able to get it by including "LAYER" in the substrate's config.individual_observation_names (e.g. here). Then if you want your agent to use it for inference you would have to also make sure to extract the key "LAYER" from the observation and pass it to the neural network. You could, for instance, replace "RGB" with "LAYER" throughout. If you want a third person, global layers view then you can also use "WORLD.LAYER". It's analogous to "WORLD.RGB".
Yes, that's true. The sizes of the observations differ a bit for those substrates. Also the two player version of Running With Scissors in The Matrix will have observation size of (40, 40, 3) like the collaborative cooking substrates. The inventories are either size two or size three depending on whether or not the substrate in question has two or three different kinds of resources to collect. The network architectures will need to be adapted to these changes, but it might not require a change to any code. Most neural net specification libraries let you define a network layer just by specifying its number of output units. The actual number of parameters to create then gets inferred automatically from the size of the input. This is how sonnet's linear module works for instance.
You can try using the special strides we did. I'm not sure if they help or hinder though. Since we know the image is made by tiling 8x8 sized sprites you can get away with choosing kernel size and kernel stride to be 8. We made this choice early on after we started using DMLab2D and never investigated its implications for meltingpot substrates. In theory it should be making things a bit faster. Though, anecdotally, one of my colleagues mentioned he had tried reverting to a more normal stride value and found that doing so improved performance. So it's a bit unclear right now what the best thing to do is. We'll try and include specific suggested conv net parameters when we next release an update to the repo. In the mean time, feel free to use the 2-layer conv net settings that we used. They are as follows:
This is a pretty standard thing at DeepMind. I agree with you though. It seems like redundant information. It's probably not necessary. We left it there because, since it's so common at DeepMind, it felt like removing it would be an unnecessary departure from a common default. It very likely makes no difference one way or another. |
Hi,
Thanks for this awesome repo and a great accompanying paper.
Following are the questions I couldn't find answers to in the paper:
Thank you
Kinal
The text was updated successfully, but these errors were encountered: