Skip to content

Support MultiGPU training? #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
804609 opened this issue Oct 26, 2017 · 9 comments
Open

Support MultiGPU training? #4

804609 opened this issue Oct 26, 2017 · 9 comments

Comments

@804609
Copy link

804609 commented Oct 26, 2017

Hi,
Does your codes support multiGPU training?
It seems there is no any responses.

Create generator for train set
Found 3775 images belonging to 3 classes.
Create generator for val set
Found 501 images belonging to 3 classes.
Start model training on the last dense layer only
Epoch 1/1

@lishen
Copy link
Owner

lishen commented Oct 26, 2017 via email

@zccoder
Copy link

zccoder commented Nov 1, 2017

@804609
I found it can. However I wonder how to get the dataset?

@804609
Copy link
Author

804609 commented Nov 3, 2017

Hi, Lishen:
I try to upgrade to Keras v2.0.9 which supports the multiGPU feature using the new multi_gpu_model () function.
However the code will have the following errors.

Create generator for train set
Found 3768 images belonging to 3 classes.
Create generator for val set
Found 496 images belonging to 3 classes.
Start model training on the last dense layer only
Epoch 1/1
Traceback (most recent call last):
File "patch_clf_train.py", line 309, in
run(args.train_dir, args.val_dir, args.test_dir, **run_opts)
File "patch_clf_train.py", line 151, in run
hidden_dropout2=hidden_dropout2)
File "/breast_cancer/end2end-all-conv/ddsm_train/dm_keras_ext.py", line 204, in do_3stage_training
verbose=2)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2046, in fit_generator
generator_output = next(output_generator)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 518, in get
raise StopIteration(e)
StopIteration: can't pickle generator objects

I traced the code but cann't found the reason.
Do you know how to fix this errors?

@804609
Copy link
Author

804609 commented Nov 3, 2017

@zccoder
Lishen's paper has that information.

@lishen
Copy link
Owner

lishen commented Nov 3, 2017 via email

@zccoder
Copy link

zccoder commented Nov 6, 2017

@804609
So which python files you run and how to configure it, Can you put the files here. Thank you!

@804609
Copy link
Author

804609 commented Nov 6, 2017

@lishen ,
I can run parallel on 8-GPUs platform now. Please see the code I modified.
You said it’s somewhat unstable, can you explain more details?
How many times do you get on 8-GPUs compared to 1-GPU?

# ================= Model creation ============== #
if gpu_count > 1:
    with tf.device('/cpu:0'):
        model, preprocess_input, top_layer_nb = get_dl_model(
            net, nb_class=len(class_list), use_pretrained=use_pretrained,
            resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
            weight_decay=weight_decay, hidden_dropout=hidden_dropout,
            nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
            init_conv_stride=init_conv_stride, pool_size=pool_size,
            pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
            inp_dropout=inp_dropout)
    model, org_model = make_parallel(model, gpu_count)
else:
    model, preprocess_input, top_layer_nb = get_dl_model(
        net, nb_class=len(class_list), use_pretrained=use_pretrained,
        resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
        weight_decay=weight_decay, hidden_dropout=hidden_dropout,
        nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
        init_conv_stride=init_conv_stride, pool_size=pool_size,
        pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
        inp_dropout=inp_dropout)
    org_model = model
if featurewise_center:
    preprocess_input = None

@lishen
Copy link
Owner

lishen commented Nov 6, 2017

@804609 ,

Your code doesn't make use of the new multi_gpu_model API. It uses the make_parallel function which is a "monkey patch" for multi-GPU support. You shall change it to the new function.

I found the Keras' new function work but sometimes it blows up GPU with an "resource exhausted" error even though the same code ran successfully before. I'm not sure what was the reason.

@804609
Copy link
Author

804609 commented Nov 8, 2017

@lishen ,
I can't run the code to the new version or new function.
I got the error "StopIteration: can't pickle generator objects" when I upgraded to 2.0.9 through the same code, but it works fine to the 2.0.8. I found some messages from keras-team/keras/issues/8368. It said an additional update on the fit_generator() leads to the use of OrderedEnqueuer instead of the GeneratorEnqueuer if the underlying generator is a sequence which can break your code.

Your code seems passing wrong class or generator to the fit_generator()?
How do you patch your code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants