Support MultiGPU training? #4

804609 · 2017-10-26T10:00:43Z

Hi,
Does your codes support multiGPU training?
It seems there is no any responses.

Create generator for train set
Found 3775 images belonging to 3 classes.
Create generator for val set
Found 501 images belonging to 3 classes.
Start model training on the last dense layer only
Epoch 1/1

lishen · 2017-10-26T13:06:33Z

The multigpu part is buggy. No guarantee it will work. Are you interested in contributing?

…

On Thu, Oct 26, 2017 at 6:00 AM 804609 ***@***.***> wrote: Hi, Does your codes support multiGPU training? It seems there is no any responses. Create generator for train set Found 3775 images belonging to 3 classes. Create generator for val set Found 501 images belonging to 3 classes. Start model training on the last dense layer only Epoch 1/1 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1XS4KeRIS1DwPT0Im04PrTZWRfpJYvks5swFhLgaJpZM4QHTy5> .

zccoder · 2017-11-01T07:11:23Z

@804609
I found it can. However I wonder how to get the dataset?

804609 · 2017-11-03T09:39:38Z

Hi, Lishen:
I try to upgrade to Keras v2.0.9 which supports the multiGPU feature using the new multi_gpu_model () function.
However the code will have the following errors.

Create generator for train set
Found 3768 images belonging to 3 classes.
Create generator for val set
Found 496 images belonging to 3 classes.
Start model training on the last dense layer only
Epoch 1/1
Traceback (most recent call last):
File "patch_clf_train.py", line 309, in
run(args.train_dir, args.val_dir, args.test_dir, **run_opts)
File "patch_clf_train.py", line 151, in run
hidden_dropout2=hidden_dropout2)
File "/breast_cancer/end2end-all-conv/ddsm_train/dm_keras_ext.py", line 204, in do_3stage_training
verbose=2)
File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 87, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 2046, in fit_generator
generator_output = next(output_generator)
File "/usr/local/lib/python2.7/dist-packages/keras/utils/data_utils.py", line 518, in get
raise StopIteration(e)
StopIteration: can't pickle generator objects

I traced the code but cann't found the reason.
Do you know how to fix this errors?

804609 · 2017-11-03T09:52:24Z

@zccoder
Lishen's paper has that information.

lishen · 2017-11-03T13:17:44Z

Not clear to me. I have already made multi gpu work in private repo. But it’s somewhat unstable at the moment. Would you post more details of your implementation? Thanks!

…

On Fri, Nov 3, 2017 at 5:52 AM 804609 ***@***.***> wrote: @zccoder <https://github.com/zccoder> Lishen's paper has that information. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1XS2h91T_aCQwT-p_dGWC1kJc2VSIwks5syuJYgaJpZM4QHTy5> .

zccoder · 2017-11-06T02:51:04Z

@804609
So which python files you run and how to configure it, Can you put the files here. Thank you!

804609 · 2017-11-06T07:23:08Z

@lishen ,
I can run parallel on 8-GPUs platform now. Please see the code I modified.
You said it’s somewhat unstable, can you explain more details?
How many times do you get on 8-GPUs compared to 1-GPU?

# ================= Model creation ============== #
if gpu_count > 1:
    with tf.device('/cpu:0'):
        model, preprocess_input, top_layer_nb = get_dl_model(
            net, nb_class=len(class_list), use_pretrained=use_pretrained,
            resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
            weight_decay=weight_decay, hidden_dropout=hidden_dropout,
            nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
            init_conv_stride=init_conv_stride, pool_size=pool_size,
            pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
            inp_dropout=inp_dropout)
    model, org_model = make_parallel(model, gpu_count)
else:
    model, preprocess_input, top_layer_nb = get_dl_model(
        net, nb_class=len(class_list), use_pretrained=use_pretrained,
        resume_from=resume_from, img_size=img_size, top_layer_nb=top_layer_nb,
        weight_decay=weight_decay, hidden_dropout=hidden_dropout,
        nb_init_filter=nb_init_filter, init_filter_size=init_filter_size,
        init_conv_stride=init_conv_stride, pool_size=pool_size,
        pool_stride=pool_stride, alpha=alpha, l1_ratio=l1_ratio,
        inp_dropout=inp_dropout)
    org_model = model
if featurewise_center:
    preprocess_input = None

lishen · 2017-11-06T16:16:12Z

@804609 ,

Your code doesn't make use of the new multi_gpu_model API. It uses the make_parallel function which is a "monkey patch" for multi-GPU support. You shall change it to the new function.

I found the Keras' new function work but sometimes it blows up GPU with an "resource exhausted" error even though the same code ran successfully before. I'm not sure what was the reason.

804609 · 2017-11-08T12:42:00Z

@lishen ,
I can't run the code to the new version or new function.
I got the error "StopIteration: can't pickle generator objects" when I upgraded to 2.0.9 through the same code, but it works fine to the 2.0.8. I found some messages from keras-team/keras/issues/8368. It said an additional update on the fit_generator() leads to the use of OrderedEnqueuer instead of the GeneratorEnqueuer if the underlying generator is a sequence which can break your code.

Your code seems passing wrong class or generator to the fit_generator()?
How do you patch your code?

lishen added enhancement help wanted labels Nov 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support MultiGPU training? #4

Support MultiGPU training? #4

804609 commented Oct 26, 2017

lishen commented Oct 26, 2017 via email

Uh oh!

zccoder commented Nov 1, 2017

Uh oh!

804609 commented Nov 3, 2017 •

edited

Loading

Uh oh!

804609 commented Nov 3, 2017

Uh oh!

lishen commented Nov 3, 2017 via email

Uh oh!

zccoder commented Nov 6, 2017

Uh oh!

804609 commented Nov 6, 2017

Uh oh!

lishen commented Nov 6, 2017

Uh oh!

804609 commented Nov 8, 2017

Uh oh!

Support MultiGPU training? #4

Support MultiGPU training? #4

Comments

804609 commented Oct 26, 2017

lishen commented Oct 26, 2017 via email

Uh oh!

zccoder commented Nov 1, 2017

Uh oh!

804609 commented Nov 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

804609 commented Nov 3, 2017

Uh oh!

lishen commented Nov 3, 2017 via email

Uh oh!

zccoder commented Nov 6, 2017

Uh oh!

804609 commented Nov 6, 2017

Uh oh!

lishen commented Nov 6, 2017

Uh oh!

804609 commented Nov 8, 2017

Uh oh!

804609 commented Nov 3, 2017 •

edited

Loading