最新消息:Welcome to the puzzle paradise for programmers! Here, a well-designed puzzle awaits you. From code logic puzzles to algorithmic challenges, each level is closely centered on the programmer's expertise and skills. Whether you're a novice programmer or an experienced tech guru, you'll find your own challenges on this site. In the process of solving puzzles, you can not only exercise your thinking skills, but also deepen your understanding and application of programming knowledge. Come to start this puzzle journey full of wisdom and challenges, with many programmers to compete with each other and show your programming wisdom! Translated with DeepL.com (free version)

python - Dimension error when using multiple GPUs for Pytorch MaskRCNN training - Stack Overflow

matteradmin7PV0评论

I have implemented a basic loop for training of the Pytorch's implementation of MaskRCNN. I have 4 GPUs available for training. I am using torch.nn.DataParallel() to use multiple GPUs if I want. However when passing an even number of GPUs like 0,1 or 0,1,2,3 I am getting the following error:-

RuntimeError: Caught RuntimeError in replica 0 on device 6.
Original Traceback (most recent call last):
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py", line 83, in forward
    images, targets = self.transform(images, targets)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 129, in forward
    image = self.normalize(image)
  File "/raid/training_data/motor_insurance/env/lib/python3.8/site-packages/torchvision/models/detection/transform.py", line 157, in normalize
    return (image - mean[:, None, None]) / std[:, None, None]
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 0

But when I use an odd number of GPU train runs perfectly and I get correct results too. Can anyone help in solving this.

I have tried everything but I think there is something wrong with Pytorch code itself

Post a comment

comment list (0)

  1. No comments so far