添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue description

I found someone had reported this error and could not reproduce it. However I always get this error during my training. When I get this error, the code is still running, so I continue to get this problem. It seems that it has no effect on training.

Code example

train_dataset = lmdbDataset(root=opt.trainroot, transform=resizeNormalize(size=(592, 32)))
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=opt.batchSize,
    shuffle=True, sampler=None,
    num_workers=int(opt.workers),
    collate_fn=alignCollate())
train_iter = iter(train_loader)
cpu_images, cpu_texts, cpu_lengths = next(train_iter)

ERROR message:

ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
                         self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
                         self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
  turn _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
                         fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
  turn reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
  turn recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get  
    return _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError:
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError:
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
                         self._shutdown_workers()
  File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
                         self.worker_result_queue.get()
  File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
  turn _ForkingPickler.loads(res)
  File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
                         fd = df.detach()
  File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
  turn reduction.recv_handle(conn)
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
  turn recvfds(s, 1)[0]
  File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:

System Info

  • PyTorch or Caffe2: Pytorch
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • OS: centos 7
  • PyTorch version: 0.4.0
  • Python version: 3.6
  • CUDA/cuDNN version: cuda9.0/cudnn7.0
  • ahangchen, JayJJChen, nathanwang000, dihuangcode, mattphillipsphd, xiayandi, monghimng, huangtao36, catrueeb, li10141110, and 16 more reacted with thumbs up emoji henzler, mattphillipskitware, miku, phizaz, ptkin, takahiro-itazuri, and mxtx0509 reacted with laugh emoji All reactions

    I found this error has a serious effect on my training. I got garbled in labels.
    Some useful information:
    #1551
    https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847/6

    @ssnl
    I am using pytorch 0.4.0 and I meet this error frequently. I find that there are three PRs related to this dataloader bug: #10366 #11985 #12011 . Is there a way to easily fix it in 0.4.0? Since there is a big difference between 0.4.0 and current master branch, I do not want to build from the source of master.
    If I manually merge the two changed files of #10366 into 0.4.0, is it OK?

    @ssnl Thanks again. After merging torch/_six.py and torch/utils/data/dataloader.py of master-034c96 (#10366 #11985 #12700) into v0.4.0, this error is fixed.

    also works in v0.4.1 , thanks a lot~

    Fangyh09, FCInter, sharonwang1, Njuod, vadimen, zzw1123, hsgser, pranay-ar, and umiskky reacted with thumbs up emoji LvJC, naveenkumarmarri, wuxiaolang, darkdevahm, troymyname, and NanAlbert reacted with thumbs down emoji FCInter reacted with laugh emoji FCInter and TheKeres reacted with hooray emoji FCInter reacted with heart emoji FCInter reacted with rocket emoji All reactions

    I also meet this question:
    that is my code.
    `from tensorflow.examples.tutorials.mnist import input_data

    print("Begin of file")
    mnist = input_data.read_data_sets("MNIST_data/",one_hot=True)
    print(mnist.train.images.shape)
    print("End of file")`

    ConnectioResetError:[Errno 104] Connection reset by peer.

    This error still repeatedly emerge in the new version of 1.2.0. Setting num_workers=0 is not an optimal workaround... It slows down the process. Strangely, I only got this error for my test data, not training data. Is that possible the error is from my test data itself? Some image has been corrupted, for instance?
    Thanks