I found someone had reported this error and could not reproduce it. However I always get this error during my training. When I get this error, the code is still running, so I continue to get this problem. It seems that it has no effect on training.
train_dataset = lmdbDataset(root=opt.trainroot, transform=resizeNormalize(size=(592, 32)))
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=opt.batchSize,
shuffle=True, sampler=None,
num_workers=int(opt.workers),
collate_fn=alignCollate())
train_iter = iter(train_loader)
cpu_images, cpu_texts, cpu_lengths = next(train_iter)
ERROR message:
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
self._shutdown_workers()
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
self.worker_result_queue.get()
File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
turn _ForkingPickler.loads(res)
File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
turn reduction.recv_handle(conn)
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
turn recvfds(s, 1)[0]
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
self._shutdown_workers()
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
self.worker_result_queue.get()
File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
self._shutdown_workers()
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
self.worker_result_queue.get()
File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
return _ForkingPickler.loads(res)
File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fec6928a9b0>>
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
self._shutdown_workers()
File "/root/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
self.worker_result_queue.get()
File "/root/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
turn _ForkingPickler.loads(res)
File "/root/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
fd = df.detach()
File "/root/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
turn reduction.recv_handle(conn)
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
turn recvfds(s, 1)[0]
File "/root/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
raise EOFError
EOFError:
System Info
PyTorch or Caffe2: Pytorch
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source):
OS: centos 7
PyTorch version: 0.4.0
Python version: 3.6
CUDA/cuDNN version: cuda9.0/cudnn7.0
ahangchen, JayJJChen, nathanwang000, dihuangcode, mattphillipsphd, xiayandi, monghimng, huangtao36, catrueeb, li10141110, and 16 more reacted with thumbs up emoji
henzler, mattphillipskitware, miku, phizaz, ptkin, takahiro-itazuri, and mxtx0509 reacted with laugh emoji
All reactions
I found this error has a serious effect on my training. I got garbled in labels.
Some useful information:
#1551
https://discuss.pytorch.org/t/using-torch-tensor-over-multiprocessing-queue-process-fails/2847/6
@ssnl
I am using pytorch 0.4.0 and I meet this error frequently. I find that there are three PRs related to this dataloader bug: #10366 #11985 #12011 . Is there a way to easily fix it in 0.4.0? Since there is a big difference between 0.4.0 and current master branch, I do not want to build from the source of master.
If I manually merge the two changed files of #10366 into 0.4.0, is it OK?
@ssnl Thanks again. After merging torch/_six.py
and torch/utils/data/dataloader.py
of master-034c96
(#10366 #11985 #12700) into v0.4.0, this error is fixed.
also works in v0.4.1 , thanks a lot~
Fangyh09, FCInter, sharonwang1, Njuod, vadimen, zzw1123, hsgser, pranay-ar, and umiskky reacted with thumbs up emoji
LvJC, naveenkumarmarri, wuxiaolang, darkdevahm, troymyname, and NanAlbert reacted with thumbs down emoji
FCInter reacted with laugh emoji
FCInter and TheKeres reacted with hooray emoji
FCInter reacted with heart emoji
FCInter reacted with rocket emoji
All reactions
I also meet this question:
that is my code.
`from tensorflow.examples.tutorials.mnist import input_data
print("Begin of file")
mnist = input_data.read_data_sets("MNIST_data/",one_hot=True)
print(mnist.train.images.shape)
print("End of file")`
ConnectioResetError:[Errno 104] Connection reset by peer.
This error still repeatedly emerge in the new version of 1.2.0. Setting num_workers=0 is not an optimal workaround... It slows down the process. Strangely, I only got this error for my test data, not training data. Is that possible the error is from my test data itself? Some image has been corrupted, for instance?
Thanks