Hello all, I want to report the issue of pytorch with hdf5 loader. The full source code and bug are provided
The problem is that I want to call the
test_dataloader.py
in two terminals. The file is used to load the custom hdf5 dataset (
custom_h5_loader
). To generate h5 files, you may need first run the file
convert_to_h5
to generate 100 random h5 files.
To reproduce the error. Please run follows steps
Step 1:
Generate the hdf5
from __future__ import print_function
import h5py
import numpy as np
import random
import os
if not os.path.exists('./data_h5'):
os.makedirs('./data_h5')
for index in range(100):
data = np.random.uniform(0,1, size=(3,128,128))
data = data[None, ...]
print (data.shape)
with h5py.File('./data_h5/' +'%s.h5' % (str(index)), 'w') as f:
f['data'] = data
Step2: Create a python file custom_h5_loader.py and paste the code
import h5py
import torch.utils.data as data
import glob
import torch
import numpy as np
import os
class custom_h5_loader(data.Dataset):
def __init__(self, root_path):
self.hdf5_list = [x for x in glob.glob(os.path.join(root_path, '*.h5'))]
self.data_list = []
for ind in range (len(self.hdf5_list)):
self.h5_file = h5py.File(self.hdf5_list[ind])
data_i = self.h5_file.get('data')
self.data_list.append(data_i)
def __getitem__(self, index):
self.data = np.asarray(self.data_list[index])
return (torch.from_numpy(self.data).float())
def __len__(self):
return len(self.hdf5_list)
Step 3: Create a python file with name test_dataloader.py
from dataloader import custom_h5_loader
import torch
import torchvision.datasets as dsets
train_h5_dataset = custom_h5_loader('./data_h5')
h5_loader = torch.utils.data.DataLoader(dataset=train_h5_dataset, batch_size=2, shuffle=True, num_workers=4)
for epoch in range(100000):
for i, data in enumerate(h5_loader):
print (data.shape)
Step 4: Open first terminal and run (it worked)
python test_dataloader.py
Step 5: Open the second terminal and run (Error report in below)
python test_dataloader.py
The error is
Traceback (most recent call last):
File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 162, in make_fid
fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 165, in make_fid
fid = h5f.open(name, h5f.ACC_RDONLY, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open
OSError: Unable to open file (unable to lock file, errno = 11, error message = 'Resource temporarily unavailable')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test_dataloader.py", line 5, in <module>
train_h5_dataset = custom_h5_loader('./data_h5')
File "/home/john/test_hdf5/dataloader.py", line 13, in __init__
self.h5_file = h5py.File(self.hdf5_list[ind])
File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 312, in __init__
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/home/john/anaconda3/lib/python3.6/site-packages/h5py/_hl/files.py", line 167, in make_fid
fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 98, in h5py.h5f.create
OSError: Unable to create file (unable to open file: name = './data_h5/47.h5', errno = 17, error message = 'File exists', flags = 15, o_flags = c2)
This is my configuration
HDF5 Version: 1.10.2
Configured on: Wed May 9 23:24:59 UTC 2018
Features:
---------
Parallel HDF5: no
High-level library: yes
Threadsafety: yes
print (torch.__version__)
1.0.0.dev20181227
With deprecated public symbols: yes
I/O filters (external): deflate(zlib),szip(encoder)
MPE: no
Direct VFD: no
dmalloc: no
Packages w/ extra debug output: none
API tracing: no
Using memory checker: no
Memory allocation sanity checks: no
Metadata trace file: no
Function stack tracing: no
Strict file format checks: no
Optimization instrumentation: no
I think your question is related to the way hdf5 format is specified. there are multiple thread in this board that concern the dataloader issues if combined with hdf5
without having the time to reproduce your errors right now. Did I understand correctly that you attempted to access the same hdf5 file from two different training sessions?
I am facing similar issues currently and one remedy that is supposed to work is enabling swmr in recent versions of libhdf5 which is said to resolve the concurrency issues which honestly have left me in doubt about the suitability of this data format for scientific applications all together