How to read dataset from .tar files? - PyTorch Forums

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

满身肌肉的野马 · 济公活佛-电视剧-全集-爱奇艺· 3 月前 ·

热心的罐头 · 屯昌县十届政协第一、二次会议委员提案任务分解 ...· 5 月前 ·

想发财的滑板 · 射击世界杯总决赛：黄春荣获男子10米气手枪亚 ...· 5 月前 ·

失恋的领带 · gson解析字符串中带引号_java ...· 6 月前 ·

稳重的企鹅 · Problem with ...· 8 月前 ·

This works if you have image dataset in


   .tar

file.
You have to have a


   .csv


   .txt

file, including a name per line of your dataset.

from PIL import Image
from torchvision.transforms import ToTensor, ToPILImage
import numpy as np
import random
import tarfile
import io
import os
import pandas as pd
from torch.utils.data import Dataset
import torch
class YourDataset(Dataset):
    def __init__(self, txt_path='filelist.txt', img_dir='data.tar', transform=None):
        Initialize data set as a list of IDs corresponding to each item of data set
        :param img_dir: path to image files as a uncompressed tar archive
        :param txt_path: a text file containing names of all of images line by line
        :param transform: apply some transforms like cropping, rotating, etc on input image
        df = pd.read_csv(txt_path, sep=' ', index_col=0)
        self.img_names = df.index.values
        self.txt_path = txt_path
        self.img_dir = img_dir
        self.transform = transform
        self.to_tensor = ToTensor()
        self.to_pil = ToPILImage()
        self.tf = tarfile.open(self.img_dir)
    def get_image_from_tar(self, name):
        Gets a image by a name gathered from file list csv file
        :param name: name of targeted image
        :return: a PIL image
        image = self.tf.extractfile(name)
        image = image.read()
        image = Image.open(io.BytesIO(image))
        return image
    def __len__(self):
        Return the length of data set using list of IDs
        :return: number of samples in data set
        return len(self.img_names)
    def __getitem__(self, index):
        Generate one item of data set.
        :param index: index of item in IDs list
        :return: a sample of data as a dict
        if index == (self.__len__() - 1) :  # close tarfile opened in __init__
            self.tf.close()
        image = self.get_image_from_tar(self.img_names[index])
        if self.transform is not None:
            image = self.transform(image)
        sample = {'X': image}
        return sample
              Hello, Thank you for your solution. I have an issue when create dataloader with dataset from tar archive.

It gives me this error, if I run dataloader more than one time. Can you help me with that problem?
OSError: TarFile is closed
              When i tried this method of extracting particular images inside getItem method for TinyImageNet dataset, I faced issues with zlib during data_loader collation. The same error disappeared once I extracted images in init and stored them in a dict, and then in getItem only did a transformation and returned the image and target.
              That is possible, I just used this implementation for a large tar file on a single machine. It may not work in few different situations.

But still the best case even for large datasets is to extract entire dataset and loop over it. Here is the original post that lead to this code:

    Fastest way to read images from uncompressed TAR file in __getitem__ method of Custom Dataset vision
    Hello everyone 
I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order. 
img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
and images in tar file are exactly the same. 
I searched alot and find out tarfile module is the best one, but when I tried to read images from tar file using name, it takes too long. And the reason is, everytime I call getmemeber(name) method, it calls getmembers() metho…
              Inspired by Nikronic’s solution, I made a similar class that reads from Tar archives.
I noticed that it didn’t work out-of-the-box with DataLoaders with multiple workers (depends on the OS/multiprocessing method), so this new class essentially opens a different TarFile (with unique file handles) per worker process.
I also added some methods to access files, including text and images, to allow easier subclassing; and the Tar archive’s content is cached on first read (takes under a minute for a 140GB Tar file with >1M files).
In case you want to check it out: Simple Tar Dataset on GitHub.