Huggingface’s datasets package has proven to be 🔥 for sharing NLP datasets. So, why aren’t people sharing other types of datasets? Well, frankly, idk. So, lets be the change we wanna see in the world and start sharing our own 😎.

This post will walk you through sharing an image classification dataset on HuggingFace’s Dataset Hub with minimal friction. We’ll use Microsoft’s Cats and Dogs dataset as an example.

When we're done, you'll be able to do something like this
Open In Colab

Source Code:

Data Preparation

Downloading the Dataset Locally

First, we’ll download the dataset from Microsoft and unzip it.

curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip
unzip -q kagglecatsanddogs_3367a.zip

This will create a new directory called PetImages, which has two folders - Cat/ and Dog/ within it. These directories both hold images corresponding to their names.

Fixing Corrupted Images

Unfortunately, the dataset comes with corrupted images. These will really screw you up when training, so we’ll need to get rid of them. To do so, we’ll use a script modified from keras’s documentation. You’ll need to pip install tensorflow-cpu (or any version of tensorflow) to run it.

from pathlib import Path

import tensorflow as tf

root = Path('./PetImages')
num_skipped = 0
for folder_name in ("Cat", "Dog"):

    # Loop over all image files
    for fpath in (root / folder_name).glob('*'):
        with fpath.open('rb') as f:
            # If is_jfif, the image is fine. If not, we delete it.
            is_jfif = tf.compat.as_bytes('JFIF') in f.peek(10)
            if is_jfif:
                continue

        num_skipped+=1

        # Deletes the file
        fpath.unlink()

print("Deleted %d images" % num_skipped)

Exporting to Bytes

We can actually take that script above and modify it slightly to export our images as bytes. Representing the images as bytes instead of files makes them play nice with pyarrow, and subsequently Huggingface’s datasets package.

Here, we basically do the same thing, except when we come across valid images, we store them in a list of dicts called examples. Each dict will look like:

{'img_bytes': <the bytes>, 'labels': <the string label>}`

After gathering all the examples, we simply save them as a pickle file, train.pt. This will be the source data we upload to the hub.

import pickle
from pathlib import Path

import tensorflow as tf

root = Path('./PetImages')
num_skipped = 0
examples = []
for folder_name in ("Cat", "Dog"):
    for fpath in (root / folder_name).glob('*'):
        with fpath.open('rb') as f:
            is_jfif = tf.compat.as_bytes('JFIF') in f.peek(10)
            if is_jfif:
                examples.append({'img_bytes': f.read(), 'labels': folder_name.lower()})
                continue
        
        num_skipped+=1
        fpath.unlink()

print("Deleted %d images" % num_skipped)

with Path("train.pt").open('wb') as f:
    pickle.dump(examples, f)

🤓 The preparation script can also be found in in my repo

Sharing our Dataset

Repo Creation + Uploading Source Data

Now, we can share our dataset with the world! Let’s see how that works…

First, create a repo on HuggingFace’s hub. You’ll need an account to do so, so go sign up if you haven’t already! Also, you’ll need git-lfs , which can be installed from here.

huggingface-cli repo create cats-and-dogs --type dataset

Then, cd into that repo and make sure git lfs is enabled.

cd cats-and-dogs/
git lfs install

After that, we can copy our train.pt file into that repo.

🚨 Make sure to git lfs track your source file before adding/commiting it 🚨

cp ../train.pt .
git lfs track *.pt

Finally, we can add, commit, and push the .gitattributes file that stores tracking info + the train.pt file we created earlier. We’ll push the data before writing the required loader script, as its easier to develop that way.

git add .gitattributes
git add *.pt
git commit -m "initial commit"
git push -u origin main

Writing the Builder Script

Now we can write the script that’ll be used to load our dataset. If you’re impatient, it can be found here. It’s easy! We’ll first create a file called cats-and-dogs.py in our repo. Then, we’ll set up some imports and global variables that will be used later. They’re pretty self explanatory, so I’ll just let you read them.

import pickle
from pathlib import Path
from typing import List

import datasets

_HOMEPAGE = "https://www.microsoft.com/en-us/download/details.aspx?id=54765"
_URL = "https://huggingface.co/datasets/nateraw/cats-and-dogs/resolve/main/"
_URLS = {
    "train": _URL + "train.pt",
}
_DESCRIPTION = "A large set of images of cats and dogs. There are 1738 corrupted images that are dropped."
_CITATION = """\
@Inproceedings (Conference){asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization,
    author = {Elson, Jeremy and Douceur, John (JD) and Howell, Jon and Saul, Jared},
    title = {Asirra: A CAPTCHA that Exploits Interest-Aligned Manual Image Categorization},
    booktitle = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
    year = {2007},
    month = {October},
    publisher = {Association for Computing Machinery, Inc.},
    url = {https://www.microsoft.com/en-us/research/publication/asirra-a-captcha-that-exploits-interest-aligned-manual-image-categorization/},
    edition = {Proceedings of 14th ACM Conference on Computer and Communications Security (CCS)},
}
"""

Now, we’ll create the class itself, inheriting from datasets.GeneratorBasedBuilder. We only need to define 3 simple functions.

1. _info

This sets up some metadata and the schema of your data. The key here for our dataset is that we use datasets.Value('binary') to represent the byte data we created before.

Also, notice that the names of the features are the same as what we exported before. This will make things easier later.

class CatsAndDogs(datasets.GeneratorBasedBuilder):
    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=datasets.Features(
                {
                    "img_bytes": datasets.Value("binary"),
                    "labels": datasets.features.ClassLabel(names=["cat", "dog"]),
                }
            ),
            supervised_keys=("img_bytes", "labels"),
            homepage=_HOMEPAGE,
            citation=_CITATION,
        )

2. _split_generators

This function is used to define steps for downloading + preparing dataset splits. We won’t be splitting the dataset, as there aren’t official splits distributed by Microsoft.

class CatsAndDogs(datasets.GeneratorBasedBuilder):

    # ...

    def _split_generators(self, dl_manager: datasets.DownloadManager) -> List[datasets.SplitGenerator]:
        downloaded_files = dl_manager.download_and_extract(_URLS)
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": downloaded_files["train"]}),
        ]

Basically, the dl_manager will take the dict from _URLS and download the file we asked for. It returns downloaded_files, which is a dict mapping _URLS’s keys to the actual filepath its values were downloaded to.

We pass gen_kwargs={'filepath': downloaded_files['train']} to the SplitGenerator to specify the signature of the next function - _generate_examples. Whatever you define in the dict here will be what is passed to that function.

3. _generate_examples

We define _generate_examples with a signature matching the dict we defined in _split_generators. We’ll get the filepath of our train.pt file, which we can then unpickle before generating examples. Since we used the same keys in examples’s inner dicts as the features we defined in _info, we can simply return each item. We also include a string identifier by using the iteration’s indices.

class CatsAndDogs(datasets.GeneratorBasedBuilder):

    # ...

    def _generate_examples(self, filepath):
        """This function returns the examples in the raw (text) form."""
        logger.info("generating examples from = %s", filepath)

        with Path(filepath).open("rb") as f:
            examples = pickle.load(f)

        for i, ex in enumerate(examples):
            yield str(i), ex

Running it Locally

We can test that our file works by running it locally before uploading it to our repo. To do that, we just pass the name of the file, cats-and-dogs.py, instead of a HF Hub identifier. This will download and load the files we already uploaded.

from datasets import load_dataset

ds = load_dataset('cats-and-dogs.py')

Once you see that work successfully, we can push it up to the repo:

git add cats-and-dogs.py
git commit -m "add builder"
git push

Using it in PyTorch

Converting to PIL from Bytes

Now that our dataset is public, we can load it using datasets.load_dataset, passing it '<your-username>/cats-and-dogs'.

from datasets import load_dataset

ds = load_dataset('nateraw/cats-and-dogs')

If you take a look at ds['train'][0], you’ll notice we still have image bytes, not a real image. To fix that, we simply convert the images back to PIL.Image. We’ll use ds.with_transform instead of map, because its faster in real time. It might be more efficient to use map instead, though (I haven’t tested it 😅)

from io import BytesIO
from PIL import Image

import datasets


def bytes_to_pil(example_batch):
    example_batch['img'] = [
        Image.open(BytesIO(b)) for b in example_batch.pop('img_bytes')
    ]
    return example_batch

ds = load_dataset('nateraw/cats-and-dogs')
ds = ds.with_transform(bytes_to_pil)

Now, when we look at an example, we see we have a friendly PIL.Image type instead of bytes:

{'img': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x448 at 0x7FACF1411F50>,
 'labels': 0}

Usage Example

Here’s a more complete example showing how to convert your data to tensors and then to a torch.utils.data.DataLoader.

from io import BytesIO
from PIL import Image
from torchvision.transforms import (
    Compose,
    ToTensor,
    RandomResizedCrop,
    RandomHorizontalFlip,
    Normalize,
    Resize,
    CenterCrop
)
from torch.utils.data import DataLoader
import datasets
 
 
_train_transforms = Compose([
    RandomResizedCrop(224),
    RandomHorizontalFlip(),
    ToTensor(),
    Normalize([0.500, 0.500, 0.500], [0.500, 0.500, 0.500])
])
 
def bytes_to_pil(b):
    return Image.open(BytesIO(b))
 
def apply_train_transforms(example_batch):
    example_batch['pixel_values'] = [
        _train_transforms(bytes_to_pil(b)) for b in example_batch.pop('img_bytes')
    ]
    return example_batch

data = datasets.load_dataset("nateraw/cats-and-dogs")
train_loader = DataLoader(
    data['train'].with_transform(apply_train_transforms),
    batch_size=32,
    num_workers=2,
)

We can take a look at a single batch to make sure it works:

batch = next(iter(train_loader))
print(batch['pixel_values'].shape, batch['labels'].shape)
# >>> (torch.Size([32, 3, 224, 224]), torch.Size([32]))

Now you’ve got a dataloader ready to train any image classification model in PyTorch! 🚀

for batch in train_loader:
    out = your_model(**batch)
    # ...

Conclusion

This post taught you how to use HuggingFace’s datasets package to upload image classification datasets to the HuggingFace Hub. This same strategy can be used to upload video, audio, segmentation masks, etc.

If this tutorial helped you out, feel free to follow me on twitter or github to stay up to date on my next tutorials. If you give this a try, have any questions, or have a suggestion for future tutorials you’d like to see, shoot me a message on Twitter.

Source Code:

Cheers 🍻