Post

Creating custom datasets

Creating custom datasets

Making updates to the CNN dataset

After a week of testing and tweaking, my Convolutional Neural Network project is finished I moved away from using the CIFAR10 dataset and began scraping the web using DuckDuckGo for celebrity pictures.

Although the model works, the dataset is fairly homogenous. Many of the celebrities chosen, from Beyoncé to Nicki Minaj look similar, especially when their pictures are scaled down to 32x32px.

Moving away from celebrities

As I realized that many of these celebrity pictures were similar, I decided to move away from using celebrities to using dog breeds, so ensure that it is possible for the image folder based model to accurately train.

Problems regarding image scraping

Image scraping is an intensely difficult topic, as most search engines do not condone the practice. The first few iterations of my web scraper utilized Google, and none of them ended up working, due to Google’s consistently changing APIs and the legal implications behind scraping from Google.

I decided to move away from Google, and found a way to extract images from DuckDuckGo. While my download speeds were being throttled, I was still able to scrape images fairly quickly (~5 images per second)

After my images were successfully scraped, I quickly learned about the principles behind designing a dataset.

Using the scraped images in PyTorch

This being my first time developing a dataset, I had no idea what the principles behind the practice were. The first time I loaded my images in PyTorch, I knew that I needed to resize them all into the same size, and normalize the data into a general range.

1
2
3
4
5
6
7
transform = transforms.Compose([
    transforms.Resize(32),
    transforms.CenterCrop(32),
    transforms.RandomHorizontalFlip(),  
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

How I preprocessed my images

Although I knew I needed to generalize my data, I had no idea that I needed to split the training and testing data into two folders. At this moment, I would train and test on the same data, unknowingly developing a false model.

I decided to split my data into a 160:40 ratio, where each class had 160 training images and 40 testing images, ensuring that my model is not testing itself on the data it previously trained on.

After splitting the two groups, there was a noticeable 10-30 second bottleneck before training would begin. I decided to move the preprocessing from the Tensor cores to the web scraper, where the images would be resized and cropped as they are being downloaded.

1
2
3
4
5
6
7
transform = transforms.Compose([
    #transforms.Resize(32),
    #transforms.CenterCrop(32),
    transforms.RandomHorizontalFlip(),  
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

How I preprocess my images now

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def check_images(folder):
    for file in os.listdir(folder):
        try:
            img_path = os.path.join(folder, file)
            img = Image.open(img_path)
            img.verify()  

            img = Image.open(img_path)
            img = img.resize((32,32),resample=Image.Resampling.BILINEAR) 

            if (img.mode not in ["RGB"]):
                img = img.convert('RGB')

            img.save(img_path, "JPEG") 

            img.close()

        except FileNotFoundError:
            print(f"File not found: {img_path}")
            continue
        except Exception as e:
            print(f"\033[91mError processing {img_path}: {e}\033[0m")
            print(f"\tRemoved corrupted image: {img_path}")

            img.close()  

            os.remove(img_path)
            continue

Where images were verified and resized

While I was moving this resampling over to the CPU, I realized that the web scraper was downloading unreliable data and images, from 404 error pages to SVGs. I obfuscated these files from the dataset folders by verifying that they could be read by Pillow seen above.

Overall, the overhead went from 30 seconds before training to about 7-10 seconds, proving that the move away from tensor-based preprocessing was beneficial.

What I Learned

  • Datasets must be disproportionately scaled, where the training data should be larger than the testing data
  • Web scraping is a risky, unreliable practice
  • Avoid homogenous data when designing a dataset
This post is licensed under CC BY 4.0 by the author.