Image Classification From Scratch
Image Classification From Scratch
Author: fchollet
Further Explanation: Huighlet Eyram Zowonu
Date created: 2020/04/27
Last Modified: 2023/11/09
Description: Training an image classifier from scratch on Kaggle Cats and Dogs dataset.
Imagine you have a robot friend who wants to learn how to recognize pictures of cats and dogs. It’s
never seen a cat or dog before, so it doesn’t know what they look like! Today, we’re going to teach it
by showing it lots of pictures and helping it practise until it gets really good at telling them apart.
We have a big photo album with lots of pictures of cats and dogs. This photo album is called a
dataset, and it’s just like a photo book where each picture is labelled. One side has cats, and the other
side has dogs. This specific album is called the Cats vs. Dogs Dataset, and it’s going to help our robot
learn to tell the difference between these two animals.
Now, we’re starting from scratch, which means our robot has no idea what a cat or dog is yet. Some
robots already know how to recognize objects like cars, trees, or people, but our robot doesn’t have
that experience—it’s a blank slate. So, we’ll be teaching it everything it needs to know about cats and
dogs, one picture at a time.
To make things easier for our robot, we’ll use a special tool called image_dataset_from_directory.
Imagine you have folders labelled “Cats” and “Dogs,” and each folder is filled with pictures. This tool
will gather all the photos from these folders and make sure the robot knows which photos are cats and
which ones are dogs. It’s like helping your robot friend keep its photo album nice and organised so it
doesn’t get confused!
Just like we need to get ready before doing something new, we need to prepare these photos for the
robot. This preparation is called preprocessing. Here’s how we do it:
1. Standardising the Images: First, we make sure all the photos are the same size and colour
format. It’s like cutting out all the photos so they’re the same shape—this makes it much
easier for the robot to look at them without getting distracted by different sizes.
2. Adding Extra Practice Photos (Data Augmentation): If we don’t have enough pictures, we
can make some more by creating variations of each photo. We might rotate or flip a picture of
a cat or dog so the robot sees it from different angles. This is like showing the robot the same
dog but sometimes upside down or turned to the side. The more it practises, the better it gets
at recognizing them from different viewpoints!
Once all the photos are organised and prepared, we can start the real learning! Here’s what happens
next:
● Each time the robot sees a photo, it tries to guess if it’s a cat or a dog.
● At first, it might get a lot of them wrong, but that’s okay! Every time it makes a mistake, it
learns a little more.
● After looking at hundreds of pictures, the robot starts getting better and better at telling cats
from dogs.
Soon, it can look at a new picture, one it’s never seen before, and say with confidence, “This is a cat!”
or “This is a dog!” It becomes a Cat-and-Dog Expert, ready to recognize furry friends in photos!
The os module helps us work with files and folders on our computer. Think of it as a helper
that knows where to find all of our cat and dog photos. We’ll use os to help our computer look
in the right places when it needs to find the images.
import numpy as np
numpy (imported as np for short) is a powerful tool for working with numbers. Computers see
pictures as grids of numbers, where each number represents the colour of a tiny dot (or pixel)
in the picture. numpy helps us handle all those numbers easily, which is important when we’re
showing lots of photos to our computer friend.
keras is like a recipe book for building "neural networks," which are special types of
computer models that can learn by example. It provides us with all the ingredients we need to
create a model that can recognize images. The layers we’re importing are like building blocks,
allowing us to stack different pieces to build our image-recognizing robot.
tensorflow is a powerful library that helps us create and train our models. Here, we’re
importing a tool called data from TensorFlow (and calling it tf_data for short). This tool will
help us load our images in a way that the computer can easily process. Think of it as a
conveyor belt that brings each image to our model so it can learn from them one by one.
Setup
AC1
import os
import numpy as np
import keras
The curl command is like asking the internet for a specific file. Here, it’s downloading a big
ZIP file (around 786MB) that contains lots of cat and dog images.
The -O flag just means we want to keep the original file name (kagglecatsanddogs_5340.zip).
Note: The ! at the beginning tells the computer this is a command for the shell (not Python),
which is why we can download the file directly.
!unzip -q kagglecatsanddogs_5340.zip
Now that we’ve downloaded the ZIP file, it’s like a packed suitcase that we need to unpack to
see everything inside.
The unzip command does this unpacking for us. The -q flag (quiet) keeps the unzipping
process from showing too much information on the screen.
This command lets us take a quick look at what’s inside our current folder, so we can confirm
that our images have been unpacked correctly.
AC3
!unzip -q kagglecatsanddogs_5340.zip
AC4
!ls
Now that we’ve unzipped the file, we have a folder called PetImages. Inside this folder, we’ll find two
other folders: Cat and Dog. Each of these folders contains images of cats and dogs, respectively.
● The !ls command lists all the files and folders in a specific location. Here, PetImages is the
folder we’re checking, so this command will show us its contents.
● By running this command, we’ll see two subfolders named Cat and Dog.
Each subfolder contains a bunch of image files (like .jpg files) that we’ll use to teach our model. We
now know where all of our cat and dog photos are.
In this part, we’re cleaning up our cat and dog image folders to make sure every image is usable.
Sometimes, images get corrupted, meaning they’re broken or unreadable by our model. Here’s how
this code finds and removes these corrupted images:
try:
fobj = open(fpath, "rb") # Open image file in binary mode (as bytes)
finally:
fobj.close()
1. Counting Skipped Files: num_skipped = 0 starts a counter to keep track of how many
corrupted images we delete.
2. Looping Through Folders: The for folder_name in ("Cat", "Dog") loop goes through the Cat
and Dog folders in PetImages.
3. Checking Each Image File:
○ We create folder_path, the path to each image folder, and then fpath, the path to each
image file.
○ We open each file as fobj in "rb" (read-binary) mode, allowing us to look at the raw
data in the file.
Detecting Corruption
The trick here is to check if the file contains a special pattern of characters, JFIF, which usually
appears in valid JPEG images. If it’s missing, the file might be corrupted.
● is_jfif = b"JFIF" in fobj.peek(10) checks if JFIF appears in the first 10 bytes of the file. If it
doesn’t, the image may be broken.
● If an image doesn’t have JFIF, we add one to our num_skipped counter and delete the file.
This cleaning step is crucial to ensure our model won’t have trouble when trying to "look at" each
picture. After running this, our folders will only contain healthy, readable images, ready for training!
AC6
num_skipped = 0
try:
finally:
fobj.close()
if not is_jfif:
num_skipped += 1
Now that our cat and dog images are cleaned up, we’re ready to turn them into a dataset that our
model can understand and learn from. Here’s what each line does in this code to make that happen:
batch_size = 128
1. Setting Image Size: We define image_size = (180, 180). This will resize every image to 180
pixels by 180 pixels, making each picture the same size. This helps the model process images
consistently, no matter the original dimensions.
2. Batch Size: We define batch_size = 128, which means our images will be grouped into
batches of 128. Processing images in batches helps speed up the training and reduces the
computer's memory usage.
"PetImages",
validation_split=0.2,
subset="both",
seed=1337,
image_size=image_size,
batch_size=batch_size,
AC7
image_size = (180, 180)
batch_size = 128
Resulting Datasets
● train_ds: This is our training dataset, containing 80% of the images. Our model will use these
images to learn how to identify cats and dogs.
● val_ds: This is our validation dataset, containing 20% of the images. The model won’t see
these images during training, so we’ll use them to test how well it can recognize cats and dogs
in new images.
Now that we have our dataset ready, let’s take a look at some of the images in it! This code snippet
will display the first 9 images in the training dataset, giving us a preview of what our model will be
learning from.
Visualising the First 9 Images in the Training Dataset
plt.figure(figsize=(10, 10))
● We’re creating a figure (or canvas) that’s 10 inches by 10 inches in size. This will be the grid
where we display our images.
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(np.array(images[i]).astype("uint8"))
plt.title(int(labels[i]))
plt.axis("off")
AC8
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(np.array(images[i]).astype("uint8"))
plt.title(int(labels[i]))
plt.axis("off")
After running this code, we’ll see a 3x3 grid of the first 9 images from the training dataset. Each
image will have a title indicating whether it’s a 0 (cat) or a 1 (dog). This visualisation step helps us
confirm that our dataset is loaded correctly and gives us an idea of what the model will be "seeing"
during training!
To help our model learn better, especially when we have a small dataset, we can use data
augmentation. This means creating slightly modified versions of each image to make our dataset
appear larger and more varied. By flipping, rotating, or otherwise adjusting images, we expose the
model to different "views" of the data, which helps prevent it from "memorizing" specific images—a
problem called overfitting.
data_augmentation_layers = [
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
These small, random changes create new variations of each image without needing additional data.
def data_augmentation(images):
images = layer(images)
return images
To see what our augmented images look like, we can apply data_augmentation to the first few images
in the training dataset:
plt.figure(figsize=(10, 10))
for i in range(9):
augmented_images = data_augmentation(images)
ax = plt.subplot(3, 3, i + 1)
plt.imshow(np.array(augmented_images[0]).astype("uint8"))
plt.axis("off")
1. Setting Up the Figure: plt.figure(figsize=(10, 10)) creates a blank 10x10 grid for our images.
2. Looping Through Images:
○ for images, _ in train_ds.take(1): retrieves one batch of images.
○ for i in range(9): selects the first 9 images from that batch.
3. Applying Data Augmentation:
○ augmented_images = data_augmentation(images) applies our flip and rotation
transformations to the batch.
4. Displaying the Augmented Image:
○ Each transformed image is displayed in a 3x3 grid using plt.subplot(3, 3, i + 1).
○ plt.imshow(np.array(augmented_images[0]).astype("uint8")) converts the image to a
displayable format.
○ plt.axis("off") hides the axes to keep the grid tidy.
AC9
data_augmentation_layers = [
layers.RandomFlip("horizontal"),
layers.RandomRotation(0.1),
]
def data_augmentation(images):
for layer in data_augmentation_layers:
images = layer(images)
return images
AC10
plt.figure(figsize=(10, 10))
for images, _ in train_ds.take(1):
for i in range(9):
augmented_images = data_augmentation(images)
ax = plt.subplot(3, 3, i + 1)
plt.imshow(np.array(augmented_images[0]).astype("uint8"))
plt.axis("off")
After running this code, you'll see 9 images with small random transformations. These transformed
versions help the model learn to recognize cats and dogs in slightly different orientations, which will
make it better at recognizing them in real-world situations!
To help our model understand the image data better, we’ll standardize our images to make sure the
values are in a range that’s easier for the model to work with. We’ll also look at two different ways to
apply data augmentation and standardization.
Currently, each image has pixel values between 0 and 255 for each color channel (red, green, and
blue). Neural networks perform better when input values are smaller, so we’ll scale these values to a
range of 0 to 1. To do this, we’ll use a special layer in Keras called Rescaling.
● Rescaling Layer: Divides each pixel value by 255 so it falls between 0 and 1. This will be
one of the first steps in our model, making the data suitable for learning.
Two Ways to Preprocess the Data
When it comes to combining data augmentation and rescaling, we have two choices:
In this option, we include data augmentation and rescaling as layers within the model itself:
AC11
inputs = keras.Input(shape=input_shape)
How it Works:
● Data augmentation happens on-device (i.e., on the GPU if available), alongside the rest of the
model.
● The transformations are applied only during training (when calling fit()), not when testing or
predicting (evaluate() or predict()).
This option is great if you have a GPU because it lets data augmentation happen on the same device,
benefiting from GPU acceleration.
Here, we apply data augmentation separately to the dataset itself, rather than including it in the model.
AC12
augmented_train_ds = train_ds.map(
This option is ideal if you’re training on a CPU, as it keeps the augmentation process separate from
the model and prevents any delays.
To make our model training faster and more efficient, we can configure the dataset for better
performance. This involves two main steps:
1. Applying Data Augmentation: We add slight modifications to each image, like flipping or
rotating them. By applying data_augmentation to our training images, we’re giving the model
more varied examples to learn from, which can improve its generalization.
2. Using Prefetching: Prefetching is like setting up the data in advance so that it’s ready exactly
when the model needs it, reducing delays during training. It helps us pull data from the disk
(storage) without slowing down the training process.
AC13
train_ds = train_ds.map(
train_ds = train_ds.prefetch(tf_data.AUTOTUNE)
val_ds = val_ds.prefetch(tf_data.AUTOTUNE)
How This Works
By augmenting and prefetching data, we keep the model constantly supplied with images, maximizing
GPU utilization and speeding up training. This approach makes sure that data loading doesn’t become
a bottleneck.
To build our image classification model, we're creating a smaller version of the well-known Xception
architecture. This model architecture uses special layers and techniques to help it learn patterns in
images efficiently.
1. Data Standardization: We start with a Rescaling layer that scales image pixel values to the
[0, 1] range. This is important for helping the model learn effectively, as it makes pixel values
smaller and more manageable for the neural network.
2. Convolutional Blocks: Convolutional layers are the heart of a CNN model. They help the
model learn spatial patterns in images, like edges or textures. In this model, we use a special
type of convolution called Separable Convolution to make it both fast and efficient. Each
block of layers applies these convolutions, followed by:
○ Batch Normalization: Stabilizes the learning process by normalizing the outputs of
convolutional layers.
○ Activation (ReLU): Adds non-linearity, enabling the model to learn more complex
patterns.
○ Max Pooling: Reduces the size of the data, helping the model focus on the most
important features.
3. Residual Connections: Residual connections (also called "skip connections") add the
previous block’s output back into the current layer's output. This helps preserve information
across layers and makes training faster and more stable.
4. Global Pooling & Dropout: After all the convolutional layers, we use a Global Average
Pooling layer to compress each feature map into a single value. This helps reduce the total
number of parameters, making the model faster and more efficient. We also use a Dropout
layer to prevent overfitting by randomly ignoring parts of the model during training.
5. Output Layer: Finally, a Dense layer gives the output for classification. Since we’re working
with two classes (Cats vs. Dogs), we use a single unit in this layer to get our classification
result.
The make_model function constructs the architecture of the CNN. It takes two arguments:
● input_shape: Specifies the shape of the input images, e.g., (180, 180, 3) for
180x180 RGB images.
● num_classes: Number of output classes for classification. Here, it’s set to 2 for a binary
classification problem (cats vs. dogs).
2. Input Layer
inputs = keras.Input(shape=input_shape)
● This line creates the input layer, where we define the expected shape of the input data.
● inputs will be the entry point of data into our model.
3. Image Rescaling
x = layers.Rescaling(1.0 / 255)(inputs)
● Rescaling normalizes the pixel values to be between 0 and 1 (from the original range of 0 to
255).
● This is helpful for the model to learn efficiently, as it standardizes the inputs.
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
The code then enters a loop that repeats a pattern three times with increasing filter sizes (256, 512,
728).
for size in [256, 512, 728]:
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
Residual Connection
previous_block_activation = x
x = layers.SeparableConv2D(1024, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.GlobalAveragePooling2D()(x)
● SeparableConv2D(1024, 3): This is the final separable convolution layer, with 1024
filters to extract detailed features.
● GlobalAveragePooling2D: This layer computes the average of each feature map,
resulting in a single number per feature map, reducing the data size significantly and helping
the model generalize well.
x = layers.Dropout(0.25)(x)
● Dropout: Dropout randomly sets 25% of the activations to zero, preventing the model from
overfitting by making it more robust.
8. Output Layer
● Dense Layer: This is the final layer, with units = 1 since we have two classes (cats and
dogs). The activation=None setting outputs raw logits instead of applying a function
like sigmoid or softmax.
9. Model Creation
● Model Instantiation: Here, we create the model using our defined function make_model.
keras.utils.plot_model(model, show_shapes=True)
● This line generates a visual representation of the model’s structure, displaying the shapes of
each layer’s output.
AC14
from keras import layers, models
def make_model(input_shape, num_classes):
inputs = keras.Input(shape=input_shape)
# Entry block
x = layers.Rescaling(1.0 / 255)(inputs)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D(3, strides=2, padding="same")(x)
# Residual connection
previous_block_activation
x = layers.SeparableConv2D(1024, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.GlobalAveragePooling2D()(x)
keras.utils.plot_model(model, show_shapes=True)
In this model:
This model is now ready to be trained on our Cats vs. Dogs dataset!
epochs = 25
callbacks = [
keras.callbacks.ModelCheckpoint("save_at_{epoch}.keras"),
● epochs = 25: The model will be trained for 25 epochs, meaning the model will go through
the entire training dataset 25 times.
● callbacks: This defines a list of callback functions that will be executed during training.
○ ModelCheckpoint: This callback saves the model weights after each epoch. The
model is saved in a file with the format save_at_{epoch}.keras, where
{epoch} is replaced by the epoch number. This is useful in case training is
interrupted and you want to resume from the last saved checkpoint.
python
Copy code
model.compile(
optimizer=keras.optimizers.Adam(3e-4),
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy(name="acc")],
model.fit(
train_ds,
epochs=epochs,
callbacks=callbacks,
validation_data=val_ds,
● train_ds: This is the training dataset that yields batches of augmented and rescaled
images.
● epochs=epochs: The number of epochs the model will train for (25 epochs as set earlier).
● callbacks=callbacks: The ModelCheckpoint callback is passed here so the model
weights will be saved after every epoch.
● validation_data=val_ds: This specifies the validation dataset that will be used to
evaluate the model at the end of each epoch. It helps monitor how well the model generalizes
to unseen data during training.
AC15
epochs = 25
callbacks = [
keras.callbacks.ModelCheckpoint("save_at_{epoch}.keras"),
model.compile(
optimizer=keras.optimizers.Adam(3e-4),
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy(name="acc")],
model.fit(
train_ds,
epochs=epochs,
callbacks=callbacks,
validation_data=val_ds,
)
Run inference on new data
Note that data augmentation and dropout are inactive at inference time.
img = keras.utils.load_img("PetImages/Cat/6779.jpg",
target_size=image_size)
plt.imshow(img)
img_array = keras.utils.img_to_array(img)
3. Making Predictions
predictions = model.predict(img_array)
score = float(keras.ops.sigmoid(predictions[0][0]))
plt.imshow(img)
img_array = keras.utils.img_to_array(img)
predictions = model.predict(img_array)
score = float(keras.ops.sigmoid(predictions[0][0]))
print(f"This image is {100 * (1 - score):.2f}% cat and {100 * score:.2f}%
dog.")
And that’s how we taught a computer to tell cats and dogs apart! Now, anytime it sees a picture of a
cat or dog, it can shout out its answer. Thanks to our careful preparation and practice, our robot has
learned to see in a whole new way. Who knows what it’ll learn to recognize next!