0% found this document useful (0 votes)
10 views40 pages

Project Report - Intro to AI

The document presents a project report on digit recognition using artificial intelligence, detailing the problem, database, libraries used, data preprocessing, algorithms, difficulties faced, and future developments. The project utilizes the MNIST dataset for training and testing, implementing various algorithms like K-Nearest Neighbours, Multi-layer Perceptron, and Convolutional Neural Networks. The report includes insights on the performance of these algorithms and the challenges encountered during implementation.

Uploaded by

lmno1dienbien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views40 pages

Project Report - Intro to AI

The document presents a project report on digit recognition using artificial intelligence, detailing the problem, database, libraries used, data preprocessing, algorithms, difficulties faced, and future developments. The project utilizes the MNIST dataset for training and testing, implementing various algorithms like K-Nearest Neighbours, Multi-layer Perceptron, and Convolutional Neural Networks. The report includes insights on the performance of these algorithms and the challenges encountered during implementation.

Uploaded by

lmno1dienbien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SHOOL OF INFORMATION
COMMUNICATION TECHNOLOGY

PROJECĪ REPORĪ

DIGIT RECOGNITION

Group 2 – Introduction to AI:


Pham Thi Ngoc Anh – 20235473
Tran Mai Duong – 20235493
Le Quang Huy – 20235503
Vu Thuy Linh – 20235521
Nguyen Phan Thang – 20225529
Contents
1 Introduction.......................................................................................2
1.1 About the problem.........................................................................................2
1.2 Database..........................................................................................................2
1.3 Packages/ Libraries used...............................................................................4
1.3.1 NumPy............................................................................................................................4
1.3.2 Pandas............................................................................................................................4
1.3.3 Matplotlib – Seaborn.....................................................................................................4
1.3.4 Skicit-learn.....................................................................................................................5
1.3.5 OpenCV..........................................................................................................................5
1.3.6 TensorDlow...................................................................................................................5
1.3.7 Pillow.............................................................................................................................5
1.3.8 Tkinder...........................................................................................................................6

2 Preprocesing Data...........................................................................6
3 Algorithm..........................................................................................8
3.1 K-Nearest Neighbours (KNN)........................................................................8
3.2 Multi-layer Perceptron (MLP) / ArtiFicial Neural Network (ANN)..........12
3.3 Convolutional Neural Network (CNN)........................................................16
4 Arose Difficulties..........................................................................23
4.1 KNN Algorithm.............................................................................................23
4.2 ANN Algorithm.............................................................................................23
4.3 CNN Algorithm.............................................................................................25
5 Final Product.................................................................................26
6 Conclusion & Development in the Future................................29
7 References......................................................................................30

1
1 Introduction
1.1 About the problem
In recent years, advancements in the Field of ArtiFicial
Intelligence (A.I) have demonstrated its growing popularity worldwide.
Since the launch of ChatGPT in 2022, A.I has garnered signiFicant
attention from the public. Numerous A.I technologies have been
developed and applied across various domains, including healthcare,
the automotive industry, real estate, stock markets, and many more.
Today, people can easily perceive the presence of A.I everywhere.
As a group of students enrolled in “Introduction to ArtiFicial
Intelligence - IT3160E,” we recognize the signiFicance of this subject
and its foundational concepts, which are essential for delving into more
advanced topics. One fundamental yet crucial challenge lies in designing
a system capable of recognizing digits from input images. Such a system
has numerous practical applications, including reading license plates at
shopping mall entrances or detecting potential trafFic violations
captured by cameras. Given the complexity of environmental data
processing, this project focuses on building a system to recognize digits
drawn on a canvas. The foundational components of this system can be
further developed to address and assist with real-world problems
effectively.

1.2 Database
We are going to use a simple dataset called MNIST (an acronym
for ModiFied National Institute of Standards and Technology) dataset
which consists of 70,000 images of handwritten digit in black-and-
white. It has a training set of 60,000 examples, and a testing set of
10,000 examples. The images have been normalized, processed to Fit
into a 28x18 pixel bounding box, and anti-aliased.

2
Figure 1. MNIST Dataset

When the system learns to recognize digits from the MNIST


dataset, it may perform exceptionally well on this dataset but poorly on
canvas images provided to the model, leading to an overFitting issue.
The MNIST dataset is highly clean and lacks variations such as rotation,
zooming, and other transformations, which do not fully represent the
diversity of real-world images. To address this problem, we implement
data augmentation, collect additional data, and introduce more
variability to the images. We found an additional dataset on Kaggle
for this purpose. This supplementary dataset is also processed to Fit
into a 28x28 bounding box but intentionally includes variations such as
rotation and skewing, as well as improper centering, to ensure the
model is more generalized and capable of recognizing digits drawn on a
canvas.

3
1.3 Packages/ Libraries used

1.3.1 NumPy
NumPy is a powerful Python library working with array. It aims to
provide fast calculation, manipulate on 1D array, a matrix and multi-
dimenional arrays (also known as tensors) which take too much time to do
in a normal Python list. The libraries have also proven to be incredibly
powerful for data analysis and serve as the foundation for other packages
and modules such as Matplotlib, Pandas, Seaborn, and Scikit-learn. NumPy,
in particular, is renowned for its simplicity and efFiciency. This library
enables faster computations for distances in KNN or perform matrix
multiplications in neural networks, making it a valuable tool in developing
and training our system to “learn”.

1.3.2 Pandas
Pandas is a Python library designed for data manipulation and
analysis. It is an incredibly powerful tool, particularly for working with
and manipulating tabular data. In this project, Pandas will efFiciently help
us convert the images into a tabular format for easier manipulation and
storage. In this project, we will use pandas speciFically to organize images
from additional datasets into tabular form. Each column will correspond to
pixel in the image, resulting in a total of 784 columns.

1.3.3 Matplotlib – Seaborn


Matplotlib is a versatile library in Python for creating static,
animated, and interactive visualizations. Seaborn, built on top of
Matplotlib, enhances it with additional features and tools for data
visualization. These libraries will assist us in visualizing multiple images
during training to ensure the system is learning correctly.

4
1.3.4 Skicit-learn
Scikit-learn is a free machine learning library for Python that
provides tools for implementing fast and efficient machine learning
models. It offers a variety of useful methods for preprocessing images
before they are fed into the system. Beyond preprocessing, Scikit-learn
also includes numerous features that we will leverage to experiment with
different algorithms to determine which performs best for the problem at
hand.

1.3.5 OpenCV
OpenCV (Open Source Computer Vision Library) is a library
that provides a wide range of programming functions, primarily
designed for real-time computer vision tasks. While it offers numerous
methods and utilities, in this project, we will specifically use a few of its
functions to draw bounding boxes with predictions on the canvas.

1.3.6 TensorGlow
TensorFlow is a free and open-source software library designed
for machine learning and artificial intelligence applications. While it
supports a wide range of tasks, it is particularly well-suited for training
and inference of deep neural networks. Given the complexity of
programming and the time required for training processes, we will use
TensorFlow to implement a Convolutional Neural Network. This library
enables us to leverage GPU acceleration, significantly improving the
training speed.

1.3.7 Pillow
Pillow is a free and open-source Python Imaging Library that
extends Python’s capabilities to open, manipulate, and save a wide variety
of image file formats. In the context of this project, we use Pillow to
retrieve images from the canvas, enabling us to process them, pass them
through the model, and ultimately generate predictions.

5
1.3.8 Tkinder
Tkinter is Python’s standard library for creating Graphical User
Interfaces (GUIs). It will be used in this project to build and deploy our
models and solutions. The interface will include a canvas where users can
draw numbers, along with five buttons: three for making predictions using
the proposed algorithms, one for displaying the bounding boxes around
digits drawn on the canvas, and one for clearing the canvas.

2 Preprocesing Data
2.1 Preprocessing the training data
Firstly, we need to get the MINIST dataset which contains of
70000 handwritten digits in form of 28x28 image which make the dataset
size (70000, 28, 28) and its label (70000, ) .
Most algorithms that we use for this problem will rely mainly on a
vector being fed into a model and it will calculate and print out the
prediction. So it’s reasonable for our team to take the First step to
preprocess the data by reshaping the 28 x 28 images into a vector of size
(784, ), this process is called Flattening the image, by putting all rows of
the image next to each other on a single row. The dataset, which its size is
(70000, 28, 28) now be transformed into a 2D array of size (70000, 784).
Then to avoid calculating large numbers, and large gradients (for
neural networks) as well as to reduce the effect of illumination on the
image, we will scale the picture from range [0, 255] down to [0, 1] by
dividing the pixel value by 255.
For special implementations such as neural network (MLP) and
Convolution neural network (CNN), we will have to one-hot encode the
label, which is changing the label of each instance into a probability
distribution, which we will mention about later. But for now, the label y of
the dataset will transform into a probability distribution. Which is the
following:

6
𝑃 1𝑖𝑓 𝑦 = 𝑖
(𝑦 = 𝑖|𝑋) = *
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Below is the function where we process the MNIST dataset,
dividing by 255 so each value of pixel will be in range [0, 1] and the label
becomes a probability distribution.

2.2 Preprocessing the testing data


The image is capture from the canvas in GUI, then being resize to
28x28 picture after that convert to grayscale all using Pillow library
built- in function.
The image is transformed into Numpy array and then scale the
picture from range [0, 255] down to [0, 1] by dividing the pixel value
by 255.

7
3 Algorithm
3.1 K-Nearest Neighbours (KNN)

3.1.1 Intuitive Idea


K-nearest neighbor (KNN) [17] is a supervised learning algorithm
in Machine Learning. It is used for classiFication and regression by picking
k closest training examples in the dataset to predict the output (label) of
new data. For classiFication problems, the class of the data is decided
based on “major votes”. This means that around K nearest data (in this
example vectors) around them based on geometry or different metric
distances, which class has the most appearance will be chosen to be the
class of the testing data. This is the First algorithm that we will propose to
design a system. Regarding picking k nearest neighbors, we will consider 2
main ways to choose those vectors.
The First way is by using Euclidean distance. This way is more
popular than the second way because of its efFiciency. In this case, we
Find k nearest neighbors by computing the square root of the sum of the
deviations to the power of 2. A deviation is the distance between a point in
the training vector with a point at that position of the test vector. Then we
need to sort the calculated distances in ascending order and keep the K
smallest distances.
The Euclidean distance between two vector p, q is:

𝑑(
𝑁
𝑝, 𝑞) = 9: (𝑝𝑖 − 𝑞𝑖)2
𝑖=1

The second way to compute is the Manhattan distance. We just


need to calculate the sum of the deviations instead of computing the
square root and then do the same as the first way.
The Manhattan distance between two vector p, q is:

8
𝑁
(𝑝, 𝑞) = : |𝑝𝑖 − 𝑞𝑖|
𝑖=1

For KNN, we are making every nearest neighbor’s weight equal,


which means that every neighbor in that K-nearest neighbors set has the
same weight, even the K-th closest one and the closest one to the testing
vector. In practice it may have some bad effects, resulting in
bad/inaccurate predictions. So we need to assign weight to them. It means
that the nearest vector to the testing vector has more “weight” when we
consider the label of the testing vector. The closer the vector is to the
testing one, it will have more influence on the output of the data than the
further ones.
Our goal is to find out the best distance measure, a suitable
number of K, and the best weight option so that KNN has the highest
accuracy and can be used to predict real hand-written numbers. For this
problem, we will implement KNN by using the “Sklearn” library and
“Matplotlib” library to help us visualize our experiments. We will split the
dataset into 2 different sets, one for training and one for testing. We will
split the MNIST dataset into 60000 samples for training size, 10000 for
test.

3.1.2. New Findings / Insights


First, we will implement KNN with Euclidean distance, with
different weights. We will test two popular “kernel methods” for
weighting, the “distance” and “Gaussian”, The “uniform” kernel does not
use any kernel functions and all nearest neighbors have the same weight.
The Gaussian weight is of the following:
𝑑𝑖22
𝞄𝑖 = 𝑒 2σ

 where di is the distance between the neighbor i and


the testing instance.

9
 σ is a fixed constant, and can be modified to adjust
the influence of the weights.
By implementing the KNN algorithm on the MNIST dataset by
testing on different “nearest neighbors, we have found some “interesting
results”

Plots of KNN with Euclidean distance

From the above line graph, we can conclude the following:


 For larger numbers of K, the accuracy, although still very
high, tends to decrease with “uniform” and “distance” weighting
functions.
 The accuracy of “my weight” (the Gaussian weighted
kernel functions) appears to be more consistent.
 We discovered that a different sigma value, specifically σ
= 2.5, seems to be the most optimal for KNN with a higher number of K.
Smaller sigma values inversely increase the influence of the nearest
neighbors on the testing set in the Gaussian weighted kernel function.
 Each of these processes takes a considerable amount of
time to run. Each cell in our Python notebook takes approximately 5
minutes to run, which is quite slow. Thus, it might not be practical to apply
this algorithm in real-life scenarios.
Secondly, we implement KNN with “Manhattan” Distance, with
different weights, applying the same “kernel methods” for weighting as the

1
0
given experiment, and the result is shown below

1
1
Plots of KNN with Manhattan distance

Here are some of our conclusions:


 Surprisingly, the accuracy remains relatively consistent
for large numbers of σ = 10, 20.
 While conducting some experiments, we’ve found out
that for every value of σ < 10, the accuracy of the model remains
consistent while changing the number of K.
 As for uniform weight (where all nearest neighbors have
the same weight of 1), the accuracy fluctuates and tends to decrease as the
number of K increases.
 The time taken to run each variation of KNN was extremely
long. Each model took around 10 minutes of runtime. The training
process was nearly instantaneous, but the testing process took a very long
time.
 The reason behind this is that KNN doesn’t really “learn”
anything; it simply stores the training data. When the testing data are
fed into the model, it calculates the distance from the testing data to
every vector in the training set, and then sorts to find the K-nearest
ones.
From the given experiments we can conclude that KNN has
really good performance, and took zero training time but the testing
phase took very long, so it’s not suitable for real-life applications.
Conclusion: Through all the plots, we have concluded that
1
2
KNN with Euclidean distance with Gaussian Kernel of σ = 4 and value k
= 7 is the model that will be deployed.

1
3
3.1.3 Implementation
Below is the part of the code that we used to plot out some
experiment result of Euclidean distance, the code of other weights are
nearly the same, so we will show the part where we test the accuracy on
σ
=4

(The rest of the code and the experiment can be found at the end of this report.)

3.2 Multi-layer Perceptron (MLP) / Artificial Neural


Network (ANN)
3.2.1 Intuitive idea
Neural networks were first proposed in 1944 by Warren
McCullough and Walter Pitts, two University of Chicago researchers. The
idea is to stimulate how neurons in our brain might work. and in 1957, a
layered network of perceptrons, consisting of an input layer, a hidden
layer with randomized weights that did not learn, and an output layer
with learning connections, was introduced by Frank Rosenblatt in his
book Perceptron [19]. And this is our second algorithm to recognize hand-
written digits.
Multilayer Perceptron (MLP) is another name for a modern
feedforward artificial neural network. Which consists of fully connected
neurons with some non-linear activation function.

1
4
It is known to learn and distinguish complex data that are not linearly
separable. With given data, it can be used to recognize and make accurate

1
5
predictions about non-linear structure, which in this case is our hand-
written digits.

MLP structure (Source: IBM)

MLP consists of at least three layers, one input layer, one (or more
hidden layers), and one output layer. The more hidden layers we add, the
better the MLP can learn complex structures. It will learn through an
algorithm called Back Propagation, which uses using “Gradient descent”
method, trying to minimize the loss function by calculating the gradient of
the loss function with respect to each of the weights, then the weights of
the neural network will be updated.
The loss function that MLP uses to learn is the cross-entropy loss
function, which is cross entropy. Cross entropy is a measure of the
difference between two probability distributions for a given random
variable or set of events. The formula of cross entropy is below:
1 𝑁 𝐶
𝐶𝑟𝑜𝑠𝑠−𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 = − : : 𝑦𝑖j . log (𝑝𝑖j)
𝑁 𝑖=1 j=1

Where yij represents the true probability distribution over class j for
the ith instance (in one-hot encoded vector), pij represents the predicted
probability distribution over class j for the ith instance.

1
6
Our team’s neural network will have an input layer of 784 units, each
unit representing a pixel in a flattened 28 x 28 image. The output layer has
10 units/nodes, which if a number is a 0, 1, . . . 9 then that node will be
“switched” on. Moreover, the Softmax activation function will be used for
the output layer. It is typically a useful function for multi-class
classification problems. In this case, it is classifying each digit to predict
( 𝑒𝑧𝑖
)
whether they are 0, 1, . . ., 9.

𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖
∑𝑁
j= 𝑒 j
𝑧

=
1

Where Softmax(zi) represent the probability P(ypredicted = i | X = given


image)
For activation functions in the hidden layers, we will use “relu”
(Rectified Linear unit) for our activation function. It is easy, fast to
compute, non-linear, and is a very robust choice for activation functions for
neural networks nowadays.
𝑥 𝑖𝑓 𝑥 > 0
𝑓(𝑥) = *0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Our problem now is to determine the best architecture for our


artificial neural networks. Finding out the number of hidden layers,
determining the number of units in each hidden layer. So that it can
achieve high accuracy without overfitting and can generalize well on
unseen data.

3.2.2 New Gindings/conclusions


All the number of nodes in a hidden layer are all power of 2, inspired
by this article [7]. We will split the MNIST dataset into 3 sets, a training set,
which consists of 48000 samples, a validation set which consists of 12000
images to make sure the neural network can generalize well and the test
sets for the Final testing with 10000 samples. We have done some research
on the simple feed-forward neural networks using Adam optimizer, which
is an accelerator for the learning rate to help the loss function escape local
1
7
extrema, and using a batch size of 32. We have come up with interesting
results.

1
8
Num batch hidden layer dims acc
Epoch size
128 32 512 256 128 98.21
128 32 256 256 256 97.82
128 32 512 512 512 98.0
128 32 32 92.18
128 32 512 512 256 256 128 64 97.96
Experimental result of MLP architecture

 The accuracy of the neural network did not increase


signiFicantly when we tried increasing the number of hidden
layers as well as their dimensions.
 The more hidden layers we add and the more dimensions we add
to it, the longer the training time becomes. For instance, the last
architecture took the longest time of training, while the third
architecture also took the second longest time for training.
 Furthermore, the more hidden layers we add, the more
vulnerable the neural network is to overFitting.
Conclusion: Since the accuracy barely changes for different architectures,
we have decided to pick the architecture that has an accuracy that is above
97% and has the least parameters so that the calculating process would be
time–effective, which is the First architecture (3 hidden layers with each has
size 512, 256 and 128).

Accuracy and loss of selected neural network architecture

1
9
3.3 Convolutional Neural Network (CNN)

3.3.1. Intuitive Idea


In the 1980s, Convolutional Neural Networks (CNNs) were first
introduced to the world by Yann LeCun, a computer science researcher.
This marked the beginning of a revolutionary approach to the field of
Artificial Intelligence, especially Computer Vision [14]. The key idea behind
the convolutional neural network is also to mimic how the brain works.
However, the difference between ANNs and CNNs is that while ANNs are
designed to be fully connected networks where each neuron is connected
to every neuron in the next layer, CNNs include convolutional layers that
use specific filters to extract image features from the given input.
Thus, CNNs are more effective for image-related tasks, and in our
case, they have shown noticeable improvement in terms of accuracy in the
task of recognizing hand-written digits

Structure of a convolutional neural network (Source: ANH H. REYNOLDS)

The 2 basic components that distinguish between CNNs and MLP


are Convolution and Subsampling (or pooling).

2
0
 Convolution: Convolution (or sometimes called cross-relation)
is a mathematical operation that allows the merging of two sets of
information. In the case of CNN, convolution is applied to the input data to
filter the information and produce a feature map. This filter is also called
a kernel, or feature detector. Each element in the filter matrix is also a
parameter that needs to be learned so that the network can extract
features from input images more effectively.
 Pooling: Pooling in the convolutional network is termed down
sampling or subsampling, which minimizes the number of neurons in the
convolutional layer by retaining the important information. The two
most famous one is max pooling and average pooling. In this problem, we
will use max pooling in our pooling layers.

Example of 2x2 pooling layers with stride = 2 (Source: Muhamad Yani)

To be more specific, CNNs consist of some significant components.


They often have one input layer and a series of convolutional layers and
subsampling pairs. In these layers, we can choose the activation function,
specify properties such as the number of filters, the size of kernels, strides,
and padding, and decide on the type of subsampling layers (MaxPooling,
AveragePooling, etc.). After several convolutional layers and subsampling
2
1
blocks, a flattening process is applied to transform the input into a vector.
This vector is then passed through fully connected layers to make the
final prediction.
As mentioned, we have an input layer with 784 units and a 10-unit
output layer that uses the SoftMax activation function. The difference lies
in implementing the CNN architecture between these two layers. The
challenge is how to choose the architecture to achieve high accuracy
without overfitting or underfitting.

3.3.2 Experiments/new findings


In our experiment, we also utilized the rescaled and split dataset
from the ANN part. Due to the numerous components that required
decisions, we opted to conduct separate tests for each element. The
architecture experiment for CNNs is based on the Kaggle notebook cited
here: [2].
To begin with, we aimed to determine the optimal number of pairs
of convolutional layers and subsampling blocks that would yield the
highest accuracy while maintaining acceptable computational costs. For
the sake of the argument, we standardized the kernel size to 5x5 and
employed the ReLU activation function with the same padding and no
strides.
Throughout the process, we compared three different scenarios: 1 pair, 2
pairs, and 3 pairs of Conv-Subsampling blocks. Assuming the first block
had 32 filters, we incrementally increased the number of filters in the
second and third blocks to 48 and 64 respectively, and then plotted the
validation accuracy of those 3 models.

2
2
Accuracy plot of the model with different number of pairs of Conv-Subsampling

blocks

As can be seen from the graph above, the accuracy of the validation
set for both 2-pair and 3-pair configurations was similar. Due to
computational cost, we prefer to implement 2 pairs of Conv-Subsampling
blocks into our architecture.
Next, we need to determine the number of filters to use in each
convolutional layer. In this step, we have decided to implement 2 pairs of
Conv-Subsampling blocks, each with a kernel size of 5x5, and we will
apply the ReLU activation function. We aim to compare the results of six
different models, each with a different number of filters. Denote that the i-
th model has ((i − 1) ∗ 8 + 8) filters in the first layer and ((i − 1) ∗ 16 + 8)
filters in the second layer.
We can then plot the graph of accuracy on the validation set as
follows:

2
3
Accuracy plot of the model with different number of filters in each Conv-

Subsampling blocks

From the graph, we can see that 24 filters in the first convolutional
layer and 48 filters in the second layer gave stable accuracy. Hence, we
will choose 24-48 filters in our architecture.
After 2 blocks of Conv-Subsampling blocks, we will then flatten the
previous input into a vector. Our next target is to decide how many nodes
we should use in the fully connected layer to make the final prediction. In
this experiment, we will choose the best model out of 8 models whose
nodes range from 0 (no fully connected layer at all) to 2048 nodes. We can
illustrate the accuracy of each model as follows:

Accuracy plot of the model with different numbers of nodes in fully connected

hidden layer
2
4
As illustrated, we can see that the accuracy in 256 nodes in a fully
connected layer is acceptable amongst 8 models. Regarding the number
of layers in fully connected layers, from experimenting the accuracy does
not change much when the number of layers varies, so we decided to
choose only 1 layer in the fully connected layer. Thus, our fully connected
layer will consist of a layer whose nodes are 256, connected to the 10-
node output layer.
To reduce overfitting, a common practice is to add a regularization
technique. In this case, we have chosen dropout regularization. Dropout is
a technique where neurons are removed randomly, and these removed
neurons’ weight will not be passed into the forward pass or the backward
pass. We need to determine the probability that randomly selected nodes
will be dropped out. We will evaluate 8 models with dropout rates ranging
from 0 (no dropout) to 70% chance that a node is dropped. The results are
as follows:

Accuracy plot of the model with different dropout rates

In conclusion, the final CNN architecture consists of an input layer


that takes in (28,28) shape array of pixel values, followed by two
convolutional layers and a subsampling block.
The number of filters in these layers is set to 24 and 48, with a
kernel size of 5x5, using the same padding and no strides. The architecture
also includes a fully connected layer with 256 nodes and implementing a
dropout rate of 0.2. The final accuracy on the test set is 99.30.

2
5
Accuracy plot of train set and validation set on selected CNN structure

3.3.3. Implementation
Implementing convolutional network from scratch takes a lot of
time and does not have the support for training from GPU, so it’s not
effective to implement this approach from scratch. Therefore we will use
“Tensorflow” library to help us implement CNN easily.
The full code for experimenting is very very long, consists of many
cells inside an ipynb file, so we will only show the creation of the final
selected CNN architecture for usage which is implemented below:

2
6
4 Arose Dif<iculties
4.1 KNN Algorithm

Determining the suitable number of K for KNN is always a tough


problem. When we choose k, if the value of k is too small, it is very noise-
sensitive, which means that the given model can be underfitting, which
gives inaccurate results on both training and testing data. If K is too large,
then it may lead to overfitting.
Deciding the suitable distance/similarity measure is always a tough
challenge, it requires domain knowledge to figure out the suitable
measure to use.
KNN Method at core is also known as the “lazy learning” algorithm
because it takes no time to train the model. All it does is store the training
data and will be brought out to compute to find the nearest neighbor to
the testing data. And with lots of data, just 60000, the problem has proven
to be taking very long to calculate and predict the label of the digits.
In case of adding more data (in this case more digits) to the training
data, the model will take more time to compute, making it very hard to use
in real-life practice because of its computing speed.

4.2 ANN Algorithm


4.2.1 Problems

Firstly, the dataset does not fully represent all kinds / special
variances of numbers. When we run the test, the numbers 1, 2, 4, 7, and 9
are the ones that have the most incorrect predictions. We have done some
error analysis, and this is what we’ve found.
For formal number recognition such as reading license plates
number 1, and 2 in the datasets are enough, however, number 1, 2, and 4
in the datasets doesn’t contain all of their variance in terms of different
ways of writing it. For number 7, the diagonal line seems to be very similar
2
7
to

2
8
that feature of number 2, so number 7 often gets mispredicted with
number 2. Number 9, the lower half is very similar to number 3, while the
images of number 9 in the dataset are drawn very weirdly, and don’t have
any common ways that most people in the world draw it, so the model
cannot learn how to predict the common 9 with the curvy lower half with
those given in the datasets.
Although it’s very powerful and capable of outperforming various
machine learning algorithms, ANN is very sensitive to “variant” images,
meaning that slight changes in the rotation, width, and zoom of the image
can result in wrong predictions. Not just the complex structure of the
dataset, but images have some special features, which are spatial
information, which means that the pixels that are adjacent to each other
have some digital connection between them. When we flattened out our 28
x 28 images into a 784-sized vector, we destroyed that relationship /
spatial information.
4.2.2 Problems
To resolve the problem of misprediction numbers 1, 2, 4, 7, and 9,
we will augment more data from our own, and then retrain the neural
network with an augmented dataset.
Each augmented image are processed by inverting images into a
black background, and white hand-written digits, and dilating it so that it
can increase the thickness of the digits. The images then were processed
by using our image processing algorithm, being saved as a tabular data
frame with 785 columns, the First 784 columns represent the pixels of
each digit, the last column is the label value of that digit.
We will then duplicate this augmented data 2 or 3 times to increase
its inFluence on the training process. After that, we will concatenate this
data with the training data. Then we will train the neural network as
normal.

2
9
4.3 CNN Algorithm
4.3.1 Problems

While CNNs implementation showed significant improvements in


terms of accuracy on the test set, it still faces various challenges. Firstly,
although the recognition system has improved, CNN implementation
struggles with variations in scale and image rotation. If the image is
rotated, zoomed, or its properties (height/width) are changed,
misprediction could still occur. However, it can recognize some slightly
rotated images, which ANNs cannot.
As the number of training data of MNIST data is not well-sufficient
and the data still have some particular problems as mentioned, CNN
implementation may be prone to overfitting. Without large datasets, CNN
architecture may encounter some difficulties in generalization and the
risk of overfitting is noticeable.
Despite its powerful performance, CNN implementation requires
expensive computational costs. Without the help of GPUs, the training
process can take very long, making the experiment on different
architectures difficult. The computational cost increases as the
architecture becomes more complex, hence the selection of parameters/
hyperparameters also becomes a challenge that we need to resolve:
choosing an architecture that requires less cost but its performance
should be acceptable.
4.3.2 Proposed Solutions
To address the problem of overfitting, we need more data
augmentation. The MNIST dataset is clean, so to make it more translation
variance, we need to rotate, zoom, and shift the current images in the
dataset a little bit so that this algorithm can work more effectively.
However, the process of training is computationally expensive, requiring
a long training time, so it will not be done in this capstone project.

3
0
5 Final Product
After developing and testing three different approaches to solve the
problem, we can now integrate them and implement a simple GUI for
deployment.

5.1 General Description


The deployment of 3 approaches will be on a simple GUI which has
 5 buttons include: 3 buttons for 3 different approaches to
prediction, 1 button to initiate the prediction process, and 1 button
to clear the digit in the box.
 Canvas area: one at the top for indicating the currently
selected model, one is provided where users can draw a digit, one for
showing predicted digit and one for displaying the accuracy of the
prediction.

Implementation
To deploy it, we will save the model structure, then use Tkinter to
make the simple GUI, the implementation is below:
Firstly, we create a new window with fixed resolution

3
1
Then, we create a canvas for drawing

We create 5 buttons as above description

5 buttons will be activated by a function, in which they will process


the image, feeding it into the chosen model and then a result will be
printed out. This is the implementation of one function use_knn, other
functions such as use_ ann and use_cnn are similar.

3
2
5.2 Source code:
• For KNN, we conducted on GoogleColab notebook, which can
be accessed from here:
https://colab.research.google.com/drive/1AUMPK-
2gzvKPwmz8X5vz0Ce51AHPRXKt?usp=sharing
• For MLP (ANN), we conducted and coded our neural network
class on GoogleColab notebook, which can be accessed here:
https://colab.research.google.com/drive/1GcMm4ZvW0p-
CoA6C7vofDb7KGJDjCDW9?usp=sharing
• For CNN, we also conducted our experiments on GoogleColab
notebook, which can be accessed here:
https://colab.research.google.com/drive/1bIPwn-
2S216U2j0g6Oaohm1ZLcbAOPFl?usp=sharing
• For the GUI application:
https://github.com/bababyVN/DR-GUI.git
https://drive.google.com/File/d/1wkFTsixsjlaTdgt-
GLkrxuz7XZcy45iq/view?usp=sharing

3
3
6 Conclusion & Development in the Future
In summary, we explored and addressed the problem using three
distinct methods: K-Nearest Neighbors, Multi-Layer Perceptron, and
Convolutional Neural Networks. These approaches are among the most
common and straightforward techniques for tackling spatial recognition
challenges. By integrating advanced image processing techniques, our
system holds the potential to identify and recognize digits from real-world
images effectively.
Besides, we come up with some ideas to improve our project in the
future as below:

6.1 Accuracy improvement


We can research and apply more advanced machine learning
algorithms or improve input data to enhance digit recognition accuracy.
 Utilize advanced deep learning models such as Transformers
or improve existing algorithms like CNNs to boost
performance.
 Clean and expand the training dataset (adding diverse data in
terms of size and handwriting styles) to ensure higher accuracy
in real-world conditions.
 Apply data augmentation techniques, such as rotation, blurring,
or resizing, to improve recognition capabilities in non-standard
scenarios.

6.2 Models optimization


Minimize processing time and resource requirements for
deployment on low-configuration devices or mobile platforms.
 Apply techniques like quantization and pruning to reduce
model size.
 Use lightweight models such as MobileNet or TinyML
for deployment on IoT devices or mobile phones.
 Optimize source code and infrastructure to ensure fast
processing speeds while maintaining high accuracy.

6.3 Expanding applications


Explore new practical applications for digit recognition systems,
such as in Finance (check number recognition) or education (automatic

3
4
grading for math exercises).

3
5
 Finance: Recognize digits on checks, invoices, or accounting
documents to automate data entry processes and reduce
errors.
 Education: Apply the technology to math exercises or exams
for automatic grading, reducing the workload for teachers.
 Healthcare: Automate data extraction from charts, patient
records, or test results.
 Transportation: Read license plates or information from
electronic tickets.

6.4 User Interface development


Create a user-friendly interface to make the digit recognition system
easy to use.
 Develop interfaces using frameworks like React, Flutter, or
Tkinter for cross-platform applications.
 Optimize user experience (UX) with features like integrated
user guidance, visual result displays, and instant feedback
during recognition.
 Support multiple languages and design interfaces suitable for
a wide range of users.

7 References
[1] GitHub - TomSchimansky/CustomTkinter: A modern and
customizable python UI- library based on Tkinter — github.com.
https://github.com/TomSchimansky/ CustomTkinter.

[2] How to choose CNN Architecture MNIST — kaggle.com.


https://www.kaggle.com/ code/cdeotte/how-to-choose-cnn-
architecture-mnist.

[3] PYPL Popularityy of Programming Language index — pypl.github.io.


https://pypl. github.io/PYPL.html.

3
6
[4] Understand and Implement the Backpropagation Algorithm From
Scratch In Python - A Developer Diary — adeveloperdiary.com.
https://www.adeveloperdiary.com/data-science/machine-learning/
understand-and-implement-the-backpropagation-algorithm-from-
scratch-in- python/?
Gbclid=IwAR3Y0jECpa3TFxpfaxJbjH9YfsnWAk0DXk6WxkaeSx0HK
9sXe77rRcJoPMs.

[5] Understanding and implementing Neural Network with SoftMax in Python


from scratch - A Developer Diary — adeveloperdiary.com.
https://www.adeveloperdiary. com/data-science/deep-learning/neural-
network-with-softmax-in- python/?
Gbclid=IwAR2gUk5waf9uZA5_v4Y3UrQtWpUcrKzEFEOCHqzsNIrAH
bZwlMtJDDxNX9Q.

[6] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.


Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A.
Harp, G. Irving, M. Is- ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J.
Levenberg, D. Man ́e, R. Monga, S. Moore, D. Murray, C. Olah, M.
Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V.
Vanhoucke, V. Vasudevan, F. Vi egas,́ O. Vinyals, P. Warden, M.
Watten- berg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-
scale machine learning on heterogeneous systems, 2015. Software
available from tensorFlow.org.

[7] T. E. Bettilyon. How to classify MNIST digits with different


neural network architectures — medium.com.
https://medium.com/tebs-lab/ how-to-classify-mnist-digits-with-
different-neural-network-architectures-39c75a0f03e3.

[8] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software


Tools, 2000.

3
7
[9] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V.
Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J.
VanderPlas,
A. Joly, B. Holt, and G. Varoquaux. API design for machine learning
software: experiences from the scikit- learn project. In ECML PKDD
Workshop: Languages for Data Mining and Machine Learning,
pages 108–122, 2013.
[10] A. Clark. Pillow (pil fork) documentation, 2015.

[11] L. Deng. The mnist database of handwritten digit images for


machine learning research.
IEEE Signal Processing Magazine, 29(6):141–142, 2012.

[12] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen,


D. Courna- peau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S.
Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R ıo,
́ M. Wiebe,
P. Peterson, P. G erard-
́ Marchant, K. Sheppard, T. Reddy, W.
Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array
programming with NumPy. Nature, 585(7825):357–362, Sept. 2020.

[13] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing


In Science & Engi- neering, 9(3):90–95, 2007.

[14] B. Kumar. Convolutional Neural Networks: A Brief History of their


Evo- lution — medium.com.
https://medium.com/appyhigh-technology-blog/ convolutional-neural-
networks-a-brief-history-of-their-evolution-ee3405568597.

[15] F. Lundh. An introduction to tkinter.


www. pythonware. com/library/tkinter/introduction/index. htm, 1999.

3
8
[16] W. McKinney et al. Data structures for statistical computing in
python. In Proceedings of the 9th Python in Science Conference,
volume 445, pages 51–56. Austin, TX, 2010.

[17] A. Mucherino, P. J. Papajorgji, and P. M. Pardalos. k-Nearest


Neighbor ClassiFication, pages 83–106. Springer New York, New
York, NY, 2009.

[18] T. pandas development team. pandas-dev/pandas: Pandas, Feb. 2020.

[19] J. Schmidhuber. Annotated history of modern ai and deep


learning, 2022.

[20] P. Umesh. Image processing in python. CSI Communications, 23, 2012.

[21] M. L. Waskom. seaborn: statistical data visualization. Journal of


Open Source Software, 6(60):3021, 2021.

3
9

You might also like