Project Report - Intro to AI
Project Report - Intro to AI
SHOOL OF INFORMATION
COMMUNICATION TECHNOLOGY
PROJECĪ REPORĪ
DIGIT RECOGNITION
2 Preprocesing Data...........................................................................6
3 Algorithm..........................................................................................8
3.1 K-Nearest Neighbours (KNN)........................................................................8
3.2 Multi-layer Perceptron (MLP) / ArtiFicial Neural Network (ANN)..........12
3.3 Convolutional Neural Network (CNN)........................................................16
4 Arose Difficulties..........................................................................23
4.1 KNN Algorithm.............................................................................................23
4.2 ANN Algorithm.............................................................................................23
4.3 CNN Algorithm.............................................................................................25
5 Final Product.................................................................................26
6 Conclusion & Development in the Future................................29
7 References......................................................................................30
1
1 Introduction
1.1 About the problem
In recent years, advancements in the Field of ArtiFicial
Intelligence (A.I) have demonstrated its growing popularity worldwide.
Since the launch of ChatGPT in 2022, A.I has garnered signiFicant
attention from the public. Numerous A.I technologies have been
developed and applied across various domains, including healthcare,
the automotive industry, real estate, stock markets, and many more.
Today, people can easily perceive the presence of A.I everywhere.
As a group of students enrolled in “Introduction to ArtiFicial
Intelligence - IT3160E,” we recognize the signiFicance of this subject
and its foundational concepts, which are essential for delving into more
advanced topics. One fundamental yet crucial challenge lies in designing
a system capable of recognizing digits from input images. Such a system
has numerous practical applications, including reading license plates at
shopping mall entrances or detecting potential trafFic violations
captured by cameras. Given the complexity of environmental data
processing, this project focuses on building a system to recognize digits
drawn on a canvas. The foundational components of this system can be
further developed to address and assist with real-world problems
effectively.
1.2 Database
We are going to use a simple dataset called MNIST (an acronym
for ModiFied National Institute of Standards and Technology) dataset
which consists of 70,000 images of handwritten digit in black-and-
white. It has a training set of 60,000 examples, and a testing set of
10,000 examples. The images have been normalized, processed to Fit
into a 28x18 pixel bounding box, and anti-aliased.
2
Figure 1. MNIST Dataset
3
1.3 Packages/ Libraries used
1.3.1 NumPy
NumPy is a powerful Python library working with array. It aims to
provide fast calculation, manipulate on 1D array, a matrix and multi-
dimenional arrays (also known as tensors) which take too much time to do
in a normal Python list. The libraries have also proven to be incredibly
powerful for data analysis and serve as the foundation for other packages
and modules such as Matplotlib, Pandas, Seaborn, and Scikit-learn. NumPy,
in particular, is renowned for its simplicity and efFiciency. This library
enables faster computations for distances in KNN or perform matrix
multiplications in neural networks, making it a valuable tool in developing
and training our system to “learn”.
1.3.2 Pandas
Pandas is a Python library designed for data manipulation and
analysis. It is an incredibly powerful tool, particularly for working with
and manipulating tabular data. In this project, Pandas will efFiciently help
us convert the images into a tabular format for easier manipulation and
storage. In this project, we will use pandas speciFically to organize images
from additional datasets into tabular form. Each column will correspond to
pixel in the image, resulting in a total of 784 columns.
4
1.3.4 Skicit-learn
Scikit-learn is a free machine learning library for Python that
provides tools for implementing fast and efficient machine learning
models. It offers a variety of useful methods for preprocessing images
before they are fed into the system. Beyond preprocessing, Scikit-learn
also includes numerous features that we will leverage to experiment with
different algorithms to determine which performs best for the problem at
hand.
1.3.5 OpenCV
OpenCV (Open Source Computer Vision Library) is a library
that provides a wide range of programming functions, primarily
designed for real-time computer vision tasks. While it offers numerous
methods and utilities, in this project, we will specifically use a few of its
functions to draw bounding boxes with predictions on the canvas.
1.3.6 TensorGlow
TensorFlow is a free and open-source software library designed
for machine learning and artificial intelligence applications. While it
supports a wide range of tasks, it is particularly well-suited for training
and inference of deep neural networks. Given the complexity of
programming and the time required for training processes, we will use
TensorFlow to implement a Convolutional Neural Network. This library
enables us to leverage GPU acceleration, significantly improving the
training speed.
1.3.7 Pillow
Pillow is a free and open-source Python Imaging Library that
extends Python’s capabilities to open, manipulate, and save a wide variety
of image file formats. In the context of this project, we use Pillow to
retrieve images from the canvas, enabling us to process them, pass them
through the model, and ultimately generate predictions.
5
1.3.8 Tkinder
Tkinter is Python’s standard library for creating Graphical User
Interfaces (GUIs). It will be used in this project to build and deploy our
models and solutions. The interface will include a canvas where users can
draw numbers, along with five buttons: three for making predictions using
the proposed algorithms, one for displaying the bounding boxes around
digits drawn on the canvas, and one for clearing the canvas.
2 Preprocesing Data
2.1 Preprocessing the training data
Firstly, we need to get the MINIST dataset which contains of
70000 handwritten digits in form of 28x28 image which make the dataset
size (70000, 28, 28) and its label (70000, ) .
Most algorithms that we use for this problem will rely mainly on a
vector being fed into a model and it will calculate and print out the
prediction. So it’s reasonable for our team to take the First step to
preprocess the data by reshaping the 28 x 28 images into a vector of size
(784, ), this process is called Flattening the image, by putting all rows of
the image next to each other on a single row. The dataset, which its size is
(70000, 28, 28) now be transformed into a 2D array of size (70000, 784).
Then to avoid calculating large numbers, and large gradients (for
neural networks) as well as to reduce the effect of illumination on the
image, we will scale the picture from range [0, 255] down to [0, 1] by
dividing the pixel value by 255.
For special implementations such as neural network (MLP) and
Convolution neural network (CNN), we will have to one-hot encode the
label, which is changing the label of each instance into a probability
distribution, which we will mention about later. But for now, the label y of
the dataset will transform into a probability distribution. Which is the
following:
6
𝑃 1𝑖𝑓 𝑦 = 𝑖
(𝑦 = 𝑖|𝑋) = *
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Below is the function where we process the MNIST dataset,
dividing by 255 so each value of pixel will be in range [0, 1] and the label
becomes a probability distribution.
7
3 Algorithm
3.1 K-Nearest Neighbours (KNN)
𝑑(
𝑁
𝑝, 𝑞) = 9: (𝑝𝑖 − 𝑞𝑖)2
𝑖=1
8
𝑁
(𝑝, 𝑞) = : |𝑝𝑖 − 𝑞𝑖|
𝑖=1
9
σ is a fixed constant, and can be modified to adjust
the influence of the weights.
By implementing the KNN algorithm on the MNIST dataset by
testing on different “nearest neighbors, we have found some “interesting
results”
1
0
given experiment, and the result is shown below
1
1
Plots of KNN with Manhattan distance
1
3
3.1.3 Implementation
Below is the part of the code that we used to plot out some
experiment result of Euclidean distance, the code of other weights are
nearly the same, so we will show the part where we test the accuracy on
σ
=4
(The rest of the code and the experiment can be found at the end of this report.)
1
4
It is known to learn and distinguish complex data that are not linearly
separable. With given data, it can be used to recognize and make accurate
1
5
predictions about non-linear structure, which in this case is our hand-
written digits.
MLP consists of at least three layers, one input layer, one (or more
hidden layers), and one output layer. The more hidden layers we add, the
better the MLP can learn complex structures. It will learn through an
algorithm called Back Propagation, which uses using “Gradient descent”
method, trying to minimize the loss function by calculating the gradient of
the loss function with respect to each of the weights, then the weights of
the neural network will be updated.
The loss function that MLP uses to learn is the cross-entropy loss
function, which is cross entropy. Cross entropy is a measure of the
difference between two probability distributions for a given random
variable or set of events. The formula of cross entropy is below:
1 𝑁 𝐶
𝐶𝑟𝑜𝑠𝑠−𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝐿𝑜𝑠𝑠 = − : : 𝑦𝑖j . log (𝑝𝑖j)
𝑁 𝑖=1 j=1
Where yij represents the true probability distribution over class j for
the ith instance (in one-hot encoded vector), pij represents the predicted
probability distribution over class j for the ith instance.
1
6
Our team’s neural network will have an input layer of 784 units, each
unit representing a pixel in a flattened 28 x 28 image. The output layer has
10 units/nodes, which if a number is a 0, 1, . . . 9 then that node will be
“switched” on. Moreover, the Softmax activation function will be used for
the output layer. It is typically a useful function for multi-class
classification problems. In this case, it is classifying each digit to predict
( 𝑒𝑧𝑖
)
whether they are 0, 1, . . ., 9.
𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑧𝑖
∑𝑁
j= 𝑒 j
𝑧
=
1
1
8
Num batch hidden layer dims acc
Epoch size
128 32 512 256 128 98.21
128 32 256 256 256 97.82
128 32 512 512 512 98.0
128 32 32 92.18
128 32 512 512 256 256 128 64 97.96
Experimental result of MLP architecture
1
9
3.3 Convolutional Neural Network (CNN)
2
0
Convolution: Convolution (or sometimes called cross-relation)
is a mathematical operation that allows the merging of two sets of
information. In the case of CNN, convolution is applied to the input data to
filter the information and produce a feature map. This filter is also called
a kernel, or feature detector. Each element in the filter matrix is also a
parameter that needs to be learned so that the network can extract
features from input images more effectively.
Pooling: Pooling in the convolutional network is termed down
sampling or subsampling, which minimizes the number of neurons in the
convolutional layer by retaining the important information. The two
most famous one is max pooling and average pooling. In this problem, we
will use max pooling in our pooling layers.
2
2
Accuracy plot of the model with different number of pairs of Conv-Subsampling
blocks
As can be seen from the graph above, the accuracy of the validation
set for both 2-pair and 3-pair configurations was similar. Due to
computational cost, we prefer to implement 2 pairs of Conv-Subsampling
blocks into our architecture.
Next, we need to determine the number of filters to use in each
convolutional layer. In this step, we have decided to implement 2 pairs of
Conv-Subsampling blocks, each with a kernel size of 5x5, and we will
apply the ReLU activation function. We aim to compare the results of six
different models, each with a different number of filters. Denote that the i-
th model has ((i − 1) ∗ 8 + 8) filters in the first layer and ((i − 1) ∗ 16 + 8)
filters in the second layer.
We can then plot the graph of accuracy on the validation set as
follows:
2
3
Accuracy plot of the model with different number of filters in each Conv-
Subsampling blocks
From the graph, we can see that 24 filters in the first convolutional
layer and 48 filters in the second layer gave stable accuracy. Hence, we
will choose 24-48 filters in our architecture.
After 2 blocks of Conv-Subsampling blocks, we will then flatten the
previous input into a vector. Our next target is to decide how many nodes
we should use in the fully connected layer to make the final prediction. In
this experiment, we will choose the best model out of 8 models whose
nodes range from 0 (no fully connected layer at all) to 2048 nodes. We can
illustrate the accuracy of each model as follows:
Accuracy plot of the model with different numbers of nodes in fully connected
hidden layer
2
4
As illustrated, we can see that the accuracy in 256 nodes in a fully
connected layer is acceptable amongst 8 models. Regarding the number
of layers in fully connected layers, from experimenting the accuracy does
not change much when the number of layers varies, so we decided to
choose only 1 layer in the fully connected layer. Thus, our fully connected
layer will consist of a layer whose nodes are 256, connected to the 10-
node output layer.
To reduce overfitting, a common practice is to add a regularization
technique. In this case, we have chosen dropout regularization. Dropout is
a technique where neurons are removed randomly, and these removed
neurons’ weight will not be passed into the forward pass or the backward
pass. We need to determine the probability that randomly selected nodes
will be dropped out. We will evaluate 8 models with dropout rates ranging
from 0 (no dropout) to 70% chance that a node is dropped. The results are
as follows:
2
5
Accuracy plot of train set and validation set on selected CNN structure
3.3.3. Implementation
Implementing convolutional network from scratch takes a lot of
time and does not have the support for training from GPU, so it’s not
effective to implement this approach from scratch. Therefore we will use
“Tensorflow” library to help us implement CNN easily.
The full code for experimenting is very very long, consists of many
cells inside an ipynb file, so we will only show the creation of the final
selected CNN architecture for usage which is implemented below:
2
6
4 Arose Dif<iculties
4.1 KNN Algorithm
Firstly, the dataset does not fully represent all kinds / special
variances of numbers. When we run the test, the numbers 1, 2, 4, 7, and 9
are the ones that have the most incorrect predictions. We have done some
error analysis, and this is what we’ve found.
For formal number recognition such as reading license plates
number 1, and 2 in the datasets are enough, however, number 1, 2, and 4
in the datasets doesn’t contain all of their variance in terms of different
ways of writing it. For number 7, the diagonal line seems to be very similar
2
7
to
2
8
that feature of number 2, so number 7 often gets mispredicted with
number 2. Number 9, the lower half is very similar to number 3, while the
images of number 9 in the dataset are drawn very weirdly, and don’t have
any common ways that most people in the world draw it, so the model
cannot learn how to predict the common 9 with the curvy lower half with
those given in the datasets.
Although it’s very powerful and capable of outperforming various
machine learning algorithms, ANN is very sensitive to “variant” images,
meaning that slight changes in the rotation, width, and zoom of the image
can result in wrong predictions. Not just the complex structure of the
dataset, but images have some special features, which are spatial
information, which means that the pixels that are adjacent to each other
have some digital connection between them. When we flattened out our 28
x 28 images into a 784-sized vector, we destroyed that relationship /
spatial information.
4.2.2 Problems
To resolve the problem of misprediction numbers 1, 2, 4, 7, and 9,
we will augment more data from our own, and then retrain the neural
network with an augmented dataset.
Each augmented image are processed by inverting images into a
black background, and white hand-written digits, and dilating it so that it
can increase the thickness of the digits. The images then were processed
by using our image processing algorithm, being saved as a tabular data
frame with 785 columns, the First 784 columns represent the pixels of
each digit, the last column is the label value of that digit.
We will then duplicate this augmented data 2 or 3 times to increase
its inFluence on the training process. After that, we will concatenate this
data with the training data. Then we will train the neural network as
normal.
2
9
4.3 CNN Algorithm
4.3.1 Problems
3
0
5 Final Product
After developing and testing three different approaches to solve the
problem, we can now integrate them and implement a simple GUI for
deployment.
Implementation
To deploy it, we will save the model structure, then use Tkinter to
make the simple GUI, the implementation is below:
Firstly, we create a new window with fixed resolution
3
1
Then, we create a canvas for drawing
3
2
5.2 Source code:
• For KNN, we conducted on GoogleColab notebook, which can
be accessed from here:
https://colab.research.google.com/drive/1AUMPK-
2gzvKPwmz8X5vz0Ce51AHPRXKt?usp=sharing
• For MLP (ANN), we conducted and coded our neural network
class on GoogleColab notebook, which can be accessed here:
https://colab.research.google.com/drive/1GcMm4ZvW0p-
CoA6C7vofDb7KGJDjCDW9?usp=sharing
• For CNN, we also conducted our experiments on GoogleColab
notebook, which can be accessed here:
https://colab.research.google.com/drive/1bIPwn-
2S216U2j0g6Oaohm1ZLcbAOPFl?usp=sharing
• For the GUI application:
https://github.com/bababyVN/DR-GUI.git
https://drive.google.com/File/d/1wkFTsixsjlaTdgt-
GLkrxuz7XZcy45iq/view?usp=sharing
3
3
6 Conclusion & Development in the Future
In summary, we explored and addressed the problem using three
distinct methods: K-Nearest Neighbors, Multi-Layer Perceptron, and
Convolutional Neural Networks. These approaches are among the most
common and straightforward techniques for tackling spatial recognition
challenges. By integrating advanced image processing techniques, our
system holds the potential to identify and recognize digits from real-world
images effectively.
Besides, we come up with some ideas to improve our project in the
future as below:
3
4
grading for math exercises).
3
5
Finance: Recognize digits on checks, invoices, or accounting
documents to automate data entry processes and reduce
errors.
Education: Apply the technology to math exercises or exams
for automatic grading, reducing the workload for teachers.
Healthcare: Automate data extraction from charts, patient
records, or test results.
Transportation: Read license plates or information from
electronic tickets.
7 References
[1] GitHub - TomSchimansky/CustomTkinter: A modern and
customizable python UI- library based on Tkinter — github.com.
https://github.com/TomSchimansky/ CustomTkinter.
3
6
[4] Understand and Implement the Backpropagation Algorithm From
Scratch In Python - A Developer Diary — adeveloperdiary.com.
https://www.adeveloperdiary.com/data-science/machine-learning/
understand-and-implement-the-backpropagation-algorithm-from-
scratch-in- python/?
Gbclid=IwAR3Y0jECpa3TFxpfaxJbjH9YfsnWAk0DXk6WxkaeSx0HK
9sXe77rRcJoPMs.
3
7
[9] L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V.
Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J.
VanderPlas,
A. Joly, B. Holt, and G. Varoquaux. API design for machine learning
software: experiences from the scikit- learn project. In ECML PKDD
Workshop: Languages for Data Mining and Machine Learning,
pages 108–122, 2013.
[10] A. Clark. Pillow (pil fork) documentation, 2015.
3
8
[16] W. McKinney et al. Data structures for statistical computing in
python. In Proceedings of the 9th Python in Science Conference,
volume 445, pages 51–56. Austin, TX, 2010.
3
9