Module 1
Module 1
Module-01
Deep learning involves training artificial neural networks, computational models inspired
by the human brain, to recognize patterns and make decisions based on vast amounts of
data.
Unlike traditional machine learning, which may require feature engineering and manual
intervention, deep learning models automatically discover representations and features
from raw data, making them particularly effective for tasks like image and speech
recognition.
Dept., of AD Page 1
21CS743 | DEEP LEARNING
Dept., of AD Page 2
21CS743 | DEEP LEARNING
5. Auto encoders:
Deep learning models have a wide array of applications across various industries:
Automatic Feature Extraction: Eliminates the need for manual feature engineering,
allowing models to learn directly from raw data.
Scalability: Can handle large volumes of data and complex models with millions of
parameters.
Dept., of AD Page 3
21CS743 | DEEP LEARNING
Versatility: Applicable to diverse domains and tasks, from vision and speech to text and
beyond.
Performance: Achieves state-of-the-art results in many benchmark tasks, often
surpassing human-level performance.
Data Requirements: Deep learning models typically require vast amounts of labeled
data, which can be costly and time-consuming to obtain.
Computational Resources: Training deep models demands significant computational
power, often necessitating specialized hardware like GPUs.
Interpretability: Deep networks are often considered "black boxes," making it difficult
to understand how decisions are made.
Overfitting: Models can become too tailored to training data, reducing their ability to
generalize to new, unseen data.
Deep learning, a branch of machine learning, has experienced tremendous growth and
transformation over the decades.
While its core principles date back to the mid-20th century, it has undergone several stages of
advancement due to technological innovations, better algorithms, and increased computational
power. Below is a timeline highlighting key historical trends in deep learning:
Dept., of AD Page 4
21CS743 | DEEP LEARNING
The foundation for deep learning lies in early research on neural networks and the imitation of
human cognition in machines. Several key milestones shaped the beginnings of the field:
1943: McCulloch and Pitts: The concept of a neuron as a binary classifier was
introduced by Warren McCulloch and Walter Pitts. They proposed a mathematical model
of a neuron that laid the groundwork for later neural network research.
1958: Perceptron by Frank Rosenblatt: The perceptron was a simple neural network
designed to perform binary classification tasks. It could learn by adjusting weights based
on input-output relationships, similar to modern deep learning models. However, its
limitations in handling non-linearly separable data, such as the XOR problem, restricted
its capabilities.
1960s: Backpropagation Concept Introduced: Although it wasn’t widely used until
much later, the concept of backpropagation—the algorithm for training multilayer neural
networks—was introduced by multiple researchers, including Bryson and Ho.
After initial interest, neural networks entered a period of decline, often called the "AI winter."
There was disappointment in the limitations of single-layer perceptrons, and other machine
learning methods, such as support vector machines (SVMs) and decision trees, gained traction.
1970s: The limitations of early neural networks, like the perceptron, led to reduced
funding and enthusiasm for the approach.
1980s: Interest was revived through theoretical work, and some breakthroughs in deep
learning principles were laid during this period, though they wouldn’t be fully realized
for decades.
Dept., of AD Page 5
21CS743 | DEEP LEARNING
2006: Deep Belief Networks (DBNs): Geoffrey Hinton and his team proposed the idea
of using deep belief networks, a type of unsupervised deep neural network. This marked
the beginning of modern deep learning, where the goal was to train deeper neural
networks that could learn complex representations.
2007–2009: GPU Acceleration: The adoption of Graphics Processing Units (GPUs) for
deep learning computations drastically improved the ability to train deeper networks
faster. This technological breakthrough allowed for more practical training of neural
networks with multiple layers.
The 2010s are often referred to as the "Golden Age" of deep learning. With the
combination of better hardware (especially GPUs), large datasets, and advanced
algorithms, deep learning achieved state-of-the-art performance across various domains.
2012: AlexNet and ImageNet Competition: A deep CNN called AlexNet, developed by
Alex Krizhevsky and Geoffrey Hinton, won the ImageNet Large Scale Visual
Recognition Challenge by a large margin. This victory demonstrated the power of deep
learning in image recognition and spurred widespread interest in the field.
Dept., of AD Page 6
21CS743 | DEEP LEARNING
2014:
o Generative Adversarial Networks (GANs): Introduced by Ian Good fellow,
GANs became one of the most revolutionary architectures in deep learning.
GANs consist of two networks—a generator and a discriminator—that compete
against each other, enabling the creation of highly realistic synthetic data.
o VGGNet and ResNet: VGGNet and ResNet were breakthroughs in CNN
architectures that allowed for deeper networks to be trained without performance
degradation. ResNet’s introduction of skip connections solved the problem of
vanishing gradients for very deep networks.
2017: Transformers and Attention Mechanisms:
o The introduction of the Transformer model by Vaswani et al. transformed the
field of natural language processing (NLP). The Transformer, which uses self-
attention mechanisms to process sequences in parallel, has since become the
foundation of cutting-edge NLP models, including BERT and GPT.
2018–2019: Transfer Learning and Pre-trained Models:
o Large pre-trained models like BERT (from Google) and GPT-2 (from OpenAI)
demonstrated the power of transfer learning, where a model pre-trained on
massive datasets can be fine-tuned for specific tasks with smaller datasets,
drastically reducing training time and improving performance.
The 2020s have seen deep learning evolve further, with a focus on more efficient models,
ethical AI practices, and novel applications.
Dept., of AD Page 7
21CS743 | DEEP LEARNING
like AlphaGo and Alpha Zero (developed by Deep Mind) demonstrate the potential of
AI in learning strategies through trial and error in dynamic environments.
Ethics and Interpretability: As deep learning models are increasingly deployed in real-
world applications, attention has shifted toward ensuring fairness, reducing biases, and
improving the interpretability of these "black box" models.
Resource Efficiency: There has been a growing interest in optimizing deep learning
models to make them more resource-efficient, addressing concerns about the
environmental impact of training massive models. Techniques like pruning,
quantization, and distillation aim to reduce the computational and energy demands of
deep learning models.
Machine learning allows computers to learn from data to improve their performance on
certain tasks. The main components of machine learning are the task (T), the
performance measure (P), and the experience (E). These three elements form the basis
of any machine learning algorithm.
The task in machine learning is the problem that we want the system to solve. It could be
recognizing images, predicting numbers, translating languages, or even detecting fraud.
The task doesn’t include learning itself but refers to the goal or action we want the
machine to perform.
Classification: The algorithm assigns an input (like an image) into one of several
categories. For example, identifying whether an image is of a cat or a dog is a
classification task.
Regression: The algorithm predicts a continuous value, like forecasting house prices or
stock market trends.
Dept., of AD Page 8
21CS743 | DEEP LEARNING
Transcription: The algorithm converts unstructured data into a structured format, such
as recognizing text in images (optical character recognition) or converting speech into
text.
Machine Translation: Translating text from one language to another, like English to
French.
Anomaly Detection: Finding unusual patterns or behaviors, such as detecting fraud in
transactions.
Structured Output: Tasks where the output involves multiple values that are connected,
such as generating captions for images.
Synthesis and Sampling: The algorithm creates new data that is similar to the training
data, like generating realistic images or audio.
Imputation of Missing Values: Predicting missing data points based on the available
information.
Denoising: Cleaning up corrupted data by predicting what the original data was before it
got corrupted.
Density Estimation: Learning the probability distribution that explains how data points
are spread out in the dataset.
The performance measure tells us how well the machine learning algorithm is doing. It
helps us compare the system’s predictions with the actual results. Different tasks require
different performance measures.
For example, in classification tasks, the performance measure might be accuracy, which
tells us how many predictions were correct. Alternatively, we can measure the error
rate, which counts how many predictions were wrong. In some cases, we may want a
more detailed performance measure, such as giving partial credit for partially correct
answers.
For tasks that don’t involve predicting categories (like density estimation), accuracy isn’t
useful, so we use other performance measures, like log-probability.
Dept., of AD Page 9
21CS743 | DEEP LEARNING
The experience refers to the data that the algorithm learns from. There are different types
of experiences:
Supervised Learning: The system is trained using data that includes both input features
and their corresponding outputs or labels. For example, training a model with labeled
images of cats and dogs, so it learns to classify them.
Unsupervised Learning: The system is trained using data without labels. It tries to find
patterns or structure in the data, such as grouping similar data points together (clustering)
or estimating the data distribution (density estimation).
Semi-Supervised Learning: Some examples in the training data have labels, but others
don’t. This is useful when getting labeled data is difficult or expensive.
Reinforcement Learning: The system learns by interacting with an environment and
receiving feedback based on its actions. This approach is used in robotics and game
playing, where the system gets rewards or penalties based on the decisions it makes.
To make the concept clearer, we can look at an example of a machine learning task called linear
regression, which predicts a continuous value. In linear regression, the algorithm uses the input
data (represented as a vector) to predict a value by calculating a linear combination of the input
features.
For example, if you want to predict the price of a house based on its size and location, the
algorithm might use a linear function to estimate the price. The output is calculated by
multiplying the input features by their corresponding weights and summing them up.
The weights are the parameters that the algorithm adjusts during training. The goal is to find the
weights that minimize the mean squared error (MSE), which measures how far off the
predictions are from the actual values.
Dept., of AD Page 10
21CS743 | DEEP LEARNING
Supervised learning algorithms learn to map inputs (x) to outputs (y) using a training set. These
outputs often require human intervention but can also be collected automatically.
Most supervised learning algorithms estimate the probability of output y given input x
represented as p(y∣x). This can be done using maximum likelihood estimation, which finds the
best parameters θ for a distribution.
2. Logistic Regression
For classification tasks (e.g., binary classification), we predict a class by squashing the
output into a probability between 0 and 1 using the logistic sigmoid function:
This technique is known as logistic regression. Despite its name, it is used for
classification, not regression.
Dept., of AD Page 11
21CS743 | DEEP LEARNING
Linear regression allows us to compute optimal weights using a simple formula (normal
equations).
Logistic regression does not have a closed-form solution. Instead, the optimal weights
are found by minimizing the negative log-likelihood (NLL) using gradient descent.
5. Decision Trees
Decision Trees divide the input space into regions based on decisions made at each node
of the tree. Internal nodes make binary decisions, and leaf nodes map each region to a
constant output.
Strength: They are easy to understand and interpret.
Weakness: Decision trees may struggle with problems where decision boundaries aren’t
axis-aligned, requiring many nodes to approximate simple boundaries.
Unsupervised learning algorithms deal with data that contains only features and no
labeled targets. They aim to extract meaningful patterns or structures from the data
Dept., of AD Page 12
21CS743 | DEEP LEARNING
without human supervision, and they are often used for tasks like clustering, density
estimation, and learning data representations.
The main goal in unsupervised learning is often to find the best representation of the
data. A good representation preserves the most important information about the data
while simplifying it or making it easier to work with.
2. Types of Representations
Reducing the dimensionality of the data helps with compression and makes it easier to
find and use the key features.
Sparse and independent representations make the data easier to interpret and process in
machine learning algorithms.
Dept., of AD Page 13
21CS743 | DEEP LEARNING
1. Goals of PCA
PCA reduces the dimensionality of the data while ensuring that the new representation’s
features are decorrelated (no linear correlations between the features). It is a step toward
achieving statistical independence of the features, though PCA only removes linear
relationships.
Linear Transformation: PCA projects the data onto new axes that capture the directions
of maximum variance in the data.
The algorithm learns an orthogonal transformation that projects input x to a new
representation z = xTW, where W is a matrix of principal components (the directions of
maximum variance).
The first principal component explains the most variance in the data, and each subsequent
component captures the remaining variance while being orthogonal to the previous ones.
PCA transforms the data such that the covariance matrix of the new representation is
diagonal, meaning the new features are uncorrelated.
It uses eigenvectors of the data’s covariance matrix or singular value decomposition
(SVD) to find the directions of maximum variance.
The result is a compact, decorrelated representation of the data that can be used for
further analysis while minimizing information loss.
k-Means Clustering
Dept., of AD Page 14
21CS743 | DEEP LEARNING
The algorithm begins by initializing k centroids (cluster centers), which are assigned
random values.
Assignment Step: Each data point is assigned to the nearest centroid, forming clusters.
Update Step: Each centroid is recalculated as the mean of the points assigned to it.
This process repeats until the centroids no longer change significantly, signaling
convergence.
2. One-Hot Representation
k-Means clustering provides a one-hot representation for each data point. If a point
belongs to cluster iii, its representation vector h has a 1 at position i and 0 everywhere
else.
This is an example of a sparse representation because only one element in the vector is
non-zero for each point.
However, this representation is limited because it treats clusters as mutually exclusive
and doesn’t capture relationships between different clusters.
3. Limitations of k-Means
Ill-posed Problem: There is no single, definitive way to evaluate how well the clustering
reflects real-world structures. For example, clustering based on vehicle color (red vs.
gray) is as valid as clustering based on type (car vs. truck), but each reveals different
information.
Lack of Fine-Grained Similarity: k-Means provides a strict one-hot output, which
doesn’t capture nuanced similarities between examples. For instance, it can’t show that
red cars are more similar to gray cars than gray trucks.
Dept., of AD Page 15
21CS743 | DEEP LEARNING
Dept., of AD Page 16