0% found this document useful (0 votes)
2 views47 pages

Module 6

Module 6 covers Function Approximation and Deep Reinforcement Learning, detailing the need for function approximation in large or continuous state spaces and its benefits such as scalability and efficiency. It discusses various function approximators, gradient descent methods, and the introduction of deep learning in reinforcement learning, emphasizing the ability of deep neural networks to handle complex policies and value functions. The module also outlines the deep learning training workflow, including data collection, model architecture definition, and optimization techniques.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views47 pages

Module 6

Module 6 covers Function Approximation and Deep Reinforcement Learning, detailing the need for function approximation in large or continuous state spaces and its benefits such as scalability and efficiency. It discusses various function approximators, gradient descent methods, and the introduction of deep learning in reinforcement learning, emphasizing the ability of deep neural networks to handle complex policies and value functions. The module also outlines the deep learning training workflow, including data collection, model architecture definition, and optimization techniques.

Uploaded by

manavvvv298
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Module 6

Function Approximation & Deep


Reinforcement Learning
• Function Approximation • Deep Reinforcement Learning
• Drawbacks of tabular implementation, • Intro of Deep Learning in Reinforcement
• Function Approximation, Learning,
• Gradient Descent Methods,
• Linear parameterization, • Deep learning training workflow,
• Policy gradient with function approximation • Categories of Deep learning,
• Deep Q-Network,
• Ways of improving Deep Q-Network,
• REINFORCE in Full Reinforcement Learning,
• Actor-Critic Algorithm, Algorithm Summary,
• DDPG,
• Case study on AlphaGo by Google
DeepMind
Function Approximation
• Function approximation in reinforcement learning (RL) is a technique used
to estimate value functions or policies when the state and/or action spaces
are too large to represent explicitly in a table.
• Instead of storing a value for each state-action pair, we use a
parameterized function to approximate the value function or policy.
• This allows RL agents to generalize from previously seen states to new,
unseen states, making learning feasible in complex environments.
• There are many function approximators, e.g.
• Artificial neural network
• Decision tree
• Nearest neighbor
• Fourier/wavelet bases
• Coarse coding
Need of Function Approximation
• Large State Spaces: Many real-world problems have an enormous number of
possible states (e.g., Go has 10170 states). Storing and updating a table for each
state is computationally infeasible.
• Continuous State Spaces: In environments with continuous state variables (e.g.,
the position and velocity of a robot arm), the number of states is infinite. Tabular
methods cannot be applied directly.
• Generalization: Function approximation enables the agent to generalize its
learning. By finding patterns and relationships in the state space, the agent can
estimate values for states it has never encountered before.
• Efficiency: Representing value functions or policies with a smaller set of
parameters is much more memory-efficient. Learning can also be faster as updates
to the parameters affect the estimated values of many states simultaneously
Function approximation involves using a
parameterized function to represent either
• Value Function Approximation:
• Approximating the state-value function V(s)≈V θ​ (s) or the action-value function Q(s,a)≈Q θ​
(s,a), where θ is a vector of parameters.
• The goal is to find the optimal θ that minimizes the error between the true value function and
the approximation.
• Policy Approximation:
• Directly approximating the policy π(a∣s)≈π θ​ (a∣s), which maps states to probabilities of taking
actions.
• The goal is to find the parameters θ that lead to an optimal policy.
Types of Function Approximators
• Linear Function Approximation:
• The value function or policy is represented as a linear combination of features of the state (and
action).

where ϕ(s) is a feature vector and θ is the weight vector.


• Examples include polynomial basis functions, Fourier basis, tile coding, and radial basis functions
(RBFs).
• Non-linear Function Approximation:
• More complex functions are used to capture non-linear relationships in the state space.
• Neural Networks: Deep learning models, particularly deep Q-networks (DQNs) and actor-critic
networks, have achieved significant success in RL with high-dimensional state spaces like images.
• Decision Trees and Random Forests: These can learn complex, non-linear decision boundaries.
• Kernel Methods: Techniques like Support Vector Regression can be used for value function
approximation.
• Nearest Neighbors: Non-parametric methods that estimate values based on similar previously seen
states.
Function Approximation for Learning
• The parameters of the function approximator (θ) are learned using RL algorithms
like:
• Monte Carlo (MC): Updates are made after the completion of an episode using
the actual returns.
• Temporal Difference (TD) Learning: Updates are made based on the difference
between the current estimate and a target value

in Q learning
• With function approximation, the update rule involves adjusting the
parameters θ to reduce the error.
• For example, using gradient descent. Policy Gradient Methods: Directly
update the parameters of the policy based on the gradient of the expected
reward.
Function Approximation

Benefits Challenges
• Scalability: Handles problems with large • Stability and Convergence: Especially with non-
or continuous state and action spaces. linear function approximators like neural
networks, ensuring the stability and convergence
• Generalization: Enables learning and of learning algorithms can be challenging.
decision-making in unseen states. • Bias-Variance Trade-off: Choosing the right
• Efficiency: Reduces memory complexity of the function approximator is
requirements and can speed up learning. crucial to avoid underfitting (high bias) or
overfitting (high variance).
• Flexibility: Allows the use of powerful • Feature Engineering: The performance of linear
machine learning techniques. function approximation heavily depends on the
choice of appropriate features.
• Exploration vs. Exploitation: Function
approximators need sufficient exploration to
generalize well and avoid getting stuck in
suboptimal solutions.
Gradient Descent Method
• Gradient descent methods are the engine that often drives the learning
process for function approximators in reinforcement learning.
• Function Approximators are the tools we use to represent complex value
functions or policies when we can't use simple tables.
• These tools have adjustable the parameters/knobs.
• Examples include linear functions, neural networks, decision trees, etc.
• Gradient Descent Methods are the algorithms that tell us how to turn
those adjust the parameters/knobs in the right direction to improve our
function approximator.
• The "right direction" is the one that reduces the error between the approximator's
output and the desired output (the target value or policy).
Gradient Descent Method (1)
• Gradient descent is an iterative optimization algorithm used to find the
minimum of a function.
• In the context of machine learning and deep learning this function is
typically a loss function or a cost function that measures the error
between our model's predictions and the actual values.
• The goal is to find the set of parameters for our model that minimizes
this error.
Gradient Descent Method (2)
• Let J(θ) be the cost function we want to minimize, where θ represents
the vector of parameters of our model (e.g., weights in a neural
network, coefficients in a linear function).
• The gradient of J(θ) with respect to θ, denoted as ∇θJ(θ), is a vector
of partial derivatives, where each element indicates the rate of
change of the cost function with respect to a particular parameter.
• The update rule for gradient descent is as follows:
.
The update rule for gradient descent is as
follows:
where:
is the vector of parameters at the k-th iteration.
α (alpha) is the learning rate, a positive scalar that determines the step
size taken in the direction of the negative gradient.
is the gradient of the cost function evaluated at the current
parameter vector . The negative sign indicates that we are moving in
the direction that decreases the cost function.
is the updated parameter vector for the next iteration
Gradient Descent
• Start with an initial guess for the parameter vector θ 0​ . This can be random or based on some
heuristic.
• Compute the gradient of the cost function at the current parameter vector ​. This
involves calculating the partial derivative of the cost function with respect to each parameter.
• Update the parameter vector using the gradient descent update rule:
• Repeat steps 2 and 3 until a termination condition is met.
• Common termination conditions include:
• A sufficiently small value for the cost function.
• A sufficiently small change in the parameter vector between iterations.
• Reaching a maximum number of iterations.
Types of Gradient Descent
• Batch Gradient Descent: Full data for stable but slow updates
• Stochastic Gradient Descent (SGD): Single data point for fast but
noisy updates.
• Mini-batch Gradient Descent: Small subset for a balance of both.
Gradient Descent

Benefit Challenges or limitation


• Simplicity: The core concept is relatively easy to • Sensitivity to Learning Rate: Choosing an
understand and implement. appropriate learning rate is crucial and can be
• Scalability (especially SGD and mini-batch GD): Can difficult.
be applied to large datasets.
• Local Minima: For non-convex cost functions
• Generality: Can be used to optimize a wide range of (common in deep learning), gradient descent can
differentiable cost functions. get stuck in local minima, which are not the
global optimum.
• Plateaus and Saddle Points: The gradient can be
very small in flat regions (plateaus) or saddle
points, slowing down or halting the optimization
process.
• Uneven Gradients: If the cost function has very
different sensitivities to different parameters, the
optimization can be inefficient (e.g., taking large
steps in some directions and very small steps in
others).
Advanced Gradient Descent Methods:
• Momentum: Helps accelerate gradient descent in the relevant direction and dampens
oscillations by adding a fraction of the previous update vector to the current update.
• AdaGrad (Adaptive Gradient Algorithm): Adapts the learning rate for each parameter
based on the historical sum of squared gradients. Parameters that have received large
gradients in the past get smaller learning rates, and vice versa.
• RMSprop (Root Mean Square Propagation): Similar to AdaGrad but uses a moving
average of squared gradients, which helps to mitigate AdaGrad's issue of the learning rate
becoming too small too quickly.
• Adam (Adaptive Moment Estimation): Combines the ideas of momentum and
RMSprop. It computes adaptive learning rates for each parameter based on estimates of
both the first and second moments of the gradients. Adam is a very popular and often
effective optimization algorithm.
• Learning Rate Scheduling: Gradually reduces the learning rate during training, often
based on the number of epochs or the progress of the optimization. This can help to fine-
tune the parameters and avoid overshooting the minimum.
Linear parameterization
• Linear parameterization refers to representing a function like a value
function or a policy in reinforcement learning, in a linear combination
of features.
• Means you take the input (e.g., the state), extract some relevant
characteristics/features, and then multiply each
characteristics/features by a weight (the parameter).
• The function's output is the sum of these weighted features.
Linear parameterization (1)
• is a vector of features extracted from a state s,
and θ=[θ1 ,θ2 ,...,θn ] T is a vector of parameters (weights), then a
linearly parameterized value function would be:

• A linearly parameterized action-value function Qθ (s,a) using features


ϕ(s,a) would be:
Linear parameterization (2)

Advantage Disadvantage
• Simplicity: Easier to understand and • Limited Representational Power: Can only
implement compared to non-linear represent functions that are linear in terms
methods like neural networks. of the chosen features. If the underlying
• Theoretical Guarantees: Some relationships are highly non-linear and the
learning algorithms with linear features are not well-designed, the
parameterization have convergence approximation can be poor.
guarantees. • Feature Engineering: The performance
• Computational Efficiency: Updates to heavily relies on the quality and relevance
the parameters are often of the hand-engineered features. This can
computationally less expensive than be a time-consuming and domain-specific
with complex non-linear models. process.
Policy gradient with function approximation
• Policy gradient methods are a class of reinforcement learning
algorithms that directly learn a parameterized policy, πθ (a∣s), which
represents the probability of taking action a in state s.
• Instead of learning a value function and then deriving a policy from it
(as in value-based methods), policy gradient methods aim to find the
optimal policy by directly optimizing its parameters θ.
• When the state and/or action spaces are large or continuous, we
need to use function approximation to represent the policy.
• This means that the policy πθ (a∣s) is parameterized by θ, which could
be the weights of a neural network, the coefficients of a linear
function of features, or other types of function approximators.
Policy gradient with function
approximation (1)
Address Advantage and disadvantage for it
Introduction of Deep Learning in
Reinforcement Learning
• Deep Learning (DL) revolutionized RL by providing powerful function
approximators capable of learning intricate patterns directly from raw,
high-dimensional data.
• Deep Neural Networks (DNNs), with their ability to automatically
learn hierarchical representations through multiple layers, can
effectively map complex states to value estimates or action
probabilities.
Deep Learning on RL
• Handling High-Dimensional State Spaces: DNNs can process raw
sensory inputs like images, audio, and text, eliminating the need for
manual feature engineering.
• Learning Complex Policies and Value Functions: The non-linear nature
of DNNs allows for the representation of highly complex relationships
between states, actions, and rewards.
• End-to-End Learning: In many cases, DL enables end-to-end learning,
where the RL agent learns directly from the environment's raw
observations to its actions.
• Breakthroughs in Complex Domains: DL has been instrumental in
achieving human-level or superhuman performance in challenging
domains like Atari games, Go, and complex robotic tasks.
Deep Learning Training Workflow
• Data Collection/Generation:
• Supervised Learning: Gathering labeled data (input-output pairs).
• Reinforcement Learning: Interacting with the environment to collect
experiences (state, action, reward, next state). This is often done through
exploration.
• Data Preprocessing:
• Cleaning, normalizing, and transforming the collected data into a format
suitable for the neural network.
• Examples in RL: Scaling rewards, normalizing state features, creating frame
stacks from game screens.
Deep Learning Training Workflow (1)
• Model Architecture Definition:
• Choosing the type of neural network (e.g., Convolutional Neural Network
(CNN) for image data, Recurrent Neural Network (RNN) for sequential data,
Feedforward Network for simpler cases).
• Designing the network's structure: number of layers, types of activation
functions, number of neurons per layer, etc.
• Loss Function Definition:
• Specifying a function that quantifies the error between the model's
predictions and the target values.
• RL Examples: Mean Squared Error (MSE) for value function approximation
(e.g., TD error), policy gradient loss for policy optimization, cross-entropy loss
for discrete action policies.
Deep Learning Training Workflow (2)
• Optimizer Selection:
• Choosing an algorithm to update the model's parameters (weights and biases)
to minimize the loss function.
• Common optimizers include Stochastic Gradient Descent (SGD), Adam,
RMSprop, and their variants.
• Training the Model:
• Feeding the preprocessed data (or experiences in RL) to the network.
• Calculating the loss based on the network's output and the target values.
• Computing the gradients of the loss with respect to the network's parameters
using backpropagation.
• Updating the parameters using the chosen optimizer.
• Repeating these steps for multiple iterations (epochs or steps) until the model
converges or reaches a satisfactory performance level.
Deep Learning Training Workflow (3)
• Evaluation:
• Assessing the trained model's performance on unseen data (or by evaluating
the learned policy in the environment).
• Using appropriate metrics to measure performance (e.g., accuracy, average
reward, success rate).
• Hyperparameter Tuning:
• Adjusting the learning rate, network architecture, batch size, and other
parameters to optimize the model's performance.
• This often involves experimentation and techniques like grid search or
random search.
• Deployment (for practical applications):
• Integrating the trained model into the desired system or application.
Categories of Deep Learning
• Deep learning models can be broadly categorized based on their
architecture and the type of data they are designed to process:
• Feedforward Neural Networks (FNNs) / Multi-Layer Perceptrons (MLPs):
• Information flows in one direction, from input to output layers, without any loops.
• Suitable for processing static data where the order of information doesn't matter
significantly.
• Can be used in RL for approximating value functions or policies based on the current
state.
• Convolutional Neural Networks (CNNs):Designed for processing grid-like
data, such as images. They use convolutional layers to automatically learn
spatial hierarchies of features. Crucial for RL tasks with visual inputs, like
playing Atari games or controlling robots with cameras.
Categories of Deep Learning (1)
• Recurrent Neural Networks (RNNs):
• Designed for processing sequential data (e.g., time series, text). They have
feedback connections that allow them to maintain a memory of past inputs.
• Useful in RL for tasks with partial observability or when the history of states
and actions is important for decision-making. LSTMs and GRUs are popular
types of RNNs that address the vanishing gradient problem.
• Autoencoders:
• Unsupervised learning models that aim to learn a compressed representation
(encoding) of the input data and then reconstruct the original input from this
representation (decoding).
• Can be used in RL for dimensionality reduction or learning useful state
representations
Categories of Deep Learning(2)
• Generative Adversarial Networks (GANs):
• Consist of two networks, a generator and a discriminator, that compete with each
other.
• The generator tries to create realistic data samples, while the discriminator tries to
distinguish between real and generated data.
• Can be used in RL for tasks like generating realistic environments for training or
learning robust policies.
• Transformer Networks:
• Primarily designed for sequence-to-sequence tasks, especially in natural language
processing.
• They rely on attention mechanisms to weigh the importance of different parts of the
input sequence.
• Increasingly being explored in RL for tasks involving sequential decision-making and
processing complex histories.
Deep Q-Network (DQN)
• Deep Q-Network (DQN) is a seminal deep reinforcement learning
algorithm that combines Q-learning (a value-based RL method) with
deep neural networks to handle high-dimensional state spaces,
particularly visual inputs.
• The Idea behind the DQN:
• Q-function Approximation: Instead of using a Q-table, DQN uses a deep
neural network Q(s,a;θ) to approximate the action-value function Q(s,a),
where θ are the weights of the neural network. The network takes the state s
as input and outputs the Q-values for all possible actions a in that state.
Deep Q-Network (DQN) (1)

• Experience Replay:
• To improve data efficiency and break the correlation between consecutive
experiences, DQN stores the agent's experiences (state st , action at , reward rt+1 , next
state st+1 ) in a replay buffer.
• During training, a mini-batch of experiences is randomly sampled from this buffer to
update the Q-network.
• Target Network:
• To stabilize training, DQN uses two separate
• Q-networks: Q-network (Q(s,a;θ)): The network whose weights θ are being updated during
training.
• Target network (Q(s,a;θ − )): A copy of the Q-network whose weights θ − are periodically
updated (e.g., every N steps) with the weights of the Q-network. The target network is used to
compute the target Q-values in the Bellman equation, which helps to reduce oscillations and
improve stability.
Ways of Improving Deep Q-Network
• Double DQN (DDQN):
• Addresses the overestimation bias in Q-learning by decoupling the action
selection and evaluation steps in the target Q-value calculation.
• Instead of max a ′ Q(s ′ ,a ′ ;θ − ), DDQN uses Q(s ′ ,argmax a ′ Q(s ′ ,a ′ ;θ);θ − ).
• Prioritized Experience Replay:
• Instead of sampling experiences uniformly from the replay buffer, prioritize
sampling transitions with higher temporal difference (TD) errors.
• This allows the agent to learn more efficiently from the most surprising or
important experiences.
Ways of Improving Deep Q-Network (1)
• Dueling DQN:
• Separates the Q-network into two streams: one estimating the state value
V(s) and the other estimating the advantage A(s,a) of each action relative to
the state value.
• The Q-value is then computed as Q(s,a)=V(s)+A(s,a)−mean a​ A(s,a) (or other
aggregation functions).
• This can lead to better learning of the value function, especially in states
where the value is more critical than the effect of individual actions.
• Noisy Networks:
• Introduce stochasticity into the network's weights during exploration, rather
than relying solely on ϵ-greedy action selection.
• This can lead to more directed and efficient exploration.
Ways of Improving Deep Q-Network (2)
• Distributional DQN (C51):
• Instead of predicting a single Q-value for each state-action pair, C51 learns
the distribution of returns.
• This richer representation of value can lead to more stable and effective
learning.
• Rainbow:
• Combines several of these improvements (e.g., Double DQN, Prioritized
Replay, Dueling Networks, Distributional DQN, Noisy Nets, and Multi-step
Returns) into a single powerful agent.
Ways of Improving Deep Q-Network (3)
• Multi-step Returns:
• Instead of using a one-step target, use discounted returns over n steps to
provide a more direct signal for the value update, reducing variance
compared to Monte Carlo methods and bias compared to one-step TD.
• Gradient Clipping:
• Helps stabilize training by preventing gradients from becoming too large
during backpropagation.
• Regularization Techniques:
• Techniques like dropout or weight decay can help prevent overfitting,
especially when dealing with complex neural networks.
Actor-Critic Algorithm
• Actor-Critic methods are a class of reinforcement learning algorithms
that combine the strengths of both value-based (critic) and policy-
based (actor) methods.
• Actor: The "actor" is responsible for learning the policy π θ​ (a∣s),
which determines the agent's actions in a given state. It aims to select
actions that maximize the expected reward.
• Critic: The "critic" is responsible for estimating the value function, V w​
(s) or Q w​ (s,a), which evaluates how good it is to be in a particular
state or to take a specific action in a state. The critic provides
feedback to the actor, telling it whether its actions are better or worse
than expected.
Working of Actor-Critics
• The actor proposes an action based on the current policy.
• The critic evaluates the chosen action (or the resulting state) and
provides a signal (e.g., the TD error) to the actor.
• The actor uses this signal to update its policy parameters, aiming to
take actions that lead to higher values as estimated by the critic.
• Simultaneously, the critic updates its value function parameters to
more accurately reflect the rewards received.
Advantage
• Can learn stochastic policies: Like policy-based methods, actor-critic
can handle environments where the optimal policy is stochastic.
• Lower variance than pure policy gradient: The critic provides a
baseline that can reduce the variance of the policy gradient
estimates, leading to more stable learning.
• Can learn online, single-step updates: Unlike Monte Carlo policy
gradient methods, actor-critic can update its parameters after each
step of interaction.
Disadvantage
• Can be more complex to implement and tune: Requires managing
two function approximators (actor and critic).
• Stability can still be an issue: The learning process can be unstable if
the actor and critic are not updated appropriately.
Actor-Critic algorithm
1. Initialize: Initialize the parameters for the actor (θ) and the critic (w).
The actor represents the policy πθ​(a∣s), and the critic represents the
value function (e.g., Vw​(s) or Qw​(s,a)).
2. For each episode (or step):
• Observe the current state st​.
• The actor selects an action at​ according to the current policy πθ​(at∣st​).
• Execute action at​ and observe the reward rt+1​ and the next state st+1​.
• The critic evaluates the current state st​ (or the state-action pair (st​,at​)) and
the next state st+1​ (or the next state-action pair (st+1​,at+1​) if using Q-values).
• Calculate the TD error (δt​):
• For V-based critic: δt​=rt+1​+γVw​(st+1​)−Vw​(st​)
• For Q-based critic: δt​=rt+1​+γQw​(st+1​,at+1​)−Qw​(st​,at​) (where at+1​ is chosen
according to the actor's policy in st+1​)
Actor-Critic algorithm
• Update the critic's parameters (w) to reduce the TD error:w←w+α w​
δ t​ ∇ w​ V w​ (s t​ ) (for V-based)w←w+α w​ δ t​ ∇ w​ Q w​ (s t​ ,a t​ ) (for Q-
based)Update the actor's parameters (θ) in the direction that
improves the policy based on the critic's evaluation (using the TD
error as a signal):θ←θ+α θ​ δ t​ ∇ θ​ logπ θ​ (a t​ ∣s t​ ) (for policy gradient)
• 3. Repeat until convergence.
Deep Deterministic Policy Gradient (DDPG)
• Deep Deterministic Policy Gradient (DDPG) is a model-free, off-policy
actor-critic algorithm designed for continuous action spaces.
• It builds upon the Deterministic Policy Gradient (DPG) algorithm and
incorporates techniques from Deep Q-Networks (DQN) to enable
learning with function approximation in high-dimensional state and
action spaces.
• DDPG has been successful in various continuous control tasks, such as
robotics and autonomous driving simulations.
Features of DDPG
1. Deterministic Policy: The actor learns a deterministic policy μ θ​ (s) that directly
maps states to a specific action, instead of a probability distribution over
actions. This is suitable for continuous action spaces where choosing a single,
precise action is often required.
2. Actor-Critic Architecture: DDPG uses two neural networks for the actor and
two for the critic:
• Actor Network (μ θ​ (s)): Parameterized by θ, it learns the deterministic policy.
• Critic Network (Q w​ (s,a)): Parameterized by w, it learns the action-value function. The
critic takes both the state and the action as input and outputs the Q-value.
• Target Actor Network (μ θ ′ ​ (s)): A delayed copy of the actor network, used to stabilize the
learning of the target policy. Its weights θ ′ are updated slowly by tracking the main actor's
weights.
• Target Critic Network (Q w ′ ​ (s,a)): A delayed copy of the critic network, used to stabilize
the learning of the target Q-values. Its weights w ′ are updated slowly by tracking the main
critic's weights.
Features of DDPG
3. Experience Replay: Similar to DQN, DDPG uses a replay buffer to store and
sample past experiences, breaking correlations and improving data efficiency.
4. Soft Target Updates: The target network parameters are updated slowly by
taking a small step towards the main network parameters:

where τ is a small value (e.g., 0.001), ensuring slow and stable learning of the
target networks.
5. Exploration: Since the policy is deterministic, DDPG adds noise to the actor's
output during training to encourage exploration. A common approach is to use
Ornstein-Uhlenbeck (OU) noise, which generates temporally correlated noise.
Case study on AlphaGo by Google
DeepMind
The Challenge: The Game of Go
• Go is an ancient Chinese board game renowned for its vast search
space and the complexity of strategic thinking required.
• With around 10170 possible board positions, it's far more complex
than chess (1047). Brute-force search, which worked for Deep Blue in
chess, was infeasible for Go.
• Expert Go players rely heavily on intuition, pattern recognition, and
strategic understanding, which were thought to be difficult for AI to
replicate.
• For more information check the refer your notes. Already discussed in
detail in class.

You might also like