Introduction to AI Concepts
Introduction to AI Concepts
TO
ARTIFICIAL
INTELLIGENCE
INTRODUCTION
AI is one of the newest disciplines, formally initiated in 1956 when the name was coined. However, the study of
intelligence is one of the oldest disciplines being approximately 2000 years old. The advent of computers made it
possible for the first time for people to test models they proposed for learning, reasoning, perceiving, etc.
Artificial Intelligence is composed of two words Artificial and Intelligence, where Artificial defines "man-
made," and intelligence defines "thinking power", hence AI means "a man-made thinking power."
A human interrogates the program and another human via a terminal simultaneously. If after a reasonable
period, the interrogator cannot tell which is which, the program passes.
Machine learning
Computer vision
Robotics
THINKING HUMANLY
This requires "getting inside" of the human mind to see how it works and then comparing our computer
Another way to do this is to observe a human problem solving and argue that one's programs go about
EXAMPLE:
GPS (General Problem Solver) was an early computer program that attempted to model human thinking.
The developers were not so much interested in whether or not GPS solved problems correctly.
They were more interested in showing that it solved problems like people, going through the same steps
and taking around the same amount of time to perform those steps.
THINKING RATIONALLY
Aristotle was one of the first to attempt to codify "thinking". His syllogisms provided patterns of
argument structure that always gave correct conclusions, giving correct premises.
EXAMPLE: All computers use energy. Using energy always generates heat. Therefore, all computers generate heat.
This initiate the field of logic. Formal logic was developed in the late nineteenth century. This was the first step toward
By 1965, programs existed that could, given enough time and memory, take a description of the problem in logical notation
and find the solution, if one existed. The logicist tradition in AI hopes to build on such programs to create intelligence.
There are two main obstacles to this approach: First, it is difficult to make informal knowledge precise enough to use the
Second, there is a big difference between being able to solve a problem in principle and doing so in practice.
ACTING RATIONALLY / THE RATIONAL AGENT APPROACH
Acting rationally means acting so as to achieve one's goals, given one's beliefs. An agent is just something
that perceives and acts.
In the logical approach to AI, the emphasis is on correct inferences. This is often part of being a rational agent because
one way to act rationally is to reason logically and then act on ones conclusions. But this is not all of rationality because
agents often find themselves in situations where there is no provably correct thing to do, yet they must do something.
There are also ways to act rationally that do not seem to involve inference, e.g., reflex actions.
The study of AI as rational agent design has two advantages:
1. It is more general than the logical approach because correct inference is only a useful mechanism for achieving
rationality, not a necessary one.
2. It is more amenable to scientific development than approaches based on human behaviour or human thought
because a standard of rationality can be defined independent of humans.
Achieving perfect rationality in complex environments is not possible because the computational demands are too high.
However, we will study perfect rationality as a starting place.
FOUNDATIONS OF AI
Like any history, this one is forced to concentrate on a small number of people, events, and ideas and to ignore others
that also were important. We organize the history around a series of questions. We certainly would not wish to give the
impression that these questions are the only ones the disciplines address or that the disciplines have all been working
toward AI as their ultimate fruition.
1. PHILOSOPHY
Can formal rules be used to draw valid conclusions?
How does the mind arise from a physical brain?
Where does knowledge come from?
How does knowledge lead to action?
Aristotle (384–322 B.C.), was the first to formulate a precise set of laws governing the rational part of the mind. He
developed an informal system of syllogisms for proper reasoning, which in principle allowed one to generate conclusions
mechanically, given initial premises.
Much later, Ramon Lull (d. 1315) had the idea that useful reasoning could actually be carried out by a mechanical artifact.
Thomas Hobbes (1588–1679) proposed that reasoning was like numerical computation that “we add and subtract in our
silent thoughts.”
2 MATHEMATICS
Philosophers staked out most of the important ideas of AI, but to move to a formal science requires a level of
Mathematicians have proved that there exists an algorithm to prove any true statement in first-order logic.
However, if one adds the principle of induction required to capture the semantics of the natural numbers, then
this is no longer the case. Specifically, the incompleteness theorem showed that in any language expressive
enough to describe the properties of the natural numbers, there are true statements that are undecidable: their
4 NEUROSCIENCE
How do brains process information?
Neuroscience is the study of the nervous system, particularly the brain. Although the exact way in which the
brain enables thought is one of the great mysteries of science, the fact that it does enable thought has been
appreciated for thousands of years because of the evidence that strong blows to the head can lead to mental
incapacitation.
It has also long been known that human brains are somehow different; in about 335 B.C. Aristotle wrote, “Of
all the animals, man has the largest brain in proportion to his size.”5 Still, it was not until the middle of the
18th century that the brain was widely recognized as the seat of consciousness. Before then, candidate
locations included the heart and the spleen.
The parts of a nerve cell or neuron. Each neuron consists of a cell body, or soma, that contains a cell nucleus.
Branching out from the cell body are a number of fibers called dendrites and a single long fiber called the axon.
The axon stretches out for a long distance, much longer than the scale in this diagram indicates.
5 PSYCHOLOGY
How do humans and animals think and act?
The principle characteristic of cognitive psychology is that the brain processes and processes information. The
claim is that beliefs, goals, and reasoning steps can be useful components of a theory of human behaviour. The
knowledge-based agent has three key steps:
6 COMPUTER ENGINEERING
How can we build an efficient computer?
For artificial intelligence to succeed, we need two things: intelligence and an artifact. The computer has
been the artifact of choice. The modern digital electronic computer was invented independently and
almost simultaneously by scientists in three countries embattled in World War II.
7 CONTROL THEORY AND CYBERNETICS
8. LINGUISTICS
Having a theory of how humans successfully process natural language is an AI-complete problem - if we
could solve this problem then we would have created a model of intelligence.
Much of the early work in knowledge representation was done in support of programs that attempted natural
language understanding.
HISTORY OF ARTIFICIAL INTELLIGENCE
Artificial Intelligence is not a new word and not a new technology for researchers. This technology is much
older than you would imagine. Even there are the myths of Mechanical men in Ancient Greek and Egyptian
Myths. Following are some milestones in the history of AI which defines the journey from the AI generation to
till date development.
Maturation of Artificial Intelligence (1943-1952)
Year 1943: The first work which is now recognized as AI was done by Warren McCulloch and
Year 1949: Donald Hebb demonstrated an updating rule for modifying the connection strength
Year 1950: The Alan Turing who was an English mathematician and pioneered Machine
which he proposed a test. The test can check the machine's ability to exhibit intelligent
Year 1955: An Allen Newell and Herbert A. Simon created the "first artificial intelligence
Mathematics theorems, and find new and more elegant proofs for some theorems.
Year 1956: The word "Artificial Intelligence" first adopted by American Computer scientist
John McCarthy at the Dartmouth Conference. For the first time, AI coined as an academic field.
At that time high-level computer languages such as FORTRAN, LISP, or COBOL were
invented. And the enthusiasm for AI was very high at that time.
The golden years-Early enthusiasm (1956-1974)
Year 1966: The researchers emphasized developing algorithms which can solve mathematical
problems. Joseph Weizenbaum created the first chatbot in 1966, which was named as ELIZA.
Year 1972: The first intelligent humanoid robot was built in Japan which was named as
WABOT-1.
The duration between years 1974 to 1980 was the first AI winter duration. AI winter refers to
the time period where computer scientist dealt with a severe shortage of funding from
Year 1980: After AI winter duration, AI came back with "Expert System". Expert systems were
In the Year 1980, the first national conference of the American Association of Artificial
The duration between the years 1987 to 1993 was the second AI Winter duration.
Again Investors and government stopped in funding for AI research as due to high cost but not
efficient result. The expert system such as XCON was very cost effective.
The emergence of intelligent agents (1993-2011)
Year 1997: In the year 1997, IBM Deep Blue beats world chess champion, Gary Kasparov, and
Year 2002: for the first time, AI entered the home in the form of Roomba, a vacuum cleaner.
Year 2006: AI came in the Business world till the year 2006. Companies like Facebook, Twitter,
The organizations which mean to have a serious edge over their adversaries are banking upon AI
advancements to acquire this.
Take the case of the Autopilot highlight offered by Tesla in its vehicles. Tesla is utilizing Deep
Learning Algorithms to accomplish Autonomous driving. This was before, when there was only one
element out of many, yet now it is characterizing the brand.
2. ACCESSIBILITY
The establishment speed, availability, and sheer scale have enabled bolder computations to deal
with progressively exciting issues. Not solely is the gear faster, expanded by specific assortments of
processors (e.g., GPUs), it is moreover available looking like cloud organizations.
What used to run in explicit labs with access to super PCs would now pass on to the cloud at a
lower cost. This has democratized access to the significant hardware stages to run AI, enabling
duplication of new organizations.
3. FEAR OF MISSING OUT (FOMO)
No typo, you read that right! Not simply us, organizations additionally feel the dread of passing
up a major opportunity. To stay competitive and not get tossed out of the market, they need to adjust
appropriately. This is done by putting resources into advances that would upset their enterprises.
Take the case of the financial part, where practically all the banks have put vigorously in chatbots with
the goal that they won’t pass up the following rush of interruption.
4. COST-EFFECTIVENESS
As with all other technologies, with time, AI is becoming more and more affordable. This has
made it feasible for a lot of organizations that couldn’t bear the cost of them in the past to use
these advances.
Organizations do not have that barrier of cost to implement AI.
5. FUTURE PROOF
One thing that we all need to comprehend is that future in AI is very safe .
Before Learning about Artificial Intelligence, we should know that what is the importance of AI
and why should we learn it.
With the help of AI, you can create such software or devices which can solve real-world
problems very easily and with accuracy such as health issues, marketing, traffic issues, etc.
With the help of AI, you can create your personal virtual Assistant, such as Cortana, Google
Assistant, Siri, etc.
With the help of AI, you can build such Robots which can work in an environment where
survival of humans can be at risk.
AI opens a path for other new technologies, new devices, and new Opportunities.
GOALS OF ARTIFICIAL INTELLIGENCE
Replicate human intelligence
Building a machine which can perform tasks that requires human intelligence such as:
Proving a theorem
Playing chess
2. AI in Healthcare
In the last, five to ten years, AI becoming more advantageous for the healthcare industry and going to
have a significant impact on this industry.
Healthcare Industries are applying AI to make a better and faster diagnosis than humans. AI can help
doctors with diagnoses and can inform when patients are worsening so that medical help can reach to
the patient before hospitalization.
3. AI in Gaming
AI can be used for gaming purpose. The AI machines can play strategic games like chess, where the
machine needs to think of a large number of possible places.
4. AI in Finance
AI and finance industries are the best matches for each other. The finance industry is implementing
automation, chatbot, adaptive intelligence, algorithm trading, and machine learning into financial
processes.
5. AI in Data Security
The security of data is crucial for every company and cyber-attacks are growing very rapidly in the
digital world. AI can be used to make your data more safe and secure. Some examples such as AEG
bot, AI2 Platform,are used to determine software bug and cyber-attacks in a better way.
6. AI in Social Media
Social Media sites such as Facebook, Twitter, and Snapchat contain billions of user profiles, which
need to be stored and managed in a very efficient way. AI can organize and manage massive amounts
of data. AI can analyze lots of data to identify the latest trends, hashtag, and requirement of different
users.
7. AI in Travel & Transport
AI is becoming highly demanding for travel industries. AI is capable of doing various travel related
works such as from making travel arrangement to suggesting the hotels, flights, and best routes to the
customers. Travel industries are using AI-powered chatbots which can make human-like interaction
with customers for better and fast response.
8. AI in Automotive Industry
Some Automotive industries are using AI to provide virtual assistant to their user for better
performance. Such as Tesla has introduced TeslaBot, an intelligent virtual assistant.
Various Industries are currently working for developing self-driven cars which can make your journey
more safe and secure.
9. AI in Robotics:
Artificial Intelligence has a remarkable role in Robotics. Usually, general robots are programmed such
that they can perform some repetitive task, but with the help of AI, we can create intelligent robots
which can perform tasks with their own experiences without pre-programmed.
Humanoid Robots are best examples for AI in robotics, recently the intelligent Humanoid robot named
as Erica and Sophia has been developed which can talk and behave like humans.
10. AI in Entertainment
We are currently using some AI based applications in our daily life with some entertainment
services such as Netflix or Amazon. With the help of ML/AI algorithms, these services show the
recommendations for programs or shows.
11. AI in Agriculture
Agriculture is an area which requires various resources, labor, money, and time for best result. Now a
day's agriculture is becoming digital, and AI is emerging in this field. Agriculture is applying AI as
agriculture robotics, solid and crop monitoring, predictive analysis. AI in agriculture can be very helpful
for farmers
12. AI in E-commerce
AI is providing a competitive edge to the e-commerce industry, and it is becoming more demanding in
the e-commerce business. AI is helping shoppers to discover associated products with recommended
size, color, or even brand.
13. AI in education:
AI can automate grading so that the tutor can have more time to teach. AI chatbot can communicate
with students as a teaching assistant.
TYPES OF ARTIFICIAL INTELLIGENCE
The main aim of Artificial Intelligence aim is to enable machines to perform a human-like function
Artificial Intelligence can be divided in various types, there are mainly two types of main
categorization which are based on capabilities and based on functionally of AI.
BASED ON FUNCTIONALITY
1. REACTIVE MACHINES
Purely reactive machines are the most basic types of Artificial Intelligence.
Such AI systems do not store memories or past experiences for future actions.
These machines only focus on current scenarios and react on it as per possible best action.
IBM's Deep Blue system is an example of reactive machines.
Google's AlphaGo is also an example of reactive machines.
2. LIMITED MEMORY
Limited memory machines can store past experiences or some data for a short period of time.
These machines can use stored data for a limited time period only.
Self-driving cars are one of the best examples of Limited Memory systems. These cars can
store recent speed of nearby cars, the distance of other cars, speed limit, and other information to
navigate the road.
3. THEORY OF MIND
Theory of Mind AI should understand the human emotions, people, beliefs, and be able to
This type of AI machines are still not developed, but researchers are making lots of efforts and
4. SELF-AWARENESS
intelligent, and will have their own consciousness, sentiments, and self-awareness.
General AI is a type of intelligence which could perform any intellectual task with efficiency
like a human.
The idea behind the general AI to make such a system which could be smarter and think like a
Currently, there is no such system exist which could come under general AI and can perform
The worldwide researchers are now focused on developing machines with General AI.
As systems with general AI are still under research, and it will take lots of efforts and time to
Machines, in contrast to people, don’t have to take rests. They can work nonstop.
Now, we can depend on machines for keeping up required manufacturing units running with their
judgment which would prompt 24×7 creation units and complete mechanization.
THE STATE OF ART OR WHAT CAN AI DO TODAY?
ROBOTIC VEHICLES: A driverless robotic car named STANLEY sped through the rough
terrain of the Mojave Desert at 22 mph, finishing the 132-mile course first to win the 2005 DARPA
Grand Challenge.
STANLEY is a Volkswagen Touareg outfitted with cameras, radar, and laser rangefinders to sense
the environment and onboard software to command the steering, braking, and acceleration (Thrun,
2006).
The following year CMU’s BOSS won the Urban Challenge, safely driving in traffic through the
streets of a closed Air Force base, obeying traffic rules and avoiding pedestrians and other vehicles.
SPEECH RECOGNITION: A traveller calling United Airlines to book a flight can have the entire
REMOTE AGENT generated plans from high-level goals specified from the ground and monitored
the execution of those plans—detecting, diagnosing, and recovering from problems as they
occurred.
Successor program MAPGEN (Al-Chang et al., 2004) plans the daily operations for NASA’s
Mars Exploration Rovers, and MEXAR2 (Cesta et al., 2007) did mission planning—both logistics
and science planning—for the European Space Agency’s Mars Express mission in 2008.
GAME PLAYING: IBM’s DEEP BLUE became the
first computer program to defeat the world champion in
a chess match when it bested Garry Kasparov by a score
of 3.5 to 2.5 in an exhibition match (Goodman and
Keene, 1997). Kasparov said that he felt a “new kind of
intelligence” across the board from him. Newsweek
magazine described the match as “The brain’s last
stand.” The value of IBM’s stock increased by $18
billion. Human champions studied Kasparov’s loss and
were able to draw a few matches in subsequent years,
but the most recent human-computer matches have been
won convincingly by the computer.
SPAM FIGHTING: Each day, learning algorithms classify over a billion messages as spam,
saving the recipient from having to waste time deleting what, for many users, could comprise 80%
or 90% of all messages, if not classified away by algorithms. Because the spammers are continually
updating their tactics, it is difficult for a static programmed approach to keep up, and learning
algorithms work best (Sahami et al., 1998; Goodman and Heckerman, 2004).
LOGISTICS PLANNING: During the Persian Gulf crisis of 1991, U.S. forces deployed a Dynamic
Analysis and Replanning Tool, DART (Cross and Walker, 1994), to do automated logistics planning
and scheduling for transportation. This involved up to 50,000 vehicles, cargo, and people at a time,
and had to account for starting points, destinations, routes, and conflict resolution among all
parameters. The AI planning techniques generated in hours a plan that would have taken weeks with
older methods. The Defence Advanced Research Project Agency (DARPA) stated that this single
application more than paid back DARPA’s 30-year investment in AI.
ROBOTICS: The iRobot Corporation has sold over two million Roomba robotic vacuum cleaners
for home use. The company also deploys the more rugged PackBot to Iraq and Afghanistan, where
it is used to handle hazardous materials, clear explosives, and identify the location of snipers.
English, allowing an English speaker to see the headline “Ardogan Confirms That Turkey Would
Not Accept Any Pressure, Urging Them to Recognize Cyprus.” The program uses a statistical
model built from examples of Arabic-to-English translations and from examples of English text
totalling two trillion words (Brants et al., 2007). None of the computer scientists on the team speak
AND
ENVIRONMENTS
AGENTS AND ENVIRONMENTS
Agents in Artificial Intelligence are the associated concepts that the AI technologies work
upon.
The AI software or AI-enabled devices with sensors generally captures the information from
the environment setup and process the data for further actions.
There are mainly two ways the agents interact with the environment, such as perception and
action.
The person is only passive for capturing the information without changing the actual
environment, whereas action is the active form of interaction by changing the actual
environment.
AI technologies such as virtual assistance catboats, AI-enabled devices to work based on the
previous persecution data processing and learning for the actions.
WHAT IS AN AGENT?
An Agent is anything that takes actions according to the information that it gains from the environment.
HUMAN-AGENT: A human agent has eyes, ears, and other organs which work for sensors and
ROBOTIC AGENT: A robotic agent can have cameras, infrared range finder, NLP for sensors
SOFTWARE AGENT: Software agent can have keystrokes, file contents as sensory input and
2. ACTION
Action is an active interaction where the environment is changed. When the robot moves
an obstacle using its arm, it is called an action as the environment is changed. The arm of the robot
is called an “Effector” as it performs the action.
SENSOR: Sensor is a device which detects the change in the environment and sends the
information to other electronic devices. An agent observes its environment through sensors.
ACTUATORS: Actuators are the component of machines that converts energy into motion. The
actuators are only responsible for moving and controlling a system. An actuator can be an electric
motor, gears, rails, etc.
EFFECTORS: Effectors are the devices which affect the environment. Effectors can be legs,
wheels, arms, fingers, wings, fins, and display screen.
CONSIDER A VACUUM CLEANER WORLD
Let's suppose that the world has just two rooms. The robot can be in either room and there can
be dirt in zero, one, or two rooms.
Goal formulation: intuitively, we want all the dirt cleaned up. Formally, the goal is {State 7, state 8}.
Problem formulation (Actions):Left, Right, Suck, NoOp
An omniscient agent knows what impact the action will have and can act accordingly, but it is not
possible in reality.
The percept sequence which is the entire sequence of perceptions by the agent until the present
moment
Car Driver Speedometer, GPS, Steering control, Safe, legal, Road, Traffic,
Microphone, accelerate, brake, comfortable journey Pedestrian etc.
Cameras talk to passenger
GOOD BEHAVIOUR: THE CONCEPT OF RATIONALITY
INTELLIGENT AGENTS:
An intelligent agent is an autonomous entity which act upon an environment using sensors and
actuators for achieving goals. An intelligent agent may learn from the environment to achieve their
NOTE: Rationality differs from Omniscience because an Omniscient agent knows the actual
outcome of its action and act accordingly, which is not possible in reality.
MAPPING OF PERCEPT SEQUENCES TO ACTIONS
When it is known that the action of agent depends completely on the perceptual history – the
percept sequence, then the agent can be described by using a mapping. Mapping is a list that maps the
percept sequence to the action. When we specify which action an agent should take corresponding to
the given percept sequence, we specify the design for an ideal agent.
AUTONOMY
The behaviour of an agent depends on its own experience as well as the built-in knowledge of the
agent instilled by the agent designer. A system is autonomous if it takes actions according to its
experience. So for the initial phase, as it does not have any experience, it is good to provide built-in
knowledge. The agent learns then through evolution. A truly autonomous intelligent agent should be
able to operate successfully in a wide variety of environments if given sufficient time to adapt.
TASK ENVIRONMENTS
To design a rational agent we need to specify a task environment
A problem specification for which the agent is a solution
PEAS Representation
PEAS is a type of model on which an AI agent works upon. When we define an AI agent or
rational agent, then we can group its properties under PEAS representation model. It is made up
of four words:
P: Performance measure
E: Environment
A: Actuators
S: Sensors
Here performance measure is the objective for the success of an agent's behaviour.
PEAS: SPECIFYING AN AUTOMATED TAXI DRIVER
Performance measure: ?
Environment: ?
Actuators: ?
Sensors: ?
Performance measure:
safe, fast, legal, comfortable, maximize profits
Environment:
roads, other traffic, pedestrians, customers
Actuators:
steering, accelerator, brake, signal, horn
Sensors:
cameras, sonar, speedometer, GPS
PEAS: MEDICAL DIAGNOSIS SYSTEM
treatments, referrals)
An environment in artificial intelligence is the surrounding of the agent. The agent takes
input from the environment through sensors and delivers the output to the environment
through actuators.
An environment is everything in the world which surrounds the agent, but it is not a part of
The environment is where agent lives, operate and provide the agent with something to sense
2. Static vs Dynamic
3. Discrete vs Continuous
4. Deterministic vs Stochastic
5. Single-agent vs Multi-agent
6. Episodic vs sequential
7. Known vs Unknown
8. Accessible vs Inaccessible
1. FULLY OBSERVABLE VS PARTIALLY OBSERVABLE:
If an agent sensor can sense or access the complete state of an environment at each point of time
then it is a fully observable environment, else it is partially observable.
A fully observable environment is easy as there is no need to maintain the internal state to keep
track history of the world.
An agent with no sensors in all environments then such an environment is called as unobservable.
2. DETERMINISTIC VS STOCHASTIC:
If an agent's current state and selected action can completely determine the next state of the
environment, then such environment is called a deterministic environment.
In a deterministic, fully observable environment, agent does not need to worry about uncertainty.
3. EPISODIC VS SEQUENTIAL:
In an episodic environment, there is a series of one-shot actions, and only the current percept is
required for the action.
However, in Sequential environment, an agent requires memory of past actions to determine the
next best actions.
4. SINGLE-AGENT VS MULTI-AGENT
If only one agent is involved in an environment, and operating by itself then such an
environment is called single agent environment.
However, if multiple agents are operating in an environment, then such an environment is called
a multi-agent environment.
The agent design problems in the multi-agent environment are different from single agent
environment.
5. STATIC VS DYNAMIC:
If the environment can change itself while an agent is deliberating then such environment is
called a dynamic environment else it is called a static environment.
Static environments are easy to deal because an agent does not need to continue looking at the
world while deciding for an action.
However for dynamic environment, agents need to keep looking at the world at each action.
A chess game comes under discrete environment as there is a finite number of moves that can be
performed.
7. ACCESSIBLE VS INACCESSIBLE
If an agent can obtain complete and accurate information about the state's environment, then such
an environment is called an Accessible environment else it is called inaccessible.
An empty room whose state can be defined by its temperature is an example of an accessible
environment.
8. KNOWN VS UNKNOWN
Known and unknown are not actually a feature of an environment, but it is an agent's state of
In a known environment, the results for all actions are known to the agent. While in unknown
Agents can be grouped into five classes based on their degree of perceived intelligence and
capability. All these agents can improve their performance and generate better action over the time.
Goal-based agents
Utility-based agent
Learning agent
1. SIMPLE REFLEX AGENT:
The Simple reflex agents are the simplest agents. These agents take decisions on the basis of the
current percepts and ignore the rest of the percept history.
These agents only succeed in the fully observable environment.
The Simple reflex agent does not consider any part of percepts history during their decision and
action process.
The Simple reflex agent works on Condition-action rule, which means it maps the current state to
action. Such as a Room Cleaner agent, it works only if there is dirt in the room.
Problems for the simple reflex agent design approach:
They have very limited intelligence
They do not have knowledge of non-perceptual parts of the current state
Mostly too big to generate and to store.
Not adaptive to changes in the environment.
Schematic diagram of a simple reflex agent.
2. MODEL-BASED REFLEX AGENT
The Model-based agent can work in a partially observable environment, and track the
situation.
A model-based agent has two important factors:
Model: It is knowledge about "how things happen in the world," so it is called a
Model-based agent.
Internal State: It is a representation of the current state based on percept history.
These agents have the model, "which is knowledge of the world" and based on the model
they perform actions.
Updating the agent state requires information about:
How the world evolves
How the agent's action affects the world.
A model-based reflex agent.
3. GOAL-BASED AGENTS
The knowledge of the current state environment is not always sufficient to decide for an
The agent needs to know its goal which describes desirable situations.
Goal-based agents expand the capabilities of the model-based agent by having the "goal"
information.
These agents may have to consider a long sequence of possible actions before deciding
whether the goal is achieved or not. Such considerations of different scenario are called
These agents are similar to the goal-based agent but provide an extra component of utility
state.
Utility-based agent act based not only goals but also the best way to achieve the goal.
The Utility-based agent is useful when there are multiple possible alternatives, and an
The utility function maps each state to a real number to check how efficiently each action
A learning agent in AI is the type of agent which can learn from its past experiences, or it has
learning capabilities.
It starts to act with basic knowledge and then able to act and adapt automatically through learning.
A learning agent has mainly four conceptual components, which are:
Learning element: It is responsible for making improvements by learning from environment
Critic: Learning element takes feedback from critic which describes that how well the agent is
doing with respect to a fixed performance standard.
Performance element: It is responsible for selecting external action
Problem generator: This component is responsible for suggesting actions that will lead to new
and informative experiences.
Hence, learning agents are able to learn, analyze performance, and look for new ways to improve
the performance.
A general learning agent.
PROBLEM-SOLVING AGENT
PROBLEM-SOLVING AGENT
The problem-solving agent performs precisely by defining problems and its several solutions.
According to psychology, “a problem-solving refers to a state where we wish to reach to a
definite goal from a present state or condition.”
According to computer science, a problem-solving is a part of artificial intelligence which
encompasses a number of techniques such as algorithms, heuristics to solve a problem.
Initial State: It is the starting state or initial step of the agent towards its goal.
Path cost: It assigns a numeric cost to each path that follows the goal. The problem-solving
agent selects a cost function, which reflects its performance measure. Remember, an optimal
solution has the lowest path cost among all the solutions.
Search: It identifies all the best possible sequence of actions to reach the goal state from the current
Solution: It finds the best algorithm out of various algorithms, which may be proven as the best
optimal solution.
Execution: It executes the best optimal solution from the searching algorithms to reach the goal
NOTE: Initial state, actions, and transition model together define the state-space of the problem
implicitly. State-space of a problem is a set of all states which can be reached from the initial state
followed by any sequence of actions. The state-space forms a directed map or graph where nodes are
the states, links between the nodes are actions, and the path is a sequence of states connected by the
sequence of actions.
EXAMPLE PROBLEMS
Basically, there are two types of problem approaches:
Toy Problem: It is a concise and exact description of the problem which is used by the
researchers to compare the performance of algorithms.
Real-world Problem: It is real-world based problems which require solutions. Unlike a toy
problem, it does not depend on descriptions, but we can have a general formulation of the
problem.
8 Puzzle Problem: Here, we have a 3×3 matrix with movable tiles numbered from 1 to 8 with
a blank space. The tile adjacent to the blank space can slide into that space. The objective is to
reach a specified goal state similar to the goal state, as shown in the below figure.
SOME TOY PROBLEMS
We also know the eight-puzzle problem by the name of N puzzle problem or sliding puzzle
problem.
N-puzzle that consists of N tiles (N+1 titles with an empty tile) where N can be 8, 15, 24 and so
on.
In our example N = 8. (that is square root of (8+1) = 3 rows and 3 columns).
In the same way, if we have N = 15, 24 in this way, then they have Row and columns as
follow (square root of (N+1) rows and square root of (N+1) columns).
That is if N=15 than number of rows and columns= 4, and if N= 24 number of rows and
columns= 5.
So, basically in these types of problems we have given a initial state or initial configuration
(Start state) and a Goal state or Goal Configuration.
Here We are solving a problem of 8 puzzle that is a 3x3 matrix.
The puzzle can be solved by moving the tiles one by one in the single empty space and thus
achieving the Goal state.
Rules of solving puzzle
Instead of moving the tiles in the empty space we can visualize moving the empty space in place
of the tile.
The empty space can only move in four directions (Movement of empty space)
Up Down Right or Left
The empty space cannot move diagonally and can take only one step at a time.
For this problem, there are two main kinds of formulation
Incremental formulation: It starts from an empty state where the operator augments a queen at
each step.
Cell layout: Here, the primitive components of the circuit are grouped into cells, each
performing its specific function. Each cell has a fixed shape and size. The task is to place the
cells on the chip without overlapping each other.
Channel routing: It finds a specific route for each wire through the gaps between the cells.
Protein Design: The objective is to find a sequence of amino acids which will fold into 3D
protein having a property to cure some disease.
ROBOT NAVIGATION is a generalization of the route-finding problem described earlier. Rather than
following a discrete set of routes, a robot can move in a continuous space with (in principle) an infinite set of
possible actions and states. For a circular robot moving on a flat surface, the space is essentially two-
dimensional. When the robot has arms and legs or wheels that must also be controlled, the search space becomes
many-dimensional. Advanced techniques are required just to make the search space finite. We examine some of
these methods in Chapter 25. In addition to the complexity of the problem, real robots must also deal with errors
in their sensor readings and motor controls.
AUTOMATIC ASSEMBLY SEQUENCING of complex objects by a robot was first demonstrated
by FREDDY (Michie, 1972). Progress since then has been slow but sure, to the point where the assembly of intricate
objects such as electric motors is economically feasible. In assembly problems, the aim is to find an order in which to
assemble the parts of some object. If the wrong order is chosen, there will be no way to add some part later in the
sequence without undoing some of the work already done. Checking a step in the sequence for feasibility is a difficult
geometrical search problem closely related to robot navigation. Thus, the generation of legal actions is the expensive
part of assembly sequencing. Any practical algorithm must avoid exploring all but a tiny fraction of the state space.
Another important assembly problem is protein design.
SEARCHING FOR SOLUTIONS
INFRASTRUCTURE FOR SEARCH ALGORITHMS
Search algorithms require a data structure to keep track of the search tree that is being constructed.
For each node n of the tree, we have a structure that contains four components:
n. STATE: the state in the state space to which the node corresponds;
n. PARENT: the node in the search tree that generated this node;
n. ACTION: the action that was applied to the parent to generate the node;
n. PATH-COST: the cost, traditionally denoted by g(n), of the path from the initial state to the
Completeness: It measures if the algorithm guarantees to find a solution (if any solution exists).
Breadth-first search is the most common search strategy for traversing a tree or graph.
BFS algorithm starts searching from the root node of the tree and expands all successor node at
ADVANTAGES:
BFS will provide a solution if any solution exists.
If there are more than one solution for a given problem, then BFS will provide the minimal solution
which requires the least number of steps.
DISADVANTAGES:
It requires lots of memory since each level of the tree must be saved into memory to expand the
next level.
BFS needs lots of time if the solution is far away from the root node.
traversed in BFS until the shallowest Node. Where the d= depth of shallowest solution and b is a
Space Complexity: Space complexity of BFS algorithm is given by the Memory size of frontier
which is O(bd).
Completeness: BFS is complete, which means if the shallowest goal node is at some finite depth,
Optimality: BFS is optimal if path cost is a non-decreasing function of the depth of the node.
2. DEPTH-FIRST SEARCH
Depth-first search is a recursive algorithm for traversing a tree or graph data structure.
It is called the depth-first search because it starts from the root node and follows each path to its
greatest depth node before moving to the next path.
DFS uses a stack data structure for its implementation.
The process of the DFS algorithm is similar to the BFS algorithm.
ADVANTAGE:
DFS requires very less memory as it only needs to store a stack of the nodes on the path from root
node to the current node.
It takes less time to reach to the goal node than BFS algorithm (if it traverses in the right path).
DISADVANTAGE:
There is the possibility that many states keep re-occurring, and there is no guarantee of finding the
solution.
DFS algorithm goes for deep down searching and sometime it may go to the infinite loop.
EXAMPLE:
Depth-first search, and it will follow the order as:
Root node--->Left node ----> right node.
It will start searching from root node S, and traverse A, then B, then D and E, after traversing E, it
will backtrack the tree as E has no other successor and still goal node is not found. After
backtracking it will traverse node C and then G, and here it will terminate as it found goal node.
Completeness: DFS search algorithm is complete within finite state space as it will expand every node
within a limited search tree.
Time Complexity: Time complexity of DFS will be equivalent to the node traversed by the algorithm.
It is given by:
T(n)= 1+ n2+ n3 +.........+ nm=O(nm)
Where, m= maximum depth of any node and this can be much larger than d (Shallowest solution
depth)
Space Complexity: DFS algorithm needs to store only single path from the root node, hence space
complexity of DFS is equivalent to the size of the fringe set, which is O(bm).
Optimal: DFS search algorithm is non-optimal, as it may generate a large number of steps or high cost
to reach to the goal node.
3. DEPTH-LIMITED SEARCH ALGORITHM:
A depth-limited search algorithm is similar to depth-first search with a predetermined limit. Depth-
limited search can solve the drawback of the infinite path in the Depth-first search. In this algorithm,
the node at the depth limit will treat as it has no successor nodes further.
Standard failure value: It indicates that problem does not have any solution.
Cut off failure value: It defines no solution for the problem within a given depth limit.
Advantages:
Depth-limited search is Memory efficient.
Disadvantages:
Depth-limited search also has a disadvantage of incompleteness.
It may not be optimal if the problem has more than one solution.
EXAMPLE
Advantages:
Uniform cost search is optimal because at every state the path with the least cost is chosen.
Disadvantages:
It does not care about the number of steps involve in searching and only concerned about path
cost. Due to which this algorithm may be stuck in an infinite loop.
EXAMPLE
Completeness: Uniform-cost search is complete, such as if there is a solution, UCS will find it.
Time Complexity: Let C* is Cost of the optimal solution, and ε is each step to get closer to the goal
node. Then the number of steps is = C*/ε+1. Here we have taken +1, as we start from state 0 and end
to C*/ε.
Space Complexity: The same logic is for space complexity so, the worst-case space complexity of
Optimal: Uniform-cost search is always optimal as it only selects a path with the lowest path cost.
5. ITERATIVE DEEPENING DEPTH-FIRST SEARCH
Advantages:
It combines the benefits of BFS and DFS search algorithm in terms of fast search and memory
efficiency.
Disadvantages:
The main drawback of IDDFS is that it repeats all the work of the previous phase.
EXAMPLE
Following tree structure is showing the iterative deepening depth-first search.
IDDFS algorithm performs various iterations until it does not find the goal node. The iteration
performed by the algorithm is given as:
1'st Iteration-----> A
2'nd Iteration----> A, B, C
3'rd Iteration------>A, B, D, E, C, F, G
4'th Iteration------>A, B, D, H, I, E, C, F, K, G
In the fourth iteration, the algorithm will find the
goal node
Completeness: This algorithm is complete is ifthe branching factor is finite.
Time Complexity: Let's suppose b is the branching factor and depth is d then the worst-case time
complexity is O(bd).
Optimal: IDDFS algorithm is optimal if path cost is a non- decreasing function of the depth of the
node.
6. BIDIRECTIONAL SEARCH ALGORITHM
Bidirectional search algorithm runs two simultaneous searches, one form initial state called as
forward-search and other from goal node called as backward-search, to find the goal node.
Bidirectional search replaces one single search graph with two small subgraphs in which one
starts the search from an initial vertex and other starts from goal vertex.
The search stops when these two graphs intersect each other.
Bidirectional search can use search techniques such as BFS, DFS, DLS, etc.
Advantages:
Bidirectional search is fast.
Bidirectional search requires less memory
Disadvantages:
Implementation of the bidirectional search tree is difficult.
In bidirectional search, one should know the goal state in advance.
EXAMPLE
In the below search tree, bidirectional search algorithm is applied.
This algorithm divides one graph/tree into two sub-graphs.
It starts traversing from node 1 in the forward direction and starts from goal node 16 in the
backward direction.
The algorithm terminates at node 9 where two searches meet.
Here h(n) is heuristic cost, and h*(n) is the estimated cost. Hence
h(n) <= h*(n)
heuristic cost should be less than or equal to the estimated cost.
PURE HEURISTIC SEARCH:
Greedy best-first search algorithm always selects the path which appears best at that moment.
It is the combination of depth-first search and breadth-first search algorithms.
It uses the heuristic function and search. Best-first search allows us to take the advantages of
both algorithms.
With the help of best-first search, at each step, we can choose the most promising node.
In the best first search algorithm, we expand the node which is closest to the goal node and the
closest cost is estimated by heuristic function, i.e.
f(n)= g(n).
Were, h(n)= estimated cost from node n to the goal.
The greedy best first algorithm is implemented by the priority queue.
BEST FIRST SEARCH ALGORITHM:
Step 1: Place the starting node into the OPEN list.
Step 3: Remove the node n, from the OPEN list which has the lowest value of h(n), and places it in
the CLOSED list.
Step 5: Check each successor of node n, and find whether any node is a goal node or not. If any
successor node is goal node, then return success and terminate the search, else proceed to Step 6.
Step 6: For each successor node, algorithm checks for evaluation function f(n), and then check if
the node has been in either OPEN or CLOSED list. If the node has not been in both list, then add it
to the OPEN list.
Step 7: Return to Step 2.
ADVANTAGES:
Best first search can switch between BFS and DFS by gaining the advantages of both
the algorithms.
DISADVANTAGES:
iteration, each node is expanded using evaluation function f(n)=h(n) , which is given in the below
table.
In this search example, we are using two lists which are OPEN and CLOSED Lists. Following are the
iteration for traversing the above example.
Space Complexity: The worst-case space complexity of Greedy best first search is
Complete: Greedy best-first search is also incomplete, even if the given state space is
finite.
So, there is total of three tiles out of position i.e., 6,5 and 4. Do not count the empty tile present
in the goal state). i.e. h(n)=3. Now, we require to minimize the value of h(n) =0.
It is seen from the above state space tree that the goal state is minimized from h(n)=3 to h(n)=0
However, we can create and use several heuristic functions as per the requirement. It is also clear
from the above example that a heuristic function h(n) can be defined as the information required to
solve a given problem more efficiently. The information can be related to the nature of the state,
cost of transforming from one state to another, goal node characteristics, etc., which is
The informed and uninformed search expands the nodes systematically in two ways:
Which leads to a solution state required to reach the goal node. But beyond these “classical
search algorithms,” we have some “local search algorithms” where the path cost does not
matters, and only focus on solution-state needed to reach the goal node.
A local search algorithm completes its task by traversing on a single current node rather than
Does the local search algorithm work for a pure optimized problem?
Yes, the local search algorithm works for pure optimized problems. A pure optimization
problem is one where all the nodes can give a solution. But the target is to find the best state out of
all according to the objective function. Unfortunately, the pure optimization problem fails to find
high-quality solutions to reach the goal state from the current state.
WORKING OF A LOCAL SEARCH ALGORITHM
Hill-climbing Search
Simulated Annealing
Hill climbing algorithm is a technique which is used for optimizing the mathematical problems. One of
the widely discussed examples of Hill climbing algorithm is Traveling-salesman Problem in which we
need to minimize the distance travelled by the salesman.
It is also called greedy local search as it only looks to its good immediate neighbour state and not beyond
that.
A node of hill climbing algorithm has two components which are state and value.
In this algorithm, we don't need to maintain and handle the search tree or graph as it only keeps a single
current state.
STATE-SPACE LANDSCAPE OF HILL CLIMBING ALGORITHM
To understand the concept of hill climbing algorithm, consider the below landscape representing
the goal state/peak and the current state of the climber. The topographical regions shown in the
figure can be defined as:
Global Maximum: It is the highest point on the hill, which is the goal state.
Local Maximum: It is the peak higher than all other peaks but lower than the global
maximum.
Flat local maximum: It is the flat area over the hill where it has no uphill or downhill. It is
2. If the CURRENT node=GOAL node, return GOAL and terminate the search.
NOTE: Both simple, as well as steepest-ascent hill climbing search, fails when there is no closer
node.
Stochastic hill climbing does not focus on all the nodes. It selects one node at random and
Random-restart algorithm is based on try and try strategy. It iteratively searches the node
and selects the best one at each step until the goal is not found. The success depends most
commonly on the shape of the hill. If there are few plateaus, local maxima, and ridges, it
Hill climbing algorithm is a fast and furious approach. It finds the solution state rapidly because it
is quite easy to improve a bad state. But, there are following limitations of this search:
Local Maxima: It is that peak of the mountain which is highest than all its neighbouring states
but lower than the global maxima. It is not the goal peak because there is another peak higher
than it.
Generate and Test variant: Hill Climbing is the variant of Generate and Test method.
The Generate and Test method produce feedback which helps to decide which direction
No backtracking: It does not backtrack the search space, as it does not remember the
previous states.
SIMULATED ANNEALING
SIMULATED ANNEALING
Simulated annealing is similar to the hill climbing algorithm. It works on the current situation. It
picks a random move instead of picking the best move. If the move leads to the improvement of
the current situation, it is always accepted as a step towards the solution state, else it accepts the
move having a probability less than 1. This search technique was first used in 1980 to
solve VLSI layout problems. It is also applied for factory scheduling and other large optimization
tasks.
LOCAL BEAM SEARCH
Local beam search is quite different from random-restart search. It keeps track of k states
instead of just one. It selects k randomly generated states, and expand them at each step. If any
state is a goal state, the search stops with success. Else it selects the best k successors from the
complete list and repeats the same process. In random-restart search where each search process
runs independently, but in local beam search, the necessary information is shared between the
parallel search processes.
DISADVANTAGES OF LOCAL BEAM SEARCH
This search can suffer from a lack of diversity among the k states.
It is an expensive version of hill climbing search.
LOCAL SEARCH IN CONTINUES SPACES
Gradient Descent is one of the most used machine learning algorithms in the
industry. And yet it confounds a lot of newcomers.
What is a Cost Function?
It is a function that measures the performance of a model for any given data. Cost
Function quantifies the error between predicted values and expected values and presents it in the form
of a single real number.
After making a hypothesis with initial parameters, we calculate the Cost function. And with a goal
to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the
given data. Here’s the mathematical representation for it:
What is Gradient Descent?
Let’s say you are playing a game where the players are at the top of
a mountain, and they are asked to reach the lowest point of the
mountain. Additionally, they are blindfolded. So, what approach do you
think would make you reach the lake?
The best way is to observe the ground and find where the land
descends. From that position, take a step in the descending direction and
iterate this process until we reach the lowest point.
Gradient descent is an iterative optimization algorithm for finding the local minimum of a
function.
To find the local minimum of a function using gradient descent, we must take steps proportional
to the negative of the gradient (move away from the gradient) of the function at the current point. If
we take steps proportional to the positive of the gradient (moving towards the gradient), we will
approach a local maximum of the function, and the procedure is called Gradient Ascent.
Gradient descent was originally proposed by CAUCHY in 1847. It is also known as steepest
descent.
The goal of the gradient descent algorithm is to minimize the given function (say cost function). To
achieve this goal, it performs two steps iteratively:
1. Compute the gradient (slope), the first order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient, opposite direction of slope
increases from the current point by alpha times the gradient at that point
As an example, we use the vacuum world, recall that the state space has eight states, as shown in
Figure. There are three actions — Left, Right, and Suck — and the goal is to clean up all the
dirt (states 7 and 8).
If the environment is observable, deterministic, and completely known, then the problem is
trivially solvable by any of the algorithm and the solution is an action sequence.
The next question is how to find contingent solutions to nondeterministic problems.
we begin by constructing search trees, but here the trees have a different character.
In a deterministic environment, the only branching is introduced by the agent’s own choices in
each state. We call these nodes OR nodes.
In the vacuum world, for example, at an OR node the agent chooses Left or Right or Suck.
In a nondeterministic environment, branching is also introduced by the environment’s choice
of outcome for each action. We call these nodes AND nodes.
A solution for an AND–OR search problem is a subtree that (1) has a goal node at every leaf, (2)
specifies one action at each of its OR nodes, and (3) includes every outcome branch at each of its
AND nodes.
Suck action: applied to a dirty square, cleans the square but sometimes cleans adjacent square; applied
to a clean square, sometimes deposits dirt.
SENSOR-LESS VACUUM WORLD
Assume belief states are the same but no location or dust sensors
Initial state = {1, 2, 3, 4, 5, 6, 7, 8}
Action: Right
Result = {2, 4, 6, 8}
Right, Suck
Result = {4, 8}
Right, Suck, Left, Suck
Result = {7} guaranteed!
WHAT IS THE CORRESPONDING SENSOR-LESS PROBLEM
States Belief States: every possible set of physical states
If N physical states, number of belief states can be 2𝑁
Initial State: Typically, the set of all states in P
Actions: Consider {s1, s2}
If 𝐴𝑐𝑡𝑖𝑜𝑛𝑠𝑝(s1) != 𝐴𝑐𝑡𝑖𝑜𝑛𝑠𝑝(s2) should we take the Union of both sets of actions or the Intersection?
Union if all actions are legal, intersection if not
TRANSITION MODEL
Union of all states that 𝑅𝑒𝑠𝑢𝑙𝑡𝑝(s) returns for all states, s, in your current belief state
𝑏′=𝑅𝑒𝑠𝑢𝑙𝑡 𝑏 ,𝑎 = {𝑠′ : 𝑠′ = 𝑅𝑒𝑠𝑢𝑙𝑡𝑝(s, a) and s ϵ b}
This is the prediction step, 𝑃𝑟𝑒𝑑𝑖𝑐𝑡 𝑝 (b, a)
Goal-Test: If all physical states in belief state satisfy 𝐺𝑜𝑎𝑙−𝑇𝑒𝑠𝑡𝑝
Path cost Tricky in general. Consider what happens if actions in different physical states have different
costs. For now assume cost of an action is the same in all states
One solution is to represent the belief state by some more compact description. In English, we could
say the agent knows “Nothing” in the initial state; after moving Left, we could say, “Not in the rightmost
column,” and so on. Chapter 7 explains how to do this in a formal representation scheme. Another
approach is to avoid the standard search algorithms, which treat belief states as black boxes just like any
other problem state. Instead, we can look
SEARCHING
WITH
OBSERVATIONS
SEARCHING WITH OBSERVATIONS
When observations are partial, it will usually be the case that several states could have produced
any given percept. For example, the percept [A, Dirty] is produced by state 3 as well as by state
1.
Hence, given this as the initial percept, the initial belief state for the local-sensing vacuum
world will be {1, 3}.
The ACTIONS, STEP-COST, and GOAL-TEST are constructed from the underlying
physical problem just as for sensor less problems, but the transition model is a bit more
complicated.
We can think of transitions from one belief state to the next for a particular action as occurring
in three stages.
The prediction stage is the same as for sensor less problems: given the action a in belief
state b, the predicted belief state is ˆb =PREDICT(b, a)
The observation prediction stage determines the set of precepts o that could be observed in the
predicted belief state:
POSSIBLE-PERCEPTS(ˆ b) = {o : o=PERCEPT(s) and s ∈ ˆ b}
The update stage determines, for each possible percept, the belief state that would result from the
percept. The new belief state bo is just the set of states in ˆb that could have produced the percept:
Notice that each updated belief state bo can be no larger than the predicted belief state ˆ b;
observations can only help reduce uncertainty compared to the sensor less case.
Moreover, for deterministic sensing, the belief states for the different possible precepts will be
The preceding section showed how to derive the RESULTS function for a nondeterministic
belief-state problem from an underlying physical problem and the PERCEPT function.
Given such a formulation, the AND–OR search algorithm of Figure can be applied directly to
derive a solution.
Figure shows part of the search tree for the local-sensing vacuum world, assuming an initial
Great for
dynamic domains
Non-deterministic domains
Does not know about obstacles, where the goal is, that UP from (1,1) goes to (1, 2)
Competitive Ratio = Cost of shortest path without exploration/Cost of actual agent path
Irreversible actions can lead to dead ends and CR can become infinite
ONLINE SEARCH ALGORITHMS
Hill-climbing is already an online search algorithm but stops at local optimum. How about
randomization?
Cannot do random restart (you can’t teleport a robot)
How about just a random walk instead of hill-climbing?
Can be very bad (two ways back for every way forward above)
Let’s augment HC with memory
Learning real-time A* (LRTA*)
Updates cost estimates, g(s), for the state it leaves
Likes unexplored states
f(s) = h(s) not g(s) + h(s) for unexplored states
REINFORCEMENT
LEARNING
REINFORCEMENT LEARNING
Reinforcement Learning is a feedback-based Machine learning technique in which an agent
learns to behave in an environment by performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks without any labelled
data, unlike supervised learning.
RL solves a specific type of problem where decision making is sequential, and the goal is long-
term, such as game-playing, robotics, etc.
The agent learns with the process of hit and trial, and based on the experience, it learns to perform
the task in a better way. Hence, we can say that "Reinforcement learning is a type of machine
learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his
arms is an example of Reinforcement learning.
EXAMPLE: Suppose there is an AI agent present within a maze environment, and his goal is to find
the diamond. The agent interacts with the environment by performing some actions, and based on
those actions, the state of the agent gets changed, and it also receives a reward or penalty as
feedback.
TERMS USED IN REINFORCEMENT LEARNING
Agent (): An entity that can perceive/explore the environment and act upon it.
Environment (): A situation in which an agent is present or surrounded by. In RL, we assume the
stochastic environment, which means it is random in nature.
Action (): Actions are the moves taken by an agent within the environment.
State (): State is a situation returned by the environment after each action taken by the agent.
Reward (): A feedback returned to the agent from the environment to evaluate the action of the agent.
Policy (): Policy is a strategy applied by the agent for the next action based on the current state.
Value (): It is expected long-term retuned with the discount factor and opposite to the short-term
reward.
Q-value (): It is mostly similar to the value, but it takes one additional parameter as a current action
(a).
KEY FEATURES OF REINFORCEMENT LEARNING
In RL, the agent is not instructed about the environment and what actions need to be taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the feedback of the previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to reach to get the maximum
positive rewards.
APPROACHES TO IMPLEMENT REINFORCEMENT LEARNING
There are mainly three ways to implement reinforcement-learning in ML, which are:
Value-based
Policy-based
Model-based
1. VALUE-BASED:
The value-based approach is about to find the optimal value function, which is the maximum value
at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under
policy π.
2. POLICY-BASED:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the action
performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
Deterministic: The same action is produced by the policy (π) at any state.
Stochastic: In this policy, probability determines the produced action.
3.MODEL-BASED:
In the model-based approach, a virtual model is created for the environment, and the agent explores
that environment to learn it. There is no particular solution or algorithm for this approach because the
model representation is different for each environment.
ELEMENTS OF REINFORCEMENT LEARNING
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
1) POLICY:
A policy can be defined as a way how an agent behaves at a given time. It maps the perceived
states of the environment to the actions taken on those states. A policy is the core element of the RL as
it alone can define the behaviour of the agent. In some cases, it may be a simple function or a lookup
table, whereas, for other cases, it may involve general computation as a search process. It could be
deterministic or a stochastic policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
2) REWARD SIGNAL:
The goal of reinforcement learning is defined by the reward signal. At each state, the
environment sends an immediate signal to the learning agent, and this signal is known as a reward
signal. These rewards are given according to the good and bad actions taken by the agent. The agent's
main objective is to maximize the total number of rewards for good actions. The reward signal can
change the policy, such as if an action selected by the agent leads to low reward, then the policy may
change to select other actions in the future.
3) VALUE FUNCTION:
The value function gives information about how good the situation and action are and how
much reward an agent can expect. A reward indicates the immediate signal for each good and bad
action, whereas a value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The goal of estimating
values is to achieve more rewards.
4) MODEL:
The last element of reinforcement learning is the model, which mimics the behaviour of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and reward.
The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations. The approaches for
solving the RL problems with the help of the model are termed as the model-based approach.
Comparatively, an approach without using a model is called a model-free approach.
HOW DOES REINFORCEMENT LEARNING WORK?
To understand the working process of the RL, we need to consider two main things:
Environment: It can be anything such as a room, maze, football ground, etc.
Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the below
image:
In the above image, the agent is at the very first block of the maze. The maze is consisting of an
S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block, then get
the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions: move
up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible fewer
steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1 reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step. To
memorize the steps, it assigns 1 value to each previous step.
It is a way of calculating the value functions in dynamic programming or environment that leads
to modern reinforcement learning.
The key-elements used in Bellman equations are:
Action performed by the agent is referred to as "a"
State occurred by performing the action is "s."
The reward/feedback obtained for each good and bad action is "R."
A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
In the above equation, we are taking the max of the complete values because the agent tries to find
the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given environment. We
will start from the block, which is next to the target block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no reward at this state
also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no reward at this state
also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Now, we will move further to the 6th block, and here
agent may change the route because it always tries to
find the optimal path. So now, let's consider from the
block next to the fire pit.
NEGATIVE REINFORCEMENT:
We can represent the agent state using the Markov State that contains all the required
information from the history. The State St is Markov state if it follows the given condition:
The Markov state follows the Markov property, which says that the future is independent of the
past and can only be defined with the present. The RL works on fully observable environments,
where the agent can observe the environment and act for the new state. The complete process is
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If
the environment is completely observable, then its dynamic can be modelled as a Markov Process.
In MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be formalized
using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
A set of finite States S
A set of finite Actions A
Rewards received after transitioning from state S to state S', due to action a.
Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
MARKOV PROPERTY
It says that "If the agent is present in the current state S1, performs an action a1 and move to
the state s2, then the state transition from s1 to s2 only depends on the current state and
future action and states do not depend on past actions, rewards, or states."
MARKOV PROPERTY
Or, in other words, as per Markov Property, the current state transition does not depend on any
past action or state. Hence, MDP is an RL problem that satisfies the Markov property. Such as in
a Chess game, the players only focus on the current state and do not need to remember
past actions or states.
FINITE MDP
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we consider
only the finite MDP.
MARKOV PROCESS
Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S, P) on
state S and transition function P. These two components (S and P) can define the dynamics of the
system.
REINFORCEMENT LEARNING ALGORITHMS
Reinforcement learning algorithms are mainly used in AI applications and gaming applications.
The main used algorithms are:
Q-Learning:
Q-learning is an Off policy RL algorithm, which is used
for the temporal difference Learning. The temporal
difference learning methods are the way of comparing
temporally successive predictions.
It learns the value function Q (s, a), which means how good
to take action "a" at a particular state "s."
The below flowchart explains the working of Q- learning:
State Action Reward State action (SARSA):
SARSA stands for State Action Reward State action, which is an on-policy temporal
difference learning method. The on-policy control method selects the action for each state
while learning using a specific policy.
The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and all
pairs of (s-a).
The main difference between Q-learning and SARSA algorithms is that unlike Q-learning, the
maximum reward for the next state is not required for updating the Q-value in the table.
In SARSA, new action and reward are selected using the same policy, which has determined
the original action.
The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
In the equation, we have various components, including reward, discount factor (γ), probability,
and end states s'. But there is no any Q-value is given so first consider the below image:
In the image, we can see there is an agent who has three values options,
V(s1), V(s2), V(s3). As this is MDP, so agent only cares for the current
state and the future state. The agent can go to any direction (Up, Left,
or Right), so he needs to decide where to go for the optimal path. Here
agent will take a move as per probability bases and changes the state.
But if we want some exact moves, so for this, we need to make some
changes in terms of Q-value.
Where R(s) = reward for being in states, P(s’|s, π(s)) = transition model, γ = discount factor
and Uπ(s) = utility of being in state’s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can become
quite costly to compute for large state spaces. ADP is a model-based approach and requires the
transition model of the environment. A model-free approach is Temporal Difference Learning.
3. TEMPORAL DIFFERENCE LEARNING (TD)
TD learning does not require the agent to learn the transition model. The update occurs between
successive states and agent only updates states that are directly affected
Where f(u, n) is the exploration function that increases with expected value u and decreases with
number of tries n
R+ is an optimistic reward and Ne is the number of times we want an agent to be forced to pick an action in
every state. The exploration function converts a passive agent into an active one.
GENERALIZATION
IN
REINFORCEMENT
LEARNING
GENERALIZATION IN REINFORCEMENT LEARNING
The study of generalisation in deep Reinforcement Learning (RL) aims to produce RL algorithms
whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting
to their training environments.
Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios,
where the environment will be diverse, dynamic and unpredictable.
This survey is an overview of this nascent field. We provide a unifying formalism and terminology
for discussing different generalisation problems, building upon previous works.
We go on to categorise existing benchmarks for generalisation, as well as current methods for
tackling the generalisation problem. Finally, we provide a critical discussion of the current state of
the field, including recommendations for future work.
Among other conclusions, we argue that taking a purely procedural content generation approach to
benchmark design is not conducive to progress in generalisation, we suggest fast online adaptation
and tackling RL-specific problems as some areas for future work on methods for generalisation, and
Its underlying idea, states Russel, is that intelligence is an emergent property of the interaction
between an agent and its environment.
This property guides the agent’s actions by orienting its choices in the conduct of some tasks.
We can say, analogously, that intelligence is the capacity of the agent to select the appropriate
strategy in relation to its goals. Strategy, a teleologically-oriented subset of all possible
behaviours, is here connected to the idea of “policy”.
A policy is, therefore, a strategy that an agent uses in pursuit of goals. The policy dictates the
actions that the agent takes as a function of the agent’s state and the environment.
MATHEMATICAL DEFINITION OF A POLICY
With formal terminology, we define a policy π in terms of the Markov Decision Process to which
it refers. A Markov Decision Process is a tuple of the form (S,A,R,P), structured as follows.
The first element is a set S containing the internal states of the agent. Together, all possible states
span a so-called state space for the agent. In the case of the grid worlds for agent simulations, S
normally consists of the position of the agent on a board plus, if necessary, some parameters.
The second element is a set A containing the actions of the agent. The actions correspond to the
possible behaviors that the agent can take in relation to the environment. Together, the set of all
actions spans the action space for that agent.
An action can also lead to a modification of the state of the agent. This is represented by the
matrix P containing the probability of transition from one state to another. Its elements, Pa(S,S’),
contain the probabilities Pr(S’/S,A) for all possible actions a €A and pairs of states (S,S’), .
The fourth element R(s) comprises the reward function for the agent. It takes as input the state of the
agent and outputs a real number that corresponds to the agent’s reward.
We can now formally define the policy, which we indicate with π(s). A policy π(s) comprises the
suggested actions that the agent should take for every possible state s € S.
The reward function is defined in this manner. If it’s in an empty cell, the agent receives a negative
reward of -1, to simulate the effect of hunger. If instead, the agent is in a cell with fruit, in this case,
(3,2) for the pear and (4,4) for the apple, it then receives a reward of +5 and +10, respectively.
EVALUATION OF THE POLICIES
The agent then considers two policies π1 and π2. If we simplify slightly the notation, we can indicate
a policy as a sequence of actions starting from the state of the agent at s0:
The agent then has to select between the two policies. By computing the utility function U over them, the
agent obtains:
The evaluation of the policies suggests that the utility is maximized with π2, which then the agent
chooses as its policy for this task.
NATURAL
LANGUAGE
PROCESSING
WHAT IS NLP?
NLP stands for Natural Language Processing, which is a part of Computer Science, Human
language, and Artificial Intelligence.
It is the technology that is used by machines to understand, analyse, manipulate, and interpret
human's languages. It helps developers to organize knowledge for performing tasks such
as translation, automatic summarization, Named Entity Recognition (NER), speech
recognition, relationship extraction, and topic segmentation.
HISTORY OF NLP
(1940-1960) - Focused on Machine Translation (MT)
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer science
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based descriptions of syntactic
structures.
(1960-1980) - Flavored with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN)
Augmented Transition Networks is a finite state machine that is capable of recognizing regular languages.
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968.
SHRDLU is a program written by Terry Winograd in 1968-70.
LUNAR is the classic example of a Natural Language database interface system that is used ATNs and Woods'
Procedural Semantics.
Till the year 1980, natural language processing systems were based on complex sets of hand-written rules. After 1980,
NLP introduced machine learning algorithms for language processing.
ADVANTAGES OF NLP
NLP helps users to ask questions about any subject and get a direct response within seconds.
NLP offers exact answers to the question means it does not offer unnecessary and unwanted
information.
Most of the companies use NLP to improve the efficiency of documentation processes, accuracy of
NLP is unpredictable
NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is built for
NLU NLG
NLU is the process of reading and NLG is the process of writing or
interpreting language. generating language.
A language can be defined as a set of strings; “print(2 + 2)” is a legal program in the language
Python, whereas “2)+(2 print” is not.
Since there are an infinite number of legal programs, they cannot be enumerated; instead they are
specified by set of rules called a grammar.
Formal languages also have rules that define the meaning or semantics of a program; for example, the
rules say that the “meaning” of “2 + 2” is 4, and the meaning of “1/0” is that an error is signaled.
Everyone agrees that “Not to be invited is sad” is a sentence of English, but people disagree on the
grammaticality of “To be not invited is sad.”
Therefore, it is more fruitful to define a natural language model as a probability distribution over
sentences rather than a definitive set.
P(S = words)
Natural languages are also ambiguous. Because we cannot speak of a single meaning for a
sentence, but rather of a probability distribution over possible meanings.
Finally, natural languages are difficult to deal with because they are very large, and constantly
changing.
Thus, one of the simplest language models is a probability distribution over sequences of
characters.
We write P(C1:N) for the probability of a sequence of N characters, C1 through CN.
A sequence of written symbols of length n is called an n-gram (from the Greek root for writing or
letters), with special case “unigram” for 1-gram, “bigram” for 2-gram, and “trigram” for 3-gram.
A model of the probability distribution of n-letter sequences is thus called an n-gram model. (But be
careful: we can have n-gram models over sequences of words, syllables, or other units; not just over
characters.)
In a Markov chain the probability of character ci depends only on the immediately preceding
characters, not on any other characters.
We can define the probability of a sequence of characters P(c1:N) under the trigram model by first
factoring with the chain rule and then using the Markov assumption:
We call a body of text a corpus (plural corpora), from the Latin word for body.
What can we do with n-gram character models? One task for which they are well suited is
language identification .
Smoothing n-gram models
The major complication of n-gram models is that the training corpus provides only an estimate of the
true probability distribution.
For common character sequences such as “ _th” any English corpus will give a good estimate: about
1.5% of all trigrams. On the other hand, “ _ht” is very uncommon—no dictionary words start with ht.
The process of adjusting the probability of low-frequency counts is called smoothing.
The simplest type of smoothing was suggested by Pierre-Simon Laplace in the 18th century: he said
that, in the lack of further information, if a random Boolean variable X has been false in all n
observations so far then the estimate for P (X = true) should be 1/(n+2).
That is, he assumes that with two more trials, one might be true and one false. Laplace smoothing
(also called add-one smoothing) is a step in the right direction, but performs relatively poorly.
A better approach is a back-off model, in which we start by estimating n-gram counts, but for any
particular sequence that has a low (or zero) count, we back off to (n - 1)-grams.
Linear interpolation smoothing is a back-off model that combines trigram, bigram, and unigram
models by linear interpolation. It defines the probability estimate as
where λ3 + λ2 + λ1 = 1. The parameter values λi can be fixed, or they can be trained with an
expectation–maximization algorithm.
It is also possible to have the values of λi depend on the counts: if we have a high count of
trigrams, then we weigh them relatively more; if only a low count, then we put more weight on the
Split the corpus into a training corpus and a validation corpus. Determine the parameters of the model
from the training data. Then evaluate the model on the validation corpus.
The evaluation can be a task-specific metric, such as measuring accuracy on language identification.
Alternatively we can have a task-independent model of language quality: calculate the probability
assigned to the validation corpus by the model; the higher the probability the better.
This metric is inconvenient because the probability of a large corpus will be a very small number, and
floating-point underflow becomes an issue.
A different way of describing the probability of a sequence is with a measure called perplexity,
Perplexity can be thought of as the reciprocal of probability, normalized by sequence length.
It can also be thought of as the weighted average branching factor of a model. Suppose there are
100 characters in our language, and our model says they are all equally likely. Then for a
If some characters are more likely than others, and the model reflects that, then the model will
All the same mechanism applies equally to word and character models. The main difference is that the vocabulary—the
set of symbols that make up the corpus and the model—is larger.
There are only about 100 characters in most languages, and sometimes we build character models that are even more
restrictive, for example by treating “A” and “a” as the same symbol or by treating all punctuation as the same symbol.
But with word models we have at least tens of thousands of symbols, and sometimes millions.
In English a sequence of letters surrounded by spaces is a word, but in some languages, like Chinese, words are not
separated by spaces, and even in English many decisions must be made to have a clear policy on word boundaries: how
many words are in.
With character models, we didn’t have to worry about someone inventing a new letter of the alphabet.
But with word models there is always the chance of a new word that was not seen in the training corpus, so we need to
model that explicitly in our language model.
TEXT
CLASSIFICATION
TEXT CLASSIFICATION
We now consider in depth the task of text classification, also known as categorization: given a text
of some kind, decide which of a predefined set of classes it belongs to. Language identification and
genre classification are examples of text classification
A training set is readily available: the positive (spam) examples are in my spam folder, the negative
(ham) examples are in my inbox.
In the language-modeling approach, we define one n-gram language model for P(Message | spam)
by training on the spam folder, and one model for P(Message | ham) by training on the inbox.
A lossless compression algorithm takes a sequence of symbols, detects repeated patterns in it, and
writes a description of the sequence that is more compact than the original.
To do classification by compression, we first lump together all the spam training messages and
We do the same for the ham. Then when given a new message to classify, we append it to the spam
We also append it to the ham and compress that. Whichever class compresses better—adds the
INFORMATION RETRIEVAL
Information retrieval is the task of finding documents that are relevant to a user’s need for
information.
The best-known examples of information retrieval systems are search engines on the World Wide
Web.
A Web user can type a query such as “AI book” into a search engine and see a list of relevant pages.
A corpus of documents. Each system must decide what it wants to treat as a document: a
paragraph, a page, or a multipage text.
Queries posed in a query language. A query specifies what the user wants to know. The query
language can be just a list of words, such as [AI book]; or it can specify a phrase of words that
must be adjacent, as in [“AI book”]; it can contain Boolean operators as in [AI AND book]; it can
include non-Boolean operators such as [AI NEAR book].
A result set. This is the subset of documents that the IR system judges to be relevant to the query.
A presentation of the result set. This can be as simple as a ranked list of document titles or as
complex as a rotating color map of the result set projected onto a three dimensional space,
First, the degree of relevance of a document is a single bit, so there is no guidance as to how to
Second, Boolean expressions are unfamiliar to users who are not programmers or logicians.
Third, it can be hard to formulate an appropriate query, even for a skilled user.
IR scoring functions
Most IR systems have abandoned the Boolean model and use models based on the statistics of
word counts. We describe the BM25 scoring function.
A scoring function takes a document and a query and returns a numeric score; the most relevant
documents have the highest scores.
In the BM25 function, the score is a linear weighted combination of scores for each of the
words that make up the query.
Three factors affect the weight of a query term:
First, the frequency with which a query term appears in a document (also known as TF for term
frequency). For the query documents that mention “farming” frequently will have higher scores.
Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost
every document, so it has a high document frequency, and thus a low inverse document frequency,
and thus it is not as important to the query.
Third, the length of the document. A million-word document will probably mention all the query
words, but may not actually be about the query. A short document that mentions all the words is a
much better candidate.
IDF(qi) is the inverse document frequency of word qi, given by
IR system evaluation
recall
precision.
Precision measures the proportion of documents in the result set that are actually relevant.
In our example, the precision is 30/(30 + 10) = .75. The false positive rate is 1 - .75 = .25.
Recall measures the proportion of all the relevant documents in the collection that are in the result set.
In our example, recall is 30/(30 + 20) = .60. The false negative rate is 1 - .60 = .40.
In a very large document collection, such as the World Wide Web, recall is difficult to compute,
because there is no easy way to examine every page on the Web for relevance.
All we can do is either estimate recall by sampling or ignore recall completely and just judge
precision.
IR refinements
There are many possible refinements to the system described here, and indeed Web search engines
are continually updating their algorithms as they discover new approaches and as the Web grows
and changes.
One common refinement is a better model of the effect of document length on relevance.
Singhal et al. (1996) observed that simple document length normalization schemes tend to favor
They propose a pivoted document length normalization scheme; the idea is that the pivot is the
document length at which the old-style normalization is correct; documents shorter than that get a
The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has
the potential for small gains in recall, but can hurt precision.
On the Web, hypertext links between documents are a crucial source of information.
The PageRank algorithm
PageRank was one of the two original ideas that set Google’s search apart from other Web search
engines when it was introduced in 1997. (The other innovation was the use of anchor text—the
underlined text in a hyperlink).
PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBM], how
do we make sure that IBM’s home page, ibm.com, is the first result, even if another page mentions
the term “IBM” more frequently?
The idea is that ibm.com has many in-links (links to the page), so it should be ranked higher:
each in-link is a vote for the quality of the linked-to page.
But if we only counted in-links, then it would be possible for a Web spammer to create a network of
pages and have them all point to a page of his choosing, increasing the score of that page.
Therefore, the PageRank algorithm is designed to weight links from high-quality sites more heavily.
What is a high quality site? One that is linked to by other high-quality sites.
The definition is recursive, but we will see that the recursion bottoms out properly. The PageRank
for a page p is defined as:
where P R(p) is the PageRank of page p, N is the total number of pages in the corpus, ini are the
pages that link in to p, and C(ini) is the count of the total number of out-links on page ini.
The constant d is a damping factor. It can be understood through the random surfer model :
imagine a Web surfer who starts at some random page and begins exploring.
The HITS algorithm
The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or
HITS, is another influential link-analysis algorithm .
Given a query, HITS first finds a set of pages that are relevant to the query. It does that by
intersecting hit lists of query words, and then adding pages in the link neighborhood of these
pages
Both PageRank and HITS played important roles in developing our understanding of Web
information retrieval.
These algorithms and their extensions are used in ranking billions of queries daily as search
engines steadily develop better ways of extracting yet finer signals of search relevance.
Question answering
Information retrieval is the task of finding documents that are relevant to a query, where the query
Question answering is a somewhat different task, in which the query really is a question, and the
answer is not a ranked list of documents but rather a short response—a sentence, or even just a
phrase.
There have been question-answering NLP (natural language processing) systems since the 1960s,
but only since 2001 have such systems used Web information retrieval to radically increase their
breadth of coverage.
INFORMATION
EXTRACTION
INFORMATION EXTRACTION
Information extraction is the process of acquiring knowledge by skimming a text and looking for
A typical task is to extract instances of addresses from Web pages, with database fields for street,
city, state, and zip code; or instances of storms from weather reports, with fields for temperature,
In a limited domain, this can be done with high accuracy. As the domain gets more general, more
complex linguistic models and more complex learning techniques are necessary.
Finite-state automata for information extraction:
The simplest type of information extraction system is an attribute-based extraction system
that assumes that the entire text refers to a single object and the task is to extract attributes of
that object.
For example, the problem of extracting from the text “IBM Think Book 970. Our price:
$399.00” the set of attributes {Manufacturer=IBM, Model=ThinkBook970, Price=$399.00}.
We can address this problem by defining a template (also known as a pattern) for each
attribute we would like to extract. The template is defined by a finite state automaton, the
simplest example of which is the regular expression, or regex.
One step up from attribute-based extraction systems are relational extraction systems, which deal
Thus, when these systems see the text “$249.99,” they need to determine not just that it is a price,
A typical relational-based extraction system is FASTUS, which handles news stories about
That is, the system consists of a series of small, efficient finite-state automata (FSAs), where
each automaton receives text as input, transduces the text into a different format, and passes it
along to the next automaton.
FASTUS consists of five stages
1. FASTUS’s first stage is tokenization, which segments the stream
1. Tokenization
of characters into tokens (words, numbers, and punctuation).
2. Complex-word handling
Some tokenizers also deal with markup languages such as
3. Basic-group handling
HTML, SGML, and XML.
4. Complex-phrase handling
2. The second stage handles complex words, including collocations
5. Structure merging
such as “set up” and “joint venture,” as well as proper names such
as “Bridgestone Sports Co.”
3 The third stage handles basic groups, meaning noun groups and verb groups. The idea is to
chunk these into units that will be managed by the later stages.
4 The fourth stage combines the basic groups into complex phrases.
5 The final stage merges structures that were built up in the previous step.
Probabilistic models for information extraction
When information extraction must be attempted from noisy or varied input, simple finite-state
approaches fare poorly.
It is too hard to get all the rules and their priorities right; it is better to use a probabilistic model
rather than a rule-based model.
The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or
HMM.
HMM models a progression through a sequence of hidden states, xt, with an observation et at each
step.
To apply HMMs to information extraction, we can either build one big HMM for all the attributes
or build a separate HMM for each attribute. We’ll do the second.
Conditional random fields for information extraction
One issue with HMMs for the information extraction task is that they model a lot of
probabilities that we don’t really need.
Modeling this directly gives us some freedom. We don’t need the independence assumptions of
the Markov model—we can have an xt that is dependent on x1.
A framework for this type of model is the conditional random field, or CRF, which models a
conditional probability distribution of a set of target variables given a set of observed variables.
Like Bayesian networks, CRFs can represent many different structures of dependencies among
the variables.
One common structure is the linear-chain conditional random field for representing Markov
dependencies among variables in a temporal sequence.
Thus, HMMs are the temporal version of naive Bayes models, and linear-chain CRFs are the
temporal version of logistic regression.
Ontology extraction from large corpora
First it is open-ended—we want to acquire facts about all types of domains, not just one specific
domain.
Second, with a large corpus, this task is dominated by precision, not recall—just as with question
answering on the Web .
Third, the results can be statistical aggregates gathered from multiple sources, rather than being
extracted from one specific text.
Automated template construction
Fortunately, it is possible to learn templates from a few examples, then use the templates to learn
more examples, from which more templates can be learned, and so on.
In one of the first experiments of this kind, Brin (1999) started with a data set of just five examples
(“Isaac Asimov”, “The Robots of Dawn”)
(“David Brin”, “Startide Rising”)
(“James Gleick”, “Chaos—Making a New Science”)
(“Charles Dickens”, “Great Expectations”)
(“William Shakespeare”, “The Comedy of Errors”)
Clearly these are examples of the author–title relation, but the learning system had no knowledge
of authors or titles.
The words in these examples were used in a search over a Web corpus, resulting in 199 matches.
Each match is defined as a tuple of seven strings,
(Author, Title, Order, Prefix, Middle, Postfix, URL) ,
where Order is true if the author came first and false if the title came first, Middle is the characters between the
author and title, Prefix is the 10 characters before the match, Suffix is the 10 characters after the match, and
URL is the Web address where the match was made.
Machine reading
Automated template construction is a big step up from handcrafted template construction, but it still requires a
handful of labeled examples of each relation to get started.
To build a large ontology with many thousands of relations, even that amount of work would be onerous; we
would like to have an extraction system with no human input of any kind—a system that could read on its own
and build up its own database.
Such a system would be relation-independent; would work for any relation. In practice, these systems work on
all relations in parallel, because of the I/O demands of large corpora.
They behave less like a traditional information extraction system that is targeted at a few relations and more
like a human reader who learns from the text itself; because of this the field has been called machine reading.
INTRODUCTION
Communication is the intentional exchange of information brought about by the production SIGN
and perception of signs drawn from a shared system of conventional signs. Most animals use signs
to represent important messages: food here, predator nearby, approach, withdraw, let’s mate.
PHRASE STRUCTURE GRAMMARS
The n-gram language models were based on sequences of words.
The big issue for these models is data sparsity—with a vocabulary of, say, trigram probabilities to estimate,
and so a corpus of even a trillion words will not be able to supply reliable estimates for all of them.
Despite the exceptions, the notion of a lexical category (also known as a part of speech) such as noun or
adjective is a useful generalization—useful in its own right, but more so when we string together lexical
categories to form syntactic categories such as noun phrase or verb phrase, and combine these syntactic
categories into trees representing the phrase structure of sentences: nested phrases, each marked with a
category .
GENERATIVE CAPACITY
Grammatical formalisms can be classified by their generative capacity: the set of languages they
can represent.
Chomsky (1957) describes four classes of grammatical formalisms that differ only in the form of
the rewrite rules.
The classes can be arranged in a hierarchy, where each class can be used to describe all the
languages that can be described by a less powerful class, as well as some additional languages.
1. Recursively enumerable grammars use unrestricted rules: both sides of the rewrite rules can have
any number of terminal and nonterminal symbols, as in the rule A B C → D E.
The name “context sensitive” comes from the fact that a rule such as A X B → A Y B says that an
X can be rewritten as a Y in the context of a preceding A and a following B.
3. In context-free grammars (or CFGs), the left-hand side consists of a single nonterminal
symbol. Thus, each rule licenses rewriting the nonterminal as the right-hand side in any context.
Regular grammars are equivalent in power to finite state machines. They are poorly suited
for programming languages, because they cannot represent constructs such as balanced
opening and closing parentheses .
The closest they can come is representing a∗b∗, a sequence of any number of as followed by
any number of bs.
There have been many competing language models based on the idea of phrase structure; we will
describe a popular model called the probabilistic context-free grammar, or PCFG.
A grammar is a collection of rules that defines a language as a set of allowable strings of words.
Probabilistic means that the grammar assigns a probability to every string.
VP → Verb [0.70]
VP NP [0.30]
Here VP (verb phrase) and NP (noun phrase) are non-terminal symbols. The grammar also refers to
actual words, which are called terminal symbols.
This rule is saying that with probability 0.70 a verb phrase consists solely of a verb, and with
probability 0.30 it is a VP followed by an NP.
The lexicon of
First we define the lexicon, or list of allowable words. The words are grouped into the lexical
categories familiar to dictionary users: nouns, pronouns, and names to denote things; verbs to denote
events; adjectives to modify nouns; adverbs to modify verbs; and function words: articles (such as
the), prepositions (in), and conjunctions (and).
Each of the categories ends in . . . to indicate that there are other words in the category.
The Grammar of
The next step is to combine the words into phrases.
A grammar for with rules for each of the six syntactic categories and an example for each rewrite rule.
SYNTACTIC ANALYSIS (PARSING)
Parsing is the process of analyzing a string of words to uncover its phrase structure, according
to the rules of a grammar.
1. Have the students in section 2 of Computer Science 101 take the exam.
2. Have the students in section 2 of Computer Science 101 taken the exam?
If the algorithm guesses wrong, it will have to backtrack all the way to the first word and reanalyze the whole
sentence under the other interpretation.
To avoid this source of inefficiency we can use dynamic programming: every time we analyze a substring,
store the results so we won’t have to reanalyze it later.
For example, once we discover that “the students in section 2 of Computer Science 101” is an NP, we can
record that result in a data structure known as a chart.
There are many types of chart parsers; we describe a bottom-up version called the CYK algorithm, after its
inventors, John Cocke, Daniel Younger, and Tadeo Kasami.
CYK algorithm
Learning probabilities for PCFGs
This suggests that learning the grammar from data might be better than a knowledge engineering
approach.
Learning is easiest if we are given a corpus of correctly parsed sentences, commonly called a
treebank.
The Penn Treebank is the best known; it consists of 3 million words which have been annotated
with part of speech and parse-tree structure, using human labor assisted by some automated tools.
Comparing context-free and Markov models
The problem with PCFGs is that they are context-free.
That means that the difference between P (“eat a banana”) and P (“eat a bandanna”) depends only
on P (Noun → “banana”) versus
P (Noun → “bandanna”) and not on the relation between “eat” and the respective objects.
A Markov model of order two or more, given a sufficiently large corpus, will know that “eat a
banana” is more probable.
We can combine a PCFG and Markov model to get the best of both. The simplest approach is to
estimate the probability of a sentence with the geometric mean of the probabilities computed by
both models.
Another problem with PCFGs is that they tend to have too strong a preference for shorter
sentences.
AUGMENTED GRAMMARS AND SEMANTIC INTERPRETATION
Lexicalized PCFGs
To get at the relationship between the verb “eat” and the nouns “banana” versus “bandanna,”
we can use a lexicalized PCFG, in which the probabilities for a rule depend on the
relationship between words in the parse tree, not just on the adjacency of words in a sentence.
Of course, we can’t have the probability depend on every word in the tree, because we won’t have
enough training data to estimate all those probabilities.
It is useful to introduce the notion of the head of a phrase—the most important word. Thus, “eat”
is the head of the VP “eat a banana” and “banana” is the head of the NP “a banana.”
We use the notation VP(v) to denote a phrase with category VP whose head word is v. We say
that the category VP is augmented with the head variable v.
Formal definition of augmented grammar rules
Augmented rules are complicated, so we will give them a formal definition by showing how an
augmented rule can be translated into a logical sentence.
The sentence will have the form of a definite clause, so the result is called a definite clause
grammar, or DCG.
Case agreement and subject–verb agreement
We splitting NP into two categories, NPS and NPO, to stand for noun phrases in the subjective and objective
case, respectively.
We would also need to split the category Pronoun into the two categories PronounS (which includes “I”) and
PronounO (which includes “me”).
Semantic interpretation
To show how to add semantics to a grammar, we start with an example that is simpler than English: the
semantics of arithmetic expressions.
MACHINE TRANSLATION
Rough translation, as provided by free online services, gives the “gist” of a foreign sentence or
Pre-edited translation is used by companies to publish their documentation and sales materials in
multiple languages.
The original source text is written in a constrained language that is easier to translate
automatically, and the results are usually edited by a human to correct any errors.
Restricted-source translation works fully automatically, but only on highly stereotypical language,
They keep a database of translation rules (or examples), and whenever the rule (or example)
matches, they translate directly.
All it needs is data—sample translations from which a translation model can be learned. To
translate a sentence in, say, English (e) into French (f), we find the string of words f ∗ that
maximizes
Here the factor P (f) is the target language model for French; it says how probable a given
sentence is in French. P (e|f) is the translation mode.
All that remains is to learn the phrasal and distortion probabilities. We sketch the procedure;
SPEECH RECOGNITION
Speech recognition is the task of identifying a sequence of words uttered by a speaker, given the
acoustic signal.
It has become one of the mainstream applications of AI—millions of people interact with speech
recognition systems every day to navigate voice mail systems, search the Web from mobile
phones, and other applications.
Speech recognition is difficult because the sounds made by a speaker are ambiguous and,
well, noisy.
Second, coarticulation: when speaking quickly the “s” sound at the end of “nice” merges
with the “b” sound at the beginning of “beach,” yielding something that is close to a “sp.”
Another problem that does not show up in this example is homophones—words like “to,”
“too,” and “two” that sound the same but differ in meaning.
Most speech recognition systems use a language model that makes the Markov assumption—that
the current state Word t depends only on a fixed number n of previous states—and represent
Word t as a single random variable taking on a finite set of values, which makes it a Hidden
Markov Model (HMM).
Acoustic model
The precision of each measurement is determined by the quantization factor; speech recognizers
typically keep 8 to 12 bits.
A phoneme is the smallest unit of sound that has a distinct meaning to speakers of a particular
language.
For example, the “t” in “stick” sounds similar enough to the “t” in “tick” that speakers of English
consider them the same phoneme.
First, we observe that although the sound frequencies in speech may be several kHz, the changes
in the content of the signal occur much less often, perhaps at no more than 100 Hz.
Language model
For general-purpose speech recognition, the language model can be an n-gram model of text learned
from a corpus of written sentences.
However, spoken language has different characteristics than written language, so it is better to get a
corpus of transcripts of spoken language.
For task-specific speech recognition, the corpus should be task-specific: to build your airline
reservation system, get transcripts of prior calls.
It also helps to have task-specific vocabulary, such as a list of all the airports and cities served,
and all the flight numbers.
IMAGE FORMATION
Imaging distorts the appearance of objects. For example, a picture taken looking down a long
straight set of railway tracks will suggest that the rails converge and meet. As another example, if
you hold your hand in front of your eye, you can block out the moon, which is not smaller than your
hand. As you move your hand back and forth or tilt it, your hand will seem to shrink and grow in the
image, but it is not doing so in reality. Models of these effects are essential for both recognition and
reconstruction.
Images without lenses: The pinhole camera
In cameras, the image is formed on an image plane, which can be a piece of film coated with
silver halides or a rectangular grid of a few million photosensitive pixels, each a complementary
metal-oxide semiconductor (CMOS) or charge-coupled device (CCD).
Lens systems
Scaled orthographic projection
The appropriate model is scaled orthographic projection. The idea is as follows: If the depth Z
of points on the object varies within some range Z0 +- ΔZ, with ΔZ * Z0, then the perspective scaling
factor f/Z can be approximated by a constant s = f/Z0. The equations for projection from the scene
coordinates (X, Y,Z) to the image plane become x = sX and y = sY . Scaled orthographic projection is
an approximation that is valid only for those parts of the scene with not much internal depth
variation. For example, scaled orthographic projection can be a good model for the features on the
front of a distant building.
Light and shading
The first cause is overall intensity of the light. Even though a white object in shadow may be less
bright than a black object in direct sunlight, the eye can distinguish relative brightness well, and
perceive the white object as white. Second, different points in the scene may reflect more or less of
the light. Third, surface patches facing the light are brighter than surface patches tilted away from the
light, an effect known as shading. Most surfaces reflect light by a process of diffuse reflection.
The main source of illumination outside is the sun, whose rays all travel parallel to one another. We
model this behaviour as a distant point light source. This is the most important model of lighting, and
is quite effective for indoor scenes as well as outdoor scenes. The amount of light collected by a surface
patch in this model depends on the angle θ between the illumination direction and the normal to the
surface.
Colour
The principle of trichromacy states that for any spectral energy density, no matter how
complicated, it is possible to construct another spectral energy density consisting of a mixture of
just three colours—usually red, green, and blue—such that a human can’t tell the difference
between the two. That means that our TVs and computer displays can get by with just the three
red/green/blue (or R/G/B) colour elements. It makes our computer vision algorithms easier, too.
Each surface can be modelled with three different albedos for R/G/B. Similarly, each light source
can be modelled with three R/G/B intensities. We then apply Lambert’s cosine law to each to get
three R/G/B pixel values. This model predicts, correctly, that the same surface will produce
different coloured image patches under different-coloured lights. In fact, human observers are
quite good at ignoring the effects of different coloured lights and are able to estimate the colour of
the surface under white light, an effect known as colour constancy.
EARLY IMAGE-PROCESSING OPERATIONS
we will study three useful image-processing operations: edge detection, texture analysis,
and computation of optical flow. These are called “early” or “low-level” operations because
they are the first in a pipeline of operations. Early vision operations are characterized by
their local nature (they can be carried out in one part of the image without regard for
anything more than a few pixels away) and by their lack of knowledge: we can perform
these operations without consideration of the objects that might be present in the scene.
This makes the low-level operations good candidates for implementation in parallel
hardware—either in a graphics processor unit (GPU) or an eye. We will then look at one
mid-level operation: segmenting the image into regions.
Edge detection
Edges are straight lines or curves EDGE in the image plane across which there is a “significant”
change in image brightness. The goal of edge detection is to abstract away from the messy,
multimega byte image and toward a more compact, abstract representation, the motivation is that
edge contours in the image correspond to important scene contours. In the figure we have three
Edge detection is concerned only with the image, and thus does not distinguish between these
Robots are physical agents that perform tasks by manipulating the physical world. To do so they are
equipped with effectors such as legs, wheels, joints, and grippers. Effectors have a single purpose:
to assert physical forces on the environment. Robots are also equipped with sensors, which allow
them to perceive their environment.
Present day robotics employs a diverse set of sensors, including cameras and lasers to measure the
environment, and gyroscopes and accelerometers to measure the robot’s own motion.
Most of today’s robots fall into one of three primary categories. Manipulators,
The second category is the mobile robot. Mobile robots move about their environment using
wheels, legs, or similar mechanisms.
They have been put to use delivering food in hospitals, moving containers at loading docks, and
similar tasks. Unmanned ground vehicles, or UGVs, drive autonomously on streets, highways, and
off-road.
The planetary rover shown in Figure 25.2(b) explored Mars for a period of 3 months in 1997.
Other types of mobile robots include unmanned air vehicles (UAVs), commonly used for
surveillance, crop-spraying
The third type of robot combines mobility with manipulation, and is often called a mobile
manipulator. Humanoid robots mimic the human torso. shows two early humanoid robots, both
manufactured by Honda Corp. in Japan. Mobile manipulators can apply their effectors further afield
than anchored manipulators can, but their task is made harder because they don’t have the rigidity that
UAV commonly used by the U.S. military. Autonomous underwater vehicles (AUVs) are used in
deep sea exploration. Mobile robots deliver packages in the workplace and vacuum the floors at home.
ROBOT HARDWARE
Sensors: Sensors are the perceptual interface between robot and environment
Passive sensors, such as cameras, are true observers of the environment: they capture signals that
are generated by other sources in the environment.
Active sensors, such as sonar, send energy into the environment.
Range finders are sensors that measure the distance to nearby objects.
In the early days of robotics, robots were commonly equipped with sonar sensors.
Stereo vision relies STEREO VISION on multiple cameras to image the environment from slightly
different viewpoints.
shows a time of flight camera. This camera acquires range images like the one at up to 60 frames
per second.
These sensors are called scanning lidars (short for light detection and ranging).
On the other extreme end of range sensing are tactile sensors such as whiskers
A second important class of sensors is location sensors.
Outdoors, the Global Positioning System is the most common solution to the localization problem.
Differential GPS involves a second ground receiver with known location, providing millimetre
accuracy under ideal conditions.
The third important class is proprioceptive sensors, which inform the robot of its own motion.
Inertial sensors, such as gyroscopes, rely on the resistance of mass to the change of velocity. They
can help reduce uncertainty.
EFFECTORS
To understand the design of effectors, it will help to talk about motion and shape in the abstract,
using the concept of a degree of freedom (DOF) We count one degree of freedom for each
independent direction in which a robot, or one of its effectors.
These six degrees define the kinematic state2 or pose of the robot. The dynamic state of a robot
includes these six plus an additional six dimensions for the rate of change of each kinematic
dimension, that is, their velocities.
created by five revolute joints that generate rotational motion and one prismatic joint that
generates sliding motion. You can verify that the human arm as a whole has more than six degrees
of freedom by a simple experiment: put your hand on the table and notice that you still have the
freedom to rotate your elbow without changing the configuration of your hand. Manipulators that
have extra degrees of freedom are easier to control than robots with only the minimum number of
DOFs. Many industrial manipulators therefore have seven DOFs, not six.
Differential drive robots possess two independently actuated wheels (or tracks), one on each side, as
on a military tank. If both wheels move at the same velocity, the robot moves on a straight line. If they
move in opposite directions, the robot turns on the spot. An alternative is the synchro drive
Legged robots have been made to walk, run, and even hop—as we see with the legged robot. This
robot is dynamically stable, meaning that it can remain upright while hopping around. A robot that
can remain upright without moving its legs is called statically stable
ROBOTIC PERCEPTION
Perception is the process by which robots map sensor measurements into internal representations
of the environment. Perception is difficult because sensors are noisy, and the environment is partially
observable, unpredictable, and often dynamic. In other words, robots have all the problems of state
estimation (or filtering). As a rule of thumb, good internal representations for robots have three
properties: they contain enough information for the robot to make good decisions, they are structured
so that they can be updated efficiently, and they are natural in the sense that internal variables
correspond to natural state variables in the physical world.
we saw that Kalman filters, HMMs, and dynamic Bayes nets can represent the transition and sensor
models of a partially observable environment, and we described both exact and approximate
algorithms for updating the belief state
We would like to compute the new belief state, P(Xt+1 | z1:t+1, a1:t), from the current belief state
P(Xt | z1:t, a1:t−1) and the new observation zt+1. We did this in Section 15.2, but here there are two
differences: we condition explicitly on the actions as well as the observations, and we deal with
continuous rather than discrete variables