0% found this document useful (0 votes)
16 views

Compiler_Construction_Lexical_Analysis

The document covers the fundamentals of lexical analysis in compiler construction, detailing the role of the scanner in converting source code into tokens and simplifying syntax analysis. It discusses methods for implementing lexical analyzers, including input buffering and transition diagrams, as well as the use of regular expressions for token recognition. Additionally, it introduces automatic generation of lexical analyzers using tools like Lex and Flex, and explains the conversion of regular expressions to finite automata.

Uploaded by

Anthony Yunusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Compiler_Construction_Lexical_Analysis

The document covers the fundamentals of lexical analysis in compiler construction, detailing the role of the scanner in converting source code into tokens and simplifying syntax analysis. It discusses methods for implementing lexical analyzers, including input buffering and transition diagrams, as well as the use of regular expressions for token recognition. Additionally, it introduces automatic generation of lexical analyzers using tools like Lex and Flex, and explains the conversion of regular expressions to finite automata.

Uploaded by

Anthony Yunusa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Compiler Construction: Lexical

Analysis
Lecture Slides | Based on CSC412
Module
Unit 1: The Scanner
OBJECTIVES

At the end of this unit, you should be able to:


• state the need of a compiler
• state the role of a compiler
• define the scanner
• state the functions of the scanner.
• Introduction to lexical analysis, tokens, and
scanning.

• The lexical analyzer (scanner) reads source code character by character


and converts it into tokens.

• Tokens: The smallest units in a program (keywords, identifiers,


operators, etc.).

• The scanner helps in simplifying syntax analysis by grouping meaningful


symbols together.
Need for Lexical Analysis

• Simplifies parser design: Tokens are easier to analyze than raw


characters.

• Removes whitespace, comments, and unnecessary symbols.

• Keeps track of line numbers and errors.


Role of Lexical Analyzer
• Acts as an interface between source code and parser.
• Two approaches:
– Separate pass (writes tokens to an intermediate file).
– Integrated with the parser (called as needed).

• Example: If input is x = y + 5;, the scanner generates: id(x) = id(y) +


num(5) ;
The Scanner
• Definition: Groups characters into meaningful tokens.

• Error Handling: Detects errors (e.g., incomplete strings, invalid


symbols).

• Pattern Matching: Uses regular expressions to recognize


tokens.
• Example: Tokenizing int x = 10;

• Keyword: int

• Identifier: x

• Operator: =

• Number: 10
Summary
• The scanner is the first phase of compilation.

• Converts raw input into tokens for further processing.

• Uses pattern matching techniques (Regular Expressions, Finite


Automata).
Unit 2: Hand Implementation of
Lexical Analyzer
OBJECTIVES

• At the end of this unit, you should be able to:

-list the various methods of constructing a lexical analyser

-describe the input buffering method of constructing a

lexical analyser

-explain the transition diagram method of constructing a

lexical analyser
-state the problems with hand implementation method
of constructing lexical analysers

-construct transition diagrams to handle keywords,


identifiers and delimiters.
Introduction
Lexical analyzers can be implemented manually or using tools.

• Manual implementation is based on pattern matching


techniques.

• Two major methods:


– Input Buffering

– Transition Diagrams
Input Buffering
• The Input Buffer Approach is a technique used in lexical
analysis to optimize character-by-character scanning by
storing the source code in memory buffers.

• Instead of reading one character at a time from disk (which is


slow), the scanner loads a block of text into memory and uses
two pointers to track tokens efficiently.
Working of Input Buffering
• The lexical analyzer reads input characters into a buffer
(usually two-buffer scheme).
• The forward pointer moves ahead to scan characters and
recognize tokens.
• If a token is recognized, the lexeme is extracted and
processed.
• If lookahead is needed, the forward pointer moves ahead
while lexeme_begin stays.
• If the forward pointer reaches the buffer end, the next half
is loaded from the source file.
Advantages of Input Buffering

• Speeds up lexical analysis (reduces character-by-character

disk reads).

• Efficient in handling lookahead (especially for multi-character

tokens like ==).

• Prevents unnecessary disk accesses (since input is loaded in

memory buffers).
Disadvantages of Input Buffering

• Requires memory space to store the buffers.

• Complex implementation (managing buffer boundaries).

• Handling lookahead requires extra logic.


Transition Diagrams
• A Transition Diagram (TD) Approach is a finite state machine (FSM)-
based method used in lexical analysis to recognize tokens.

• A transition diagram consists of:


– States (nodes) → Represent different steps in recognizing a token.
– Transitions (edges) → Define how the scanner moves between states based on
input characters.
– Final (accepting) state → When reached, a token is successfully recognized.
– Start state → The initial state where scanning begins.
– Error state → If an invalid character is encountered, the scanner rejects the
input.
(Start)

'i' → 'f' → (Final) → "IF keyword"

(Letter) → (Letter | Digit) → (Final) → "Identifier"


How to Handle Keywords
• There are two ways we can handle keywords.
– We can use the transition diagrams for the identifier and when you
get a delimiter, you look up a dictionary (that contains all the
keywords) to see if the identifier you are seeing is a keyword or
not.

– Another way is to bring all the keywords together in a TD i.e.


construct a TD for each keyword.
• E.g. Suppose the following keywords exist in a language:

BEGIN, END, IF, THEN, ELSE.

• You can construct a single TD for all of them and then you

have something like the diagram below


Advantages of Transition Diagrams
• Clear visual representation of token recognition.

• Easy to implement using finite state machines (FSMs).

• Handles different token patterns in a structured way.


Disadvantages of Transition Diagrams

• Can become complex for large grammars.

• Hard to manage when multiple tokens have overlapping


patterns.

• Does not inherently handle whitespace and comments.


Automatic Generation of Lexical Analyzer
• In the previous unit, we discussed manual (hand)
implementation of lexical analysers and the difficulties
involved. This unit introduces automatic lexical analyser
generation using regular expressions and tools such as Lex
and Flex.

• Instead of writing a lexer manually, we can use automated


tools that take a description of tokens and generate a lexer
automatically.
Language Theory Background

To understand lexical analysers, we need a foundation in


language theory.

Definitions
– Symbol: A single character (e.g., a, b, 0, 1).
– Alphabet: A finite set of symbols (e.g., {0, 1}, ASCII,
Unicode).
– String (word, sentence): A sequence of characters from an
alphabet (e.g., "hello").
– Language: A set of valid strings (words) defined over an
alphabet (e.g., { "aa", "ab", "ba", "bb" }).
Operations on Strings

• Concatenation: Combining two strings ("ab" + "cd" = "abcd").


• Exponentiation: Repeating a string multiple times ("a"^3 =
"aaa").
• Prefix, Suffix, Substring, Subsequence: Different ways to
extract parts of a string.
Operations on Languages

• Union (L1 ∪ L2) → Combines languages (e.g., { "a", "b" } ∪


{ "b", "c" } = { "a", "b", "c" }).

• Concatenation (L1L2) → Joins words (e.g., { "a", "b" } { "c", "d"


} = { "ac", "ad", "bc", "bd" }).

• Exponentiation (L^n) → Repeats elements (L^2 = { "aa", "ab",


"ba", "bb" }).
• Kleene Closure (L*) → Infinite repetition, including the empty
string ({ ε, "a", "aa", "aaa", ... }).

• Kleene Closure Always Includes ε (Empty String).


Regular Expressions (REs)
• Regular expressions (REs) are used to describe patterns in
text.They define regular languages, which are recognized by
finite automata.

• Definition of Regular Expressions


– Basic rules:
• ε → Represents an empty string.
• A single character (e.g., a, b, 0, 1) represents a language
{ a }.
• Union (|) → Matches either pattern (a | b matches "a"
or "b").
• Concatenation (rs) → Joins two expressions ("ab" means "a"
followed by "b").

• Kleene Star (*) → Matches zero or more repetitions (a*


matches ε, "a", "aa", "aaa", ...).
Example:
The RE (a|b)*aa represents words that:

– Contain only "a" and "b".


– End with "aa".
– Examples: "aa", "baa", "abaa", "bbaaa", etc.

• Regular Expressions Are Used in Languages Like:


awk, Java, JavaScript, Perl, Python, grep, sed
Lex is a tool that extends regular expressions for tokenizing
input. Special symbols used in Lex REs:

Example of a Lex Regular Expression:


• "a.*b" matches "a", followed by any characters, ending
in "b".
Tokens, Patterns, Lexemes, and Attributes

Tokens
• A token is a pair: <token_name, optional_attribute>
Example:
• <id, ptr_to_symbol_table>, <number, 42>

Patterns and Lexemes


• Pattern: A description of lexemes (using REs).
• Lexeme: A sequence of characters matching a pattern.
Examples of Patterns and Lexemes
Specifying a Lexical Analyser with Lex

Structure of a Lex Program


• Lex programs consist of:
Declarations
%%
Translation Rules
%%
Auxiliary Functions
• Each translation rule has:

• pattern { action }

• Pattern: Regular expression defining a token.

• Action: C code executed when the pattern matches.


Example of a Lex Program
• Lex Program to Count Words, Numbers, and Lines
Example Input and Output

• Input Text:

• Hello world 123

• This is Lex 456

• It counts words and numbers 789


Steps in Lex Implementation

• Read language specification.


• Construct NFA (Non-Deterministic Finite Automaton).
• Convert NFA to DFA (Deterministic Finite Automaton).
• Optimize the DFA.
• Generate parsing tables & code.
• Key Question: Why are finite automata important in lexical
analysis?

• They efficiently recognize patterns (e.g., detecting keywords


like for, while).

• They convert regular expressions (REs) into a structured


format that a compiler can process.
Unit 4: Implementing a Lexical Analyzer
Introduction
• The lexical analyser (or scanner) is the first stage of a
compiler.

• It reads the source code and converts it into tokens (basic


units like keywords, identifiers, operators, etc.).

• In this unit, we will discuss finite state machines, which are


the recognisers that recognise regular expressions and also
how to convert REs into finite automata and vice versa.
Objectives

By the end of this unit, you should be able to:

– Define finite automata (FA)

– Convert regular expressions (REs) to NFA

– Convert NFA to DFA


Finite Automata

• A finite automaton is a machine that takes an input string X


and decides whether it belongs to a specific language L.

• Types of Finite Automata:


– Nondeterministic Finite Automaton (NFA)
– Deterministic Finite Automaton (DFA)
Nondeterministic Finite Automaton (NFA)

• An NFA can be represented by a 5-tuple (S,Σ,δ, s₀,F) has:

• ✔️A finite non-empty set of states (S)

• ✔️A finite non-empty set input alphabets (Σ)

• ✔️A transition function (δ) mapping states and input symbols


to new states. i.e S X Σ -> S

• ✔️An initial state (s₀) in (S)

• ✔️A set of final states (F) belonging to S


Example: How an NFA Works

– The machine can be in multiple states at once (because of multiple

transitions).

– Epsilon (ε) moves allow moving between states without consuming

input.

Acceptance Condition:

– An NFA accepts a string if it can move from the start state to a final

state while reading the string.


• The language defined by an NFA is the set of
strings accepted by the NFA.
Deterministic Finite Automaton (DFA)
• A DFA is an NFA with no ε-moves and only one possible
transition per input character.

• Key Differences:

Feature NFA DFA


Multiple transitions for the
✅ Yes ❌ No
same input?

ε-moves allowed? ✅ Yes ❌ No

Deterministic behavior? ❌ No ✅ Yes


Converting a Regular Expression to an NFA
• Regular expressions (REs) describe patterns in
text. We can convert them to NFA in 5 steps:

• Steps to Convert RE → NFA:


– For a single character "a" → create two states with a transition
on 'a'.
– For concatenation (AB) → link two NFAs together.
– For alternation (A|B) → create a start state with ε-moves to
both NFAs.
– For repetition (A)* → add ε-moves to allow looping.
– For grouping (AB)* → combine rules above.
Converting NFA to DFA (Subset Construction Algorithm)
To create a DFA from an NFA, we:
– 1️Find ε-closures (all reachable states without consuming input).
– 2️Track all possible states an NFA could be in for a given input.
– 3️Merge states where possible to create a deterministic machine.

• Key Challenge:
– The number of DFA states can be much larger than the NFA!
– But a DFA is faster in execution because it has a single, clear
transition path for each input.
Conclusion & Summary
• NFA allows multiple paths and ε-moves.
• DFA ensures only one path for each input.
• Regular expressions → NFA → DFA for pattern matching in
compilers.

You might also like