0% found this document useful (0 votes)

80 views43 pages

Lexical Analysis in Modern Compilers

The document discusses lexical analysis, which involves breaking a program's source code into tokens. It describes how regular expressions can be used to define tokens and how a lexical analyzer works by scanning the input, identifying tokens based on the regular expressions, and outputting a sequence of tokens. The key steps in lexical analysis are defining tokens with regular expressions, constructing a finite state machine to recognize tokens, handling errors, and automatically generating an efficient scanner from the token definitions and finite state machine.

Uploaded by

cute_guddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views43 pages

Lexical Analysis in Modern Compilers

Uploaded by

cute_guddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Lexical Analysis

Textbook:Modern Compiler Design Chapter 2.1

A motivating example
Create a program that counts the number of lines in a given input text file

Solution
int num_lines = 0;
%% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Solution
int num_lines = 0;
initial %% \n ++num_lines; . ; %% main() { yylex(); printf( "# of lines = %d\n", num_lines); } newline ;

Outline
Roles of lexical analysis What is a token Regular expressions and regular descriptions Lexical analysis Automatic Creation of Lexical Analysis Error Handling

Basic Compiler Phases

Source program (string) Front-End lexical analysis

Tokens
syntax analysis Abstract syntax tree semantic analysis Annotated Abstract syntax tree Back-End

Fin. Assembly

Example Tokens
Type Examples

ID NUM REAL IF COMMA NOTEQ LPAREN RPAREN

foo n_14 last 73 00 517 082 66.1 .5 10. 1e67 5.5e-10 if , != ( )

Example NonTokens
Type Examples

comment preprocessor directive

macro whitespace

/* ignored */ #include <foo.h> #define NUMS 5, 6 NUMS \t \n \b

Example
void match0(char *s) /* find a zero */
{ if (!strncmp(s, 0.0, 3))

return 0. ;
} VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF

Lexical Analysis (Scanning)

input program text (file) output sequence of tokens Read input file Identify language keywords and standard identifiers Handle include files and macros Count line numbers Remove whitespaces Report illegal symbols Produce symbol table

Why Lexical Analysis

Simplifies the syntax analysis
And language definition

Modularity Reusability Efficiency

What is a token?
Defined by the programming language Can be separated by spaces Smallest units Defined by regular expressions

A simplified scanner for C

Token nextToken() { char c ; loop: c = getchar(); switch (c){ case ` `:goto loop ; case `;`: return SemiColumn; case `+`: c = ungetc() ; switch (c) { case `+': return PlusPlus ; case '= return PlusEqual; default: ungetc(c); return Plus; case `<`: case `w`: }

Regular Expressions

Escape characters in regular expressions

\ converts a single operator into text
a\+ (a\+\*)+

Double quotes surround text

a+*+

Esthetically ugly But standard

Regular Descriptions
EBNF where non-terminals are fully defined before first use
letter [a-zA-Z] digit [0-9] underscore _ letter_or_digit letter|digit underscored_tail underscore letter_or_digit+ identifier letter letter_or_digit* underscored_tail

token description
A token name A regular expression

The Lexical Analysis Problem

Given
A set of token descriptions An input string

Partition the strings into tokens (class, value) Ambiguity resolution

The longest matching token Between two equal length tokens select the first

A Flex specification of C Scanner

Letter [a-zA-Z_] Digit [0-9] %% [ \t] {;} [\n] {line_count++;} ; { return SemiColumn;} ++ { return PlusPlus ;} += { return PlusEqual ;} + { return Plus} while { return While ; } {Letter}({Letter}|{Digit})* { return Id ;} <= { return LessOrEqual;} < { return LessThen ;}

Flex
Input
regular expressions and actions (C code)

Output
A scanner program that reads the input and applies actions when input regular expression is matched
regular expressions

flex
input program scanner tokens

Nave Lexical Analysis

Automatic Creation of Efficient Scanners

Nave approach on regular expressions (dotted items) Construct non deterministic finite automaton over items Convert to a deterministic Minimize the resultant automaton Optimize (compress) representation

Dotted Items

Example
T a+ b+ Input aab After parsing aa
T a+ b+

Item Types
Shift item
In front of a basic pattern A (ab)+ c (de|fe)*

Reduce item
At the end of rhs A (ab)+ c (de|fe)*

Basic item
Shift or reduce items

Character Moves
For shift items character moves are simple
Tc
Digit [0-9]

c 7

c
7

Tc
T [0-9]

Moves
For non-shift items the situation is more complicated What character do we need to see? Where are we in the matching? T a* T (a*)

Moves for Repetitions

Where can we get from T (R)* If R occurs zero times T (R)* If R occurs one or more times T ( R)*
When R ends ( R )*
(R)* ( R)*

Moves

I [0-9]+ F [0-9]*.[0-9]+

Input 3.1;

F ([0-9])*.([0-9])+ F ([0-9])*.([0-9])+ F ( [0-9] )*.([0-9])+ F ( [0-9])*.([0-9])+ F ( [0-9])* .([0-9])+ F ( [0-9])*. ([0-9])+ F ( [0-9])*. ([0-9])+ F ( [0-9])* .([0-9])+

F ( [0-9])*. ( [0-9] )+
F ( [0-9])*. ( [0-9])+ F ( [0-9])*. ( [0-9])+

Concurrent Search
How to scan multiple token classes in a single run?

I [0-9]+ F [0-9]*.[0-9]+

Input 3.1;

I ([0-9])+
I ([0-9])+ I ( [0-9] )+
I ([0-9])+ I ( [0-9])+

F ([0-9])*.([0-9])+
F ([0-9])*.([0-9])+ F ( [0-9] )*.([0-9])+
F ([0-9])*.([0-9])+

F ([0-9])* .([0-9])+

F ( [0-9])* .([0-9])+

F ( [0-9])*. ([0-9])+

The Need for Backtracking

A simple minded solution may require unbounded backtracking T1 a+; T2 a Quadratic behavior Does not occur in practice A linear solution exists

A Non-Deterministic Finite State Machine

Add a production S T1 | T2 | | Tn Construct NDFA over the items
Initial state S (T1 | T2 | | Tn) For every character move, construct a character transition <T c , a> T c For every move construct an transition The accepting states are the reduce items Accept the language defined by Ti

Moves

I [0-9]+ F [0-9]*.[0-9]+ S(I|F)

F ([0-9]*).[0-9]+

I ([0-9]+)
F ([0-9]*).[0-9]+ F ( [0-9]*) .[0-9]+

I ([0-9])+
[0-9] I ( [0-9])+

[0-9]
F ( [0-9] *).[0-9]+

.
F [0-9]*. ([0-9]+)

F [0-9]*. ( [0-9] +)

[0-9] F [0-9]*. ([0-9]+)

I ( [0-9])+

F [0-9]*. ( [0-9] +)

Efficient Scanners
Construct Deterministic Finite Automaton
Every state is a set of items Every transition is followed by an -closure When a set contains two reduce items select the one declared first

Minimize the resultant automaton

Rejecting states are initially indistinguishable Accepting states of the same token are indistinguishable

Exponential worst case complexity

Does not occur in practice

Compress representation

I [0-9]+ F [0-9]*.[0-9]+
[0-9] I ( [0-9])+ F ( [0-9] *).[0-9]+ I ( [0-9])+ I ([0-9])+ F ([0-9]*).[0-9]+ F ( [0-9]*) .[0-9]+ [0-9]

S(I|F) I ([0-9]+) I ([0-9])+ F ([0-9]).[0-9]+ F ([0-9]). [0-9]+ F ([0-9]*) . [0-9]+

.|\n [^0-9.]

Sink

. F [0-9] . ([0-9]+) F [0-9].([0-9]+)

[^0-9]

[0-9] F [0-9] . ([0-9] +) F [0-9].([0-9]+) F [0-9]*.( [0-9]+) [0-9] [^0-9]

[^0-9.]

A Linear-Time Lexical Analyzer

IMPORT Input Char [1..]; Set Read Index To 1;
Procedure Get_Next_Token;
set Start of token to Read Index; set End of last token to uninitialized set Class of last token to uninitialized set State to Initial while state /= Sink: Set ch to Input Char[Read Index]; Set state = [state, ch]; if accepting(state): set Class of last token to Class(state); set End of last token to Read Index set Read Index to Read Index + 1; set token .class to Class of last token; set token .repr to char[Start of token .. End last token]; set Read index to End last token + 1;

Scanning 3.1;
input 3.1; 3 .1; 3. 1; 3.1 ; state 1 2 3 4 next state 2 3 4 Sink last token I I F F
[0-9] [0-9]

[^0-9.]

1
[0-9] [^0-9.] . .

Sink

2 I

3
[0-9]
[^0-9]

4 F

Scanning aaa
[^a]

[.\n]

T1 a+; T2 a input aaa$ a aa$ a a a$ aaa$ state 1 2 4 4 next state 2 4 4 Sink last token T1 T1 T1

1
[a] [^a;]

Sink
[.\n] ;
;

2
[a]

T1
[^a;]

[a]

Error Handling
Illegal symbols Common errors

Missing
Creating a lexical analysis by hand Table compression Symbol Tables Handling Macros Start states Nested comments

Summary
For most programming languages lexical analyzers can be easily constructed automatically Exceptions:
Fortran PL/1

Lex/Flex/Jlex are useful beyond compilers

Lecture 3
No ratings yet
Lecture 3
22 pages
Lexical Analysis and Finite Automata Guide
No ratings yet
Lexical Analysis and Finite Automata Guide
104 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Lexical Analysis in Compiler Design
100% (1)
Lexical Analysis in Compiler Design
59 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
27 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Understanding Lexical Analyzers in Compilers
No ratings yet
Understanding Lexical Analyzers in Compilers
33 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Lexical Analysis and Token Recognition
100% (3)
Lexical Analysis and Token Recognition
51 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Lexical Analysis in Compilers
No ratings yet
Lexical Analysis in Compilers
5 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Lexical Analysis and Analyzers Explained
No ratings yet
Lexical Analysis and Analyzers Explained
63 pages
2 - Compilers (Lexical Analysis)
No ratings yet
2 - Compilers (Lexical Analysis)
60 pages
Compiler Lexical Analysis Guide
No ratings yet
Compiler Lexical Analysis Guide
56 pages
Lexical Analysis and Analyzer Generators
No ratings yet
Lexical Analysis and Analyzer Generators
59 pages
Lexical Analysis and Compilation Techniques
No ratings yet
Lexical Analysis and Compilation Techniques
20 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lexical Analysis and Tokenization Guide
No ratings yet
Lexical Analysis and Tokenization Guide
45 pages
Lecture II - Lexical Analysis - Handouts
No ratings yet
Lecture II - Lexical Analysis - Handouts
71 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
Lexical Analyzer Design and Analysis
No ratings yet
Lexical Analyzer Design and Analysis
36 pages
Token Recognition in Compiler Design
No ratings yet
Token Recognition in Compiler Design
51 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
60 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
27 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
64 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
51 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
62 pages
Lexical Analysis in Programming Languages
No ratings yet
Lexical Analysis in Programming Languages
21 pages
Lexical Analyzer Overview and Techniques
No ratings yet
Lexical Analyzer Overview and Techniques
94 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
38 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
58 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
51 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
27 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
10 pages
Chapter 2 Lexical Analyser
No ratings yet
Chapter 2 Lexical Analyser
40 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
34 pages
Compiler Design for Students
No ratings yet
Compiler Design for Students
40 pages
Implementing a Lexical Analyzer with Flex
No ratings yet
Implementing a Lexical Analyzer with Flex
16 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Lexical Analysis in Compiler Design
No ratings yet
Lexical Analysis in Compiler Design
49 pages
Practical File: Be (Cse) 6 Semester
No ratings yet
Practical File: Be (Cse) 6 Semester
54 pages
Understanding Lexical Analysis
No ratings yet
Understanding Lexical Analysis
36 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Lexical Analysis and Tokenization Guide
No ratings yet
Lexical Analysis and Tokenization Guide
47 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
How to Build a Scanner in Compilers
No ratings yet
How to Build a Scanner in Compilers
24 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Lexical Analyzer and Token Recognition Guide
No ratings yet
Lexical Analyzer and Token Recognition Guide
34 pages
Object-Oriented System Modeling Quiz
No ratings yet
Object-Oriented System Modeling Quiz
9 pages
Android App Components Guide
No ratings yet
Android App Components Guide
2 pages
Data Communication & Networking Course Guide
No ratings yet
Data Communication & Networking Course Guide
1 page
Mostapha Fawzi Shehab CV Summary
No ratings yet
Mostapha Fawzi Shehab CV Summary
2 pages
SCJP Certified Resume-1
No ratings yet
SCJP Certified Resume-1
2 pages
DHTML Guide for Web Developers
No ratings yet
DHTML Guide for Web Developers
26 pages
Data Communication & Networking Course Guide
No ratings yet
Data Communication & Networking Course Guide
1 page
JAVA Socket Programming: 2003.3.19 Joonbok Lee Kaist
No ratings yet
JAVA Socket Programming: 2003.3.19 Joonbok Lee Kaist
30 pages
HTML Tags and Semantic Usage Guide
No ratings yet
HTML Tags and Semantic Usage Guide
9 pages
Antonyms Synonyms
No ratings yet
Antonyms Synonyms
13 pages
Online Application Process Guide
No ratings yet
Online Application Process Guide
2 pages
Expressions Without Prepositions
No ratings yet
Expressions Without Prepositions
2 pages
Prepositional Idioms Explained
100% (1)
Prepositional Idioms Explained
4 pages
Antropology Past Papers Analysis
No ratings yet
Antropology Past Papers Analysis
4 pages
A. D. Reiling - Courts and Artificial Intelligence
No ratings yet
A. D. Reiling - Courts and Artificial Intelligence
11 pages
Three Kinds of Atheism
No ratings yet
Three Kinds of Atheism
4 pages
Syllabus For Indian Constitution Subject of Ist Semester 2018-19 (Compulsory Paper)
No ratings yet
Syllabus For Indian Constitution Subject of Ist Semester 2018-19 (Compulsory Paper)
3 pages
Review John Lennox Seven Days
No ratings yet
Review John Lennox Seven Days
10 pages
Exit Exam Model Answer
100% (3)
Exit Exam Model Answer
26 pages
The State of Louisiana Literacy Test
No ratings yet
The State of Louisiana Literacy Test
4 pages
Fermentation of Prunus Persica Insights
No ratings yet
Fermentation of Prunus Persica Insights
110 pages
Paradigm Shifts in Mission Theology
No ratings yet
Paradigm Shifts in Mission Theology
26 pages
Performance Management & Strategic Planning
No ratings yet
Performance Management & Strategic Planning
40 pages
Julius Caesar ACT 5 Notes
No ratings yet
Julius Caesar ACT 5 Notes
7 pages
Lessons from IMF at Four Seasons Marrakech
No ratings yet
Lessons from IMF at Four Seasons Marrakech
4 pages
Analysis of Reinforced Concrete Structures Using Ansys Nonlinear Concrete Model
100% (1)
Analysis of Reinforced Concrete Structures Using Ansys Nonlinear Concrete Model
7 pages
Sociometric Techniques
100% (1)
Sociometric Techniques
11 pages
The Training Cycle Model - Module 2
No ratings yet
The Training Cycle Model - Module 2
6 pages
Athlete A Discussion Guide: Abuse in Sports
No ratings yet
Athlete A Discussion Guide: Abuse in Sports
20 pages
Out of The Cradle Endlessly Rocking, Themes
100% (4)
Out of The Cradle Endlessly Rocking, Themes
11 pages
Chapter 8 Matrices
No ratings yet
Chapter 8 Matrices
42 pages
Thesis Help for WMU Honors Students
100% (3)
Thesis Help for WMU Honors Students
4 pages
The Book of Jafr Attributed To Jafar Al
No ratings yet
The Book of Jafr Attributed To Jafar Al
23 pages
Sadhananda Swamigal - Jeeva Samadhi in and Around Chennai
No ratings yet
Sadhananda Swamigal - Jeeva Samadhi in and Around Chennai
1 page
Rainforest Learning Activities for Grade 6
100% (1)
Rainforest Learning Activities for Grade 6
21 pages
Chi Lab
No ratings yet
Chi Lab
4 pages
Of Myriad Minds
No ratings yet
Of Myriad Minds
434 pages
Allen Offer Letter
No ratings yet
Allen Offer Letter
11 pages
Deformation and Mountain Building Processes
No ratings yet
Deformation and Mountain Building Processes
87 pages
Conciliation
No ratings yet
Conciliation
13 pages
Job Employment Position Post Work Occupation Profession Trade Business Task
No ratings yet
Job Employment Position Post Work Occupation Profession Trade Business Task
10 pages
Parental Consent for Sports Tryouts
No ratings yet
Parental Consent for Sports Tryouts
4 pages
CSEC Music 2024 Paper
No ratings yet
CSEC Music 2024 Paper
15 pages

Lexical Analysis in Modern Compilers

Uploaded by

Lexical Analysis in Modern Compilers

Uploaded by

Lexical Analysis

Textbook:Modern Compiler Design Chapter 2.1

Basic Compiler Phases

ID NUM REAL IF COMMA NOTEQ LPAREN RPAREN

foo n_14 last 73 00 517 082 66.1 .5 10. 1e67 5.5e-10 if , != ( )

comment preprocessor directive

/* ignored */ #include <foo.h> #define NUMS 5, 6 NUMS \t \n \b

Lexical Analysis (Scanning)

Why Lexical Analysis

Modularity Reusability Efficiency

A simplified scanner for C

Escape characters in regular expressions

Double quotes surround text

Esthetically ugly But standard

The Lexical Analysis Problem

Partition the strings into tokens (class, value) Ambiguity resolution

A Flex specification of C Scanner

Nave Lexical Analysis

Automatic Creation of Efficient Scanners

Moves for Repetitions

The Need for Backtracking

A Non-Deterministic Finite State Machine

I [0-9]+ F [0-9]*.[0-9]+ S(I|F)

[0-9] F [0-9]*. ([0-9]+)

Minimize the resultant automaton

Exponential worst case complexity

S(I|F) I ([0-9]+) I ([0-9])+ F ([0-9]*).[0-9]+ F ([0-9]*). [0-9]+ F ([0-9]*) . [0-9]+

. F [0-9] *. ([0-9]+) F [0-9]*.([0-9]+)

[0-9] F [0-9] *. ([0-9] +) F [0-9]*.([0-9]+) F [0-9]*.( [0-9]+) [0-9] [^0-9]

A Linear-Time Lexical Analyzer

Lex/Flex/Jlex are useful beyond compilers

You might also like

S(I|F) I ([0-9]+) I ([0-9])+ F ([0-9]).[0-9]+ F ([0-9]). [0-9]+ F ([0-9]*) . [0-9]+

. F [0-9] . ([0-9]+) F [0-9].([0-9]+)

[0-9] F [0-9] . ([0-9] +) F [0-9].([0-9]+) F [0-9]*.( [0-9]+) [0-9] [^0-9]