0% found this document useful (0 votes)
57 views

Lecture 2.1 - Lexical Analysis

The document discusses the phases of a compiler and lexical analysis as the first phase. It notes that the six phases of a compiler are: 1) lexical analysis, 2) syntax analysis, 3) semantic analysis, 4) intermediate code generation, 5) code optimization, and 6) code generation. Lexical analysis involves scanning source code and grouping characters into tokens. It identifies tokens and enters them into symbol tables which are passed to subsequent phases.

Uploaded by

Divyanshu Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Lecture 2.1 - Lexical Analysis

The document discusses the phases of a compiler and lexical analysis as the first phase. It notes that the six phases of a compiler are: 1) lexical analysis, 2) syntax analysis, 3) semantic analysis, 4) intermediate code generation, 5) code optimization, and 6) code generation. Lexical analysis involves scanning source code and grouping characters into tokens. It identifies tokens and enters them into symbol tables which are passed to subsequent phases.

Uploaded by

Divyanshu Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

CSE2004

Theory of Computation
and
Compiler Design
Lecture-2.1
Lexical Analyser

14-12-2020 1 Dr. Chandan Kumar Behera


Phases of Compiler
 Compiler operates in various phases each phase transforms the source program
from one representation to another.
 Every phase takes inputs from its previous stage and feeds its output to the next
phase of the compiler.
 There are 6 phases in a compiler. Each of this phase helps in converting the
high-level langue the machine code.
 The phases of a compiler are:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generator
5. Code optimizer
6. Code generator
14-12-2020 2 Dr. Chandan Kumar Behera
Phases of Compiler

14-12-2020 3 Dr. Chandan Kumar Behera


Different Phases of Compilers
 Lexical Analysis is the first phase when compiler scans the source code
 Syntax analysis is all about discovering structure in text
 Semantic analysis checks the semantic consistency of the code
 Once the semantic analysis phase is over, the compiler generates intermediate code
for the target machine
 Code optimization phase removes unnecessary code line and arranges the sequence
of statements
 Code generation phase gets inputs from code optimization phase and produces the
page code or object code as a result
 A symbol table contains a record for each identifier with fields for the attributes of
the identifier
 Error handling routine handles error and reports during many phases

14-12-2020 4 Dr. Chandan Kumar Behera


Phase 1: Lexical Analysis
 Lexical Analysis is the first phase when compiler scans the source code.
 This process can be left to right, character by character, and group these
characters into tokens.
 The character stream from the source program is grouped in meaningful
sequences by identifying the tokens. It makes the entry of the corresponding
tickets into the symbol table and passes that token to next phase.
 The primary functions of this phase are:
 Identify the lexical units in a source code
 Classify lexical units into classes like constants, reserved words, and enter them in different
tables. It will Ignore comments in the source program
 Identify token which is not a part of the language

14-12-2020 5 Dr. Chandan Kumar Behera


Example: x = y + 10
 Tokens: x, =, y, +, 10

X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number

14-12-2020 6 Dr. Chandan Kumar Behera


What is Lexical Analysis?
 LEXICAL ANALYSIS is the very first phase in the compiler designing.
 The lexical analyzer breaks this syntax into a series of tokens.
 It removes any extra space or comment written in the source code.
 Programs that perform lexical analysis are called lexical analyzers or lexers.
 A lexer contains tokenizer or scanner.
 If the lexical analyzer detects that the token is invalid, it generates an error.
 It reads character streams from the source code, checks for legal tokens, and pass the
data to the syntax analyzer when it demands.
 A Lexer takes the source code which is written in the form of sentences.
 In other words, it helps you to convert a sequence of characters into a sequence of tokens.
 A lexical analyzer is a program that transforms a stream of characters into a stream
of 'atomic chunks of meaning', so called tokens.

14-12-2020 7 Dr. Chandan Kumar Behera


Lexical Analyzer

14-12-2020 8 Dr. Chandan Kumar Behera


Basic Terminologies
 Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total, start).
 Token: A token is just a category of lexemes.
 Keywords and reserved words – It is an identifier which is used as a fixed part of the
syntax of a statement. It is a reserved word which you can't use as a variable name or
identifier.
 Comments – It is a very important part of the documentation. It mostly display by, /*
*/, or//Blank (spaces)
 Delimiters – It is a syntactic element which marks the start or end of some syntactic
unit. Like a statement or expression, "begin"...''end", or {}.
 Character set - ASCII, Unicode
 Identifiers – It is a restrictions on the length which helps you to reduce the readability
of the sentence.
 Operator symbols - + and – performs two basic arithmetic operations

14-12-2020 11 Dr. Chandan Kumar Behera


Basic Terminologies

14-12-2020 12 Dr. Chandan Kumar Behera


Attributes for Tokens
 When more than one lexeme can match a pattern, a lexical analyzer must provide the compiler additional
information about that lexeme matched.
 In formation about identifiers, its lexeme, type and location at which it was first found is kept in symbol table.
 The appropriate attribute value for an identifier is a pointer to the symbol table entry for that identifier.
 Example: E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>

14-12-2020 13 Dr. Chandan Kumar Behera


Lexical Analyzer Architecture
 How tokens are recognized?
 The main task of lexical analysis is to read input characters in the code and
produce tokens.
 Lexical analyzer scans the entire source code of the program. It identifies each
token one by one. Scanners are usually implemented to produce tokens only
when requested by a parser.
1. "Get next token" is a command which is sent from the parser to the lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
3. It returns the token to Parser.
 Lexical Analyzer skips whitespaces and comments while creating these tokens. If
any error is present, then Lexical analyzer will correlate that error with the
source file and line number.
14-12-2020 14 Dr. Chandan Kumar Behera
How tokens are recognized?

14-12-2020 15 Dr. Chandan Kumar Behera


Roles of the Lexical analyzer
 Helps to identify token in the symbol table
 Removes white spaces and comments from the source program
 Helps to expand the macros if it is found in the source program
 Read input characters from the source program

14-12-2020 16 Dr. Chandan Kumar Behera


Example of Lexical Analysis, Tokens, Non-Tokens
#include <stdio.h>
int maximum(int x, int y)
{
// This will compare 2 numbers
if (x > y)
return x;
else
{
return y;
}
}

14-12-2020 17 Dr. Chandan Kumar Behera


Example of Lexical Analysis, Tokens, Non-Tokens
#include <stdio.h> Tokens created

int maximum(int x, int y) Lexeme Token


int Keyword
{
maximum Identifier
// This will compare 2 numbers
( Operator
if (x > y) int Keyword
return x; x Identifier
else , Operator
{ int Keyword
return y; Y Identifier

} ) Operator
{ Operator
}
If Keyword
14-12-2020 18 Dr. Chandan Kumar Behera
Example of Lexical Analysis, Tokens, Non-Tokens
#include <stdio.h>
Non-tokens
int maximum(int x, int y)
{ Type Examples
Comment // This will compare 2 numbers
// This will compare 2 numbers
Pre-processor directive #include <stdio.h>
if (x > y)
Pre-processor directive #define NUMS 8,9
return x; Macro NUMS
else Whitespace /n /b /t
{
return y;
}
}

14-12-2020 19 Dr. Chandan Kumar Behera


Regular Definition

14-12-2020 20 Dr. Chandan Kumar Behera


Unsigned Number

14-12-2020 21 Dr. Chandan Kumar Behera


All Relational Operators

14-12-2020 22 Dr. Chandan Kumar Behera


Lexical Errors
 A character sequence which is not possible to scan in any valid token is a lexical
error.
 Important facts about the lexical error:
 Lexical errors are not very common, but it should be managed by a scanner
 Misspelling of identifiers, operators, keyword are considered as lexical errors
 Generally, a lexical error is caused by the appearance of some illegal character, mostly
at the beginning of a token.

14-12-2020 24 Dr. Chandan Kumar Behera


Handling Lexical Errors
 Its hard for lexical analyzer without the aid of other components, that there is a
source-code error.
 If the statement fi is encountered for the first time in a C program it can not tell
whether fi is misspelling of if statement or a undeclared literal.

• Error Handling is very localized, with Respect to Input Source


• For example: whil ( x = 0 ) do
generates no lexical errors in PASCAL

14-12-2020 25 Dr. Chandan Kumar Behera


Handling Lexical Errors
 In what Situations do Errors Occur?
• Lexical analyzer is unable to proceed because none of the patterns for tokens
matches a prefix of remaining input.
 Panic mode Recovery
• Delete successive characters from the remaining input until the analyzer can
find a well-formed token.
• May confuse the parser – creating syntax error
 Possible error recovery actions:
• Deleting or Inserting Input Characters
• Replacing or Transposing Characters

14-12-2020 26 Dr. Chandan Kumar Behera


Summary
 Lexical analysis is the very first phase in the compiler designing
 A lexeme is a sequence of characters that are included in the source program
according to the matching pattern of a token
 Lexical analyzer is implemented to scan the entire source code of the program
 Lexical analyzer helps to identify token into the symbol table
 A character sequence which is not possible to scan into any valid token is a lexical
error
 Removes one character from the remaining input is useful Error recovery method
 Lexical Analyser scan the input program while parser perform syntax analysis
 It eases the process of lexical analysis and the syntax analysis by eliminating
unwanted tokens
 The biggest drawback of using Lexical analyzer is that it needs additional runtime
overhead is required to generate the lexer tables and construct the tokens

14-12-2020 30 Dr. Chandan Kumar Behera

You might also like