LN6 - C2 - Elements of Lexical Analysis
LN6 - C2 - Elements of Lexical Analysis
Lexical Analysis:
o First phase, another name scanning.
o The lexical analyzer does the lexical analysis.
o Sometimes, lexical analyzers are divided into a cascade of two processes:
o Scanning, consists of the simple processes that do not require tokenization of
the input, such as deletion of comments and compaction consecutive white
space characters into one.
o Lexical Analysis, is the more complex portion, which produces tokens from the
output of the scanner.
Main tasks
– Scan the source program, that is, read the characters of the source code.
– Group them into lexemes.
– Produce a sequence of tokens, (one for each lexeme) for Syntax Analysis.
– Pattern:
• a description of the form that the lexemes of a token may take.
• In the case of a keyword as a token, the pattern is just the sequence of
characters that form the keyword.
• For identifiers and some other tokens, the pattern is a more complex
structure that matched by many strings.
CSE4129: Formal languages and Compilers : Lecture 7
– Lexeme:
• a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that
token.
• Lexeme recognized as of a certain type of token like: Keywords, operators,
identifiers, constants, literal strings, punctuation symbols (separators)
such as commas, semicolons, etc.
In many programming languages, the following classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a keyword is the same as the keyword
itself.
2. Tokens for the operators, either individually or in classes.
3. One token representing all identifiers.
4. One or more tokens representing constants, such as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
• [a, b, c, d, e] each comma (,) is a single delimiter. The left and right brackets, ([, ]) are pair-
delimiters.
• "hello", the two quote symbols (") are pair-delimiters.
• Common delimiters are commas (,), semicolon (;), quotes ( ", ' ), braces ({}), pipes (|),
or slashes ( / \ ).