5CAI4-02 Compiler Design
5CAI4-02 Compiler Design
6 Definition of basic block control flow graphs; DAG representation of basic block, 07
Advantages of DAG, Sources of optimization, Loop optimization, Idea about global
data flow analysis, Loop invariant computation, Peephole optimization, Issues in
design of code generator, A simple code generator, Code
generation from DAG.
Total 42
3. Interpreter
The translation of a single statement of the source program into machine code is done by a language
processor and executes immediately before moving on to the next line is called an interpreter. If there
is an error in the statement, the interpreter terminates its translating process at that statement and
displays an error message. The interpreter moves on to the next line for execution only after the removal
of the error. An Interpreter directly executes instructions written in a programming or scripting
language without previously converting them to an object code or machine code. An interpreter
translates one line at a time and then executes it.
Example: Perl, Python and Matlab.
Phase of compiler
A compiler is a software tool that converts high-level programming code into machine code that a
computer can understand and execute. It acts as a bridge between human-readable code and machine-
level instructions, enabling efficient program execution. The process of compilation is divided into six
phases:
1. Lexical Analysis: The first phase, where the source code is broken down into tokens such as
keywords, operators, and identifiers for easier processing.
2. Syntax Analysis or Parsing: This phase checks if the source code follows the correct syntax
rules, building a parse tree or abstract syntax tree (AST).
3. Semantic Analysis: It ensures the program’s logic makes sense, checking for errors like type
mismatches or undeclared variables.
4. Intermediate Code Generation: In this phase, the compiler converts the source code into an
intermediate, machine-independent representation, simplifying optimization and translation.
5. Code Optimization: This phase improves the intermediate code to make it run more efficiently,
reducing resource usage or increasing speed.
6. Target Code Generation: The final phase where the optimized code is translated into the target
machine code or assembly language that can be executed on the computer.
The whole compilation process is divided into two parts, front-end and back-end. These six phases are
divided into two main parts, front-end and back-end with the intermediate code generation phase acting
as a link between them. The front end analyzes source code for syntax and semantics, generating
intermediate code, while ensuring correctness. The back end optimizes this intermediate code and
converts it into efficient machine code for execution. The front end is mostly machine-independent,
while the back end is machine-dependent.
The compilation process is an essential part of transforming high-level source code into machine-
readable code. A compiler performs this transformation through several phases, each with a specific
role in making the code efficient and correct. Broadly, the compilation process can be divided into two
main parts:
1. Analysis Phase: The analysis phase breaks the source program into its basic components and
creates an intermediate representation of the program. It is sometimes referred to as front end.
2. Synthesis Phase: The synthesis phase creates the final target program from the intermediate
representation. It is sometimes referred to as back end.
3.
1. Lexical Analysis
Lexical analysis is the first phase of a compiler, responsible for converting the raw source code into a
sequence of tokens. A token is the smallest unit of meaningful data in a programming language. Lexical
analysis involves scanning the source code, recognizing patterns, and categorizing groups of characters
into distinct tokens.
The lexical analyzer scans the source code character by character, grouping these characters into
meaningful units (tokens) based on the language's syntax rules. These tokens can represent keywords,
identifiers, constants, operators, or punctuation marks. By converting the source code into tokens,
lexical analysis simplifies the process of understanding and processing the code in later stages of
compilation.
Example: int x = 10;
The lexical analyzer would break this line into the following tokens:
int - Keyword token (data type)
x - Identifier token (variable name)
= - Operator token (assignment operator)
10 - Numeric literal token (integer value)
; - Punctuation token (semicolon, used to terminate statements)
Each of these tokens is then passed on to the next phase of the compiler for further processing, such as
syntax analysis.
To know more about Lexical Analysis refer to this article - Lexical Analysis.
2. Syntax Analysis
Syntax analysis, also known as parsing, is the second phase of a compiler where the structure of the
source code is checked. This phase ensures that the code follows the correct grammatical rules of the
programming language.
The role of syntax analysis is to verify that the sequence of tokens produced by the lexical analyzer is
arranged in a valid way according to the language's syntax. It checks whether the code adheres to the
language's rules, such as correct use of operators, keywords, and parentheses. If the source code is not
structured correctly, the syntax analyzer will generate errors.
To represent the structure of the source code, syntax analysis uses parse trees or syntax trees.
Parse Tree: A parse tree is a tree-like structure that represents the syntactic structure of the
source code. It shows how the tokens relate to each other according to the grammar rules. Each
branch in the tree represents a production rule of the language, and the leaves represent the
tokens.
Syntax Tree: A syntax tree is a more abstract version of the parse tree. It represents the
hierarchical structure of the source code but with less detail, focusing on the essential syntactic
structure. It helps in understanding how different parts of the code relate to each other.
Parse Tree
To know more about Syntax Analysis refer to this article - Syntax Analysis.
3. Semantic Analysis
Semantic analysis is the phase of the compiler that ensures the source code makes sense logically. It
goes beyond the syntax of the code and checks whether the program has any semantic errors, such as
type mismatches or undeclared variables.
Semantic analysis checks the meaning of the program by validating that the operations performed in
the code are logically correct. This phase ensures that the source code follows the rules of the
programming language in terms of its logic and data usage.
Some key checks performed during semantic analysis include:
Type Checking: The compiler ensures that operations are performed on compatible data types.
For example, trying to add a string and an integer would be flagged as an error because they
are incompatible types.
Variable Declaration: It checks whether variables are declared before they are used. For
example, using a variable that has not been defined earlier in the code would result in a semantic
error.
Example:
int a = 5;
float b = 3.5;
a = a + b;
Type Checking:
a is int and b is float. Adding them (a + b) results in float, which cannot be assigned to int a.
Error: Type mismatch: cannot assign float to int.
To know more about Semantic Analysis refer to this article - Semantic Analysis.
4. Intermediate Code Generation
Intermediate code is a form of code that lies between the high-level source code and the final machine
code. It is not specific to any particular machine, making it portable and easier to optimize. Intermediate
code acts as a bridge, simplifying the process of converting source code into executable code.
The use of intermediate code plays a crucial role in optimizing the program before it is turned into
machine code.
Platform Independence: Since the intermediate code is not tied to any specific hardware, it can
be easily optimized for different platforms without needing to recompile the entire source code.
This makes the process more efficient for cross-platform development.
Simplifying Optimization: Intermediate code simplifies the optimization process by providing
a clearer, more structured view of the program. This makes it easier to apply optimization
techniques such as:
o Dead Code Elimination: Removing parts of the code that don’t affect the program’s
output.
o Loop Optimization: Improving loops to make them run faster or consume less memory.
o Common Subexpression Elimination: Reusing previously calculated values to avoid
redundant calculations.
Easier Translation: Intermediate code is often closer to machine code, but not specific to any
one machine, making it easier to convert into the target machine code. This step is typically
handled in the back end of the compiler, allowing for smoother and more efficient code
generation.
Example: a = b + c * d;
t1 = c * d
t2 = b + t1
a = t2
To know more about Intermediate Code Generation refer to this article - Intermediate Code Generation.
5. Code Optimization
Code Optimization is the process of improving the intermediate or target code to make the program run
faster, use less memory, or be more efficient, without altering its functionality. It involves techniques
like removing unnecessary computations, reducing redundancy, and reorganizing code to achieve better
performance. Optimization is classified broadly into two types:
Machine-Independent
Machine-Dependent
Common Techniques:
Constant Folding: Precomputing constant expressions.
Dead Code Elimination: Removing unreachable or unused code.
Loop Optimization: Improving loop performance through invariant code motion or unrolling.
Strength Reduction: Replacing expensive operations with simpler ones.
Example:
To know more about Code Optimization refer to this article - Code Optimization.
6. Code Generation
Code Generation is the final phase of a compiler, where the intermediate representation of the source
program (e.g., three-address code or abstract syntax tree) is translated into machine code or assembly
code. This machine code is specific to the target platform and can be executed directly by the hardware.
The code generated by the compiler is an object code of some lower-level programming language, for
example, assembly language. The source code written in a higher-level language is transformed into a
lower-level language that results in a lower-level object code, which should have the following
minimum properties:
It should carry the exact meaning of the source code.
It should be efficient in terms of CPU usage and memory management.
Example:
Three Address Code Assembly Code
Symbol Table - It is a data structure being used and maintained by the compiler, consisting of all the
identifier's names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
To know more about Symbol Table refer to this article - Symbol Table.
Error Handling in Phases of Compiler
Error Handling refers to the mechanism in each phase of the compiler to detect, report and recover from
errors without terminating the entire compilation process.
Lexical Analysis: Detects errors in the character stream and ensures valid token formation.
o Example: Identifies illegal characters or invalid tokens (e.g., @var as an identifier).
Syntax Analysis: Checks for structural or grammatical errors based on the language's grammar.
o Example: Detects missing semicolons or unmatched parentheses.
Semantic Analysis: Verifies the meaning of the code and ensures it follows language semantics.
o Example: Reports undeclared variables or type mismatches (e.g., adding a string to an
integer).
Intermediate Code Generation: Ensures the correctness of intermediate representations used in
further stages.
o Example: Detects invalid operations, such as dividing by zero.
Code Optimization: Ensures that the optimization process doesn’t produce errors or alter code
functionality.
o Example: Identifies issues with unreachable or redundant code.
Code Generation: Handles errors in generating machine code or allocating resources.
o Example: Reports insufficient registers or invalid machine instructions.
Bootstrapping
Bootstrapping is an important technique in compiler design, where a basic compiler is used to create a
more advanced version of itself. This process helps in building compilers for new programming
languages and improving the ones already in use. By starting with a simple compiler, bootstrapping
allows gradual improvements and makes the compiler more efficient over time.
Bootstrapping relies on the idea of a self-compiling compiler, where each iteration improves
the compiler's ability to handle more complex code.
It simplifies the development cycle, allowing incremental improvements and faster deployment
of more robust compilers.
Many successful programming languages, including C and Java, have used bootstrapping
techniques during their development.
A compiler can be represented using a T Diagram. In this diagram, the source language of the compiler
is positioned at the top-left, the target language (the language produced by the compiler) is placed at the
top-right, and the language in which the compiler is implemented is shown at the bottom.
Working of Bootstrapping
Bootstrapping is the process of creating compilers. It involves a methodology where a slightly more
complicated compiler is created using a simple language (such as assembly language). This slightly
more complicated compiler, in turn, is used to create an even more advanced compiler, and this process
continues until the desired result is achieved.
Here’s a step-by-step look at how bootstrapping works in compiler design:
Step 1: Start with a Basic Compiler
The first step is to create a basic compiler that can handle the most essential features of a programming
language. This simple compiler is often written in assembly language or machine language to make it
easier to build.
Step 2: Use the Basic Compiler to Create a More Advanced Version
Once the basic compiler is ready, it is used to compile a more advanced version of itself. This new
version can handle more complex features, like better error checking and optimizations.
Step 3: Gradually Improve the Compiler
With each new version, the compiler becomes more capable. The process is repeated, and each iteration
adds more features, making the compiler stronger and more efficient.
For example, let's assume a compiler which takes C language as input and generates an assembly
language as an output.
1. To generate this compiler we first write a compiler for a small subset of C language i.e. C0 in assembly
language. Subset of C means C language with less functionality.
Here in the T diagram, source language is subset of C (C0), target language is Assembly language and
implementation language is also Assembly language.
2. Then using the C0 language, we create a compiler for the C language. This compiler C0, takes C
language as source language and generates assembly language as target language as shown below.
By the help bootstrapping we generated compiler for C language in C language itself i.e. self-compiling
compiler.
Cross-Compilation Using Bootstrapping
Cross-compilation is a process where a compiler runs on one platform (host) but generates machine
code for a different platform (target). This is useful when the target platform is not powerful enough to
run the full compiler or when the target architecture is different from the host system. Using
bootstrapping in cross-compilation can help create a compiler that runs on one system (the host) but
produces code for another system (the target).
For example, suppose we want to write a cross compiler for new language X. To create a cross-compiler
for a new language X that generates code in language Z, we start by using an existing compiler Y
(running on machine M) to compile a simple version of language X into language Z. The first step is to
create a basic compiler, XYZ, using Y that translates X code into Z code, all while running on machine
M. This results in a cross-compiler XMZ, which is capable of generating target code in language Z for
the source code in language X, but it runs on machine M. This method allows us to create a compiler
for a new language without needing to run it on
( LAPREN = ASSIGNMENT
a IDENTIFIER A IDENTIFIER
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
Recognition of tokens:
What is a Token?
In programming, a token is the smallest unit of meaningful data; it may be an identifier, keyword,
operator, or symbol. A token represents a series or sequence of characters that cannot be decomposed
further. In languages such as C, some examples of tokens would include:
Keywords : Those reserved words in C like ` int `, ` char `, ` float `, ` const `, ` goto `, etc.
Identifiers: Names of variables and user-defined functions.
Operators : ` + `, ` - `, ` * `, ` / `, etc.
Delimiters /Punctuators: Symbols used such as commas " , " semicolons " ; " braces ` {} `.
By and large, tokens may be divided into three categories:
Terminal Symbols (TRM) : Keywords and operators.
Literals (LIT) : Values like numbers and strings.
Identifiers (IDN) : Names defined by the user.
Let's understand now how to calculate tokens in a source code (C language):
Example 1:
int a = 10; //Input Source code
Tokens
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer - Total number of tokens = 5
Example 2:
int main() {
It is a sequence of characters in
Token is basically a It specifies a set
the source code that are
sequence of characters of rules that a
matched by given predefined
Definition that are treated as a unit scanner follows
language rules for every
as it cannot be further to create a
lexeme to be specified as a
broken down. token.
valid token.
alphabet or a
digit.
Interpretation
all the operators are
of type +, = +, =
considered tokens.
Operator
any string of
Interpretation a grammar rule or characters
"Welcome to GeeksforGeeks!"
of type Literal boolean literal. (except ' ')
between " and "
Lexical Analysis
Lexical Analysis is the first phase of the compiler, it takes the stream of characters as input and converts
that input into tokens also known as tokenization. The token can be classified into various types such
as Identifier, Separator, Keyword, Operator, Constant, and Special Character, and further, these tokens
will get stored in the symbol table. Lexical Analysis has three phases:
1. Tokenization: It is the process of converting a stream of characters into tokens.
2. Error Messages: The stream of characters we are taking as input is also called lexemes. In this
phase we will get error messages related to lexical analysis, It will generate error messages
while scanning the input such as illegal characters, unmatched strings, and exceeding length.
3. Comments Elimination: It will eliminate the spaces, tab, and blank spaces and then generates
the tokens.
Automatic Lexical Generator
The Automatic Lexical Generator is a tool that generates a code so that we can perform lexical analysis
on that to get the output as tokens. This process is used in compiler design and in the field of computer
science.
As we have discussed above the process of lexical analysis is to take the stream of characters and later
convert it into tokens.
The Lexical Generator includes the following steps:
1. In the first step, we give Lex source program as input to the lexical compiler and it will generate
the Lex.yy.c files as output.
2. In the second step, we take Lex.yy.c as input to the C compiler and it will generate the file a.out.
3. And now the output file a.out will take the stream of characters and generates a sequence of
tokens as an output.
Advantages of Automatic Lexical Generators
Lexical Generators help overcome many problems such as:
The main usage of Automatica Lexical Generators is to make Lexical analyzers for any language that
too in an easy and efficient manner.
1. Making a Lexical analyzer for any programming language is require the same level of design,
coding, and practice to generate the output. In that case, Lexical Generator works efficiently.
2. It is very difficult to make Lexical analyzers for every programming language as it is a very
sophisticated process and with the help of Lexical generators we can solve that problem.