0% found this document useful (0 votes)
26 views16 pages

CMP 352

Cmp 352 University of agriculture makurdi

Uploaded by

Tor Simon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views16 pages

CMP 352

Cmp 352 University of agriculture makurdi

Uploaded by

Tor Simon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

The compiler is software that transforms a program documented in a source language or high-level

language into a machine language or low-level language.

What is Compiler?

A compiler is a translator that turns programs in a source language into the same program in a target
language. Usually that target language is executable, which means that a machine can run it.

A compiler is a computer program that helps in translating the computer code from one programming
language into another language. Basically, it translates the program written in the source language to
the machine language.

The compiling process contains an essential translation operation and error detection.

Language Processing System

We have learnt that any computer system is made of hardware and software. The hardware
understands a language, which humans cannot understand. So, we write programs in high-level
language, which is easier for us to understand and remember. These programs are then fed into a
series of tools and OS components to get the desired code that can be used by the machine. This is
known as Language Processing System.

Why Study Compilers

Computers are an intricate combination of hardware and software. Hardware consists of physical
devices, while its operations are managed by corresponding software. Hardware responds to
instructions as electronic signals, which align with the binary language used in software
programming. The binary language comprises only two symbols, 0 and 1. Instructions for hardware
must be coded in this binary format, a sequence of 1s and 0s. Writing such binary code manually
would be complex and laborious for programmers, which is why compilers are used to generate
these codes.

The program written in a high-level language is known as a source program, and the program
converted into a low-level language is known as an object (or target) program. Without compilation,
no program written in a high-level language can be executed. For every programming language, we
have a different compiler; however, the basic tasks performed by every compiler are the same.

High-Level Programming Language

High-level programming language is a language with a strong vision and abstraction of the attributes
of the computer. High-level programming enables the evolution of a program in a more user-friendly
programming context. High-level programming is more convenient to the user in writing a program.

What is a Low-Level Programming Language?

1|Pa ge
A low-level language is a programming language that gives no inference of programming ideas,
visions and concepts. A low-Level Programming language doesn’t require programming ideas and
concepts.

Types of Compiler

Majorly, there are three types of compilers:

• Single Pass Compilers

• Two Pass Compilers

• Multipass Compilers

Single Pass Compiler:

When we merge all the phases of compiler design in a single module, then it is called a single pass
compiler. In the case of a single pass compiler, the source code converts into machine code.

Two Pass Compiler:

A processor that runs through the program to be translated twice is considered a two-pass compiler.

Multipass Compiler:

A program’s source code or syntax tree is processed many times by the multipass compiler. It breaks
down a huge programme into numerous smaller programmes and runs them all at the same time. It
creates a number of intermediate codes. All of these multipasses use the previous phase’s output
as an input. As a result, it necessitates less memory. ‘Wide Compiler’ is another name for it.

Operations of Compiler

2|Pa ge
The important tasks executed by the compiler are:

• Breaks down the source programme into chunks and applies grammatical structure to each
one.

• Enables you to create the symbol table and the target programme you’re after from the
intermediate representation.

• Helps in compiling source code and detecting the errors.

• Organise and save all codes and variables.

• Separate compilation is supported.

• Read the full programme, analyse it, and translate it to a semantically similar language.

• Depending on the type of machine, converting source code to object code.

Steps for Language Processing Systems

• Preprocessor: It is an integral component of the compiler. It’s a tool that generates Compiler
input. It is concerned with augmentation, macro processing, and language expansion,
among other things.

• Interpreter: It works similarly to a compiler in that it converts high-level language into


machine language.

• Assembler: It changes assembly language code into a language that computers can
understand. The output of the assembler, known as an object file, consists of both machine
instructions and the information required to store those commands in memory.

• Linker: It facilitates the linking and combining of numerous object files to produce an
executable file. The main responsibility of a linker is to search a programme for modules and
to identify the memory address where every module is kept.

3|Pa ge
• Loader: The loader is a component of the operating system that handles the loading and
execution of executable files. It also determines the size of an application, freeing up
additional memory.

• Cross-compiler: In compiler design, a cross compiler is a forum that aids in the generation
of executable code.

• Source-to-source Compiler: A source-to-source compiler is used when the source code of


one programming language is converted into the program code of some other language.

Compiler Modules

The compiler has two modules namely the front end and the back end. Front-end is also referred to
as the analysis phase and it comprises the Lexical analyzer, semantic analyzer, syntax analyzer, and
intermediate code generator. While the code optimiser and code generation phases form the back
end.

The compilation process contains the sequence of various phases. Each phase takes source
program in one representation and produces output in another representation. Each phase takes
input from its previous stage.

There are the various phases of compiler:

Analysis Phase

Lexical Analysis:

Lexical analyzer phase is the first phase of compilation process and also referred to as scanning
phase. It takes source code as input. It reads the source program one character at a time and
converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the form of tokens.
The stage is all about transforming our program into tokens, which is just a sequence of characters
that corresponds to a unit in the grammar of the programming language.

The lexical analyzer breaks the source code’s syntaxes into a series of tokens, by removing
whitespace in the source code. If the lexical analyzer gets any invalid token, it generates an error. The

4|Pa ge
stream of character is read by it and it seeks the legal tokens, and then the data is passed to the
syntax analyzer

These are some valid tokens in many popular programming languages:

• 15 (INT), -0.6 (FLOAT)

• “compilers rock” (STRING)

• foo, x (IDENTIFIER)

• +, &&, < (OPERATORS)

• ; (SEMICOLON)

• return (RETURN).

For instance:

public static void main(String[] args)


{
System.out.println("Hi Mom!");
}

At the end of lexing stage may return:


PUBLIC STATIC VOID MAIN LPAREN STRING LBRACKET RBRACKET ID(args) RPAREN LBRACE
ID(System) DOT ID(out) DOT ID(println) LPAREN STRING(Hi Mom!) RPAREN SEMICOLON RBRACE

Specifications of Tokens

How the language theory works:

Lexemes: A lexeme is a sequence of characters matched in the source program that matches the
pattern of a token.

Patterns: A pattern is a set of rules a scanner follows to match a lexeme in the input program to
identify a valid token.

Alphabets

Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings

Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string “computer” is 7 and is denoted by |
computer | = 7. A string having no alphabets, i.e. a string of zero length is known as an empty string
and is denoted by ε (epsilon).

Language

A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.

Regular Expressions

5|Pa ge
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.

Regular expressions have the capability to express finite languages by defining a pattern for finite
strings of symbols. The grammar defined by regular expressions is known as regular grammar. The
language defined by regular grammar is known as regular language.

Breaking of input into lexemes

First, the lexical analyzer has to read the input program and break it into tokens. This is achieved by a
method called "Input Buffering".

Input Buffering

Assume that the line of code is:

int i, j;

i = j + 1;

j = j + 1;

The input is stored in buffers to avoid going to secondary memory.

Initially, We used a One-buffer scheme:

Two pointers are used to read and find tokens: *bp (Beginning) and *fp (foreword). *bp is kept at
the beginning, and *fp is traversed through the buffer. Once *fp finds a delimiter like white space or
semi-colon, the traversed part between *bp and the encountered delimiter is identified as a token.
Now, *bp and *fp are set to the succeeding block of the delimiter to continue searching for tokens.

The drawback of One-buffer schema: When the string we want to read is longer than the buffer
length, before reading the whole string, the end of the buffer is reached, and the whole buffer has to

6|Pa ge
be reloaded with the rest of the string, which makes identification hard hence, the Two Buffer
scheme is introduced.

Here, two buffers of the same size are used. The advantage is that when the first buffer is filled, the
second buffer is loaded, and vice versa. We won't lose strings midways.

Whenever fp* moves forward, eof checks if it is reaching the end to reload the second buffer. So, this
is how an input program is read, and tokens are divided.

Patterns matching by lexical analyzer check the validity of lexemes with tokens

The Lexical analyzer has to scan and identify only a finite set of valid tokens/ lexemes from the
program for which it uses patterns. Patterns are used to find a valid lexeme from the program. These
patterns are specified using "Regular grammar". All the valid tokens are given pre-defined patterns
to check the validity of detected lexemes in the program.

Numbers

A number can be in the form of:

1. A whole number (0, 1, 2...)

2. A decimal number (0.1, 0.2...)

3. Scientific notation(1.25E), (1.25E23)

The grammar has to identify all types of numbers:

Sample Regular grammar:

Digit -> 0|1|....9

Digits -> Digit (Digit)*

Number -> Digits (.Digits)? (E[+ -] ? Digits)?

Number -> Digit+ (.Digit)+? (E[+ -] ? Digit+)?

➢ ? represents 0 or more occurrences of the previous expression

➢ * represents 0 or more occurrences of the base expression

➢ + represents 1 or more occurrences of the base expression

Delimiters

There are different types of delimiters like white space, newline character, tab space, etc.

Sample Regular grammar:

7|Pa ge
Delimiter -> ' ', '\t', '\n'

Delimiters -> delimiter (delimiter)*

Identifiers

The rules of an identifier are:

1. It has to start only with an alphabet.

2. After the first alphabet, it can have any number of alphabets, digits, and underscores.

Sample Regular grammar:

Letter -> a|b|....z

Letter -> A|B|...Z

Digit -> 0|1|...9

Identifier -> Letter (Letter|Digit)*

Syntax Analysis

Syntax analysis (Parser) is the second phase of compilation process. It takes tokens as input and
generates a parse tree as output. In syntax analysis phase, the parser checks that the expression
made by the tokens is syntactically correct or not. It checks that the structure of these tokens is legal
and follows the rules of the language’s grammar. It constructs the parse tree. It takes all the tokens
one by one and uses Context-Free Grammar to construct the parse tree.

The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of the source
code, which is a hierarchical representation of the source code that reflects the grammatical
structure of the program.

Features of syntax analysis:

Syntax Trees: Syntax analysis creates a syntax tree, which is a hierarchical representation of the
code’s structure. The tree shows the relationship between the various parts of the code, including
statements, expressions, and operators.

Context-Free Grammar: Syntax analysis uses context-free grammar to define the syntax of the
programming language. Context-free grammar is a formal language used to describe the structure of
programming languages.

Top-Down and Bottom-Up Parsing: Syntax analysis can be performed using two main approaches:
top-down parsing and bottom-up parsing. Top-down parsing starts from the highest level of the

8|Pa ge
syntax tree and works its way down, while bottom-up parsing starts from the lowest level and works
its way up.

Error Detection: Syntax analysis is responsible for detecting syntax errors in the code. If the code
does not conform to the rules of the programming language, the parser will report an error and halt
the compilation process.

Intermediate Code Generation: Syntax analysis generates an intermediate representation of the


code, which is used by the subsequent phases of the compiler. The intermediate representation is
usually a more abstract form of the code, which is easier to work with than the original source code.

Optimization: Syntax analysis can perform basic optimizations on the code, such as removing
redundant code and simplifying expressions.

The pushdown automata (PDA) is used to design the syntax analysis phase.

The Grammar for a Language consists of Production rules.

Example: Suppose Production rules for the Grammar of a language are:

S -> cAd

A -> bc|a

And the input string is “cad”.

Now the parser attempts to construct a syntax tree from this grammar for the given input string. It
uses the given production rules and applies those as needed to generate the string. To generate string
“cad” it uses the rules:

S -> cAd rule 1

-> cad rule 2 (b)

Thus, the given input can be produced by the given grammar, therefore the input is correct in syntax.

Exercise: Design a parse tree for the above

The parse tree is also called the derivation tree. Parse trees are generally constructed to check for
ambiguity in the given grammar. There are certain rules associated with the derivation tree.

• Any identifier is an expression

• Any number can be called an expression

• Performing any operations in the given expression will always result in an expression. For
example, the sum of two expressions is also an expression.

• The parse tree can be compressed to form a syntax tree

Syntax error can be detected at this level if the input is not in accordance with the grammar.

Example:

i = 3 + (4 * 17)^2;

After lexing (Lexical analysis phase), we’ll have;

9|Pa ge
ID(i) EQUALS INT(3) PLUS LPAREN INT(4) TIMES INT(17) RPAREN EXPONENT INT(2) SEMICOLON

<id, i> <=> <3> <+> <(> <4> <*> <17> <)> <^> <2>

And after parsing,

The kinds of errors that would be caught during this stage of compilation are your traditional “syntax
errors” such as:

• Mismatched or missing parentheses in if conditions or while loops

• Missing semicolons after variable assignment, statements, etc

• Missing a return statement

The output of this phase is usually an Abstract Syntax Tree (AST).

Example showing the lexemes and AST

The syntax analysis phase typically involves the following steps:

1. Tokenization: The input program is divided into a sequence of tokens, which are basic building
blocks of the programming language, such as identifiers, keywords, operators, and literals.

10 | P a g e
2. Parsing: The tokens are analyzed according to the grammar rules of the programming
language, and a parse tree or AST is constructed that represents the hierarchical structure of
the program.

3. Error handling: If the input program contains syntax errors, the syntax analyzer detects and
reports them to the user, along with an indication of where the error occurred.

4. Symbol table creation: The syntax analyzer creates a symbol table, which is a data structure
that stores information about the identifiers used in the program, such as their type, scope,
and location.

Parsing

The process of transforming the data from one format to another is called Parsing. This process can
be accomplished by the parser. The parser is a component of the translator that helps to organise
linear text structure following the set of defined rules which is known as grammar.

Types of Parsing:

There are two types of Parsing:

• The Top-down Parsing

• The Bottom-up Parsing

Top-down Parsing

When the parser generates a parse with top-down expansion to the first trace, the left-most
derivation of input is called top-down parsing. The top-down parsing initiates with the start symbol
and ends on the terminals. This parsing technique parses the input, and starts constructing a parse
tree from the root node gradually moving down to the leaf nodes. Such parsing is also known as
predictive parsing.

11 | P a g e
Recursive Descent Parsing

Recursive descent is a top-down parsing technique that constructs the parse tree from the top and
the input is read from left to right. It uses procedures for every terminal and non-terminal entity. This
parsing technique recursively parses the input to make a parse tree, which may or may not require
back-tracking. But the grammar associated with it (if not left factored) cannot avoid back-tracking. A
form of recursive-descent parsing that does not require any back-tracking is known as predictive
parsing.

This parsing technique is regarded recursive as it uses context-free grammar which is recursive in
nature.

Back-tracking

Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of CFG:

S → rXd | rZd

X → oa | ea

Z → ai

For an input string: “read”, a top-down parser will:

i. Start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it.
ii. For the next input letter ‘e’, the parser tries to expand non-terminal ‘X’ and checks its
production from the left (X → oa). It does not match with the next input symbol. So the
top-down parser backtracks to obtain the next production rule of X, (X → ea).

Now the parser matches all the input letters in an ordered manner. The string is accepted.

12 | P a g e
Predictive Parser

Predictive parser is a recursive descent parser, which has the capability to predict which
production is to be used to replace the input string. The predictive parser does not suffer from
backtracking.

To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some
constraints on the grammar and accepts only a class of grammar known as LL(k) grammar where
L is left-to-right, L is leftmost derivation and K is lookahead symbol.

Bottom-up Parsing

Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches
the root node. Here, we start from a sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the bottom-up parsers available.

Semantic Analysis

Semantic analysis is the third phase of compilation process. It checks whether the parse tree follows
the rules of language. The phase checks whether the code is semantically correct, i.e., whether it
conforms to the language’s type system and other semantic rules. In this stage, the compiler checks
the meaning of the source code to ensure that it makes sense. The compiler performs type checking,
which ensures that variables are used correctly and that operations are performed on compatible
data types. The compiler also checks for other semantic errors, such as undeclared variables and
incorrect function calls.

13 | P a g e
It also keeps track of identifiers, their types and expressions. The output of semantic analysis phase
is the annotated tree syntax.

Intermediate Code Generation

In the intermediate code generation, compiler generates the source code into the intermediate code.
Intermediate code is generated between the high-level language and the machine language. The
intermediate code is generated in such a way that you can easily translate it into the target machine
code.

Synthesis Phase

Code Optimization

Code optimization is used to improve the intermediate code so that the output of the program could
run faster and take less space. It removes the unnecessary lines of the code and arranges the
sequence of statements in order to speed up the program execution.

Code Generation

Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.

Example:

14 | P a g e
Compiler Passes

Pass is a complete traversal of the


source program. Compiler has two
passes to traverse the source
program.

Multi-pass Compiler

o Multi pass compiler is used to


process the source code of a
program several times.

o In the first pass, compiler can read


the source program, scan it, extract
the tokens and store the result in an
output file.

o In the second pass, compiler can


read the output file produced by first
pass, build the syntactic tree and
perform the syntactical analysis. The
output of this phase is a file that
contains the syntactical tree.

o In the third pass, compiler can read


the output file produced by second
pass and check that the tree follows
the rules of language or not. The
output of semantic analysis phase is
the annotated tree syntax.

o This pass is going on, until the target


output is produced.

One-pass Compiler

o One-pass compiler is used to


traverse the program only once. The
one-pass compiler passes only once
through the parts of each
compilation unit. It translates each
part into its final machine code.

o In the one pass compiler, when the


line source is processed, it is
scanned and the token is extracted.

o Then the syntax of each line is


analyzed and the tree structure is
build. After the semantic part, the
code is generated.

15 | P a g e
o The same process is repeated for each line of code until the entire program is compiled.

The advantages of using a compiler to translate high-level programming languages into


machine code are:

1. Portability: Compilers allow programs to be written in a high-level programming language,


which can be executed on different hardware platforms without the need for
modification. This means that programs can be written once and run on multiple
platforms, making them more portable.

2. Optimization: Compilers can apply various optimization techniques to the code, such as
loop unrolling, dead code elimination, and constant propagation, which can significantly
improve the performance of the generated machine code.

3. Error Checking: Compilers perform a thorough check of the source code, which can
detect syntax and semantic errors at compile-time, thereby reducing the likelihood of
runtime errors.

4. Maintainability: Programs written in high-level languages are easier to understand and


maintain than programs written in low-level assembly language. Compilers help in
translating high-level code into machine code, making programs easier to maintain and
modify.

5. Productivity: High-level programming languages and compilers help in increasing the


productivity of developers. Developers can write code faster in high-level languages,
which can be compiled into efficient machine code.

16 | P a g e

You might also like