0% found this document useful (0 votes)
3 views33 pages

Unit-1,2

The document provides an overview of compilers, detailing the roles of translators, compilers, and interpreters in converting source code from one programming language to another. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with the importance of symbol table management and error handling. Additionally, it discusses the grouping of compiler phases into front end and back end, as well as tools used in compiler construction.

Uploaded by

Niharikha Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
3 views33 pages

Unit-1,2

The document provides an overview of compilers, detailing the roles of translators, compilers, and interpreters in converting source code from one programming language to another. It outlines the phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with the importance of symbol table management and error handling. Additionally, it discusses the grouping of compiler phases into front end and back end, as well as tools used in compiler construction.

Uploaded by

Niharikha Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 33
INTRODUCTION TO COMPILERS 4.0 Translators A translator in a programming language is. a way of converting a em written in a given programming, language into a functionally It program ina different computer language. Compiler: A compiler isa program that reads a program written in one language the source language into equivalent program in another language — get language. Source Target Program Compiler Program Error messages Fig.1.1 Compiler The source program may be developed using any programming languages such as Fortran, C, C++... The target program may be another programming language or the machine language of any computer. Interpreter: Itis also a translator which converts the source program into target program. Interpreter is a program in which the code is directly interpreted by the microcode resides in the control memory ofa machine and executes the code line by line. This microcode generates the control signals for execution. The target program from compiler is then translated into.assembly code and is then translated using assembler into machine code. A linker in turn links the library routines with machine code and is then loaded into primary memory using loader. Source program with assembler directives Preprocessor Source program L Target ee program Assembly Relocatab tefaachine code Links library, Loader/Linker | TScatable object files Absolute niachine code Fig.1.2 A language processing system The input to @ ooMpiler Hiay Hh ef dy he produced by one or ‘ more = context in which the compiler typically operates are: ne + Preprocessor + Assembler + Loader and Link Pditer The compilation tool chain mblet] ——+ [Linker] + [Loader] [Preprocessor] = [Conmpiler] . Preprocessor: A preprocessor in a program that processes its input data to produce output that ram. The output is said to be a preprocessed form of is used as input to another pre the input data, whieh is often used by some subsequent programs like compilers. They may perform the following functions: i) Macro processing ii) File Inclusion iii) Rational Preprocessors iv) Languape extension i) Macro processing rm that specifies how a certain input sequence shor ccording to a defined procedure. The mappi cific output sequence is known as mat A macro is a rule or patte be mapped to an output sequence @ process that instantiates a macro into a spel expansion. ii) File Inelu t. When the preproce: header files into the program tex! e specified file Preprocessor includes yy the entire content of th finds an #include directive it replaces it b: iii) Rational Preprocessors These processors change older languages with more modem flow-of-control ang + data structuring facilitics. 2 iv) Language extension These processors atlempt to add capabilities to the language by what amounts 1g anguage Equel is a database query language embeddeg built-in macros, For example, th in C. Assembler: sembly language into machine language. Typically a moder . assembly instruction mnemonics into ; | «cmblers based on how many passes through the source - It converts the assembler creates object code by translating opcodes, and by resolving symbolic names for memory locations and other entities. There are two types of are needed to produce the executable program. i) One -pass ii) Two -Pass + One-pass assembler gocs through the source code once and assumes that all symbols will be defined before any instruction that references them. ass assemblers create a table with all symbols and their values in the first. + Two-| pass, then use the table in a second pass to generate code. Loader and Link Editor: The process of loading consists of taking relocatable machine code, altering the ing the altered instructions and data in memory at the relocatable addresses and pl: proper locations. A linker or link editor is a program that takes one or more objects generated by gle executable program. a compiler and combines them into a 1.3 PHASES OF ACOMPILER In each phase of the compiler, the source program is transformed — from one representation to another. The six phase of compiler includes lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization and code generation. The two other activities symbol — table management and error handling interacts with all the six phases of the cimpiler. Consider the statement total = balance +i * 50; 1.3.1 Lexical Analysis The lexical analysis phase reads the character in the source program and groups them into a stream of tokens in which each token represents a logically cohesive sequence of characters such as an identifier , a keyword, a punctuation, or a multi-character operator := (PASCAL Programming Language). This phase is also called as scanner. The character sequence forming atoken is called lexeme. eg. Balance is called as lexeme. total = balance +i * 50; This code can be represented in lexical form as id) = id,+ id, x 50 total = balance + j*50 Lexical Analyzer id = idy'hid, *50 Ni id) y id, A id, No wt Semantic Analyzer id, aN id, A idy int real | 50.0 Intermediate Code Generator] temp | temp temp 1:= id, * 50.0 id, = id,+ temp! MOVF id, R, MULF #50.0,R, MOVF id, R, ADDF R,,R, MOVF R,, id, 4.3.2 Syntax Analysis In this phase, the scanned input symbols called tokens are hierarchically structured and generate a valid syntax tree. 2 Z a id FigA.5 Syntax tree 4.3.3 Semantic Analysis In this phase, the compiler perforins (ype checking and attaches Jes oractions to the grammar, which is then converted into a code Suitable rules for intermediate code generation, «> id, * id, inttoreal | 50 Fig. 1.7 Semantic tree 1.3.4 Intermediate Code Generation In this phase, the compiler receives valid syntactical constructs and convert it into an intermediate form which are of the form syntax tree or postfix notation or three address code, The three-address code looks like embly language where every memory location can acts like a register, Three ~ address code consists ofa sequence of instructions each o f which has atmost three operands. The three-address code generated for the syntax tree as follows temp := inttoreal($0) temp2 = id,* temp] temp2 (= id femp2 id, = temp3 Before starting the conversion, the compiler has to decide the order in which the operation are to be done, The intermediate form has the following properties. i) _ Each three-address instruction has atmost one operator in addition to the assignment operator. ii) |The compiler must generate a temporary name to hold the value computed by each instruction. iii) Some instructions may have one or more operands, Eg.,(id temp3) 4.3.5 Code Optimization In order to improve the intermediate code to run faster, code optimizer performs code improving transformations like redundant code elimination, unreachable code elimination, etc., It is an optional phase. ‘The optimized code for the intermediate code is temp 1 := id, * 50.0 id = id, + temp 1 1 1.3.6 Code Generation The final phase of the compiler is the target code generation. The ble machine code or assembly code. target code may be either relocatal to be selected by the compiler The memory location for each variable has in this phase. Acomplex task here is to assign variables to registers. The target code generated for this, is as follows MOVE ide R, Rid MULE #50.0,R, Ry = 50.0%R, MOVE Wd, R, Rid, ADDF RAR, R= R, +R, MOVE = Ryid, ER, All these phases work in association with symbol-table management and error handling. 1.3.7 Symbol - Table Management A symbol table is a data structure maintained by the compiler to record the identifier used in the source program and to collect information about various attributes of cach identifier, The attributes provide information about id the storage allocated for an identifier, ii) itstype iti) its scope (where in the program it is valid) iv) the number and types of arguments used in procedures. Using this symbol table, the identifier can be fetched quickly or stored easily. j eps ‘ | The code generator enters and uses detailed information about the | storage assigned to identifiers, 1.3.8 Error Handling 1 Each phases of the compiler can encounter errors. Each error is 1 handled by error handling mechanism and to handle the errors at the | respective phase, The lexical analysis phase detect errors for the token stream which | violates the structure rules or syntax rules of the language. | The semantic analysis phase detect constructs that have the right / syntactic structure but no meaning to the operation involved.(eg. Adding an array element with procedure), Source program ‘ Analysts Phase Lexical Analyzer Syntax Analyzer Semantic Analyzer Symbol Table Management Error Detection nd Handling ase Intermediate Code Generator ‘Target Program 4,6 GROUPING OF PHASES 1epending on the relationship between phases, the phases arg grouped together as front end and back end. 1.6.4 Front and Back Ends ‘The front end consists of those phases or parts of phases which depend primarily on the source language and are largely independent of the target machine, This includes phases like i)Lexical Analysis ii)Syntactic Analysis Front Error jii)Symbol table creation End Handler jy)Semantic analysis v)Generation of intermediate code Some part of the code optimization can be performed at the front end as well, ‘The back end includes these phases of compiler which depend mainly on the target machine but not on the source language. i) Part of code optimization phase ii) | Code generation Back End 1.6.2 Passes All the phases of compiler are implemented using either a single pass or (wo — pass compile i and writing an output file. It is common for several phases to be grouped into one pass and for the activity of these phases to be interleaved during the pass. Compiling involves performing lot of work and early computers did not have enough memory to contain one program in which performs all the work. te Front End and Back End The different phases of the compiler are grouped into two phases. The first three 5 are grouped into back end. In phases are grouped into front end. The last two pha: hetween these two groups the intermediate code generator will be placed. Front End Back End —4 = Intermediate Code nner (>| Parser ‘Semantic / Code Generator 1| Generator Fig. 1.12 Grouping of Phases Front End: + The front end dependent on the source language but independent on target language. + Phases in analysis (machine independent) are grouped as frontend. + These normally include lexical and syntactic analysis, the creation of the symbol table, semantic analysis and the generation of intermediate code. It also includes error handling that goes along with each of these phases. Back End: + The back end dependent on the target language but independent on the source language. + Phases in synthesis (machine dependent) are grouped as backend. + [t includes code optimization phase and code generation along with the necessary error handling and symbol table operations Advantages of Grouping of Phases + Keeping same front end and attaching different back ends, we can produce a compiler for same Source Language. + By keeping different front ends and same back end, we can compile several different languages in the same machine _— le and writi One complete sean of source language i.e. reading one input file users te to an output file. 4 collection of phases is done only once (single pass) oF Plc times ; (multi pass) + Single pass; usually requires everything to be defined before being used in snitve program, + Multi pass: compiler may have to keep entire program representation in memory, Several phases can be grouped into single pass and the activities of these Phases syntax analysis. semantic — are interleaved during the pays, For example, lexical analyst analysis and intermed ¢ code generation might be grouped into one pass and pass is Compilation phases of source language — a IN pass, the phases are grouped logically. in six phases, wher 1.9 COMPILER CONSTRUCTION ‘TOOLS The compiler writers use sofiware tools such as, debuggers, version managers, profilers, and so on, The Jollowing is a list of some useful compiler-construction tools: 1, Scanner generators 2 Parser generators 3 Syntax-direeted translation engines 4. Automatic code generators 5. Data-flow engines Scanner Generator * These generate lexical analyzers, normally from 8 specificat 2 Ication based o expressions, mn regular * The + LEX j ‘ae organization of lexical analyzers is based o i yi wee on finite automat; automation, anner generator used in compilers, Parser Generator * These produce syntax analyzers, normally from input that is b free grammar. *OsSed On a context. + It consumes a large fraction of the running time of a compiler. + Example-YACC (Yet Another Compiler-Compiler). syntax-directed Translation + These produce routines that walk the parse tree and as a result generate intermediate code. + Each translation is defined in terms of translations at its neighbor nodes in the tree. + This construction tool converts the parse tree into intermediate code as output Automatic code Generator + It takes a collection of rules to translate intermediate language into machine language. The rules must include sufficient details to handle different possible access methods for data. It converts the intermediate code into Assembly language Data Flow Engine g data-flow analysis, that is, the gathering « transmitted from one part of a program to ca bly code into optimized assembly code. + [t does code optimization usin| information about how values ar other part. It converts the assem! 4.4 Errors Encountered in Different Phases A good compiler should detect the errors in prior and should have — good diagnostic facilities. Acomplex compiler should repair the error ang transforms the errorneous input into anormal input by which the compiler can resume its operations. Errors encountered in all the phases of compiler. For example, the errors that are encountered in each of the phases are as follows. i) Lexical Analyzer - misspelling of tokens ii) | Syntax Analyzer - missing parenthesis Intermediate code generator - incompatible types iti) iv) Code Optimizer - unreachable statements v) Code generator - compiler created constants are incompatible with target memory. A good error diagnostic routine will reduce debugging and maintenance activities. It should possess the following properties, i) The error messages should specify the error in the source program which should be understandable by the users. ii) The error messages should be specific and clear. Eg. Missing right parenthesis in line 4. iti) The error message should locate the problem line with the corresponding message. eg. rate not declared, \v) The error messages should not be reduntant, 1.4.1 Sources of Error There i i simple se i Way for classifyimg the Programming errors. A ssifying errors according to how they are introduced. A Compiling program a may enco i . compilation, ‘unter different errors in different stages of itmay be a a a i) ¥) vi) Due to the inconsistency in the design specifications. Due to inadequate or incorrect design called ‘algorithmic errors’. Due to logical errors while implementing the programs using programming languages, Due to keypunching or transcription errors Due to exceeding the compiler or machine limit. (eg. An array may be large enough to be allocated at run time). Due to compiler during translation. Some examples are listed below i) i) iii) iv) Insertion of an extraneous character or token. Deletion of a required character or token. Replacement of a correct character or token by an incorrect character or token. Transcription of two adjacent characters or tokens. * Normally the compiler designer classifies the errors as either syntactic or semantic. A syntactic error can be an error detectable by the lexical or syntactic phase of the compiler. Other kinds of errors are called as semantic errors. 1.4.2 Syntactic Errors These errors are identified in the syntax analysis phase. i) Missing right parenthesis x=(xty) #(ytz ii) | Extraneous comma (in Fortran Program) Do 20, 1 = 1,200 ii) Colon in place of semicolon Oa —h iv) Misspelled keyword wihle (x> y) y) Extra Blank /* Area of circle */ From these examples, it is easy to understand the type of error and the position of the error. 1,.4,3 Semantic Errors ‘These types of errors can be detected both at the compile time as well at the run time. Most of the semantic errors which are detected at compile time are errors of declaration and scope. eg. i) Undeclared identifier. ii) Type incompatibility between operators and operands. iii) Type incompatibility between formal and actual arguments. Depending upon the language used, proper type checking needs to be performed. Some languages supports automatic type conversions. 1.4.4 Errors in each phases of Compiler All the phases ofa compiler should follow certain specifications for proper execution of the source code. If it does not complaint with the specification, it encounters an error. A good compiler should recover from cach error and continue processin, g the input. The recovery routines detect the error and print the error diagnostic messages. The diagnostic routines communicate with the symbol table to avoid redundant messages. Ifa particular phase transmits errors without repair, then all the } Subsequent phases deal with the errorenous inputs passed. The errors — can be classifi ied i i broadly as lexical error, syntactic error or semantic error. 11.2 APPROACHES TO COMPILER DEVELOPMENT There are several get | approw that a compiler writer can adopt to implement o compiler, The simplest ix to retarget or rehost an existing com- piler, I there no suitable existing compiler, the compiler writer might adopt the organization of 4 kaown compiler for a similar language and imple- ment the corresponding components, using component-generation tools or implementing them by hand, Ul is relatively care that a completely new com- piler organization is required No matter what approach is adopted, compiler writing is an exercise in software engincering, Lessons from other software efforts (see for example Brooks [1975]) can be applied to enhance the reliability and maintainability of the final product, A design that readily accommodates change will allow the compiler to evolve with the language. The use of compiler-building tools can be a significant help in this regard. Bootstrapping A compiler is a complex enough program that we would like to write it ina friendlier language than assembly language. In the UNIX programming environment, compilers are usually written in C. Even C compilers are writ- ten in C. Using the facilities offered by a language to compile itself is the essence of bootstrapping. Mere we shall look at the use of bootstrapping to create compilers and to move them from one machine to another by modifying the back end, The basic ideas of bootstrapping have been known since the mid 1950's (Strong et al. [1958]). Bootstrapping may raise the question, “How was the first compiler com- piled?" which sounds like, “What came first, the chicken or the egg?” but is casier to answer, For an answer we consider how Lisp became a program- ming language. McCarthy [1981] notes that in late 1958 Lisp was used as a Notation for writing functions; they were then hand-translated into assembly language and run, The implementation of an interpreter for Lisp occurred unexpectedly. McCarthy wanted to show that Lisp was a notation for describ- ing functions “much neater than Turing machines or the general recursive definitions used in recursive function theory," so he wrote a function eval le, a] in Lisp that took a Lisp, expression ¢ as an argument. S, R. Russell noticed that eval could serve as an interpreter for Lisp, hand-coded it, and thus created a programming language with an interpreter. As mentioned in Section \.1, rather than generating target code, an interpreter actually per- forms the operations of the source program. For bootstrapping purposes. a compiler is characterized by three languages: the source language S that it compiles, the target language T that it generates code for, and the implementation language I that it is written in. We noo the three languages using the following diagram, called a T-diagram, ‘ause of its shape (Bratman [1961]). agram as $1'T. The three languages ‘or example, a compiler may run on another machine, Such a compiler is Within text, we abbreviate the above T S, 1, and T may all be quite at one machine and produce target code for often called a cross-compiler Suppose we write a cross-compiler for a new language L in implementation language S to generate code for machine N; that is, we create LgN. If an existing compiler for $ runs on machine M and generates code for M, it is characterized by SMM. If LN is run through SMM, we get a compiler LMN, that is, a compiler from L to N that runs on M. This process is illus- trated in Fig. 11.1 by putting together the T-diagrams for these compilers. Fig. 11.1. Compiling a compiler. When T-diagrams are put together as in Fig. 11.1, note that the implemen- tation language S of the compiler LgN must be the same as the source language of the existing compiler Sj M and that the target language M of the existing compiler must be that same as the implementation language of the translated form LN. A trio of T-diagrams such as Fig. 11.1 can be thought of as an cquation LsN + SMM=LMN Example 11.1. The first version of the EQN compiler (see Section 12.1) had C as the implementation language and generated commands for the text for- matter TROFF. As shown in the following diagram, a cross-compiler for EQN, running on a PDP-11, was obtained by running EQN CTROFF through the C compiler C 1 11 on the PDP-11. [Ban 2.4 SPECIFICATION OF TOKENS We can specify the tokens in the following ways. 1. String 2. Language 3. Expression Strings and Languages + An alphabet or character class is a finite set of symbols. + A string over an alphabet is a finite sequence of symbols drawn alphabet. + The empty string denoted by e - length of empty string is zero. of strings over some fixed alphabet. ! In language theory, the terms “sentence” and “word” are often used as synonyms | he number of occurrences for “string.” The length of a string s, usually written |s|, is tl E of symbols in s. For example, compiler is a string of length eight. The emply string: denoted €, is the string of length zero. from that + A language is any countable set erations on Strings The following are terms which are related to Strings. A prefix of string s is any string obtained by removing zero or more symbols from the end of string s For example, com is a prefix of compiler. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, piler is a suflix of compiler. A substring of s is obtained by deleting any prefix and any suffix from s. For example, pil is a substring of compiler. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and substrings, respectively of s that are not € or not equal to s itself. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. erations on Languages The following are the operations that can be applied to languages: 1. Union 2. Concatenation 3. Kleene closure 4. Positive closure The following example shows the operations on strings: Let L = {0,1} and S = {a,b,c} Ll. Union; L US = {0, 1, a, b, c} 2. Concatenation: L.S = {0a,1a,0b,1b, Oc, ley 3. Kleene closure: L’ = { €, 0, |, 00...) 4. Positive closure: L* = {0, 1, 00...) Regular Expressions + Each regular expression r denotes a language L(r). + Here are the rules that define the regular expressions over some alphabet x the languages that those expressions denote: 1. e€ is a regular expression, and L(e) is { € }, that is, the language Whose gay. member is the empty string. 2. If ‘a’ is a symbol in ¥, then ‘a’ is a regular expression, and L(a) = {a}, that the language with one string, of length one, with ‘a’ in its one position, 3. Suppose r and s are regular expressions denoting the languages L(®) and Uy) Then, a. (n)|(s) is a regular expression denoting the language L(r) U L(s), b. (x(s) is a regular expression denoting the language L(r)L(s). c. (r)* is a regular expression denoting (L(r))*. d. (r) is a regular expression denoting L(r). 4. The unary operator * has highest precedence and is left associative. 5. Concatenation has second highest precedence and is left associative. 6. | has lowest precedence and is left associative AXIOM DESCRIPTION tls = sr ri(s|) = (s)It (rs)t = r(st) r(s|t) = rs[rt (s|t)r = sr|tr = | is commutative | is associative Concatenation is associative Concatenation distributes over | er=re=r € is the identity element for concatenation m= (1e)* Relation between * and € L re* = p* * Is idempotent Table 2.2 Algebraic Properties of regular expressions Regular Set A language that can be defi wo regular expressions be defined by a regular expression is called a regular set, Uf et Pressions r and s denote the same regular set, we say they ure equivalent and write r = s. 2 There are a nu aleebr: ie timber of algebraic laws for regular expressions that can be used to manipulate into equivalent forms. For instance, r | s = s | r is commutative; t | (8 \o =(r|s)|t is associative. Regular Definitions Giving names to regular expressions is referred to as a Regular definition. If Ois an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form dry dy > ry 1. Each d; is a distinct name. 2. Each r; is a regular expression over the alphabet EU {d), dy...» diy}. Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular definition for this set: letter > A | B| digit > O|1]...19 id — letter ( letter | digit ) * [Zalbl.. lz] Notational Shorthands Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational shorthand’s for them. 1. One or more instances (+) + The unary postfix operator + means “one or more instances of”. + Ifr isa regular expression that denotes the language L(r), then (rig a expression that denotes the language (L (r ))* + Thus the regular expression a* denotes the set of all strings of one or + The operator + has the same precedence and associativity as the 9 ‘ Zero or one instance ( ? ) nN + The unary postfix operator ? means “zero or one instance of”, + The notation r? is a shorthand for r | €. q + If ‘r’ is a regular expression, then (r )? is a regular expression that denn language L( rr) U { € }. 3 3. Character Classes * The notation [abc] where a, b and c are alphabet symbols denotes the red expression a | b | c. + Character class such as [a — z] denotes the regular expression albjcld)....J2 + We can describe identifiers as being strings generated by the regular express [A-Za-z][A-Za-z0-9]* Non-regular Set A language which cannot be described by any regular expression is a non-regu set. Example: The set of all strings of balanced parentheses and repeating strings cant be described by a regular expression. This set can be specified by a context-fi grammar, ; 2.5 RECOGNITION OF TOKENS For a programming language there are various types of tokens such as i keywords, constants and operators and so on. The token is usually represented pair token type and token value. Fig. 2.6 Token representation + Token type : It is pointer to symbol variable and constants Token value : Information about the token Consider the following grammar fragment: stmt — if expr then stmt | if expr then stmt else stmt je expr — term relop term | term tem — id \num where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions: if > f then > then ese 7 else relop — <|<=|=|<>17 17" id > letter(letter | digit)* mm > digit™ (digit*) ? (E(+-)2digit*)? e fi t the le: Zer Wi 21 the keywords if, rt he lexical analyze! will recognize For this language fragmen! 4 en, else, as well as the lexemes denoted by relop, id, and num. To simp! ify matters, 2 , e assume ke: ords are re: ‘erved; that is, they cannot be used as identifiers rest 5 S, f keyw 2.3.3 Expressing tokens by Regular Expressions A regular expression is built out of ‘simpler regular expressions using aset of definite rules. Each regular expression r denotes a language L(r). i) itl) Rules that define the regular expressions « isa regular expression denotes the language { ¢} Ifa isasymbol in ¥, thenais the regular expression denotes the language {a}. Suppose r and s are regular expression denoting the languages L(t) and L(s). Then a) (r)| (s) is aregular expression denoting L(r) U L(s) b) (x). (s) isa regular expression senoting L(r). L(s) c) (x)* isaregular expression denoting (L(r))’ d) — (r) isa regular expression denoting L (r) The precedence and associativity of operators are as follows. i) Unary operator * has the highest precedence and is left associative. ii) | Concatenation operator has the second highest precedence and is left associative. i) — 'has the lowest precedence and is left associative. 2.1, ROLE OF LEXICAL ANALYZER Lexical Analyzer is also called as scanner. It is the first phase ofa compiler. The main task of the lexical analyzer is to read the input characters and produce as output a sequence of tokens to the parser used for syntax analysis. Synrat tree Fig.2.1 Lexical analyzer Lexical analyzer reads the source program and removes comments, white spaces in the form of blank, tab and new line characters. The lexical analyzer keeps track of the number of new line characters in order to associate the message with the line number. 2.1.1 Issues in Lexical Analysis There are certain reasons for separating the analysis phase of compiling into lexical analysis and parsing. i) Tomake the design simpler. The separation of lexical analysis from syntax analysis allows the other phases to be simpler. For eg. Parsing adocument with comments and white spaces is more complex than it is removed in the previous phase itself. ii) Toimprove the efficiency of the compiler. A separate lexical analyze, allows us to construct an efficient processor. A large amount of time is spent in reading the source program and partitioning it into tokens, Specialized buffering techniques speed up the performance. 4 ii) | To enhance the compiler portability. Input alphabets and device specific anomalies can be restricted to the lexical analyzer. 2.1.2 Tokens, Patterns, Lexemes A token is an atomic unit which represents a logically cohesive sequence of characters such as an identifier, a keyword, an operator, constants, literal strings, punctuation symbols such as parentheses, commas and semicolons. eg, rate -- identifier + -- operator if - keyword A pattern is a set of strings in the input for which the same tokens is produced as output. It’s a rule to represent lexeme. A lexeme is a sequence of characters in the source program which is matched by the pattern for a token. eg. float rate; Pattern for the token float is the Tokens - float sequence of characters start with f ~ rate | followed by ‘I’, ‘o’,’a’, *t? Lexeme for thetoken - rate identifier Ina grammar, tokens are considered as terminal symbols. — 2.4.3. Attributes for Tokens When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler, For example, the pattern relation matches with the operators like <, >> <-. It is necessary to identify operator which is matched with the pattern. The lexical analyzer collects other information about tokens as its attributes. Atoken has only a single attribute, a pointer to the symbol- table entry in which the information about the token is kept. For eg. The tokens and attribute — values for the given statement is x=yx10 For certain attribute pairs, there is no need for an attribute value. eg. For others, the compiler store: value ina symbol table. s the character string that forms a 2.1.4 Lexical Errors A lexical analyzer has a very localized view of a source programs. The possible error — recovery actions are i) deleting an extraneous character ii) inserting a missing character iii) replacing an incorrect character by correct character iv) transposing two adjacent characters For eg.. fi@a==D..- Here ‘fi’ isa valid identifier. But the open parentheses followeg), the identifier may tell ‘fi’ isa misspelling of the keyword ‘iP oran mndeal function identifier. LEX Lex is a computer program that generates lexical analyzers. Lex is commonly used with the yacc parser generator. Creating a lexical analyzer + First, a specification of a lexical analyzer is prepared by creating a program lex.1 in the Lex language. Then, lex.| is run through the Lex compiler to produce a C program lex.yy.c. + Finally, lex.yy.c is run through the C compiler to produce an object program a.out, which is the lexical analyzer that transforms an input stream into a sequence of tokens. lex ——» Lex | _ tex yy compiler lex.yy.c_ ———»} C compiler }—» a.out Sequence - input stream aout of tokens Fig. 2.9 Creating a lexical analyzer with Lex x Specification A Lex program consists of three parts: { definitions } WN { rules } Ww { user subroutines } * Definitions include declarations of variables, constants, and regular definitions * Rules are statements of the form Py {aetion,| fi v, {ction} P,, faction, | ion and action, describes what action the lexical = i Wheres p, is ropular expr i ular express : the | ions are written in ¢ ‘aulyz04 showld take when pattem p; matches a lexeme. Acti woe e actions. The: User subroutines arc auxiliary procedures needed by the actio compiled separately and loaded with the lexical analyzer. i ither non-deterministic or She plementation of a tex compiler can be based on either In be deterininistic automata,

You might also like