Chapter Three: Language Translation Issues
1. Programming Language syntax
Syntax is defined as “the arrangement of words as elements in a sentence to show their
relationship”; it also describes the sequence of symbols that make up valid programs. In other
words syntax is the study of how words combine to make sentences. The order of words in
sentences varies from language to language. Syntax provides significant information needed for
understanding a program and provides much needed information toward the translation of the
source program into an object program.
The syntax of a programming language describes the structure of programs without any
consideration of their meaning.
The arrangement of words as elements in a sentence to show their relationship
In C, X = Y + Z represents a valid sequence of symbols, XY +- does not
provides significant information for
understanding a program
translation into an object program
rules: 2 + 3 x 4 is 14 not 20
(2+3) x 4 - specify interpretation by syntax - syntax guides the translator
Examples of syntax features:
statements end with ';' (C,C++, Pascal), with'.' (Prolog), or do not have an ending symbol
(FORTRAN)
variables must start with any letter (C, C++, Java), or only with a capital letter (Prolog).
the symbol for assignment statement is '=', or ':=' , or something else.
Key/General Syntactic criteria
Provide a common notation between the programmer and the programming language
processor.
1
the choice is constrained only slightly by the necessity to communicate particular items of
information
for example: a variable may be represented as a real can be done by an explicit
declaration as in Pascal or by an implicit naming convention as FORTRAN
general criteria: easy to read, write, translate and unambiguous
Readability – a program is considered readable if the algorithm and data are apparent by
inspection.
Algorithm is apparent from inspection of text
self-documenting
natural statement formats
liberal use of key words and noise words
providing for embedded comments
unrestricted length identifiers
mnemonic operator symbols
Write-ability – ease of writing the program.
Enhanced by concise and regular structures (notice readability->wordy, different; help us
to distinguish programming features)
FORTRAN - implicit naming does not help us catch misspellings (like indx and index,
both are good integer variables, even though the programmer wanted indx to be index)
redundancy can be good
easier to read and allows for error checking
Verifiability – ability to prove program correctness (very difficult issue)
Translatability – ease of translating the program into executable form.
2
Key of easy translation is regularity of structure
LISP can be translated in a few short easy rules, but it is a bear to read.
COBOL has large number of syntactic constructs -> hard to translate
Lack of ambiguity – the syntax should provide for ease of avoiding ambiguous structures.
Central problem in every language design!
Ambiguous construction allows for two or more different interpretations
These do not arise in the structure of individual program elements but in the interplay
between structures
Basic syntactic elements/concepts in a programming language
Character set – the alphabet of the language.
Several different character sets are used: ASCII, EBCIDIC, Unicode.
o The choice of character set is one the first to be made in designing language
syntax.
Identifiers – strings of letters of digits usually beginning with a letter
Operator Symbols – +-*/ are special characters that most language used to represent the
basic arithmetic operations.
Keywords or Reserved Words – used as a fixed part of the syntax of a statement.
Noise words – optional words inserted into statements to improve readability.
Comments – used to improve readability and for documentation purposes.
Comments are usually enclosed by special markers.
Blanks – rules vary from language to language. Usually only significant in literal strings.
Delimiters – used to denote the beginning and the end of syntactic constructs {}.
Expressions – functions that access data objects in a program and return a value
o An expression is a piece of a statement that describes a series of computations to
be performed on some of the program’s variables, such as X + Y/Z, in which the
variables are X, Y, and Z and the computations are addition and division.
3
Statements – these are the sentences of the language, describe a task to be performed. A
statement in a program is a basic sentence that expresses a simple idea—its purpose is to
give the computer a basic instruction. Statements define the types of data allowed, how
data are to be manipulated, and the ways that procedures and functions work.
o simple - no embedding
o structured or nested - embedded
Overall Program-Subprogram Structure
Separate subprogram definitions (Common blocks in FORTRAN): separate
compilation, linked at load time.
Advantages: easy modification.
Separate data definitions (class mechanism): Group together all definitions that
manipulate a data object.
General approach in OOP.
Nested subprogram definitions (Pascal nesting one subprogram in the other):
Subprogram definitions appear as declarations within the main program or other
subprograms. Not used in many contemporary languages. Provides for static type
checking in non-local referencing environments.
Separate interface definitions: Subprogram interface - the way programs and
subprograms interact by means of arguments and returned results. A program
specification component may be used to describe the type of information transferred
between separate components of the program. E.G. C/C++ use header files as
specification components
o package interface in Ada - in C you can do this with an include file
Data descriptions separated from executable statements. A centralized data division
contains all data declarations. E.G. COBOL. Advantage - logical data format independent
on algorithms.
Unseparated subprogram definitions: No syntactic distinction between main program
statements and subprogram statements. Allows for run-time translation and execution.
o no organization - early BASIC and SNOBOL
4
2. Stages in Translation
Process of translation of a program from its original syntax into executable form is central in
every programming implementation. Translation can be quite simple as in LISP and Prolog but
more often quite complex. Most languages could be implemented with only trivial translation if
you wrote a software interpreter and willing to accept slow execution speeds.
Stages in Translation
Syntactic recognition parts of compiler theory are fairly standard
Analysis of the Source Program
the structure of the program must be laboriously built up character by
character during translation
Synthesis of the Object Program
construction of the executable program from the output of the semantic
analysis
2.1. Analysis of the source program
To a translator, the source program appears initially as one long undifferentiated sequence of
symbols composed of thousands or tens of thousands characters.
Lexical analysis (scanning/ tokenizing) – identifying the tokens of the programming
language: the keywords, identifiers, constants and other symbols appearing in the
language.
In the program
void main()
{
printf("Hello World\n");
}
The tokens are
void, main, (, ), {, printf, (, "Hello
World\n", ), ;, }
5
Syntactic analysis (parsing) – determines the structure of the program, as defined by the
language grammar.
Semantic analysis - assigns meaning to the syntactic structures.
Example:
int variable1;
The meaning is that the program needs 4 bytes in the memory to serve as a location for
variable1. Further on, a specific set of operations only can be used with variable1,
namely integer operations.
The semantic analysis builds the bridge between analysis and synthesis.
Basic semantic tasks:
1. Symbol–table maintenance
2. Insertion of implicit information
3. Error detection
4. Macro processing and compile-time operations
The result of the semantic analysis is an internal representation, suitable to be used for
code optimization and code generation.
2.2. Synthesis of the object program
The final stages of translation are concerned with the construction of the executable program
from the outputs produced by the semantic analyzer. The final result is the executable code of the
program. It is obtained in three main steps:
Optimization - Code optimization involves the application of rules and algorithms
applied to the intermediate and/ or assembler code with the purpose to make it more
efficient, i.e. faster and smaller.
6
The semantic analyzer ordinarily produces as output the executable translated program
represented in some intermediate code.
Code generation - generating assembler commands with relative memory addresses
for the separate program modules - obtaining the object code of the program.
internal representation must be formed into assembly language statements,
machine code or other object form
After the translated program in the internal representation has been optimized, it must be
formed into the assembly language statements, machine code, or other object program
form that is to be the output of the translation.
Linking and loading - resolving the addresses - obtaining the executable code of the
program.
references to external data or other subprograms
In the optional final stage of translation, the pieces of code resulting from separate
translations of subprograms are coalesced into the final executable program.
2.3. Bootstrapping
The compiler for a given language can be written in the same language. The process is based on
the notion of a virtual machine.
A virtual machine is characterized by the set of operations, assumed to be executable by the
machine.
Assume we have:
A real machine (at the lowest level with machine code operations implemented in
hardware)
7
A firmware machine (next level - its set is the assembler language operations and the
program
that translates them into machine operations is stored in a special read-only memory)
A virtual machine for some internal representation (this is the third level, and there is a
program
that translates each operation into assembler code)
A compiler for the language L (some language) written in L (the same language)
The translation of the compiler into the internal representation is done manually - the
programmer manually re-writes the compiler into the internal representation. This is done once
and though tedious, it is not difficult - the programmer uses the algorithm that is encoded into the
compiler.
From there on the internal representation is translated into assembler and then into machine
language.
8
Chapter Four: Data Types
A data type defines a collection of data values and a set of predefined operations on those
values. Computer programs produce results by manipulating data. An important factor in
determining the ease with which they can perform this task is how well the data types available
in the language being used match the objects in the real-world of the problem being addressed.
Therefore, it is crucial that a language supports an appropriate collection of data types and
structures.
The type declarations in a program document information about its data, which provides clues
about the program’s behavior. The type system of a programming language defines how a type is
associated with each expression in the language and includes its rules for type equivalence and
type compatibility. Certainly, one of the most important parts of understanding the semantics of a
programming language is understanding its type system.
1 Primitive Data Types
Data types that are not defined in terms of other types are called primitive data types. Nearly all
programming languages provide a set of primitive data types. Some of the primitive types are
merely reflections of the hardware—for example, most integer types. Others require only a little
non hardware support for their implementation.
To provide the structured types, the primitive data types of a language are used, along with one
or more type constructors.
1.1 Numeric Types
Many early programming languages had only numeric primitive types. Numeric types still play a
central role among the collections of types supported by contemporary languages.
1.1.1 Integer
The most common primitive numeric data type is integer. Many computers now support several
sizes of integers. These sizes of integers, and often a few others, are supported by some
programming languages. For example, Java includes four signed integer sizes: byte, short, int,
and long. Some languages, for example, C++ and C#, include unsigned integer types, which are
simply types for integer values without signs. Unsigned types are often used for binary data.
9
A signed integer value is represented in a computer by a string of bits, with one of the bits
(typically the leftmost) representing the sign. Most integer types are supported directly by the
hardware. One example of an integer type that is not supported directly by the hardware is the
long integer type of Python (F# also provides such integers). Values of this type can have
unlimited length.
Long integer values can be specified as literals, as in the following example:
243725839182756281923L
Integer arithmetic operations in Python that produce values too large to be represented with int
type store them as long integer type values.
A negative integer could be stored in sign-magnitude notation, in which the sign bit is set to
indicate negative and the remainder of the bit string represents the absolute value of the number.
Sign-magnitude notation, however, does
not lend itself to computer arithmetic. Most computers now use a notation called twos
complement to store negative integers, which is convenient for addition and subtraction. In
twos-complement notation, the representation of a negative integer is formed by taking the
logical complement of the positive version of the number and adding one. Ones-complement
notation is still used by some computers. In ones-complement notation, the negative of an integer
is stored as the logical complement of its absolute value. Ones-complement notation has the
disadvantage that it has two representations of zero. See any book on assembly language
programming for details of integer representations.
1.1.2 Floating-Point
Floating-point data types model real numbers, but the representations are only approximations
for many real values. For example, neither of the fundamental numbers pi or e (the base for the
natural logarithms) can be correctly represented in floating-point notation. Of course, neither of
these numbers can be accurately represented in any finite space. On most computers, floating
point numbers are stored in binary, which exacerbates the problem. For example, even the value
0.1 in decimal cannot be represented by a finite number of binary digits. Another problem with
floating-point types is the loss of accuracy through arithmetic operations. For more information
on the problems of floating-point notation, see any book on numerical analysis.
10
Floating-point values are represented as fractions and exponents, a form that is borrowed from
scientific notation. Older computers used a variety of different representations for floating-point
values. However, most newer machines use the IEEE Floating-Point Standard 754 format.
Language implementors use whatever representation is supported by the hardware. Most
languages include two floating-point types, often called float and double. The float type is the
standard size, usually being stored in four bytes of memory. The double type is provided for
situations where larger fractional parts and/or a larger range of exponents is needed. Double-
precision variables usually occupy twice as much storage as float variables and provide at least
twice the number of bits of fraction.
The collection of values that can be represented by a floating-point type is defined in terms of
precision and range. Precision is the accuracy of the fractional part of a value, measured as the
number of bits. Range is a combination of the range of fractions and, more important, the range
of exponents.
Boolean Types
Boolean types are perhaps the simplest of all types. Their range of values has only two elements:
one for true and one for false. They were introduced in ALGOL 60 and have been included in
most general-purpose languages designed since 1960.
Boolean types are often used to represent switches or flags in programs. Although other types,
such as integers, can be used for these purposes, the use of Boolean types is more readable. A
Boolean value could be represented by a single bit, but because a single bit of memory cannot be
accessed efficiently on many machines, they are often stored in the smallest efficiently
addressable cell of memory, typically a byte.
Character Types
Character data are stored in computers as numeric codings. Traditionally, the most commonly
used coding was the 8-bit code ASCII (American Standard Code for Information Interchange),
which uses the values 0 to 127 to code 128 different characters. ISO 8859-1 is another 8-bit
11
character code, but it allows 256 different characters. Unicode includes the characters from most
of the world’s natural languages. For example, Unicode includes the Cyrillic alphabet, as used in
Serbia, and the Thai digits. The first 128 characters of Unicode are identical to those of ASCII.
Java was the first widely used language to use the Unicode character set. Since then, it has found
its way into JavaScript, Python, Perl, C#, and F#.
To provide the means of processing codings of single characters, most programming languages
include a primitive type for them. However, Python supports single characters only as character
strings of length 1.
Structure Data Type
Composite data types are language components which can hold more than a single data value at
the same time. The most common form of composite data type is the array; however, the array
data type is somewhat restrictive in as much as every data value within a variable of a given
array must be of the same type (it is possible to have an array of int values, or an array of float
values, but not an array of int and float values).
Sometimes it is useful to be able to collect several different data values (often of different types)
together as a single item. One of the most common examples of this is the record structure, used
to "encapsle" multiple pieces of information about an individual (person, event, thing). Coding
this type of structure requires something a bit more complex than a simple array or list of a single
data type sharing a common name.
Arrays
An array is a contiguous set of storage locations set aside to hold one type of data. Array
is simply a means whereby we can store values of the same type by using one generic name.
The list of items are stored linearly, hence the items can be accessed by their relative position in
the list.
12
Arrays are regarded as real objects in Java. Storage space is not allocated for an array
during compilation time, but rather during execution time. This means that the declaration alone
does not imply that memory is allocated.
Arrays are specified with curly brackets, e.g., "{1, 2, 3}" is an array of int, while
"{"x", "y", "z"}" is an array of string. The types are denoted "{int}" and "{string}"
respectively. An array is an ordered list of tokens of any type, with the only constraint being that
the elements all have the same type. If an array is given with mixed types, the expression
evaluator will attempt to losslessly convert the elements to a common type. Thus, for example,
{1, 2.3}
has value
{1.0, 2.3}
String
The composite type String has a unique feature. On the one hand it resembles an ordinary
data type, but in fact it is an object represented by a class called String. The data representation
of a string is a finite sequence of characters, be it chars or digits or special characters. We will
study strings from both perspectives: treated as an ordinary type, and treated in its full for, a
class.
Declaring and Initializing String Variables
Treated as an ordinary type, a string variable is declared and initialized the same way
variables of primitive types are declared. That is, the declaration follows the pattern:
String myString;
Where String is the data type, and myString is the name of the variable
String values are not treated the same way as character and numeric arrays. When assigning a
value to a string variable, the value to be assigned must be enclosed within double quotation
marks. For example, the following is a valid declaration and assignment:
String myString;
myString = “I am learning Java”;
Alternately,
13
String myString = “I am learning Java”;
Two strings can be joined by use of the plus ( + ) symbol to form a third string. Hence if str1,
str2, and str3 are string objects, then the following statement simply join str2 to the tail of str1,
and the contents is stored in str3.
str3 = str1 + str2;
This is the only operation that can be performed on strings without looking to the class String.
Record Types
A record is an aggregate of data elements in which the individual elements are identified
by names and accessed through offsets from the beginning of the structure.
There is frequently a need in programs to model a collection of data in which the
individual elements are not of the same type or size. For example, information about a college
student might include name, student number, grade point average, and so forth. A data type for
such a collection might use a character string for the name, an integer for the student number, a
floating point for the grade point average, and so forth. Records are designed for this kind of
need. The elements of a record are of potentially different sizes and reside in adjacent memory
locations.
In some languages that support object-oriented programming, data classes serve as
records. In C, C++, and C#, records are supported with the struct data type. In C++, structures
are a minor variation on classes. In C#, structs are also related to classes, but are also quite
different. C# structs are stack-allocated value types, as opposed to class objects, which are heap-
allocated reference types. Structs in C++ and C# are normally used as encapsulation structures,
rather than data structures. Structs are also included in ML and F#. In Python and Ruby, records
can be implemented as hashes, which themselves can be elements of arrays.
The fundamental difference between a record and an array is that record elements, or
fields, are not referenced by indices. Instead, the fields are named with identifiers, and references
to the fields are made using these identifiers. Another difference between arrays and records is
that records in some languages are allowed to include unions.
In Java and C#, records can be defined as data classes, with nested records defined as
nested classes. Data members of such classes serve as the record fields.
14
Records and arrays are closely related structural forms, and it is therefore interesting to
compare them. Arrays are used when all the data values have the same type and/or are processed
in the same way. This processing is easily done when there is a systematic way of sequencing
through the structure. Such processing is well supported by using dynamic subscripting as the
addressing method.
Records are used when the collection of data values is heterogeneous and the different
fields are not processed in the same way. Also, the fields of a record often need not be processed
in a particular order. Field names are like literal, or constant, subscripts. Because they are static,
they provide very efficient access to the fields. Dynamic subscripts could be used to access
record fields, but it would disallow type checking and would also be slower.
Records and arrays represent thoughtful and efficient methods of fulfilling two separate
but related applications of data structures.
Declaring A Record Structure Data Type
The structure being described must be given a "name" (so that, later on, variables can be defined
using this new type), and the components which make up its internal structure must be identified.
The sub-structural components will normally be basic, built-in data types; each must be
identified with its own name and with its data type.
The form for describing a structure, called the structure declaration is composed of:
1. the keyword struct
2. the name to be used for this new structure data type
3. a left brace bracket ({)
4. a list of the component declaration statements, each composed of:
the component's data type; followed by the name of the component within the structure;
(followed by an array size, in square brackets, if the component is an array);
and terminated with a semi-colon (;)
5. a terminating right brace bracket (})
For example, a structure declaration for a new data type which could be used to describe a room
within an institution, such as a college building, might look like:
15
struct roomrec
{
char roomNum[7]; /* null-terminated char array */
char roomType;
int capacity;
}
Notice that although the "component declarations" within a structure declaration look like
variable definition statements, they are not variable definition statements. No space is reserved
to hold data by these statements; they are simply "descriptions" or declarations of what each
"component" should look like, if a variable of this new structure type were defined. As a
particular note, since no actual space is being set aside for these components within the structure
declaration, it is impossible to assign initial values to such structural components; for example:
(within the structure declaration above)
int capacity = 0; is not valid!
Structure declarations which describe a new composite data type which might be commonly
used by a collection of programs within some larger system are often stored in a separate
"header" (.h) files which can then be included in each program which requires variables of that
type.
Defining Variables of A Declared Record Structure
After a specific structure data type has been declared, it is possible to define variables of this new
type. The syntax for a structure variable definition is basically the same as for any other type of
variable.
For example, assuming the roomrec structure declaration (above), it is possible to define a
variable, classRoom, as:
roomrec classRoom;
(in some old versions of the C-language it was necessary to code this as struct roomrec
classRoom;)
16
Notice that structure variable definitions cannot be coded until after the code for the structure
declaration.
Referencing Elementary Values Within A Record Structure Variable
While structure variables are often read from or written to files as single, composite data items,
most of the processing is performed on the sub-structure elementary data fields. To work with a
"sub-structure elementary data field" it is necessary to specify both:
the name of the structure variable, and
the component name from within the structure declaration.
These two names are combined into a single name, first the "variable name" and then the
"component name", separated by a period (.); as with any name in C++, there can be no spaces in
this two part name.
Using the examples from above:
structure declaration:
struct roomrec
{
char roomNum[7]; /* null-terminated char array */
char roomType;
int capacity;
}
and variable definition:
roomrec classRoom;
then if we wanted to check if the room described by some data in the classRoom record
structure had a capacity of more than 40 people, we would code:
if (classRoom.capacity > 30) ....
Nested Record Structures
Sometimes it is desirable to code a sub-structure component as being a structure itself, instead of
a simple built-in data type. This is permissible. For example, it is often desirable to always use
the same format for all dates within a system; since there is no built-in type date, the easiest way
to accomplish this is to declare a new structured data type:
17
struct dayt
{
int year;
int month;
int day;
}
(notice the unusual spelling "dayt" since "date" is a reserved word in C++).
The structure of an employee record for some company, might then use this new structure type
(possibly multiple times):
struct employee
{
int employeeNumber;
// (other field declarations)
dayt dateHired;
// (some more field declarations)
dayt dateOfBirth;
// (etc.)
}
if a record structure variable were defined, for example as:
employee janitorStaff;
and this, janitorStaff, record had data "read" into it, the year in which the employee, whose
information was currently in the janitorStaff record, was hired could be displayed using:
cout << janitorStaff.dateHired.year;
18