Compiler Techniques & Tools For Embedded Processor Architectures
The document discusses compiler techniques and tools for embedded processor architectures. It notes trends toward application-specific instruction set architectures (ASIPs) tailored for tasks like multimedia, wireless communications, and telecom. Customizable architectures can offer performance gains and lower costs compared to general purpose processors. Examples are given of embedded architectures developed specifically for applications like mobile phones, set-top boxes, and video teleconferencing that achieve better results than commercial off-the-shelf processors. Embedded software development tools must be tailored differently than for general purpose processors due to architectural variability across embedded domains.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
1K views
Compiler Techniques & Tools For Embedded Processor Architectures
The document discusses compiler techniques and tools for embedded processor architectures. It notes trends toward application-specific instruction set architectures (ASIPs) tailored for tasks like multimedia, wireless communications, and telecom. Customizable architectures can offer performance gains and lower costs compared to general purpose processors. Examples are given of embedded architectures developed specifically for applications like mobile phones, set-top boxes, and video teleconferencing that achieve better results than commercial off-the-shelf processors. Embedded software development tools must be tailored differently than for general purpose processors due to architectural variability across embedded domains.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54
COMPILER TECHNIQUES & TOOLS
FOR EMBEDDED PROCESSOR
ARCHITECTURES
PROF B ABDUL RAHIM AITS RAJAMPET 1
• Compiler technology & firmware development tools are becoming a key differentiator in the design of embedded processor based systems Trends in embedded processor architectures • High performance applications such as multimedia, wireless communications and telecom require new design methodologies • The key trend in embedded CPUs is the application specific instruction set architecture or ASIP A processor designed for a particular applications Advances in VLSI Technology pushing systems to implement system-on-silicon
PROF B ABDUL RAHIM AITS RAJAMPET 2
• As higher levels of integration make current design practices more complex, reuse becomes a key methodology • The design tool needs for deeply embedded processors such as DSPs and MCUs used in the application domains of wireless, telecom and multimedia where: – The system runs embedded firmware & require real-time response – Low cost and low power are of critical importance – Products are sold in high volumes – The processor can be used on-chip as an embedded core
PROF B ABDUL RAHIM AITS RAJAMPET 3
MODERN EMBEDDED ARCHITECTURES • Increasing product complexity is driving architectures towards increasing reliance on CPUs • Further more, customized cores are increasingly used to meet cost/performance goals • The market for workstation/PC microprocessors is driven by compatibility • Embedded architectures show a different picture – custom programmable architectures tuned to application types (ASIPs) • In multimedia applications- MPEG-1 & 2 encoding and decoding, Dolby AC-3 and video telephones • Many companies tuning – custom programmable solutions Ex: Phillips TM-1, a custom VLIW Processor for MPEG-1 & 2 encoding and decoding PROF B ABDUL RAHIM AITS RAJAMPET 4 • Custom programmable architectures are popular in multimedia systems for three main reasons: 1. Tailored architecture is more cost-effective than a general purpose architecture cost reduced design – major source of revenue 2. Considerable performance gains with special purpose memories and datapaths 3. Instruction set compatibility is not an issue for a chip which executes a small set of specialized routines [ save memory and/or increase performance] • The increasing demands for multimedia require higher end X86 processor and the same function can be offered more cost effectively by a dedicated multimedia processor (MMX) MMX handles high quality game programs or professional level 3D graphics
PROF B ABDUL RAHIM AITS RAJAMPET 5
• No. of custom DSPs are used in designs of GSM handsets & base stations Ex: CNET – R&D group of France Telecom & ALCATEL – designed custom DSP solution - reducing power consumption (one half of commercial DSP solution) – Italtel uses 2 in-house VLIW DSPs to replace GSM base station equalization system [equal to 8 commercial DSPs] – The 109 bit instruction word of VLIW machine provides many parallel execution units – high throughput – Northern Telecom (Nortel) designed custom DSP for a key system unit (i.e, a local telephone switch) The ASIP outperforms standard DSPs – differences can be seen in the areas of power and speed because reduction in buses & multiplexers
PROF B ABDUL RAHIM AITS RAJAMPET 6
Summary: Performance: a custom solution reduces the number of components and the complexity of the communication between components Power: for handheld applications, low power is essential (Architectures of extra features are not used unnecessarily) Cost: wireless & Telecomm applications cannot afford luxury of many chip solutions – go for custom chips PROF B ABDUL RAHIM AITS RAJAMPET 7 Examples of emerging architectures • Commercial, programmable DSPs; – Architectures characterized by Harvard architecture – With separate program and data memories – A fixed point core – And fixed I/O peripherals – The instruction set is highly encoded & tuned to multiply- accumulate operations (used for convolution algorithms, common in DSP) • VLIW processors: Architectures characterized by a network of horizontally programmable execution units • ASIPs; processors specific to particular domain – ASIP optimized for performance, speed & power characteristics of the application
PROF B ABDUL RAHIM AITS RAJAMPET 8
• These may be regarded as base categories but practically one or more of these characteristics may be used • In multimedia, memory manipulation and operation throughput are important criteria • The parallelism of a VLIW architecture can provide needed speed, while orthogonality of the instruction set also allows the compatibility
PROF B ABDUL RAHIM AITS RAJAMPET 9
PROF B ABDUL RAHIM AITS RAJAMPET 10 • The MMDSP, a processor core designed at Thomson Consumer Electronics Components (TECE) – used to perform –MPEG-2, Dolby AC-3, Dolby prologic audio encoding [this is an evolution of architecture of MPEG-1 audio decoder] – Used in satellite-to-set-top box DirecTV application – Other applications for this architecture include • DVD (Digital Versatile Disk) • Multimedia PC • HDTV (High Definition Television) • And high end audio equipment This is ex of high volume, low cost multimedia application
PROF B ABDUL RAHIM AITS RAJAMPET 11
• It is based on load/store, Harvard architecture • Communication is through a bus between ALU, ACU(Address calculation Unit) and memories • The controller is a standard pipelined decoder with common branching capabilities and also interrupt capability and hardware do-loop capability • Three sets of registers used to provide nesting levels of hardware loops • The memory structure developed around data types needed for applications • The multiply-accumulate unit was designed around the time-critical inner-loop functions of the application • The very large instruction word format (61 bits) allows parallelism this required performance of the MPEG-2 and Dolby AC-3 audio std
PROF B ABDUL RAHIM AITS RAJAMPET 12
• The SGS-Thomson integrated video telephone Integrating approach to macro-block reuse of ASIP core • This is system on chip contains block operators which communicate through set of buses • The design increased no. of embedded processor cores to six from two (previous generation video codec) • Four of the ASIP blocks are based on same controller architecture – The MSQ (Micro sequencer) – The BSP ( Bit Stream Processor) – The VIP (VLIW image processor) – The HME (Hierarchical motion estimator) These operators communicate through bus interface protocol • The compiler for each ASIP could be based on an original and modified depending on the needs of each new block • Key to this design is significant reuse of template control structure & bus communication protocol
PROF B ABDUL RAHIM AITS RAJAMPET 13
PROF B ABDUL RAHIM AITS RAJAMPET 14 • One of the most complete portfolios in DSP offerings is that of motorola • Motorola 56k series of DSPs is categorized in five main families – DSP56000 for digital audio applications – DSP56100 for wireless & wireline communications – DSP56300 for wireless infrastructure & high MIPS applications including Dolby AC-3 encoders – DSP56600 for wireless subscriber market – DSP56800 for low cost consumer applications • These architectures resemble one another but have different characteristics based on the target application, area & cost, performance and power requirements • From this it is clear that embedded processor market has no single product with fixed architecture • Specialized for application domain – The architecture refined for type of algorithm to be executed – Instruction sets encoded making memory less
PROF B ABDUL RAHIM AITS RAJAMPET 15
• To illustrate one part of motorola DSP architecture – The address calculation unit (ACU) of motorola 56000 – Two halves exists mainly for two memories X & Y – Each with an arithmetic unit performs with separate set of register – X & Y address busses XAB & YAB – Active in parallel with the central data calculation unit (DCU) and access to each memories – Registers are treated as triplets (R0:N0:M0, R1:N1:M1, etc,) Rn - address register Nn – index register Mn – register determines type of address arithmatic (linear, modulo or reverse carry)
PROF B ABDUL RAHIM AITS RAJAMPET 16
PROF B ABDUL RAHIM AITS RAJAMPET 17 Embedded software development needs
• Design tools needs for embedded processor
systems is quite different from standard ASIC design flow • Differ from tools used for general purpose processors • Reason being wide variety of architecture styles & customized architectures depends on application domains
PROF B ABDUL RAHIM AITS RAJAMPET 18
Commercial support of embedded processors • EDA industry offer tools for commercial processor cores • Including HW/SW co-simulation offerings (co-verification for embedded processor SWs) • Semiconductor companies offering commercial DSPs & MCUs stress on compiler companies for design on internal compiler development • Quality of the code produced by a commercial C compilers is not promising • DSPStone benchmarking showed that machine code runs 2-12 times slower than a hand crafted program • For low-cost, fixed-point, register-poor DSPs, it is difficult to develop efficient compilers • Even for high-end, floating-point DSPs- need to work • If the designer uses assembly code which locks designs to old architectures – flexibility is the issue
PROF B ABDUL RAHIM AITS RAJAMPET 19
Design tool requirements • Compilers is no. one need followed by instruction set simulation, multilevel co-simulation and source-level debugging – Nothern Telecom • Principal set of design tools envisioned for embedded processor systems is shown in figure • Core technology is retargetable compilation which is driven by instruction set specification – Modifications to this specification allows changes to the processor – An instruction set simulator, or a hardware description of the processor itself could be generated from the specification – Instruction set specification is principal means to support a variety of architectures as well as processor evolution and reuse
PROF B ABDUL RAHIM AITS RAJAMPET 20
PROF B ABDUL RAHIM AITS RAJAMPET 21 • The host compiler (e.g workstation or PC) serves multiple purposes 1. Early function verification of source algorithm even before processor design is available In addition, standard debugging tools are available 2. Validation of targetable compiler by simulation & comparison of a test suite on both the host & instruction set simulator • Retargetable compiler & host compiler used for – Debugging in various forms – For architectures – For algorithm exploration – Execution based optimization strategies for retargetable compiler are possible • In addition to these tools no. of additional technologies are important for HW/SW co-design of embedded systems. These technologies include: – Areas of HW/SW estimation & partitioning – HW/SW co-simulation (VHDL-C Co-simulation) – High level synthesis of hardware – Processor design and synthesis • Lastly RTOS for run time scheduling of tasks
PROF B ABDUL RAHIM AITS RAJAMPET 22
Compilation technologies • Compilation of embedded processors have been converging to two main areas: 1. SW compilation for general purpose microprocessors 2. High level synthesis for ASICs • Aho, Sehti and Ulman define compilation as the translation of a program in a source language (eg C) to the equivalent program in a target language (eg assembly language & absolute machine code) • This translation is typically decomposed into a series of phases as shown in figure
PROF B ABDUL RAHIM AITS RAJAMPET 23
PROF B ABDUL RAHIM AITS RAJAMPET 24 • Lexical analysis : tokenizes the program source • Syntax analysis : parses the program into grammatical units – Intermediate representation of the source code For each tree, node represents operation (=, +) • semantic analysis : meaning of the language – Semantic checks are • Type checking • Flow of control checking • Symbol name checking – Following these phases many compiler produce intermediate code – Code for abstract or virtual machine – Atleast three address code ( 2 sources & 1 destination ) – Range – local to global optimization – Gain: high-gain, high-risk transformations – Compilation unsatisfactory without optimization – Code from the intermediate form is translated to assembly code for the target – Memory locations are chosen for variable ( register & memory) – Code generated is suitable for running on target machine
PROF B ABDUL RAHIM AITS RAJAMPET 25
• Software tool designers encounter several major problems when adopting traditional compilation model to embedded processors • Retargetability : – retargeting to a new architecture is confined to the final code generation phase – Intermediate code must resemble final target to produce efficient code – If the instruction set of the final target is different than the virtual machine, it can be difficult to produce efficient target code – Conceptualization of general form of intermediate code is troublesome as embedded processor instruction sets vary widely
PROF B ABDUL RAHIM AITS RAJAMPET 26
• Register constraints: – Embedded processors often contain no. of special purpose registers – Registers are reserved for special functions mainly to maintain low instruction widths – The instruction widths reflects directly into costly on-chip program space – The constraints of register assignment can affect all the phases of compilation • Arithmetic specialization: – Three address code artificially decomposes dataflow operations into small pieces – Arithmetic operations which require more than three operands are not naturally handled with three address code – Such operations occur frequently on DSP architectures
PROF B ABDUL RAHIM AITS RAJAMPET 27
• Instruction level parallelism: – Architectures with parallel executing engines require different compilation techniques Ex: A DSP has both DCU & ADU A compiler should perform operation on different functional units as well as choose the most compact solution • Optimization: – Real-time embedded firmware cannot have performance penalties due to poor compilation – Efficient compilation only possible with many optimization algorithms – Optimizations which are restricted to intermediate code work most efficiently on a local area – Global optimizations which use data structures, dataflow, and control flow information will be suited to a higher level of abstraction, closer to source program structure
PROF B ABDUL RAHIM AITS RAJAMPET 28
Retargetability, specification languages and models • Retargetability is Topic of interest for compilers. • Retargetability allows the rapid set-up of a specific processor [algorithm developers evaluate the efficiency of application code on different existing architectures] • Retargetability permits architecture exploration [the processor designer tunes the architecture to run efficiently for a set of source applications in a particular domain]
PROF B ABDUL RAHIM AITS RAJAMPET 29
PROF B ABDUL RAHIM AITS RAJAMPET 30 Compiler techniques for specialized architectures • Languages and models provide support for retargetability • Code quality is dependent on particular compiler phases • There are three principal compiler tasks: – Instruction set matching and selection – Register allocation and assignment – Instruction scheduling and compaction – They are highly interdependent, which is a concept known in compiler community as phase coupling • Determination of instructions which can perform the implementation of source code is a main task of compilation, defined in two parts: – Instruction set matching is the process of determining a wide set of target instructions which can implement the source code – Instruction set selection is the process of choosing the best subset of instructions from the matched set
PROF B ABDUL RAHIM AITS RAJAMPET 31
• The matching and selection process has varying levels of importance, to address, methods: – Pattern based methods – Constructive methods • It is possible to translate a source program into a forest of syntax trees then match to the pattern set of syntax trees • A subset of all the matched patterns are selected to perform the implementation in microcode [dynamic programming can be used to select a cover of patterns for the subject tree, restricted by homogeneous register set. Embedded processors have heterogeneous register set and instructions best suited to graph-based patterns] • Tree based pattern selection allows handling of heterogeneous register sets which register constraints are encapsulated by a trellis diagram (path minimization problem) • Other pattern based methods are: – SPAM – Synopsys Princeton Aachen MIT project – CodeSyn compiler based on CDFGs – FlexCC compiler of SGS-Thomson Microelectronics
PROF B ABDUL RAHIM AITS RAJAMPET 32
• Mapping to architecture storage units is a principal means of compiling efficient code • Register allocation is the determination of a set of registers which may hold the value of a variable • Register assignment is the determination of a physical register which is specified to hold the value of a variable • Register allocation and assignment for embedded processors is further complicated by – Special purpose registers – Heterogeneous register files – Overlapping register functions • Coloring to determine the no. of registers needed for a programs variable • One approach to handling heterogeneous register sets is register classes • codeSyn compiler builds upon the concept of register classes of special registers • Register assignment related to architecture structure known as data routing – Best flow of data through the architecture minimizing the time
PROF B ABDUL RAHIM AITS RAJAMPET 33
• Scheduling is the process of determining an order of execution of instructions • Although it can be treated separately, the interdependencies with instruction selection and register allocation makes it a particularly difficult problem for embedded processors • Further more, machines which support instruction level parallelism require fine grained scheduling also known as compaction • The SPAM group optimally solves tree based dataflow schedules for structures like TI C25 architecture
PROF B ABDUL RAHIM AITS RAJAMPET 34
• Mutation scheduling is an approach whereby different implementation of instructions can be regenerated by means of mutation set • After the generation of three address code, critical parts are calculated • Attempts are made to improve the speed by identifying the instructions which lie on critical paths and mutating them to other implementations which allow a rescheduling of the instructions The overall schedule is improved The advantage is it works directly on critical paths & improves timing on level of the code which is very close to machine structure
PROF B ABDUL RAHIM AITS RAJAMPET 35
• Compaction is a form of scheduling designed to increase parallelism in micro instruction words • The compaction attempts to pack micro operations as tightly as possible within the given constraints • Consequently the compaction algorithm is able to obey all the dataflow dependencies as well as resource restrictions of the machine
PROF B ABDUL RAHIM AITS RAJAMPET 36
Optimizations for embedded processors • Optimization theory not well understood for embedded real-time architectures • Because infancy of standard mapping techniques for embedded architectures • Because amplitude of constraints that embedded processor impose on standard optimization techniques • Existing standard optimization techniques: – Constant propagation – Constant folding – Common sub expression elimination – Strength reduction Not so efficient for embedded processors
PROF B ABDUL RAHIM AITS RAJAMPET 37
Solutions: • A effective compiler could apply a set of optimizations based on characteristics of the architecture • Provisions could be implemented so that the programmer is allowed to control where & when optimizations are applied In real-time applications, loops are the critical regions of the code, since most of the time is spent there in. • Streamlining the retrieval of data from and the storage of data to memory elements can produce substantial gains • Loop pipeling (software pipelining) improves instruction level parallelism code within loops
PROF B ABDUL RAHIM AITS RAJAMPET 38
• Memory optimizations are important because program memory is expensive part of embedded architecture • Narrowing of instruction words through encoding implies continual difficulties on compiler methods This is the limitation of absolute program memory addresses Because of short instruction words, an instruction set which uses exclusively absolute memory addresses is limited in program size Solution: – provide near and far program calls & branches – A program memory can be organized in a set of pages page memory good for hardware, problems for compiler developer – Retrieval caching a cache is a temporary buffer which acts as an intermediary between program or data memory improving the local nature of the data
PROF B ABDUL RAHIM AITS RAJAMPET 39
Practical considerations in a compiler development environment Presented techniques to form the basis of a compiler system, many other factors to be discussed: Language support: what ingredients of a programming language should be provided to the user? Coding style: what abstraction of coding style should be supported? What are the trade offs? Validation: what level of confidence will be provided with a retargeted compiler? Source level debugging: how does debugging on the host fit in with debugging on the target? Architecture and algorithm exploration: how well does my architecture and instruction set fit the application? PROF B ABDUL RAHIM AITS RAJAMPET 40 • Language support is one of the most visible choices to the designer – providing interface to the machine the language of choice is C because wide availability of compilers and tools (linkers, libraries, debuggers etc.) further ISO and ITU provides executable models in C C’s limitations as an embedded programming language include: Limited word length support: Fixed point support in C is limited to 8 bit (Char), 16 bit (short int), 32 bit (long int) This is sufficient for speech processing Insufficient in audio (24 bit) processing & image processing
PROF B ABDUL RAHIM AITS RAJAMPET 41
A limited set of storage classes: In DSP systems, in addition to multiple register files, there are two or more data memories as well as program memory Specific addresses need to distinguish memory mapped IO ANSI C provides only the auto, static, extern and register storage classes which is insufficient in providing the user control over where to place data
PROF B ABDUL RAHIM AITS RAJAMPET 42
A fixed set of operators: Embedded systems may have hardware operators which do not correspond to the classical operations found in C Separate compilation and linking: C allows modules to be compiled separately Modules are then linked together in a separate phase Because of limited register resources, difficult to optimize Solution: A subset of C could be chosen to allow a certain optimization A minimal extension to the C language gives features desired other language is C++ - programming benefits being the object oriented capabilities Java is the other alternative – much simpler, needs further progress PROF B ABDUL RAHIM AITS RAJAMPET 43 C defines four levels of coding styles: 1. High level behavioral ANSI C: This level is characterized by the use of variables, arrays, structures & all the operations available in C. Is the goal for compiler technology Provides most abstract and portable source description aiming at optimization and retargeting capabilities. Programmer writes algorithms without knowing hardware 2. Mid level: this level allows the use of built-in functions Any arrays or structures must be accessed by pointers Variables & pointers may be allocated into extended storage classes and register sets Allows the declaration of specific arrays into specific memories 3. Low level : this level allows the assignment of variables & pointers in specific registers. specifying new variables allocated to register sets 4. Assembly level : this level allows the programmer to write in-line assembly directly in C code assembly instructions specifying specific operations and registers
PROF B ABDUL RAHIM AITS RAJAMPET 44
Compiler validation • Retargetable compiler suggests countless targets and even possible targets • Embedded system designer sees that compiler produces correct code • Compiler validation is done by simulation • Selection of suitable test suite which covers possible faults is an issue • C test suites available are: Plum-Hall [http://www.plumhall.com] Metaware [http://www.metaware.com] • These test suites are not directly applicable to embedded processors, uses a subset of C – Only some data types and some operations may be supported – Any extensions to C are not tested
PROF B ABDUL RAHIM AITS RAJAMPET 45
Bit-true library consists of: any built-in functions that are provided by the target compiler; any operators with data types differing from ANSI C; any other operations that are implemented differently on the target hardware than on the host processor The construction of the bit-true library involves careful handling of bit- widths with shifts and masking The function library contains functions which allow writing values to a pre-defined test buffer After compilation on both paths, this buffer is compared for any differences - indicate a discrepancy in the retargetable compiler, instruction set simulator, or the bit-true library
PROF B ABDUL RAHIM AITS RAJAMPET 46
PROF B ABDUL RAHIM AITS RAJAMPET 47 • Second strategy demonstrated for ST Integrated Video Telephone – Processors of IVT , integral part of the entire system – Validation of each processor function done – Key is co-simulation approach – Validates both processor compiler & VHDL processor model [a test bench which thoroughly exercises the C application code] – Two principal elements – the host compiler & the target compiler
PROF B ABDUL RAHIM AITS RAJAMPET 48
PROF B ABDUL RAHIM AITS RAJAMPET 49 Source level debugging • Compilation path to host allows standard source level debugging tools Public domain debugger is gdb – the GNU source level debugger distributed by Free Software Foundation, uses interfaces (eg. Xxgdb, Emacs, ddd) Host source level debugging interface is reused in different modes Mode 1: host debugging mode Mode 2: uses instruction set simulator debugging is principally used for verifying the retargetable compiler bugs can occur in the instruction-set simulator or the bit-true library as well. Mode 3: interfaces with cycle true model of the processor or the chip itself through in-circuit emulator(ICE) interface allows verification of the functionality of the VHDL cycle-true model of processor – extremely slow real-time interface with chip itself costly in terms of I/O pins to exterior
PROF B ABDUL RAHIM AITS RAJAMPET 50
PROF B ABDUL RAHIM AITS RAJAMPET 51 Architecture and algorithm exploration • Design exploration tools are of great use for the development of a system • Designer’s need tools which provide feedback on how well a piece of application code fits on an architecture, such as – statistics on resource usage – suggestions for changes to the program to improve its fit onto the architecture
PROF B ABDUL RAHIM AITS RAJAMPET 52
PROF B ABDUL RAHIM AITS RAJAMPET 53 • After analyzing the instruction set and code, the designer can then use the set of editing functions to adjust the instruction set to the application code • The designer gains: – Removing unused hardware – Relieving bottleneck in the hardware – Removing unused instruction codes • Tools work together with instruction set specification used by the retargetable compiler & automatically regenerate the specification changes • The tools also support dynamic analysis • Functions are available which estimate real-time performance based on the host execution • The same information is reused for dynamic analysis of the instruction usage • This can be used for exploring changes in the architecture