0% found this document useful (0 votes)
8 views54 pages

Unit 5 DSP

The document discusses digital signal processors and their architecture. P-DSPs are designed specifically for digital signal processing with advantages over general processors. Key features of P-DSPs include hardware multipliers, modified Harvard architecture with separate program and data memory, and pipelining for improved performance.

Uploaded by

banu pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views54 pages

Unit 5 DSP

The document discusses digital signal processors and their architecture. P-DSPs are designed specifically for digital signal processing with advantages over general processors. Key features of P-DSPs include hardware multipliers, modified Harvard architecture with separate program and data memory, and pipelining for improved performance.

Uploaded by

banu pradeep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

UNIT-V

INTRODUCTION TO DSP PROCESSORS


The programmable digital signal processors (P-DSPs) are designed with features that are specifically
required for digital signal processing applications.

P-DSPs have an advantage over the advanced microprocessors and the RISC processors

In terms of low power requirement, cost, real time I/O capability and availability of high speed on-chip
memories
MULTIPLIER AND MULTIPLIER ACCUMULATOR
One of the most common(MAC)
operations required in digital signal processing applications is array
Multiplication.

Eg. Convolution and Correlation

Important requirements of these array multipliers is that they have to process the signals in real time.

Before the next sample of the input signal arrives at the input to the array, the array multiplication should be
completed.

This requires the multiplication as well as accumulation to be carried out using hardware elements.

There are two approaches to solve this problem.


• A dedicated MAC unit may be implemented in hardware, which integrates multiplier and accumulator in a
single hardware unit.

• The other approach is to have multiplier and accumulator separate.

In both of the above approaches, the MAC operation can be completed in one clock cycle. The presence of H/W
multipliers and/or multiplier accumulator is one of the mandatory requirements of a P-DSP.
Implementation of convolver with single multiplier/adder

This is achieved in P-DSP by using a special instruction called MACD multiply accumulate with data
shift. For example, TMS320C5X has the instruction MACD pma, dma, which multiplies the content of the
program memory pma with the content of the data memory with address dma and stores the result in the
product register. The content of product register is added to the accumulator before the new product is
stored. Further, the content of dma is copied to the next location whose address is dma + 1.
MODIFIED BUS STRUCTURES AND MEMORY ACCESS SCHEMES IN P-
DSPs

An instruction cycle is the time that elapses since an instruction is fetched till the particular instruction completes
execution including the time taken for writing the result into a register or memory.

MAC operation with data move (i.e. the MACD instruction) requires four memory accesses per instruction cycle.

The four memory accesses/clock period required for the


MACD instructions are as follows:
1. Fetch the MACD instruction from the program memory
2. Fetch one of the operands from the program memory
3. Fetch the second operand from the data memory
4. Write the content of the data memory with address dma into the location with the address
dma + 1

The relatively static impulse response coefficients are stored in the program memory and the
samples of the input data are stored in the data memory.
The relatively static impulse response coefficients are stored in the program memory and the samples of the input
data are stored in the data memory.

If the MACD instruction is to be executed in a machine with Von Neumann architecture, it requires
four clock cycles. This is because in the Von Neumann architecture shown, there is a single address bus and a single
data bus for accessing the program as well as data memory area.
Another ways by which the number of clock cycles required for the memory access can be reduced is to use more
than one bus for both address and data.

In the Harvard architecture shown, there are two separate buses for the program and data memory. Hence the
content of program memory and data memory can be accessed in parallel.
The instruction code can be fed from the program memory to the control unit while the operand is fed to the
processing unit from the data memory.
The processing unit consisting of the registers and processing elements such as MAC units, multiplier, ALU, shifter,
etc., are also referred to as data path.
The P-DSPs follow the modified Harvard architecture shown, one set of bus is used to access a memory that has
both program and data and another that has data alone. Data can also be transferred
from one memory to another. The modified Harvard architecture is used in several P-DSPs.

The P-DSPs use multiple buses only for connecting the on-chip memory to the control unit and data path. For
accessing off-chip memory only a single bus is used for accessing both the program memory and data memory.
Because of this, any operation that involves an off-chip memory is slow compared to that using the on-chip memory.
MULTIPLE ACCESS
MEMORY
The number of memory accesses/clock period can also be increased by using a high speed memory that
permits more than one memory access/clock period.

For example, the DARAM, the dual access RAM, permits two memory access/clock period.

Multiple access RAM may be connected to the processing unit of the P-DSP by using the Harvard architecture.

For example DARAM connected to a P-DSP with two independent data and address buses can be used to achieve
four memory accesses/ clock period.
MULTIPORTED
MEMORY
Technique that is adopted for increasing the number of
accesses/clock period is to use multiported memory.
For example the dual port memory has two independent data
and address buses as shown and hence two memory accesses
can be achieved in a clock period.
Multiported memories dispense with the need for storing the
program and data in two different memory chips in order to
permit simultaneous access to both program and data memory.

Major Limitations of the dualported memory is the increase In the cost compared to two single port memory
of the same total capacity. This is because of the increased number of pins and larger chip area required for
the dualported memory. Larger number of I/O pins require a larger and more expensive package and a larger
die size.

P-DSPs combine the modified Harvard architecture with the dualported memories.
VLIW
ARCHITECTURE

VERY LONG INSTRUCTION WORD (VLIW) architecture. These P-DSPs have a number of processing units
(data paths). In other words, they have a number of ALUs, MAC units, shifters, etc.
The VLIW is accessed from memory and is used to specify the operands and operations to be performed by each
of the data paths.

The multiple functional units share a common multiported register


file for fetching the operands and storing the results.

Parallel random access by the functional units to the register file is


facilitated by the read/write cross bar.

Execution of the operations in the functional units is carried out


concurrently with the load/store operation of data between a RAM
and the register file.
VLIW
ARCHITECTURE

The performance gains that can be achieved with VLIW architecture depends on the
degree of parallelism in the algorithm selected for a DSP application and the number
of functional units.

The throughput will be higher only if the algorithm involves execution of independent
operations.
For example, in architecture, by using eight functional units, the time required for
convolution can be reduced by a factor of 8 compared to the case where a single
functional unit is used.

However, it may not always be possible to have independent stream of data for
processing.

Further the number of functional units is also limited by the hardware cost for the
multiported register file and cross bar switch.
PIPELINING

The approaches adopted for increasing the efficiency of the advanced microprocessors as well
as P-DSPs is instruction pipelining.

An instruction cycle starting with the fetching of an instruction and ending with the execution of the instruction
including the time storage of the results can be split into a number of microinstructions.

Execution of each of the microinstructions is also referred to as one phase of an instruction.

An instruction cycle requiring four microinstructions can be said to be in four phases as follows:

1. Fetch phase in which the instruction is fetched from the program memory
2. Decode phase in which the instruction is decoded
3. Memory read phase in which the operand required for the execution of the instruction may be
read from the data memory
4. Execution phase in which execution as well as the storage of the results in either one of the
registers or memory is carried out
The above microinstructions may be carried out separately by four functional units. Let us assume that each of
the above four phases take equal time for completion.
In this case in a conventional microprocessor with no pipelining, each of the functional units is busy only 25%
of the time. This is because only one instruction is processed at the CPU at a time.
Only one instruction is processed at the CPU at a time. Figure 2.7 shows when each of the
functional unit is busy when a program containing three instructions I1, I2, I3 is executed fig (a).

(a)Instruction cycles of processor with no (b)Instruction cycles of a processor


pipelining with pipelining
The functional units can be kept busy almost all the time by processing a number of instructions
simultaneously in the CPU. For example, in a machine with four functional units, four instructions I1,
I2, I3 and I4 can be processed simultaneously as shown in Fig (b).
When I1 enters the decode phase I2 can enter the opcode fetch phase.
When I1 enters the operand read phase I2 enters the decode phase and I3 enters the opcode fetch phase.
When I1 enters the execute phase I2 enters the operand read phase I3 enters the decode phase and I4 enters the
opcode fetch phase.
The pipeline is fully loaded now and all the functional units have useful work to do.
The instructions that follow I4 keep the functional units busy till the program is exited.

Let T denote the time required for each phase of the instruction. One clock cycle of the processor corresponds to T.
In a period of 12T only three instructions can be executed in a machine without pipelining.
In the same period nine instructions can be executed as shown in Fig (b). Hence the throughput is increased by a
factor of 3 in this case.
The initial latency of a machine with four phases is 4T. Hence for executing a program with N instructions, the time
required for execution is (N + 4)T.
With a non-pipelined machine, the time required for executing N instructions is 4NT

Limitations:
The through put is normally achieved with restricted instruction set computers (RISC).

1. In complex instruction set computers (CISC), there are also instructions with multiple word requiring multiple
clock cycles for execution. In this case all the functional units cannot be kept busy all the time.

For example, in the case of call and branch instructions of a P-DSP, four phases or T states are required for the
call/branch instruction to exit execution phase. By that time two more single word instructions or one double
instruction enters the instruction pipeline. These instructions should not be executed. Hence two words have to be
flushed out of the instruction pipeline before the instructions are fetched.

To overcome this problem, some of the P-DSPs have special branch/call and return instructions called as delayed
branch/call/return instructions.
2. The throughput efficiency of the pipeline may also be reduced because of conflicts between the instructions in the
instruction pipeline in different phases.
This happens if the same memory is used to store the data and program and there is only a single address bus for
addressing both the program anddata memory.

For example, an instruction in fetch phase may try to fetch the instruction code from a memory chip that is also
accessed by another instruction that is in the operand read phase. To avoid the conflict, the operand read phase will
be done first and the opcode fetch phase will be repeated till there is no conflict again.
The number of instructions that are processed simultaneously in the CPU, is referred to as depth
of the instruction pipeline, differs in different families of P-DSPs.

The pipeline depths of some of the P-DSPs


SPECIAL ADDRESSING MODES IN P-
DSPs
P-DSPs have special addressing modes that permit single word/instruction format and thereby speed up the execution
by making effective use of the instruction pipelining.
There are also special addressing modes such as cyclic addressing and bit reversed addressing that are specifically for
DSP applications.

Short Immediate Addressing


Short Direct Addressing
Memory-mapped Addressing
Indirect Addressing
Bit Reversed Addressing Mode
Circular Addressing

Short Immediate Addressing

This permits the operand to be specified using a short constant that forms part of a single word instruction. The
length of the short constant depends on the instruction type and the P-DSP.

For example in the case of Tl TMS320C5X, an 8-bit constant can be specified as one of the operands in the single
word instructions for addition, subtraction, AND, OR, XOR
Short Direct Addressing

This permits the lower order address of the operand of an instruction to be specified in the single word
instruction.

In the Tl TMS320 DSPs, the higher order 9 bits of the memory are stored in the data page pointer and only the lower 7
bits are specified as a part of the instruction.
Each contiguous block of 128 words is referred to as one page in the Tl DSPs.
The argument in the instruction specifies only the location within the current page.
In the Motorola DSP5600X, short direct memory addressing permits a 6-bit address to be specified in the instruction.

Memory-mapped Addressing

The CPU registers and the I/O registers of the P-DSPs are also accessible as memory location. This is achieved by
storing them in either the starting page or the final page of the memory space.
For example, in TMS320C5X, page 0 corresponds to the CPU registers and I/O registers.
In the case of Motorola DSP5600X, the last page of the memory space containing 64 locations is used as the memory
map for the CPU and I/O registers.
When these registers are accessed using memory mapped addressing modes, the higher address bits are not taken from
the data page pointer and instead made to be 0 in the case of TI DSPs and made to be 1 in Motorola DSPs.
Indirect Addressing

In P-DSPs this addressing mode has a number of options. This permits an array of data to be processed in P-DSP to
be efficiently fetched and stored. The address of the operands can be stored in one of the registers called indirect
address registers.

In the case of TI processors, the indirect address registers are called auxiliary registers ARs.

Any of these registers can be updated when the operand fetched using these registers are being executed. This is
made possible by having an additional ALU in the CPU core specifically for the indirect address registers or ARs.
The ARs may be incremented or decremented either in steps of 1 or in steps specified by the content of an offset
register.
In the case of TI processors, the offset register is called an INDX register.
In the P-DSPs from analog devices it is called the modifier register.

The indirect addressing mode used in TI 5X P-DSPs is called indirect addressing mode with
post-increment/decrement. In the TI 5X processors the new address computed by the auxiliary ALU is not used for
fetching the operand for the current instruction that is being decoded and is executed. It is used for fetching the
operand that uses the indirect addressing mode next with this particular AR.
In Motorola DSP563XX, the updated indirect address register content may also be used to fetch the operand for the
current instruction. Hence this mode is called the indirect addressing mode with pre-increment/ decrement.

In TI TMS320C54X processors both post-increment/decrement and pre-increment/decrement operations are supported.

Bit Reversed Addressing Mode

The binary pattern corresponding to a particular decimal number is obtained by writing the natural binary equivalent
of the number in the reverse order so that the most significant bit of the natural binary number becomes the least
significant bit of the bit reversed no and vice versa. In the bit reversed addressing mode, the address is
incremented/decremented by the number represented in the bit reversed form.

For the computation of the FFT, the data is to be arranged in the bit reversed order and 2-point DFT
of the resulting sequence is to be computed first. In the bit reversed addressing mode, when a 16-point
FFT is to be computed, 2-point DFT of X(0) and X(8) is to be found. Similarly 2-point DFT of X(4) and X(12) and
so on. It may be noted that the value 0, 8, 4, 12 corresponds to the consecutive numbers in the bit reversed number
representation.
Circular Addressing

In linear addressing mode for real time processing of signals, the input signal is continuously stored in the
memory. The processed data is stored in another memory space continuously and may be written onto the output
device.
In this case input as well as output program will be simple. However, since the input as well as output memory
space will be finite in size, the entire memory space would be exhausted after processing the input signal for some
time, if the data is written into the memory.

To overcome this problem it is necessary to keep checking whether the range of either the input or the output
memory space is exceeded. In that case, the new data is to be stored starting from the beginning of the particular
memory space. However, checking this condition is an overhead that can be overcome using the circular
addressing mode.

In this mode, the memory can be organized as a circular buffer with the beginning memory address and the ending
memory address corresponding to this buffer defined by the programmer. In the circular addressing mode, when
the address pointer is incremented, the address will be checked with the ending memory address of the circular
buffer. If it exceeds that, the address will be made equal to the beginning address of the circular buffer.
ON-CHIP
PERIPHERALS
The P-DSPs have a number of on-chip peripherals that relieve the CPU from routine functions.

They help to reduce the chip count on the DSP system based around P-DSP.

Some of the on-chip peripherals in the P-DSPs are

• On-chip Timer
• Serial Port
• TDM Serial Port
• Parallel Port
• Bit I/O Ports
• Host Port
• Comm Ports
• On-Chip A/D and D/A Converters
• P-DSPs with RISC and CISC
On-chip Timer

Generates periodic interrupts to the P-DSPs


Generation of the sampling clocks for the A/D converters.
The timer mode can be programmed by the P-DSPs.
The period of the timer is also made programmable
The timers can generate a single pulse or a periodic train of pulses.
They can also generate a single square wave or a periodic square wave.
Serial Port

Enables the data communication between the P-DSP and an external peripheral such as A/D converter, D/A converter
or an RS232 C device.
These ports normally have input and output buffers so that the P-DSP writes or reads from the serial port in parallel
form and the serial port sends and receives data to the peripherals in serial form.
They also generate interrupts when the serial port output buffer is empty or the input buffer is full.
These devices have parallel to serial and serial to parallel converter inbuilt into them.
The serial ports can operate either in the asynchronous mode or in the synchronous mode.
In the asynchronous mode, the transmit data and receive data lines alone are used for communication and bit clock is
transmitted from either end.
In the case of synchronous mode, both bit clock and a frame sync signal that indicates the beginning of the first bit of
the data transmitted using synchronous mode is transmitted from the serial port to the I/O device and also from I/O
port to the serial port.
TDM Serial Port

The P-DSPs have a special serial port called TDM serial port. This permits a P-DSP to communicate
with other devices or P-DSPs by using time division multiplexing (TDM). One of the devices can generate the
frame sync pulse that indicates the beginning of a TDM frame and bit clock, the duration
for which a bit is to be transmitted. TDM frame is split into a number of equal slots and each slot can be allotted
for one of the devices.

TDM frame with 8 time slots


In each of the slots, a number of bits may be transmitted by a channel. The TDM serial port normally uses four
lines for the purpose of serial communication. They are
TFRM: the frame sync signal
TClock: the bit clock
TADD: The address of the serial device that is outputting data in a particular TDM slot
TDAT: The data transmitted into the TDM channel by the authorized device
The signals TADD and TDAT are bidirectional and are tristate controlled so that only one of the devices transmit the
data and address in these lines at a time. Any one of the devices can generate the TFRM and clock signals and they
are used by the other devices as a reference. A scheme where eight devices are interconnected using the TDM serial
port is shown in Fig. below

An example, of each of the devices outputting a 16-bit data (D15 - DO) in its slot and also the address of the
device (A0-A15), which is supposed to receive this data, is shown in Fig. below
Parallel Port

Parallel ports enable communication between the P-DSP and other devices to be faster compared to the serial
communication by using a number of lines in parallel.
In addition, they also have additional lines, which are for strobing or for handshaking purposes.

The P-DSPs have two approaches for assigning lines for parallel port.
In one approach used by the TI, the data bus itself is used for parallel ports.
This is achieved by allocating a specific address space for I/O and whenever this address space is addressed
using the I/O instructions, the parallel port signals including the handshaking signals are sent over the data bus.
In another approach, separate lines are dedicated for parallel ports including the handshaking signals.
Bit I/O Ports

The P-DSPs have additional I/O ports that are single bit wide.
These port bits may be individually set, reset or read.
These bits are normally used for control purposes but they can also be used for data transfer.
There are no handshaking signals for these I/O ports.
Some of these bits are also used for conditional branching or calls.
For example, in TI processors there are instructions such as branch if I/O zero.

Host Port

The P-DSPs also have a special parallel port normally 8-bit or 16-bit wide called the host port
This enables them to communicate with a microprocessor or PC, which is called as a host.
In addition to data communication, the host can generate interrupts and also cause the P-DSP to load a program
from ROM to the RAM on reset.
Comm Ports

These are parallel ports that are used for interprocess communication between a number of identical
P-DSP in a multiprocessor system.
For example, a multiprocessor system may be built using a number of TI TMS320C4X.
For the purpose of communication of the data between these processors six comm ports each of width 8 bits is
provided.
Since the data to be processed may be 32 or more number of bits, the P-DSPs have provision for splitting the data in
streams of 8 bits and also assemble the 8 bits into words of 32 bits.
Analog devices DSP ADSP 2106X has 6 comm ports each of which is 4 bits wide.

On-Chip A/D and D/A Converters

P-DSPs targeted towards voice applications such as cellular telephones and tapeless answering machines have
A/D and D/A converters inside the P-DSP.

Motorola DSP 561XX and Analog devices ADSP 21MSP5X both have the A/D and D/A on chip and permit
effective
P-DSPs with RISC and CISC

P-DSPs may be implemented using either the RISC processor or the CISC processor.
TI TMS320C6X P-DSPs uses RISC processor.
TI TMS32054X and the Motorola DSP563XX and analog devices ADSP 2100X make use of CISC.
TI TMS320C8X has a RISC and four P-DSPs with CISC in a single core.

RISC Advantages
• The chip area dedicated to the realization of the control unit is considerably reduced because of the reduced
number of instructions. 20% of the chip area may be used for the control unit in RISC.
• Therefore in a RISC there is more area available for incorporating other features.
• Reduction in the control area, the CPU registers and the data paths (processing units) can be replicated and the
throughput of the processor can be increased by applying pipelining and parallel processing.
• Computational speed is high- All the instructions are of uniform length and take the same time for execution.
Hence the dummy periods or hold periods in the instruction pipeline is reduced to the minimum.
• Control unit in RISC has fewer gates. This reduces the propagation delay and increases the speed. Reduced
number of instructions, formats and addressing modes result in simpler and smaller decoder, which, in turn,
increase the speed.
• The delayed branch and call instructions can be effectively used and they improve the speed.
Disadvantages
The HLL compilers are costly by several orders of magnitude compared to the P-DSPs themselves.
For P-DSP with RISC architecture, compilers are essential. For most of the low cost applications, DSP
platforms without the compilers are preferred.

RISC has a smaller number of instructions, implementation of a single CISC instruction might require a
number of instructions in RISC. This increases the memory required for storing the program and the traffic
between CPU and memory is increased. This is on the one hand increases the computation time and on he
other hand makes the program difficult to debug.
CISC Advantages

CISC processors have a very rich instruction set that even support high level language.
The P-DSPs with CISC also have instructions specifically required for DSP applications such as MACD,
FIRS, etc.
This makes the application program written in the assembly language to be shorter and easy to follow.
The HLL compilers are costly by several orders of magnitude compared to the P-DSPs themselves.
For P-DSP with RISC architecture, compilers are essential. For most of the low cost applications, DSP
platforms without the compilers are preferred. Hence a majority of P-DSPs are CISC based.
RISC has a smaller number of instructions, implementation of a single CISC instruction might require a
number of instructions in RISC. This increases the memory required for storing the program and the traffic
between CPU and memory is increased. This is on the one hand increases the computation time and on he
other hand makes the program difficult to debug.

The code composer studio from TI permits the programming in HLL as well as assembly language in a single
development environment so that the best features of both the HLL and assembly language programming can be
used by the programmer.
ARCHITECTURE OF
TMS320C6X
• TMS320C6X DSPs are the first DSPs to use advanced VLIW (Very Large Instruction Word) architecture to
achieve high performance through increased instruction parallelism.
• C6X DSPs an excellent choice for multichannel and multifunction applications.
• TMS320C6X DSPs use the VelociTI architecture
• The VelociTI architecture is a highly deterministic architecture having reduced code size, flexibility of code
and data type and zero overhead in branching.
• The C6X devices execute up to eight 32-bit instructions per cycle with an execution speed of up

to 6000 million instructions per second (MIPS).


FEATURES OF C6X
PROCESSORS

• Advanced VLIW CPU with eight functional units, including two multipliers and six ALUs
• Executes up to eight instructions per cycle allows to develop effective RISC like code
• Instruction packing reduces code size, program fetches and power consumption
• Conditional execution of all instructions
• Efficient code execution on independent functional units
• Supports 8/16/32- bit data formats 40-bit arithmetic operations, saturation and normalization operations
• Field manipulation and instruction extract, set, clear and bit counting operations The C67X device has
hardware support for single precision (32-bit) and double precision (64- bit) IEEE floating point operations
and also 32 X 32 bit integer multiplication with 32 or 64-bit results.
• The C64X device multiplier can perform two 16 X16 bit or four 8 X 8 bit multiplications per cycle, quad
8-bit and dual 16-bit instruction set extensions with data fl ow support, memory access for non-aligned 32-
bit and 64-bit, special communication-specific instruction useful in realizing error-correcting codes, bit
count and rotate hardware.
INTERNAL
ARCHITECTURE
The C6X devices contains 32-bit CPU, on-chip program,
data memory and on-chip peripherals.
The on-chip memory has cache either for program space or
for both program and data space.
The C6X devices have peripherals such as external memory
interface (EMIF), direct memory access controller (DMA),
timers, multi-channel buffered serial ports (McBSP), host
port interface (HPI) and power down logic.
CPU

The central processing unit of C6X device is 32-bit size. The


block diagram of C6X CPU is shown.

The CPU contains the following units:

• Program fetch unit


• Instruction dispatch unit
• Instruction decode unit
• Two data paths, each data path consists of four functional
units
• Register fi le for each data path
• Control registers
• Control logic
• Test, emulation and interrupt logic
C6X CPU is based on advanced VLIW architecture, which accepts eight 32-bit instructions (the instruction fetch packet
size is 256 bits) at a time.
The program fetch unit generates the addresses of eight instructions and sends it to the program memory for each fetch
packet.
Once the contents of the program memory read occurs, the fetch packet is received at the CPU.
The instruction dispatch unit receives the fetch packet and splits it into execute packets.
The instructions in the execute packet (eight instructions) are assigned to the appropriate eight functional units in the
data path.
During the instruction decode, the source registers, destination registers and associated paths are decoded for the
execution of the instructions in the functional units. Finally the instructions are executed by the functional units.
The register file (A&B) of all the C6X devices contain 32 numbers of 32-bit registers (16 register for each data path)
The C6X CPU contains eight functional units, six arithmetic and logic units and two multipliers
(.L1, .L2, .S1, .S2, .M1, .M2, .D1 and .D2.).
These functional units can be divided into two groups of four.
The L, S & D units are arithmetic and logic units (ALU), and the M unit is a multiplier unit. Each data path has almost
identical functional units.
GENERAL-PURPOSE REGISTER
FILES

The general-purpose registers can be used for handling data; data address pointers or condition registers.
There are two general-purpose register fi les A and B in C6X CPU data paths.

The registers A0-A15 for register fi le-A and B0-B15 for register file-B.

In C62X/C67X devices each register fi le contains 16 numbers of 32-bit registers.

C64X devices have double the number of general-purpose registers There are 32 numbers of 32-bit registers for
each data path, where A0-A31 for register fi le-A and B0-B31 for register file-B.

C62X/C67X general-purpose register files supports packed 16-bit, 40-bit fixed point data and
64-bit floating point data types. The packed data type can store four 8-bit values or two 16-bit values
in a single 32-bit register or four 16-bit values in 64-bit register.
FUNCTIONAL UNITS AND OPERATION

The C6X CPU consists of eight functional units, .L1, .S1, .M1, .D1, .L2, .S2, .M2 and .D2.
These eight functional units of C6X devices are divided into two groups, one group for each data path.
Each functional unit in one data path is almost identical to the corresponding unit in the other data path and
arranged as mirror image to each functional unit.
The .L, .S and .D units are arithmetic and logic unit, .M unit is a multiplier unit.

The .L unit performs arithmetic and logical operations, other operations like compare and count are
performed in this unit.
The .S unit is used for arithmetic and logical operations as well as for branch,
shift, constant generation and move operations.
The .D unit does add and subtract operations.
The .D unit is a dedicated unit for the load, store operations, linear and circular address calculations. The .M
unit is dedicated unit to perform multiply operations.
DATA PATHS

The C6X CPU has two data paths, Data path ・ A and Data path ・ B.

Register File Data Paths


Register File Cross Paths
Register File Memory Access Paths
Register File Data Paths
Data lines in the CPU data path are 32-bit wide but some support 40-bit (long operands) and 64-bit (double word
operands) lines.
The functional units ending in 1 (.L1, .S1, .M1 and .D1) have access to register file A, and functional units
ending in 2 (.L2, .S2, .M2 and .D2) to register file B.
Each functional unit has two 32-bit ports for reading source operands src1 and src2 from the respective register
fi les.
The .L and .S units have an extra 8-bit line for 40-bit long src operand reads.
Each functional unit has its own 32-bit write port into the respective register fi le for destination dst operands
except .M unit.
.L and .S units have an extra 8-bit line for 40-bit long dst operand writes.
Since each unit has its own port for operand read and writes, when performing 32-bit operations all the eight
functional units can be used in parallel every machine cycle.
Register File Cross Paths

Functional units can read and write the operands directly from their respective register files using its own data
paths.
The register files are connected to the opposite side functional units through 1X and 2X cross paths.
These cross paths allow the functional units from one data path to access 32-bit operand from the opposite side
register file.

The functional units of data path A read their source operands from register file B via 1X cross path and the 2X
cross path allows the functional units of data path B to read the source operand from register file A.

In C62X and C67Xprocessor


The six functional units (.L1, .L2, .S1, .S2, .M1 and .M2) out of eight have access to the opposite side register
file via cross path.
In .S1, .S2, .M1 and .M2 units src2 operand is selectable between the cross path and the same side register
file path .
In the case of .L1 and .L2 units, both src1 and src2 operands are selectable.

In C64X processor

All the eight functional units have access to the register file of the opposite side through cross path.

.L1 and .L2 units both src1 and src2 operands are selectable between the cross path and the same side
register file path but in the case of other six functional units only src2 operand is selectable.
Register File Memory Access Paths

In order to access data from memory to CPU register files, C6X CPU has address paths, data load and store paths.
The DA1 and DA2 the address paths, LD1 and LD2 the data load paths and ST1 and ST2 the data store paths are used for
memory access.

Address path
The DA1 and DA2 address paths are 32-bit size and are connected to .D unit of the respective data
paths. The paths allow addresses generated by any one path to access data to or from any register.

There is cross path for the address buses, the address generated in .D1 and .D2 units can have access to DA2 and DA1
paths (opposite paths) respectively.

Load path
C62X processor has two 32-bit paths for loading data from memory to register file, LD1 for register fi le A and LD2 for
register file B.

C64X and C67X processors have additional 32-bit load paths (LD1a and LD1b, LD2a and LD2b) for register fi les A and
B.
Store path
C62X processor have two 32-bit paths to store data values from register file to memory.

C64X has additional 32-bit store paths ST1a and ST1b and ST2a and ST2b for register files A and B.
The size of memory access paths in C6X processors are given in Table
CONTROL REGISTER
FILE
The control register fi le of C6X processor
contains ten control registers

The .S2 unit alone can read and write to


control register fi le. The control registers
are generally
accessed by the MVC (Move between the
Control file and Register file) instruction

The list of control registers common to


C6X processors and their description is
given in Table
Addressing Mode Register (AMR)

The eight registers A4-A7 and B4-B7 of the CPU register file can be used for linear and circular addressing.
The Addressing Mode Register (AMR) specifies the addressing mode; it consists of mode select fields
and block select fields.

The various fields of the AMR are shown below:


A 2-bit field, mode select filed for each register in AMR selects the address modification mode between linear or
circular mode.

The 5-bit field, block size field BK0 and BK1 is used to select the block size of the circular buffer in
circular addressing.

The 2-bit fi eld in AMR also specifies which BK (block size) field is to be used for a circular buffer.
The mode select field encoding is given below:
Control Status Register (CSR)

The Control Status Register (CSR) of C6X contains control and status bits of the processor.

The various fields of the CSR are

The functions of each field are


Control Register File Extensions

C67X and C64X processors contain additional control registers.


C67X processor contains three configurations registers to support floating point operations. These registers
specify the desired floating-point rounding mode for the .L, .S and .M units.
C67X additional control registers and its functions are given
REFERENCE:

DIGITAL SIGNAL PROCESSORS ARCHITECTURE, PROGRAMMING AND APPLICATIONS

BY B VENKATARAMANI AND M BHASKAR (CHAPTER 2 AND CHAPTER 13)

You might also like