M1&M2 Mixed
M1&M2 Mixed
CHAPTER – 1
BASIC STRUCTURE OF COMPUTERS
List of instructions are called programs & internal storage is called computer
memory.
Input ALU
I/O Processor
Memory
Finally the results are sent to the outside world through output device. All of
these actions are coordinated by the control unit.
Input unit: -
The source program/high level language program/coded information/simply data
is fed to a computer through input devices keyboard is a most common type. Whenever a
key is pressed, one corresponding word or number is translated into its equivalent binary
code over a cable & fed either to memory or processor.
Memory unit: -
Its function into store programs and data. It is basically to two types
1. Primary memory
2. Secondary memory
1. Primary memory: - Is the one exclusively associated with the processor and operates
at the electronics speeds programs must be stored in this memory while they are being
executed. The memory contains a large number of semiconductors storage cells. Each
capable of storing one bit of information. These are processed in a group of fixed site
called word.
Number of bits in each word is called word length of the computer. Programs
must reside in the memory during execution. Instructions and data can be written into the
memory or read out under the control of processor.
Memory in which any location can be reached in a short and fixed amount of
time after specifying its address is called random-access memory (RAM).
The time required to access one word in called memory access time. Memory
which is only readable by the user and contents of which can’t be altered is called read
only memory (ROM) it contains operating system.
Caches are the small fast RAM units, which are coupled with the processor and
are aften contained on the same IC chip to achieve high performance. Although primary
storage is essential it tends to be expensive.
2 Secondary memory: - Is used where large amounts of data & programs have to be
stored, particularly information that is accessed infrequently.
Examples: - Magnetic disks & tapes, optical disks (ie CD-ROM’s), floppies etc.,
The control and the ALU are may times faster than other devices connected to a
computer system. This enables a single processor to control a number of external devices
such as key boards, displays, magnetic and optical disks, sensors and other mechanical
controllers.
Output unit:-
These actually are the counterparts of input unit. Its basic function is to send the
processed results to the outside world.
Control unit:-
It effectively is the nerve center that sends signals to other units and senses their
states. The actual timing signals that govern the transfer of data between input unit,
processor, memory and output unit are generated by the control unit.
1. First the instruction is fetched from the memory into the processor.
2. The operand at LOCA is fetched and added to the contents of R0
3. Finally the resulting sum is stored in the register R0
The preceding add instruction combines a memory access operation with an ALU
Operations. In some other type of computers, these two types of operations are performed
by separate instructions for performance reasons.
Load LOCA, R1
Add R1, R0
Transfers between the memory and the processor are started by sending the
address of the memory location to be accessed to the memory unit and issuing the
appropriate control signals. The data are then transferred to or from the memory.
MEMORY
MAR MDR
CONTROL
PC R0
R1
…
… ALU
…
IR
…
n- GPRs
The fig shows how memory & the processor can be connected. In addition to the
ALU & the control circuitry, the processor contains a number of registers used for several
different purposes.
The instruction register (IR):- Holds the instructions that is currently being executed.
Its output is available for the control circuits which generates the timing signals that
control the various processing elements in one execution of instruction.
Besides IR and PC, there are n-general purpose registers R0 through Rn-1.
The other two registers which facilitate communication with memory are: -
1. MAR – (Memory Address Register):- It holds the address of the location to be
accessed.
2. MDR – (Memory Data Register):- It contains the data to be written into or read
out of the address location.
An interrupt is a request signal from an I/O device for service by the processor.
The processor provides the requested service by executing an appropriate interrupt
service routine.
The Diversion may change the internal stage of the processor its state must be
saved in the memory location before interruption. When the interrupt-routine service is
completed the state of the processor is restored so that the interrupted program may
continue.
The simplest and most common way of interconnecting various parts of the
computer. To achieve a reasonable speed of operation, a computer must be organized so
that all its units can handle one full word of data at a given time.A group of lines that
serve as a connecting port for several devices is called a bus.
In addition to the lines that carry the data, the bus must have lines for address and
control purpose. Simplest way to interconnect is to use the single bus as shown
Since the bus can be used for only one transfer at a time, only two units can
actively use the bus at any given time. Bus control lines are used to arbitrate multiple
requests for use of one bus.
Low cost
Very flexible for attaching peripheral devices
Multiple bus structure certainly increases, the performance but also increases the
cost significantly.
All the interconnected devices are not of same speed & time, leads to a bit of a
problem. This is solved by using cache registers (ie buffer registers). These buffers are
electronic registers of small capacity when compared to the main memory but of
comparable speed.
The instructions from the processor at once are loaded into these buffers and then
the complete transfer of data at a fast rate will take place.
1.5 Performance
The total time required to execute the program is elapsed time is a measure of the
performance of the entire computer system. It is affected by the speed of the processor,
the disk and the printer. The time needed to execute a instruction is called the processor
time.
Just as the elapsed time for the execution of a program depends on all units in a
computer system, the processor time depends on the hardware involved in the execution
of individual machine instructions. This hardware comprises the processor and the
memory which are usually connected by the bus as shown in the fig c.
Bus
The pertinent parts of the fig. c are repeated in fig. d which includes the cache
memory as part of the processor unit.
Let us examine the flow of program instructions and data between the memory
and the processor. At the start of execution, all program instructions and the required data
are stored in the main memory. As the execution proceeds, instructions are fetched one
by one over the bus into the processor, and a copy is placed in the cache later if the same
instruction or data item is needed a second time, it is read directly from the cache.
The processor and relatively small cache memory can be fabricated on a single
IC chip. The internal speed of performing the basic steps of instruction processing on
chip is very high and is considerably faster than the speed at which the instruction and
data can be fetched from the main memory. A program will be executed faster if the
movement of instructions and data between the main memory and the processor is
minimized, which is achieved by using the cache.
For example:- Suppose a number of instructions are executed repeatedly over a short
period of time as happens in a program loop. If these instructions are available in the
cache, they can be fetched quickly during the period of repeated use. The same applies to
the data that are used repeatedly.
Processor clock: -
Processor circuits are controlled by a timing signal called clock. The clock
designer the regular time intervals called clock cycles. To execute a machine instruction
the processor divides the action to be performed into a sequence of basic steps that each
step can be completed in one clock cycle. The length P of one clock cycle is an important
parameter that affects the processor performance.
Processor used in today’s personal computer and work station have a clock rates
that range from a few hundred million to over a billion cycles per second.
We now focus our attention on the processor time component of the total elapsed
time. Let ‘T’ be the processor time required to execute a program that has been prepared
in some high-level language. The compiler generates a machine language object program
that corresponds to the source program. Assume that complete execution of the program
requires the execution of N machine cycle language instructions. The number N is the
actual number of instruction execution and is not necessarily equal to the number of
machine cycle instructions in the object program. Some instruction may be executed
more than once, which in the case for instructions inside a program loop others may not
be executed all, depending on the input data used.
Suppose that the average number of basic steps needed to execute one machine
cycle instruction is S, where each basic step is completed in one clock cycle. If clock rate
is ‘R’ cycles per second, the program execution time is given by
T= N× S
R
this is often referred to as the basic performance equation.
We must emphasize that N, S & R are not independent parameters changing one
may affect another. Introducing a new feature in the design of a processor will lead to
improved performance only if the overall result is to reduce the value of T.
Consider Add R1 R2 R3
This adds the contents of R1 & R2 and places the sum into R3.
The contents of R1 & R2 are first transferred to the inputs of ALU. After the
addition operation is performed, the sum is transferred to R3. The processor can read the
next instruction from the memory, while the addition operation is being performed. Then
of that instruction also uses, the ALU, its operand can be transferred to the ALU inputs at
the same time that the add instructions is being transferred to R3.
In the ideal case if all instructions are overlapped to the maximum degree
possible the execution proceeds at the rate of one instruction completed in each clock
cycle. Individual instructions still require several clock cycles to complete. But for the
purpose of computing T, effective value of S is 1.
the serial execution of program instructions. Now a days may processor are designed in
this manner.
These are two possibilities for increasing the clock rate ‘R’.
1. Improving the IC technology makes logical circuit faster, which reduces the time
of execution of basic steps. This allows the clock period P, to be reduced and the
clock rate R to be increased.
2. Reducing the amount of processing done in one basic step also makes it possible
to reduce the clock period P. however if the actions that have to be performed by
an instructions remain the same, the number of basic steps needed may increase.
The performance measure is the time taken by the computer to execute a given
bench mark. Initially some attempts were made to create artificial programs that could be
used as bench mark programs. But synthetic programs do not properly predict the
performance obtained when real application programs are run.
The program selected range from game playing, compiler, and data base
applications to numerically intensive programs in astrophysics and quantum chemistry. In
each case, the program is compiled under test, and the running time on a real computer is
measured. The same program is also compiled and run on one computer selected as
reference.
The ‘SPEC’ rating is computed as follows.
Means that the computer under test is 50 times as fast as the ultra sparc 10. This
is repeated for all the programs in the SPEC suit, and the geometric mean of the result is
computed.
Let SPECi be the rating for program ‘i’ in the suite. The overall SPEC rating for
the computer is given by
( )
n 1
n
SPEC rating = π SP ECi
i= 1
Since actual execution time is measured the SPEC rating is a measure of the
combined effect of all factors affecting performance, including the compiler, the OS, the
processor, the memory of comp being tested.
We obviously need to represent both positive and negative numbers. Three systems are
used for representing such numbers :
• Sign-and-magnitude
• 1’s-complement
• 2’s-complement
In all three systems, the leftmost bit is 0 for positive numbers and 1 for negative numbers.
Fig 2.1 illustrates all three representations using 4-bit numbers. Positive values have
identical representations in al systems, but negative values have different representations.
In the sign-and-magnitude systems, negative values are represented by changing the most
significant bit (b3 in figure 2.1) from 0 to 1 in the B vector of the corresponding positive
value. For example, +5 is represented by 0101, and -5 is represented by 1101. In 1’s-
B Values represented
Sign and
b3b2b1 1's 2's
b0 magnitude complement complement
0 1 1
1 +7 +7 +7
0 1 1
0 +6 +6 +6
0 1 0
1 +5 +5 +5
0 1 0
0 +4 +4 +4
0 0 1
1 +3 +3 +3
0 0 1
0 +2 +2 +2
0 0 0
1 +1 +1 +1
0 0 0
0 +0 +0 +0
1 0 0
0 -0 -0 -0
1 0 0
1 -1 -1 -1
1 0 1
0 -2 -2 -2
1 0 1
1 -3 -3 -3
1 1 0
0 -4 -4 -4
1 1 0
1 -5 -5 -5
1 1 1
0 -6 -6 -6
1 1 1
1 -7 -7 -7
Hence, the 2’s complement of a number is obtained by adding 1 to the 1’s complement of
that number.
0 1 0 1
+0 +0 +1 +1
____ ____ ___ ___
0 1 1 10
Carry-out
Figure 2.2 Addition of 1-bit numbers.
Number and character operands, as well as instructions, are stored in the memory
of a computer. The memory consists of many millions of storage cells, each of which can
store a bit of information having the value 0 or 1. Because a single bit represents a very
small amount of information, bits are seldom handled individually. The usual approach is
to deal with them in groups of fixed size. For this purpose, the memory is organized so
that a group of n bits can be stored or retrieved in a single, basic operation. Each group of
n bits is referred to as a word of information, and n is called the word length. The
memory of a computer can be schematically represented as a collection of words as
shown in figure (a).
Modern computers have word lengths that typically range from 16 to 64 bits. If
the word length of a computer is 32 bits, a single word can store a 32-bit 2’s complement
number or four ASCII characters, each occupying 8 bits. A unit of 8 bits is called a byte.
The hardware required to connect an I/O device to the bus. The address decoder
enables the device to recognize its address when this address appears on the address lines.
The data register holds the data being transferred to or from the processor. The status
register contains information relevant to the operation of the I/O device. Both the data
and status registers are connected to the data bus and assigned unique addresses. The
address decoder, the data and status registers, and the control circuitry required to
coordinate I/O transfers constitute the device’s interface circuit.
I/O devices operate at speeds that are vastly different from that of the processor.
When a human operator is entering characters at a keyboard, the processor is capable of
executing millions of instructions between successive character entries. An instruction
that reads a character from the keyboard should be executed only when a character is
available in the input buffer of the keyboard interface. Also, we must make sure that an
input character is read only once.
This example illustrates program-controlled I/O, in which the processor
repeatedly checks a status flag to achieve the required synchronization between the
processor and an input or output device. We say that the processor polls the device. There
are two other commonly used mechanisms for implementing I/O operations: interrupts
and direct memory access. In the case of interrupts, synchronization is achieved by
having the I/O device send a special signal over the bus whenever it is ready for a data
transfer operation. Direct memory access is a technique used for high-speed I/O devices.
It involves having the device interface transfer data directly to or from the memory,
without continuous involvement by the processor.
The routine executed in response to an interrupt request is called the interrupt-
service routine, which is the PRINT routine in our example. Interrupts bear considerable
resemblance to subroutine calls. Assume that an interrupt request arrives during
execution of instruction i in figure 1
Program 1 Program 2
COMPUTER routine PRINT routine
2
....
Interrupt i
Occurs i+1
here
…
M
signal. This may be accomplished by means of a special control signal on the bus. An
interrupt-acknowledge signal. The execution of an instruction in the interrupt-service
routine that accesses a status or data register in the device interface implicitly informs
that device that its interrupt request has been recognized.
So far, treatment of an interrupt-service routine is very similar to that of a
subroutine. An important departure from this similarity should be noted. A subroutine
performs a function required by the program from which it is called. However, the
interrupt-service routine may not have anything in common with the program being
executed at the time the interrupt request is received. In fact, the two programs often
belong to different users. Therefore, before starting execution of the interrupt-service
routine, any information that may be altered during the execution of that routine must be
saved. This information must be restored before execution of the interrupt program is
resumed. In this way, the original program can continue execution without being affected
in any way by the interruption, except for the time delay. The information that needs to
be saved and restored typically includes the condition code flags and the contents of any
registers used by both the interrupted program and the interrupt-service routine.
The task of saving and restoring information can be done automatically by the
processor or by program instructions. Most modern processors save only the minimum
amount of information needed to maintain the registers involves memory transfers that
increase the total execution time, and hence represent execution overhead. Saving
registers also increase the delay between the time an interrupt request is received and the
start of execution of the interrupt-service routine. This delay is called interrupt latency.
one or more switches will cause the line voltage to drop to 0, the value of INTR is the
logical OR of the requests from individual devices, that is,
and the processor status register (PS) on the stack, the processor performs the equivalent
of executing an Interrupt-disable instruction. It is often the case that one bit in the PS
register, called Interrupt-enable, indicates whether interrupts are enabled.
In the third option, the processor has a special interrupt-request line for which the
interrupt-handling circuit responds only to the leading edge of the signal. Such a line is
said to be edge-triggered.
Before proceeding to study more complex aspects of interrupts, let us summarize
the sequence of events involved in handling an interrupt request from a single device.
Assuming that interrupts are enabled, the following is a typical scenario.
1. The device raises an interrupt request.
2. The processor interrupts the program currently being executed.
3. Interrupts are disabled by changing the control bits in the PS (except in the case of
edge-triggered interrupts).
4. The device is informed that its request has been recognized, and in response, it
deactivates the interrupt-request signal.
5. The action requested by the interrupt is performed by the interrupt-service routine.
6. Interrupts are enabled and execution of the interrupted program is resumed.
Vectored Interrupts:-
length is typically in the range of 4 to 8 bits. The remainder of the address is supplied by
the processor based on the area in its memory where the addresses for interrupt-service
routines are located.
This arrangement implies that the interrupt-service routine for a given device
must always start at the same location. The programmer can gain some flexibility by
storing in this location an instruction that causes a branch to the appropriate routine.
Interrupt Nesting: -
The processor’s priority is usually encoded in a few bits of the processor status
word. It can be changed by program instructions that write into the PS. These are
privileged instructions, which can be executed only while the processor is running in the
supervisor mode. The processor is in the supervisor mode only when executing operating
system routines. It switches to the user mode before beginning to execute application
programs. Thus, a user program cannot accidentally, or intentionally, change the priority
of the processor and disrupt the system’s operation. An attempt to execute a privileged
instruction while in the user mode leads to a special type of interrupt called a privileged
instruction.
INTR p
INTR 1
Device 1 Device 2 Device p
INTA 1 INTA p
Simultaneous Requests:-
Polling the status registers of the I/O devices is the simplest such mechanism. In
this case, priority is determined by the order in which the devices are polled. When
vectored interrupts are used, we must ensure that only one device is selected to send its
interrupt vector code. A widely used scheme is to connect the devices to form a daisy
chain, as shown in figure 3a. The interrupt-request line INTR is common to all devices.
The interrupt-acknowledge line, INTA, is connected in a daisy-chain fashion, such that
the INTA signal propagates serially through the devices.
INTR
INTA
INTR 1
Device Device
INTA1
INTR p
Device Device
INTA p
Priority arbitration
Circuit (3.b) Arrangement of priority groups
When several devices raise an interrupt request and the INTR line is activated,
the processor responds by setting the INTA line to 1. This signal is received by device 1.
Device 1 passes the signal on to device 2 only if it does not require any service. If device
1 has a pending request for interrupt, it blocks the INTA signal and proceeds to put its
identifying code on the data lines. Therefore, in the daisy-chain arrangement, the device
that is electrically closest to the processor has the highest priority. The second device
along the chain has second highest priority, and so on.
The scheme in figure 3.a requires considerably fewer wires than the individual
connections in figure 2. The main advantage of the scheme in figure 2 is that it allows the
processor to accept interrupt requests from some devices but not from others, depending
upon their priorities. The two schemes may be combined to produce the more general
structure in figure 3b. Devices are organized in groups, and each group is connected at a
different priority level. Within a group, devices are connected in a daisy chain. This
organization is used in many computer systems.
Until now, we have assumed that an I/O device interface generates an interrupt
request whenever it is ready for an I/O transfer, for example whenever the SIN flag is 1.
It is important to ensure that interrupt requests are generated only by those I/O devices
that are being used by a given program. Idle devices must not be allowed to generate
interrupt requests, even though they may be ready to participate in I/O transfer
operations. Hence, we need a mechanism in the interface circuits of individual devices to
control whether a device is allowed to generate an interrupt request.
interrupt. If an interrupt-enable bit is equal to 0, the interface circuit will not generate an
interrupt request, regardless of the state of the status flag.
3.6 EXCEPTIONS:
An interrupt is an event that causes the execution of one program to be suspended
and the execution of another program to begin. So far, we have dealt only with interrupts
caused by requests received during I/O data transfers. However, the interrupt mechanism
is used in a number of other situations.
The term exception is often used to refer to any event that causes an interruption.
Hence, I/O interrupts are one example of an exception. We now describe a few other
kinds of exceptions.
Computers use a variety of techniques to ensure that all hardware components are
operating properly. For example, many computers include an error-checking code in the
main memory, which allows detection of errors in the stored data. If errors occur, the
control hardware detects it and informs the processor by raising an interrupt.
Debugging:
Another important type of exception is used as an aid in debugging programs.
System software usually includes a program called a debugger, which helps the
programmer find errors in a program. The debugger uses exceptions to provide two
important facilities called trace and breakpoints.
Breakpoint provides a similar facility, except that the program being debugged is
interrupted only at specific points selected by the user. An instruction called Trap or
Software-interrupt is usually provided for this purpose. Execution of this instruction
results in exactly the same actions as when a hardware interrupt request is received.
While debugging a program, the user may wish to interrupt program execution after
instruction i. The debugging routine saves instruction i+1 and replaces it with a software
interrupt instruction. When the program is executed and reaches that point, it is
interrupted and the debugging routine is activated. This gives the user a chance to
examine memory and register contents. When the user is ready to continue executing the
program being debugged, the debugging routine restores the saved instruction that was a
location i+1 and executes a Return-from-interrupt instruction.
Privilege Exception:
To protect the operating system of a computer from being corrupted by user
programs, certain instructions can be executed only while the processor is in supervisor
mode. These are called privileged instructions. For example, when the processor is
running in the user mode, it will not execute an instruction that changes the priority level
of the processor or that enables a user program to access areas in the computer memory
that have been allocated to other users. An attempt to execute such an instruction will
produce a privilege exceptions, causing the processor to switch to the supervisor mode
and begin executing an appropriate routine in the operating system.
Move DATAIN, R0
An instruction to transfer input or output data is executed only after the processor
determines that the I/O device is ready. To do this, the processor either polls a status flag
in the device interface or waits for the device to send an interrupt request. In either case,
considerable overhead is incurred, because several program instructions must be executed
for each data word transferred. In addition to polling the status register of the device,
instructions are needed for incrementing the memory address and keeping track of the
word count. When interrupts are used, there is the additional overhead associated with
saving and restoring the program counter and other state information.
an external device and the main memory, without continuous intervention by the
processor. This approach is called direct memory access, or DMA.
DMA transfers are performed by a control circuit that is part of the I/O device
interface. We refer to this circuit as a DMA controller. The DMA controller performs the
functions that would normally be carried out by the processor when accessing the main
memory. For each word transferred, it provides the memory address and all the bus
signals that control data transfer. Since it has to transfer blocks of data, the DMA
controller must increment the memory address for successive words and keep track of the
number of transfers.
While a DMA transfer is taking place, the program that requested the transfer
cannot continue, and the processor can be used to execute another program. After the
DMA transfer is completed, the processor can return to the program that requested the
transfer.
I/O operations are always performed by the operating system of the computer in
response to a request from an application program. The OS is also responsible for
suspending the execution of one program and starting another. Thus, for an I/O operation
involving DMA, the OS puts the program that requested the transfer in the Blocked state,
initiates the DMA operation, and starts the execution of another program. When the
transfer is completed, the DMA controller informs the processor by sending an interrupt
request. In response, the OS puts the suspended program in the Runnable state so that it
can be selected by the scheduler to continue execution.
Figure 4 shows an example of the DMA controller registers that are accessed by
the processor to initiate transfer operations. Two registers are used for storing the
Status and Control 31 30 1 0
IRQ Done
IE R/W
Starting address
Word count
Main
memory
Processor
System bus
Disk/DMA DMA
controller controller Printer Keyboard
Starting address and the word count. The third register contains status and control flags.
The R/W bit determines the direction of the transfer. When this bit is set to 1 by a
program instruction, the controller performs a read operation, that is, it transfers data
from the memory to the I/O device. Otherwise, it performs a write operation. When the
controller has completed transferring a block of data and is ready to receive another
command, it sets the Done flag to 1. Bit 30 is the Interrupt-enable flag, IE. When this flag
is set to 1, it causes the controller to raise an interrupt after it has completed transferring a
block of data. Finally, the controller sets the IRQ bit to 1 when it has requested an
interrupt.
To start a DMA transfer of a block of data from the main memory to one of the
disks, a program writes the address and word count information into the registers of the
corresponding channel of the disk controller. It also provides the disk controller with
information to identify the data for future retrieval. The DMA controller proceeds
independently to implement the specified operation. When the DMA transfer is
completed. This fact is recorded in the status and control register of the DMA channel by
setting the Done bit. At the same time, if the IE bit is set, the controller sends an interrupt
request to the processor and sets the IRQ bit. The status register can also be used to
record other information, such as whether the transfer took place correctly or errors
occurred.
Memory accesses by the processor and the DMA controller are interwoven.
Requests by DMA devices for using the bus are always given higher priority than
processor requests. Among different DMA devices, top priority is given to high-speed
peripherals such as a disk, a high-speed network interface, or a graphics display device.
Since the processor originates most memory access cycles, the DMA controller can be
said to “steal” memory cycles from the processor. Hence, the interweaving technique is
usually called cycle stealing. Alternatively, the DMA controller may be given exclusive
access to the main memory to transfer a block of data without interruption. This is known
as block or burst mode.
Most DMA controllers incorporate a data storage buffer. In the case of the
network interface in figure 5 for example, the DMA controller reads a block of data from
the main memory and stores it into its input buffer. This transfer takes place using burst
mode at a speed appropriate to the memory and the computer bus. Then, the data in the
buffer are transmitted over the network at the speed of the network.
A conflict may arise if both the processor and a DMA controller or two DMA
controllers try to use the bus at the same time to access the main memory. To resolve
these conflicts, an arbitration procedure is implemented on the bus to coordinate the
activities of all devices requesting memory transfers.
Bus Arbitration:-
The device that is allowed to initiate data transfers on the bus at any given time is
called the bus master. When the current master relinquishes control of the bus, another
device can acquire this status. Bus arbitration is the process by which the next device to
become the bus master is selected and bus mastership is transferred to it. The selection of
the bus master must take into account the needs of various devices by establishing a
priority system for gaining access to the bus.
Centralized Arbitration:-
The bus arbiter may be the processor or a separate unit connected to the bus. A
basic arrangement in which the processor contains the bus arbitration circuitry. In this
case, the processor is normally the bus master unless it grants bus mastership to one of
the DMA controllers. A DMA controller indicates that it needs to become the bus master
by activating the Bus-Request line, BR . The signal on the Bus-Request line is the logical
OR of the bus requests from all the devices connected to it. When Bus-Request is
activated, the processor activates the Bus-Grant signal, BG1, indicating to the DMA
controllers that they may use the bus when it becomes free. This signal is connected to all
DMA controllers using a daisy-chain arrangement. Thus, if DMA controller 1 is
requesting the bus, it blocks the propagation of the grant signal to other devices.
Otherwise, it passes the grant downstream by asserting BG2. The current bus master
indicates to all device that it is using the bus by activating another open-controller line
called Bus-Busy, BBSY . Hence, after receiving the Bus-Grant signal, a DMA controller
waits for Bus-Busy to become inactive, then assumes mastership of the bus. At this time,
it activates Bus-Busy to prevent other devices from using the bus at the same time.
Distributed Arbitration:-
Vcc ARB3
ARB2
ARB1
ARB0
Start-Arbitration
O.C
0 1 0 1 0 1 1 1
Interface circuit
for device A
Distributed arbitration means that all devices waiting to use the bus have equal
responsibility in carrying out the arbitration process, without using a central arbiter. A
simple method for distributed arbitration is illustrated in figure 6. Each device on the bus
assigned a 4-bit identification number. When one or more devices request the bus, they
assert the Start� Arbitration signal and place their 4-bit ID numbers on four open-
Dept of Page 80
COMPUTER ORGANIZATION 10CS46
3.8 BUSES:
The processor, main memory, and I/O devices can be interconnected by means of
a common bus whose primary function is to provide a communication path for the
transfer of data. The bus includes the lines needed to support interrupts and arbitration. In
this section, we discuss the main features of the bus protocols used for transferring data.
A bus protocol is the set of rules that govern the behavior of various devices connected to
the bus as to when to place information on the bus, assert control signals, and so on. After
describing bus protocols, we will present examples of interface circuits that use these
protocols.
Synchronous Bus:-
bus at a speed determined by its physical and electrical characteristics. The clock pulse
width, t1 – t0, must be longer than the maximum propagation delay between two devices
connected to the bus. It also has to be long enough to allow all devices to decode the
address and control signals so that the addressed device (the slave) can respond at time t1.
It is important that slaves take no action or place any data on the bus before t1. The
information on the bus is unreliable during the period t0 to t1 because signals are changing
state. The addressed slave places the requested input data on the data lines at time t1.
At the end of the clock cycle, at time t2, the master strobes the data on the data
lines into its input buffer. In this context, “strobe” means to capture the values of the .
Figure 7 Timing of an input transfer on a synchronous bus.
Time
Bus clock
Address
and
command
Data
t1
t0 t2
Bus cycle
Data of a given instant and store them into a buffer. For data to be loaded correctly into
any storage device, such as a register built with flip-flops, the data must be available at
the input of that device for a period greater than the setup time of the device. Hence, the
period t2 - t1 must be greater than the maximum propagation time on the bus plus the
setup time of the input buffer register of the master.
A similar procedure is followed for an output operation. The master places the
output data on the data lines when it transmits the address and command information at
time t2, the addressed device strobes the data lines and loads the data into its data buffer.
The master sends the address and command signals on the rising edge at the
beginning of clock period 1 (t0). However, these signals do not actually appear on the bus
until fAM, largely due to the delay in the bus driver circuit. A while later, at tAS, the
signals reach the slave. The slave decodes the address and at t1 sends the requested data.
Here again, the data signals do not appear on the bus until tDS. They travel toward the
master and arrive at tDM. At t2, the master loads the data into its input buffer. Hence the
period t2-tDM is the setup time for the master’s input buffer. The data must continue to be
valid after t2 for a period equal to the hold time of that buffer.
Time
Bus clock
Address
and
command
Data tDM
Data tDS
t0 t1 t2
Multiple-Cycle transfers:-
The scheme described above results in a simple design for the device interface,
however, it has some limitations. Because a transfer has to be completed within one clock
cycle, the clock period, t2-t0, must be chosen to accommodate the longest delays on the
bus and the lowest device interface. This forces all devices to operate at the speed of the
slowest device.
Also, the processor has no way of determining whether the addressed device has
actually responded. It simply assumes that, at t2, the output data have been received by
the I/O device or the input data are available on the data lines. If, because of a
malfunction, the device does not respond, the error will not be detected.
An example of this approach is shown in figure 4.25. during clock cycle 1, the
master sends address and command information on the bus, requesting a read operation.
The slave receives this information and decodes it. On the following active edge of the
clock, that is, at the beginning of clock cycle 2, it makes a decision to respond and begins
to access the requested data. We have assumed that some delay is involved in getting the
data, and hence the slave cannot respond immediately. The data become ready and are
placed on the bus in clock cycle 3. At the same time, the slave asserts a control signal
called Slave-ready.
cycles, then aborts the operation. This could be the result of an incorrect address or a
device malfunction.
Time
1 2 3 4
Clock
Address
Command
Data
Slave-ready
ASYNCHRONOUS BUS:-
An alternative scheme for controlling data transfers on the bus is based on the use
of a handshake between the master and the salve. The concept of a handshake is a
generalization of the idea of the Slave-ready signal in figure 10. The common clock is
replaced by two timing control lines, Master-ready and Slave-ready. The first is asserted
by the master to indicate that it is ready for a transaction, and the second is a response
from the slave.
Address
And command
Master-ready
Slave-ready
Data
t0 t1 t2 t3 t4 t5
Bus cycle
An example of the timing of an input data transfer using the handshake scheme is
given in figure 4.26, which depicts the following sequence of events.
t0 – The master places the address and command information on the bus, and all devices
on the bus begin to decode this information.
t1 – The master sets the Master-ready line to 1 to inform the I/O devices that the address
and command information is ready. The delay t1-t0 is intended to allow for any skew that
may occur o the bus. Skew occurs when two signals simultaneously transmitted from one
source arrive at the destination at different times. This happens because different lines of
the bus may have different propagation speeds. Thus, to guarantee that the Master-ready
signal does not arrive at any device ahead of the address and command information, the
delay t1-t0 should be larger than the maximum possible bus skew.
t2 – The selected slave, having decoded the address and command information performs
the required input operation by placing the data from its data register on the data lines.
t3 – The Slave-ready signal arrives at the master, indicating that the input data are
available on the bus.
t4 – The master removes the address and command information from the bus. The delay
between t3 and t4 is again intended to allow for bus skew.
t5 – When the device interface receives the 1 to 0 transition of the Master-ready signal, it
removes the data and the Slave-ready signal from the bus. This completes the input
transfer.
UNIT-4
UNIT-4
Data Data
DATAIN
Encoder
Address and Keyboard
Processor Debouncing switches
SIN circuit
R/ W
Input
interface
Master-ready Valid
Slave-ready
The output of the encoder consists of the bits that represent the encoded character
and one control signal called Valid, which indicates that a key is being pressed. This
information is sent to the interface circuit, which contains a data register, DATAIN, and a
status flag, SIN. When a key is pressed, the Valid signal changes from 0 to 1, causing the
ASCII code to be loaded into DATAIN and SIN to be set to 1. The status flag SIN is
cleared to 0 when the processor reads the contents of the DATAIN register. The interface
circuit is connected to an asynchronous bus on which transfers are controlled using the
handshake signals Master-ready and Slave-ready, as indicated in figure 11. The third
control line, R/ W distinguishes read and write transfers.
Figure 12 shows a suitable circuit for an input interface. The output lines of the
DATAIN register are connected to the data lines of the bus by means of three-state
drivers, which are turned on when the processor issues a read instruction with the address
that selects this register. The SIN signal is generated by a status flag circuit. This signal is
also sent to the bus through a three-state driver. It is connected to bit D0, which means it
will appear as bit 0 of the status register. Other bits of this register do not contain valid
information. An address decoder is used to select the input interface when the high-order
31 bits of an address correspond to any of the addresses assigned to this interface.
Address bit A0 determines whether the status or the data registers is to be read when the
Master-ready signal is active. The control handshake is accomplished by activating the
Slave-ready signal when either Read-status or Read-data is equal to 1.
Data Data
DATAOUT
Address Printer
Processor
SOUT
Valid
R/ W
Output
interface
Master-ready Idle
Slave-ready
Dept of Page 92
COMPUTER ORGANIZATION 10CS46
Let us now consider an output interface that can be used to connect an output
device, such as a printer, to a processor, as shown in figure 13. The printer operates under
control of the handshake signals Valid and Idle in a manner similar to the handshake used
on the bus with the Master-ready and Slave-ready signals. When it is ready to accept a
character, the printer asserts its Idle signal. The interface circuit can then place a new
character on the data lines and activate the Valid signal. In response, the printer starts
printing the new character and negates the Idle signal, which in turn causes the interface
to deactivate the Valid signal.
The circuit in figure 16 has separate input and output data lines for connection to
an I/O device. A more flexible parallel port is created if the data lines to I/O devices are
bidirectional. Figure 17 shows a general-purpose parallel interface circuit that can be
configured in a variety of ways. Data lines P7 through P0 can be used for either input or
output purposes. For increased flexibility, the circuit makes it possible for some lines to
serve as inputs and some lines to serve as outputs, under program control. The
DATAOUT register is connected to these lines via three-state drivers that are controlled
by a data direction register, DDR. The processor can write any 8-bit pattern into DDR.
For a given bit, if the DDR value is 1, the corresponding data line acts as an output line;
otherwise, it acts as an input line.
D7 P7
DATAIN
D0 P0
DATAOUT
Data
Direction
Register
My-address
RS2 Register C1
Status
RS1 select and
Control
RS0
R/W C2
Ready
Accept
INTR
Time
1 2 3
Clock
Address
R/ W
Data
Go
Slave-ready
SERIAL PORT:-
A Serial port is used to connect the processor to I/O devices that require
transmission of data one bit at a time. The key feature of an interface circuit for a serial
port is that it is capable of communicating in a bit-serial fashion on the device side and in
a bit-parallel fashion on the bus side. The transformation between the parallel and serial
formats is achieved with shift registers that have parallel access capability. A block
diagram of a typical serial interface is shown in figure 20. It includes the familiar
DATAIN and DATAOUT registers. The input shift register accepts bit-serial input from
the I/O device. When all 8 bits of data have been received, the contents of this shift
register are loaded in parallel into the DATAIN register. Similarly, output data in the
DATAOUT register are loaded into the output register, from which the bits are shifted
out and sent to the I/O device.
DATAIN
D7
D0
DATAOUT
My-address
Chip and
RS1
Register select
RS0 Serial
Output shift register
R/ W output
Ready
Accept
Receiving clock
Status
and
INTR
Control
Transmission clock
The double buffering used in the input and output paths are important. A simpler
interface could be implemented by turning DATAIN and DATAOUT into shift registers
and eliminating the shift registers in figure 4.37. However, this would impose awkward
restrictions on the operation of the I/O device; after receiving one character from the
serial line, the device cannot start receiving the next character until the processor reads
the contents of DATAIN. Thus, a pause would be needed between two characters to
allow the processor to read the input data. With the double buffer, the transfer of the
second character can begin as soon as the first character is loaded from the shift register
into the DATAIN register. Thus, provided the processor reads the contents of DATAIN
before the serial transfer of the second character is completed, the interface can receive a
continuous stream of serial data. An analogous situation occurs in the output path of the
interface.
The processor bus is the bus defied by the signals on the processor chip
itself. Devices that require a very high-speed connection to the processor, such as the
main memory, may be connected directly to this bus. For electrical reasons, only a few
devices can be connected in this manner. The motherboard usually provides another bus
that can support more devices. The two buses are interconnected by a circuit, which we
will call a bridge, that translates the signals and protocols of one bus into those of the
other. Devices connected to the expansion bus appear to the processor as if they were
connected directly to the processor’s own bus. The only difference is that the bridge
circuit introduces a small delay in data transfers between the processor and those devices.
It is not possible to define a uniform standard for the processor bus. The structure
of this bus is closely tied to the architecture of the processor. It is also dependent on the
electrical characteristics of the processor chip, such as its clock speed. The expansion bus
is not subject to these limitations, and therefore it can use a standardized signaling
scheme. A number of standards have been developed. Some have evolved by default,
when a particular design became commercially successful. For example, IBM developed
a bus they called ISA (Industry Standard Architecture) for their personal computer known
at the time as PC AT.
Some standards have been developed through industrial cooperative efforts, even
among competing companies driven by their common self-interest in having compatible
products. In some cases, organizations such as the IEEE (Institute of Electrical and
Electronics Engineers), ANSI (American National Standards Institute), or international
bodies such as ISO (International Standards Organization) have blessed these standards
and given them an official status.
A given computer may use more than one bus standards. A typical Pentium
computer has both a PCI bus and an ISA bus, thus providing the user with a wide range
of devices to choose from.
Main
Memory
Processor bus
Processor Bridge
PCI bus
The PCI bus is a good example of a system bus that grew out of the need for
standardization. It supports the functions found on a processor bus bit in a standardized
format that is independent of any particular processor. Devices connected to the PCI bus
appear to the processor as if they were connected directly to the processor bus. They are
assigned addresses in the memory address space of the processor.
The PCI follows a sequence of bus standards that were used primarily in IBM
PCs. Early PCs used the 8-bit XT bus, whose signals closely mimicked those of Intel’s
80x86 processors. Later, the 16-bit bus used on the PC At computers became known as
the ISA bus. Its extended 32-bit version is known as the EISA bus. Other buses
developed in the eighties with similar capabilities are the Microchannel used in IBM PCs
and the NuBus used in Macintosh computers.
The PCI was developed as a low-cost bus that is truly processor independent. Its
design anticipated a rapidly growing demand for bus bandwidth to support high-speed
disks and graphic and video devices, as well as the specialized needs of multiprocessor
systems. As a result, the PCI is still popular as an industry standard almost a decade after
it was first introduced in 1992.
Data Transfer:-
In today’s computers, most memory transfers involve a burst of data rather than
just one word. The reason is that modern processors include a cache memory. Data are
transferred between the cache and the main memory in burst of several words each. The
words involved in such a transfer are stored at successive memory locations. When the
processor (actually the cache controller) specifies an address and requests a read
operation from the main memory, the memory responds by sending a sequence of data
words starting at that address. Similarly, during a write operation, the processor sends a
memory address followed by a sequence of data words, to be written in successive
memory locations starting at the address. The PCI is designed primarily to support this
mode of operation. A read or write operation involving a single word is simply treated as
a burst of length one.
The bus supports three independent address spaces: memory, I/O, and
configuration. The first two are self-explanatory. The I/O address space is intended for
use with processors, such as Pentium, that have a separate I/O address space. However, as
noted , the system designer may choose to use memory-mapped I/O even when a separate
I/O address space is available. In fact, this is the approach recommended by the PCI its
plug-and-play capability. A 4-bit command that accompanies the address identifies which
of the three spaces is being used in a given data transfer operation.
The signaling convention on the PCI bus is similar to the one used, we assumed
that the master maintains the address information on the bus until data transfer is
completed. But, this is not necessary. The address is needed only long enough for the
slave to be selected. The slave can store the address in its internal buffer. Thus, the
address is needed on the bus for one clock cycle only, freeing the address lines to be used
for sending data in subsequent clock cycles. The result is a significant cost reduction
because the number of wires on a bus is an important cost factor. This approach in used
in the PCI bus.
At any given time, one device is the bus master. It has the right to initiate data
transfers by issuing read and write commands. A master is called an initiator in PCI
terminology. This is either a processor or a DMA controller. The addressed device that
responds to read and write commands is called a target.
Device Configuration:-
When an I/O device is connected to a computer, several actions are needed to
configure both the device and the software that communicates with it.
The PCI simplifies this process by incorporating in each I/O device interface a
small configuration ROM memory that stores information about that device. The
configuration ROMs of all devices is accessible in the configuration address space. The
PCI initialization software reads these ROMs whenever the system is powered up or
reset. In each case, it determines whether the device is a printer, a keyboard, an Ethernet
interface, or a disk controller. It can further learn bout various device options and
characteristics.
Devices are assigned addresses during the initialization process. This means that
during the bus configuration operation, devices cannot be accessed based on their
address, as they have not yet been assigned one. Hence, the configuration address space
uses a different mechanism. Each device has an input signal called Initialization Device
Select, IDSEL#.
The PCI bus has gained great popularity in the PC word. It is also used in many
other computers, such as SUNs, to benefit from the wide range of I/O devices for which a
PCI interface is available. In the case of some processors, such as the Compaq Alpha, the
PCI-processor bridge circuit is built on the processor chip itself, further simplifying
system design and packaging.
SCSI Bus:-
The acronym SCSI stands for Small Computer System Interface. It refers to a
standard bus defined by the American National Standards Institute (ANSI) under the
designation X3.131 . In the original specifications of the standard, devices such as disks
are connected to a computer via a 50-wire cable, which can be up to 25 meters in length
and can transfer data at rates up to 5 megabytes/s.
The SCSI bus standard has undergone many revisions, and its data transfer
capability has increased very rapidly, almost doubling every two years. SCSI-2 and
SCSI-3 have been defined, and each has several options. A SCSI bus may have eight data
lines, in which case it is called a narrow bus and transfers data one byte at a time.
Alternatively, a wide SCSI bus has 16 data lines and transfers data 16 bits at a time.
There are also several options for the electrical signaling scheme used.
Devices connected to the SCSI bus are not part of the address space of the
processor in the same way as devices connected to the processor bus. The SCSI bus is
connected to the processor bus through a SCSI controller. This controller uses DMA to
transfer data packets from the main memory to the device, or vice versa. A packet may
contain a block of data, commands from the processor to the device, or status information
about the device.
To illustrate the operation of the SCSI bus, let us consider how it may be used
with a disk drive. Communication with a disk drive differs substantially from
communication with the main memory.
Data transfers on the SCSI bus are always controlled by the target controller. To
send a command to a target, an initiator requests control of the bus and, after winning
arbitration, selects the controller it wants to communicate with and hands control of the
bus over to it. Then the controller starts a data transfer operation to receive a command
from the initiator.
The processor sends a command to the SCSI controller, which causes the
following sequence of event to take place:
1. The SCSI controller, acting as an initiator, contends for control of the bus.
2. When the initiator wins the arbitration process, it selects the target controller and
hands over control of the bus to it.
3. The target starts an output operation (from initiator to target); in response to this,
the initiator sends a command specifying the required read operation.
4. The target, realizing that it first needs to perform a disk seek operation, sends a
message to the initiator indicating that it will temporarily suspend the connection
between them. Then it releases the bus.
5. The target controller sends a command to the disk drive to move the read head to
the first sector involved in the requested read operation. Then, it reads the data
stored in that sector and stores them in a data buffer. When it is ready to begin
transferring data to the initiator, the target requests control of the bus. After it
wins arbitration, it reselects the initiator controller, thus restoring the suspended
connection.
6. The target transfers the contents of the data buffer to the initiator and then
suspends the connection again. Data are transferred either 8 or 16 bits in parallel,
depending on the width of the bus.
7. The target controller sends a command to the disk drive to perform another seek
operation. Then, it transfers the contents of the second disk sector to the initiator
as before. At the end of this transfers, the logical connection between the two
controllers is terminated.
8. As the initiator controller receives the data, it stores them into the main memory
using the DMA approach.
9. The SCSI controller sends as interrupt to the processor to inform it that the
requested operation has been completed
This scenario show that the messages exchanged over the SCSI bus are at a higher
level than those exchanged over the processor bus. In this context, a “higher level” means
that the messages refer to operations that may require several steps to complete,
depending on the device. Neither the processor nor the SCSI controller need be aware of
the details of operation of the particular device involved in a data transfer. In the
preceding example, the processor need not be involved in the disk seek operation.
The USB supports two speeds of operation, called low-speed (1.5 megabits/s)
and full-speed (12 megabits/s). The most recent revision of the bus specification (USB
2.0) introduced a third speed of operation, called high-speed (480 megabits/s). The USB
is quickly gaining acceptance in the market place, and with the addition of the high-speed
capability it may well become the interconnection method of choice for most computer
devices.
Port Limitation:-
The parallel and serial ports described in previous section provide a general-
purpose point of connection through which a variety of low-to medium-speed devices can
be connected to a computer. For practical reasons, only a few such ports are provided in a
typical computer.
Device Characteristics:-
The kinds of devices that may be connected to a computer cover a wide range of
functionality. The speed, volume, and timing constraints associated with data transfers to
and from such devices vary significantly.
A variety of simple devices that may be attached to a computer generate data of a
similar nature – low speed and asynchronous. Computer mice and the controls and
manipulators used with video games are good examples.
Plug-and-Play:-
USB Architecture:-
The discussion above points to the need for an interconnection system that
combines low cost, flexibility, and high data-transfer bandwidth. Also, I/O devices may
be located at some distance from the computer to which they are connected. The
requirement for high bandwidth would normally suggest a wide bus that carries 8, 16, or
more bits in parallel. However, a large number of wires increases cost and complexity
and is inconvenient to the user. Also, it is difficult to design a wide bus that carries data
for a long distance because of the data skew problem discussed. The amount of skew
increases with distance and limits the data that can be used.
A serial transmission format has been chosen for the USB because a serial bus
satisfies the low-cost and flexibility requirements. Clock and data information are
encoded together and transmitted as a single signal. Hence, there are no limitations on
clock frequency or distance arising from data skew. Therefore, it is possible to provide a
high data transfer bandwidth by using a high clock frequency. As pointed out earlier, the
USB offers three bit rates, ranging from 1.5 to 480 megabits/s, to suit the needs of
different I/O devices.
Host computer
Root
hub
Hub Hub
Hub
I/O I/O
device device
The tree structure enables many devices to be connected while using only simple
point-to-point serial links. Each hub has a number of ports where devices may be
connected, including other hubs. In normal operation, a hub copies a message that it
receives from its upstream connection to all its downstream ports. As a result, a message
sent by the host computer is broadcast to all I/O devices, but only the addressed device
will respond to that message. In this respect, the USB functions in the same way as the
bus in figure 4.1. However, unlike the bus in figure 4.1, a message from an I/O device is
sent only upstream towards the root of the tree and is not seen by other devices. Hence,
the USB enables the host to communicate with the I/O devices, but it does not enable
these devices to communicate with each other.
Note how the tree structure helps meet the USB’s design objectives. The tree
makes it possible to connect a large number of devices to a computer through a few ports
(the root hub). At the same time, each I/O device is connected through a serial point-to-
point connection. This is an important consideration in facilitating the plug-and-play
feature, as we will see shortly.
The USB operates strictly on the basis of polling. A device may send a message
only in response to a poll message from the host. Hence, upstream messages do not
encounter conflicts or interfere with each other, as no two devices can send messages at
the same time. This restriction allows hubs to be simple, low-cost devices.
The mode of operation described above is observed for all devices operating at
either low speed or full speed. However, one exception has been necessitated by the
introduction of high-speed operation in USB version 2.0. Consider the situation in figure
24. Hub A is connected to the root hub by a high-speed link. This hub serves one high-
speed device, C, and one low-speed device, D. Normally, a messages to device D would
be sent at low speed from the root hub. At 1.5 megabits/s, even a short message takes
several tens of microsends. For the duration of this message, no other data transfers can
take place, thus reducing the effectiveness of the high-speed links and introducing
unacceptable delays for high-speed devices. To mitigate this problem, the USB protocol
requires that a message transmitted on a high-speed link is always transmitted at high
speed, even when the ultimate receiver is a low-speed device. Hence, a message at low
speed to device D. The latter transfer will take a long time, during which high-speed
traffic to other nodes is allowed to continue. For example, the root hub may exchange
several message with device C while the low-speed message is being sent from hub A to
device D. During this period, the bus is said to be split between high-speed and low-
speed traffic. The message to device D is preceded and followed by special commands to
hub A to start and end the split-traffic mode of operation, respectively.
UNIT - 5
UNIT - 5
MEMORY SYSTEM
MDR
Cache Memory:-
The CPU of a computer can usually process instructions and data faster than they
can be fetched from compatibly priced main memory unit. Thus the memory cycle time
become the bottleneck in the system. One way to reduce the memory access time is to use
cache memory. This is a small and fast memory that is inserted between the larger,
slower main memory and the CPU. This holds the currently active segments of a program
and its data. Because of the locality of address references, the CPU can, most of the time,
find the relevant information in the cache memory itself (cache hit) and infrequently
needs access to the main memory (cache miss) with suitable size of the cache memory,
cache hit rates of over 90% are possible leading to a cost-effective increase in the
performance of the system.
Memory Interleaving: -
This technique divides the memory system into a number of memory modules
and arranges addressing so that successive words in the address space are placed in
different modules. When requests for memory access involve consecutive addresses, the
access will be to different modules. Since parallel access to these modules is possible, the
average rate of fetching words from the Main Memory can be increased.
Virtual Memory: -
In a virtual memory System, the address generated by the CPU is referred to as a
virtual or logical address. The corresponding physical address can be different and the
required mapping is implemented by a special memory control unit, often called the
memory management unit. The mapping function itself may be changed during program
execution according to system requirements.
Because of the distinction made between the logical (virtual) address space and
the physical address space; while the former can be as large as the addressing capability
of the CPU, the actual physical memory can be much smaller. Only the active portion of
the virtual address space is mapped onto the physical memory and the rest of the virtual
address space is mapped onto the bulk storage device used. If the addressed information
is in the Main Memory (MM), it is accessed and execution proceeds. Otherwise, an
exception is generated, in response to which the memory management unit transfers a
contigious block of words containing the desired word from the bulk storage unit to the
MM, displacing some block that is currently inactive. If the memory is managed in such a
way that, such transfers are required relatively infrequency (ie the CPU will generally
find the required information in the MM), the virtual memory system can provide a
reasonably good performance and succeed in creating an illusion of a large memory with
a small, in expensive MM.
Memory chips are usually organized in the form of an array of cells, in which
each cell is capable of storing one bit of information. A row of cells constitutes a memory
word, and the cells of a row are connected to a common line referred to as the word line,
and this line is driven by the address decoder on the chip. The cells in each column are
connected to a sense/write circuit by two lines known as bit lines. The sense/write circuits
are connected to the data input/output lines of the chip. During a READ operation, the
Sense/Write circuits sense, or read, the information stored in the cells selected by a word
line and transmit this information to the output lines. During a write operation, they
receive input information and store it in the cells of the selected word.
The data input and the data output of each Sense/Write circuit are connected to a
single bi-directional data line in order to reduce the number of pins required. One control
line, the R/W (Read/Write) input is used a specify the required operation and another
control line, the CS (Chip Select) input is used to select a given chip in a multichip
memory system. This circuit requires 14 external connections, and allowing 2 pins for
power supply and ground connections, can be manufactured in the form of a 16-pin chip.
It can store 16 x 8 = 128 bits.
Another type of organization for 1k x 1 format is shown below:
W1
5-Bit 32X32
Decoder memory cell
array
10
B W31
I
T
A Two 32 – to – 1 Sence/Write
D Multiplexers
D Circuitr
(input & 1 output)
R
E
S
S
The 10-bit address is divided into two groups of 5 bits each to form the row and column
addresses for the cell array. A row address selects a row of 32 cells, all of which are
accessed in parallel. One of these, selected by the column address, is connected to the
external data lines by the input and output multiplexers. This structure can store 1024
bits, can be implemented in a 16-pin chip.
Semiconductor memories may be divided into bipolar and MOS types. They may
be compared as follows:
Two transistor inverters connected to implement a basic flip-flop. The cell is connected to
one word line and two bits lines as shown. Normally, the bit lines are kept at about 1.6V,
and the word line is kept at a slightly higher voltage of about 2.5V. Under these
conditions, the two diodes D1 and D2 are reverse biased. Thus, because no current flows
through the diodes, the cell is isolated from the bit lines.
Read Operation:-
Let us assume the Q1 on and Q2 off represents a 1 to read the contents of a given
cell, the voltage on the corresponding word line is reduced from 2.5 V to approximately
0.3 V. This causes one of the diodes D1 or D2 to become forward-biased, depending on
whether the transistor Q1 or Q2 is conducting. As a result, current flows from bit line b
when the cell is in the 1 state and from bit line b when the cell is in the 0 state. The
Sense/Write circuit at the end of each pair of bit lines monitors the current on lines b and
b’ and sets the output bit line accordingly.
Write Operation: -
While a given row of bits is selected, that is, while the voltage on the
corresponding word line is 0.3V, the cells can be individually forced to either the 1 state
Bipolar as well as MOS memory cells using a flip-flop like structure to store
information can maintain the information as long as current flow to the cell is maintained.
Such memories are called static memories. In contracts, Dynamic memories require not
only the maintaining of a power supply, but also a periodic “refresh” to maintain the
information stored in them. Dynamic memories can have very high bit densities and very
lower power consumption relative to static memories and are thus generally used to
realize the main memory unit.
Dynamic Memories:-
The basic idea of dynamic memory is that information is stored in the form of a
charge on the capacitor. An example of a dynamic memory cell is shown below:
When the transistor T is turned on and an appropriate voltage is applied to the bit
line, information is stored in the cell, in the form of a known amount of charge stored on
the capacitor. After the transistor is turned off, the capacitor begins to discharge. This is
caused by the capacitor’s own leakage resistance and the very small amount of current
that still flows through the transistor. Hence the data is read correctly only if is read
before the charge on the capacitor drops below some threshold value. During a Read
operation, the bit line is placed in a high-impendance state, the transistor is turned on and
a sense circuit connected to the bit line is used to determine whether the charge on the
capacitor is above or below the threshold value. During such a Read, the charge on the
capacitor is restored to its original value and thus the cell is refreshed with every read
operation.
Sense/Write
circuits
A7-0 CS
Column Column DI/D0
Address Decoder
R/W
CAS
It is important to note that the application of a row address causes all the cells on
the corresponding row to be read and refreshed during both Read and Write operations.
To ensure that the contents of a dynamic memory are maintained, each row of cells must
Another feature available on many dynamic memory chips is that once the row
address is loaded, successive locations can be accessed by loading only column
addresses. Such block transfers can be carried out typically at a rate that is double that for
transfers involving random addresses. Such a feature is useful when memory access
follow a regular pattern, for example, in a graphics terminal.
Because of their high density and low cost, dynamic memories are widely used in
the main memory units of computers. Commercially available chips range in size from 1k
to 4M bits or more, and are available in various organizations like 64k x 1, 16k x 4, 1MB
x 1 etc.
The choice of a RAM chip for a given application depends on several factors like
speed, power dissipation, size of the chip, availability of block transfer feature etc.
Bipolar memories are generally used when very fast operation is the primary
requirement. High power dissipation in bipolar circuits makes it difficult to realize high
bit densities.
Static MOS memory chips have higher densities and slightly longer access times
compared to bipolar chips. They have lower densities than dynamic memories but are
easier to use because they do not require refreshing.
21-bit
addresses
A0
A1
..
A19
A20
21-bit
decoder
512 K x 8
memory chip
• The row and column parts of the address of each chip have to be multiplexed;
• A refresh circuit is needed; and
• The timing of various steps of a memory cycle must be carefully controlled.
A memory system of 256k x 16 designed using 64k x 1 DRAM chips, is shown below;
The memory unit is assumed to be connected to an asynchronous memory bus that has
18 address lines (ADRS 17-0), 16 data lines (DATA15-0), two handshake signals (Memory
request and MFC), and a Read/ Write line to specify the operation (read to Write).
The memory chips are organized into 4 rows, with each row having 16 chips.
Thus each column of the 16 columns implements one bit position. The higher order 12
bits of the address are decoded to get four chip select control signals which are used to
select one of the four rows. The remaining 16 bits, after multiplexing, are used to access
specific bit locations inside each chip of the selected row. The R/W inputs of all chips are
tied together to provide a common Read/Write control. The DI and DO lines, together,
provide D15 – D0 i.e. the data bus DATA15-0.
Refresh Operation:-
The Refresh control block periodically generates Refresh, requests, causing the
access control block to start a memory cycle in the normal way. This block allows the
refresh operation by activating the Refresh Grant line. The access control block arbitrates
between Memory Access requests and Refresh requests, with priority to Refresh requests
in the case of a tie to ensure the integrity of the stored data.
As soon as the Refresh control block receives the Refresh Grant signal, it
activates the Refresh line. This causes the address multiplexer to select the Refresh
counter as the source and its contents are thus loaded into the row address latches of all
memory chips when the RAS signal is activated. During this time the R/ W line may be
low, causing an inadvertent write operation. One way to prevent this is to use the Refresh
line to control the decoder block to deactivate all the chip select lines. The rest of the
refresh cycle is the same as in a normal cycle. At the end, the Refresh control block
increments the refresh counter in preparation for the next Refresh cycle.
Even though the row address has 8 bits, the Refresh counter need only be 7 bits
wide because of the cell organization inside the memory chips. In a 64k x 1 memory chip,
the 256x256 cell array actually consists of two 128x256 arrays. The low order 7 bits of
the row address select a row from both arrays and thus the row from both arrays is
refreshed!
Ideally, the refresh operation should be transparent to the CPU. However, the
response of the memory to a request from the CPU or from a DMA device may be
delayed if a Refresh operation is in progress. Further, in the case of a tie, Refresh
operation is given priority. Thus the normal access may be delayed. This delay will be
more if all memory rows are refreshed before the memory is returned to normal use. A
more common scheme, however, interleaves Refresh operations on successive rows with
accesses from the memory bus. In either case, Refresh operations generally use less than
5% of the available memory cycles, so the time penalty caused by refreshing is very
small.
The variability in the access times resulting from refresh can be easily
accommodated in an asynchronous bus scheme. With synchronous buses, it may be
necessary for the Refresh circuit to request bus cycles as a DMA device would!
The word line is normally held at a low voltage. If a word is to be selected, the voltage of
the corresponding word line is momentarily raised, which causes all transistors whose
emitters are connected to their corresponding bit lines to be turned on. The current that
flows from the voltage supply to the bit line can be detected by a sense circuit. The bit
positions in which current is detected are read as 1s, and the remaining bits are read as Os.
Therefore, the contents of a given word are determined by the pattern of emitter to bit-
line connections similar configurations are possible in MOS technology.
Data are written into a ROM at the time of manufacture programmable ROM
(PROM) devices allow the data to be loaded by the user. Programmability is achieved by
connecting a fuse between the emitter and the bit line. Thus, prior to programming, the
memory contains all 1s. The user can inserts Os at the required locations by burning out
the fuses at these locations using high-current pulses. This process is irreversible.
ROMs are attractive when high production volumes are involved. For smaller
numbers, PROMs provide a faster and considerably less expensive approach.
Chips which allow the stored data to be erased and new data to be loaded. Such a
chip is an erasable, programmable ROM, usually called an EPROM. It provides
considerable flexibility during the development phase. An EPROM cell bears
considerable resemblance to the dynamic memory cell. As in the case of dynamic
memory, information is stored in the form of a charge on a capacitor. The main
difference is that the capacitor in an EPROM cell is very well insulated. Its rate of
discharge is so low that it retains the stored information for very long periods. To write
information, allowing charge to be stored on the capacitor.
The contents of EPROM cells can be erased by increasing the discharge rate of
the storage capacitor by several orders of magnitude. This can be accomplished by
allowing ultraviolet light into the chip through a window provided for that purpose, or by
the application of a high voltage similar to that used in a write operation. If ultraviolet
light is used, all cells in the chip are erased at the same time. When electrical erasure is
used, however, the process can be made selective. An electrically erasable EPROM, often
referred to as EEPROM. However, the circuit must now include high voltage generation.
Some E2PROM chips incorporate the circuitry for generating these voltages o the chip
itself. Depending on the requirements, suitable device can be selected.
Classification of memory devices
Memory devise
Two Level memory Hierarchy: We will adopt the terms Primary level for the
smaller, faster memory and the secondary level for larger, slower memory, we will also
allow cache to be a primary level with slower semiconductor memory as the
corresponding secondary level. At a different point in the hierarchy, the same S.C
memory could be the primary level with disk as the secondary level.
A two level hierarchy and its addressing are illustrated in fig.2. A system address
is applied to the memory management unit (MMU) that handles the mapping function for
the particular pair in the hierarchy. If the MMU finds the address in the Primary level, it
provides Primary address, which selects the item from the Primary memory. This
translation must be fast, because every time memory is accessed, the system address must
be translated. The translation may fail to produce a Primary address because the
requested items is not found, so that information can be retrieved from the secondary
level and transferred to the Primary level.
Hits and Misses:- Successful translation of reference into Primary address is called a
hit, and failure is a miss. The hit ratio is (1-miss ratio). If tp is the Primary memory access
time and ts is the secondary access time, the average access time for the two level
hierarchy is
ta = h tp + (1-h)ts
Fig. Shows a schematic view of the cache mapping function. The mapping
function is responsible for all cache operations. It is implemented in hardware, because of
the required high speed operation. The mapping function determines the following.
CPU Main
Cache Block
Word Memor
y
Mapping function
1. Associative mapped caches:- In this any block from main memory can be placed any
where in the cache. After being placed in the cache, a given block is identified uniquely
by its main memory block number, referred to as the tag, which is stored inside a separate
tag memory in the cache.
Regardless of the kind of cache, a given block in the cache may or may not
contain valid information. For example, when the system has just been powered up add
before the cache has had any blocks loaded into it, all the information there is invalid.
The cache maintains a valid bit for each block to keep track of whether the information in
the corresponding block is valid.
Fig4. shows the various memory structures in an associative cache, The cache
itself contain 256, 8byte blocks, a 256 x 13 bit tag memory for holding the tags of the
blocks currently stored in the cache, and a 256 x 1 bit memory for storing the valid bits,
Main memory contains 8192, 8 byte blocks. The figure indicates that main memory
address references are partition into two fields, a 3 bit word field describing the location
of the desired word in the cache line, and a 13 bit tag field describing the main memory
block number desired. The 3 bit word field becomes essentially a “cache address”
specifying where to find the word if indeed it is in the cache.
The remaining 13 bits must be compared against every 13 bit tag in the tag
memory to see if the desired word is present.
In the fig, above, main memory block 2 has been stored in the 256 cache block
and so the 256th tag entry is 2 mm block 119 has been stored in the second cache block
corresponding entry in tag memory is 119 mm block 421 has been stored in cache block 0
and tag memory location 0 has been set to 421. Three valid bits have also been set,
indicating valid information in these locations.
The associative cache makes the most flexible and complete use of its capacity,
storing blocks wherever it needs to, but there is a penalty to be paid for this flexibility the
tag memory must be searched in for each memory reference.
Fig.5 shows the mechanism of operation of the associative cache. The process
begins with the main memory address being placed in the argument register of the
(associative) tag memory (1) if there is a match (hit), (2) and if the ratio bit for the block
is set (3), then the block is gated out of the cache (4), and the 3 bit offset field is used to
select the byte corresponding to the block offset field of the main memory address (5)
That byte is forwarded to the CPU, (6)
2. Direct mapped caches:- In this a given main memory block can be placed in one and
only one place in the cache. Fig6. Shows an example of a direct – mapped cache. For
simplicity, the example again uses a 256 block x 8 byte cache and a 16 bit main memory
address. The main memory in the fig. has 256 rows x 32 columns, still fielding 256 x 32
= 8192 = 213 total blocks as before. Notice that the main memory address is partitioned
into 3 fields. The word field still specifies the word in the block. The group field specifies
which of the 256 cache locations the block will be in, if it is indeed in the cache. The tag
field specifies which of the 32 blocks from main memory is actually present in the cache.
Now the cache address is composed of the group field, which specifies the address of the
block location in the cache and the word field, which specifies the address of the word in
the block. There is also valid bit specifying whether the information in the selected block
in valid.
The fig.6 shows block 7680, from group 0 of MM placed in block location 0 of
the cache and the corresponding tag set to 30. Similarly MM block 259 is in MM group
2, column 1, it is placed in block location 2 of the cache and the corresponding tag
memory entry is 1.
The tasks required of the direct – mapped cache in servicing a memory request
are shown in fig7.
The fig. shows the group field of the memory address being decoded 1) and used
to select the tag of the one cache block location in which the block must be stored if it is
the cache. If the valid bit for that block location is gated (2), then that tag is gated out, (3)
and compared with the tag of the incoming memory address (4), A cache hit gates the
cache block out (5) and the word field selects the specified word from the block (6), only
one tag needs to be compared, resulting in considerably less hardware than in the
associative memory case.
The direct mapped cache has the advantage of simplicity, but the obvious disadvantage
that only a single block from a given group can be present in the cache at any given time.
3. Block-set-Associative cache:-
Fig 8 shows a 2 way set associative cache that is similar to the direct mapped
cache in the previous example, but with twice as many blocks in the cache, arranged so
that a set of any two blocks from each main memory group can be stored in the cache.
MM is still partitioned into an 8 bit set field and a 5 bit tag field, but now there are two
possible places in which a given block can reside and both must be searched
associatively.
The cache group address is the same as that of the direct matched cache, an 8 bit
block location and a 3 bit word location. Fig 8 shows that the cache entries corresponding
to the second group contains blocks 513 and 2304. the group field now called the set
field, is again decoded and directs the search to the correct group and new only the tags in
the selected group must be searched. So instead of 256 compares, the cache only needs to
do 2.
For simplicity, the valid bits are not shown, but they must be present. The cache
hard ware would be similar so that shown in fig 7. but there would be two simultaneous
comparisons of the two blocks in the set.
Solved Problems:-
1. A block set associative cache consists of a total of 64 blocks divided into 4 block
sets. The MM contains 4096 blocks each containing 128 words.
a) How many bits are there in MM address?
b) How many bits are there in each of the TAG, SET & word fields
10 4 6
Virtual Memory:-
Virtual memory is the technique of using secondary storage such as disks to enter
the apparent size of accessible memory beyond its actual physical size. Virtual memory is
implemented by employing a memory-management unit (MMU) to translate every
logical address reference into a physical address reference as shown in fig 1. The MMU
is imposed between the CPU and the physical memory where it performs these
translations under the control of the operating system. Each memory reference is sued by
the CPU is translated from the logical address space to the physical address space.
Mapping tables guide the translation, again under the control of the operating system.
MMU
CPU Mapping tastes
Cache Main Disk
- Memory
Virtual address
Physical Address
Logical address
Virtual memory usually demand paging, which means that a Page is moved from
disk into main memory only when the processor accesses a word on that page. Virtual
memory pages always have a place on the disk once they are created, but are copied to
main memory only on a miss or page fault.
at load time, nor must they be broken into fragments merely to accommodate
memory limitations.
2. Cost effective use of memory: - Less expensive disk storage can replace more
expensive RAM memory, since the entire program does not need to occupy
physical memory at one time.
3. Access control: - Since each memory reference must be translated, it can be
simultaneously checked for read, write and execute privileges. This allows
hardward level control of access to system resources and also prevents and also
prevents buggy programs or intruders from causing damage to the resources of
other users or the system.
Each virtual address arriving from the CPU is added to the contents of the
segment base register in the MMU to form the physical address. The virtual address may
also optionally be compared to a segment limit register to trap reference beyond a
specified limit.
If the presence bit indicates a hit, then the page field of the page table entry will
contain the physical page number. If the presence bit is a miss, which is page fault, then
the page field of the page table entry which contains an address is secondary memory
where the page is stored. This miss condition also generates an interrupt. The interrupt
service routine will initiate the page fetch from secondary memory and with also
suspended the requesting process until the page has been bought into main memory. If
the CPU operation is a write hit, then the dirty bit is set. If the CPU operation is a write
miss, then the MMU with begin a write allocate process.
Problems:-
1. An address space is specified by 24 bits & the corresponding memory space is 16
bits.
a) How many words are there in address space?
b) How many words are there in memory space?
c) If a page has 2k words, how many pages & blocks are in the system?
Solution:-
a) Address space = 24 bits
224 = 24.220 = 16M words
b) Memory space: 16 bits
216 = 64k words
c) page consists of 2k words
Number of pages in add space = 16M/2K = 8000
Number of blocks = 64k/2k = 32 blocks
2. A virtual memory has a page size of 1k words. There are 8 pages & 4 blocks. The
associative memory page table has the following entries.
Page block Make a list of virtual addresses (in decimal ) that will cause a page
0 3 by cpu fault of used
1 1
4 2
6 0
UNIT - 7
The heart of any computer is the central processing unit (CPU). The CPU
executes all the machine instructions and coordinates the activities of all other units
during the execution of an instruction. This unit is also called as the Instruction Set
Processor (ISP). By looking at its internal structure, we can understand how it performs
the tasks of fetching, decoding, and executing instructions of a program. The processor is
generally called as the central processing unit (CPU) or micro processing unit (MPU).An
high-performance processor can be built by making various functional units operate in
parallel. High-performance processors have a pipelined organization where the execution
of one instruction is started before the execution of the preceding instruction is
completed. In another approach, known as superscalar operation, several instructions are
fetched and executed at the same time. Pipelining and superscalar architectures provide a
very high performance for any processor.
A typical computing task consists of a series of steps specified by a sequence of
machine instructions that constitute a program. A program is a set of instructions
performing a meaningful task. An instruction is command to the processor & is executed
by carrying out a sequence of sub-operations called as micro-operations. Figure 1
indicates various blocks of a typical processing unit. It consists of PC, IR, ID, MAR,
MDR, a set of register arrays for temporary storage, Timing and Control unit as main
units.
instruction is encountered. The processor keeps track of the address of the memory
location containing the next instruction to be fetched using the program counter (PC) or
Instruction Pointer (IP). After fetching an instruction, the contents of the PC are updated
to point to the next instruction in the sequence. But, when a branch instruction is to be
executed, the PC will be loaded with a different (jump/branch address).
Fig-1
Instruction register, IR is another key register in the processor, which is used to
hold the op-codes before decoding. IR contents are then transferred to an instruction
decoder (ID) for decoding. The decoder then informs the control unit about the task to be
executed. The control unit along with the timing unit generates all necessary control
signals needed for the instruction execution. Suppose that each instruction comprises 2
bytes, and that it is stored in one memory word. To execute an instruction, the processor
has to perform the following three steps:
1. Fetch the contents of the memory location pointed to by the PC. The contents
of this location are interpreted as an instruction code to be executed. Hence, they are
loaded into the IR/ID. Symbolically, this operation can be written as
IR [(PC)]
2. Assuming that the memory is byte addressable, increment the contents of the
PC by 2, that is,
PC [PC] + 2
3. Decode the instruction to understand the operation & generate the control
signals necessary to carry out the operation.
4. Carry out the actions specified by the instruction in the IR.
In cases where an instruction occupies more than one word, steps 1 and 2 must be
repeated as many times as necessary to fetch the complete instruction. These two steps
together are usually referred to as the fetch phase; step 3 constitutes the decoding phase;
and step 4 constitutes the execution phase.
To study these operations in detail, let us examine the internal organization of the
processor. The main building blocks of a processor are interconnected in a variety of
ways. A very simple organization is shown in Figure 2. A more complex structure that
provides high performance will be presented at the end.
Fig 2
Figure shows an organization in which the arithmetic and logic unit (ALU) and all
the registers are interconnected through a single common bus, which is internal to the
processor. The data and address lines of the external memory bus are shown in Figure 7.1
connected to the internal processor bus via the memory data register, MDR, and the
memory address register, MAR, respectively. Register MDR has two inputs and two
outputs. Data may be loaded into MDR either from the memory bus or from the internal
processor bus. The data stored in MDR may be placed on either bus. The input of MAR
is connected to the internal bus, and its output is connected to the external bus. The
control lines of the memory bus are connected to the instruction decoder and control logic
block. This unit is responsible for issuing the signals that control the operation of all the
units inside the processor and for interacting with the memory bus.
The number and use of the processor registers RO through R(n - 1) vary considerably
from one processor to another. Registers may be provided for general-purpose use by the
programmer. Some may be dedicated as special-purpose registers, such as index registers
or stack pointers. Three registers, Y, Z, and TEMP in Figure 2, have not been mentioned
before. These registers are transparent to the programmer, that is, the programmer need
not be concerned with them because they are never referenced explicitly by any
instruction. They are used by the processor for temporary storage during execution of
some instructions. These registers are never used for storing data generated by one
instruction for later use by another instruction.
The multiplexer MUX selects either the output of register Y or a constant value 4 to be
provided as input A of the ALU. The constant 4 is used to increment the contents of the
program counter. We will refer to the two possible values of the MUX control input
Select as Select4 and Select Y for selecting the constant 4 or register Y, respectively.
As instruction execution progresses, data are transferred from one register to another,
often passing through the ALU to perform some arithmetic or logic operation. The
instruction decoder and control logic unit is responsible for implementing the actions
specified by the instruction loaded in the IR register. The decoder generates the control
signals needed to select the registers involved and direct the transfer of data. The
registers, the ALU, and the interconnecting bus are collectively referred to as the data
path.
With few exceptions, an instruction can be executed by performing one or more of the
following operations in some specified sequence:
1. Transfer a word of data from one processor register to another or to the ALU
2. Perform an arithmetic or a logic operation and store the result in a processor
register
3. Fetch the contents of a given memory location and load them into a processor
register
4. Store a word of data from a processor register into a given memory location
We now consider in detail how each of these operations is implemented, using the simple
processor model in Figure 2.
Instruction execution involves a sequence of steps in which data are transferred from one
register to another. For each register, two control signals are used to place the contents of
that register on the bus or to load the data on the bus into the register. This is represented
symbolically in Figure 3. The input and output of register Ri are connected to the bus via
switches controlled by the signals Riin and Riout respectively. When Riin is set to 1, the
data on the bus are loaded into Ri. Similarly, when Riout, is set to 1, the contents of
register Riout are placed on the bus. While Riout is equal to 0, the bus can be used for
transferring data from other registers.
Suppose that we wish to transfer the contents of register RI to register R4. This can be
accomplished as follows:
1. Enable the output of register R1out by setting Rlout, tc 1. This places the contents
of R1 on the processor bus.
2. Enable the input of register R4 by setting R4in to 1. This loads data from the
processor bus into register R4.
All operations and data transfers within the processor take place within time periods
defined by the processor clock. The control signals that govern a particular transfer are
asserted at the start of the clock cycle. In our example, Rlout and R4in are set to 1. The
registers consist of edge-triggered flip-flops. Hence, at the next active edge of the clock,
the flip-flops that constitute R4 will load the data present at their inputs. At the same
time, the control signals Rlout and R4in will return to 0. We will use this simple model of
the timing of data transfers for the rest of this chapter. However, we should point out that
other schemes are possible. For example, data transfers may use both the rising and
falling edges of the clock. Also, when edge-triggered flip-flops are not used, two or more
clock signals may be needed to guarantee proper transfer of data. This is known as
multiphase clocking.
An implementation for one bit of register Ri is shown in Figure 7.3 as an example. A
two-input multiplexer is used to select the data applied to the input of an edge-triggered
D flip-flop. When the control input Riin is equal to 1, the multiplexer selects the data on
the bus. This data will be loaded into the flip-flop at the rising edge of the clock. When
Riin is equal to 0, the multiplexer feeds back the value currently stored in the flip-flop.
The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout, is
equal to 0, the gate's output is in the high-impedance (electrically disconnected) state.
This corresponds to the open-circuit state of a switch. When Riout, = 1, the gate drives the
bus to 0 or 1, depending on the value of Q.
Let us now put together the sequence of elementary operations required to execute one
instruction. Consider the instruction
Add (R3), R1
which adds the contents of a memory location pointed to by R3 to register R1. Executing
this instruction requires the following actions:
1. Fetch the instruction.
2. Fetch the first operand (the contents of the memory location pointed to by R3).
3. Perform the addition.
4 .Load the result into Rl.
Fig 7
The listing shown in figure 7 above indicates the sequence of control steps
required to perform these operations for the single-bus architecture of Figure 2.
Instruction execution proceeds as follows. In step 1, the instruction fetch operation is
initiated by loading the contents of the PC into the MAR and sending a Read request to
the memory. The Select signal is set to Select4, which causes the multiplexer MUX to
select the constant 4. This value is added to the operand at input B, which is the contents
of the PC, and the result is stored in register Z. The updated value is moved from register
Z back into the PC during step 2, while waiting for the memory to respond. In step 3, the
word fetched from the memory is loaded into the IR.
Steps 1 through 3 constitute the instruction fetch phase, which is the same for all
instructions. The instruction decoding circuit interprets the contents of the IR at the
beginning of step 4. This enables the control circuitry to activate the control signals for
steps 4 through 7, which constitute the execution phase. The contents of register R3 are
transferred to the MAR in step 4, and a memory read operation is initiated.
Then the contents of Rl are transferred to register Y in step 5, to prepare for the
addition operation. When the Read operation is completed, the memory operand is
available in register MDR, and the addition operation is performed in step 6. The contents
of MDR are gated to the bus, and thus also to the B input of the ALU, and register Y is
selected as the second input to the ALU by choosing Select Y. The sum is stored in
register Z, then transferred to Rl in step 7. The End signal causes a new instruction fetch
cycle to begin by returning to step 1.
This discussion accounts for all control signals in Figure 7.6 except Y in step 2.
There is no need to copy the updated contents of PC into register Y when executing the
Add instruction. But, in Branch instructions the updated value of the PC is needed to
compute the Branch target address. To speed up the execution of Branch instructions, this
value is copied into register Y in step 2. Since step 2 is part of the fetch phase, the same
action will be performed for all instructions. This does not cause any harm because
register Y is not used for any other purpose at that time.
Branch Instructions:
A branch instruction replaces the contents of the PC with the branch target
address. This address is usually obtained by adding an offset X, which is given in the
branch instruction, to the updated value of the PC. Listing in figure 8 below gives a
control sequence that implements an unconditional branch instruction. Processing starts,
as usual, with the fetch phase. This phase ends when the instruction is loaded into the IR
in step 3. The offset value is extracted from the IR by the instruction decoding circuit,
which will also perform sign extension if required. Since the value of the updated PC is
already available in register Y, the offset X is gated onto the bus in step 4, and an
addition operation is performed. The result, which is the branch target address, is loaded
into the PC in step 5.
The offset X used in a branch instruction is usually the difference between the branch
target address and the address immediately following the branch instruction.
]
Fig 8
Dept of CSE,SJBIT Page 186
COMPUTER ORGANIZATION 10CS46
For example, if the branch instruction is at location 2000 and if the branch target
address is 2050, the value of X must be 46. The reason for this can be readily appreciated
from the control sequence in Figure 7. The PC is incremented during the fetch phase,
before knowing the type of instruction being executed. Thus, when the branch address is
computed in step 4, the PC value used is the updated value, which points to the
instruction following the branch instruction in the memory.
Consider now a conditional branch. In this case, we need to check the status of the
condition codes before loading a new value into the PC. For example, for a Branch-on-
negative (Branch<0) instruction, step 4 is replaced with
Buses A and B are used to transfer the source operands to the A and B inputs of
the ALU, where an arithmetic or logic operation may be performed. The result is
transferred to the destination over bus C. If needed, the ALU may simply pass one of its
two input operands unmodified to bus C. We will call the ALU control signals for such
an operation R=A or R=B. The three-bus arrangement obviates the need for registers Y
and Z in Figure 2.
A second feature in Figure 9 is the introduction of the Incremental unit, which is
used to increment the PC by 4.. The source for the constant 4 at the ALU input
multiplexer is still useful. It can be used to increment other addresses, such as the
memory addresses in Load Multiple and Store Multiple instructions.
Fig 9
Consider the three-operand instruction
Add R4,R5,R6
Fig 10
The control sequence for executing this instruction is given in Figure 10. In step
1, the contents of the PC are passed through the ALU, using the R=B control signal, and
loaded into the MAR to start a memory read operation. At the same time the PC is
incremented by 4. Note that the value loaded into MAR is the original contents of the PC.
The incremented value is loaded into the PC at the end of the clock cycle and will not
affect the contents of MAR. In step 2, the processor waits for MFC and loads the data
received into MDR, then transfers them to IR in step 3. Finally, the execution phase of
the instruction requires only one control step to complete, step 4.
By providing more paths for data transfer a significant reduction in the number of
clock cycles needed to execute an instruction is achieved.
To execute instructions, the processor must have some means of generating the con-
trol signals needed in the proper sequence. Computer designers use a wide variety of
techniques to solve this problem. The approaches used fall into one of two categories:
hardwired control and micro programmed control. We discuss each of these techniques in
detail, starting with hardwired control in this section.
Consider the sequence of control signals given in Figure 7. Each step in this
sequence is completed in one clock period. A counter may be used to keep track of the
control steps, as shown in Figure 11. Each state, or count, of this counter corresponds to
one control step. The required control signals are determined by the following
information:
1. Contents of the control step counter
2. Contents of the instruction register
3. Contents of the condition code flags
4. External input signals, such as MFC and interrupt requests
Fig 11
To gain insight into the structure of the control unit, we start with a simplified
view of the hardware involved. The decoder/encoder block in Figure 11 is a
combinational circuit that generates the required control outputs, depending on the state
of all its inputs. By separating the decoding and encoding functions, we obtain the more
detailed block diagram in Figure 12. The step decoder provides a separate signal line for
each step, or time slot, in the control sequence. Similarly, the output of the instruction
decoder consists of a separate line for each machine instruction. For any instruction
loaded in the IR, one of the output lines INS1 through INSm is set to 1, and all other lines
are set to 0. (For design details of decoders, refer to Appendix A.) The input signals to the
encoder block in Figure 12 are combined to generate the individual control signals Yin,
PCout, Add, End, and so on. An example of how the encoder generates the Zin control
signal for the processor organization in Figure 2 is given in Figure 13. This circuit
implements the logic function
The End signal starts a new instruction fetch cycle by resetting the control step counter to
its starting value. Figure 12 contains another control signal called RUN. When
Fig 12
set to 1, RUN causes the counter to be incremented by one at the end of every clock
cycle. When RUN is equal to 0, the counter stops counting. This is needed whenever the
WMFC signal is issued, to cause the processor to wait for the reply from the memory.
Fig 13a
The control hardware shown can be viewed as a state machine that changes from
one state to another in every clock cycle, depending on the contents of the instruction
register, the condition codes, and the external inputs. The outputs of the state machine are
the control signals. The sequence of operations carried out by this machine is determined
by the wiring of the logic elements, hence the name "hardwired." A controller that uses
this approach can operate at high speed. However, it has little flexibility, and the
complexity of the instruction set it can implement is limited.
Fig 13b
ALU is the heart of any computing system, while Control unit is its brain. The design
of a control unit is not unique; it varies from designer to designer. Some of the
commonly used control logic design methods are;
• Sequence Reg & Decoder method
• Hard-wired control method
• PLA control method
• Micro-program control method
The control signals required inside the processor can be generated using a control
step counter and a decoder/ encoder circuit. Now we discuss an alternative scheme, called
micro programmed control, in which control signals are generated by a program similar
to machine language programs.
Fig 15
First, we introduce some common terms. A control word (CW) is a word whose
individual bits represent the various control signals in Figure 12. Each of the control steps
in the control sequence of an instruction defines a unique combination of Is and Os in the
CW. The CWs corresponding to the 7 steps of Figure 6 are shown in Figure 15. We have
assumed that Select Y is represented by Select = 0 and Select4 by Select = 1. A sequence
of CWs corresponding to the control sequence of a machine instruction constitutes the
micro routine for that instruction, and the individual control words in this micro routine
are referred to as microinstructions.
The micro routines for all instructions in the instruction set of a computer are
stored in a special memory called the control store. The control unit can generate the
control signals for any instruction by sequentially reading the CWs of the corresponding
micro routine from the control store. This suggests organizing the control unit as shown
in Figure 16. To read the control words sequentially from the control store, a micro
program counter (µPC) is used. Every time a new instruction is loaded into the IR, the
output of the block labeled "starting address generator" is loaded into the µPC. The µPC
is then automatically incremented by the clock, causing successive microinstructions to
be read from the control store. Hence, the control signals are delivered to various parts of
the processor in the correct sequence.
One important function of the control unit cannot be implemented by the simple
organization in Figure 16. This is the situation that arises when the control unit is
required to check the status of the condition codes or external inputs to choose between
alternative courses of action. In the case of hardwired control, this situation is handled by
Fig 16
Fig 17
Fig 18
To support micro program branching, the organization of the control unit should
be modified as shown in Figure 18. The starting address generator block of Figure 16
becomes the starting and branch address generator. This block loads a new address into
the µPC when a microinstruction instructs it to do so. To allow implementation of a
conditional branch, inputs to this block consist of the external inputs and condition codes
as well as the contents of the instruction register. In this control unit, the µPC is
incremented every time a new microinstruction is fetched from the micro program
memory, except in the following situations:
1. When a new instruction is loaded into the IR, the µPC is loaded with the starting
address of the micro routine for that instruction.
2. When a Branch microinstruction is encountered and the branch condition is satis-
fied, the µPC is loaded with the branch address.
3. When an End microinstruction is encountered, the µPC is loaded with the address
of the first CW in the micro routine for the instruction fetch cycle
Microinstructions
However, this scheme has one serious drawback — assigning individual bits to each
control signal results in long microinstructions because the number of required signals is
usually large. Moreover, only a few bits are set to 1 (to be used for active gating) in any
given microinstruction, which means the available bit space is poorly used. Consider
again the simple processor of Figure 2, and assume that it contains only four general-
purpose registers, R0, Rl, R2, and R3. Some of the connections in this processor are
permanently enabled, such as the output of the IR to the decoding circuits and both inputs
to the ALU. The remaining connections to various registers require a total of 20 gating
signals. Additional control signals not shown in the figure are also needed, including the
Read, Write, Select, WMFC, and End signals. Finally, we must specify the function to be
performed by the ALU. Let us assume that 16 functions are provided, including Add,
Subtract, AND, and XOR. These functions depend on the particular ALU used and do not
necessarily have a one-to-one correspondence with the machine instruction OP codes. In
total, 42 control signals are needed.
If we use the simple encoding scheme described earlier, 42 bits would be needed
in each microinstruction. Fortunately, the length of the microinstructions can be reduced
easily. Most signals are not needed simultaneously, and many signals are mutually
exclusive. For example, only one function of the ALU can be activated at a time. The
source for a data transfer must be unique because it is not possible to gate the contents of
two different registers onto the bus at the same time. Read and Write signals to the
memory cannot be active simultaneously. This suggests that signals can be grouped so
that all mutually exclusive signals are placed in the same group. Thus, at most one micro
operation per group is specified in any microinstruction. Then it is possible to use a
binary coding scheme to represent the signals within a group. For example, four bits
suffice to represent the 16 available functions in the ALU. Register output control signals
can be placed in a group consisting of PCout, MDRout, Zout, Offsetout, R0out Rlout, R2out,
R3out, and TEMPout. Any one of these can be selected by a unique 4-bit code.
Further natural groupings can be made for the remaining signals. Figure 19
shows an example of a partial format for the microinstructions, in which each group
occupies a field large enough to contain the required codes. Most fields must include one
inactive code for the case in which no action is required. For example, the all-zero pattern
in Fl indicates that none of the registers that may be specified in this field should have its
contents placed on the bus. An inactive code is not needed in all fields. For example, F4
contains 4 bits that specify one of the 16 operations performed in the ALU. Since no
spare code is included, the ALU is active during the execution of every microinstruction.
However, its activity is monitored by the rest of the machine through register Z, which is
loaded only when the Zin signal is activated.
Grouping control signals into fields requires a little more hardware because
decoding circuits must be used to decode the bit patterns of each field into individual
control signals. The cost of this additional hardware is more than offset by the reduced
number of bits in each microinstruction, which results in a smaller control store. In Figure
19, only 20 bits are needed to store the patterns for the 42 signals.
So far we have considered grouping and encoding only mutually exclusive
control signals. We can extend this idea by enumerating the patterns of required signals
in all possible microinstructions. Each meaningful combination of active control signals
can
Fig 19
then be assigned a distinct code that represents the microinstruction. Such full
encoding is likely to further reduce the length of micro words but also to increase the
complexity of the required decoder circuits.
Highly encoded schemes that use compact codes to specify only a small number of
control functions in each microinstruction are referred to as a vertical organization. On
the other hand, the minimally encoded scheme of Figure 15, in which many resources can
be controlled with a single microinstruction, is called a horizontal organization. The
horizontal approach is useful when a higher operating speed is desired and when the
machine structure allows parallel use of resources. The vertical approach results in
considerably slower operating speeds because more microinstructions are needed to
perform the desired control functions. Although fewer bits are required for each
microinstruction, this does not imply that the total number of bits in the control store is
smaller. The significant factor is that less hardware is needed to handle the execution of
microinstructions. Horizontal and vertical organizations represent the two organizational
extremes in micro programmed control. Many intermediate schemes are also possible, in
which the degree of encoding is a design parameter. The layout in Figure 19 is a
horizontal organization because it groups only mutually exclusive micro operations in the
same fields. As a result, it does not limit in any way the processor's ability to perform
various micro operations in parallel.
Although we have considered only a subset of all the possible control signals, this
subset is representative of actual requirements. We have omitted some details that are not
essential for understanding the principles of operation.
UNIT - 8
Vector.
UNIT - 8
MULTICORES, MULTIPROCESSORS, AND CLUSTERS
8.1 PERFORMANCE:
Computer performance is characterized by the amount of useful work accomplished by a
computer system compared to the time and resources used. Depending on the context,
good computer performance may involve one or more of the following:
There are a wide variety of technical performance metrics that indirectly affect overall
computer performance. Because there are too many programs to test a CPU's speed on all
of them, benchmarks were developed. The most famous benchmarks are the SPECint
and SPECfpbenchmarks developed by Standard Performance Evaluation Corporation and
the ConsumerMark benchmark developed by the Embedded Microprocessor Benchmark
Consortium EEMBC.
parallel simultaneously on multiple cores; this effect is described by Amdahl's law. In the
best case, so-called embarrassingly parallel problems may realize speedup factors near
the number of cores, or even more if the problem is split up enough to fit within each
core's cache(s), avoiding use of much slower main system memory. Most applications,
however, are not accelerated so much unless programmers invest a prohibitive amount of
effort in re-factoring the whole problem.
•Double-click the object listed there (on my system, it is called Standard PC), then
choose the Driver tab. Click the Update Driver button.
•On the Upgrade Device Driver Wizard, click the Next button, then select Display
a known list of drivers for this device so that I can choose a specific driver. Click
the Next button.
•On the Select Device Driver page, select Show all hardware of this device class.
Amdahl's law is a model for the relationship between the expected speedup of
parallelized implementations of an algorithm relative to the serial algorithm, under the
assumption that the problem size remains the same when parallelized. For example, if for
a given problem size a parallelized implementation of an algorithm can run 12% of the
algorithm's operations arbitrarily quickly (while the remaining 88% of the operations are
not parallelizable), Amdahl's law states that the maximum speedup of the parallelized
version is 1/(1 – 0.12) = 1.136 times as fast as the non-parallelized implementation.
More technically, the law is concerned with the speedup achievable from an
improvement to a computation that affects a proportion P of that computation where the
improvement has a speedup of S. (For example, if 30% of the computation may be the
subject of a speed up, P will be 0.3; if the improvement makes the portion affected twice
as fast, S will be 2.) Amdahl's law states that the overall speedup of applying the
improvement will be:
To see how this formula was derived, assume that the running time of the old
computation was 1, for some unit of time. The running time of the new computation will
be the length of time the unimproved fraction takes (which is 1 - P), plus the length of
time the improved fraction takes. The length of time for the improved part of the
computation is the length of the improved part's former running time divided by the
speedup, making the length of time of the improved part P/S. The final speedup is
computed by dividing the old running time by the new running time, which is what the
above formula does.
Here's another example. We are given a sequential task which is split into four
consecutive parts: P1, P2, P3 and P4 with the percentages of runtime being 11%, 18%,
23% and 48% respectively. Then we are told that P1 is not sped up, so S1 = 1, while P2 is
sped up 5×, P3 is sped up 20×, and P4 is sped up 1.6×. By using the formula P1/S1 +
P2/S2 + P3/S3 + P4/S4, we find the new sequential running time is: or a little less
than 1⁄2 the original running time. Using the formula (P1/S1 + P2/S2 + P3/S3 +
P4/S4)-1, the overall speed boost is 1 / 0.4575 = 2.186, or a little more than double the
original speed. Notice how the 20× and 5× speedup don't have much effect on the overall
speed when P1 (11%) is not sped up, and P4 (48%) is sped up only 1.6 times.
What is MPI?
1.MPI stands for Message Passing Interface and its standard is set by the Message
Passing Interface Forum
2.It is a library of subroutines/functions, NOT a language
3.MPI subroutines are callable from Fortran and C
4.Programmer writes Fortran/C code with appropriate MPI library calls, compiles
with Fortran/C compiler, then links with Message Passing library
5. For large problems that demand better turn-around time (and access to more
memory)
6.For Fortran “dusty deck”, often it would be very time-consuming to rewrite code
to take advantage of parallelism. Even in the case of SMP, as are the SGI
PowerChallengeArray and Origin2000, automatic parallelizer might not be able to
detect parallelism.
7.For distributed memory machines, such as a cluster of Unix work stations or a
cluster of NT/Linux PCs.
8.Maximize portability; works on distributed and shared memory architectures.
•In a user code, wherever MPI library calls occur, the following header file must be
included:
#include “mpi.h” for C code or
include “mpif.h” for Fortran code
These files contain definitions of constants, prototypes, etc. which are neccessary
to compile a program that contains MPI library calls
•MPI is initiated by a call to MPI_Init. This MPI routine must be called before any
other MPI routines and it must only be called once in the program.
•MPI processing ends with a call to MPI_Finalize.
•Essentially the only difference between MPI subroutines (for Fortran programs) and
MPI functions (for C programs) is the error reporting flag. In fortran, it is returned as
the last member of the subroutine’s argument list. In C, the integer error flag is
returned through the function value. Consequently, MPI fortran routines always
contain one additional variable in the argument list than the C counterpart.
•C’s MPI function names start with “MPI_” and followed by a character string with
the leading character in upper case letter while the rest in lower case letters. Fortran
subroutines bear the same names but are case-insensitive.
Overview
•Even though it is very difficult to further speed up a single thread or single program,
most computer systems are actually multi-tasking among multiple threads or programs.
•Techniques that would allow speedup of the overall system throughput of all tasks
would be a meaningful performance gain.
The two major techniques for throughput computing are multiprocessing and
multithreading.
•If a thread gets a lot of cache misses, the other thread(s) can continue, taking advantage
of the unused computing resources, which thus can lead to faster overall execution, as
these resources would have been idle if only a single thread was executed.
•If a thread cannot use all the computing resources of the CPU (because instructions
depend on each other's result), running another thread can avoid leaving these idle.
•If several threads work on the same set of data, they can actually share their cache,
leading to better cache usage or synchronization on its values.
•Multiple threads can interfere with each other when sharing hardware resources such as
caches or translation lookaside buffers (TLBs).
•Execution times of a single thread are not improved but can be degraded, even when
only one thread is executing. This is due to slower frequencies and/or additional pipeline
stages that are necessary to accommodate thread-switching hardware.
•Hardware support for multithreading is more visible to software, thus requiring more
changes to both application programs and operating systems than multiprocessing.
The mileage thus varies; Intel claims up to 30 percent improvement with
its HyperThreading technology, while a synthetic program just performing a loop of non-
optimized dependent floating-point operations actually gains a 100 percent speed
improvement when run in parallel. On the other hand, hand-tuned assembly
languageprograms using MMX or Altivec extensions and performing data pre-fetches (as
a good video encoder might), do not suffer from cache misses or idle computing
resources. Such programs therefore do not benefit from hardware multithreading and can
indeed see degraded performance due to contention for shared resources.
For example:
Terminology:
This type of multi threading is known as Block or Cooperative or Coarse-
grained multithreading.
Hardware cost:
The goal of multi-threading hardware support is to allow quick switching between
a blocked thread and another thread ready to run. To achieve this goal, the hardware cost
is to replicate the program visible registers as well as some processor control registers
(such as the program counter). Switching from one thread to another thread means the
hardware switches from using one register set to another.
Dept of CSE,SJBIT Page 211
COMPUTER ORGANIZATION 10CS46
Examples
•Many families of microcontrollers and embedded processors have multiple register
banks to allow quick context switching for interrupts. Such schemes can be considered a
type of block multithreading among the user program thread and the interrupt threads
Hardware costs:
In addition to the hardware costs discussed for interleaved multithreading, SMT
has the additional cost of each pipeline stage tracking the Thread ID of each instruction
being processed. Again, shared resources such as caches and TLBs have to be sized for
the large number of active threads being processed.
Examples
•DEC (later Compaq) EV8 (not completed)
•Intel Hyper-Threading
•IBM POWER5
•Sun Microsystems UltraSPARC T2
•MIPS MT
•CRAY XMT
SPMD vs SIMD
In SPMD, multiple autonomous processors simultaneously execute the same program at
independent points, rather than in the lockstep that SIMD imposes on different data. With
SPMD, tasks can be executed on general purpose CPUs; SIMD requires vector
processors to manipulate data streams. Note that the two are not mutually exclusive.
Distributed memory
SPMD usually refers to message passing programming on distributed
memory computer architectures. A distributed memory computer consists of a collection
of independent computers, called nodes. Each node starts its own program and
communicates with other nodes by sending and receiving messages, calling send/receive
routines for that purpose. Barrier synchronization may also be implemented by messages.
The messages can be sent by a number of communication mechanisms, such
as TCP/IP over Ethernet, or specialized high-speed interconnects such
as Myrinet and Supercomputer Interconnect. Serial sections of the program are
implemented by identical computation on all nodes rather than computing the result on
one node and sending it to the others. Nowadays, the programmer is isolated from the
details of the message passing by standard interfaces, such as PVM and MPI.
Shared memory:
On a shared memory machine (a computer with several CPUs that access the same
memory space), messages can be sent by depositing their contents in a shared memory
area. This is often the most efficient way to program shared memory computers with
large number of processors, especially on NUMA machines, where memory is local to
processors and accessing memory of another processor takes longer. SPMD on a shared
memory machine is usually implemented by standard (heavyweight) processes.
processor continues, and the others wait. The current standard interface for shared
memory multiprocessing is OpenMP. It usually implemented by lightweight processes,
called threads.
Bus-based
MIMD machines with shared memory have processors which share a common,
central memory. In the simplest form, all processors are attached to a bus which connects
them to memory.