0% found this document useful (0 votes)
32 views81 pages

Unit-2 CDA DrManojY

Uploaded by

rgchessworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views81 pages

Unit-2 CDA DrManojY

Uploaded by

rgchessworld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 2

Control Unit Design


Computer Memories
o There are several memories in computer: internal (cache and main
memory); external (secondary memory).

9/12/2024 Manoj Yadav 2


Memory Heirarchy
o Classification: Cache, main (primary) memory, secondary/auxiallary
memory.

9/12/2024 Manoj Yadav 3


Memory Heirarchy
o The memory speed increases from bottom to top but size of memories decreases
in this order. This is a trade-off between memory size and speed.

o In computer, memories are interfaced to each other in order of their speed viz. the
fastest memory (cache) stays closest to procesor and slower one (secondary
memory) stays far from it.

9/12/2024 Manoj Yadav 4


Memory Heirarchy
o Levels of Memory:
: It is a type of memory in which data is stored and
accepted that are immediately stored in the CPU. The most commonly
used register is Accumulator, Program counter, Address Register,
etc.

o Level 2 or Cache memory: It is the fastest memory that has faster


access time where data is temporarily stored for faster access.

: It is the memory on which the computer


works currently. It is small in size and once power is off data no
longer stays in this memory.

o Level 4 or Secondary Memory: It is external memory that is not as


fast as the main memory but data stays permanently in this memory.

9/12/2024 Manoj Yadav 5


Memory Heirarchy and Cache Memory

Hierarchical memory is a hardware optimization that takes the benefits of


spatial and temporal locality and can be used on several levels of the
memory hierarchy.

Paging: It is obviously benefits from temporal and spatial locality. A cache


is a simple example of exploiting temporal locality, because it is a specially
designed, faster but smaller memory area, generally used to keep recently
referenced data and data near recently referenced data, which can lead to
potential performance increases.
Memory Heirarchy and Cache Levels/types
Typical memory hierarchy (access times and cache
sizes are approximate for the purpose of discussion):
1. CPU registers (8-256 registers) – immediate access,
with the speed of the innermost core of the processor.

2. L1 CPU caches (32 KB to 512KB) – fast access, with


the speed of the innermost memory bus owned
exclusively by each core

3. L2 CPU caches (128 KB to 24MB) – slightly slower


access, with the speed of the memory bus shared
between twins of cores

4. L3 CPU caches (2 MB to 32MB) – even slower


access, with the speed of the memory bus shared
between even more cores of the same processor
Memory Heirarchy and Cache Levels/types
5. Main physical memory (RAM) (256 MB to
64GB) – slow access, the speed of which is
limited by the spatial distances and general
hardware interfaces between the processor and
the memory modules on the motherboard.

6. Disk (virtual memory, file system) (1 GB to


256TB) – very slow, due to the narrower (in bit
width), physically much longer data channel
between the main board of the computer and the
disk devices, and due to the extraneous software
protocol needed on the top of the slow hardware
interface.

7. Remote memory (other computers or the cloud)


(practically unlimited) – speed varies from very
slow to extremely slow.
Cache Memory (SRAM) and Mapping
v The
, the
average memory access time will approach the access time of the cache.

v Although the cache is only a small fraction of the size of main memory, a
large fraction of memory requests will be found in the fast cache memory
because of the of programs.

v Locality of reference refers to a phenomenon in which a computer program


tends to access same set of memory locations for a particular time period.
v In other words, refers to the tendency of the
computer program to access instructions whose addresses are near one
another.
Cache Memory Working
v When the CPU needs to access memory, firstly the cache is examined. If
the word is found in the cache, it is read from the cache memory.

v If the word addressed by the CPU is not found in the cache, the main
memory is accessed to read the word.

v A block of words containing the one just accessed is then transferred


from main memory to cache memory.

v The block size may vary from one word (the one just accessed) to about 16
words adjacent to the one just accessed.

v In this manner, some data are transferred to cache so that future references
to memory find the required words in the fast cache memory.
Cache Memory:
v The performance of cache memory is frequently measured in
terms of a quantity called .

v When the CPU refers to memory and finds the word in


cache, it is said to produce a hit (or cache hit).

v If the word is not found in cache, it is in main memory and it


counts as a miss (or cache miss).

v Hit ratio=No of hits/(no of miss+no of hits)


Memory Mapping and Types
v Cache mapping refers to a process/scheme using which the
content which is present in the main memory, is brought into
the cachememory.

v So, basically, the transfer of data from main memory to cache


memory is referred to as a mapping process.

v Three types of mapping procedures are of practical interest


when considering the organization of cache memory:
§ Direct mapping
§ Associative mapping
§ Set-associative mapping
1. Associative Mapping:

•The fastest and most flexible cache


organization uses an associative memory
•Stores both the address and content
(data) of the memory word.
•Any location in cache can store any
word from main memory.
•A CPU address of 15 bits is placed in the
argument register and the associative
memory is searched for a matching
address.
•If found ok, otherwise read from Main
memory, and store address and data pair
in Cache.
•If Cache is full FIFO algorithm
•Associative memory is used for Cache
expensive
2. Direct Mapping:
Random Access Memory is used to implement
The CPU address of 15 bits is divided into two fields.
The nine least significant bits constitute the index field
Remaining six bits form the tag field.
To access Main Memory, we needs an address that includes both the tag and the
index bits.
The number of bits in the index field is equal to the number of address bits
required to access the cache memory.
•Each word in cache consists of the data word and its associated tag

•When a new word is first brought into the cache, the tag bits are stored
alongside the data bits

•When the CPU generates a memory request, the index field is used for
the address to access the cache  tag field of the CPU address is
compared with the tag in the word read from the cache.

•If the two tags match, there is a hit and the desired data word is in
•cache.
•If there is no match, there is a miss and the required word is read from
main memory.

•It is then stored in the cache together with the new tag, replacing the
previous value.
•We have divided the Main memory into blocks of 29 data and there are
26 blocks
•We keep one data from each block.
In general:
if there are n bits in Main Memory address, and k
bits in Cache memory address.
 2(n-k) Blocks and size of each block is 2k.
Disadvantage of Direct mapping:

•The disadvantage of direct mapping is that the hit ratio can drop
considerably if two or more words whose addresses have the
•same index but different tags are accessed repeatedly.
•How ever, the possibility is less as the two data are far away from
each other
•Two words with the same index in their address but with different tag
values cannot reside in cache memory at the same time
Read from 02000?
Example:
3. Set Associative Mapping

each word of cache can store two or more words of memory under the same
index address
Each data word is stored together with its tag and the number of tag-data
items in one word of cache is said to form a set.
Two way set associative
mapping cache

•Each index address refers to two data words and their associated tags.
•Each tag requires six bits and each data word has 12 bits
•so the word length is 2(6 + 12) = 36 bits.
•An index address of nine bits can accommodate 512 words. Thus the size of
cache memory is 512 x 36.
•It can accommodate 1024 words of main memory since each word of cache
contains
•two data words.
•In general, a set-associative cache of set size k will accommodate k words of
main memory in each word of cache.
Other Consideration

•The comparison logic is done by an associative search of the tags in the set
similar to an associative memory search: thus the name "set-associative."
•The hit ratio will improve as the set size increases because more words with
the same index but different tags can reside in cache.
•However, an increase in the set size increases the number of bits in words of
cache and requires more complex comparison logic.
•When a miss occurs in a set-associative cache and the set is full, it is
necessary to replace one of the tag-data items with a new value.
Replacement Algorithms

•The most common replacement algorithms used are:


• Random replacement
• First-in, first out (FIFO)
• Least recently used (LRU).
•With the random replacement policy the control chooses one tag-data
item for replacement at random.
•The FIFO procedure selects for replacement the item that has been in the
set the longest.
•The LRU algorithm selects for replacement the item that has been least
recently used by the CPU. Therefore, in memory, any item that has been
unused for a longer period of time than the others is replaced.
•Both FIFO and LRU can be implemented by adding a few extra bits in each
word of cache.

•Purpose is to reduce cache-miss and increase the cache-hit.


Numerical

FIFO replacement
LRU replacement (oldest from Access point of view)
Pipelining
in Processor
Why use the Array Processor
• Array processors increases the overall instruction processing speed.
• As most of the Array processors operates asynchronously from the host CPU,
hence it improves the overall capacity of the system.
• Array Processors has its own local memory, hence providing extra memory
for systems with low memory.
SIMD Array Processors
single instruction stream and multiple data
streams.
Pipelining
in Processor
Flynn's Classification of Computers

v Single Instruction stream and Single Data Stream (SISD)


v Single Instruction stream and Multiple Data stream(SIMD)
v Multiple Instruction stream and Single Data Stream(MISD)
v Multiple Instruction stream and Multiple Data stream (MIMD)
Flynn's Classification of Computers
Arithmetic Pipeline :
An for
execution in various pipeline segments. It is used for floating point operations, multiplication
and various other computations.

4-Segment/sub-operations
Pipeline:
1. Compare the exponents
2. Align mantissas and choose
exponent.
3. Add or Subtract the mantissas
4. Normalise the result.
Arithmetic Pipeline :
§ First of all the two exponents are compared and the larger of two exponents
is chosen as the result exponent.

§ The difference in the exponents then decides how many times we must shift
the smaller exponent to the right.

§ Then after shifting of exponent, both the mantissas get aligned.

§ Finally the addition of both numbers take place followed by normalisation of


the result in the last segment.
Instruction Pipeline :
§ In this a stream of instructions can be executed by overlapping fetch,
decode and execute phases of an instruction cycle.
§ This type of technique is used to increase the throughput of the computer
system.
§ An instruction pipeline reads instruction from the memory while previous
instructions are being executed in other segments of the pipeline.
§ Thus we can execute multiple instructions simultaneously.
§ The pipeline will be more efficient if the instruction cycle is divided into
segments of equal duration.

In the most general case computer needs to process each instruction in following
sequence of steps:

• Fetch the instruction from memory (FI)


• Decode the instruction (DA)
• Calculate the effective address
• Fetch the operands from memory (FO)
• Execute the instruction (EX)
• Store the result in the memory.
Pipe line Hazards/issues:
1. Resource conflicts caused by access to memory by
two segments at the same time. Most of these conflicts
can be resolved by using separate instruction and data
memories.
2. Data dependency conflicts arise when an instruction
depends on the result of a previous instruction, but this
result is not yet available.
3. Branch difficulties arise from branch and other
instructions that change the value of PC.
Branching Instruction in Pipeline:
Pipeline conflict:
Data Dependancy:

A collision occurs when an instruction cannot proceed because


previous instructions did not complete certain operations. A
data dependency occurs when an instruction needs data that
are not yet available.
e.g.
an instruction in the FO segment may need to fetch an
operand that is being generated at the same time by the
previous instruction in segment EX. Therefore, the second
instruction must wait for data to become available by the first
instruction delay the operation
Dealing with data dependency:

1. Hardware Interlock: The most straightforward method is to insert


hardware interlocks . An interlock is a circuit that detects instructions
whose source operands are destinations of instructions farther up in the
pipeline. Detection of this situation causes the instruction whose source
is not available to be delayed by enough clock cycles to resolve the
conflict. This approach maintains the program sequence by using
hardware to insert the required delays.
2. Operand Forwarding:
This uses special hardware to detect a conflict and then avoid it by
routing the data through special paths between pipeline segments. For
example, instead of transferring an ALU result into a destination
register, the hardware checks the destination operand, and if it is
needed as a source in the next instruction, it passes the result directly
into the ALU input, bypassing the register file. This method requires
additional hardware paths through multiplexers as well as the circuit
that detects the conflict.
3. Delayed Load:
A procedure employed in some computers is to give the responsibility
for solving data conflicts problems to the compiler that translates the
high-level programming language into a machine language program. The
compiler for such computers is designed to detect a data conflict and
reorder the instructions as necessary to delay the loading of the
conflicting data by inserting delayed load no-operation instructions. This
method is referred to as delayed load.
Handling of Branch Instructions

Pre-fetch Target Instruction:


One way of handling a conditional branch is to prefetch the target
instruction in addition to the instruction following the branch. Both are
saved until the branch is executed. If the branch condition is successful,
the pipeline continues from the branch target instruction. An extension
of this procedure is to continue fetching instructions from both places
until the branch decision is made. At that time control chooses the
instruction stream of the correct program flow.
2. Branch target Buffer:
Another possibility is the use of a branch target buffer or BTB. The BTB is
an associative memory included in the fetch segment of the pipeline. Each
entry in the BTB consists of the address of a previously executed branch
instruction and the target instruction for that branch. It also stores the
next few instructions after the branch target instruction. When the
pipeline decodes a branch instruction, it searches the associative memory
BTB for the address of the instruction. If it is in the BTB, the instruction is
available directly and prefetch continues from the new path. If the
instruction is not in the BTB, the pipeline shifts to a new instruction
stream and stores the target instruction in the BTB. The advantage of this
scheme is that branch instructions that have occurred previously are
readily available in the pipeline without interruption.
3. Loop Buffer: A variation of the BTB is the loop buffer. This
is a small very high speed register file maintained by the
instruction fetch segment of the pipeline. When a program
loop is detected in the program, it is stored in the loop
buffer in its entirety, including all branches. The program
loop can be executed directly without having to access
memory until the loop mode is removed by the final
branching out.
4.Branch Prediction:
Another procedure that some computers use is branch
prediction . A pipeline with branch prediction uses some
additional logic to guess the outcome of a conditional
branch instruction before it is executed. The pipeline then
begins prefetching the instruction stream from the
predicted path. A correct prediction eliminates the
wasted time caused by branch penalties.
5. Delayed Branching:
A procedure employed in most ruse processors is the delayed branch . In
this procedure, the compiler detects the branch instructions and
rearranges the machine language code sequence by inserting useful
instructions that keep the pipeline operating without interruptions. An
example of delayed branch is the insertion of a no-operation instruction
after a branch instruction. This causes the computer to fetch the target
instruction during the execution of the no operation instruction, allowing
a continuous flow of the pipeline.

You might also like