Parallel and Distributed Systems
Performance Issues
Prepared by : Shah Murtaza Rashid Al Masud
Department of Computer Science
Najran University
1
Pipelining Example
Laundry Example: Three Stages
ﺛﻼث ﻣﺮاﺣﻞ:ﻣﺜﺎل اﻟﻐﺴﯿﻞ
1. Wash dirty load of clothes اﻏﺴﻞ اﻟﺤﻤﻮﻟﺔ اﻟﻤﺘﺴﺨﺔ ﻣﻦ اﻟﻤﻼﺑﺲ.1
ﺗﺠﻔﯿﻒ اﻟﻤﻼﺑﺲ اﻟﻤﺒﻠﻠﺔ.2
2. Dry wet clothes ﻗﻢ ﺑﻄﻲ اﻟﻤﻼﺑﺲ ووﺿﻌﮭﺎ ﻓﻲ اﻷدراج.3
دﻗﯿﻘﺔ ﻹﻛﻤﺎﻟﮭﺎ30 ﺗﺴﺘﻐﺮق ﻛﻞ ﻣﺮﺣﻠﺔ
أرﺑﻊ ﺣﻤﻮﻻت ﻣﻦ اﻟﻤﻼﺑﺲ ﻟﻐﺴﻠﮭﺎ وﺗﺠﻔﯿﻔﮭﺎ
وطﯿﮭﺎ
3. Fold and put clothes into drawers
Each stage takes 30 minutes to complete
A B
Four loads of clothes to wash, dry, and fold C D
Sequential Laundry
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30
Sequential laundry takes 6 hours for 4 loads
Intuitively, we can use pipelining to speed up laundry
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30
A Pipelined laundry takes
3 hours for 4 loads
B
Speedup factor is 2 for
4 loads
C
Time to wash, dry, and
D fold one load is still the
same (90 minutes)
Serial Execution versus Pipelining
ﯾﻌﺘﺒﺮﻣﮭﻤﺔ ﯾﻤﻜﻦ ﺗﻘﺴﯿﻤﮭﺎ إﻟﻰ ﻣﮭﺎم ﻓﺮﻋﯿﺔ
Consider a task that can be divided into k subtasks
The k subtasks are executed on k differentﻣﺨﺘﻠﻔﺔ
stages
ﻋﻠﻰ ﻣﺮاﺣﻞk ﯾﺘﻢ ﺗﻨﻔﯿﺬ اﻟﻤﮭﺎم اﻟﻔﺮﻋﯿﺔ
Each subtask requires one time unit ﺗﺘﻄﻠﺐ ﻛﻞ ﻣﮭﻤﺔ ﻓﺮﻋﯿﺔ وﺣﺪة زﻣﻨﯿﺔk
ﻣﻦk واﺣﺪة إﺟﻤﺎﻟﻲ وﻗﺖ اﻟﺘﻨﻔﯿﺬ ﻟﻠﻤﮭﻤﺔ ھﻮ
The total execution time of the task is k time اﻟﻮﻗﺖunits
وﺣﺪات
ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ھﻮ ﺑﺪء ﻣﮭﻤﺔ ﺟﺪﯾﺪة ﻗﺒﻞ اﻻﻧﺘﮭﺎء ﻣﻦ اﻟﺴﺎﺑﻖ
Pipelining is to start a new task before finishing previous
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
ﺗﺘﺮك اﻟﻤﮭﺎم ﻓﻲ ﺧﻂ اﻷﻧﺎﺑﯿﺐ ﺑﻤﻌﺪل ﻣﮭﻤﺔ واﺣﺪة ﻟﻜﻞ/ ﺗﺪﺧﻞ ﻣﮭﺎم ﻣﺨﺘﻠﻔﺔk ﺑﺎﻟﺘﻮازي ﻓﻲk ﺗﻌﻤﻞ ﻣﺮاﺣﻞ
زﻣﻨﯿﺔ … k
1 2 وﺣﺪة 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k
Without Pipelining With Pipelining
One completion every k time units One completion every 1 time unit
Pipeline Performance
In this subsection, we develop some simple measures of pipeline performance and
relative speedup (based on a discussion in [HWAN93]). The cycle time of an
instruction pipeline is the time needed to advance a set of instructions one stage
through the pipeline; each column in the following Figure represents one cycle
time. The cycle time can be determined as:
، ﻓﻲ ھﺬا اﻟﻘﺴﻢ اﻟﻔﺮﻋﻲ
ﻧﻘﻮم ﺑﺘﻄﻮﯾﺮ ﺑﻌﺾ
اﻟﻤﻘﺎﯾﯿﺲ اﻟﺒﺴﯿﻄﺔ ﻷداء
ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ واﻟﺘﺴﺮﯾﻊ
اﻟﻨﺴﺒﻲ )ﺑﻨﺎءً ﻋﻠﻰ ﻣﻨﺎﻗﺸﺔ
.([HWAN93] ﻓﻲ
وﻗﺖ دورة ﺧﻂ أﻧﺎﺑﯿﺐ
اﻟﺘﻌﻠﯿﻤﺎت ھﻮ اﻟﻮﻗﺖ اﻟﻼزم
ﻟﺘﻘﺪﯾﻢ ﻣﺠﻤﻮﻋﺔ ﻣﻦ
اﻟﺘﻌﻠﯿﻤﺎت ﻓﻲ ﻣﺮﺣﻠﺔ
واﺣﺪة ﻋﺒﺮ ﺧﻂ اﻷﻧﺎﺑﯿﺐ ؛
ﯾﻤﺜﻞ ﻛﻞ ﻋﻤﻮد ﻓﻲ اﻟﺸﻜﻞ
.اﻟﺘﺎﻟﻲ وﻗﺖ دورة واﺣﺪ
ﯾﻤﻜﻦ ﺗﺤﺪﯾﺪ وﻗﺖ اﻟﺪورة
:ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ
Figure: Timing Diagram for Instruction Pipeline Operation
Pipeline Performance
cycle 1A total of k cycles are required to complete the execution of the first instruction, and
the remaining n-1 instructions require n-1 cycles. This equation is easily verified from the
Figure mentioned above. The ninth instruction completes at time 4:
، ﻹﻛﻤﺎل ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت اﻷوﻟﻰk ﯾﻠﺰم إﺟﻤﺎﻟﻲ دورات1A دورة
ﯾﻤﻜﻦ اﻟﺘﺤﻘﻖ ﻣﻦ ھﺬه.n-1 اﻟﻤﺘﺒﻘﯿﺔ دوراتn-1 وﺗﺘﻄﻠﺐ ﺗﻌﻠﯿﻤﺎت
اﻟﺘﻌﻠﯿﻤﺎت اﻟﺘﺎﺳﻌﺔ.اﻟﻤﻌﺎدﻟﺔ ﺑﺴﮭﻮﻟﺔ ﻣﻦ اﻟﺸﻜﻞ اﻟﻤﺬﻛﻮر أﻋﻼه
:4 ﺗﻜﺘﻤﻞ ﻓﻲ اﻟﺴﺎﻋﺔ
Pipeline Performance
Now consider a processor with equivalent functions but no pipeline, and assume that the
instruction cycle time is kτ The speedup factor for the instruction pipeline compared to
execution without the pipeline is defined as:
واﻓﺘﺮض أن وﻗﺖ، اﻵن ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك ﻣﻌﺎﻟﺠًﺎ ﻟﮫ وظﺎﺋﻒ ﻣﻜﺎﻓﺌﺔ وﻟﻜﻦ ﻻ ﯾﻮﺟﺪ ﺧﻂ أﻧﺎﺑﯿﺐ
ﻋﺎﻣﻞ اﻟﺘﺴﺮﯾﻊ ﻟﺨﻂ أﻧﺎﺑﯿﺐ اﻟﺘﻌﻠﯿﻤﺎت ﻣﻘﺎرﻧﺔ ﺑﺎﻟﺘﻨﻔﯿﺬ ﺑﺪون ﺧﻂ اﻷﻧﺎﺑﯿﺐkτ دورة اﻟﺘﻌﻠﯿﻤﺎت ھﻮ
:ﻣﺤﺪد ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ
Pipeline Performance Cont’d
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n – 1 cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
Ideal speedup of a k-stage pipeline over serial execution
Serial execution in cycles nk
Sk = = Sk → k for large n
Pipelined execution in cycles k+n–1
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which …
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the single-cycle non-pipelined time?
What is the pipelined cycle time?
What is the speedup factor for pipelined execution?
Solution
Non-pipelined cycle = 200+150+200+200+150 = 900 ps
IF Reg ALU MEM Reg
900 ps IF Reg ALU MEM Reg
900 ps
Single-Cycle versus Pipelined – cont’d
Pipelined cycle time = max(200, 150) = 200 ps
IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 IF Reg ALU MEM Reg
200 200 200 200 200
CPI for pipelined execution = 1
One instruction completes each cycle (ignoring pipeline fill)
Speedup of pipelined execution = 900 ps / 200 ps = 4.5
Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)
Because the pipeline stages are not balanced
INSTRUCTION PIPELINING
Instruction pipelining is similar to the use of an assembly line in a manufacturing plant. An
assembly line takes advantage of the fact that a product goes through various stages of
production. By laying the production process out in an assembly line, products at various
stages can be worked on simultaneously. This process is also referred to as pipelining,
because, as in a pipeline, new inputs are accepted at one end before previously accepted
inputs appear as outputs at the other end.
To apply this concept to instruction execution, we must recognize that, in fact, an instruction
has a number of stages. Figures 1, for example, breaks the instruction cycle up into 10
tasks, which occur in sequence. Clearly, there should be some opportunity for pipelining.
ﻟﺘﻄﺒﯿﻖ ھﺬا اﻟﻤﻔﮭﻮم ﻋﻠﻰ ﯾﺸﺒﮫ ﺧﻂ أﻧﺎﺑﯿﺐ اﻟﺘﻌﻠﯿﻤﺎت اﺳﺘﺨﺪام ﺧﻂ اﻟﺘﺠﻤﯿﻊ ﻓﻲ ﻣﺼﻨﻊ
ﯾﺠﺐ أن، ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت ﯾﺴﺘﻔﯿﺪ ﺧﻂ اﻟﺘﺠﻤﯿﻊ ﻣﻦ ﺣﻘﯿﻘﺔ أن اﻟﻤﻨﺘﺞ ﯾﻤﺮ ﺑﻤﺮاﺣﻞ.اﻟﺘﺼﻨﯿﻊ
أن، ﻓﻲ اﻟﻮاﻗﻊ، ﻧﺪرك ﻣﻦ ﺧﻼل وﺿﻊ ﻋﻤﻠﯿﺔ اﻹﻧﺘﺎج ﻓﻲ ﺧﻂ.ﻣﺨﺘﻠﻔﺔ ﻣﻦ اﻹﻧﺘﺎج
ﯾﻤﻜﻦ اﻟﻌﻤﻞ ﻋﻠﻰ اﻟﻤﻨﺘﺠﺎت ﻓﻲ ﻣﺮاﺣﻞ ﻣﺨﺘﻠﻔﺔ ﻓﻲ، اﻟﺘﺠﻤﯿﻊ
اﻟﺘﻌﻠﯿﻤﺎت ﻟﮭﺎ ﻋﺪد ﻣﻦ ، ﯾﺸﺎر إﻟﻰ ھﺬه اﻟﻌﻤﻠﯿﺔ أﯾﻀًﺎ ﺑﺎﺳﻢ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ.وﻗﺖ واﺣﺪ
ﻋﻠﻰ، 1 اﻟﺸﻜﻞ.اﻟﻤﺮاﺣﻞ ﯾﺘﻢ ﻗﺒﻮل اﻟﻤﺪﺧﻼت، ﻛﻤﺎ ھﻮ اﻟﺤﺎل ﻓﻲ ﺧﻂ اﻷﻧﺎﺑﯿﺐ، ﻷﻧﮫ
ﯾﻘﺴﻢ دورة، ﺳﺒﯿﻞ اﻟﻤﺜﺎل اﻟﺠﺪﯾﺪة ﻓﻲ أﺣﺪ اﻟﻄﺮﻓﯿﻦ ﻗﺒﻞ أن ﺗﻈﮭﺮ اﻟﻤﺪﺧﻼت اﻟﻤﻘﺒﻮﻟﺔ
، ﻣﮭﺎم10 اﻟﺘﻌﻠﯿﻤﺎت إﻟﻰ .ﻣﺴﺒﻘًﺎ ﻛﻤﺨﺮﺟﺎت ﻓﻲ اﻟﻄﺮف اﻵﺧﺮ
ﻣﻦ.واﻟﺘﻲ ﺗﺤﺪث ﺑﺎﻟﺘﺴﻠﺴﻞ
اﻟﻮاﺿﺢ أﻧﮫ ﯾﺠﺐ أن ﯾﻜﻮن
ھﻨﺎك ﺑﻌﺾ اﻟﻔﺮص
.ﻟﻠﺘﺨﻄﯿﻂ
Figure 1 Instruction Cycle State Diagram
Pipeline cycling
6 staged pipelining:
To gain more speedup, the pipeline must have more stages.
Let us consider the following decomposition of the instruction processing.
Figure: Six-Stage CPU Instruction Pipeline
اﻗﺮأ اﻟﺘﻌﻠﯿﻤﺎت اﻟﻤﺘﻮﻗﻌﺔ اﻟﺘﺎﻟﯿﺔ ﻓﻲ اﻟﻤﺨﺰن اﻟﻤﺆﻗﺖ:(FI) ﺗﻌﻠﯿﻤﺎت اﻟﺠﻠﺐ
. ﺗﺤﺪﯾﺪ ﻛﻮد اﻟﺘﺸﻐﯿﻞ وﻣﺤﺪدات اﻟﻤﻌﺎﻣﻞ:(DI) ﺗﻌﻠﯿﻤﺎت ﻓﻚ اﻟﺸﻔﺮة
اﺣﺴﺐ اﻟﻌﻨﻮان اﻟﻔﻌﺎل ﻟﻜﻞ ﻣﺼﺪر:(CO) ﺣﺴﺎب اﻟﻤﻌﺎﻣﻼت
. ﻗﺪ ﯾﺸﻤﻞ ذﻟﻚ اﻹزاﺣﺔ أو ﺗﺴﺠﯿﻞ ﻏﯿﺮ ﻣﺒﺎﺷﺮ أو ﻏﯿﺮ ﻣﺒﺎﺷﺮ أو أﺷﻜﺎل أﺧﺮى ﻣﻦ ﺣﺴﺎب اﻟﻌﻨﻮان.اﻟﻤﻌﺎﻣﻞ
1) Fetch instruction (FI): Read the next expected instruction into a buffer.
2) Decode instruction (DI): Determine the opcode and the operand specifiers.
3) Calculate operands (CO): Calculate the effective address of each source
operand. This may involve displacement, register indirect, indirect, or other forms of address
calculation. . ﻻ ﯾﻠﺰم ﺟﻠﺐ اﻟﻤﻌﺎﻣﻼت ﻓﻲ اﻟﺴﺠﻼت. ﻗﻢ ﺑﺈﺣﻀﺎر ﻛﻞ ﻣﻌﺎﻣﻞ ﻣﻦ اﻟﺬاﻛﺮة:(FO) ﻣﻌﺎﻣﻼت اﻟﺠﻠﺐ
4) Fetch operands (FO): Fetch each operand from memory. Operands in registers need not
be fetched. . ﻓﻲ ﻣﻮﻗﻊ ﻣﻌﺎﻣﻞ اﻟﻮﺟﮭﺔ اﻟﻤﺤﺪد، إن وﺟﺪت، ﻗﻢ ﺑﺈﺟﺮاء اﻟﻌﻤﻠﯿﺔ اﻟﻤﺸﺎر إﻟﯿﮭﺎ وﻗﻢ ﺑﺘﺨﺰﯾﻦ اﻟﻨﺘﯿﺠﺔ:(EI) ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت
5) Execute instruction (EI): Perform the indicated operation and store the result, if any, in
the specified destination operand location.
6) Write operand (WO): Store the result in memory.
Figure 2 shows that a six-stage pipeline can reduce the execution time for 9 instructions from 54
time units to 14 time units.
Figure2: Timing Diagram for Instruction Pipeline Operation
Example
Consider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50
ns, 60 ns, 60 ns, 50 ns, and 50 ns.
- Find the instruction latency on this machine.
?- How much time does it take to execute 100 instructions
ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك آﻟﺔ ﻏﯿﺮ ﻣﺒﻄﻨﺔ ﺑﺎﻷﻧﺎﺑﯿﺐ ذات 6ﻣﺮاﺣﻞ ﺗﻨﻔﯿﺬ ﺑﺄطﻮال 50ﻧﺎﻧﻮﺛﺎﻧﯿﺔ ،و
50ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ،و 60ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ،و 60ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ،و 50ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ ،و 50ﻧﺎﻧﻮ ﺛﺎﻧﯿﺔ.
-اﺑﺤﺚ ﻋﻦ زﻣﻦ اﻧﺘﻘﺎل اﻟﺘﻌﻠﯿﻤﺎت ﻋﻠﻰ ھﺬا اﻟﺠﮭﺎز.
-ﻛﻢ ﻣﻦ اﻟﻮﻗﺖ ﯾﺴﺘﻐﺮق ﺗﻨﻔﯿﺬ 100أﻣﺮ؟
Example
Suppose we introduce pipelining on this machine. Assume that when
introducing pipelining, the clock skew adds 5ns of overhead to each execution
stage. ، اﻓﺘﺮض أﻧﮫ ﻋﻨﺪ إدﺧﺎل ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ.ﻟﻨﻔﺘﺮض أﻧﻨﺎ أدﺧﻠﻨﺎ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ﻋﻠﻰ ھﺬا اﻟﺠﮭﺎز
. ﻣﻦ اﻟﺤﻤﻞ ﻟﻜﻞ ﻣﺮﺣﻠﺔ ﺗﻨﻔﯿﺬ5ns ﯾﻀﯿﻒ ﻣﯿﻞ اﻟﺴﺎﻋﺔ
- What is the instruction latency on the pipelined machine?
- How much time does it take to execute 100 instructions?
Solution
Remember that in the pipelined implementation, the length of the pipe stages must all
be the same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it
comes to:
ﺗﺬﻛﺮ أﻧﮫ ﻓﻲ ﺗﻨﻔﯿﺬ اﻷﻧﺎﺑﯿﺐ
ﯾﺠﺐ أن ﯾﻜﻮن طﻮل،
، ًﻣﺮاﺣﻞ اﻷﻧﺎﺑﯿﺐﻣﺘﻤﺎﺛﻼ
أي ﺳﺮﻋﺔ أﺑﻄﺄ ﻣﺮﺣﻠﺔ
ﻣﻊ.ﺑﺎﻹﺿﺎﻓﺔ إﻟﻰ اﻟﺤﻤﻞ
، اﻟﻨﻔﻘﺎت اﻟﻌﺎﻣﺔ5ns
:ﯾﺘﻌﻠﻖ اﻷﻣﺮ ﺑﻤﺎ ﯾﻠﻲ
The length of pipelined stage = MAX(lengths of un-pipelined stages) + overhead = 60 + 5 = 65
ns
Instruction latency = 65 ns
Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns
Cont’d
Problem:
What is the speedup obtained from pipelining?
اﻟﺴﺮﻋﺔ ھﻲ ﻧﺴﺒﺔ ﻣﺘﻮﺳﻂ وﻗﺖ اﻟﺘﻌﻠﯿﻤﺎت ﺑﺪون ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ إﻟﻰ ﻣﺘﻮﺳﻂ وﻗﺖ اﻟﺘﻌﻠﯿﻤﺎت ﺑﺎﺳﺘﺨﺪام
.ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ
Solution:
Speedup is the ratio of the average instruction time without pipelining to the
average instruction time with pipelining.
Average instruction time not pipelined = 320 ns
Average instruction time pipelined = 65 ns
Speedup for 100 instructions = 32000 / 6825 = 4.69
Pipeline: Enhancing Performance
-We observed that we can obtain better performance in
executing instructions, if a single cycle accomplishes
multiple operations: أﻧﺠﺰت إذا، ﻻﺣﻈﻨﺎ أﻧﮫ ﯾﻤﻜﻨﻨﺎ اﻟﺤﺼﻮل ﻋﻠﻰ أداء أﻓﻀﻞ ﻓﻲ ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت
:دورة واﺣﺪة ﻋﻤﻠﯿﺎت ﻣﺘﻌﺪدة
-We added a shift register after the ALU to improve the
performance on the multiplication and division
algorithms. ﻟﺘﺤﺴﯿﻦ اﻷداء ﻓﻲ ﺧﻮارزﻣﯿﺎتALU أﺿﻔﻨﺎ ﺳﺠﻞ اﻟﺘﺤﻮل ﺑﻌﺪ
.اﻟﻀﺮب واﻟﻘﺴﻤﺔ
-Not all operations will need both the ALU and the
SHIFT register SHIFT وﺳﺠﻞALU ﻟﻦ ﺗﺤﺘﺎج ﺟﻤﯿﻊ اﻟﻌﻤﻠﯿﺎت إﻟﻰ ﻛﻞ ﻣﻦ
-But all instructions do have some things in
common:
:ﻟﻜﻦ ﺟﻤﯿﻊ اﻟﺘﻌﻠﯿﻤﺎت ﺗﺸﺘﺮك ﻓﻲ ﺑﻌﺾ اﻷﺷﯿﺎء
Instruction fetch
(ﺟﻠﺐ اﻟﺘﻌﻠﯿﻤﺎت ﺟﻠﺐ اﻟﻤﻌﺎﻣﻞ وﺟﻠﺐ )إذا ﻛﺎﻧﺖ اﻟﻤﻌﺎﻣﻼت
Decode
Operand fetch (if operands)
ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ھﻲ ﺗﺤﺴﯿﻦ ﻣﻌﻤﺎري ﯾﻌﻤﻞ ﻋﻠﻰ ﺗﺤﺴﯿﻦ
PIPELINES Parallel Level 1 .أداء ﻧﻈﺎم اﻟﻜﻤﺒﯿﻮﺗﺮ دون زﯾﺎدة ﺳﺮﻋﺔ اﻟﺴﺎﻋﺔ
-Pipelines are an architectural enhancement that improves
the performance of a computer system without increasing
the clock speed. .ﺗﺴﺘﻔﯿﺪ ﺧﻄﻮط اﻷﻧﺎﺑﯿﺐ ﻣﻦ اﻟﺘﻮازي اﻟﻤﺘﺄﺻﻞ ﻓﻲ ﻋﻤﻠﯿﺔ ﺗﻨﻔﯿﺬ اﻟﺘﻌﻠﯿﻤﺎت واﻟﺘﻌﻠﯿﻤﺎت
-Pipelines take advantage of parallelisminherent in the
instruction execution process and instructions.
-Based on the observation that all instructions go through
the same set of phases (stages) in the process of begin
executed: (one possible example:)
INSTRUCTION FETCH
INSTRUCTION DECODE ﺑﻨﺎءً ﻋﻠﻰ ﻣﻼﺣﻈﺔ أن ﺟﻤﯿﻊ اﻟﺘﻌﻠﯿﻤﺎت ﺗﻤﺮ ﺑﻨﻔﺲ
:ﻣﺠﻤﻮﻋﺔ اﻟﻤﺮاﺣﻞ )اﻟﻤﺮاﺣﻞ( ﻓﻲ ﻋﻤﻠﯿﺔ ﺑﺪء اﻟﺘﻨﻔﯿﺬ
OPERAND FETCH
EXECUTION
RESULT STORE (not all architectures)
PARALLEL STAGES
-Do each stage in parallel, with different instructions in each stage
-Instructions “flow” from one stage to the next Stages
PIPELINE PERFORMANCE
k: stages in pipeline
t: time to do each stage (actually slowest stage
constrains performance)
n: number of instructions to be processed
Time required to complete the processing on a
non-pipelined processor: (CPU cycle = kt time) =
n(kt)
Time required for pipelined processor:
1st instruction: = kt(pipeline load)
All after 1st (n-1 more after 1st): = (n-1)t
= kt + (n-1)t
PIPELINE SPEEDUP
PIPELINE EFFICIENCY
Taxonomy of Parallel Processor
Architectures
Parallel Speedup
Parallel speedup is defined as the ratio of the time
required to compute some function using a single
processor (T1) divided by the time required to compute it
using P processors (TP). That is: speedup = T1/TP.
For Example if it takes 10 seconds to run a program
sequentially and 2 seconds to run it in parallel on some
number of processors, P, then the speedup is 10/2=5
times.
ﯾﺘﻢ ﺗﻌﺮﯾﻒ اﻟﺘﺴﺮﯾﻊ اﻟﻤﻮازي ﻋﻠﻰ أﻧﮫ ﻧﺴﺒﺔ اﻟﻮﻗﺖ اﻟﻤﻄﻠﻮب ﻟﺤﺴﺎب ﺑﻌﺾ
( ﻣﻘﺴﻮﻣًﺎ ﻋﻠﻰ اﻟﻮﻗﺖ اﻟﻤﻄﻠﻮبT1) اﻟﻮظﺎﺋﻒ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ واﺣﺪ
.T1 / TP = اﻟﺘﺴﺮﯾﻊ: ھﺬا ھﻮ.P (TP) ﻟﺤﺴﺎﺑﮭﺎ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺠﺎت
ﺛﻮانٍ ﻟﺘﺸﻐﯿﻞ ﺑﺮﻧﺎﻣﺞ ﺑﺎﻟﺘﺘﺎﺑﻊ10 إذا اﺳﺘﻐﺮق اﻷﻣﺮ، ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل
ﻓﺈن اﻟﺘﺴﺮﯾﻊ ھﻮ، P ، وﺛﺎﻧﯿﺘﯿﻦ ﻟﺘﺸﻐﯿﻠﮫ ﺑﺎﻟﺘﻮازي ﻋﻠﻰ ﻋﺪد ﻣﻦ اﻟﻤﻌﺎﻟﺠﺎت
. ﻣﺮات5 = 10/2
Speedup Formula
Sequential execution time
Speedup
Parallel execution time
Parallel Efficiency
Parallel efficiency measures how much use of the parallel
processors we are making. For P processors, it is defined as:
efficiency= 1/P x speedup= 1/P x T1/TP. For example,
continuing with the same example, if P is 10 processors and
the speedup is 5 times, then the parallel efficiency is 5/10=.5.
In other words, on average, only half of the processors were
used to gain the speedup and the other half were idle.
ﺗﻘﯿﺲ اﻟﻜﻔﺎءة اﻟﻤﻮازﯾﺔ ﻣﻘﺪار اﺳﺘﺨﺪام اﻟﻤﻌﺎﻟﺠﺎت اﻟﻤﺘﻮازﯾﺔ
ﯾﺘﻢ ﺗﻌﺮﯾﻔﮭﺎ ﻋﻠﻰ اﻟﻨﺤﻮ، P ﺑﺎﻟﻨﺴﺒﺔ ﻟﻠﻤﻌﺎﻟﺠﺎت.اﻟﺘﻲ ﻧﻘﻮم ﺑﮭﺎ
.P x T1 / TP / 1 = اﻟﺘﺴﺮﯾﻊP x / 1 = اﻟﻜﻔﺎءة:اﻟﺘﺎﻟﻲ
P إذا ﻛﺎﻧﺖ، اﻻﺳﺘﻤﺮار ﻓﻲ ﻧﻔﺲ اﻟﻤﺜﺎل، ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل
ﻓﺈن اﻟﻜﻔﺎءة، ﻣﺮات5 ﻣﻌﺎﻟﺠﺎت وﻛﺎن اﻟﺘﺴﺮﯾﻊ10 ھﻲ
ﺗﻢ، ﻓﻲ اﻟﻤﺘﻮﺳﻂ، ﺑﻤﻌﻨﻰ آﺧﺮ.0.5 = 5/10 اﻟﻤﺘﻮازﯾﺔ ھﻲ
اﺳﺘﺨﺪام ﻧﺼﻒ اﻟﻤﻌﺎﻟﺠﺎت ﻓﻘﻂ ﻟﻠﺤﺼﻮل ﻋﻠﻰ اﻟﺘﺴﺮﯾﻊ
.واﻟﻨﺼﻒ اﻵﺧﺮ ﻓﻲ وﺿﻊ اﻟﺨﻤﻮل
Efficiency Formula
Sequential execution time
Efficiency
Processors Parallel execution time
Speedup
Efficiency
Processors
Parallel Processing and Amdahl’s Law
When considering system performance, computer system designers look for ways to improve
performance by improvement in technology or change in design. Examples include the use of parallel
processors, the use of a memory cache hierarchy, and speedup in memory access time and I/O
transfer rate due to technology improvements.
In all of these cases, it is important to note that a speedup in one aspect of the
technology or design does not result in a corresponding improvement in
performance. This limitation is succinctly expressed by Amdahl’s law.
.ﯾﻤﻜﻦ ﻟﻌﺪد ﺻﻐﯿﺮ ﻣﻦ اﻟﻌﻤﻠﯿﺎت اﻟﻤﺘﺴﻠﺴﻠﺔ أن ﺗﺤﺪ ﺑﺸﻜﻞ ﻓﻌﺎل ﻣﻦ ﺗﺴﺮﯾﻊ اﻟﺨﻮارزﻣﯿﺔ اﻟﻤﺘﻮازﯾﺔ
Amdahl’s Law: A small number of sequential operations can effectively limit
the speedup of a parallel algorithm. ﺧﻼل،اﻟﻤﺘﻮازﯾﺔﻣﻦ
ﯾﺒﺤﺚ ﻣﺼﻤﻤﻮ أﻧﻈﻤﺔ اﻟﻜﻤﺒﯿﻮﺗﺮ ﻋﻦ طﺮق ﻟﺘﺤﺴﯿﻦ اﻷداء، ﻋﻨﺪ اﻟﺘﻔﻜﯿﺮ ﻓﻲ أداء اﻟﻨﻈﺎم
ﺗﺘﻀﻤﻦ اﻷﻣﺜﻠﺔ اﺳﺘﺨﺪام اﻟﻤﻌﺎﻟﺠﺎت.ﺗﺤﺴﯿﻦ اﻟﺘﻜﻨﻮﻟﻮﺟﯿﺎ أو اﻟﺘﻐﯿﯿﺮ ﻓﻲ اﻟﺘﺼﻤﯿﻢ
واﻟﺘﺴﺮﯾﻊ ﻓﻲ وﻗﺖ اﻟﻮﺻﻮل إﻟﻰ اﻟﺬاﻛﺮة، واﺳﺘﺨﺪام اﻟﺘﺴﻠﺴﻞ اﻟﮭﺮﻣﻲ ﻟﺬاﻛﺮة اﻟﺘﺨﺰﯾﻦ اﻟﻤﺆﻗﺖ
اﻹﺧﺮاج/ واﻹدﺧﺎل
.ﻣﻌﺪل اﻟﻨﻘﻞ ﺑﺴﺒﺐ اﻟﺘﺤﺴﯿﻨﺎت اﻟﺘﻜﻨﻮﻟﻮﺟﯿﺔ
Let f be the fraction of operations in a computation that must be performed sequentially, where 0 <
f < 1. Then the maximum speedup S achievable by a parallel computer with p processors
performing the computation is S < 1 / [f + (1 - f) / p]. For example, if 10% of the computation must
be performed sequentially, then the maximum speedup achievable is 10, no matter how many
processors a parallel computer has.
ﺛﻢ أﻗﺼﻰ ﺳﺮﻋﺔ ﯾﻤﻜﻦ ﺗﺤﻘﯿﻘﮭﺎ ﺑﻮاﺳﻄﺔ.f <1> 0 ﺣﯿﺚ ﯾﻜﻮن، ﺟﺰء ﻣﻦ اﻟﻌﻤﻠﯿﺎت اﻟﺤﺴﺎﺑﯿﺔ اﻟﺘﻲ ﯾﺠﺐ إﺟﺮاؤھﺎ ﺑﺎﻟﺘﺘﺎﺑﻊf ﻟﻨﻔﺘﺮض أن
ﻣﻦ٪10 إذا ﻛﺎن ﯾﺠﺐ إﺟﺮاء، ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل.[ ع/ ( و- 1) + f] / S <1 ﺗﻘﻮم ﺑﺎﻟﺤﺴﺎب ھﻮp ﻛﻤﺒﯿﻮﺗﺮ ﻣﻮازٍ ﻣﻊ ﻣﻌﺎﻟﺠﺎت
ﺑﻐﺾ اﻟﻨﻈﺮ ﻋﻦ ﻋﺪد اﻟﻤﻌﺎﻟﺠﺎت اﻟﺘﻲ ﯾﻤﺘﻠﻜﮭﺎ اﻟﻜﻤﺒﯿﻮﺗﺮ، 10 ﻓﺈن أﻗﺼﻰ ﺳﺮﻋﺔ ﯾﻤﻜﻦ ﺗﺤﻘﯿﻘﮭﺎ ھﻲ، اﻟﻌﻤﻠﯿﺎت اﻟﺤﺴﺎﺑﯿﺔ ﺑﺎﻟﺘﺘﺎﺑﻊ
.اﻟﻤﻮازي
Maximum Speedup in Amdahl’s Law
Amdahl’s law states that the maximum speedup possible in
parallelizing an algorithm is limited by the sequential
ﯾﻨﺺ ﻗﺎﻧﻮن أﻣﺪال ﻋﻠﻰ أن أﻗﺼﻰ ﺳﺮﻋﺔ ﻣﻤﻜﻨﺔ ﻓﻲ ﻣﻮازاة
portion of the code. .اﻟﺨﻮارزﻣﯿﺔ ﻣﺤﺪودة ﺑﺎﻟﺠﺰء اﻟﻤﺘﺴﻠﺴﻞ ﻣﻦ اﻟﻜﻮد
Given an algorithm which is P% parallel, Amdahl’s law
states that: Maximum Speedup=1/(1- (P/100)).
For example if 80% of a program is parallel, then the
maximum speedup is 1/(1-0.8)=1/.2=5 times.
= اﻟﺤﺪ اﻷﻗﺼﻰ ﻟﻠﺴﺮﻋﺔ: ﯾﻨﺺ ﻗﺎﻧﻮن أﻣﺪال ﻋﻠﻰ ﻣﺎ ﯾﻠﻲ، ٪P ﺑﺎﻟﻨﻈﺮ إﻟﻰ ﺧﻮارزﻣﯿﺔ ﻣﻮازﯾﺔ ﻟـ
.((P / 100) -1) / 1
ﻓﺈن اﻟﺤﺪ، ﻣﻦ اﻟﺒﺮﻧﺎﻣﺞ ﻣﺘﻮازﯾًﺎ٪80 إذا ﻛﺎن، ﻋﻠﻰ ﺳﺒﯿﻞ اﻟﻤﺜﺎل
. ﻣﺮات5 = 2. / 1 = (0.8-1) / 1 اﻷﻗﺼﻰ ﻟﻠﺘﺴﺮﯾﻊ ھﻮ
Parallel Processing and Amdahl’s Law
Amdahl’s law was first proposed by Gene Amdahl in [AMDA67] and deals with the potential
speedup of a program using multiple processors compared to a single processor. Consider a
program running on a single processor such that a fraction (1-f) of the execution time involves
code that is inherently serial and a fraction f that involves code that is infinitely parallelizable
[ وﯾﺘﻌﺎﻣﻞ ﻣﻊ اﻟﺘﺴﺮﯾﻊ اﻟﻤﺤﺘﻤﻞ ﻟﺒﺮﻧﺎﻣﺞ ﯾﺴﺘﺨﺪمAMDA67] ﺗﻢ اﻗﺘﺮاح ﻗﺎﻧﻮن أﻣﺪال ﻷول ﻣﺮة ﻣﻦ ﻗﺒﻞ ﺟﯿﻦ أﻣﺪال ﻓﻲ
with no scheduling overhead. ( ﻣﻦf-1) ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك ﺑﺮﻧﺎﻣﺠًﺎ ﯾﻌﻤﻞ ﻋﻠﻰ ﻣﻌﺎﻟﺞ واﺣﺪ ﺑﺤﯿﺚ ﯾﺘﻀﻤﻦ ﺟﺰء.ﻣﻌﺎﻟﺠﺎت ﻣﺘﻌﺪدة ﻣﻘﺎرﻧﺔ ﺑﻤﻌﺎﻟﺞ واﺣﺪ
. ﯾﺘﻀﻤﻦ رﻣﺰًا ﻗﺎﺑﻞ ﻟﻠﺘﻮازي ﺑﻼ ﺣﺪود ﻣﻊ ﻋﺪم وﺟﻮد ﺗﻜﺎﻟﯿﻒ إﺿﺎﻓﯿﺔ ﻟﻠﺠﺪوﻟﺔf وﻗﺖ اﻟﺘﻨﻔﯿﺬ رﻣﺰًا ﺗﺴﻠﺴﻠﯿًﺎ ﺑﻄﺒﯿﻌﺘﮫ وﺟﺰء
Let T be the total execution time of the program using a single processor. Then the speedup
using a parallel processor with N processors that fully exploits the parallel portion of the
program is as follows:
ﯾﻜﻮن إﺟﻤﺎﻟﻲ وﻗﺖ ﺗﻨﻔﯿﺬT دع
ﺛﻢ.اﻟﺒﺮﻧﺎﻣﺞ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ واﺣﺪ
ﯾﻜﻮن اﻟﺘﺴﺮﯾﻊ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺎﻟﺞ
ﯾﺴﺘﻐﻞN ﻣﺘﻮازي ﻣﻊ ﻣﻌﺎﻟﺠﺎت
اﻟﺠﺰء اﻟﻤﺘﻮازي ﺑﺎﻟﻜﺎﻣﻞ ﻣﻦ اﻟﺒﺮﻧﺎﻣﺞ
:ﻛﻤﺎ ﯾﻠﻲ
Two important conclusions can be drawn:
1. When f is small, the use of parallel processors has little effect.
2. As N approaches infinity, speedup is bound by 1/(1-f), so that there are diminishing returns
for using more processors.
Parallel Processing and Amdahl’s Law
Amdahl’s law illustrates the problems facing industry in the development of multicore
machines with an ever-growing number of cores: The software that runs on such machines
must be adapted to a highly parallel execution environment to exploit the power of parallel
processing. ﯾﻮﺿﺢ ﻗﺎﻧﻮن أﻣﺪال اﻟﻤﺸﻜﻼت اﻟﺘﻲ ﺗﻮاﺟﮫ اﻟﺼﻨﺎﻋﺔ ﻓﻲ ﺗﻄﻮﯾﺮ آﻻت ﻣﺘﻌﺪدة اﻟﻨﻮاة ﻣﻊ ﻋﺪد ﻣﺘﺰاﯾﺪ
ﯾﺠﺐ ﺗﻜﯿﯿﻒ اﻟﺒﺮﻧﺎﻣﺞ اﻟﺬي ﯾﻌﻤﻞ ﻋﻠﻰ ھﺬه اﻵﻻت ﻣﻊ ﺑﯿﺌﺔ ﺗﻨﻔﯿﺬ ﻣﺘﻮازﯾﺔ ﻟﻠﻐﺎﯾﺔ ﻻﺳﺘﻐﻼل:ﻣﻦ اﻟﻨﻮى
.ﻗﻮة اﻟﻤﻌﺎﻟﺠﺔ اﻟﻤﺘﻮازﯾﺔ
Amdahl’s law can be generalized to evaluate any design or technical improvement in a
computer system. Consider any enhancement to a feature of a system that results in a
speedup. The speedup can be expressed as
ﺿﻊ ﻓﻲ اﻋﺘﺒﺎرك أي ﺗﺤﺴﯿﻦ ﻋﻠﻰ.ﯾﻤﻜﻦ ﺗﻌﻤﯿﻢ ﻗﺎﻧﻮن أﻣﺪال ﻟﺘﻘﯿﯿﻢ أي ﺗﺼﻤﯿﻢ أو ﺗﺤﺴﯿﻦ ﺗﻘﻨﻲ ﻓﻲ ﻧﻈﺎم اﻟﻜﻤﺒﯿﻮﺗﺮ
ﯾﻤﻜﻦ اﻟﺘﻌﺒﯿﺮ ﻋﻦ اﻟﺘﺴﺮﯾﻊ ﻛـ.ﻣﯿﺰة ﻓﻲ ﻧﻈﺎم ﺗﺆدي إﻟﻰ زﯾﺎدة اﻟﺴﺮﻋﺔ
Suppose that a feature of the system is used during execution a fraction of the time f,
before enhancement, and that the speedup of that feature after enhancement is SUf . Then
the overall speedup of the system is:
، ﻗﺒﻞ اﻟﺘﺤﺴﯿﻦ، f اﻓﺘﺮض أن إﺣﺪى ﻣﯿﺰات اﻟﻨﻈﺎم ﺗُﺴﺘﺨﺪم أﺛﻨﺎء اﻟﺘﻨﻔﯿﺬ ﻟﺠﺰء ﺑﺴﯿﻂ ﻣﻦ اﻟﻮﻗﺖ
: ﺛﻢ اﻟﺴﺮﻋﺔ اﻹﺟﻤﺎﻟﯿﺔ ﻟﻠﻨﻈﺎم ھﻲ.SU f وأن ﺗﺴﺮﯾﻊ ھﺬه اﻟﻤﯿﺰة ﺑﻌﺪ اﻟﺘﺤﺴﯿﻦ ھﻮ
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
Parallel Processing and Amdahl’s Law
(Summary)
Amdahl’s Law
Examples
Questions
Parallel Processing and Amdahl’s Law
Example, suppose that a task makes extensive use of floating-point operations,
with 40% of the time is consumed by floating-point operations. With a new
hardware design, the floating-point module is speeded up by a factor of K. Then
the overall speedup is:
Thus, independent of K, the maximum speedup is 1.67.
Parallel Processing and Amdahl’s Law
Problem:
Suppose, You have a system that contains a special processor for doing floating-point
operations. You have determined that 50% of your computations can use the floating-point
processor. The speedup of the floating pointing-point processor is 15.
Solution
Parallel Processing and Amdahl’s Law
Problems and solutions
EFFECTIVENESS OF PARALLEL PROCESSING
There are certain measures to compare the effectiveness of various parallel algorithms or
architectures for solving desired problems. The following definitions and notations are
applicable [Lee80]:
The significance of each measure is self-evident from its name and defining equation given above. It is not difficult to
establish the following relationships between these parameters.
EFFECTIVENESS OF PARALLEL PROCESSING Cont’d
EFFECTIVENESS OF PARALLEL PROCESSING Cont’d
Example
Example: Finding the sum of 16 numbers can be represented by the binary-tree
computation graph of Figure mentioned bellow with T(1) = W(1) = 15. Assume
unit-time additions and ignore all else.
With p = 8 processors, we have
W(8) = 15 T(8) = 4 E(8) = 15/(8 * 4) = 47%
S(8) = 15/4 = 3.75 R(8) = 15/15 = 1 Q(8) = 1.76
Essentially, the 8 processors perform all of the additions at the same tree level in
each time unit, beginning with the leaf nodes and ending at the root. The relatively
low efficiency is the result of limited parallelism near the root of the tree.
Thank You
46