The ARM Cortex-M4 Processor
Architecture
Module Syllabus
ARM Architectures and Processors
What is ARM Architecture
ARM Processor Families
ARM Cortex-M Series
Cortex-M4 Processor
ARM Processor vs. ARM Architectures
ARM Cortex-M4 Processor
Cortex-M4 Processor Overview
Cortex-M4 Block Diagram
Cortex-M4 Registers
2
ARM ARCHITECTURES AND
PROCESSORS
3
What is ARM Architecture
ARM architecture is a family of RISC-based processor architectures
Well-known for its power efficiency;
Hence widely used in mobile devices, such as smartphones and tablets
Designed and licensed to a wide eco-system by ARM
ARM Holdings
The company designs ARM-based processors;
Does not manufacture, but licenses designs to semiconductor partners who add their own
Intellectual Property (IP) on top of ARMs IP, fabricate and sell to customers;
Also offer other IP apart from processors, such as physical IPs, interconnect IPs, graphics
cores, and development tools.
ARM Processor Families
Cortex-A series (Application)
High performance processors capable of full Operating System (OS)
support;
Applications include smartphones, digital TV, smart books, home
gateways etc.
Cortex-R series (Real-time)
High performance for real-time applications;
High reliability
Applications include automotive braking system, powertrains etc.
Cortex-M series (Microcontroller)
Cost-sensitive solutions for deterministic microcontroller applications;
Applications include microcontrollers, mixed signal devices, smart
sensors, automotive body electronics and airbags;
SecurCore series
Cortex-A15
Cortex-A9
Cortex-A8
Cortex-A7
Cortex-A5
Cortex-R7
Cortex-R5
Cortex-R4
Cortex-M4
Cortex-M3
Cortex-M1
Cortex-M0+
Cortex-M0
SC000
SC100
SC300
ARM11
ARM9
ARM7
Cortex-A
Cortex-R
Cortex-M
SecurCore
Classic
High security applications.
Previous classic processors
5
Cortex-A57
Cortex-A53
Include ARM7, ARM9, ARM11 families
As of Dec 2013
Cortex-M processors are the optimal solution for low-power embedded computing applications. The 32-bit Cortex-M
processor family is the key to transforming all sorts of embedded systems into smart and connected systems. Often
provided as a black box with pre-loaded applications, they have limited capability to expand hardware functionality and in
most cases no screen.
Merchant MCUs
*Automotive Control Systems
White Goods controllers
*Smart Meters
*Sensors
6
*Motor Control Systems
*Internet of Things
Equipment Adopting ARM Cores
IR Fire Detector
Intelligent toys
Utility
Meters
Exercise
Machines
Energy Efficient Appliances
Tele-parking
R
A
7
Source: ARM University Program Overview
Intelligent
Vending
With More Than 50 Billion
Over 10 Billion
ARM-powered chips shipped in
2013 alone.
Strong and Consistent
Growth
Since 1993
This curve shows overall shipments leading to the 50
billion milestone. Theres been an upward trend as
shipments skyrocketed in recent years.
1993
10 Billion
2013
50 Billion
www.50billionchips.com
Source: ARM University Program Overview
Markets were POWERING
20% | Embedded
16% | Enterprise
Applications including automotive, touchscreen controllers, industrial equipment,
connectivity and smartcards
Applications such as hard disk drives, and
wireless/wireline networking infrastructure equipment
6% | Home
Consumers devices such as smart TVs, game
consoles and home networking gateways
58% | Mobile
Devices including smartphones,
mobile phones, tablets, e-readers
and wearables
www.50billionchips.com
Source: ARM University Program Overview
Based on Lecture Notes by Marilyn Wolf
10
ARM Architecture versions
(From arm.com)
Design an ARM-based SoC
Select a set of IP cores from ARM and/or other third-party IP vendors
Integrate IP cores into a single chip design
Give design to semiconductor foundries for chip fabrication
IP libraries
Cortex-A9
Cortex-R5
Cortex-M4
ARM7
ARM9
ARM11
DRAM ctrl
FLASH ctrl
SRAM ctrl
AXI bus
AHB bus
APB bus
GPIO
I/O blocks
Timer
Licensable IPs
12
SoC
ROM
ARM
processor
System bus
RAM
ARM-based
SoC
Peripherals
External Interface
SoC Design
Chip Manufacture
ARM Cortex-M Series
Cortex-M series: Cortex-M0, M0+, M1, M3, M4.
Energy-efficiency
Lower energy cost, longer battery life
Smaller code
Lower silicon costs
Ease of use
Faster software development and reuse
Embedded applications
Smart metering, human interface devices, automotive and industrial control
systems, white goods, consumer products and medical instrumentation
13
As of Dec 2013
ARM Processors vs. ARM Architectures
ARM architecture
Describes the details of instruction set, programmers model, exception model, and
memory map
Documented in the Architecture Reference Manual
ARM processor
Developed using one of the ARM architectures
More implementation details, such as timing information
Documented in processors Technical Reference Manual
ARMv4/v4T
Architecture
ARMv5/ v4E
Architecture
ARMv6
Architecture
ARMv7
Architecture
ARMv7-A
e.g. Cortex-A9
ARMv7-R
e.g. Cortex-R4
ARM v6-M
e.g. Cortex-M0, M1
e.g. ARM7TDMI
14
e.g. ARM9926EJ-S
ARMv8
Architecture
ARMv8-A
e.g. Cortex-A53
Cortex-A57
ARMv8-R
ARMv7-M
e.g. Cortex-M4
e.g. ARM1136
As of Dec 2013
ARM Cortex-M Series Family
15
Processor
ARM
Architecture
Core
Architecture
Thumb
Thumb-2
Hardware
Multiply
Hardware
Divide
Saturated
Math
DSP
Extensions
Floating
Point
Cortex-M0
ARMv6-M
Von
Neumann
Most
Subset
1 or 32
cycle
No
No
Software
No
Cortex-M0+
ARMv6-M
Von
Neumann
Most
Subset
1 or 32
cycle
No
No
Software
No
Cortex-M1
ARMv6-M
Von
Neumann
Most
Subset
3 or 33
cycle
No
No
Software
No
Cortex-M3
ARMv7-M
Harvard
Entire
Entire
1 cycle
Yes
Yes
Software
No
Cortex-M4
ARMv7E-M
Harvard
Entire
Entire
1 cycle
Yes
Yes
Hardware
Optional
RISC CPU Characteristics
32-bit
load/store architecture
Fixed instruction length
Fewer/simpler instructions than CISC CPU
Limited addressing modes, operand types
Simple design easier to speed up, pipeline &
scale
16
ARM assembly language
Fairly
standard RISC assembly
language:
label
LDR r0,[r8]
ADD r4,r0,r1
destination
17
source/left
; a comment
;r4=r0+r1
source/right
ARM data types
Word
is 32 bits long.
Word can be divided into four 8-bit bytes.
ARM addresses can be 32 bits long.
Address refers to byte.
Address 4 starts at byte 4.
Configure at power-up in either little- or bitendian mode.
18
Endianness
Relationship between bit and byte/word ordering
defines endianness:
bit 31
byte 3
byte 2
byte 1
little-endian
(default)
19
bit 0
byte 0
bit 0
byte 0
byte 1
byte 2
big-endian
bit 31
byte 3
ARM CORTEX-M4 PROCESSOR
OVERVIEW
20
Cortex-M4 Processor Overview
Cortex-M4 Processor
Introduced in 2010
Designed with a large variety of highly efficient signal processing features
Features extended single-cycle multiply accumulate instructions, optimized SIMD arithmetic,
saturating arithmetic and an optional Floating Point Unit.
High Performance Efficiency
1.25 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz) at the order of Watts / MHz
Low Power Consumption
Longer battery life especially critical in mobile products
Enhanced Determinism
The critical tasks and interrupt routines can be served quickly in a known number of cycles
21
Cortex-M4 Processor Features
32-bit Reduced Instruction Set Computing (RISC) processor
Harvard architecture
Separated data bus and instruction bus
Instruction set
Include the entire Thumb-1 (16-bit) and Thumb-2 (16/ 32-bit) instruction sets
3-stage + branch speculation pipeline
Performance efficiency
1.25 1.95 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz)
Supported Interrupts
Non-maskable Interrupt (NMI) + 1 to 240 physical interrupts
8 to 256 interrupt priority levels
22
Cortex-M4 Processor Features
Supports Sleep Modes
Up to 240 Wake-up Interrupts
Integrated WFI (Wait For Interrupt) and WFE (Wait For Event) Instructions and Sleep On Exit capability (to be
covered in more detail later)
Sleep & Deep Sleep Signals
Optional Retention Mode with ARM Power Management Kit
Enhanced Instructions
Hardware Divide (2-12 Cycles)
Single-Cycle 16, 32-bit MAC, Single-cycle dual 16-bit MAC
8, 16-bit SIMD arithmetic
Debug
Optional JTAG & Serial-Wire Debug (SWD) Ports
Up to 8 Breakpoints and 4 Watchpoints
Memory Protection Unit (MPU)
23
Optional 8 region MPU with sub regions and background region
Cortex-M4 Processor Features
Cortex-M4 processor is designed to meet the challenges of low dynamic power
constraints while retaining light footprints
180 nm ultra low power process 157 W/MHz
90 nm low power process 33 W/MHz
40 nm G process 8 W/MHz
ARM Cortex-M4 Implementation Data
24
Process
180ULL
(7-track, typical 1.8v, 25C)
90LP
(7-track, typical 1.2v, 25C)
40G
9-track, typical 0.9v, 25C)
Dynamic Power
157 W/MHz
33 W/MHz
8 W/MHz
Floorplanned Area
0.56 mm2
0.17 mm2
0.04 mm2
Cortex-M4 Block Diagram
ARM Cortex-M4 Microprocessor
Optional FPU
Optional
WIC
Nested Vector
Interrupt
Controller
(NVIC)
Optional
Debug
Access Port
Processor core
Optional
Embedded
Trace Macrocell
Optional Memory
protection unit
Optional Serial
Wire Viewer
Optional
Flash
patch
Optional
Data
watchpoints
Bus matrix
Code interface
25
SRAM and
peripheral interface
Cortex-M4 Block Diagram
Processor core
Contains internal registers, the ALU, data path, and some control logic
Registers include sixteen 32-bit registers for both general and special usage
Processor pipeline stages
Three-stage pipeline: fetch, decode, and execution
Some instructions may take multiple cycles to execute, in which case the pipeline will be stalled
The pipeline will be flushed if a branch instruction is executed
Up to two instructions can be fetched in one transfer (16-bit instructions)
Instruction 1
Fetch
Instruction 2
Fetch
Decode
Execute
Decode
Instruction 3
Fetch
Instruction 4
Fetch
Execute
Decode
Execute
Decode
Execute
Time
26
Cortex-M4 Block Diagram
Nested Vectored Interrupt Controller (NVIC)
Up to 240 interrupt request signals and a non-maskable interrupt (NMI)
Automatically handles nested interrupts, such as comparing priorities between interrupt
requests and the current priority level
Wakeup Interrupt Controller (WIC)
For low-power applications, the microcontroller can enter sleep mode by shutting down
most of the components.
When an interrupt request is detected, the WIC can inform the power management unit
to power up the system.
Memory Protection Unit (optional)
Used to protect memory content, e.g. make some memory regions read-only or
preventing user applications from accessing privileged application data
27
Cortex-M4 Block Diagram
Bus interconnect
Allows data transfer to take place on different buses simultaneously
Provides data transfer management, e.g. a write buffer, bit-oriented operations
(bit-band)
May include bus bridges (e.g. AHB-to-APB bus bridge) to connect different buses
into a network using a single global memory space
Includes the internal bus system, the data path in the processor core, and the
AHB LITE interface unit
Debug subsystem
Handles debug control, program breakpoints, and data watchpoints
When a debug event occurs, it can put the processor core in a halted state,
28
where developers can analyse the status of the processor at that point, such as
register values and flags
STM32F4xx Block Diagram
RDP (JTAG fuse)
More I/Os in UFBGA 176 package
29
JTAG/SW Debug
ETM
Nested vect IT Ctrl
1 x Systic Timer
DMA
AHB2 (max 168MHz)
D-bus
I-bus
S-bus
16 Channels
Clock Control
AHB1
(max 168MHz)
51/82/114/140 I/Os
2x6x 16-bit PWM
Synchronized AC Timer
3 x 16bit Timer
Up to 16 Ext. ITs
1 x SPI
2 x USART/LIN
Bridge
512kB- 1MB
Flash Memory
Flash I/F
CORTEX-M4
CPU + FPU +
MPU
168 MHz
ARM 32-bit multi-AHB bus matrix
Arbiter (max 168MHz)
New application specific peripherals
USB OTG HS w/ ULPI interface
Camera interface
HW Encryption**: DES, 3DES, AES
256-bit, SHA-1 hash, RNG.
Enhanced peripherals
USB OTG Full speed
ADC: 0.416s conversion/2.4Msps, up to
7.2Msps in interleaved triple
mode
ADC/DAC working down to 1.8V
2
Dedicated PLL for I S precision
Ethernet w/ HW IEEE1588 v2.0
32-bit RTC with calendar
4KB backup SRAM in VBAT domain
2 x 32bit and 8 x 16bit Timers
high speed USART up to 10.5Mb/s
high speed SPI up to 37.5Mb/s
64KB CCM data RAM
APB2 (max 84MHz)
Cortex-M4 w/ FPU, MPU and ETM
Memory
Up to 1MB Flash memory
192KB RAM (including 64KB CCM data RAM
FSMC up to 60MHz
Encryption**
Camera Interface
USB 2.0 OTG FS
128KB SRAM
External Memory
Interface
USB 2.0 OTG
FS/HS
Power Supply Reg
1.2V POR/PDR/PVD
XTAL oscillators
32KHz + 8~25MHz
Ethernet MAC 10/100,
IEEE1588
Bridge
APB1 (max 42MHz)
Int. RC oscillators
32KHz + 16MHz
PLL
RTC / AWU
5x 16-bit Timer
2x 32-bit Timer
4KB backup RAM
2x DAC + 2 Timers
2x Watchdog
(independent& window)
1x SDIO
3x 12-bit ADC
24 channels / 2Msps
Temp Sensor
HS requires an external PHY connected to ULPI interface,
** Encryption is only available on STM32F415 and STM32F417
2x CAN 2.0B
2 x SPI / I2S
4x USART/LIN
3x I2C
STM32F4 Series highlights 1/3
Based on Cortex M4 core
The new DSP and FPU instructions combined to 168MHz
Over 30 new part numbers pin-to-pin and software compatible with
existing STM32 F2 Series.
Advanced technology and process from ST:
Memory accelerator: ART Accelerator
Multi AHB Bus Matrix
90nm process
Outstanding results:
210DMIPS at 168MHz.
Execution from Flash equivalent to 0-wait state performance up to 168MHz thanks to ST
ART Accelerator
30
5
STM32F4 Series highlights 2/3
More Memory
Up to 1MB Flash with option to permanent readout protection (JTAG fuse),
192kB SRAM: 128kB on bus matrix + 64kB (Core Coupled Memory) on data bus dedicated to
the CPU usage
Advanced peripherals
USB OTG High speed 480Mbit/s
Ethernet MAC 10/100 with IEEE1588
PWM High speed timers: 168MHz max frequency
Crypto/Hash processor, 32-bit random number generator (RNG)
32-bit RTC with calendar: with sub 1 second accuracy, and <1uA
31
6
STM32F4 Series highlights 3/3
Further improvements
Low voltage: 1.8V to 3.6V VDD , down to 1.7*V on most packages
Full duplex I2S peripherals
12-bit ADC: 0.41s conversion/2.4Msps (7.2Msps in interleaved mode)
High speed USART up to 10.5Mbits/s
High speed SPI up to 37.5Mbits/s
Camera interface up to 54MBytes/s
*external reset circuitry required to support 1.7V
32
7
STM32F4 portfolio
Extensive tools and SW
Evaluation board for full product feature evaluation
Hardware evaluation platform for all interfaces
Possible connection to all I/Os and all peripherals
Discovery kit for cost-effective evaluation and prototyping
STM3240G-EVAL
$349
Starter kits from 3rd parties available soon
Large choice of development IDE solutions from the STM32 and ARM
ecosystem
STM32F4DISCOVERY
$14.90
34
Tools for development SW (examples)
Commercial ones:
IAR eval 32kB/30days for test
[RK-System]
Keil (ARM) eval 32kB for test
[WG Electronics]
Based on GCC commercial:
Atollic Lite (no hex/bin, limited debug), [Kamami]
Raisonance debug limited to 32kB
Rowley Crossworks 30 days for test
Free
STVP FLASH prog.
STLink utility FLASH prog.
(+cmd line)
ST FlashLoader FLASH prog.
Libraries (free)
Standard peripherals library with CMSIS
USB device library
35
Cortex-M processors binary compatible
Cortex-M feature set comparison
Cortex-M4
Cortex-M0
Cortex-M3
Instruction set architecture
Architecture Version
v7ME
Thumb, Thumb-2
Instructions
V6M Thumb + Thumb-2 System
v7M
0.9
DMIPS/MHz
Bus interfaces
Yes
Integrated NVIC
Number interrupts
1.25
1-32 + NMI
Yes
1-240 + NMI
Interrupt priorities
8-256
4/2/0, 2/1/0
8/4/0, 2/1/0
Thumb + Thumb-2, DSP,
SIMD, FP
1.25
3
Yes
1-240 + NMI
8-256
Breakpoints, Watchpoints
No
Yes (Option)
8/4/0, 2/1/0
Memory Protection Unit (MPU)
No
Yes (Option)
Yes (Option)
Integrated trace option (ETM)
No
Yes (Option)
Yes (Option)
Fault Robust Interface
Yes (Option)
No
Yes
Single Cycle Multiply
No
Yes
Yes
Hardware Divide
Yes
Yes
Yes
WIC Support
No
Yes
Yes
Bit banding support
No
No
Yes
Single cycle DSP/SIMD
No
No
Yes
Floating point hardware
Bus protocol
CMSIS Support
37
AHB Lite AHB Lite, APB Yes
Yes
Yes
AHB Lite, APB
Yes
13
ARM CORTEX-M4 PROCESSOR
REGISTERS
38
Cortex-M4 Registers
Processor registers
The internal registers are used to store and process temporary data within the
processor core
All registers are inside the processor core, hence they can be accessed quickly
Load-store architecture
To process memory data, they have to be first loaded from memory to registers,
processed inside the processor core using register data only, and then written back to
memory if needed
Cortex-M4 registers
Register bank
Sixteen 32-bit registers (thirteen are used for general-purpose);
Special registers
39
Cortex-M4 Registers
R0
Register bank
R1
R2
R3
R4
Low
Registers
R5
General purpose
register
R6
R7
R8
R9
R10
R11
Special registers
R12
MSP
Stack Pointer (SP)
R13(banked)
Main Stack Pointer
Link Register (LR)
R14
PSP
Program Counter (PC)
R15
Process Stack Pointer
Program Status Registers (PSR)
Interrupt mask register
x PSR
APSR
EPSR
IPSR
PRIMASK
Application
PSR
Execution
PSR
Interrupt
PSR
FAULTMASK
BASEPRI
Stack definition
40
High
Registers
CONTROL
Cortex-M4 Registers
R0 R12: general purpose registers
Data
Low registers (R0 R7) can be accessed by any instruction
High registers (R8 R12) sometimes cannot be accessed e.g. by some
Data
PUSH
POP
Thumb (16-bit) instructions
Low
R13: Stack Pointer (SP)
Stack
Records the current address of the stack
Used for saving the context of a program while switching between tasks
Cortex-M4 has two SPs: Main SP, used in applications that require privileged
access e.g. OS kernel, and exception handlers, and Process SP, used in baselevel application code (when not running an exception handler)
SP
High
PC
Heap
Program Counter (PC)
Records the address of the current instruction code
Automatically incremented by 4 at each operation (for 32-bit instruction
code), except branching operations
A branching operation, such as function calls, will change the PC to a specific
41
address, meanwhile it saves the current PC to the Link Register (LR)
Address
Code
Cortex-M4 Registers
R14: Link Register (LR)
The LR is used to store the return address of a subroutine or a function call
The program counter (PC) will load the value from LR after a function is finished
Current PC
PC
LR
LR
2. Load PC with
the starting
address of the
subroutine
Main
Program
code
Load PC with the
address in LR to
return to the
main program
subroutine
Current PC
Code region
Main
Program
code
Code region
1. Save current
PC to LR
Current LR
subroutine
PC
Call a subroutine
42
Return from a subroutine to the main program
Cortex-M4 Registers
xPSR, combined Program Status Register
Provides information about program execution and ALU flags
Application PSR (APSR)
Interrupt PSR (IPSR)
Execution PSR (EPSR)
APSR
NZCVQ
Reserved
IPSR
Reserved
EPSR
xPSR
NZCVQ
bit31
43
ISR number
ICI/IT
Reserved
ICI/IT
ICI/IT
Reserved
ICI/IT
bit24
bit16
bit8
ISR number
bit0
Cortex-M4 Registers
APSR
N: negative flag set to one if the result from ALU is negative
Z: zero flag set to one if the result from ALU is zero
C: carry flag set to one if an unsigned overflow occurs
V: overflow flag set to one if a signed overflow occurs
Q: sticky saturation flag set to one if saturation has occurred in saturating arithmetic instructions,
or overflow has occurred in certain multiply instructions
IPSR
ISR number current executing interrupt service routine number
EPSR
T: Thumb state always one since Cortex-M4 only supports the Thumb state (more on processor
states in the next module)
IC/IT: Interrupt-Continuable Instruction (ICI) bit, IF-THEN instruction status bit
44
ARM status bits
Every
arithmetic, logical, or shifting operation can
set CPSR bits:
N (negative), Z (zero), C (carry),V (overflow)
Examples:
-1 + 1 = 0:
NZCV = 0110.
231-1+1 = -231:
NZCV = 1001.
Setting status bits must be explicitly enabled on
each instruction
ex.
45
adds sets status bits, whereas add does not
Cortex-M4 Registers
Interrupt mask registers
1-bit PRIMASK
Set to one will block all the interrupts apart from nonmaskable interrupt (NMI) and the
hard fault exception
1-bit FAULTMASK
Set to one will block all the interrupts apart from NMI
1-bit BASEPRI
Set to one will block all interrupts of the same or lower level (only allow for interrupts
with higher priorities)
CONTROL: special register
1-bit stack definition
Set to one: use the process stack pointer (PSP)
Clear to zero: use the main stack pointer (MSP)
46
Cortex-M4 Registers
PRIMASK
PRIMASK
Reserved
FAULTMASK
FAULTMASK
Reserved
BASEPRI
BASEPRI
Reserved
CONTROL
Reserved
bit31
bit24
bit16
bit8
Stack definition
47
Cortex-M4 Operation Modes
48
49
50
Cortex M4
DSP features
Cortex-M4 processor architecture
ARMv7ME Architecture
Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 -> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3
Microarchitecture
3-stage pipeline with branch speculation
3x AHB-Lite Bus Interfaces
Configurable for ultra low power
Deep Sleep Mode, Wakeup Interrupt Controller
Power down features for Floating Point Unit
Flexible configurations for wider applicability
Configurable Interrupt Controller (1-240 Interrupts and Priorities)
Optional Memory Protection Unit
Optional Debug & Trace
52
15
Cortex-M4 overview
Main Cortex-M4 processor features
ARMv7-ME architecture revision
Fully compatible with Cortex-M3 instruction set
Single-cycle multiply-accumulate (MAC) unit
Optimized single instruction multiple data (SIMD)
instructions
Saturating arithmetic instructions
Optional single precision Floating-Point Unit (FPU)
Hardware Divide (2-12 Cycles), same as Cortex-M3
Barrel shifter (same as Cortex-M3)
Hardware divide (same as Cortex-M3)
53
Single-cycle multiply-accumulate unit
The multiplier unit allows any MUL or MAC
instructions to be executed in a single cycle
Signed/Unsigned Multiply
Signed/Unsigned Multiply-Accumulate
Signed/Unsigned Multiply-Accumulate Long (64-bit)
Benefits : Speed improvement vs. Cortex-M3
4x for 16-bit MAC (dual 16-bit MAC)
2x for 32-bit MAC
up to 7x for 64-bit MAC
54
Cortex-M4 extended single cycle MAC
O P ER ATIO N
16 x 16 = 32
16 x 16 + 32 = 32
16 x 16 + 64 = 64
16 x 32 = 32
(16 x 32) + 32 = 32
IN S TR U C TIO N S
CM 3
CM 4
SM U LBB, SM U LBT, SM U LTB, SM U LTT
SM LABB, SM LABT, SM LATB, SM LATT
SM LALBB, SM LALBT, SM LALTB, SM LALTT
SM U LW B, SM U LW T
SM LAW B, SM LAW T
SM U AD , SM U AD X, SM U SD , SM U SD X
n/a
n/a
n/a
n/a
n/a
n/a
1
1
1
1
1
1
(16 x 16) (16 x 16) + 32 = 32
SM LAD , SM LAD X, SM LSD , SM LSD X
n/a
(16 x 16) (16 x 16) + 64 = 64
SM LALD , SM LALD X, SM LSLD , SM LSLD X
n/a
32 x 32 = 32
M UL
32 (32 x 32) = 32
32 x 32 = 64
(32 x 32) + 64 = 64
(32 x 32) + 32 + 32 = 64
M LA, M LS
SM U LL, U M U LL
SM LAL, U M LAL
U M AAL
5-7
5-7
n/a
1
1
1
32 (32 x 32) = 32 (upper)
(32 x 32) = 32 (upper)
SM M LA, SM M LAR, SM M LS, SM M LSR
n/a
SM M U L, SM M U LR
n/a
(16 x 16) (16 x 16) = 32
All the above operations are single cycle on the Cortex-M4 processor
55
Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max
boundaries and remove CPU burden due to software range
checks
Benefits
1,5
1,5Audio applications
Without
saturation
0,5
0
-0,5
-1
0,5
-1,5
1,5
-0,5
0,5
-1
With
saturation
-1,5
Control applications
0
-0,5
-1
-1,5
The PID controllers integral term is continuously accumulated over time. The
saturation automatically limits its value and saves several CPU cycles per
regulators
56
Single-cycle SIMD instructions
Stands for Single Instruction Multiple Data
It operates with packed data
Allows to do simultaneously several operations with 8-bit or 16-bit data format
i.e.: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)
Benefits
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between memory and register file
(2 or 4 data transferred at once), if 32-bit is not necessary
Maximizes register file use (1 register holds 2 or 4 values)
57
Packed data types
Byte or halfword quantities packed into words
Allows more efficient access to packed structure types
SIMD instructions can act on packed data
Instructions to extract and pack data
A
00......00 A
Extract
00......00 B
Pack
A
58
IIR single cycle MAC benefit
xN = *x++;
yN = xN * b0;
yN += xNm1
* b1;
yN += xNm2
* b2;
yN -= yNm1
* a1;
yN -= yNm2
* a2;
*y++
= yN;
xNm2
= xNm1;
xNm1
= xN;
yNm2
= yNm1;
yNm1
= yN;
Decrement loop counter
Cortex-M3 Cortex-M4 cycle
countcycle count
2
3-7
3-7
3-7
3-7
3-7
2
1
1
1
1
1 Branch
2
Only looking at the inner loop, making these assumptions
59
2
1
1
1
1
1
2
1
1
1
1
1
2
Function operates on a block of samples
Coefficients b0, b1, b2, a1, and a2 are in registers
Previous states, x[n-1], x[n-2], y[n-1], and y[n-2] are in registers
Inner loop on Cortex-M3 takes 27-47 cycles per sample
Inner loop on Cortex-M4 takes 16 cycles per sample
yn
b0 x n
a1 y n
1
b1 x n 1
a2 y n
b2 x n 2
2
Further optimization strategies
Circular addressing alternatives
Loop unrolling
Caching of intermediate variables
Extensive use of SIMD and intrinsics
60
FIR Filter Standard C Code
void fir(q31_t *in, q31_t *out, q31_t *coeffs, int
int filtLen, int blockSize)
{
int sample; int k; q31_t
sum;
int stateIndex =
*stateIndexPtr;
for(sample=0; sample < blockSize; sample++)
{
state[stateIndex++] = in[sample]; sum=0;
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex]; stateIndex--;
if (stateIndex < 0)
{
stateIndex
}
}
out[sample]=sum;
}
*stateIndexPtr = stateIndex;
}
61
= filtLen-1;
*stateIndexPtr,
Block based processing
Inner loop consists of:
Dual memory
fetches
MAC
Pointer updates with
circular addressing
FIR Filter DSP Code
32-bit DSP processor assembly code
Only the inner loop is shown, executes in a single cycle
Optimized assembly code, cannot be achieved
in C
Zero overhead loop
lcntr=r2, do
FIRLoop
lce;
FIRLoop: untilf12=f0*f4,
f8=f8+f12, f4=dm(i1,m4), f0=pm(i12,m12);
Multiply and
accumulate
previous
62
Coeff fetch with linear
addressing
State fetch with
circular addressing
Cortex-M4 - Final FIR Code
sample = blockSize/4; do
{
sum0 = sum1 = sum2 = sum3 = 0; statePtr =
stateBasePtr; coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++); i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++); x3 = *(q31_t *)
(statePtr++); sum0 =
SMLALD(x0, c0, sum0);
sum1
sum2
sum3
= __SMLALD(x1, c0, sum1);
=
SMLALD(x2, c0, sum2);
=
SMLALD(x3, c0, sum3);
c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++); x1 =
*(q31_t *)(statePtr++);
sum0
=
SMLALD(x0, c0, sum0);
sum1
=
SMLALD(x1, c0, sum1);
sum2
=
SMLALD (x2, c0, sum2);
sum3
=
SMLALD (x3, c0, sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);
63
Uses loop unrolling, SIMD intrinsics, caching of
states and coefficients, and work around circular
addressing by using a large state buffer.
Inner loop is 26 cycles for a total of 16, 16-bit
MACs.
Only 1.625 cycles per filter tap!
Cortex-M4 - FIR performance
DSP assembly code = 1 cycle
Cortex-M4 standard C code takes 12 cycles
Using circular addressing alternative = 8 cycles
After loop unrolling < 6 cycles
After using SIMD instructions
< 2.5 cycles
After caching intermediate values ~ 1.6 cycles
Cortex-M4 C code now comparable in
performance
64
Useful Resources
Architecture Reference Manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html
Cortex-M4 Technical Reference Manual:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_processo
r_r0p1_trm.pdf
Cortex-M4 Devices Generic User Guide:
http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf
65