100% found this document useful (2 votes)

1K views65 pages

Cortex-M4 Part1

Cortex-M4_part1(2)

Uploaded by

shalini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

1K views65 pages

Cortex-M4 Part1

Cortex-M4_part1(2)

Uploaded by

shalini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 65

The ARM Cortex-M4 Processor

Architecture

Module Syllabus
ARM Architectures and Processors
What is ARM Architecture
ARM Processor Families
ARM Cortex-M Series
Cortex-M4 Processor
ARM Processor vs. ARM Architectures

ARM Cortex-M4 Processor

Cortex-M4 Processor Overview
Cortex-M4 Block Diagram
Cortex-M4 Registers
2

ARM ARCHITECTURES AND

PROCESSORS
3

What is ARM Architecture

ARM architecture is a family of RISC-based processor architectures
Well-known for its power efficiency;
Hence widely used in mobile devices, such as smartphones and tablets
Designed and licensed to a wide eco-system by ARM

ARM Holdings
The company designs ARM-based processors;
Does not manufacture, but licenses designs to semiconductor partners who add their own

Intellectual Property (IP) on top of ARMs IP, fabricate and sell to customers;
Also offer other IP apart from processors, such as physical IPs, interconnect IPs, graphics

cores, and development tools.

ARM Processor Families

Cortex-A series (Application)

High performance processors capable of full Operating System (OS)

support;

Applications include smartphones, digital TV, smart books, home

gateways etc.

Cortex-R series (Real-time)

High performance for real-time applications;

High reliability

Applications include automotive braking system, powertrains etc.

Cortex-M series (Microcontroller)

Cost-sensitive solutions for deterministic microcontroller applications;

Applications include microcontrollers, mixed signal devices, smart

sensors, automotive body electronics and airbags;

SecurCore series

Cortex-A15
Cortex-A9
Cortex-A8
Cortex-A7
Cortex-A5
Cortex-R7
Cortex-R5
Cortex-R4
Cortex-M4
Cortex-M3
Cortex-M1
Cortex-M0+
Cortex-M0
SC000
SC100
SC300
ARM11
ARM9
ARM7

Cortex-A

Cortex-R
Cortex-M
SecurCore
Classic

High security applications.

Previous classic processors

Cortex-A57
Cortex-A53

Include ARM7, ARM9, ARM11 families

As of Dec 2013

Cortex-M processors are the optimal solution for low-power embedded computing applications. The 32-bit Cortex-M
processor family is the key to transforming all sorts of embedded systems into smart and connected systems. Often
provided as a black box with pre-loaded applications, they have limited capability to expand hardware functionality and in
most cases no screen.
Merchant MCUs
*Automotive Control Systems
White Goods controllers
*Smart Meters
*Sensors
6

*Motor Control Systems

*Internet of Things

Equipment Adopting ARM Cores

IR Fire Detector
Intelligent toys

Utility
Meters

Exercise
Machines

Energy Efficient Appliances

Tele-parking

R
A
7

Source: ARM University Program Overview

Intelligent
Vending

With More Than 50 Billion

Over 10 Billion
ARM-powered chips shipped in
2013 alone.

Strong and Consistent

Growth

Since 1993

This curve shows overall shipments leading to the 50

billion milestone. Theres been an upward trend as
shipments skyrocketed in recent years.

1993

10 Billion

2013
50 Billion

www.50billionchips.com

Source: ARM University Program Overview

Markets were POWERING

20% | Embedded

16% | Enterprise

Applications including automotive, touchscreen controllers, industrial equipment,

connectivity and smartcards

Applications such as hard disk drives, and

wireless/wireline networking infrastructure equipment

6% | Home
Consumers devices such as smart TVs, game
consoles and home networking gateways

58% | Mobile
Devices including smartphones,
mobile phones, tablets, e-readers
and wearables

www.50billionchips.com

Source: ARM University Program Overview

Based on Lecture Notes by Marilyn Wolf

ARM Architecture versions

(From arm.com)

Design an ARM-based SoC

Select a set of IP cores from ARM and/or other third-party IP vendors
Integrate IP cores into a single chip design
Give design to semiconductor foundries for chip fabrication
IP libraries
Cortex-A9

Cortex-R5

Cortex-M4

ARM7

ARM9

ARM11

DRAM ctrl

FLASH ctrl

SRAM ctrl

AXI bus

AHB bus

APB bus

GPIO

I/O blocks

Timer

Licensable IPs

SoC
ROM

ARM
processor
System bus

RAM

ARM-based
SoC

Peripherals
External Interface

SoC Design

Chip Manufacture

ARM Cortex-M Series

Cortex-M series: Cortex-M0, M0+, M1, M3, M4.
Energy-efficiency
Lower energy cost, longer battery life

Smaller code
Lower silicon costs

Ease of use
Faster software development and reuse

Embedded applications
Smart metering, human interface devices, automotive and industrial control

systems, white goods, consumer products and medical instrumentation

As of Dec 2013

ARM Processors vs. ARM Architectures

ARM architecture
Describes the details of instruction set, programmers model, exception model, and

memory map
Documented in the Architecture Reference Manual

ARM processor
Developed using one of the ARM architectures
More implementation details, such as timing information
Documented in processors Technical Reference Manual
ARMv4/v4T
Architecture

ARMv5/ v4E
Architecture

ARMv6
Architecture

ARMv7
Architecture

ARMv7-A
e.g. Cortex-A9
ARMv7-R
e.g. Cortex-R4

ARM v6-M
e.g. Cortex-M0, M1
e.g. ARM7TDMI
14

e.g. ARM9926EJ-S

ARMv8
Architecture

ARMv8-A
e.g. Cortex-A53
Cortex-A57
ARMv8-R

ARMv7-M
e.g. Cortex-M4

e.g. ARM1136
As of Dec 2013

ARM Cortex-M Series Family

Processor

ARM
Architecture

Core
Architecture

Thumb

Thumb-2

Hardware
Multiply

Hardware
Divide

Saturated
Math

DSP
Extensions

Floating
Point

Cortex-M0

ARMv6-M

Von
Neumann

Most

Subset

1 or 32
cycle

Software

Cortex-M0+

ARMv6-M

Von
Neumann

Most

Subset

1 or 32
cycle

Software

Cortex-M1

ARMv6-M

Von
Neumann

Most

Subset

3 or 33
cycle

Software

Cortex-M3

ARMv7-M

Harvard

Entire

1 cycle

Yes

Software

Cortex-M4

ARMv7E-M

Harvard

Entire

1 cycle

Yes

Hardware

Optional

RISC CPU Characteristics

32-bit

load/store architecture
Fixed instruction length
Fewer/simpler instructions than CISC CPU
Limited addressing modes, operand types
Simple design easier to speed up, pipeline &
scale

ARM assembly language

Fairly

standard RISC assembly

language:

label

LDR r0,[r8]
ADD r4,r0,r1

destination

source/left

; a comment
;r4=r0+r1

source/right

ARM data types

Word

is 32 bits long.
Word can be divided into four 8-bit bytes.
ARM addresses can be 32 bits long.
Address refers to byte.
Address 4 starts at byte 4.
Configure at power-up in either little- or bitendian mode.

Endianness
Relationship between bit and byte/word ordering

defines endianness:

bit 31
byte 3

byte 2

byte 1

little-endian
(default)

bit 0
byte 0

byte 1

byte 2

big-endian

bit 31
byte 3

ARM CORTEX-M4 PROCESSOR

OVERVIEW
20

Cortex-M4 Processor Overview

Cortex-M4 Processor
Introduced in 2010
Designed with a large variety of highly efficient signal processing features
Features extended single-cycle multiply accumulate instructions, optimized SIMD arithmetic,

saturating arithmetic and an optional Floating Point Unit.

High Performance Efficiency

1.25 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz) at the order of Watts / MHz

Low Power Consumption

Longer battery life especially critical in mobile products

Enhanced Determinism
The critical tasks and interrupt routines can be served quickly in a known number of cycles

Cortex-M4 Processor Features

32-bit Reduced Instruction Set Computing (RISC) processor
Harvard architecture
Separated data bus and instruction bus

Instruction set
Include the entire Thumb-1 (16-bit) and Thumb-2 (16/ 32-bit) instruction sets

3-stage + branch speculation pipeline

Performance efficiency
1.25 1.95 DMIPS/MHz (DhrystoneMillion Instructions Per Second / MHz)

Supported Interrupts
Non-maskable Interrupt (NMI) + 1 to 240 physical interrupts
8 to 256 interrupt priority levels
22

Cortex-M4 Processor Features

Supports Sleep Modes

Up to 240 Wake-up Interrupts

Integrated WFI (Wait For Interrupt) and WFE (Wait For Event) Instructions and Sleep On Exit capability (to be
covered in more detail later)

Sleep & Deep Sleep Signals

Optional Retention Mode with ARM Power Management Kit

Enhanced Instructions

Hardware Divide (2-12 Cycles)

Single-Cycle 16, 32-bit MAC, Single-cycle dual 16-bit MAC

8, 16-bit SIMD arithmetic

Debug

Optional JTAG & Serial-Wire Debug (SWD) Ports

Up to 8 Breakpoints and 4 Watchpoints

Memory Protection Unit (MPU)

Optional 8 region MPU with sub regions and background region

Cortex-M4 Processor Features

Cortex-M4 processor is designed to meet the challenges of low dynamic power

constraints while retaining light footprints

180 nm ultra low power process 157 W/MHz

90 nm low power process 33 W/MHz

40 nm G process 8 W/MHz

ARM Cortex-M4 Implementation Data

Process

180ULL
(7-track, typical 1.8v, 25C)

90LP
(7-track, typical 1.2v, 25C)

40G
9-track, typical 0.9v, 25C)

Dynamic Power

157 W/MHz

33 W/MHz

8 W/MHz

Floorplanned Area

0.56 mm2

0.17 mm2

0.04 mm2

Cortex-M4 Block Diagram

ARM Cortex-M4 Microprocessor
Optional FPU
Optional
WIC

Nested Vector
Interrupt
Controller
(NVIC)

Optional
Debug
Access Port

Processor core

Optional
Embedded
Trace Macrocell

Optional Memory
protection unit

Optional Serial
Wire Viewer

Optional
Flash
patch

Optional
Data
watchpoints

Bus matrix
Code interface

SRAM and
peripheral interface

Cortex-M4 Block Diagram

Processor core
Contains internal registers, the ALU, data path, and some control logic
Registers include sixteen 32-bit registers for both general and special usage

Processor pipeline stages

Three-stage pipeline: fetch, decode, and execution
Some instructions may take multiple cycles to execute, in which case the pipeline will be stalled
The pipeline will be flushed if a branch instruction is executed
Up to two instructions can be fetched in one transfer (16-bit instructions)
Instruction 1

Fetch

Instruction 2

Fetch

Decode

Execute
Decode

Instruction 3

Fetch

Instruction 4

Fetch

Execute
Decode

Execute
Time

Cortex-M4 Block Diagram

Nested Vectored Interrupt Controller (NVIC)
Up to 240 interrupt request signals and a non-maskable interrupt (NMI)
Automatically handles nested interrupts, such as comparing priorities between interrupt

requests and the current priority level

Wakeup Interrupt Controller (WIC)

For low-power applications, the microcontroller can enter sleep mode by shutting down

most of the components.

When an interrupt request is detected, the WIC can inform the power management unit

to power up the system.

Memory Protection Unit (optional)

Used to protect memory content, e.g. make some memory regions read-only or

preventing user applications from accessing privileged application data

Cortex-M4 Block Diagram

Bus interconnect
Allows data transfer to take place on different buses simultaneously
Provides data transfer management, e.g. a write buffer, bit-oriented operations

(bit-band)
May include bus bridges (e.g. AHB-to-APB bus bridge) to connect different buses

into a network using a single global memory space

Includes the internal bus system, the data path in the processor core, and the

AHB LITE interface unit

Debug subsystem
Handles debug control, program breakpoints, and data watchpoints
When a debug event occurs, it can put the processor core in a halted state,
28

where developers can analyse the status of the processor at that point, such as
register values and flags

STM32F4xx Block Diagram

RDP (JTAG fuse)

More I/Os in UFBGA 176 package
29

JTAG/SW Debug
ETM
Nested vect IT Ctrl

1 x Systic Timer
DMA

AHB2 (max 168MHz)

D-bus
I-bus
S-bus

16 Channels

Clock Control

AHB1
(max 168MHz)

51/82/114/140 I/Os
2x6x 16-bit PWM
Synchronized AC Timer

3 x 16bit Timer
Up to 16 Ext. ITs
1 x SPI
2 x USART/LIN

Bridge

512kB- 1MB
Flash Memory

Flash I/F

CORTEX-M4
CPU + FPU +
MPU
168 MHz

ARM 32-bit multi-AHB bus matrix

Arbiter (max 168MHz)

New application specific peripherals

USB OTG HS w/ ULPI interface
Camera interface
HW Encryption**: DES, 3DES, AES
256-bit, SHA-1 hash, RNG.
Enhanced peripherals
USB OTG Full speed
ADC: 0.416s conversion/2.4Msps, up to
7.2Msps in interleaved triple
mode
ADC/DAC working down to 1.8V
2
Dedicated PLL for I S precision
Ethernet w/ HW IEEE1588 v2.0
32-bit RTC with calendar
4KB backup SRAM in VBAT domain
2 x 32bit and 8 x 16bit Timers
high speed USART up to 10.5Mb/s
high speed SPI up to 37.5Mb/s

64KB CCM data RAM

APB2 (max 84MHz)

Cortex-M4 w/ FPU, MPU and ETM

Memory
Up to 1MB Flash memory
192KB RAM (including 64KB CCM data RAM
FSMC up to 60MHz

Encryption**
Camera Interface
USB 2.0 OTG FS

128KB SRAM
External Memory
Interface
USB 2.0 OTG
FS/HS

Power Supply Reg

1.2V POR/PDR/PVD
XTAL oscillators
32KHz + 8~25MHz

Ethernet MAC 10/100,

IEEE1588
Bridge

APB1 (max 42MHz)

Int. RC oscillators
32KHz + 16MHz

PLL
RTC / AWU

5x 16-bit Timer
2x 32-bit Timer

4KB backup RAM

2x DAC + 2 Timers

2x Watchdog
(independent& window)

1x SDIO
3x 12-bit ADC
24 channels / 2Msps

Temp Sensor

HS requires an external PHY connected to ULPI interface,

** Encryption is only available on STM32F415 and STM32F417

2x CAN 2.0B
2 x SPI / I2S
4x USART/LIN
3x I2C

STM32F4 Series highlights 1/3

Based on Cortex M4 core
The new DSP and FPU instructions combined to 168MHz
Over 30 new part numbers pin-to-pin and software compatible with
existing STM32 F2 Series.

Advanced technology and process from ST:

Memory accelerator: ART Accelerator
Multi AHB Bus Matrix
90nm process

Outstanding results:
210DMIPS at 168MHz.
Execution from Flash equivalent to 0-wait state performance up to 168MHz thanks to ST
ART Accelerator
30
5

STM32F4 Series highlights 2/3

More Memory
Up to 1MB Flash with option to permanent readout protection (JTAG fuse),
192kB SRAM: 128kB on bus matrix + 64kB (Core Coupled Memory) on data bus dedicated to
the CPU usage

Advanced peripherals

USB OTG High speed 480Mbit/s

Ethernet MAC 10/100 with IEEE1588
PWM High speed timers: 168MHz max frequency
Crypto/Hash processor, 32-bit random number generator (RNG)
32-bit RTC with calendar: with sub 1 second accuracy, and <1uA

31
6

STM32F4 Series highlights 3/3

Further improvements

Low voltage: 1.8V to 3.6V VDD , down to 1.7*V on most packages

Full duplex I2S peripherals
12-bit ADC: 0.41s conversion/2.4Msps (7.2Msps in interleaved mode)
High speed USART up to 10.5Mbits/s
High speed SPI up to 37.5Mbits/s
Camera interface up to 54MBytes/s

*external reset circuitry required to support 1.7V

32
7

STM32F4 portfolio

Extensive tools and SW

Evaluation board for full product feature evaluation

Hardware evaluation platform for all interfaces
Possible connection to all I/Os and all peripherals
Discovery kit for cost-effective evaluation and prototyping

STM3240G-EVAL
$349

Starter kits from 3rd parties available soon

Large choice of development IDE solutions from the STM32 and ARM
ecosystem

STM32F4DISCOVERY

$14.90

Tools for development SW (examples)

Commercial ones:
IAR eval 32kB/30days for test
[RK-System]
Keil (ARM) eval 32kB for test
[WG Electronics]
Based on GCC commercial:
Atollic Lite (no hex/bin, limited debug), [Kamami]
Raisonance debug limited to 32kB
Rowley Crossworks 30 days for test
Free
STVP FLASH prog.
STLink utility FLASH prog.
(+cmd line)
ST FlashLoader FLASH prog.
Libraries (free)
Standard peripherals library with CMSIS
USB device library

Cortex-M processors binary compatible

Cortex-M feature set comparison

Cortex-M4

Cortex-M0
Cortex-M3
Instruction set architecture
Architecture Version

v7ME

Thumb, Thumb-2
Instructions
V6M Thumb + Thumb-2 System
v7M
0.9

DMIPS/MHz
Bus interfaces

Yes

Integrated NVIC
Number interrupts

1.25

1-32 + NMI

Yes
1-240 + NMI

Interrupt priorities

8-256
4/2/0, 2/1/0

8/4/0, 2/1/0

Thumb + Thumb-2, DSP,

SIMD, FP
1.25
3
Yes
1-240 + NMI
8-256

Breakpoints, Watchpoints

Yes (Option)

8/4/0, 2/1/0

Memory Protection Unit (MPU)

Yes (Option)

Integrated trace option (ETM)

Yes (Option)

Fault Robust Interface

Yes (Option)

Yes

Single Cycle Multiply

Yes

Hardware Divide

Yes

WIC Support

Yes

Bit banding support

Yes

Single cycle DSP/SIMD

Yes

Floating point hardware

Bus protocol
CMSIS Support
37

AHB Lite AHB Lite, APB Yes

Yes

Yes
AHB Lite, APB
Yes

ARM CORTEX-M4 PROCESSOR

REGISTERS
38

Cortex-M4 Registers
Processor registers
The internal registers are used to store and process temporary data within the

processor core
All registers are inside the processor core, hence they can be accessed quickly
Load-store architecture
To process memory data, they have to be first loaded from memory to registers,
processed inside the processor core using register data only, and then written back to
memory if needed

Cortex-M4 registers
Register bank
Sixteen 32-bit registers (thirteen are used for general-purpose);

Special registers
39

Cortex-M4 Registers
R0

R1
R2
R3
R4

Low
Registers

R5
General purpose
register

R6
R7
R8
R9
R10
R11

Special registers

R12

MSP

Stack Pointer (SP)

R13(banked)

Main Stack Pointer

Link Register (LR)

R14

PSP

Program Counter (PC)

R15

Process Stack Pointer

Program Status Registers (PSR)

Interrupt mask register

x PSR

APSR

EPSR

IPSR

PRIMASK

Application
PSR

Execution
PSR

Interrupt
PSR

FAULTMASK
BASEPRI

Stack definition

High
Registers

CONTROL

Cortex-M4 Registers
R0 R12: general purpose registers
Data

Low registers (R0 R7) can be accessed by any instruction

High registers (R8 R12) sometimes cannot be accessed e.g. by some

Data

PUSH

POP

Thumb (16-bit) instructions

Low

R13: Stack Pointer (SP)

Stack

Records the current address of the stack

Used for saving the context of a program while switching between tasks
Cortex-M4 has two SPs: Main SP, used in applications that require privileged

access e.g. OS kernel, and exception handlers, and Process SP, used in baselevel application code (when not running an exception handler)

SP
High
PC

Heap

Program Counter (PC)

Records the address of the current instruction code
Automatically incremented by 4 at each operation (for 32-bit instruction

code), except branching operations

A branching operation, such as function calls, will change the PC to a specific
41

address, meanwhile it saves the current PC to the Link Register (LR)

Address

Code

Cortex-M4 Registers
R14: Link Register (LR)
The LR is used to store the return address of a subroutine or a function call
The program counter (PC) will load the value from LR after a function is finished

Current PC
PC

LR
2. Load PC with
the starting
address of the
subroutine

Main
Program
code

Load PC with the

address in LR to
return to the
main program

subroutine

Current PC

Code region

Main
Program
code

Code region

1. Save current
PC to LR

Current LR

subroutine

Call a subroutine

Return from a subroutine to the main program

Cortex-M4 Registers
xPSR, combined Program Status Register
Provides information about program execution and ALU flags
Application PSR (APSR)
Interrupt PSR (IPSR)
Execution PSR (EPSR)
APSR

NZCVQ

Reserved

IPSR

Reserved

EPSR

xPSR

NZCVQ
bit31

ISR number

ICI/IT

Reserved

ICI/IT

Reserved

ICI/IT

bit24

bit16

bit8

ISR number
bit0

Cortex-M4 Registers
APSR
N: negative flag set to one if the result from ALU is negative
Z: zero flag set to one if the result from ALU is zero
C: carry flag set to one if an unsigned overflow occurs
V: overflow flag set to one if a signed overflow occurs
Q: sticky saturation flag set to one if saturation has occurred in saturating arithmetic instructions,

or overflow has occurred in certain multiply instructions

IPSR
ISR number current executing interrupt service routine number

EPSR
T: Thumb state always one since Cortex-M4 only supports the Thumb state (more on processor

states in the next module)

IC/IT: Interrupt-Continuable Instruction (ICI) bit, IF-THEN instruction status bit

ARM status bits

Every

arithmetic, logical, or shifting operation can

set CPSR bits:
N (negative), Z (zero), C (carry),V (overflow)
Examples:
-1 + 1 = 0:
NZCV = 0110.
231-1+1 = -231:
NZCV = 1001.
Setting status bits must be explicitly enabled on

each instruction
ex.
45

adds sets status bits, whereas add does not

Cortex-M4 Registers
Interrupt mask registers
1-bit PRIMASK
Set to one will block all the interrupts apart from nonmaskable interrupt (NMI) and the
hard fault exception

1-bit FAULTMASK
Set to one will block all the interrupts apart from NMI

1-bit BASEPRI
Set to one will block all interrupts of the same or lower level (only allow for interrupts
with higher priorities)

CONTROL: special register

1-bit stack definition
Set to one: use the process stack pointer (PSP)
Clear to zero: use the main stack pointer (MSP)
46

Cortex-M4 Registers
PRIMASK
PRIMASK

Reserved
FAULTMASK

FAULTMASK

Reserved
BASEPRI

BASEPRI

Reserved

CONTROL

Reserved
bit31

bit24

bit16

bit8

Stack definition

Cortex-M4 Operation Modes

Cortex M4

DSP features

Cortex-M4 processor architecture

ARMv7ME Architecture

Thumb-2 Technology
DSP and SIMD extensions
Single cycle MAC (Up to 32 x 32 + 64 -> 64)
Optional single precision FPU
Integrated configurable NVIC
Compatible with Cortex-M3

Microarchitecture
3-stage pipeline with branch speculation
3x AHB-Lite Bus Interfaces

Configurable for ultra low power

Deep Sleep Mode, Wakeup Interrupt Controller
Power down features for Floating Point Unit

Flexible configurations for wider applicability

Configurable Interrupt Controller (1-240 Interrupts and Priorities)
Optional Memory Protection Unit
Optional Debug & Trace
52

Cortex-M4 overview
Main Cortex-M4 processor features
ARMv7-ME architecture revision
Fully compatible with Cortex-M3 instruction set

Single-cycle multiply-accumulate (MAC) unit

Optimized single instruction multiple data (SIMD)
instructions
Saturating arithmetic instructions
Optional single precision Floating-Point Unit (FPU)
Hardware Divide (2-12 Cycles), same as Cortex-M3
Barrel shifter (same as Cortex-M3)
Hardware divide (same as Cortex-M3)
53

Single-cycle multiply-accumulate unit

The multiplier unit allows any MUL or MAC
instructions to be executed in a single cycle
Signed/Unsigned Multiply
Signed/Unsigned Multiply-Accumulate
Signed/Unsigned Multiply-Accumulate Long (64-bit)

Benefits : Speed improvement vs. Cortex-M3

4x for 16-bit MAC (dual 16-bit MAC)
2x for 32-bit MAC
up to 7x for 64-bit MAC

Cortex-M4 extended single cycle MAC

O P ER ATIO N
16 x 16 = 32
16 x 16 + 32 = 32
16 x 16 + 64 = 64
16 x 32 = 32
(16 x 32) + 32 = 32

IN S TR U C TIO N S

CM 3

CM 4

SM U LBB, SM U LBT, SM U LTB, SM U LTT

SM LABB, SM LABT, SM LATB, SM LATT
SM LALBB, SM LALBT, SM LALTB, SM LALTT
SM U LW B, SM U LW T
SM LAW B, SM LAW T
SM U AD , SM U AD X, SM U SD , SM U SD X

n/a
n/a
n/a
n/a
n/a
n/a

1
1
1
1
1
1

(16 x 16) (16 x 16) + 32 = 32

SM LAD , SM LAD X, SM LSD , SM LSD X

n/a

(16 x 16) (16 x 16) + 64 = 64

SM LALD , SM LALD X, SM LSLD , SM LSLD X

n/a

32 x 32 = 32

M UL

32 (32 x 32) = 32
32 x 32 = 64
(32 x 32) + 64 = 64
(32 x 32) + 32 + 32 = 64

M LA, M LS

SM U LL, U M U LL
SM LAL, U M LAL
U M AAL

5-7
5-7
n/a

1
1
1

32 (32 x 32) = 32 (upper)

(32 x 32) = 32 (upper)

SM M LA, SM M LAR, SM M LS, SM M LSR

n/a

SM M U L, SM M U LR

n/a

(16 x 16) (16 x 16) = 32

All the above operations are single cycle on the Cortex-M4 processor
55

Saturated arithmetic
Intrinsically prevents overflow of variable by clipping to min/max
boundaries and remove CPU burden due to software range
checks
Benefits
1,5

1,5Audio applications

Without
saturation

0,5
0
-0,5
-1

0,5

-1,5
1,5

-0,5

0,5

-1

With
saturation

-1,5

Control applications

0
-0,5
-1
-1,5

The PID controllers integral term is continuously accumulated over time. The
saturation automatically limits its value and saves several CPU cycles per
regulators
56

Single-cycle SIMD instructions

Stands for Single Instruction Multiple Data

It operates with packed data
Allows to do simultaneously several operations with 8-bit or 16-bit data format

i.e.: dual 16-bit MAC (Result = 16x16 + 16x16 + 32)

Benefits
Parallelizes operations (2x to 4x speed gain)
Minimizes the number of Load/Store instruction for exchanges between memory and register file
(2 or 4 data transferred at once), if 32-bit is not necessary
Maximizes register file use (1 register holds 2 or 4 values)

Packed data types

Byte or halfword quantities packed into words

Allows more efficient access to packed structure types
SIMD instructions can act on packed data
Instructions to extract and pack data
A

00......00 A

Extract
00......00 B
Pack

A
58

IIR single cycle MAC benefit

xN = *x++;
yN = xN * b0;
yN += xNm1
* b1;
yN += xNm2
* b2;
yN -= yNm1
* a1;
yN -= yNm2
* a2;
*y++
= yN;
xNm2
= xNm1;
xNm1
= xN;
yNm2
= yNm1;
yNm1
= yN;
Decrement loop counter

Cortex-M3 Cortex-M4 cycle

countcycle count
2
3-7
3-7
3-7
3-7
3-7
2
1
1
1
1

1 Branch
2

Only looking at the inner loop, making these assumptions

2
1
1
1
1
1
2
1
1
1
1
1
2

Function operates on a block of samples

Coefficients b0, b1, b2, a1, and a2 are in registers
Previous states, x[n-1], x[n-2], y[n-1], and y[n-2] are in registers

Inner loop on Cortex-M3 takes 27-47 cycles per sample

Inner loop on Cortex-M4 takes 16 cycles per sample

b0 x n
a1 y n
1

b1 x n 1
a2 y n

b2 x n 2
2

Further optimization strategies

Circular addressing alternatives
Loop unrolling
Caching of intermediate variables
Extensive use of SIMD and intrinsics

FIR Filter Standard C Code

void fir(q31_t *in, q31_t *out, q31_t *coeffs, int
int filtLen, int blockSize)
{
int sample; int k; q31_t
sum;
int stateIndex =
*stateIndexPtr;
for(sample=0; sample < blockSize; sample++)
{
state[stateIndex++] = in[sample]; sum=0;
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex]; stateIndex--;
if (stateIndex < 0)
{
stateIndex
}
}
out[sample]=sum;
}
*stateIndexPtr = stateIndex;
}

= filtLen-1;

*stateIndexPtr,

Block based processing

Inner loop consists of:
Dual memory
fetches
MAC
Pointer updates with
circular addressing

FIR Filter DSP Code

32-bit DSP processor assembly code
Only the inner loop is shown, executes in a single cycle
Optimized assembly code, cannot be achieved
in C

Zero overhead loop

lcntr=r2, do
FIRLoop
lce;
FIRLoop: untilf12=f0*f4,
f8=f8+f12, f4=dm(i1,m4), f0=pm(i12,m12);

Multiply and
accumulate
previous

Coeff fetch with linear

addressing

State fetch with

circular addressing

Cortex-M4 - Final FIR Code

sample = blockSize/4; do
{
sum0 = sum1 = sum2 = sum3 = 0; statePtr =
stateBasePtr; coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++); i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++); x3 = *(q31_t *)
(statePtr++); sum0 =
SMLALD(x0, c0, sum0);

sum1
sum2
sum3

= __SMLALD(x1, c0, sum1);

=
SMLALD(x2, c0, sum2);
=
SMLALD(x3, c0, sum3);

c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++); x1 =
*(q31_t *)(statePtr++);
sum0
=
SMLALD(x0, c0, sum0);
sum1
=
SMLALD(x1, c0, sum1);
sum2
=
SMLALD (x2, c0, sum2);
sum3
=
SMLALD (x3, c0, sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);

Uses loop unrolling, SIMD intrinsics, caching of

states and coefficients, and work around circular
addressing by using a large state buffer.
Inner loop is 26 cycles for a total of 16, 16-bit
MACs.
Only 1.625 cycles per filter tap!

Cortex-M4 - FIR performance

DSP assembly code = 1 cycle

Cortex-M4 standard C code takes 12 cycles

Using circular addressing alternative = 8 cycles
After loop unrolling < 6 cycles
After using SIMD instructions
< 2.5 cycles
After caching intermediate values ~ 1.6 cycles
Cortex-M4 C code now comparable in
performance

Useful Resources
Architecture Reference Manual:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403c/index.html

Cortex-M4 Technical Reference Manual:

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439d/DDI0439D_cortex_m4_processo
r_r0p1_trm.pdf

Cortex-M4 Devices Generic User Guide:

http://infocenter.arm.com/help/topic/com.arm.doc.dui0553a/DUI0553A_cortex_m4_dgug.pdf

Cortex Processors Introduction Presentation
No ratings yet
Cortex Processors Introduction Presentation
12 pages
Unit 4 Introduction To ARM CORTEX M4
100% (2)
Unit 4 Introduction To ARM CORTEX M4
84 pages
ARM Processor-Full
100% (1)
ARM Processor-Full
148 pages
Arm-Cortex m3
No ratings yet
Arm-Cortex m3
13 pages
Prepared by Mahesh.R.K Asst - Prof. AIET GLB
No ratings yet
Prepared by Mahesh.R.K Asst - Prof. AIET GLB
23 pages
Module - 1: Salient Features of The Cortex-M3
100% (1)
Module - 1: Salient Features of The Cortex-M3
29 pages
LPC2148 Dac
100% (2)
LPC2148 Dac
14 pages
ARM Introduction-1
100% (2)
ARM Introduction-1
26 pages
LPC2148 Microcontroller Architecture and
100% (1)
LPC2148 Microcontroller Architecture and
50 pages
(Lecture-07) The Thumb Instruction Sets
No ratings yet
(Lecture-07) The Thumb Instruction Sets
22 pages
ARM Assembly Instruction Set Overview
No ratings yet
ARM Assembly Instruction Set Overview
42 pages
Arm 9
No ratings yet
Arm 9
16 pages
ARM7 Architecture Overview
100% (1)
ARM7 Architecture Overview
9 pages
Unit IV MPMC
No ratings yet
Unit IV MPMC
22 pages
Serial Communication Bus-Interface (Unit3)
100% (2)
Serial Communication Bus-Interface (Unit3)
44 pages
Introduction To ARM Cortex-M Processor
100% (2)
Introduction To ARM Cortex-M Processor
19 pages
ARM Architecture - L5
100% (2)
ARM Architecture - L5
7 pages
Advanced Arm Processors: Ch. S. V. Maruthi Rao Associate Professor Department of ECE, S. I. E. T
100% (1)
Advanced Arm Processors: Ch. S. V. Maruthi Rao Associate Professor Department of ECE, S. I. E. T
27 pages
MPMC Unit 4
No ratings yet
MPMC Unit 4
23 pages
Thumb Instructions
No ratings yet
Thumb Instructions
37 pages
Structural Units of Embedded Processor PDF
67% (3)
Structural Units of Embedded Processor PDF
11 pages
Arm Processor
No ratings yet
Arm Processor
18 pages
Unit 1 - ARM7
No ratings yet
Unit 1 - ARM7
67 pages
Introduction to Arduino Platform
No ratings yet
Introduction to Arduino Platform
104 pages
Arm Architecture
No ratings yet
Arm Architecture
11 pages
MCQ Bank For ARM in Format
75% (4)
MCQ Bank For ARM in Format
8 pages
02-General Purpose Processors
No ratings yet
02-General Purpose Processors
37 pages
LPC2148 Microcontroller Architecture Overview
100% (1)
LPC2148 Microcontroller Architecture Overview
50 pages
Arm
100% (2)
Arm
44 pages
2 Marks
No ratings yet
2 Marks
30 pages
AP Unit 3
No ratings yet
AP Unit 3
133 pages
Cortex M3
100% (1)
Cortex M3
34 pages
Advanced Microprocessors and Microcontrollers
0% (1)
Advanced Microprocessors and Microcontrollers
1 page
8051 Instruction Set
No ratings yet
8051 Instruction Set
79 pages
Pic18 Serial Communication
100% (4)
Pic18 Serial Communication
25 pages
ARM Embedded Systems Programming
100% (2)
ARM Embedded Systems Programming
67 pages
8085 Microprocessor Timing Diagrams With Examples
No ratings yet
8085 Microprocessor Timing Diagrams With Examples
32 pages
Architecture of Digital Signal Processor TMS320C54X
No ratings yet
Architecture of Digital Signal Processor TMS320C54X
50 pages
Arm Cortex
100% (2)
Arm Cortex
31 pages
Lecture Notes 1
No ratings yet
Lecture Notes 1
59 pages
Unit 2-LPC 2148
No ratings yet
Unit 2-LPC 2148
91 pages
ARM7TDMI Processor
No ratings yet
ARM7TDMI Processor
44 pages
Cortex-M4 Memory System Overview
No ratings yet
Cortex-M4 Memory System Overview
51 pages
ARM Cortex-M4 Architecture Overview
100% (3)
ARM Cortex-M4 Architecture Overview
11 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
34 pages
8051 Microcontroller Architecture
No ratings yet
8051 Microcontroller Architecture
3 pages
Write The Features of PIC16F8XX Flash Microcontrollers?: Unit-Vi
100% (2)
Write The Features of PIC16F8XX Flash Microcontrollers?: Unit-Vi
6 pages
Embedded Systems Development IDEs
100% (1)
Embedded Systems Development IDEs
25 pages
Overview of ARM Architecture and ISA
100% (2)
Overview of ARM Architecture and ISA
74 pages
Overview of ARM7 Architecture
100% (1)
Overview of ARM7 Architecture
8 pages
8051 vs PIC Microcontroller Design
100% (1)
8051 vs PIC Microcontroller Design
10 pages
UNIT IV - RTOS Based Embedded System Design
No ratings yet
UNIT IV - RTOS Based Embedded System Design
34 pages
DCP Notes
100% (1)
DCP Notes
201 pages
PIC18F4550 ADC - PIC Controllers
100% (1)
PIC18F4550 ADC - PIC Controllers
9 pages
8051 2 Rs 232 Interfacing
No ratings yet
8051 2 Rs 232 Interfacing
45 pages
Embedded Systems: Lecture Notes
100% (1)
Embedded Systems: Lecture Notes
164 pages
Proc Emb - Ch2
No ratings yet
Proc Emb - Ch2
29 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
ARM Microcontroller Guide
No ratings yet
ARM Microcontroller Guide
218 pages
Arm 2011
No ratings yet
Arm 2011
55 pages
Mastering Stm32 Sample
100% (1)
Mastering Stm32 Sample
119 pages
Mod Menu Crash 2025 01 26-08 03 06
No ratings yet
Mod Menu Crash 2025 01 26-08 03 06
4 pages
VG Study Hub: Tax Resources & App
No ratings yet
VG Study Hub: Tax Resources & App
41 pages
Capitulo I Introdusaun Autocad 2009 (2D)
No ratings yet
Capitulo I Introdusaun Autocad 2009 (2D)
19 pages
Mod Menu Crash 2025 07 29-12 27 02
No ratings yet
Mod Menu Crash 2025 07 29-12 27 02
3 pages
Mod Menu Crash 2025 06 05-20 35 49
No ratings yet
Mod Menu Crash 2025 06 05-20 35 49
2 pages
EStore Pricelist Trade-In
No ratings yet
EStore Pricelist Trade-In
31 pages
Orvibo Home Setup Guide
No ratings yet
Orvibo Home Setup Guide
61 pages
Android Practical No. 22: Sensor Apps
No ratings yet
Android Practical No. 22: Sensor Apps
4 pages
El Confidente de La Mafia
100% (1)
El Confidente de La Mafia
212 pages
Android TimePicker & DatePicker Guide
100% (1)
Android TimePicker & DatePicker Guide
6 pages
User Agent 23
No ratings yet
User Agent 23
3 pages
Status
No ratings yet
Status
26 pages
AML8726-MX: Quick Reference Manual
No ratings yet
AML8726-MX: Quick Reference Manual
37 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
3,639 pages
NXP LPC1768 & Keil Quadcopter Project Lab Manual PDF
No ratings yet
NXP LPC1768 & Keil Quadcopter Project Lab Manual PDF
99 pages
Apple Products Price List & Specs
No ratings yet
Apple Products Price List & Specs
1 page
Advanced Digital System & VLSI Design
No ratings yet
Advanced Digital System & VLSI Design
6 pages
Dumpsys ANR WindowManager
No ratings yet
Dumpsys ANR WindowManager
2,389 pages
i.MX 7dual Family of Applications Processors Datasheet
No ratings yet
i.MX 7dual Family of Applications Processors Datasheet
158 pages
COA - Chapter # 2
No ratings yet
COA - Chapter # 2
60 pages
HomePod Manual
No ratings yet
HomePod Manual
1 page
How to Screenshot on Samsung S21
No ratings yet
How to Screenshot on Samsung S21
1 page
History of MacOS
No ratings yet
History of MacOS
37 pages
Samsung Trade-In Values 2023
No ratings yet
Samsung Trade-In Values 2023
53 pages
DDI0403E B Armv7m Arm
No ratings yet
DDI0403E B Armv7m Arm
40 pages
51 Android - Rules
No ratings yet
51 Android - Rules
3 pages
Lecture 8: More Interrupts (Revision) : Philip Leong
No ratings yet
Lecture 8: More Interrupts (Revision) : Philip Leong
5 pages
Lista 12 Abril 21-2 PDF
No ratings yet
Lista 12 Abril 21-2 PDF
91 pages
Document Scanning Overview
No ratings yet
Document Scanning Overview
179 pages