Intel Optimization Reference Manual V1 050
Intel Optimization Reference Manual V1 050
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel
Corporation or its subsidiaries. Other names and brands may be claimed as the property of
others.
REVISION HISTORY
CHAPTER 1
INTRODUCTION
1.1 ABOUT THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 TUNING YOUR APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1
1.3 INTEL PROCESSORS SUPPORTING THE INTEL® 64 ARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
1.4 ORGANIZATION OF THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.4.1 Chapter Summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.4.1.1 Volume 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
1.4.1.2 Volume 2: Earlier Generations of Intel® 64 and IA-32 Processor Architectures . . . . . . . . . . . . . 1-5
1.5 GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
1.6 RELATED INFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
CHAPTER 2
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.1 6TH GENERATION INTEL® XEON® SCALABLE PROCESSOR FAMILY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.1.1 The Redwood Cove Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.1.1.1 Branch Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
2.1.1.2 Profiling Support for Branch Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.1.1.3 New Redwood Cove Macro-Fusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.1.1.4 Improved Memory BW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2.1.1.5 Array of Pointers Prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2.1.1.6 LLC Page Prefetcher (LLCPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.2 THE SAPPHIRE RAPIDS MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.2.1 4th Generation Intel® Xeon® Scalable Family of Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3 THE ALDER LAKE PERFORMANCE HYBRID ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.1 12th Generation Intel® Core™ Processors Supporting Performance Hybrid Architecture . . . . . . . 2-6
2.3.2 Hybrid Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.2.1 Intel® Thread Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.2.2 Scheduling with Intel® Hyper-Threading Technology-Enabled on Processors Supporting
x86 Hybrid Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.3.2.3 Scheduling with a Multi-E-Core Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.3.2.4 Scheduling Background Threads on x86 Hybrid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.3.3 Recommendations for Application Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.4 THE GOLDEN COVE MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.4.1 Golden Cove Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
2.4.1.1 Cache Subsystem and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2.4.1.2 Avoiding Destination False Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2.5 ICE LAKE CLIENT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.5.1 Ice Lake Client Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.5.1.1 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.5.1.2 The Out of Order and Execution Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.5.1.3 Cache and Memory Subsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
2.5.1.4 Fast Store Forwarding Prediction (FSFP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
2.5.1.5 New Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
2.5.1.6 Ice Lake Client Microarchitecture Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
2.6 SKYLAKE SERVER MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
2.6.1 Skylake Server Microarchitecture Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.1 Larger Mid-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.2 Non-Inclusive Last Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.3 Skylake Server Microarchitecture Cache Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.2 Non-Temporal Stores on Skylake Server Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.6.3 Skylake Server Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.7 SKYLAKE CLIENT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
2.7.1 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
2.7.2 The Out-of-Order Execution Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
2.7.3 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
2.7.4 Pause Latency in Skylake Client Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
Document #: 248966-050US iv
CONTENTS
PAGE
CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
3.1 PERFORMANCE TOOLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 Intel® C++ and Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.2 General Compiler Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.1.3 VTune™ Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2 PROCESSOR PERSPECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2.1 CPUID Dispatch Strategy and Compatible Code Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2.2 Transparent Cache-Parameter Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.2.3 Threading Strategy and Hardware Multithreading Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.3 CODING RULES, SUGGESTIONS, AND TUNING HINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.4 OPTIMIZING THE FRONT END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1 Branch Prediction Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1.1 Eliminating Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1.2 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.4.1.3 Inlining, Calls, and Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.4.1.4 Code Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.4.1.5 Branch Type Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.4.1.6 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
3.4.2 Fetch and Decode Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.2.1 Optimizing for Microfusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.2.2 Optimizing for Macrofusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.2.3 Length-Changing Prefixes (LCP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
3.4.2.4 Optimizing the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
3.4.2.5 Optimization for Decoded ICache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
3.4.2.6 Other Decoding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5 OPTIMIZING THE EXECUTION CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1.1 Integer Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.2 Using LEA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.3 ADC and SBB in Sandy Bridge Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Document #: 248966-050US v
CONTENTS
PAGE
Document #: 248966-050US vi
CONTENTS
PAGE
CHAPTER 4
INTEL ATOM® PROCESSOR ARCHITECTURES
4.1 THE CRESTMONT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.1.1 Crestmont Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.1.2 Predict and Fetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.1.3 Dynamic Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
4.1.4 Instruction Decode and the On-Demand Instruction Length Decoder (OD-ILD) . . . . . . . . . . . . . . . 4-4
4.1.5 Allocation and Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.1.6 The Out-of-Order and Execution Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.1.7 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.1.8 Crestmont New Instruction Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.1 AVX-NE-CONVERT Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.2 AVX-IFMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.3 AVX-VNNI-INT8 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.9 Legacy Intel® AVX1/Intel® AVX2 Instruction Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
CHAPTER 5
CODING FOR SIMD ARCHITECTURES
5.1 CHECKING FOR PROCESSOR SUPPORT OF SIMD TECHNOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.1 Checking for MMX Technology Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
5.1.2 Checking for Intel® Streaming SIMD Extensions (Intel® SSE) Support . . . . . . . . . . . . . . . . . . . . . . . .5-2
5.1.3 Checking for Intel® Streaming SIMD Extensions 2 (Intel® SSE2) Support. . . . . . . . . . . . . . . . . . . . . .5-2
5.1.4 Checking for Intel® Streaming SIMD Extensions 3 (Intel® SSE3) Support. . . . . . . . . . . . . . . . . . . . . .5-3
5.1.5 Checking for Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE) Support. . . . . . . . . .5-3
5.1.6 Checking for Intel® SSE4.1 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
5.1.7 Checking for Intel® SSE4.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
5.1.8 Detection of PCLMULQDQ and AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.1.9 Detection of Intel® AVX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.1.10 Detection of VEX-Encoded AES and VPCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
5.1.11 Detection of F16C Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
5.1.12 Detection of FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
5.1.13 Detection of Intel® AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
5.2 CONSIDERATIONS FOR CODE CONVERSION TO SIMD PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . 5-11
5.2.1 Identifying Hot Spots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.2.2 Determine If Code Benefits by Conversion to SIMD Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.3 CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
5.3.1 Coding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
5.3.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.3.1.2 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.3.1.3 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
5.3.1.4 Automatic Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
5.4 STACK AND DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
5.4.1 Alignment and Contiguity of Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.1.1 Using Padding to Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.1.2 Using Arrays to Make Data Contiguous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.2 Stack Alignment for 128-bit SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
5.4.3 Data Alignment for MMX™ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.4.4 Data Alignment for 128-bit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.4.4.1 Compiler-Supported Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.5 IMPROVING MEMORY UTILIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
5.5.1 Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-22
5.5.2 Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25
5.5.3 Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
5.6 INSTRUCTION SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
5.7 TUNING THE FINAL APPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
CHAPTER 6
OPTIMIZING FOR SIMD INTEGER APPLICATIONS
6.1 GENERAL RULES ON SIMD INTEGER CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.2 USING SIMD INTEGER WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.2.1 Using the EMMS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.2.2 Guidelines for Using EMMS Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.3 DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.4 DATA MOVEMENT CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.4.1 Unsigned Unpack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.4.2 Signed Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
6.4.3 Interleaved Pack with Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
6.4.4 Interleaved Pack without Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.4.5 Non-Interleaved Unpack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.4.6 Extract Data Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10
6.4.7 Insert Data Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10
6.4.8 Non-Unit Stride Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-12
6.4.9 Move Byte Mask to Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-12
6.4.10 Packed Shuffle Word for 64-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-13
6.4.11 Packed Shuffle Word for 128-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.4.12 Shuffle Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.4.13 Conditional Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-14
6.4.14 Unpacking/Interleaving 64-bit Data in 128-bit Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-15
6.4.15 Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-15
6.4.16 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-15
6.5 GENERATING CONSTANTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.6 BUILDING BLOCKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.6.1 Absolute Difference of Unsigned Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-16
6.6.2 Absolute Difference of Signed Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17
6.6.3 Absolute Value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17
6.6.4 Pixel Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-17
6.6.5 Endian Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-19
6.6.6 Clipping to an Arbitrary Range [High, Low]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-20
6.6.6.1 Highly Efficient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-21
6.6.6.2 Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22
6.6.7 Packed Max/Min of Byte, Word and Dword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22
6.6.8 Packed Multiply Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22
6.6.9 Packed Sum of Absolute Differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.6.10 MPSADBW and PHMINPOSUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.6.11 Packed Average (Byte/Word). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23
6.6.12 Complex Multiply by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-24
6.6.13 Packed 64-bit Add/Subtract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-24
6.6.14 128-bit Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-24
6.6.15 PTEST and Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25
6.6.16 Vectorization of Heterogeneous Computations across Loop Iterations. . . . . . . . . . . . . . . . . . . . . .6-25
6.6.17 Vectorization of Control Flows in Nested Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-26
6.7 MEMORY OPTIMIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
6.7.1 Partial Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-29
6.7.2 Increasing Bandwidth of Memory Fills and Video Fills. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-30
6.7.2.1 Increasing Memory Bandwidth Using the MOVDQ Instruction . . . . . . . . . . . . . . . . . . . . . . . . . .6-31
6.7.2.2 Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM Page . .6-31
6.7.2.3 Increasing UC and WC Store Bandwidth by Using Aligned Stores . . . . . . . . . . . . . . . . . . . . . . . .6-31
6.7.3 Reverse Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-31
6.8 CONVERTING FROM 64-BIT TO 128-BIT SIMD INTEGERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34
6.8.1 SIMD Optimizations and Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-35
6.8.1.1 Packed SSE2 Integer versus MMX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-35
6.9 TUNING PARTIALLY VECTORIZABLE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-35
6.10 PARALLEL MODE AES ENCRYPTION AND DECRYPTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38
6.10.1 AES Counter Mode of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-39
Document #: 248966-050US ix
CONTENTS
PAGE
CHAPTER 7
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
7.1 GENERAL RULES FOR SIMD FLOATING-POINT CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2 PLANNING CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.3 USING SIMD FLOATING-POINT WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.4 SCALAR FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5 DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5.1 Data Arrangement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5.1.1 Vertical versus Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.5.1.2 Data Swizzling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.5.1.3 Data Deswizzling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
7.5.1.4 Horizontal ADD Using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
7.5.2 Use of CVTTPS2PI/CVTTSS2SI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.5.3 Flush-to-Zero and Denormals-are-Zero Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6 SIMD OPTIMIZATIONS AND MICROARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6.1 Dot Product and Horizontal SIMD Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6.2 Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
7.6.3 Using Horizontal SIMD Instruction Sets and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
7.6.3.1 SOA and Vector Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
CHAPTER 8
INT8 DEEP LEARNING INFERENCE
8.1 INTRODUCING INT8 AS DATA TYPE FOR DEEP LEARNING INFERENCE. . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2 INTRODUCING INTEL® DL BOOST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2.1 Multiply and Add Unsigned and Signed Bytes (VPDPBUSD Instruction). . . . . . . . . . . . . . . . . . . . . . 8-2
8.2.2 Multiply & Add Signed Word Integers (VPDPWSSD Instruction). . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3 GENERAL OPTIMIZATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3.1 Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.1 Quantization of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.2 Quantization of Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.3 Quantizing Negative Activations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3 Multicore Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3.1 Large Batch (Throughput Workload) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3.2 Small Batch (Throughput at Latency Workload) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.3.3.3 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4 CNNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1 Convolutional Layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1.1 Direct Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1.2 Convolutional Layers with Low OFM Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
8.4.2 Post Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
8.4.2.1 Fused Quantization/Dequantization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
8.4.2.2 ReLu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
8.4.2.3 EltWise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
8.4.2.4 Pooling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
8.4.2.5 Pixel Shuffler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
8.5 LSTM NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
8.5.1 Fused LSTM Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
8.5.2 Fused post GEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Document #: 248966-050US x
CONTENTS
PAGE
CHAPTER 9
OPTIMIZING CACHE USAGE
9.1 GENERAL PREFETCH CODING GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 PREFETCH AND CACHEABILITY INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3 PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3.1 Software Data Prefetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3.2 Prefetch Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
9.3.3 Prefetch and Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4 CACHEABILITY CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4.1 The Non-temporal Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4.1.1 Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.2 Streaming Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.3 Memory Type and Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.4 Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.2 Streaming Store Usage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.2.1 Coherent Requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.2.2 Non-coherent requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.3 Streaming Store Instruction Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.4 The Streaming Load Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5 FENCE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5.1 SFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5.2 LFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.5.3 MFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.6 CLFLUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.7 CLFLUSHOPT Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
9.5 MEMORY OPTIMIZATION USING PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.1 Software-Controlled Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.2 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.3 Example of Effective Latency Reduction with Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
9.5.4 Example of Latency Hiding with S/W Prefetch Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
9.5.5 Software Prefetching Usage Checklist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
9.5.6 Software Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
9.5.7 Software Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
9.5.8 Minimize Number of Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-17
9.5.9 Mix Software Prefetch with Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-19
9.5.10 Software Prefetch and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20
9.5.11 Hardware Prefetching and Cache Blocking Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-23
9.5.12 Single-Pass versus Multi-Pass Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-24
9.6 MEMORY OPTIMIZATION USING NON-TEMPORAL STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.6.1 Non-Temporal Stores and Software Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.6.2 Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.1 Video Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.2 Video Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.3 Conclusions from Video Encoder and Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . 9-27
9.6.2.4 Optimizing Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-27
9.6.2.5 Using the 8-byte Streaming Stores and Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28
9.6.2.6 Using 16-byte Streaming Stores and Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29
9.6.2.7 Performance Comparisons of Memory Copy Routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
9.6.3 Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
9.6.3.1 Cache Sharing Using Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
9.6.3.2 Cache Sharing in Single-Core or Multicore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
9.6.3.3 Determine Prefetch Stride. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
Document #: 248966-050US xi
CONTENTS
PAGE
CHAPTER 10
SUB-NUMA CLUSTERING
10.1 SUB-NUMA CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.2 COMPARISON WITH CLUSTER-ON-DIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3 SNC USAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3.1 How to Check NUMA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
10.3.2 MPI Optimizations for SNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
10.3.3 SNC Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-8
CHAPTER 11
MULTICORE AND INTEL® HYPER-THREADING TECHNOLOGY (INTEL® HT)
11.1 PERFORMANCE AND USAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.1.1 Multithreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-1
11.1.2 Multitasking Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-3
11.2 PROGRAMMING MODELS AND MULTITHREADING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.2.1 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.1.1 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.2 Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.3 Specialized Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.3.1 Producer-Consumer Threading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-5
11.2.4 Tools for Creating Multithreaded Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.1 Programming with OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.2 Automatic Parallelization of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.3 Supporting Development Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.3 OPTIMIZATION GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
11.3.1 Key Practices of Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.2 Key Practices of System Bus Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.3 Key Practices of Memory Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.4 Key Practices of Execution Resource Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.3.5 Generality and Performance Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.4 THREAD SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.4.1 Choice of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
11.4.2 Synchronization for Short Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
11.4.3 Optimization with Spin-Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13
11.4.4 Synchronization for Longer Periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13
11.4.4.1 Avoid Coding Pitfalls in Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
11.4.5 Prevent Sharing of Modified Data and False-Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
11.4.6 Placement of Shared Synchronization Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
11.5 SYSTEM BUS OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
11.5.1 Conserve Bus Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
11.5.2 Understand the Bus and Cache Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
11.5.3 Avoid Excessive Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
11.5.4 Improve Effective Latency of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
11.5.5 Use Full Write Transactions to Achieve Higher Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
11.6 MEMORY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
11.6.1 Cache Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
11.6.2 Shared-Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
11.6.2.1 Minimize Sharing of Data between Physical Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
11.6.2.2 Batched Producer-Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
11.6.3 Eliminate 64-KByte Aliased Data Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.7 FRONT END OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.7.1 Avoid Excessive Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.8 AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.8.1 Topology Enumeration of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
11.8.2 Non-Uniform Memory Access (NUMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
11.9 OPTIMIZATION OF OTHER SHARED RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
11.9.1 Expanded Opportunity for Intel® HT Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
CHAPTER 12
INTEL® OPTANE™ DC PERSISTENT MEMORY
12.1 MEMORY MODE AND APP-DIRECT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
12.1.1 Memory Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1
12.1.2 App Direct Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1
12.1.3 Selecting a Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2
12.2 DEVICE CHARACTERISTICS OF INTEL® OPTANE™ DC PERSISTENT MEMORY MODULE . . . . . . . . . . . 12-4
12.2.1 Intel® Optane™ DC Persistent Memory Module Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
12.2.2 Read vs. Write Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
12.2.3 Number of Threads for Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-6
12.3 PLATFORM IMPLICATIONS OF HANDLING A SECOND TYPE OF MEMORY . . . . . . . . . . . . . . . . . . . . . . 12-8
12.3.1 Multi-Processor Cache Coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-8
12.3.2 Shared Queues in the Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-9
12.4 IMPLEMENTING PERSISTENCE FOR MEMORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
12.5 POWER CONSUMPTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.5.1 Read-Write Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
12.5.2 Spatial and Temporal Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
CHAPTER 13
64-BIT MODE CODING GUIDELINES
13.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2 CODING RULES AFFECTING 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.1 Use Legacy 32-Bit Instructions When Data Size Is 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.2 Use Extra Registers to Reduce Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.3 Effective Use of 64-Bit by 64-Bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.2.4 Replace 128-bit Integer Division with 128-bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.2.5 Sign Extension to Full 64-Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.3 ALTERNATE CODING RULES FOR 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.3.1 Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic Result . . . . . . . . . . . . 13-5
13.3.2 Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
CHAPTER 14
INTEL® SSE4.2 AND SIMD PROGRAMMING FOR TEXT-
PROCESSING/LEXING/PARSING
14.1 INTEL® SSE4.2 STRING AND TEXT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1
14.1.1 CRC32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-4
14.2 USING INTEL® SSE4.2 STRING AND TEXT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
14.2.1 Unaligned Memory Access and Buffer Size Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.2.2 Unaligned Memory Access and String Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.3 INTEL® SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.3.1 Null Character Identification (Strlen equivalent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
14.3.2 White-Space-Like Character Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
14.3.3 Substring Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
14.3.4 String Token Extraction and Case Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19
14.3.5 Unicode Processing and PCMPxSTRy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22
14.3.6 Replacement String Library Function Using Intel® SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27
14.4 INTEL® SSE4.2-ENABLED NUMERICAL AND LEXICAL COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . 14-28
14.5 NUMERICAL DATA CONVERSION TO ASCII FORMAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34
14.5.1 Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
14.5.1.1 MULX Instruction and Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
CHAPTER 15
OPTIMIZATIONS FOR INTEL® AVX, INTEL® AVX2,
& INTEL® FMA
15.1 INTEL® AVX INTRINSICS CODING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2
15.1.1 Intel® AVX Assembly Coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
15.2 NON-DESTRUCTIVE SOURCE (NDS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6
15.3 MIXING AVX CODE WITH SSE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
15.3.1 Mixing Intel® AVX and Intel SSE in Function Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
15.4 128-BIT LANE OPERATION AND AVX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
15.4.1 Programming With the Lane Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12
15.4.2 Strided Load Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13
15.4.3 The Register Overlap Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-16
15.5 DATA GATHER AND SCATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-18
15.5.1 Data Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-18
15.5.2 Data Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20
15.6 DATA ALIGNMENT FOR INTEL® AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-22
15.6.1 Align Data to 32 Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-22
15.6.2 Consider 16-Byte Memory Access when Memory is Unaligned . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23
15.6.3 Prefer Aligned Stores Over Aligned Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25
15.7 L1D CACHE LINE REPLACEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-25
15.8 4K ALIASING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26
15.9 CONDITIONAL SIMD PACKED LOADS AND STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-26
15.9.1 Conditional Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-27
15.10 MIXING INTEGER AND FLOATING-POINT CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29
15.11 HANDLING PORT 5 PRESSURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-32
15.11.1 Replace Shuffles with Blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-32
15.11.2 Design Algorithm with Fewer Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-34
15.11.3 Perform Basic Shuffles on Load Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-37
15.12 DIVIDE AND SQUARE ROOT OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-38
15.12.1 Single-Precision Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-40
15.12.2 Single-Precision Reciprocal Square Root. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-42
15.12.3 Single-Precision Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-44
15.13 OPTIMIZATION OF ARRAY SUB SUM EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-46
15.14 HALF-PRECISION FLOATING-POINT CONVERSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-48
15.14.1 Packed Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-48
15.14.2 Packed Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-50
15.14.3 Locality Consideration for using Half-Precision FP to Conserve Bandwidth. . . . . . . . . . . . . . . . . 15-50
15.15 FUSED MULTIPLY-ADD (FMA) INSTRUCTIONS GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-52
15.15.1 Optimizing Throughput with FMA and Floating-Point Add/MUL . . . . . . . . . . . . . . . . . . . . . . . . . 15-52
15.15.2 Optimizing Throughput with Vector Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-54
15.16 AVX2 OPTIMIZATION GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-55
15.16.1 Multi-Buffering and Intel® AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-60
15.16.2 Modular Multiplication and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-60
15.16.3 Data Movement Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-60
15.16.3.1 SIMD Heuristics to implement Memcpy() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-61
15.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-62
15.16.3.3 Memset() Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-62
15.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-63
15.16.3.5 256-bit Fetch versus Two 128-bit Fetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-63
15.16.3.6 Mixing MULX and AVX2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-64
15.16.4 Considerations for Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-70
15.16.4.1 Strided Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-73
15.16.4.2 Adjacent Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-75
15.16.5 Intel® AVX2 Conversion Remedy to MMX Instruction Throughput Limitation . . . . . . . . . . . . . . 15-76
CHAPTER 16
POWER OPTIMIZATION FOR MOBILE USAGES
16.1 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1
16.2 MOBILE USAGE SCENARIOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1
16.2.1 Intelligent Energy Efficient Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
16.3 ACPI C-STATES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3
16.3.1 Processor-Specific C4 and Deep C4 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-4
16.3.2 Processor-Specific Deep C-States and Intel® Turbo Boost Technology . . . . . . . . . . . . . . . . . . . . . .16-5
16.3.3 Processor-Specific Deep C-States for Sandy Bridge Microarchitecture . . . . . . . . . . . . . . . . . . . . . .16-5
16.3.4 Intel® Turbo Boost Technology 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-6
16.4 GUIDELINES FOR EXTENDING BATTERY LIFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-6
16.4.1 Adjust Performance to Meet Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-6
16.4.2 Reducing Amount of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-7
16.4.3 Platform-Level Optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-7
16.4.4 Handling Sleep State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-8
16.4.5 Using Enhanced Intel SpeedStep® Technology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-8
16.4.6 Enabling Intel® Enhanced Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-9
16.4.7 Multicore Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
16.4.7.1 Enhanced Intel SpeedStep® Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
16.4.7.2 Thread Migration Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-10
16.4.7.3 Multicore Considerations for C-States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-11
16.5 TUNING SOFTWARE FOR INTELLIGENT POWER CONSUMPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-12
16.5.1 Reduction of Active Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-12
16.5.1.1 Multithreading to Reduce Active Cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-12
16.5.1.2 Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-13
16.5.2 PAUSE and Sleep(0) Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-14
16.5.3 Spin-Wait Loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-15
16.5.4 Using Event Driven Service Instead of Polling in Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-15
16.5.5 Reducing Interrupt Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-15
16.5.6 Reducing Privileged Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-15
16.5.7 Setting Context Awareness in the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-16
16.5.8 Saving Energy by Optimizing for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
16.6 PROCESSOR SPECIFIC POWER MANAGEMENT OPTIMIZATION FOR SYSTEM SOFTWARE . . . . . . . . 16-17
16.6.1 Power Management Recommendation of Processor-Specific Inactive State Configurations . . 16-17
16.6.1.1 Balancing Power Management and Responsiveness of Inactive To Active State
Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-19
CHAPTER 17
SOFTWARE OPTIMIZATION FOR INTEL® AVX-512 INSTRUCTIONS
17.1 BASIC INTEL® AVX-512 VS. INTEL® AVX2 CODING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
17.1.1 Intrinsic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-3
17.1.2 Assembly Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-5
17.2 MASKING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-7
17.2.1 Masking Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-8
17.2.2 Masking Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-12
17.2.3 Masking vs. Blending. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-13
17.2.4 Nested Conditions / Mask Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-15
17.2.5 Memory Masking Microarchitecture Improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-16
17.2.6 Peeling and Remainder Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-16
17.3 FORWARDING AND UNMASKED OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-19
17.4 FORWARDING AND MEMORY MASKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-19
17.5 DATA COMPRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20
17.5.1 Data Compress Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-21
17.6 DATA EXPAND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-24
17.6.1 Data Expand Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-25
17.7 TERNARY LOGIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-27
17.7.1 Ternary Logic Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-27
Document #: 248966-050US xv
CONTENTS
PAGE
CHAPTER 18
INTEL® ADVANCED VECTOR EXTENSIONS 512 - FP16 INSTRUCTION SET FOR INTEL®
XEON® PROCESSORS
18.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.2 TERM DEFINITIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.3 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.4 FP16 NUMERIC INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
18.4.1 Data Type Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
18.4.2 Overview of Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-3
18.4.3 Fundamental Complex-Valued Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-4
18.4.4 Using Intel® AVX-512 Bit Masks for Real-Valued Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-5
18.4.5 Using Intel® AVX-512 Bit Masks for Complex-Valued Operations. . . . . . . . . . . . . . . . . . . . . . . . . . 18-6
18.5 NUMERICS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
18.5.1 Introduction to FP16 Number Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
18.5.2 Observations on Representing Numbers in FP16 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-10
18.5.3 Numeric Accuracy Guarantees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-11
18.5.4 Handling Denormal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-12
18.5.5 Embedded Rounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13
18.5.6 Legacy FP16 Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13
18.5.7 FP16 Conversions to and from Other Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-14
18.5.8 Approximation Instructions and Their Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-14
18.5.8.1 Approximate Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
18.5.8.2 Approximate Division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
18.5.8.3 Approximate Reciprocal Square Root. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-16
18.5.9 Approximate Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-17
18.6 USING EXISTING INTEL® AVX-512 INSTRUCTIONS TO AUGMENT FP16 SUPPORT . . . . . . . . . . . . . . 18-17
18.6.1 Using Existing Instructions to Extend Intel® AVX-512 FP16 Intrinsics. . . . . . . . . . . . . . . . . . . . . . 18-17
18.6.2 Common Convenience Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-18
18.6.3 Using Integer Comparisons for Fast Floating-Point Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 18-18
18.7 MATH LIBRARY SUPPORT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-19
CHAPTER 19
CRYPTOGRAPHY & FINITE FIELD ARITHMETIC ENHANCEMENTS
19.1 VECTOR AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1
19.2 VPCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
19.3 GALOIS FIELD NEW INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
19.4 INTEGER FUSED MULTIPLY ACCUMULATE OPERATIONS (AVX512_IFMA - VPMADD52) . . . . . . . . . . 19-4
CHAPTER 20
INTEL® ADVANCED MATRIX EXTENSIONS (INTEL® AMX)
20.1 DETECTING INTEL® AMX SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.2 INTEL® AMX MICROARCHITECTURE OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.2.1 Intel® AMX Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.3 INTEL® AMX INSTRUCTIONS THROUGHPUT AND LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.4 DATA STRUCTURE ALIGNMENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5 GEMMS / CONVOLUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5.2 Tiles in the Intel® AMX Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-4
20.5.3 B Matrix Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-6
20.5.4 Straightforward GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-9
20.5.5 Optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
20.5.5.1 Minimizing Tile Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
20.5.5.2 Software Pipelining of Tile Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13
20.5.5.3 Optimized GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13
20.5.5.4 Direct Convolution with Intel® AMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-15
CHAPTER 21
INTEL® QUICKASSIST TECHNOLOGY (INTEL® QAT)
21.1 SOFTWARE DESIGN GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-1
21.1.1 Polling vs. Interrupts (If Supported). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-1
21.1.1.1 Interrupt Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
21.1.1.2 Polling Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
21.1.1.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-3
Document #: 248966-050US
CONTENTS
PAGE
21.1.2 Use of Data Plane (DP) API vs. Traditional API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-3
21.1.2.1 Batch Submission of Requests Using the Data Plane API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-3
21.1.3 Synchronous (sync) vs. Asynchronous (async) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-4
21.1.4 Buffer Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-4
21.1.5 Maximum Number of Concurrent Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-4
21.1.6 Symmetric Crypto Partial Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5
21.1.7 Reusing Sessions in QAT Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5
21.1.8 Maximizing QAT Device Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5
21.1.9 Best Known Method (BKM) for Avoiding Performance Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . 21-6
21.1.10 Avoid Data Copies By Using SVM and ATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-6
21.1.11 Avoid Page Faults When Using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-6
CHAPTER 22
USING PERFORMANCE MONITORING EVENTS
22.1 TOP-DOWN ANALYSIS METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-1
22.1.1 Top-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-2
22.1.2 Frontend Bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.3 Backend Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.4 Memory Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.5 Core Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-5
22.1.6 Bad Speculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-5
22.1.7 Retiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.8 Golden Cove Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.9 Ice Lake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.10 Optane Persistent Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.11 Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.11.1 TMA Use Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7
22.1.11.2 TMA Use Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7
22.2 PERFORMANCE MONITORING AND MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-8
22.3 INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15
22.4 PERFORMANCE ANALYSIS TECHNIQUES FOR INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . 22-16
22.4.1 Cycle Accounting and Uop Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-17
22.4.1.1 Cycle Drill Down and Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-18
22.4.1.2 Basic Block Drill Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-21
22.4.2 Stall Cycle Decomposition and Core Memory Accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22
22.4.2.1 Measuring Costs of Microarchitectural Conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22
22.4.3 Core PMU Precise Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-23
22.4.3.1 Precise Memory Access Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24
22.4.3.2 Load Latency Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-25
22.4.3.3 Precise Execution Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-27
22.4.3.4 Last Branch Record (LBR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-28
22.4.3.5 Measuring Per-Core Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-33
22.4.3.6 Miscellaneous L1 and L2 Events for Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.3.7 TLB Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.3.8 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.4 Front End Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.4.1 Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.4.2 Front End Code Generation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.5 Uncore Performance Monitoring Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-36
22.4.5.1 Global Queue Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-36
22.4.5.2 Global Queue Port Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-38
22.4.5.3 Global Queue Snoop Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-38
22.4.5.4 L3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-39
22.4.6 Intel QuickPath Interconnect Home Logic (QHL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-39
22.4.7 Measuring Bandwidth From the Uncore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-45
22.5 PERFORMANCE TUNING TECHNIQUES FOR SANDY BRIDGE MICROARCHITECTURE . . . . . . . . . . . . 22-45
22.5.1 Correlating Performance Bottleneck to Source Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-45
Document #: 248966-050US xx
CONTENTS
PAGE
APPENDIX A
APPLICATION PERFORMANCE TOOLS
A.1 COMPILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
A.1.1 Recommended Optimization Settings for Intel® 64 and IA-32 Processors. . . . . . . . . . . . . . . . . . . . A-2
A.1.2 Vectorization and Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.2.1 Multithreading with OpenMP* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.2.2 Automatic Multithreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.3 Inline Expansion of Library Functions (/Oi, /Oi-). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.4 Interprocedural and Profile-Guided Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.4.1 Interprocedural Optimization (IPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.1.4.2 Profile-Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.1.5 Intel® Cilk™ Plus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.2 PERFORMANCE LIBRARIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.2.1 Intel® Integrated Performance Primitives (Intel® IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.2 Intel® Math Kernel Library (Intel® MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.3 Intel® Threading Building Blocks (Intel® TBB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.4 Benefits Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.3 PERFORMANCE PROFILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1 Intel® VTune™ Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.1 Hardware Event-Based Sampling Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.2 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.3 Platform Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.4 THREAD AND MEMORY CHECKERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.4.1 Intel® Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.5 VECTORIZATION ASSISTANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.5.1 Intel® Advisor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6 CLUSTER TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6.1 Intel® Trace Analyzer and Collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6.1.1 MPI Performance Snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.6.2 Intel® MPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.6.3 Intel® MPI Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.7 INTEL® COMMUNITIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
APPENDIX B
RUNTIME PERFORMANCE OPTIMIZATION BLUEPRINT: INTEL® ARCHITECTURE
OPTIMIZATION WITH LARGE CODE PAGES
B.1 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.1.1 TLBs and Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-1
B.1.2 Large Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2
B.2 DIAGNOSING THE PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.2.1 ITLB Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2
B.2.2 Measuring the ITLB Miss Stall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5
CHAPTER 1
INTRODUCTION
NOTE
A public repository is available with open source code samples from select chapters of this manual.
These code samples are released under a 0-Clause BSD license. Intel provides additional code
samples and updates to the repository as the samples are created and verified.
Public repository: https://github.com/intel/optimization-manual.
Link to license: https://github.com/intel/optimization-manual/blob/master/COPYING.
Knights Mill microarchitecture Intel® Xeon® Phi™ Processor 7215, 7285, 7295 Series
Cascade Lake microarchitecture 2nd generation Intel® Xeon® Scalable processor family
Cooper Lake microarchitecture Some of the 3rd generation Intel® Xeon® Scalable processor family
Gracemont microarchitecture Intel N-Series processors
Granite Lake microarchitecture 6th generation Intel® Xeon® Scalable processor family
Meteor Lake microarchitecture 1st generation Intel® Core™ Ultra processors
1.4.1.1 Volume 1
• Chapter 1: Introduction: Defines the purpose and outlines the contents of this manual.
• Chapter 2: Intel® 64 and IA-32 Processor Architectures: Describes the microarchitecture of recent Intel 64 and IA-
32 processor families, and other features relevant to software optimization.
• Chapter 3: General Optimization Guidelines: Describes general code development and optimization techniques
that apply to all applications designed to take advantage of the common features of current Intel processors.
• Chapter 4: Intel Atom® Processor Architecture: Describes the microarchitecture of recent Intel Atom processor
families, and other features relevant to software optimization.
• Chapter 5: Coding for SIMD Architectures: Describes techniques and concepts for using the SIMD integer and
SIMD floating-point instructions provided by the MMX™ technology, Streaming SIMD Extensions, Streaming
SIMD Extensions 2, Streaming SIMD Extensions 3, SSSE3, and SSE4.1.
• Chapter 6: Optimizing for SIMD Integer Applications: Provides optimization suggestions and common building
blocks for applications that use the 128-bit SIMD integer instructions.
• Chapter 7: Optimizing for SIMD Floating-point Applications: Provides optimization suggestions and common
building blocks for applications that use the single-precision and double-precision SIMD floating-point
instructions.
• Chapter 8: INT8 Deep Learning Inference: Describes INT8 as a data type for Deep learning Inference on Intel
technology. The document covers both AVX-512 implementations and implementations using the new Intel® DL
Boost Instructions.
• Chapter 9: Optimizing Cache Usage: Describes how to use the PREFETCH instruction, cache control
management instructions to optimize cache usage, and the deterministic cache parameters.
• Chapter 10: Introducing Sub-NUMA Clustering: Describes Sub-NUMA Clustering (SNC), a mode for improving
average latency from last level cache (LLC) to local memory.
• Chapter 11: Multicore and Intel® Hyper-Threading Technology: Describes guidelines and techniques for
optimizing multithreaded applications to achieve optimal performance scaling. Use these when targeting
multicore processor, processors supporting Hyper-Threading Technology, or multiprocessor (MP) systems.
• Chapter 12: Intel® Optane™ DC Persistent Memory: Provides optimization suggestions for applications that use
Intel® Optane™ DC Persistent Memory.
• Chapter 13: 64-Bit Mode Coding Guidelines: This chapter describes a set of additional coding guidelines for
application software written to run in 64-bit mode.
• Chapter 14: SSE4.2 and SIMD Programming for Text-Processing/Lexing/Parsing: Describes SIMD techniques of
using SSE4.2 along with other instruction extensions to improve text/string processing and lexing/parsing
applications.
• Chapter 15: Optimizations for Intel® AVX, FMA, and Intel® AVX2: Provides optimization suggestions and
common building blocks for applications that use Intel® Advanced Vector Extensions, FMA, and Intel® Advanced
Vector Extensions 2 (Intel® AVX2).
• Chapter 16: Intel Transactional Synchronization Extensions: Tuning recommendations to use lock elision
techniques with Intel Transactional Synchronization Extensions to optimize multi-threaded software with
contended locks.
• Chapter 16: Power Optimization for Mobile Usages: This chapter provides background on power saving
techniques in mobile processors and makes recommendations that developers can leverage to provide longer
battery life.
• Chapter 17: Software Optimization for Intel® AVX-512 Instructions: Provides optimization suggestions and
common building blocks for applications that use Intel® Advanced Vector Extensions 512.
• Chapter 18: Intel® Advanced Vector Extensions 512-FP16 Instruction Set for Intel® Xeon® Processors: Describes
the addition of the FP16 ISA for Intel AVX-512 to handle IEEE 754-2019 compliant half-precision floating-point
operations.
• Chapter 19: Cryptography & Finite Field Arithmetic Enhancements: Describes the new instruction extensions
designated for acceleration of cryptography flows and finite field arithmetic.
• Chapter 20: Intel® Advanced Matrix Extensions (Intel® AMX): Describes best practices to optimally code to the
metal on Intel® Xeon® Processors based on Sapphire Rapids SP microarchitecture. It extends the public
documentation on Optimizing DL code with DL Boost instructions.
• Chapter 21: Intel® QuickAssist Technology (Intel® QAT): Describes software development guidelines for the
Intel® QuickAssist Technology (Intel® QAT) API. This API supports both the Cryptographic and Data Compression
services.
• Appendix A: Application Performance Tools: Introduces tools for analyzing and enhancing application
performance without having to write assembly code.
• Appendix B: Using Performance Monitoring Events: Provides information on the Top-Down Analysis Method
and information on how to use performance events specific to the Intel Xeon processor 5500 series, processors
based on Sandy Bridge microarchitecture, and Intel Core Solo and Intel Core Duo processors.
• Appendix C: Intel Architecture Optimization with Large Code Pages: Provides information on how the
performance of runtimes can be improved by using large code pages.
1.5 GLOSSARY
Table 1-2 provides definitions of commonly used terms throughout this volume.
A subset of denormalized numbers that fill the underflow gap around zero in
Denormal
floating-point arithmetic.
FP16 Half precision 16-bit floating-point data format.
A function that can be called from a high-level language, like C/C++, which gives
Intrinsic direct access to the underlying ISA. Intrinsics allow the programmer to bypass the
compiler and directly specify that a particular instruction be used.
Instruction Set Architecture1: a part of the abstract model of a computer, which
ISA
generally defines how software controls the CPU2.
NOTES:
1. See Intel’s Instruction Set Architecture landing page.
2. See Wikipedia.
3. See Chapter 5, “Coding for SIMD Architectures”
4. See Intel’s Instruction Set Extensions Technology Support landing page.
5. See Wikipedia for a deep dive.
Intel® Development Topics & Technologies landing A landing page devoted to everything from storage to
page computer vision (CV).
The official source for the Intel® distribution of
Intel® Distribution of OpenVino™ Toolkit landing page OpenVINO™, an open source toolkit that simplifies
deployment. Includes a section with documentation.
C2C - False Sharing Detection in Linux Perf An introduction to perf c2c in Linux.
The objective of the Multithreading Consistency Guide
is to provide guidelines for developing efficient
Developing Multi-Threaded Applications: A Platform
multithreaded applications across Intel-based
Consistent Approach
symmetric multiprocessors (SMP) and/or systems with
Intel® Hyper-Threading Technology (Intel® HT). (2005)
CHAPTER 2
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
This chapter overviews features relevant to software optimization for current generations of Intel® 64 and IA-32
processors1. These features include:
• Microarchitectures that enable executing instructions with high throughput at high clock speeds, a high-speed
cache hierarchy, and a high-speed system bus.
• Intel® Hyper-Threading Technology2 (Intel® HT Technology) support.
• Intel 64 architecture on Intel 64 processors.
• Single Instruction Multiple Data (SIMD) instruction extensions: MMX™ technology, Streaming SIMD Extensions
(Intel® SSE), Streaming SIMD Extensions 2 (Intel® SSE2), Streaming SIMD Extensions 3 (Intel® SSE3), Supplemental
Streaming SIMD Extensions 3 (SSSE3), Intel® SSE4.1, and Intel® SSE4.2.
• Intel® Advanced Vector Extensions (Intel® AVX).
• Half-precision floating-point conversion and RDRAND.
• Fused Multiply Add Extensions.
• Intel® Advanced Vector Extensions 2 (Intel® AVX2).
• ADX and RDSEED.
• Intel® Advanced Vector Extensions 512 (Intel® AVX-512).
• Intel® Thread Director.
1. Intel® Atom® processors are covered in Chapter 4, “Intel Atom® Processor Architectures.”
2. Intel® HT Technology requires a computer system with an Intel processor supporting hyper-threading and an
Intel® HT Technology-enabled chipset, BIOS, and operating system. Performance varies depending on the hard-
ware and software used.
1. As specified in section 2.1.1 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2A
Chapter 2, “Instruction Format”.
The two instructions are macro-fused to a single micro-operation cached in the μop-Cache. The macro-fused
operation takes a single slot in IDQ and during allocation, execution, and retirement.
Similarly, with a new LOAD+OP macro-fusion, a LOAD instruction could be macro-fused by following an OP instruction
to form a fused load-op micro-operation, which often cannot be encoded within the x86 instruction set.
For example:
NOTES:
1. Note that only sub reg, [mem] is encodable within x86 instruction set.
55 33 94
The Redwood Cove microarchitecture adds a new Array of Pointers (AOP) hardware prefetcher for detecting and
prefetching an array of pointers references into the cache hierarchy. The AOP prefetcher treats the data prefetched
for a constant stride load as a pointer and may issue prefetch requests to the memory addresses corresponding to the
pointer’s value.
Successful detection of the pattern allows AOP Prefetch to bring data to be accessed by the program in advance,
saving costly cache misses.
To benefit most from the AOP Prefetching capabilities, we recommend the following:
1. Please see the Intel® DSA Specification and Intel® DSA User Guide.
• Dynamic performance and energy efficiency capabilities of P-cores and E-cores based on power/thermal limits.
• Idling hints when power and thermal are constrained.
Intel Thread Director was first introduced in desktop and mobile variants of the 12th generation Intel Core processor
based on Alder Lake performance hybrid architecture.
A processor containing P-cores and E-cores with different performance characteristics challenges the operating
system’s scheduler. Additionally, different software threads see different performance ratios between the P-cores and
E-cores. For example, the performance ratio between the P-cores and E-cores for highly vectorized floating-point
code is higher than that for scalar integer code. So, when the operating system must make an optimal scheduling
decision, it must be aware of the characteristics of the software threads that are candidates for scheduling. Suppose
there are insufficient P-cores and a mix of software threads with different characteristics. In that case, the operating
system should schedule those threads that benefit most from the P-cores onto those cores and schedule the others
on the E-cores.
Intel Thread Director provides the necessary hint to the operating system about the characteristics of the software
thread executing on each of the logical processors. The hint is dynamic and reflects the recent characteristics of the
thread, i.e., it may change over time based on the dynamic instruction mix of the thread. The processor also considers
microarchitecture factors to define the dynamic software thread characteristics.
Thread specific hardware support is enumerated via the CPUID instruction and enabled by the operating system via
writing to configuration MSRs. The Intel Thread Director implementation on processors based on Alder Lake
performance hybrid architecture defines four thread classes:
1. Non-vectorized integer or floating-point code.
2. Integer or floating-point vectorized code, excluding Intel® Deep Learning Boost (Intel® DL Boost) code.
3. Intel DL Boost code.
4. Pause (spin-wait) dominated code.
The dynamic code need not be 100% of the class definition. It should be large enough to be considered belonging to
that class. Also, dynamic microarchitectural metrics such as consumed memory bandwidth or cache bandwidth may
move software threads between classes. Example pseudo-code sequences for the Intel Thread Director classes
available on processors based on Alder Lake performance hybrid architecture are provided in Examples 2-1 through
2-4.
Intel Thread Director also provides a table in system memory, only accessible to the operating system, that defines
the P-core vs. E-core performance ratio per class. This allows the operating system to pick and choose the correct
software thread for the correct logical processor.
In addition to the performance ratio between P-cores and E-cores, Intel Thread Director provides the energy
efficiency ratio between those cores. The operating system can use this information when it prefers energy savings
over maximum performance. For example, a background task such as indexing can be scheduled on the most energy-
efficient core since its performance is less critical.
ȝRS4XHXH
Scheduler / Reservation
Scheduler Station
/ Reservation Station
P2 P3 P11 P4 P9 P7 P8
P0 P1 P5 P6 P10 STD
AGU AGU AGU STD AGU AGU
3x256
FMA FMA FMA512 2x512
FastADD FastADD
SOC
Figure 2-2. Processor Core Pipeline Functionality of the Golden Cove Microarchitecture
The Golden Cove front end is depicted in Figure 2-3. The front end is built to feed the wider and deeper out-of-order
core:
• Legacy decode pipeline fetch bandwidth increased from 16 to 32 bytes/cycle.
• The number of decoders increased from four to six, allowing the decode of up to six instructions per cycle.
• The micro-op cache size increased, and its bandwidth increased to deliver up to eight micro-ops per cycle.
• Improved branch prediction.
,7/%.%,QVWUXFWLRQ&DFKH BPU
ȝRS4XHXH
Table 2-3. Dispatch Port and Execution Stacks of the Golden Cove Microarchitecture
Port Port Port Ports 7, Port Port Port
Port 0 Port 11 Port 52 Port 6
2 3 4 8 9 10 11
INT ALU INT INT INT ALU
LEA ALU3 ALU LEA INT
Store Store Store
INT LEA Load Load LEA INT ALU Load
Data Address Data
Shift INT Mul INT Shift LEA
Jump1 INT Div MUL Hi Jump2
Table 2-3. Dispatch Port and Execution Stacks of the Golden Cove Microarchitecture (Contd.)
Port Port Port Ports 7, Port Port Port
Port 0 Port 11 Port 52 Port 6
2 3 4 8 9 10 11
FMA*
Fast
FMA**
FMA Adder*
Fast
Vec ALU Vec
Adder
Vec ALU*
Vec
Shift Vec
ALU
FP Div Shift*
Shuffle
Shuffle
*
NOTES:
1. “*” in this table indicates that these features are unavailable for 512-bit vectors.
2. “**” in this table indicates that these features are unavailable for 512-bit vectors in Client parts.
3. The Golden Cove microarchitecture implemented performance improvements requiring constraint of the micro-
ops which use *H partial registers (i.e. AH, BH, CH, DH). See Section 3.5.2.3 for more details.
Table 2-4 lists execution units and common representative instructions that rely on these units.
Throughput improvements across the Intel® SSE, Intel AVX, and general-purpose instruction sets are related to the
number of units for the respective operations, and the varieties of instructions that execute using a particular unit.
Table 2-4. Golden Cove Microarchitecture Execution Units and Representative Instructions1
Execution
# of Unit Instructions
Unit
add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa,
ALU 52
(v)movap*, (v)movup*
SHFT 23 sal, shl, rol, adc, sarx, adcx, adox, etc.
Slow Int 1 mul, imul, bsr, rcl, shld, mulx, pdep, etc.
BM 2 andn, bextr, blsi, blsmsk, bzhi, etc.
2x256-bit
Vec_Shft (v)psllv*, (v)psrlv*, vector shift count in imm8
1x512-bit
(v)add*, (v)cmp*, (v)max*, (v)min*, (v)sub*, (v)padds*, (v)paddus*, (v)psign,
VEC Add (in 2x256-bit
(v)pabs, (v)pavgb, (v)pcmpeq*, (v)pmax, (v)cvtps2dq, (v)cvtdq2ps, (v)cvtsd2si,
VEC FMA) 1x512-bit
(v)cvtss2si
Table 2-4. Golden Cove Microarchitecture Execution Units and Representative Instructions1
Execution
# of Unit Instructions
Unit
(v)shufp*, vperm*, (v)pack*, (v)unpck*, (v)punpck*, (v)pshuf*, (v)pslldq,
2x256-bit
Shuffle (v)alignr, (v)pmovzx*, vbroadcast*, (v)pslldq, (v)psrldq, (v)pblendw (new cross
1x512-bit
lane shuffle on both ports)
2x256-bit
Vec
(1 or 2)x512- (v)mul*, (v)pmul*, (v)pmadd*
Mul/FMA
bit
SIMD Misc 1 STTNI, (v)pclmulqdq, (v)psadw, vector shift count in xmm
NOTES:
1. Execution unit mapping to MMX instructions are not covered in this table. See Section 15.16.5 on MMX instruc-
tion throughput remedy.
2. The Golden Cove microarchitecture implemented performance improvements requiring constraint of the micro-
ops which use *H partial registers (i.e. AH, BH, CH, DH). See Section 3.5.2.3 for more details.
3. Ibid.
Table 2-5 describes bypass delay in cycles between producer and consumer operations.
SIMD/0,1/1 0 1 1 1 0 0 0
FMA/0,1/4 1 0 1 0 0 0 0
MUL/0,1/4 1 0 1 0 0 0 0
Fast Adder/0,1/3 1 0 1 -1 0 0 0
SIMD/5/1,3 0 1 1 1 0 0 0
SHUF/1,5/1,3 0 0 1 0 0 0 0
V2I/0/3 0 0 1 0 0 0 0
I2V/5/1 0 1 1 0 0 0 0
The attributes relevant to the producer/consumer micro-ops for bypass are a triplet of abbreviation/one or more port
number/latency cycle of the μop. For example:
• “SIMD/0,1/1” applies to a 1-cycle vector SIMD μop dispatched to either port 0 or port 1.
• “SIMD/5/1,3” applies to either a 1-cycle or 3-cycle non-shuffle μop dispatched to port 5.
• “V2I/0/3” applies to a three-cycle vector-to-integer μop dispatched to port 0.
Recommendation: Use dependency breaking zero idioms on the destination register before the affected instructions
to avoid potential slowdown from the false dependency.
.%,QVWUXFWLRQ&DFKH BPU
/HJDF\'HFRGH
ȝRS&DFKH 06520
3LSHOLQH
ȝRS4XHXH
$OORFDWH5HQDPH0RYH(OLPLQDWLRQ=HUR,GLRP
6FKHGXOHU5HVHUYDWLRQ6WDWLRQ
Figure 2-4. Processor Core Pipeline Functionality of the Ice Lake Client Microarchitecture1
NOTES:
1. “*” in the figure above indicates these features are unavailable for 512-bit vectors.
2. “INT” represents GPR scalar instructions.
3. “VEC” represents floating-point and integer vector instructions.
4. “MULHi” produces the upper 64 bits of the result of an iMul operation that multiplies two 64-bit registers and
places the result into two 64-bits registers.
5. The “Shuffle” on port 1 is new, and supports only in-lane shuffles that operate within the same 128-bit sub-vector.
6. The “IDIV” unit on port 1 is new, and performs integer divide operations at a reduced latency.
7. The Golden Cove microarchitecture implemented performance improvements requiring constraint of the micro-
ops which use *H partial registers (i.e. AH, BH, CH, DH). See Section 3.5.2.3 for more details.
Table 2-6 summarizes the OOO engine's capability to dispatch different types of operations to ports.
Table 2-6. Dispatch Port and Execution Stacks of the Ice Lake Client Microarchitecture
Port 0 Port 11 Port 2 Port 3 Port 4 Port 5 Port 6 Port 7 Port 8 Port 9
INT ALU INT ALU
INT ALU INT ALU
LEA LEA
LEA Store LEA Store Store Store
INT Load Load INT
INT Mul Data INT Address Address Data
Shift Shift
INT Div MUL Hi
Jump1 Jump2
FMA*
FMA Vec
Vec ALU ALU* Vec ALU
Vec Vec Vec
Shift Shift* Shuffle
FP Div Vec
Shuffle*
NOTES:
1. “*” in this table indicates these features are unavailable for 512-bit vectors.
Table 2-7 lists execution units and common representative instructions that rely on these units.
Throughput improvements across the Intel SSE, Intel AVX, and general-purpose instruction sets are related to the
number of units for the respective operations, and the varieties of instructions that execute using a particular unit.
Table 2-7. Ice Lake Client Microarchitecture Execution Units and Representative Instructions1
Execution # of
Instructions
Unit Unit
add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*,
ALU 4
(v)movup*
Slow Int 1 mul, imul, bsr, rcl, shld, mulx, pdep, etc.
BM 2 andn, bextr, blsi, blsmsk, bzhi, etc.
Table 2-7. Ice Lake Client Microarchitecture Execution Units and Representative Instructions1
Execution # of
Instructions
Unit Unit
SIMD Misc 1 STTNI, (v)pclmulqdq, (v)psadw, vector shift count in xmm
NOTES:
1. Execution unit mapping to MMX instructions are not covered in this table. See Section 15.16.5 on MMX instruc-
tion throughput remedy.
Table 2-8 describes bypass delay in cycles between producer and consumer operations.
SIMD/0,1/1 0 1 1 0 0 0 NA
FMA/0,1/4 1 0 1 0 0 0 NA
VIMUL/0,1/4 1 0 1 0 0 0 NA
SIMD/5/1,3 0 1 1 0 0 0 NA
SHUF/5/1,3 0 0 1 0 0 0 NA
V2I/0/3 0 0 1 0 0 0 NA
I2V/5/1 0 1 1 0 0 0 NA
The attributes that are relevant to the producer/consumer micro-ops for bypass are a triplet of abbreviation/one or
more port number/latency cycle of the μop. For example:
• “SIMD/0,1/1” applies to 1-cycle vector SIMD μop dispatched to either port 0 or port 1.
• “SIMD/5/1,3” applies to either a 1-cycle or 3-cycle non-shuffle μop dispatched to port 5.
• “V2I/0/3” applies to a 3-cycle vector-to-integer μop dispatched to port 0.
• “I2V/5/1” applies to a 1-cycle integer-to-vector μop to port 5.
NOTES:
1. Software-visible latency/bandwidth will vary depending on access patterns and other factors.
2. This number depends on core count.
The TLB hierarchy consists of dedicated level one TLB for instruction cache, TLB for L1D, shared L2 TLB for 4K and 4MB
pages and a dedicated L2 TLB for 1GB pages.
Table 2-10. TLB Parameters of the Ice Lake Client Microarchitecture (Contd.)
Per-thread Entries MT
Level Page Size Entries ST Associativity
Latency
Shared for all page 2048 competitively
Second Level 20481 16
sizes shared
NOTES:
1. 4K pages can use all 2048 entries. 2/4MB pages can use 1024 entries (in 8 ways), sharing them with 4K pages.
1GB pages can use the other 1024 entries (in 8 ways), also sharing them with 4K pages.
Paired Stores
Ice Lake Client microarchitecture includes two store pipelines in the core, with the following features:
• Two dedicated AGU for LDs on ports 2 and 3.
• Two dedicated AGU for STAs on ports 7 and 8.
• Two fully featured STA pipelines.
• Two 256-bit wide STD pipelines (Intel AVX-512 store data takes two cycles to write).
• Second senior store pipeline to the DCU via store merging.
Ice Lake Client microarchitecture can write two senior stores to the cache in a single cycle if these two stores can be
paired together. That is:
• The stores must be to the same cache line.
• Both stores are of the same memory type, WB or USWC.
• None of the stores cross cache line or page boundary.
To maximize performance from the second store port try to:
• Align store operations whenever possible.
• Place consecutive stores in the same cache line (not necessarily as adjacent instructions).
As seen in Example 2-6, it is important to take into consideration all stores, explicit or not.
In some cases it is possible to rearrange the code to achieve store pairing. Example 2-7 provides details.
Slow Version Not Enabling PFSP Enabling FSFP Using LEA Operation
Loop: Loop:
mov r10,[rsi+r8*8]
mov r10,[rsi+r8*8]
lea r12,[rdi+r10*8] ; using LEA to avoid
inc qword[rdi+r10*8] ;Index register
mov r11,[rsi+r8*8] for ;inc below
inc r8 inc qword[r12]
inc qword[rdi+r11*8] mov r11,[rsi+r8*8]
… jmp Loop inc r8
lea r13,[rdi+r11*8] ; another similar case
inc qword [r13]
…. jmp Loop
— New performance metrics for built-in support for Level 1 Top-Down method (% of Issue slots that are front-
end bound, back-end bound, bad speculation, retiring) while leaving the 8 general purpose counters free for
software use.
Micro-Op Queue
1M L2 Cache
Allocate/Rename/Retire/Move Elimination/Zero Idiom
Scheduler
32K Data
Cache
Port 7 STA
Figure 2-5. Processor Core Pipeline Functionality of the Skylake Server Microarchitecture
Associativity [ways] 8 8
Latency [cycles] 12 14
L2 Mid-level
Max bandwidth [bytes/cycles] 32 64
Cache (MLC)
Sustained bandwidth [bytes/cycles] 25 52
Associativity [ways] 8 16
NOTES:
1. Some Skylake server parts have some cores disabled and hence have more than 1.375 MBs per core of L3 cache.
The figure below shows how Skylake server microarchitecture shifts the memory balance from shared-distributed
with high latency, to private-local with low latency.
Figure 2-6. Broadwell Microarchitecture and Skylake Server Microarchitecture Cache Structures
The potential performance benefit from the cache changes is high, but software will need to adapt its memory tiling
strategy to be optimal for the new cache sizes.
Recommendation: Rebalance application shared and private data sizes to match the smaller,
non-inclusive L3 cache, and larger L2 cache.
Choice of cache blocking should be based on application bandwidth requirements and changes from one application
to another. Having four times the L2 cache size and twice the L2 cache bandwidth compared to the previous
generation Broadwell microarchitecture enables some applications to block to L2 instead of L1 and thereby improves
performance.
Recommendation: Consider blocking to L2 on Skylake Server microarchitecture if L2 can sustain the application’s
bandwidth requirements.
The change from inclusive last level cache to non-inclusive means that the capacity of mid-level and last level cache
can now be added together. Programs that determine cache capacity per core at run time should now use a
combination of mid-level cache size and last level cache size per core to estimate the effective cache size per core.
Using just the last level cache size per core may result in non-optimal use of available on-chip cache; see Section 2.6.2
for details.
Recommendation: In case of no data sharing, applications should consider cache capacity per core as L2 and L3 cache
sizes and not only L3 cache size.
Table 2-12 includes information about the maximum Intel® Turbo Boost technology core frequency for each type of
instruction executed. The maximum frequency (P0n) is an array of frequencies which depend on the number of cores
within the category. The more cores belonging to a category at any given time, the lower the maximum frequency.
Table 2-12. Maximum Intel® Turbo Boost Technology Core Frequency Levels
Max Frequency
Level Category Frequency Level Instruction Types
(P0n)
Intel® AVX2 light Scalar, AVX128, SSE, Intel® AVX2 w/o
0 Highest Max
instructions FP or INT MUL/FMA
Intel® AVX-512
2 Lowest Max Intel® AVX-512 Intel® AVX-512 FP + INT MUL/FMA
heavy instructions
For per SKU max frequency details (reference figure 1-15), refer to the Intel® Xeon® Scalable Processor Family
Technical Resources page.
Figure 2-7 is an example for core frequency range in a given system where each core frequency is determined
independently based on the demand of the workload.
P0n
P0n-AVX2
Non-AVX
Non-AVX
...
AVX2
AVX2
AVX512
P1-AVX-512
Cores
The following performance monitoring events can be used to determine how many cycles were spent in each of the
three frequency levels.
• CORE_POWER.LVL0_TURBO_LICENSE: Core cycles where the core was running in a manner where the maximum
frequency was P0n.
• CORE_POWER.LVL1_TURBO_LICENSE: Core cycles where the core was running in a manner where the maximum
frequency was P0n-AVX2.
• CORE_POWER.LVL2_TURBO_LICENSE: Core cycles where the core was running in a manner where the maximum
frequency was P0n-AVX-512.
When the core requests a higher license level than its current one, it takes the PCU up to 500 micro-seconds to grant
the new license. Until then the core operates at a lower peak capability. During this time period the PCU evaluates
how many cores are executing at the new license level and adjusts their frequency as necessary, potentially lowering
the frequency. Cores that execute at other license levels are not affected.
A timer of approximately 2ms is applied before going back to a higher frequency level. Any condition that would have
requested a new license resets the timer.
NOTES
A license transition request may occur when executing instructions on a mis-speculated path.
A large enough mix of Intel AVX-512 light instructions and Intel AVX2 heavy instructions drives the
core to request License 2, despite the fact that they usually map to License 1. The same is true for
Intel AVX2 light instructions and Intel SSE heavy instructions that may drive the core to License 1
rather than License 0. For example, The Intel® Xeon® Platinum 8180 processor moves from license 1
to license 2 when executing a mix of 110 Intel AVX-512 light instructions and 20 256-bit heavy
instructions over a window of 65 cycles.
Some workloads do not cause the processor to reach its maximum frequency as these workloads are bound by other
factors. For example, the LINPACK benchmark is power limited and does not reach the processor's maximum
frequency. The following graph shows how frequency degrades as vector width grows, but, despite the frequency
drop, performance improves. The data for this graph was collected on an Intel Xeon Platinum 8180 processor.
Core Frequency
2034
2000 2.0
1500 1178
1.5
760
1000 1
0 0
SSE4.2 AVX AVX2 AVX512
Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain
performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. For example,
maximum frequency bound Deep Learning workloads that target Intel AVX-512 heavy instructions at a very high
percentage can gain 1.3x-1.5x performance improvement vs. the same workload built to target Intel AVX2 (both
operating on Skylake Server microarchitecture).
It is not always easy to predict whether a program's performance will improve from building it to target Intel AVX-512
instructions. Programs that enjoy high performance gains from the use of xmm or ymm registers may expect
performance improvement by moving to the use of zmm registers. However, some programs that use zmm registers
may not gain as much, or may even lose performance. It is recommended to try multiple build options and measure
the performance of the program.
Recommendation: To identify the optimal compiler options to use, build the application with each of the following
set of options and choose the set that provides the best performance.
• -xCORE-AVX2 -mtune=skylake-avx512 (Linux* and macOS*)
/QxCORE-AVX2 /tune=skylake-avx512 (Windows*)
• -xCORE-AVX512 -qopt-zmm-usage=low (Linux* and macOS*)
/QxCORE-AVX512 /Qopt-zmm-usage:low (Windows*)
• -xCORE-AVX512 -qopt-zmm-usage=high (Linux* and macOS*)
/QxCORE-AVX512 /Qopt-zmm-usage:high (Windows*)
See Section 17.26 for more information about these options.
The GCC Compiler has the option -mprefer-vector-width=none|128|256|512 to control vector width preference.
While -march=skylake-avx512 is designed to provide the best performance for the Skylake Server microarchitecture
some programs can benefit from different vector width preferences. To identify the optimal compiler options to use,
build the application with each of the following set of options and choose the set that provides the best performance.
-mprefer-vector-width=256 is the default for skylake-avx512.
• -march=skylake -mtune=skylake-avx512
• -march=skylake-avx512
• -march=skylake-avx512 -mprefer-vector-width=512
Clang/LLVM is currently implementing the option -mprefer-vector-width=none|128|256|512, similar to GCC. To
identify the optimal compiler options to use, build the application with each of the following set of options and
choose the set that provides the best performance.
• -march=skylake -mtune=skylake-avx512
• -march=skylake-avx512 (plus -mprefer-vector-width=256, if available)
• -march=skylake-avx512 (plus -mprefer-vector-width=512, if available)
32K L1
BPU
Instruction Cache
256k L2 Cache
Instruction Decode Queue (IDQ, or Micro-Ops Queue)
(Unified)
Scheduler
32K L1 Data
Cache
Figure 2-9. CPU Core Pipeline Functionality of the Skylake Client Microarchitecture
Table 2-13. Dispatch Port and Execution Stacks of the Skylake Client Microarchitecture
Port 0 Port 1 Port 2, 3 Port 4 Port 5 Port 6 Port 7
ALU, ALU, ALU, ALU,
LD
Vec ALU Fast LEA, STD Fast LEA, Shft, STA
STA
Vec ALU Vec ALU,
Vec Shft, Vec Shft,
Vec Shuffle, Branch1
Vec Add, Vec Add,
Vec Mul, Vec Mul,
FMA, FMA
DIV, Slow Int
Branch2 Slow LEA
Table 2-14 lists execution units and common representative instructions that rely on these units. Throughput
improvements across the SSE, AVX and general-purpose instruction sets are related to the number of units for the
respective operations, and the varieties of instructions that execute using a particular unit.
Table 2-14. Skylake Client Microarchitecture Execution Units and Representative Instructions1
Execution # of
Instructions
Unit Unit
add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*,
ALU 4
(v)movup*
Table 2-14. Skylake Client Microarchitecture Execution Units and Representative Instructions1
Execution # of
Instructions
Unit Unit
SHFT 2 sal, shl, rol, adc, sarx, adcx, adox, etc.
Slow Int 1 mul, imul, bsr, rcl, shld, mulx, pdep, etc.
BM 2 andn, bextr, blsi, blsmsk, bzhi, etc
NOTES:
1. Execution unit mapping to MMX instructions are not covered in this table. See Section 15.16.5 on MMX instruc-
tion throughput remedy.
A significant portion of the Intel SSE, Intel AVX and general-purpose instructions also have latency improvements.
Appendix C lists the specific details. Software-visible latency exposure of an instruction sometimes may include
additional contributions that depend on the relationship between micro-ops flows of the producer instruction and
the micro-op flows of the ensuing consumer instruction. For example, a two-μop instruction like VPMULLD may
experience two cumulative bypass delays of 1 cycle each from each of the two micro-ops of VPMULLD.
Table 2-15 describes the bypass delay in cycles between a producer μop and the consumer μop. The
left-most column lists a variety of situations characteristic of the producer micro-op. The top row lists a variety of
situations characteristic of the consumer micro-op.
SIMD/0,1/1 0 1 1 0 0 0 NA
FMA/0,1/4 1 0 1 0 0 0 NA
VIMUL/0,1/4 1 0 1 0 0 0 NA
SIMD/5/1,3 0 1 1 0 0 0 NA
SHUF/5/1,3 0 0 1 0 0 0 NA
Table 2-15. Bypass Delay Between Producer and Consumer Micro-ops (Contd.)
SIMD/0,1/1 FMA/0,1/4 VIMUL/0,1/4 SIMD/5/1,3 SHUF/5/1,3 V2I/0/3 I2V/5/1
V2I/0/3 NA NA NA NA NA NA NA
I2V/5/1 0 0 1 0 0 0 NA
The attributes that are relevant to the producer/consumer micro-ops for bypass are a triplet of abbreviation/one or
more port number/latency cycle of the μop. For example:
• “SIMD/0,1/1” applies to 1-cycle vector SIMD μop dispatched to either port 0 or port 1.
• “VIMUL/0,1/4” applies to 4-cycle vector integer multiply μop dispatched to either port 0 or port 1.
• “SIMD/5/1,3” applies to either 1-cycle or 3-cycle non-shuffle μop dispatched to port 5.
96 (2x32B Load
First Level Data 32 KB/ 8 64 4 cycle ~81 Writeback
+ 1*32B Store)
Instruction 32 KB/8 64 N/A N/A N/A N/A
NOTES:
1. Software-visible latency will vary depending on access patterns and other factors.
The TLB hierarchy consists of dedicated level one TLB for instruction cache, TLB for L1D, plus unified TLB for L2. The
partition column of Table 2-17 indicates the resource sharing policy when Hyper-Threading Technology is active.
The following is an example of how to use the PAUSE instruction with a dynamic loop iteration count.
Notice that in the Skylake Client microarchitecture the RDTSC instruction counts at the machine's guaranteed P1
frequency independently of the current processor clock (see the INVARIANT TSC property), and therefore, when
running in Intel® Turbo-Boost-enabled mode, the delay will remain constant, but the number of instructions that
could have been executed will change.
Use Poll Delay function in your lock to wait a given amount of guaranteed P1 frequency cycles, specified in the
“clocks” variable.
For contended spinlocks of the form shown in the baseline example below, we recommend an exponential back off
when the lock is found to be busy, as shown in the improved example, to avoid significant performance degradation
that can be caused by conflicts between threads in the machine. This is more important as we increase the number of
threads in the machine and make changes to the architecture that might aggravate these conflict conditions. In multi-
socket Intel server processors with shared memory, conflicts across threads take much longer to resolve as the
number of threads contending for the same lock increases. The exponential back off is designed to avoid these
conflicts between the threads thus avoiding the potential performance degradation. Note that in the example below,
the number of PAUSE instructions are increased by a factor of 2 until some MAX_BACKOFF is reached which is subject
to tuning.
/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
for (int i=mask; i; --i){
__asm__ (“pause”);
}
mask = mask < max ? mask<<1 : max;
}
}
These instructions ensure the processor completes all modifications to flags, registers, and memory by previous
instructions and drains all buffered writes to memory before the next instruction is fetched and executed. The non-
privileged serialization instructions are:
• SERIALIZE
• CPUID
• IRET
• RSM
The SERIALIZE instruction was introduced in the Sapphire Rapids and Alder Lake platforms as a purpose-specific
method of providing serialization to supersede the current typical usages such as CPUID.(EAX=0H). For example,
CPUID usage for serialization has issues such that registers [EAX, EBX, ECX, EDX] are modified and, when executed on
top of a VMM, will always incur the latency of a VM exit/VM entry round trip. SERIALIZE does not modify registers,
arithmetic flags, or memory and does not incur a VM exit. The SERIALIZE instruction is enumerated via
CPUID.(EAX=07H,ECX=0):EDX[14]=1, and software must verify support before usage.
Software that uses CPUID for serialization is recommended to use Leaf 0 [CPUID.(EAX=0H)] CPUID. CPUID leaves have
variable performance with Leaf 0 providing the lowest latency when executed natively.
System Bus
Replicated Resources
The architectural state is replicated for each logical processor. The architecture state consists of registers that are used
by the operating system and application code to control program behavior and store data for computations. This state
includes the eight general-purpose registers, the control registers, machine state registers, debug registers, and
others. There are a few exceptions, most notably the memory type range registers (MTRRs) and the performance
monitoring resources. For a complete list of the architecture state and exceptions, see the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volumes 3A, 3B, 3C, & 3D.
Other resources such as instruction pointers and register renaming tables were replicated to simultaneously track
execution and state changes of the two logical processors. The return stack predictor is replicated to improve branch
prediction of return instructions.
In addition, a few buffers (for example, the two-entry instruction streaming buffers) were replicated to reduce
complexity.
Partitioned Resources
Several buffers are shared by limiting the use of each logical processor to half the entries. These are referred to as
partitioned resources. Reasons for this partitioning include:
• Operational fairness.
• Permitting the ability to allow operations from one logical processor to bypass operations of the other logical
processor that may have stalled.
For example: a cache miss, a branch misprediction, or instruction dependencies may prevent a logical processor from
making forward progress for some number of cycles. The partitioning prevents the stalled logical processor from
blocking forward progress.
In general, the buffers for staging instructions between major pipe stages are partitioned. These buffers include µop
queues after the execution trace cache, the queues after the register rename stage, the reorder buffer which stages
instructions for retirement, and the load and store buffers.
In the case of load and store buffers, partitioning also provided an easier implementation to maintain memory
ordering for each logical processor and detect memory ordering violations.
Shared Resources
Most resources in a physical processor are fully shared to improve the dynamic utilization of the resource, including
caches and all the execution units. Some shared resources which are linearly addressed, like the DTLB, include a
logical processor ID bit to distinguish whether the entry belongs to one logical processor or the other.
2.8.2.4 Retirement
The retirement logic tracks when instructions from the two logical processors are ready to be retired. It retires the
instruction in program order for each logical processor by alternating between the two logical processors. If one
logical processor is not ready to retire any instructions, then all retirement bandwidth is dedicated to the other logical
processor.
Once stores have retired, the processor must write the store data into the level-one data cache. Selection logic
alternates between the two logical processors to commit store data to the cache.
Earlier processors extended the SIMD computation model with the introduction of the Streaming SIMD Extensions
(SSE). SSE allows SIMD computations to be performed on operands that contain four packed single-precision floating-
point data elements. The operands can be in memory or in a set of eight 128-bit XMM registers (see Figure 2-12). SSE
also extended SIMD computational capability by adding additional 64-bit MMX instructions.
Figure 2-11 shows a typical SIMD computation. Two sets of four packed data elements (X1, X2, X3, and X4, and Y1, Y2,
Y3, and Y4) are operated on in parallel, with the same operation being performed on each corresponding pair of data
elements (X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4). The results of the four parallel computations are sorted as
a set of four packed data elements.
X4 X3 X2 X1
Y4 Y3 Y2 Y1
OP OP OP OP
X4 op Y4 X3 op Y3 X2 op Y2 X1 op Y1
The Pentium 4 processor further extended the SIMD computation model with the introduction of Streaming SIMD
Extensions 2 (SSE2), Streaming SIMD Extensions 3 (SSE3), and Intel Xeon processor 5100 series introduced
Supplemental Streaming SIMD Extensions 3 (SSSE3).
SSE2 works with operands in either memory or in the XMM registers. The technology extends SIMD computations to
process packed double-precision floating-point data elements and 128-bit packed integers. There are 144
instructions in SSE2 that operate on two packed double-precision floating-point data elements or on 16 packed byte,
8 packed word, 4 doubleword, and 2 quadword integers.
SSE3 enhances x87, SSE and SSE2 by providing 13 instructions that can accelerate application performance in specific
areas. These include video processing, complex arithmetics, and thread synchronization. SSE3 complements SSE and
SSE2 with instructions that process SIMD data asymmetrically, facilitate horizontal computation, and help avoid
loading cache line splits. See Figure 2-12.
SSSE3 provides additional enhancement for SIMD computation with 32 instructions on digital video and signal
processing.
SSE4.1, SSE4.2 and AESNI are additional SIMD extensions that provide acceleration for applications in media
processing, text/lexical processing, and block encryption/decryption.
The SIMD extensions operates the same way in Intel 64 architecture as in IA-32 architecture, with the following
enhancements:
• 128-bit SIMD instructions referencing XMM register can access 16 XMM registers in 64-bit mode.
• Instructions that reference 32-bit general purpose registers can access 16 general purpose registers in 64-bit
mode.
SIMD improves the performance of 3D graphics, speech recognition, image processing, scientific applications and
applications that have the following characteristics:
• Inherently parallel.
• Recurring memory access patterns.
• Localized recurring operations performed on the data.
• Data-independent control flow.
2.8.5 SSE4.1
SSE4.1 introduces 47 new instructions to accelerate video, imaging and 3D applications. SSE4.1 also improves
compiler vectorization and significantly increase support for packed dword computation. These include:
• Two instructions perform packed dword multiplies.
• Two instructions perform floating-point dot products with input/output selects.
• One instruction provides a streaming hint for WC loads.
• Six instructions simplify packed blending.
• Eight instructions expand support for packed integer MIN/MAX.
• Four instructions support floating-point round with selectable rounding mode and precision exception override.
• Seven instructions improve data insertion and extractions from XMM registers
• Twelve instructions improve packed integer format conversions (sign and zero extensions).
• One instruction improves SAD (sum absolute difference) generation for small block sizes.
• One instruction aids horizontal searching operations of word integers.
• One instruction improves masked comparisons.
• One instruction adds qword packed equality comparisons.
• One instruction adds dword packing with unsigned saturation.
2.8.5.1 SSE4.2
SSE4.2 introduces 7 new instructions. These include:
• A 128-bit SIMD integer instruction for comparing 64-bit integer data elements.
• Four string/text processing instructions providing a rich set of primitives, these primitives can accelerate:
— Basic and advanced string library functions from strlen, strcmp, to strcspn.
— Delimiter processing, token extraction for lexing of text streams.
— Parser, schema validation including XML processing.
• A general-purpose instruction for accelerating cyclic redundancy checksum signature calculations.
• A general-purpose instruction for calculating bit count population of integer numbers.
The PCLMULQDQ instruction accelerates general-purpose block encryption, which can perform carry-less
multiplication for two binary numbers up to 64-bit wide.
Typically, algorithm based on AES standard involve transformation of block data over multiple iterations via several
primitives. The AES iteration.
AES encryption involves processing 128-bit input data (plain text) through a finite number of iterative operation,
referred to as “AES round”, into a 128-bit encrypted block (ciphertext). Decryption follows the reverse direction of
iterative operation using the “equivalent inverse cipher” instead of the “inverse cipher”.
The cryptographic processing at each round involves two input data, one is the “state”, the other is the “round key”.
Each round uses a different “round key”. The round keys are derived from the cipher key using a “key schedule”
algorithm. The “key schedule” algorithm is independent of the data processing of encryption/decryption, and can be
carried out independently from the encryption/decryption phase.
The AES extensions provide two primitives to accelerate AES rounds on encryption, two primitives for AES rounds on
decryption using the equivalent inverse cipher, and two instructions to support the AES key expansion procedure.
2.8.5.5 RDRAND
The RDRAND instruction retrieves a random number supplied by a cryptographically secure, deterministic random bit
generator (DBRG). The DBRG is designed to meet NIST SP 800-90A standard.
multiply-subtract operations. FMA extensions provide 36 256-bit floating-point instructions to perform computation
on 256-bit vectors and additional 128-bit and scalar FMA instructions.
2.8.5.9 RDSEED
The RDSEED instruction retrieves a random number supplied by a cryptographically secure, enhanced deterministic
random bit generator Enhanced NRBG). The NRBG is designed to meet the NIST SP 800-90B and NIST SP 800-90C
standards.
CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
This chapter discusses general optimization techniques that can improve the performance of applications running on
Intel® processors. These techniques take advantage of microarchitectural features described in Chapter 2, “Intel® 64
and IA-32 Processor Architectures.” Optimization guidelines focusing on Intel multi-core processors, Hyper-Threading
Technology, and 64-bit mode applications are discussed in Chapter 11, “Multicore and Intel® Hyper-Threading
Technology (Intel® HT),” and Chapter 13, “64-bit Mode Coding Guidelines.”
Practices that optimize performance focus on three areas:
• Tools and techniques for code generation.
• Analysis of the performance characteristics of the workload and its interaction with microarchitectural sub-
systems.
• Tuning code to the target microarchitecture (or families of microarchitecture) to improve performance.
Some hints on using tools are summarized first to simplify the first two tasks. The rest of the chapter will focus on
recommendations for code generation or code tuning to the target microarchitectures.
This chapter explains optimization techniques for the Intel® C++ Compiler, the Intel® Fortran Compiler, and other
compilers.
NOTE
Improving performance in one part of the machine does not necessarily bring significant gains to
overall performance. It is possible to degrade overall performance by improving performance for
some particular metric.
Where appropriate, coding recommendations in this chapter include descriptions of the VTune Performance Analyzer
events that provide measurable data on the performance gain achieved by following the recommendations. For more
on using the VTune analyzer, refer to the application’s online help.
The Intel C++ Compiler supports the integration of different versions of the code for different target processors. The
selection of which code to execute at runtime is made based on the CPU identifiers. Binary code targeted for different
processor generations can be generated under the control of the programmer or by the compiler. Refer to the “Intel®
C++ Compiler Classic Developer Guide and Reference” cpu_dispatch and cpu_specific sections for more information
on CPU dispatching (a.k.a function multi-versioning).
For applications that target multiple generations of microarchitectures, and where minimum binary code size and
single code path is important, a compatible code strategy is the best. Optimizing applications using techniques
developed for the Intel Core microarchitecture combined with Nehalem microarchitecture are likely to improve code
efficiency and scalability when running on processors based on current and future generations of Intel 64 and IA-32
processors.
These recommendations are approximate. They can vary depending on coding style, application domain, and other
factors.
The purpose of the high, medium, and low (H, M, and L) priorities is to suggest the relative level of performance gain
one can expect if a recommendation is implemented.
Because it is not possible to predict the frequency of a particular code instance in applications, priority hints cannot
be directly correlated to application-level performance gain. In cases in which application-level performance gain has
been observed, we have provided a quantitative characterization of the gain (for information only). In cases in which
the impact has been deemed inapplicable, no priority is assigned.
The optimized code in Example 3-2 sets EBX to zero, then compares A and B. If A is greater than or equal to B, EBX is
set to one. Then EBX is decreased and AND’d with the difference of the constant values. This sets EBX to either zero
or the difference of the values. By adding CONST2 back to EBX, the correct value is written to EBX. When CONST2 is
equal to zero, the last instruction can be deleted.
Another way to remove branches is to use the CMOV and FCMOV instructions. Example 3-3 shows how to change a
TEST and branch instruction sequence using CMOV to eliminate a branch. If the TEST sets the equal flag, the value in
EBX will be moved to EAX. This branch is data-dependent, and is representative of an unpredictable branch.
1H:
; To optimize code, combine jne and mov into one cmovcc instruction that checks the equal flag
test ecx, ecx ; Test the flags
cmoveq eax, ebx ; If the equal flag is set, move
; ebx to eax- the 1H: tag no longer needed
An extension to this concept can be seen in the AVX-512 masked operations, as well as in some instructions such as
VPCMP which can be used to eliminate data dependent branches; see Section 17.4.
IF<condition> {...
}
Example 3-5 and Example 3-6 provide basic rules for a static prediction algorithm. In Example 3-5, the backward
branch (JC BEGIN) is not in the BTB the first time through; therefore, the BTB does not issue a prediction. The static
predictor, however, will predict the branch to be taken, so a misprediction will not occur.
The first branch instruction (JC BEGIN) in Example 3-6 is a conditional forward branch. It is not in the BTB the first time
through, but the static predictor will predict the branch to fall through. The static prediction algorithm correctly
predicts that the CALL CONVERT instruction will be taken, even before the branch has any branch history in the BTB.
The Intel Core microarchitecture does not use the static prediction heuristic. However, to maintain consistency across
Intel 64 and IA-32 processors, software should maintain the static prediction heuristic as the default.
Assembly/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if doing so decreases
code size or if the function is small and the call site is frequently executed.
Assembly/Compiler Coding Rule 6. (ML impact, ML generality) If there are more than 16 nested calls and returns in
rapid succession; consider transforming the program with inline to reduce the call depth.
Assembly/Compiler Coding Rule 7. (ML impact, ML generality) Favor inlining small functions that contain branches
with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken, a
performance penalty may be incurred.
Assembly/Compiler Coding Rule 8. (L impact, L generality) If the last statement in a function is a call to another
function, consider converting the call to a jump. This will save the call/return overhead as well as an entry in the
return stack buffer.
Assembly/Compiler Coding Rule 9. (M impact, L generality) Do not put more than four branches in a 16-byte chunk.
Assembly/Compiler Coding Rule 10. (M impact, L generality) Do not put more than two end loop branches in a 16-
byte chunk.
goes to the same address most of the time, then the BTB will predict
accurately most of the time. Since only one taken (non-fall-through) target can be stored in the BTB, indirect branches
with multiple taken targets may have lower prediction rates.
The effective number of targets stored may be increased by introducing additional conditional branches. Adding a
conditional branch to a target is fruitful if:
• The branch direction is correlated with the branch history leading up to that branch; that is, not just the last
target, but how it got to this branch.
• The source/target pair is common enough to warrant using the extra branch prediction capacity. This may
increase the number of overall branch mispredictions, while improving the misprediction of indirect branches.
The profitability is lower if the number of mispredicting branches is very large.
User/Source Coding Rule 1. (M impact, L generality) If an indirect branch has two or more common taken targets
and at least one of those targets is correlated with branch history leading up to the branch, then convert the indirect
branch to a tree where one or more indirect branches are preceded by conditional branches to those targets. Apply
this “peeling” procedure to the common target of an indirect branch that correlates to branch history.
The purpose of this rule is to reduce the total number of mispredictions by enhancing the predictability of branches
(even at the expense of adding more branches). The added branches must be predictable for this to be worthwhile.
One reason for such predictability is a strong correlation with preceding branch history. That is, the directions taken
on preceding branches are a good indicator of the direction of the branch under consideration.
Example 3-7 shows a simple example of the correlation between a target of a preceding conditional branch and a
target of an indirect branch.
switch (n) {
case 0: handle_0(); break; // common target, correlated with
// branch history that is forward taken
case 1: handle_1(); break; // uncommon
case 3: handle_3(); break; // uncommon
default: handle_other(); // common target
}
}
Correlation can be difficult to determine analytically, for a compiler and for an assembly language programmer. It may
be fruitful to evaluate performance with and without peeling to get the best performance from a coding effort.
An example of peeling out the most favored target of an indirect branch with correlated branch history is shown in
Example 3-8.
{
switch (n) {
case 1: handle_1(); break; // Uncommon
case 3: handle_3(); break; // Uncommon
In this example, the loop that executes 100 times assigns X to every even-numbered element and Y to every odd-
numbered element. By unrolling the loop you can make assignments more efficiently, removing one branch in the
loop body.
— JGE or JNL
— JLE or JNG
— JG or JNLE
• Macrofusion is supported in 64-bit mode.
Enhanced macrofusion support in Sandy Bridge microarchitecture is summarized in Table 3-1 with additional
information in Example 3-14:
JC/JB/JAE/JNB Y Y Y Y Y N N
JE/JZ/JNE/JNZ Y Y Y Y Y Y Y
JNA/JBE/JA/JNBE Y Y Y Y Y N N
JS/JNS/JP/JPE/JNP/JPO Y Y N N N N N
JL/JNGE/JGE/JNL/JLE/JNG/JG/JNLE Y Y Y Y Y Y Y
71 0F 81 Jno N N Y
72 0F 82 Jc / Jb Y N Y
73 0F 83 Jae / Jnb Y N Y
74 0F 84 Je / Jz Y Y Y
75 0F 85 Jne / Jnz Y Y Y
76 0F 86 Jna / Jbe Y N Y
77 0F 87 Ja / Jnbe Y N Y
78 0F 88 Js N N Y
79 0F 89 Jns N N Y
7A 0F 8A Jp / Jpe N N Y
7B 0F 8B Jnp / Jpo N N Y
7C 0F 8C Jl / Jnge Y Y Y
7D 0F 8D Jge / Jnl Y Y Y
7F 0F 8F Jg / Jnle Y Y Y
Assembly/Compiler Coding Rule 17. (M impact, ML generality) Employ macrofusion where possible using
instruction pairs that support macrofusion. Prefer TEST over CMP if possible. Use unsigned variables and unsigned
jumps when possible. Try to logically verify that a variable is non-negative at the time of comparison. Avoid CMP or
TEST of MEM-IMM flavor when possible. However, do not add other instructions to avoid using the MEM-IMM
flavor.
for (int1 i = 0; i < 1000; i++) for ( unsigned int2 i = 0; i < 1000; i++)
C code
a++; a++;
for (int i = 0; i < 1000; i++) for ( unsigned int i = 0; i < 1000; i++)
mov dword ptr [ i ], 0 xor eax, eax
jmp First mov dword ptr [ i ], eax
Loop: jmp First
mov eax, dword ptr [ i ] Loop:
add eax, 1 mov eax, dword ptr [ i ]
mov dword ptr [ i ], eax add eax, 1
mov dword ptr [ i ], eax
Disassembly First: First:
cmp dword ptr [ i ], 3E8H3 cmp eax, 3E8H 4
jge End jae End
a++; a++;
mov eax, dword ptr [ a ] mov eax, dword ptr [ a ]
addqq eax,1 add eax, 1
mov dword ptr [ a ], eax mov dword ptr [ a ], eax
jmp Loop jmp Loop
End: End:
NOTES:
1. Signed iteration count inhibits macrofusion.
2. Unsigned iteration count is compatible with macrofusion.
3. CMP MEM-IMM, JGE inhibit macrofusion.
4. CMP REG-IMM, JAE permits macrofusion.
a++; a++;
mov eax, dword ptr [ a ] add eax,1
add eax, 1 mov dword ptr [ a ], eax
mov dword ptr [a], eax else
else jmp End
jmp End a--;
a--; Dec:
Dec: sub eax, 1
mov eax, dword ptr [ a ] mov dword ptr [ a ], eax
sub eax, 1 End::
mov dword ptr [ a ], eax
End::
NOTES:
1. Signed iteration count inhibits macrofusion.
2. Unsigned iteration count is compatible with macrofusion.
3. CMP MEM-IMM, JGE inhibit macrofusion.
Assembly/Compiler Coding Rule 18. (M impact, ML generality) Software can enable macro fusion when it can be
logically determined that a variable is non-negative at the time of comparison; use TEST appropriately to enable
macrofusion when comparing a variable with 0.
For either signed or unsigned variable ‘a’; “CMP a,0” and “TEST a,a” produce the same result as far as the flags are
concerned. Since TEST can be macro-fused more often, software can use “TEST a,a” to replace “CMP a,0” for the
purpose of enabling macrofusion.
The Sandy Bridge microarchitecture enables more arithmetic and logic instructions to macro-fuse with conditional
branches. In loops where the ALU ports are already congested, performing one of these
macrofusions can relieve the pressure, as the macro-fused instruction consumes only port 5, instead of an ALU port
plus port 5.
In Example 3-14, the “add/cmp/jnz” loop contains two ALU instructions that can be dispatched via either port 0, 1, 5.
So there is higher probability of port 5 might bind to either ALU instruction causing JNZ to wait a cycle. The “sub/jnz”
loop, the likelihood of ADD/SUB/JNZ can be dispatched in the same cycle is increased because only SUB is free to bind
with either port 0, 1, 5.
The instruction MOV DX, 01234h is subject to LCP stalls in processors based on Intel Core microarchitecture, and in
Intel Core Duo and Intel Core Solo processors. Instructions that contain imm16 as part of their fixed encoding but do
not require LCP to change the immediate size are not subject to LCP stalls. The REX prefix (4xh) in 64-bit mode can
change the size of two classes of instruction, but does not cause an LCP penalty.
If the LCP stall happens in a tight loop, it can cause significant performance degradation. When decoding is not a
bottleneck, as in floating-point heavy code, isolated LCP stalls usually do not cause performance degradation.
Assembly/Compiler Coding Rule 19. (MH impact, MH generality) Favor generating code using imm8 or imm32
values instead of imm16 values.
If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.
Assembly/Compiler Coding Rule 20. (M impact, ML generality) Ensure instructions using 0xF7 opcode byte does not
start at offset 14 of a fetch line; and avoid using these instruction to operate on 16-bit data, upcast short data to 32
bits.
Example 3-15. Avoiding False LCP Delays with 0xF7 Group Instructions
A Sequence Causing Delay in the Decoder Alternate Sequence to Avoid Delay
neg word ptr a movsx eax, word ptr a
neg eax
mov word ptr a, AX
• Make sure each hot code block is less than about 750 instructions. Specifically, do not unroll to more than 750
instructions in a loop. This should enable Decoded ICache residency even when
hyper-threading is enabled.
• For applications with very large blocks of calculations inside a loop, consider loop-fission: split the loop into
multiple loops that fit in the Decoded ICache, rather than a single loop that overflows.
• If an application can be sure to run with only one thread per core, it can increase hot code block size to about
1500 instructions.
Dense Read-Modify-Write Code
The Decoded ICache can hold only up to 18 micro-ops per each 32 byte aligned memory chunk. Therefore, code with
a high concentration of instructions that are encoded in a small number of bytes, yet have many micro-ops, may
overflow the 18 micro-op limitation and not enter the Decoded ICache. Read-modify-write (RMW) instructions are a
good example of such instructions.
RMW instructions accept one memory source operand, one register source operand, and use the source memory
operand as the destination. The same functionality can be achieved by two or three instructions: the first reads the
memory source operand, the second performs the operation with the second register source operand, and the last
writes the result back to memory. These instructions usually result in the same number of micro-ops but use more
bytes to encode the same functionality.
One case where RMW instructions may be used extensively is when the compiler optimizes aggressively for code size.
Here are some possible solutions to fit the hot code in the Decoded ICache:
• Replace RMW instructions with two or three instructions that have the same functionality. For example, “adc
[rdi], rcx“ is only three bytes long; the equivalent sequence “adc rax, [rdi]“ + “mov [rdi], rax“ has a footprint of six
bytes.
• Align the code so that the dense part is broken down among two different 32-byte chunks. This solution is useful
when using a tool that aligns code automatically, and is indifferent to code changes.
• Spread the code by adding multiple byte NOPs in the loop. Note that this solution adds micro-ops for execution.
Align Unconditional Branches for Decoded ICache
For code entering the Decoded ICache, each unconditional branch is the last micro-op occupying a Decoded ICache
Way. Therefore, only three unconditional branches per a 32 byte aligned chunk can enter the Decoded ICache.
Unconditional branches are frequent in jump tables and switch declarations. Below are examples for these
constructs, and methods for writing them so that they fit in the Decoded ICache.
Compilers create jump tables for C++ virtual class methods or DLL dispatch tables. Each unconditional branch
consumes five bytes; therefore up to seven of them can be associated with a 32-byte chunk. Thus jump tables may not
fit in the Decoded ICache if the unconditional branches are too dense in each 32Byte-aligned chunk. This can cause
performance degradation for code executing before and after the branch table.
The solution is to add multi-byte NOP instructions among the branches in the branch table. This may increases code
size and should be used cautiously. However, these NOPs are not executed and therefore have no penalty in later pipe
stages.
Switch-Case constructs represents a similar situation. Each evaluation of a case condition results in an unconditional
branch. The same solution of using multi-byte NOP can apply for every three consecutive unconditional branches that
fits inside an aligned 32-byte chunk.
Two Branches in a Decoded ICache Way
The Decoded ICache can hold up to two branches in a way. Dense branches in a 32 byte aligned chunk, or their
ordering with other instructions may prohibit all the micro-ops of the instructions in the chunk from entering the
Decoded ICache. This does not happen often. When it does happen, you can space the code with NOP instructions
where appropriate. Make sure that these NOP instructions are not part of hot code.
Assembly/Compiler Coding Rule 22. (M impact, M generality) Avoid putting explicit references to ESP in a sequence
of stack operations (POP, PUSH, CALL, RET).
Assembly/Compiler Coding Rule 26. (M impact, L generality) Avoid prefixes, especially multiple non-0F-prefixed
opcodes.
Assembly/Compiler Coding Rule 27. (M impact, L generality) Do not use many segment registers.
Assembly/Compiler Coding Rule 28. (M impact, M generality) Avoid using complex instructions (for example, enter,
leave, or loop) that have more than four µops and require multiple cycles to decode. Use sequences of simple
instructions instead.
Assembly/Compiler Coding Rule 29. (MH impact, M generality) Use push/pop to manage stack space and address
adjustments between function calls/returns instead of enter/leave. Using enter instruction with non-zero immediates
can experience significant delays in the pipeline in addition to misprediction.
Theoretically, arranging instructions sequence to match the 4-1-1-1 template applies to processors based on Intel
Core microarchitecture. However, with macrofusion and micro-fusion capabilities in the front end, attempts to
schedule instruction sequences using the 4-1-1-1 template will likely provide diminishing returns.
Instead, software should follow these additional decoder guidelines:
• If you need to use multiple micro-op, non-microsequenced instructions, try to separate by a few single micro-op
instructions. The following instructions are examples of multiple micro-op instruction not requiring micro-
sequencer:
ADC/SBB
CMOVcc
Read-modify-write instructions
• If a series of multiple micro-op instructions cannot be separated, try breaking the series into a different
equivalent instruction sequence. For example, a series of read-modify-write instructions may go faster if
sequenced as a series of read-modify + store instructions. This strategy could improve performance even if the
new code sequence is larger than the original one.
• LEA can be dispatched via port 1 and 5 in most cases, doubling the throughput over prior generations. However
this apply only to LEA instructions with one or two source operands.
loop:
lea ecx, [ecx + ecx] // ecx = ecx*2
lea eax, [eax + eax *4] // eax = eax*5
and ecx, 0xff
and eax, 0xff
dec edx
jg loop
• For LEA instructions with three source operands and some specific situations, instruction latency has increased to
3 cycles, and must dispatch via port 1:
— LEA that has all three source operands: base, index, and offset.
— LEA that uses base and index registers where the base is EBP, RBP, or R13.
— LEA that uses RIP relative addressing mode.
— LEA that uses 16-bit addressing mode.
The LEA instruction or a sequence of LEA, ADD, SUB and SHIFT instructions can replace constant multiply instructions.
The LEA instruction can also be used as a multiple operand addition instruction, for example:
LEA ECX, [EAX + EBX*4 + A]
Using LEA in this way may avoid register usage by not tying up registers for operands of arithmetic instructions. This
use may also save code space.
If the LEA instruction uses a shift by a constant amount then the latency of the sequence of µops is shorter if adds are
used instead of a shift, and the LEA instruction may be replaced with an appropriate sequence of µops. This, however,
increases the total number of µops, leading to a trade-off.
Assembly/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the
critical path, a sequence with ADDs may be better.
Example 3-19. Clearing Register to Break Dependency While Negating Array Elements
Negation (-x = (x XOR (-1)) - (-1) without Negation (-x = 0 -x) using PXOR reg, reg breaks
breaking dependency dependency
lea eax, a lea eax, a
lea ecx, b lea ecx, b
lea edi, c lea edi, c
xor edx, edx xor edx, edx
movdqa xmm7, allone lp:
lp:
movdqa xmm0, [eax + edx] movdqa xmm0, [eax + edx]
paddd xmm0, [ecx + edx] paddd xmm0, [ecx + edx]
pxor xmm0, xmm7 pxor xmm7, xmm7
psubd xmm0, xmm7 psubd xmm7, xmm0
movdqa [edi + edx], xmm0 movdqa [edi + edx], xmm7
add edx, 16 add edx,16
cmp edx, 4096 cmp edx, 4096
jl lp jl lp
Assembly/Compiler Coding Rule 33. (M impact, MH generality) Break dependences on portions of registers
between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be
accomplished with 32-bit moves or by using MOVZX.
Sometimes sign-extended semantics can be maintained by zero-extending operands. For example, the C code in the
following statements does not need sign extension, nor does it need prefixes for operand size overrides:
static short INT a, b;
IF (a == b) {
...
}
Code for comparing these 16-bit operands might be:
MOVZWEAX, [a]
MOVZWEBX, [b]
CMP EAX, EBX
These circumstances tend to be common. However, the technique will not work if the compare is for greater than,
less than, greater than or equal, and so on, or if the values in eax or ebx are to be used in another operation where
sign extension is required.
Assembly/Compiler Coding Rule 34. (M impact, M generality) Try to use zero extension or operate on 32-bit
operands instead of using moves with sign extension.
The trace cache can be packed more tightly when instructions with operands that can only be represented as 32 bits
are not adjacent.
Assembly/Compiler Coding Rule 35. (ML impact, L generality) Avoid placing instructions that use 32-bit immediates
which cannot be encoded as sign-extended 16-bit immediates near each other. Try to schedule µops that have no
immediate immediately before or after µops with 32-bit immediates.
3.5.1.8 Compares
Use TEST when comparing a value in a register with zero. TEST essentially ANDs operands together without writing to
a destination register. TEST is preferred over AND because AND produces an extra result register. TEST is better than
CMP ..., 0 because the instruction size is smaller.
Use TEST when comparing the result of a logical AND with an immediate constant for equality or inequality if the
register is EAX for cases such as:
IF (AVAR & 8) { }
The TEST instruction can also be used to detect rollover of modulo of a power of 2. For example, the C code:
IF ( (AVAR % 16) == 0 ) { }
can be implemented using:
TEST EAX, 0x0F
JNZ AfterIf
Using the TEST instruction between the instruction that may modify part of the flag register and the instruction that
uses the flag register can also help prevent partial flag register stall.
Assembly/Compiler Coding Rule 36. (ML impact, M generality) Use the TEST instruction instead of AND when the
result of the logical AND is not used. This saves µops in execution. Use a TEST of a register with itself instead of a
CMP of the register to zero, this saves the need to encode the zero and saves encoding space. Avoid comparing a
constant to a memory operand. It is preferable to load the memory operand and compare the constant to a register.
Often a produced value must be compared with zero, and then used in a branch. Because most Intel architecture
instructions set the condition codes as part of their execution, the compare instruction may be eliminated. Thus the
operation can be tested directly by a JCC instruction. The notable exceptions are MOV and LEA. In these cases, use
TEST.
Assembly/Compiler Coding Rule 37. (ML impact, M generality) Eliminate unnecessary compare with zero
instructions by using the appropriate conditional jump instruction when the flags are already set by a preceding
arithmetic instruction. If necessary, use a TEST instruction instead of a compare. Be certain that any code
transformations made do not introduce problems with overflow.
These are all true NOPs, having no effect on the state of the machine except to advance the EIP. Because NOPs require
hardware resources to decode and execute, use the fewest number to achieve the desired padding.
The one byte NOP:[XCHG EAX,EAX] has special hardware support. Although it still consumes a µop and its
accompanying resources, the dependence upon the old value of EAX is removed. This µop can be executed at the
earliest possible opportunity, reducing the number of outstanding instructions, and is the lowest cost NOP.
The other NOPs have no special hardware support. Their input and output registers are interpreted by the hardware.
Therefore, a code generator should arrange to use the register containing the oldest value as input, so that the NOP
will dispatch and release RS resources at the earliest possible opportunity.
Try to observe the following NOP generation priority:
• Select the smallest number of NOPs and pseudo-NOPs to provide the desired padding.
• Select NOPs that are least likely to execute on slower execution unit clusters.
• Select the register arguments of NOPs to reduce dependencies.
For modern microarchitectures, using dependence depth information in spill scheduling is even more important than
in previous processors. The loop-carried dependence in A makes it especially important that A not be spilled. Not only
would a store/load be placed in the dependence chain, but there would also be a data-not-ready stall of the load,
costing further cycles.
Assembly/Compiler Coding Rule 38. (H impact, MH generality) For small loops, placing loop invariants in memory is
better than spilling loop-carried dependencies.
A possibly counter-intuitive result is that in such a situation it is better to put loop invariants in memory than in
registers, since loop invariants never have a load blocked by store data that is not ready.
Example 3-22 shows how to process 8-bit integers using MOVZX to take advantage of zero-latency MOV
enhancement. Consider
X = (X * 3^N ) MOD 256;
Y = (Y * 3^N ) MOD 256;
When “MOD 256” is implemented using the “AND 0xff” technique, its latency is exposed in the result-dependency
chain. Using a form of MOVZX on a truncated byte input, it can take advantage of zero-latency MOV enhancement
and gain about 45% in speed.
The effectiveness of coding a dense sequence of instructions to rely on a zero-latency MOV instruction must also
consider internal resource constraints in the microarchitecture.
In Example 3-23, RBX/RCX and RDX/RAX are pairs of registers that are shared and continuously overwritten. In the
right-hand sequence, registers are overwritten with new results immediately, consuming less internal resources
provided by the underlying microarchitecture. As a result, it is about 8% faster than the left-hand sequence where
internal resources could only support 50% of the attempt to take advantage of zero-latency MOV instructions.
P0 P1 P5 P6 P10
*H
Figure 3-1. INT Execution Ports Within the Processor Core Pipeline
Example 3-24 illustrates the use of MOVZX to avoid a partial register stall when packing three byte values into a
register.
Follow these recommendations to avoid stalls from partial updates to XMM registers:
• Avoid using instructions which update only part of the XMM register.
• If a 64-bit load is needed, use the MOVSD or MOVQ instruction.
• If 2 64-bit loads are required to the same register from non continuous locations, use MOVSD/MOVHPD instead
of MOVLPD/MOVHPD.
• When copying the XMM register, use the following instructions for full register copy, even if you only want to copy
some of the source register data:
MOVAPS
MOVAPD
MOVDQA
In processors based on Intel Core microarchitecture, shift immediate by 1 is handled by special hardware so it does
not experience partial flag stall.
In Sandy Bridge microarchitecture, the cost of partial flag access is replaced by the insertion of a micro-op instead of
a stall. However, it is still recommended to use less of instructions that write only to some of the flags (such as INC,
DEC, SET CL) before instructions that can write flags conditionally (such as SHIFT CL).
Example 3-26 compares two techniques to implement the addition of very large integers (e.g., 1024 bits). The
alternative sequence on the right side of Example 3-26 will be faster than the left side on Sandy Bridge
microarchitecture, but it will experience partial flag stalls on prior microarchitectures.
Assembly/Compiler Coding Rule 39. (M impact, ML generality) Avoid introducing dependences with partial
floating-point register writes, e.g. from the MOVSD XMMREG1, XMMREG2 instruction. Use the MOVAPD
XMMREG1, XMMREG2 instruction instead.
The MOVSD XMMREG, MEM instruction writes all 128 bits and breaks a dependence.
3.5.3 VECTORIZATION
This section provides a brief summary of optimization issues related to vectorization. There is more detail in the
chapters that follow.
Vectorization is a program transformation that allows special hardware to perform the same operation on multiple
data elements at the same time. Successive processor generations have provided vector support through the MMX
technology, Intel Streaming SIMD Extensions (Intel SSE), Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel
Streaming SIMD Extensions 3 (Intel SSE3) and Intel Supplemental Streaming SIMD Extensions 3 (Intel SSSE3).
Vectorization is a special case of SIMD, a term defined in Flynn’s architecture taxonomy to denote a single instruction
stream capable of operating on multiple data elements in parallel. The number of elements which can be operated on
in parallel range from four single-precision floating-point data elements in Intel SSE and two double-precision
floating-point data elements in Intel SSE2 to sixteen byte operations in a 128-bit register in Intel SSE2. Thus, vector
length ranges from 2 to 16, depending on the instruction extensions used and on the data type.
The Intel C++ Compiler supports vectorization in three ways:
• The compiler may be able to generate SIMD code without intervention from the user.
• The can user insert pragmas to help the compiler realize that it can vectorize the code.
• The user can write SIMD code explicitly using intrinsics and C++ classes.
To help enable the compiler to generate SIMD code, avoid global pointers and global variables. These issues may be
less troublesome if all modules are compiled simultaneously, and whole-program optimization is used.
User/Source Coding Rule 2. (H impact, M generality) Use the smallest possible floating-point or SIMD data type, to
enable more parallelism with the use of a (longer) SIMD vector. For example, use single precision instead of double
precision where possible.
User/Source Coding Rule 3. (M impact, ML generality) Arrange the nesting of loops so that the innermost nesting
level is free of inter-iteration dependencies. Especially avoid the case where the store of data in an earlier iteration
happens lexically after the load of that data in a future iteration, something which is called a lexically backward
dependence.
The integer part of the SIMD instruction set extensions cover 8-bit,16-bit and 32-bit operands. Not all SIMD
operations are supported for 32 bits, meaning that some source code will not be able to be vectorized at all unless
smaller operands are used.
User/Source Coding Rule 4. (M impact, ML generality) Avoid the use of conditional branches inside loops and
consider using SSE instructions to eliminate branches.
User/Source Coding Rule 5. (M impact, ML generality) Keep induction (loop) variable expressions simple.
Unpacking
Unvectorizable Code
Serial Routine
Packing
// Unpacking ////////////////////////////
sub ebp, 32
and ebp, 0xfffffff0
movaps [ebp], xmm0
Example 3-27. Reference Code Template for Partially Vectorizable Program (Contd.)
// Serial operations on components ///////
sub ebp, 4
mov eax, [ebp+4]
mov [ebp], eax
call foo
mov [ebp+16+4], eax
mov eax, [ebp+8]
mov [ebp], eax
call foo
mov [ebp+16+4+4], eax
// Packing ///////////////////////////////
movaps xmm0, [ebp+16+4]
// Epilog ////////////////////////////////
pop ebp
ret
Example 3-28. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty
Packing Method 1 Packing Method 2 Packing Method 3
movd xmm0, [ebp+16+4] movd xmm0, [ebp+16+4] movd xmm0, [ebp+16+4]
movd xmm1, [ebp+16+8] movd xmm1, [ebp+16+8] movd xmm1, [ebp+16+8]
movd xmm2, [ebp+16+12] movd xmm2, [ebp+16+12] movd xmm2, [ebp+16+12]
movd xmm3, [ebp+12+16+4] movd xmm3, [ebp+12+16+4] movd xmm3, [ebp+12+16+4]
punpckldq xmm0, xmm1 psllq xmm3, 32 movlhps xmm1,xmm3
punpckldq xmm2, xmm3 orps xmm2, xmm3 psllq xmm1, 32
punpckldq xmm0, xmm2 psllq xmm1, 32 movlhps xmm0, xmm2
orps xmm0, xmm1movlhps xmm0, orps xmm0, xmm1
xmm2
registers. Using registers to simplify result passing and reduce memory spills can improve performance by varying
degrees depending on the register pressure at runtime.
Example 3-29 shows the coding sequence that uses four extra XMM registers to reduce all memory spills of passing
results back to the parent routine. However, software must observe the following conditions when using this
technique:
• There is no register shortage.
• If the loop does not have many stores or loads but has many computations, this technique does not help
performance. This technique adds work to the computational units, while the store and loads ports are idle.
Example 3-29. Using Four Registers to Reduce Memory Spills and Simplify Result Passing
mov eax, [ebp+4]
mov [ebp], eax
call foo
movd xmm0, eax
add ebp, 4
call foo
mov [ebp+16], eax
add ebp, 4
call foo
Example 3-31. Base Line Code Sequence to Estimate Loop Overhead (Contd.)
mov [ebp], edi
call foo
add ebp, 4
pop ebp
ret
The average per-iteration cost of packing/unpacking can be derived from measuring the execution times of a large
number of iterations by:
((Cycles to run TestCase) - (Cycles to run equivalent baseline sequence) ) / (Iteration count).
For example, using a simple function that returns an input parameter (representative of tight, short loops), the per-
iteration cost of packing/unpacking may range from slightly more than 7 cycles (the shuffle with store forwarding
case, Example 3-27) to ~0.9 cycles (accomplished by several test cases). Across 27 test cases (consisting of one of the
alternate packing methods, no result-simplification/simplification of either 1 or 4 results, no stack optimization or
with stack optimization), the average per-iteration cost of packing/unpacking is about 1.7 cycles.
Generally speaking, packing method 2 and 3 (see Example 3-28) tend to be more robust than packing method 1; the
optimal choice of simplifying 1 or 4 results will be affected by register pressure of the runtime and other relevant
microarchitectural conditions.
Note that the numeric discussion of per-iteration cost of packing/packing is illustrative only. It will vary with test cases
using a different base line code sequence and will generally increase if the non-vectorizable routine requires longer
time to execute because the number of loop iterations that can reside in flight in the execution core decreases.
for (i=0;i<BUFF_SIZE;i++){
sum+=buff[i];
}
Alternative 1 is the assembly code generated by the Intel compiler for this C code, using the optimization flag for
Nehalem microarchitecture. The compiler vectorizes execution using Intel SSE instructions. In this code, each ADD
operation uses the result of the previous ADD operation. This limits the throughput to one load and ADD operation
per cycle. Alternative 2 is optimized for Sandy Bridge microarchitecture by enabling it to use the additional load
bandwidth. The code removes the dependency among ADD operations, by using two registers to sum the array
values. Two load and two ADD operations can be executed every cycle.
Example 3-32. Optimizing for Load Port Bandwidth in Sandy Bridge Microarchitecture
Example 3-32. Optimizing for Load Port Bandwidth in Sandy Bridge Microarchitecture (Contd.)
The left side implements pointer chasing via traversing an index. Compiler then generates the code shown below
addressing memory using base+index with an offset. The right side shows compiler generated code from pointer de-
referencing code and uses only a base register.
The code on the right side is faster than the left side across Sandy Bridge microarchitecture and prior
microarchitecture. However the code that traverses index will be slower on Sandy Bridge microarchitecture relative
to prior microarchitecture.
Example 3-34. Example of Bank Conflicts in L1D Cache and Remedy (Contd.)
mov [r13+rsi*4], edi inc ecx
inc ecx mov [r13+rsi*4], edi
mov [r13+rsi*4+4], r8d mov [r13+rsi*4+4], r8d
mov [r13+rsi*4+8], r9d mov [r13+rsi*4+8], r9d
mov [r13+rsi*4+12], r10d mov [r13+rsi*4+12], r10d
cmp ecx, LEN cmp ecx, LEN
jb loop jb loop
Bank conflicts may occur with the introduction of the third load port in the Golden Cove microarchitecture. In this
microarchitecture, conflicts happen between three loads with the same bits 2-5 of their linear address even if they
access the same set of the cache. Up to two loads can access the same cache bank without a conflict; however, a third
load accessing the same bank must be delayed. The bank conflicts do not apply to 512-bit wide loads because their
bandwidth is limited to two per cycle.
Recommendation: In the Golden Cove microarchitecture, bank conflicts often happen when multiple loads access
the same memory location. Whenever possible, avoid reading the same memory location within a tight loop or using
multiple load operations. Commonly used memory locations are better kept in the registers to prevent potential bank
conflict penalty.
Example 3-35. Using XMM Register in Lieu of Memory for Register Spills
Register spills into memory Register spills into XMM
loop: movq xmm4, [rsp+0x18]
mov rdx, [rsp+0x18] mov rcx, 0x10
movdqa xmm0, [rdx] movq xmm5, rcx
movdqa xmm1, [rsp+0x20] loop:
pcmpeqd xmm1, xmm0 movq rdx, xmm4
pmovmskb eax, xmm1 movdqa xmm0, [rdx]
test eax, eax movdqa xmm1, [rsp+0x20]
jne end_loop pcmpeqd xmm1, xmm0
movzx rcx, [rbx+0x60] pmovmskb eax, xmm1
test eax, eax
jne end_loop
movzx rcx, [rbx+0x60]
Example 3-35. Using XMM Register in Lieu of Memory for Register Spills (Contd.)
Register spills into memory Register spills into XMM
add qword ptr[rsp+0x18], 0x10 padd xmm4, xmm5
add rdi, 0x4 add rdi, 0x4
movzx rdx, di movzx rdx, di
sub rcx, 0x4 sub rcx, 0x4
add rsi, 0x1d0 add rsi, 0x1d0
cmp rdx, rcx cmp rdx, rcx
jle loop jle loop
There are two kinds of requirements for store forwarding. If these requirements are violated, store forwarding cannot
occur and the load must get its data from the cache (so the store must write its data back to the cache first). This
incurs a penalty that is largely related to pipeline depth of the underlying micro-architecture.
The first requirement pertains to the size and alignment of the store-forwarding data. This restriction is likely to have
high impact on overall application performance. Typically, a performance penalty due to violating this restriction can
be prevented. The store-to-load forwarding restrictions vary from one microarchitecture to another. Several
examples of coding pitfalls that cause store-forwarding stalls and solutions to these pitfalls are discussed in detail in
Section 3.6.4.1 The second requirement is the availability of data, discussed in Section 3.6.4.2 A good practice is to
eliminate redundant load operations.
It may be possible to keep a temporary scalar variable in a register and never write it to memory. Generally, such a
variable must not be accessible using indirect pointers. Moving a variable to a register eliminates all loads and stores
of that variable and eliminates potential problems associated with store forwarding. However, it also increases
register pressure.
Load instructions tend to start chains of computation. Since the out-of-order engine is based on data dependence,
load instructions play a significant role in the engine’s ability to execute at a high rate. Eliminating loads should be
given a high priority.
If a variable does not change between the time when it is stored and the time when it is used again, the register that
was stored can be copied or used directly. If register pressure is too high, or an unseen function is called before the
store and the second load, it may not be possible to eliminate the second load.
Assembly/Compiler Coding Rule 40. (H impact, M generality) Pass parameters in registers instead of on the stack
where possible. Passing arguments on the stack requires a store followed by a reload. While this sequence is
optimized in hardware by providing the value to the load directly from the memory order buffer without the need to
access the data cache if permitted by store-forwarding restrictions, floating-point values incur a significant latency in
forwarding. Passing floating-point arguments in (preferably XMM) registers should save this long latency operation.
Parameter passing conventions may limit the choice of which parameters are passed in registers which are passed on
the stack. However, these limitations may be overcome if the compiler has control of the compilation of the whole
binary (using whole-program optimization).
Assembly/Compiler Coding Rule 43. (H impact, ML generality) If it is necessary to extract a non-aligned portion of
stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as
necessary. This is better than incurring the penalties of a failed store-forward.
Assembly/Compiler Coding Rule 44. (MH impact, ML generality) Avoid several small loads after large stores to the
same area of memory by using a single large read and register copies as needed.
Example 3-37 depicts several store-forwarding situations in which small loads follow large stores. The first three load
operations illustrate the situations described in Rule 44. However, the last load operation gets data from store-
forwarding without problem.
Example 3-38 illustrates a store-forwarding situation in which a large load follows several small stores. The data
needed by the load operation cannot be forwarded because all of the data that needs to be forwarded is not
contained in the store buffer. Avoid large loads after small stores to the same area of memory.
Example 3-39 illustrates a stalled store-forwarding situation that may appear in compiler generated code. Sometimes
a compiler generates code similar to that shown in Example 3-39 to handle a spilled byte to the stack and convert the
byte to an integer value.
Example 3-40 offers two alternatives to avoid the non-forwarding situation shown in Example 3-39.
When moving data that is smaller than 64 bits between memory locations, 64-bit or 128-bit SIMD register moves are
more efficient (if aligned) and can be used to avoid unaligned loads. Although floating-point registers allow the
movement of 64 bits at a time, floating-point instructions should not be used for this purpose, as data may be
inadvertently modified.
As an additional example, consider the cases in Example 3-41.
In the first case (A), there is a large load after a series of small stores to the same area of memory (beginning at
memory address MEM). The large load will stall.
The FLD must wait for the stores to write to memory before it can access all the data it requires. This stall can also
occur with other data types (for example, when bytes or words are stored and then words or doublewords are read
from the same area of memory).
In the second case (B), there is a series of small loads after a large store to the same area of memory (beginning at
memory address MEM). The small loads will stall.
The word loads must wait for the quadword store to write to memory before they can access the data they require.
This stall can also occur with other data types (for example, when doublewords or words are stored and then words
or bytes are read from the same area of memory). This can be avoided by moving the store as far from the loads as
possible.
Store forwarding restrictions for processors based on Intel Core microarchitecture is listed in Table 3-4.
Table 3-4. Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture
Width of Load Alignment Width of Load Store Forwarding
Store Alignment
Store (bits) (byte) (bits) Restriction
To Natural size 16 word aligned 8, 16 not stalled
Table 3-4. Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture
Width of Load Alignment Width of Load Store Forwarding
Store Alignment
Store (bits) (byte) (bits) Restriction
To Natural size 16 not word aligned 8 stalled
In modern microarchitectures, hardware predicts when loads are dependent on and get their data forwarded from
preceding stores. These predictions can significantly improve performance. However, if a load is scheduled too soon
after the store it depends on or if the generation of the data to be stored is delayed, there can be a significant penalty.
There are several cases in which data is passed through memory, and the store may need to be separated from the
load:
• Spills, save and restore registers in a stack frame.
• Parameter passing.
• Global and volatile variables.
• Type conversion between integer and floating-point.
• When compilers do not analyze code that is inlined, forcing variables that are involved in the interface with
inlined code to be in memory, creating more memory variables and preventing the elimination of redundant
loads.
Assembly/Compiler Coding Rule 45. (H impact, MH generality) Where it is possible to do so without incurring other
penalties, prioritize the allocation of variables to registers, as in register allocation and for parameter passing, to
minimize the likelihood and impact of store-forwarding problems. Try not to store-forward data generated from a
long latency instruction - for example, MUL or DIV. Avoid store-forwarding data for variables with the shortest store-
load distance. Avoid store-forwarding data for variables with many and/or long dependence chains, and especially
avoid including a store forward on a loop-carried dependence chain.
Example 3-42 shows an example of a loop-carried dependence chain.
Assembly/Compiler Coding Rule 46. (M impact, MH generality) Calculate store addresses as early as possible to
avoid having stores block loads.
Example 3-43 shows how a data structure could be rearranged to reduce its size.
Cache line size of 64 bytes can impact streaming applications (for example, multimedia). These reference and use
data only once before discarding it. Data accesses which sparsely utilize the data within a cache line can result in less
efficient utilization of system memory bandwidth. For example, arrays of structures can be decomposed into several
arrays to achieve better packing, as shown in Example 3-44.
The efficiency of such optimizations depends on usage patterns. If the elements of the structure are all accessed
together but the access pattern of the array is random, then ARRAY_OF_STRUCT avoids unnecessary prefetch even
though it wastes memory.
However, if the access pattern of the array exhibits locality (for example, if the array index is being swept through)
then processors with hardware prefetchers will prefetch data from STRUCT_OF_ARRAY, even if the elements of the
structure are accessed together.
When the elements of the structure are not accessed with equal frequency, such as when element A is accessed
ten times more often than the other entries, then STRUCT_OF_ARRAY not only saves memory, but
it also prevents fetching unnecessary data items B, C, D, and E.
Using STRUCT_OF_ARRAY also enables the use of the SIMD data types by the programmer and the compiler.
Note that STRUCT_OF_ARRAY can have the disadvantage of requiring more independent memory stream references.
This can require the use of more prefetches and additional address generation calculations. It can also have an impact
on DRAM page access efficiency. An alternative, HYBRID_STRUCT_OF_ARRAY blends the two approaches. In this
case, only 2 separate address streams are generated and referenced: 1 for HYBRID_STRUCT_OF_ARRAY_ACE and 1
for HYBRID_STRUCT_OF_ARRAY_BD. The second alterative also prevents fetching unnecessary data — assuming
that (1) the variables A, C and E are always used together, and (2) the variables B and D are always used together, but
not at the same time as A, C and E.
The hybrid approach ensures:
• Simpler/fewer address generations than STRUCT_OF_ARRAY.
• Fewer streams, which reduces DRAM page misses.
• Fewer prefetches due to fewer streams.
• Efficient cache line packing of data elements that are used concurrently.
Assembly/Compiler Coding Rule 47. (H impact, M generality) Try to arrange data structures so they permit
sequential access.
If the data is arranged into a set of streams, the automatic hardware prefetcher can prefetch data that will be needed
by the application, reducing the effective memory latency. If the data is accessed in a
non-sequential manner, the automatic hardware prefetcher cannot prefetch the data. The prefetcher can recognize
up to eight concurrent streams. See Chapter 9 for more information on the hardware prefetcher.
User/Source Coding Rule 7. (M impact, L generality) Beware of false sharing within a cache line (64 bytes).
// 64-bit environment
sub esp, $<stack_size +N>
mov r13, $<offset_of_aligned_section_in_stack>
andl r13, $-<N> ; r13 point to aligned section in stack
. ;use r13 as base for aligned data
If for some reason it is not possible to align the stack for 64-bits, the routine should access the parameter and save it
into a register or known aligned storage, thus incurring the penalty only once.
Assembly/Compiler Coding Rule 50. (H impact, L generality) Always put code and data on separate pages. Avoid
self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code that
performs the modifications and the code being modified are on separate 4-KByte pages or on separate aligned 1-
KByte subpages.
call _lblcx;
... ; ECX now contains IP of this instruction
...
_lblcx
mov ecx, [esp];
ret
• Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit.
This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached
memory.
Processors based on Intel Core microarchitecture have eight write-combining buffers in each core. Beginning with
Nehalem microarchitecture, there are 10 buffers available for write-combining. Beginning with Ice Lake Client
microarchitecture, there are 12 buffers available for write-combining.
Assembly/Compiler Coding Rule 51. (H impact, L generality) If an inner loop writes to more than four arrays (four
distinct cache lines), apply loop fission to break up the body of the loop so only four arrays are being written to in
each iteration of each of the resulting loops.
Write combining buffers are used for stores of all memory types. They are particularly important for writes to
uncached memory: writes to different parts of the same cache line can be grouped into a single, full-cache-line bus
transaction instead of going across the bus (since they are not cached) as several partial writes. Avoiding partial writes
can have a significant impact on bus bandwidth-bound graphics applications, where graphics buffers are in uncached
memory. Separating writes to uncached memory and writes to writeback memory into separate phases can assure
that the write combining buffers can fill before getting evicted by other write traffic. Eliminating partial write
transactions has been found to have performance impact on the order of 20% for some applications. Because the
cache lines are 64 bytes, a write to the bus for 63 bytes will result in partial bus transactions.
When coding functions that execute simultaneously on two threads, reducing the number of writes that are allowed
in an inner loop will help take full advantage of write-combining store buffers. For write-combining buffer
recommendations for Intel® Hyper-Threading Technology (Intel® HT), see Chapter 11.
Store ordering and visibility are also important issues for write combining. When a write to a
write-combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO). If a
subsequent write happens to another write-combining buffer, a separate RFO may be caused for that cache line.
Subsequent writes to the first cache line and write-combining buffer will be delayed until the second RFO has been
serviced to guarantee properly ordered visibility of the writes. If the memory type for the writes is write-combining,
there will be no RFO since the line is not cached, and there is no such delay. For details on write-combining, see
Chapter 9, “Optimizing Cache Usage”
Example 3-47. Using Non-Temporal Stores and 64-byte Bus Write Transactions
#define STRIDESIZE 256
lea ecx, p64byte_Aligned
mov edx, ARRAY_LEN
xor eax, eax
slloop:
movntps XMMWORD ptr [ecx + eax], xmm0
movntps XMMWORD ptr [ecx + eax+16], xmm0
movntps XMMWORD ptr [ecx + eax+32], xmm0
movntps XMMWORD ptr [ecx + eax+48], xmm0
; 64 bytes is written in one bus transaction
add eax, STRIDESIZE
cmp eax, edx
jl slloop
3.7 PREFETCHING
Recent Intel processor families employ several prefetching mechanisms to accelerate the movement of data or code
and improve performance:
• Hardware instruction prefetcher.
• Software prefetch for data.
• Hardware prefetch for cache lines of data or instructions.
do_some_work_1: do_some_work_1:
add eax, eax add eax, eax
and eax, 6 and eax, 6
sub ecx, 1 sub ecx, 1
jnz do_some_work_1 jnz do_some_work_1
mov eax, [ebx+64] mov eax, [ebx+64]
mov ecx, 30 mov ecx, 30
do_some_work_2: do_some_work_2:
add eax, eax add eax, eax
and eax, 6 and eax, 6
sub ecx, 1 sub ecx, 1
jnz do_some_work_2 jnz do_some_work_2
The additional instructions to load data from one member in the modified sequence can trigger the DCU hardware
prefetch mechanisms to prefetch data in the next cache line, enabling the work on the second member to complete
sooner.
Software can gain from the first-level data cache prefetchers in two cases:
• If data is not in the second-level cache, the first-level data cache prefetcher enables early trigger of the second-
level cache prefetcher.
• If data is in the second-level cache and not in the first-level data cache, then the first-level data cache prefetcher
triggers earlier data bring-up of sequential cache line to the first-level data cache.
There are situations that software should pay attention to a potential side effect of triggering unnecessary DCU
hardware prefetches. If a large data structure with many members spanning many cache lines is accessed in ways that
only a few of its members are actually referenced, but there are multiple pair accesses to the same cache line.
The DCU hardware prefetcher can trigger fetching of cache lines that are not needed. In Example 3-50, references to
the “Pts” array and “AltPts” will trigger DCU prefetch to fetch additional cache lines that won’t be needed. If
significant negative performance impact is detected due to DCU hardware prefetch on a portion of the code, software
can try to reduce the size of that contemporaneous working set to be less than half of the L2 cache.
Example 3-50. Avoid Causing DCU Hardware Prefetch to Fetch Unneeded Lines
while ( CurrBond != NULL )
{
MyATOM *a1 = CurrBond->At1 ;
MyATOM *a2 = CurrBond->At2 ;
a2->AuxPts[0].x += ux ;
a2->AuxPts[0].y += uy ;
a2->AuxPts[0].z += uz ;
};
CurrBond = CurrBond->Next ;
};
To fully benefit from these prefetchers, organize and access the data using one of the following methods:
Method 1:
• Organize the data so consecutive accesses can usually be found in the same 4-KByte page.
• Access the data in constant strides forward or backward IP Prefetcher.
Method 2:
• Organize the data in consecutive lines.
• Access the data in increasing addresses, in sequential cache lines.
Example 3-51 demonstrates accesses to sequential cache lines that can benefit from the first-level cache prefetcher.
By elevating the load operations from memory to the beginning of each iteration, it is likely that a significant part of
the latency of the pair cache line transfer from memory to the second-level cache will be in parallel with the transfer
of the first cache line.
The IP prefetcher uses only the lower 8 bits of the address to distinguish a specific address. If the code size of a loop is
bigger than 256 bytes, two loads may appear similar in the lowest 8 bits and the IP prefetcher will be restricted.
Therefore, if you have a loop bigger than 256 bytes, make sure that no two loads have the same lowest 8 bits in order
to use the IP prefetcher.
string of doublewords. To improve address alignment, a small piece of prolog code using MOVSB/STOSB with a
count less than 4 can be used to peel off the non-aligned data moves before starting to use MOVSD/STOSD.
• When N is less than half the size of last level cache, throughput consideration may favor either:
— An approach using a REP string with the largest data granularity because a REP string has little overhead for
loop iteration, and the branch misprediction overhead in the prolog/epilogue code to handle address
alignment is amortized over many iterations.
— An iterative approach using the instruction with largest data granularity, where the overhead for SIMD
feature detection, iteration overhead, and prolog/epilogue for alignment control can be minimized. The
trade-off between these approaches may depend on the microarchitecture.
— An example of MEMSET() implemented using stosd for arbitrary counter value with the destination address
aligned to doubleword boundary in 32-bit mode is shown in Example 3-52.
• When N is larger than half the size of the last level cache, using 16-byte granularity streaming stores with
prolog/epilog for address alignment will likely be more efficient, if the destination addresses will not be
referenced immediately afterwards.
Example 3-52. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination
A ‘C’ example of Memset() Equivalent Implementation Using REP STOSD
void memset(void *dst,int c,size_t size) push edi
{ movzx eax, byte ptr [esp+12]
char *d = (char *)dst; mov ecx, eax
size_t i; shl ecx, 8
for (i=0;i<size;i++) or ecx, eax
*d++ = (char)c; mov ecx, eax
} shl ecx, 16
or eax, ecx
mov edi, [esp+8] ; 4-byte aligned
mov ecx, [esp+16] ; byte count
shr ecx, 2 ; do dword
cmp ecx, 127
jle _main
test edi, 4
jz _main
stosd ;peel off one dword
dec ecx
_main: ; 8-byte aligned
rep stosd
mov ecx, [esp + 16]
and ecx, 3 ; do count <= 3
rep stosb ; optimal with <= 3
pop edi
ret
Memory routines in the runtime library generated by Intel compilers are optimized across a wide range of address
alignments, counter values, and microarchitectures. In most cases, applications should take advantage of the default
memory routines provided by Intel compilers.
In some situations, the byte count of the data is known by the context (as opposed to being known by a parameter
passed from a call), and one can take a simpler approach than those required for a general-purpose library routine.
For example, if the byte count is also small, using REP MOVSB/STOSB with a count less than four can ensure good
address alignment and loop-unrolling to finish the remaining data; using MOVSD/STOSD can reduce the overhead
associated with iteration.
Using a REP prefix with string move instructions can provide high performance in the situations described above.
However, using a REP prefix with string scan instructions (SCASB, SCASW, SCASD, SCASQ) or compare instructions
(CMPSB, CMPSW, SMPSD, SMPSQ) is not recommended for high performance. Consider using SIMD instructions
instead.
120
100
se80
lc
yc60
40
20
0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4
3 6 9 2 6 9 2 5 8 2 5 8 1 4 8 1 4 7 0 4 7 0 3 6 0 3 6 9 2 6 9 2
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8 9 9 9 0
1
length in bytes
Figure 3-3 depicts the relative performance of memcpy implementation on a third-generation Intel Core processor
using Enhanced REP MOVSB and STOSB versus REP MOVSD+B, for alignment conditions when both the source and
destination addresses are aligned to a 16-Byte boundary and the source region does not overlap with the destination
region. Using Enhanced REP MOVSB and STOSB always delivers better performance than using REP MOVSD+B. If the
length is a multiple of 64, it can produce even higher performance. For example, copying 65-128 bytes takes 40 cycles,
while copying 128 bytes needs only 35 cycles.
If an application wishes to bypass standard memcpy library implementation with its own custom implementation and
have freedom to manage the buffer length allocation for both source and destination, it may be worthwhile to
manipulate the lengths of its memory copy operation to be multiples of 64 to take advantage the code size and
performance benefit of Enhanced REP MOVSB and STOSB.
The performance characteristic of implementing a general-purpose memcpy library function using a SIMD register is
significantly more colorful than an equivalent implementation using a general-purpose register, depending on length,
instruction set selection between SSE2, 128-bit AVX, 256-bit AVX, relative alignment of source/destination, and
memory address alignment granularities/boundaries, etc.
Hence comparing performance characteristics between a memcpy using Enhanced REP MOVSB and STOSB versus a
SIMD implementation is highly dependent on the particular SIMD implementation. The remainder of this section
discusses the relative performance of memcpy using Enhanced REP MOVSB and STOSB versus unpublished,
optimized 128-bit AVX implementation of memcpy to illustrate the hardware capability of Ivy Bridge
microarchitecture.
Table 3-5. Relative Performance of Memcpy() Using Enhanced REP MOVSB and STOSB Vs. 128-
bit AVX
Range of Lengths (bytes) <128 128 to 2048 2048 to 4096
Memcpy_ERMSB/Memcpy_AVX128 0x7X 1X 1.02X
Table 3-5 shows the relative performance of the Memcpy function implemented using enhanced REP MOVSB versus
128-bit AVX for several ranges of memcpy lengths, when both the source and destination addresses are 16-byte
aligned and the source region and destination region do not overlap. For memcpy length less than 128 bytes, using
Enhanced REP MOVSB and STOSB is slower than what’s possible using 128-bit AVX, due to internal start-up overhead
in the REP string.
For situations with address misalignment, memcpy performance will generally be reduced relative to the 16-byte
alignment scenario (see Table 3-6).
Memcpy() implemented with Enhanced REP MOVSB and STOSB can benefit further from the 256-bit SIMD integer
data-path in Haswell microarchitecture. See Section 15.16.3.
When the destination buffer is 16-byte aligned, memset() using Enhanced REP MOVSB and STOSB can perform better
than SIMD approaches. When the destination buffer is misaligned, memset() performance using Enhanced REP
MOVSB and STOSB can degrade about 20% relative to aligned case, for processors based on Ivy Bridge
microarchitecture. In contrast, SIMD implementation of memset() will experience smaller degradation when the
destination is misaligned.
Memset() implemented with Enhanced REP MOVSB and STOSB can benefit further from the 256-bit data path in
Haswell microarchitecture. see Section 15.16.3.3.
User/Source Coding Rule 10. (H impact, ML generality) Make sure your application stays in range to avoid
denormal values, underflows.
Out-of-range numbers cause very high overhead.
When converting floating-point values to 16-bit, 32-bit, or 64-bit integers using truncation, the instructions
CVTTSS2SI and CVTTSD2SI are recommended over instructions that access x87 FPU stack. This avoids changing the
rounding mode.
User/Source Coding Rule 11. (M impact, ML generality) Usually, math libraries take advantage of the
transcendental instructions (for example, FSIN) when evaluating elementary functions. If there is no critical need to
evaluate the transcendental functions using the extended precision of 80 bits, applications should consider an
alternate, software-based approach, such as a look-up-table-based algorithm using interpolation techniques. It is
possible to improve transcendental performance with these techniques by choosing the desired numeric precision
and the size of the look-up table, and by taking advantage of the parallelism of the Intel SSE and the Intel SSE2
instructions.
Refer to Chapter 4 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1 for definitions of
overflow, underflow and denormal exceptions.
Denormalized floating-point numbers impact performance in two ways:
• Directly when are used as operands.
• Indirectly when are produced as a result of an underflow situation.
If a floating-point application never underflows, the denormals can only come from floating-point constants.
User/Source Coding Rule 12. (H impact, ML generality) Denormalized floating-point constants should be avoided as
much as possible.
Denormal and arithmetic underflow exceptions can occur during the execution of x87 instructions or Intel SSE/Intel
SSE2/Intel SSE3 instructions. Processors based on Intel NetBurst microarchitecture handle these exceptions more
efficiently when executing Intel SSE/Intel SSE2/Intel SSE3 instructions and when speed is more important than
complying with the IEEE standard. The following paragraphs give recommendations on how to optimize your code to
reduce performance degradations related to floating-point exceptions.
1. “IEEE Standard for Floating-Point Arithmetic,” in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-
84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
the only denormal floating-point numbers that can be encountered in FTZ mode are the ones specified as constants
(read only).
The DAZ mode is provided to handle denormal source operands efficiently when running a SIMD floating-point
application. When the DAZ mode is enabled, input denormals are treated as zeros with the same sign. Enabling the
DAZ mode is the way to deal with denormal floating-point constants when performance is the objective.
If departing from the IEEE 754 specification is acceptable and performance is critical, run Intel SSE/Intel SSE2/Intel
SSE3/Intel AVX/Intel AVX2/Intel AVX-512 applications with FTZ and DAZ modes enabled.
NOTE
The DAZ mode is available with both the Intel SSE and Intel SSE2 extensions, although the speed
improvement expected from this mode is fully realized only in SSE code and later.
without ever needing to change rounding modes. The cost savings of using these instructions over the methods
below is enough to justify using Intel SSE and Intel SSE2 wherever possible when truncation is involved.
For x87 floating-point, the FIST instruction uses the rounding mode represented in the floating-point control word
(FCW). The rounding mode is generally “round to nearest”, so many compiler writers implement a change in the
rounding mode in the processor in order to conform to the C and FORTRAN standards. This implementation requires
changing the control word on the processor using the FLDCW instruction. For a change in the rounding, precision, and
infinity bits, use the FSTCW instruction to store the floating-point control word. Then use the FLDCW instruction to
change the rounding mode to truncation.
In a typical code sequence that changes the rounding mode in the FCW, a FSTCW instruction is usually followed by a
load operation. The load operation from memory should be a 16-bit operand to prevent store-forwarding problem. If
the load operation on the previously-stored FCW word involves either an 8-bit or a 32-bit operand, this will cause a
store-forwarding problem due to mismatch of the size of the data between the store operation and the load
operation.
To avoid store-forwarding problems, make sure that the write and read to the FCW are both 16-bit operations.
If there is more than one change to the rounding, precision, and infinity bits, and the rounding mode is not important
to the result, use the algorithm in Example 3-53 to avoid synchronization issues, the overhead of the FLDCW
instruction, and having to change the rounding mode. Note that the example suffers from a store-forwarding problem
which will lead to a performance penalty. However, its performance is still better than changing the rounding,
precision, and infinity bits among more than two values.
Assembly/Compiler Coding Rule 53. (H impact, L generality) Minimize the number of changes to the rounding
mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a total of
more than two values of the set of rounding, precision, and infinity bits.
3.9.3.2 Precision
If single precision is adequate, use it instead of double precision. This is true because:
• Single precision operations allow the use of longer SIMD vectors, since more single precision data elements can
fit in a register.
• If the precision control (PC) field in the x87 FPU control word is set to single precision, the floating-point divider
can complete a single-precision computation much faster than either a double-precision computation or an
extended double-precision computation. If the PC field is set to double precision, this will enable those x87 FPU
operations on double-precision data to complete faster than extended double-precision computation. These
characteristics affect computations including floating-point divide and square root.
Assembly/Compiler Coding Rule 54. (H impact, L generality) Minimize the number of changes to the precision
mode.
• The cost of converting from floating-point to integer with truncation is significantly lower with Intel SSE and Intel
SSE2 in the processors based on Intel NetBurst microarchitecture than with either changes to the rounding mode
or the sequence prescribed in the Example 3-53.
Assembly/Compiler Coding Rule 55. (M impact, M generality) Use Streaming SIMD Extensions 2 or Streaming SIMD
Extensions unless you need an x87 feature. Most SSE2 arithmetic operations have shorter latency then their X87
counterpart and they eliminate the overhead associated with the management of the X87 register stack.
Table 3-7. Intel Processor CPU RP Device IDs for Processors Optimizing PCIe Performance
Processor CPU RP Device IDs
Intel® Xeon® processors based on Broadwell microarchitecture 6F01H-6F0EH
Intel® Xeon® processors based on Haswell microarchitecture 2F01H-2F0EH
Here are the PMU event id and Umask for the 2 CHA events that are very useful for detecting contention:
• Phys_addr_match event: Event id: 0x19, Umask: 0x80
• CHA_clockticks event: Event id: 0x01, Umask: 0x01
These events have to be measured on a per-CHA basis, and if the ratio of the counts between phys_addr_match to
CHA_clockticks is more than 0.15 on any CHA that indicates > 30% of the CHA cycles (2x the ratio as this event can
count only once every two cycles) are spent with multiple requests outstanding to the same address.
The Recipe to Measure Events with Linux perf:
Once confirmed that the ratio of phys_addr_match events to the CHA clockticks is more than 0.15, the next step is
figuring out where this may be happening in the code. Intel CPUs provide a PMU mechanism wherein a load operation
is randomly selected and tracked through completion, and the true latency is recorded if it is over a given threshold.
The threshold value is specified in cycles and must be in the power of 2. In the following “perf mem record”
command, define a command to sample all loads that take more than 128 cycles to complete.
Once the above data is collected, execute the following command to process the data collected:
Information similar to the table below will be generated. Such information will include details on hot loads along with
data linear address and the actual latency that the load experienced. This can be used to identify the necessary fixes
to the code.
Shared Object
Local Weight
Data Symbol
Data Object
TLB Access
Overhead
Samples
Blocked
Symbol
Locked
Snoop
[.]0x0000556db14282a0
lockcontention
[.]asm_mutex
L3 or L3 hit
L1 or L2 hit
0.22% 1
[heap]
Shared Object
Local Weight
Data Symbol
Data Object
TLB Access
Overhead
Samples
Blocked
Symbol
Locked
Snoop
[.]0x0000556db14282a0
lockcontention
[.]asm_mutex
L3 or L3 hit
L1 or L2 hit
0.18% 1
[heap]
1 HitM Yes N/A 40411
0.06% 31338
[.]0x0000556db14282a0
lockcontention
[.]asm_mutex
L3 or L3 hit
L1 or L2 hit
0.17% 1
[heap]
1 HitM Yes N/A 36652
0.06% 29572
lock_loop:
while (lock is not free) // just a load operation
execute pause;
Additionally, as the core counts continue to increase, exploring other algorithmic fixes that dissolve or reduce
contention on memory variables (including locks) is essential. For example, instead of frequently updating a hot
statistical variable from all threads, consider updating a copy of it per thread (without contention) and later aggregate
the updated per-thread copies on a less frequent basis or use some existing atomic-free concurrency methods such as
rseq1. As another example, restructure locking algorithms to use hierarchical locking when excessive contention is
detected on a global lock.
The baseline case (blue line) had a sharp throughput with increased thread count, as all cores reduced their
throughput as they suffered from the increasing percent of Fast Asserts. With the same work distributed instances
(red line), Fast asserts dropped. Similarly, with a software fix (gray line), again, the Fast Asserts dropped even though
only one instance was in execution.
1. https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2
2. The most current version is MariaDB 10.3.39
unnecessary cache invalidations and the resulting traffic between caches. By reducing false sharing, the HITM counter
can be reduced, leading to better performance and scalability in multi-threaded programs.
Steps for perf c2c analysis1:
1. Collect perf c2c data on the target system (this example is for the full system):
“perf c2c record -a -u --ldlat 50 -- sleep 30
2. Generate report (this can take considerable time to process)
“perf c2c report -NN -g --call-graph --full-symbols -c pid,iaddr --stdio >perf_report.txt
3. Check the generated perf_report.txt for “Shared Data Cache Line Table” (see Table 3-9). This table is sorted by the
HITM. Pay attention to the topped “CacheLine address”. See Example 3-55.
4. Read the perf_report.txt for the “Shared Cache Line Distribution Pareto” (see Table 3-10). Check the “Offset”
column to see if there are multiple offset within single cache line. If there are multiple offset, that points to a
potential false s haring issue. See Example 3-56.
next :* atomic.Load64(&node.next)
6290 425fb6: mov (%rcx) ,%rdx // lfstack.go.48
425fb9 lea 0x87fb88(%rip) ,%rbx
9703 425fc0 lock cmpxchng %rdx,(%rbx) // lfstack.go.49
425fc5 sete %dl
425fc8 test %dl,%dl
425fca je 425f9e <runtime.getempty+0x19e>
425fcc jmp 425fde <runtime.getempty+0x1d0>
425fce xor %ecx,%ecx
src/runtime/mgc.go
@@ -285,8 + @@ func pollFractionalWorkerExit() bool {
CPI 0.84
core cycles than on the previous generation Sunny Cove CPU microarchitecture for the Ice Lake version of the 3rd
Generation of Intel Xeon Scalable processors. The higher core cycles are due to the execution of additional micro-
operations.
Table 3-12. Instruction Sequence Mixing VEX on the Sapphire Rapids and Ice Lake Server
Microarchitectures
Ice lake Server Sapphire Rapids
Intel Assembly Code Syntax Microarchitecture Microarchitecture
(Sunny Cove Cores) (Golden Cove Cores)
1. Using upstream perf. If OS doesn’t have support for the event use
cpu/event=0xc1,umask=0x10,name=assists_sse_avx_mix/
a. With this change, the Core Cycles do not have a performance inversion relative to the previous generation.
Table 3-13. Fixed Instruction Sequence with Improved Performance on Sapphire Rapids
Microarchitecture
Ice lake Sapphire Rapids ASSISTS.SS
Microarchitecture Microarchitecture E_AVX_MI
(Sunny Cove Cores) (Golden Cove Cores) X
ASSISTS.SS
Inst Core Core
Intel Assembly Code Syntax Inst Retired E_AVX_MI
Retired Cycles Cycles
X
VPXOR XMM3, XMM3, XMM3;
VEXTRACTI128 XMM3, YMM3, 1; 4.00 2.00 4.00 1.00 0
PXOR XMM3, XMM3
Table 3-14. WordPress/PHP Case Study: With and Without a 2GB Fix for
Branch Misprediction
WP4.2 / PHP7.4.29 WP4.2 / PHP7.4.29 - 2G FIX/NO
- NO FIX 2G FIX in Glibc FIX
Config (Workers) 8c x 42 8c x 42 -
Config (Cores Per socket) 56 56 1.00
Config (Sockets) 2 2 1.00
3.13 SYNCHRONIZATION
Synchronous application
• when two hardware threads from the same core use user-level monitor and user-level MWAIT, it can progress
effectively as some of the hardware resources are available to the other thread when a hyperthread issues the
user-level MWAITs.
To achieve the best performance using user-level monitor and user-level MWAIT:
• The entire contents of monitored locations must be verified after user-level MWAIT to avoid a false wake-up.
• It is the developer’s responsibility to check the contents of monitored locations:
— Before issuing monitor.
— Before issuing user-level MWAIT.
— After user-level MWAIT. See Example 3-59.
• If an application expects a store to a monitored location, the timeout value should be as high as it is supported.
Since user-level MWAIT and TPAUSE are a hint to a processor, a user should selectively identify locations in the
application.
CHAPTER 4
INTEL ATOM PROCESSOR ARCHITECTURES
®
This chapter gives an overview of features relevant to software optimization for current generations of Intel Atom®
processors.
128B
,34XHXH ,34XHXH
Fetch
64KB I$
ȝRS4XHXH ȝRS4XHXH
Allocation/Rename (6-wide)
Memory Vector/Float
8SWR0%/&DFKH
The Crestmont microarchitecture supports flexible integration of multiple processor cores with a shared un-core
subsystem consisting of a number of components including a ring interconnect to multiple slices of L3, processor
graphics, integrated memory controller, interconnect fabrics, and more.
128B
,34XHXH ,34XHXH
Fetch
64KB I$
ȝRS4XHXH ȝRS4XHXH
Each cycle, the predicted IP is sent down the instruction fetch pipeline. These predictions can look up the Instruction
TLB (ITLB) and the instruction cache tag to determine the physical address and instruction cache hit or miss. Upon
successful translation, and depending on resource availability, these accesses are stored into the instruction pointer
(IP) queues. This enables the decoupling instruction cache hit/miss from delivering raw instruction bytes to the rest of
the front end. In the case of an instruction cache miss, the IP queue holds the address but signals that the data cannot
be read until it is returned from the memory subsystem. The stream of IPs generated at fetch can handle up to eight
concurrent instruction cache misses. There are two independent IP queues, each with its instruction data buffers.
Combined with their associated decoders, these are referred to as clusters. For each taken branch or inserted toggle
point, the prediction will toggle back and forth between each IP queue and cluster. This toggling enables out-of-order
decode, which is the key feature that enables this microarchitecture to fetch and decode up to 6 variable length x86
instructions per cycle.
Performance debug of prediction or fetch can be done utilizing the front-end bound events in the top-down category
of performance monitoring events1. Front-end bound events count slots at allocation only when slots are available,
but no μops are present. If bubbles caused by the three-cycle predictor percolate to allocation, for example, these will
be represented by TOPDOWN_FE_BOUND.BRANCH_RESTEER. You can precisely tag the instruction following such a
bubble via FRONTEND_RETIRED.BRANCH_RESTEER. If the predictor failed to cache a branch target and redirection
occurred during decode, those slots are counted by TOPDOWN_FE_BOUND.BRANCH_DETECT. If μops are not
delivered due to misses in the Instruction Cache or Instruction TLB, these appear as TOPDOWN_FE_BOUND.ICACHE
and TOPDOWN_FE_BOUND.ITLB, respectively. Like BRANCH_RESTEER, all front-end bound slot-based accounting can
be tracked precisely via the corresponding FRONTEND_RETIRED set of events. The instruction code can often be
rearranged to optimize such a bottleneck away. Multiple event classes can be tracked simultaneously (e.g., mark both
ICACHE and ITLB events) on the same general-purpose performance counter or with different events across multiple
performance counters.
Sometimes, a code loop is too short and/or poorly aligned within the cache to enable the machine to decode
sufficiently fast. In this situation you could be fetching every cycle and never inserting bubbles, but still unable to keep
the back-end fed. When this happens, the event class that detects this is TOPDOWN_FE_BOUND.OTHER. The “other”
event class catches front-end bound behavior that cannot be pinpointed to other specific sources.
1. Refer to Chapter 6, “Earlier Generations of Intel Atom® Microarchitecture and Software Optimization” in the Intel® 64 and IA-
32 Architectures Optimization Reference Manual Documentation Volume 2: Earlier Generations of Intel® 64 and IA-32 Pro-
cessor Architectures, Throughput, and Latency.
Allocation/Rename (6-wide)
Memory Vector/Float
Allocation delivers μops to three types of structures. Each μop is written into one or more of five reservation stations
for pure integer operations. These hold instructions, track their dependencies, and schedule them for execution.
• Four are for ALU operations, labeled ports 00 to 03.
There are three independent L1 prefetchers. One does a simple next-line fetch on DL1 load misses. An instruction
pointer-based prefetcher capable of detecting striding access patterns of various sizes. This prefetcher works in the
linear address space; it can, therefore, cross page boundaries and start translations for TLB misses. The final
prefetcher is a next-page prefetcher that detects accesses likely to cross a page boundary and starts the access early.
L1 data misses generated by these prefetchers communicate additional information to the L2 prefetchers, which
helps them work together.
The L2 cache delivers 64 bytes of data per cycle at a latency of 17 cycles, and that bandwidth is shared amongst 4
cores. The L2 cache subsystem also contains multiple prefetchers, including a streaming prefetcher that detects
striding access patterns. An additional L2 prefetcher attempts to detect more complicated access patterns. These
prefetches can also be generated such that they only fill the LLC but do not fill into the L2 to help reduce DRAM
latency.
The L2 cache subsystem of a single 4-core module can have 64 requests and 32 L2 data evictions outstanding on the
fabric. To ensure fairness, these are competitively shared amongst the cores with per-core reservations.
4.1.8.2 AVX-IFMA
AVX-IFMA includes two instructions, VPMADD52LUQ and VPMADD52HUQ. They are designed to accelerate Big
Integer Arithmetic (BIA). These instructions can multiply eight 52-bit unsigned integers residing in YMM registers,
produce the low (VPMADD52LUQ) and high (VPMADD52HUQ) halves of the 104-bit products, and add the results to
64-bit accumulators (i.e., SIMD elements), placing them in the destination register.
operand dependencies across two cycles. The dependent second μop executes the 256-bit operation by using a single
128-bit execution port for two consecutive cycles with a five-cycle latency for a total latency of seven cycles.
• VPERM2I128 ymm1, ymm2, ymm3/m256, imm8
• VPERM2F128 ymm1, ymm2, ymm3/m256, imm8
• VPERMPD ymm1, ymm2/m256, imm8
• VPERMPS ymm1, ymm2, ymm3/m256
• VPERMD ymm1, ymm2, ymm3/m256
• VPERMQ ymm1, ymm2/m256, imm8
• Dual generic load and store execution pipes capable of two loads, two stores, or one load and store per cycle.
• Dedicated integer and vector integer/floating point store data ports.
• New and improved cryptography.
— New Galois-field instructions (GFNI).
— Dual AES units.
— Enhanced SHA-NI implementation.
— Faster PCLMULQDQ.
• Support for user-level low-power and low-latency spin-loop instructions UMWAIT/UMONITOR and TPAUSE.
IP Queue IP Queue
Fetch
32KB ICache
ȝRS4XHXH ȝRS4XHXH
Allocation/Renama (6-wide)
Memory Vector/Float
ALU ALU ALU JMP AGU AGU STD STD ALU ALU
SHIFT MUL AES AES
DIV SHA FADD
IMUL
LLB FMUL
TLB SDB FDIV
32KB DCache
L2 QUEUE
Up to 4.5MB L2 Cache
The Tremont microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-
system consisting of several components, including a ring interconnect to multiple slices of L3, processor graphics,
integrated memory controller, interconnect fabrics, and more.
Tremont microarchitecture has a 32B predict pipeline that feeds dual 3-wide decode clusters capable of 6 instruction
decode per cycle. Each cluster can access a banked 32KB instruction cache at 16B/cycle for a maximum of 32B/cycle.
Due to differences in the number of instructions per block and other decode latency differences, younger blocks of
code can decode before older blocks. At the end of each decode cluster is a queue of decoded instructions (µop
queue).
The allocation and rename pipeline reads both µop queues in parallel and puts the instruction stream back in order
for register renaming and resource allocation. Whereas increasing decode width for x86 traditionally requires
exponential resources and triggers efficiency loss, clustering allows for x86 decode to be built with linear resources
and little efficiency loss.
As the clustering algorithm is dependent on the ability to predict taken branches within the branch predictor, very
long assembly sequences that lack taken branches (long unrolled code utilizing the floating point unit, for example)
can be bottlenecked due to being unable to utilize both decode clusters simultaneously. Inserting unconditional JMP
instructions to the next sequential instruction pointer at intervals between 16 to 32 instructions may relieve this
bottleneck if encountered.
While Tremont microarchitecture did not build a dynamic mechanism to load balance the decode clusters, future
generations of Intel Atom processors will include hardware to recognize and mitigate these cases without the need
for explicit insertions of taken branches into the assembly code.
In addition to the novel clustered decode scheme, Tremont microarchitecture enhanced the branch predictor and
doubled the size of the L2 Predecode cache from 64KB on the Goldmont Plus microarchitecture to 128 KB.
The low level characteristics of the microarchitecture within each decode cluster remain the same as in the Goldmont
Plus microarchitecture. For example, instructions should avoid more than 4 Bytes of prefixes and escapes.
Table 4-2 summarizes the OOO engine's capability to dispatch different types of operations to ports.
Table 4-2. Dispatch Port and Execution Stacks of the Tremont Microarchitecture
Port
Port 00 Port 01 Port 02 Port 08 Port 20 Port 21 Port 29
09 Port 10 Port 11
INT INT INT INT FP/VEC FP/VEC FP/VEC
INT
ALU ALU ALU JUMP Store Load Load ALU ALU Store
LEA1 LEA2 LEA3 Data AES AES Data
Shift Bit Ops Store Store SHA- SHA-
IMUL Address Address RND MSG
IDIV FMUL FADD
POPCNT FDIV Shuffle
CRC32 Shuffle
Shift
SIMUL
GFNI
Converts
NOTES:
1. LEAs without a scaled index and only two sources (among base, index, and displacement inputs) execute as one
operation on any ALU port (00, 01, or 02).
2. LEAs with three sources fracture into two operations and take an additional cycle of latency. Index consuming por-
tion, regardless of scale value, will bind to port 02 while second operation binds to either port 00 or 01.
3. LEAs with a scaled index but without a displacement execute as one operation on port 02.
The TLB hierarchy consists of a dedicated level one TLB for instruction cache and data cache with a shared second-
level TLB for all page translations.
NOTES:
1. The first level instruction TLB (ITLB) caches small and large page translations but large pages are cached as 256KB
regions per ITLB entry.
2. The first level data TLB (uTLB) caches small and large page translations but large pages are fully fractured into
4KB regions per uTLB entry.
CHAPTER 5
CODING FOR SIMD ARCHITECTURES
• Processors based on Intel Core microarchitecture support MMX™, Intel® SSE, Intel® SSE2, Intel® SSE3, and Intel®
SSSE3.
• Processors based on Enhanced Intel Core microarchitecture support MMX, Intel SSE, Intel SSE2, Intel SSE3, Intel
SSSE3, and Intel SSE4.1.
• Processors based Westmere microarchitecture support MMX, Intel SSE, Intel SSE2, Intel SSE3, Intel SSSE3, Intel
SSE4.1, Intel SSE4.2, and AESNI.
• Processors based on Sandy Bridge microarchitecture support MMX, Intel SSE, Intel SSE2, Intel SSE3, Intel SSSE3,
Intel SSE4.1, Intel SSE4.2, AESNI, PCLMULQDQ, and Intel® AVX.
• Intel® Core™ Solo and Intel® Core™ Duo processors support MMX, Intel SSE, Intel SSE2, and Intel SSE3.
Single-instruction, multiple-data (SIMD) technologies enable the development of advanced multimedia, signal
processing, and modeling applications.
SIMD techniques can be applied to text/string processing, lexing and parser applications. This is covered in Chapter
14, “Intel® SSE4.2 and SIMD Programming For Text-Processing/Lexing/Parsing.” Techniques for optimizing AESNI are
discussed in Section 6.10.
To take advantage of the performance opportunities presented by these capabilities, do the following:
• Ensure that the processor supports MMX technology, Intel SSE, Intel SSE2, Intel SSE3, Intel SSSE3, and Intel
SSE4.1.
• Ensure that the operating system supports MMX technology and Intel SSE (OS support for Intel SSE2, Intel SSE3
and Intel SSSE3 is the same as OS support for Intel SSE).
• Employ the optimization and scheduling strategies described in this book.
• Use stack and data alignment techniques to keep data properly aligned for efficient memory use.
• Utilize the cacheability instructions offered by Intel SSE and Intel SSE2, where appropriate.
Example 5-3 shows how to find the SSE2 feature bit (bit 26) in the CPUID feature flags.
Software must check for support of MONITOR and MWAIT before attempting to use MONITOR and MWAIT.Detecting
the availability of MONITOR and MWAIT can be done using a code sequence similar to Example 5-4. The availability of
MONITOR and MWAIT is indicated by bit 3 of the returned value in ECX.
Example 5-5 shows how to find the Intel SSSE3 feature bit in the CPUID feature flags.
1.If CPUID.01H:ECX.OSXSAVE reports 1, it also indirectly implies the processor supports XSAVE, XRSTOR, XGETBV,
processor extended state bit vector XFEATURE_ENALBED_MASK register. Thus an application may streamline the
checking of CPUID feature flags for XSAVE and OSXSAVE. XSETBV is a privileged instruction.
OS provides processor
Yes
extended state management
Implied HW support for
XSAVE, XRSTOR, XGETBV, XFEATURE_ENABLED_MASK
The following pseudocode illustrates this recommended application Intel AVX detection process:
NOTE
It is unwise for an application to rely exclusively on CPUID.1:ECX.AVX[bit 28] or at all on
CPUID.1:ECX.XSAVE[bit 26]: These indicate hardware support but not operating system support. If
YMM state management is not enabled by an operating systems, AVX instructions will #UD
regardless of CPUID.1:ECX.AVX[bit 28]. “CPUID.1:ECX.XSAVE[bit 26] = 1” does not guarantee the OS
actually uses the XSAVE process for state management.
Similarly, the detection sequence for VPCLMULQDQ must combine checking for CPUID.1:ECX.PCLMULQDQ[bit 1] = 1
and the sequence for detection application support for AVX.
This is shown in the pseudocode:
----------------------------------------------------------------------------------------
INT supports_f16c()
{ ; result in eax
mov eax, 1
cpuid
and ecx, 038000000H
cmp ecx, 038000000H; check OSXSAVE, AVX, F16C feature flags
jne not_supported
; processor supports AVX,F16C instructions and XGETBV is enabled by OS
mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
XGETBV; result in EDX:EAX
and eax, 06H
cmp eax, 06H; check OS has enabled both XMM and YMM state support
jne not_supported
mov eax, 1
jmp done
NOT_SUPPORTED:
mov eax, 0
done:
}
-------------------------------------------------------------------------------
Code benefits
No
from SIM D
Yes
Integer or
Floating P oint Integer
floating-point?
W hy FP ? Perform ance
If pos sible, re-arrange data
for S IM D efficiency
Range or
P recision Align data structures
Sc hedule instructions to
No optim ize perform ance
STOP
O M15156
To use any of the SIMD technologies optimally, you must evaluate the following situations in your code:
• Fragments that are computationally intensive.
• Fragments that are executed often enough to have an impact on performance.
• Fragments that with little data-dependent control flow.
• Fragments that require floating-point computations.
• Fragments that can benefit from moving data 16 bytes at a time.
• Fragments of computation that can coded using fewer instructions.
• Fragments that require help in using the cache hierarchy efficiently.
scalar, code into code that can execute in parallel, taking advantage of the SIMD architecture parallelism. This section
discusses the coding techniques available for an application to make use of the SIMD architecture.
To vectorize your code and thus take advantage of the SIMD architecture, do the following:
• Determine if the memory accesses have dependencies that would prevent parallel execution.
• “Strip-mine” the inner loop to reduce the iteration count by the length of the SIMD operations (for example, four
for single-precision floating-point SIMD, eight for 16-bit integer SIMD on the XMM registers).
• Re-code the loop with the SIMD instructions.
Each of these actions is discussed in detail in the subsequent sections of this chapter. These sections also discuss
enabling automatic vectorization using the Intel C++ Compiler.
Assembly Intrinsics
Performance
Automatic
Vectoriztion
C/C++ / Fortran
Ease of Programming/Portability
The examples that follow illustrate the use of coding adjustments to enable the algorithm to benefit from the Intel
SSE. The same techniques may be used for single-precision floating-point, double-precision floating-point, and
integer data under Intel SSE3, Intel SSE2, Intel SSE, and MMX technology.
As a basis for the usage model discussed in this section, consider a simple loop shown in Example 5-13.
Note that the loop runs for only four iterations. This allows a simple replacement of the code with Streaming SIMD
Extensions.
For the optimal use of the Intel SSE that need data alignment on the 16-byte boundary, all examples in this chapter
assume that the arrays passed to the routine, A, B, C, are aligned to 16-byte boundaries by a calling routine.
The sections that follow provide details on the coding methodologies: inlined assembly, intrinsics, C++ vector classes,
and automatic vectorization.
5.3.1.1 Assembly
Key loops can be coded directly in assembly language using an assembler or by using inlined assembly (C-asm) in
C/C++ code. The Intel compiler or assembler recognize the new instructions and registers, then directly generate the
corresponding code. This model offers the opportunity for attaining greatest performance, but this performance is
not portable across the different processor architectures.
Example 5-14 shows the Intel SSE inlined assembly encoding.
Example 5-14. Intel® Streaming SIMD Extensions (Intel® SSE) Using Inlined Assembly Encoding
void add(float *a, float *b, float *c)
{
__asm {
mov eax, a
mov edx, b
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
}
5.3.1.2 Intrinsics
Intrinsics provide the access to the ISA functionality using C/C++ style coding instead of assembly language. Intel has
defined three sets of intrinsic functions that are implemented in the Intel C++ Compiler to support the MMX
technology, Intel SSE, Intel SSE2. Four new C data types, representing 64-bit and 128-bit objects are used as the
operands of these intrinsic functions. __M64 is used for MMX integer SIMD, __M128 is used for single-precision
floating-point SIMD, __M128I is used for Streaming SIMD Extensions 2 integer SIMD, and __M128D is used for
double precision floating-point SIMD. These types enable the programmer to choose the implementation of an
algorithm directly, while allowing the compiler to perform register allocation and instruction scheduling where
possible. The intrinsics are portable among all Intel architecture-based processors supported by a compiler.
The use of intrinsics allows you to obtain performance close to the levels achievable with assembly. The cost of
writing and maintaining programs with intrinsics is considerably less. For a detailed description of the intrinsics and
their use, refer to the Intel C++ Compiler documentation.
Example 5-15 shows the loop from Example 5-13 using intrinsics.
The intrinsics map one-to-one with actual Intel SSE assembly code. The XMMINTRIN.H header file in which the
prototypes for the intrinsics are defined is part of the Intel C++ Compiler included with the VTune Performance
Enhancement Environment CD.
Intrinsics are also defined for the MMX technology ISA. These are based on the __m64 data type to represent the
contents of an mm register. You can specify values in bytes, short integers, 32-bit values, or as a 64-bit object.
The intrinsic data types, however, are not a basic ANSI C data type, and therefore you must observe the following
usage restrictions:
• Use intrinsic data types only on the left-hand side of an assignment as a return value or as a parameter. You
cannot use it with other arithmetic expressions (for example, “+”, “>>”).
• Use intrinsic data type objects in aggregates, such as unions to access the byte elements and structures; the
address of an __M64 object may be also used.
• Use intrinsic data type data only with the MMX technology intrinsics described in this guide.
For complete details of the hardware instructions, see the Intel Architecture MMX Technology Developer’s Guide.
For a description of data types, see the Intel® 64 and IA-32 Architectures Software Developer’s Manual.
5.3.1.3 Classes
A set of C++ classes has been defined and available in Intel C++ Compiler to provide both a higher-level abstraction
and more flexibility for programming with MMX technology, Intel SSE and Intel SSE2. These classes provide an easy-
to-use and flexible interface to the intrinsic functions, allowing developers to write more natural C++ code without
worrying about which intrinsic or assembly language instruction to use for a given operation. Since the intrinsic
functions underlie the implementation of these C++ classes, the performance of applications using this methodology
can approach that of one using the intrinsics. Further details on the use of these classes can be found in the Intel C++
Class Libraries for SIMD Operations page.
Example 5-16 shows the C++ code using a vector class library. The example assumes the arrays passed to the routine
are already aligned to 16-byte boundaries.
Here, fvec.h is the class definition file and F32vec4 is the class representing an array of four floats. The “+” and “=”
operators are overloaded so that the actual Streaming SIMD Extensions implementation in the previous example is
abstracted out, or hidden, from the developer. Note how much more this resembles the original code, allowing for
simpler and faster programming.
Again, the example is assuming the arrays, passed to the routine, are already aligned to 16-byte boundary.
Compile this code using the -QAX and -QRESTRICT switches of the Intel C++ Compiler, version 4.0 or later.
The RESTRICT qualifier in the argument list is necessary to let the compiler know that there are no other aliases to
the memory to which the pointers point. In other words, the pointer for which it is used, provides the only means of
accessing the memory in question in the scope in which the pointers live. Without the restrict qualifier, the compiler
will still vectorize this loop using runtime data dependence testing, where the generated code dynamically selects
between sequential or vector execution of the loop, based on overlap of the parameters. The restrict keyword avoids
the associated overhead altogether.
See Intel® C++ Compiler Classic Developer Guide and Reference for details.
The following declaration allows you to vectorize the scaling operation and further improve the alignment of the data
access patterns:
short ptx[N], pty[N], ptz[N];
for (i=0; i<N; i++) pty[i] *= scale;
With the SIMD technology, choice of data organization becomes more important and should be made carefully based
on the operations that will be performed on the data. In some applications, traditional data arrangements may not
lead to the maximum performance.
A simple example of this is an FIR filter. An FIR filter is effectively a vector dot product in the length of the number of
coefficient taps.
Consider the following code:
(data [ j ] *coeff [0] + data [j+1]*coeff [1]+...+data [j+num of taps-1]*coeff [num of taps-1]),
If in the code above the filter operation of data element I is the vector dot product that begins at data element J, then
the filter operation of data element I+1 begins at data element J+1.
Assuming you have a 64-bit aligned data vector and a 64-bit aligned coefficients vector, the filter operation on the first
data element will be fully aligned. For the second data element, however, access to the data vector will be misaligned.
For an example of how to avoid the misalignment problem in the FIR filter, refer to Intel application notes on
Streaming SIMD Extensions and filters.
Duplication and padding of data structures can be used to avoid the problem of data accesses in algorithms which are
inherently misaligned. Section 5.5.1 discusses trade-offs for organizing data structures.
NOTE
The duplication and padding technique overcomes the misalignment problem, thus
avoiding the expensive penalty for misaligned data access, at the cost of increasing the
data size. When developing your code, you should consider this tradeoff and use the
option which gives the best performance.
The algorithm in Example 5-18 aligns an array of 64-bit elements on a 64-bit boundary. The constant of 7 is derived
from one less than the number of bytes in a 64-bit element, or 8-1. Aligning data in this manner avoids the significant
performance penalties that can occur when an access crosses a cache line boundary.
Another way to improve data alignment is to copy the data into locations that are aligned on 64-bit boundaries. When
the data is accessed frequently, this can provide a significant performance improvement.
The variable BUFFER could then be used as if it contained 100 objects of type __M128 or F32VEC4. In the code
below, the construction of the F32VEC4 object, X, will occur with aligned data.
void foo() {
F32vec4 x = *(__m128 *) buffer;
...
}
Without the declaration of __DECLSPEC(ALIGN(16)), a fault may occur.
union {
float f[400];
__m128 m[100];
} buffer;
Now, 16-byte alignment is used by default due to the __M128 type in the UNION; it is not necessary to use
__DECLSPEC(ALIGN(16)) to force the result.
In C++ (but not in C) it is also possible to force the alignment of a CLASS/STRUCT/UNION type, as in the code that
follows:
If the data in such a CLASS is going to be used with the Intel SSE or Intel SSE2, it is preferable to use a UNION to make
this explicit. In C++, an anonymous UNION can be used to make this more convenient:
class my_m128 {
union {
__m128 m;
float f[4];
};
};
Because the UNION is anonymous, the names, M and F, can be used as immediate member names of MY__M128.
Note that __DECLSPEC(ALIGN) has no effect when applied to a CLASS, STRUCT, or UNION member in either C
or C++.
The best processing method for code using SIMD technology is to arrange the data in an array for each coordinate
(Example 5-20). This data arrangement is called structure of arrays (SoA).
There are two options for computing data in AoS format: perform operation on the data as it stands in AoS format, or
re-arrange it (swizzle it) into SoA format dynamically. See Example 5-21 for code samples of each option based on a
dot-product computation.
;
; AoS code
; All values marked DC are “don’t-care.”
; In the AOS model, the vertices are stored in the xyz format
movaps xmm0, Array ; xmm0 = DC, x0, y0, z0
movaps xmm1, Fixed ; xmm1 = DC, xF, yF, zF
mulps xmm0, xmm1 ; xmm0 = DC, x0*xF, y0*yF, z0*zF
movhlps xmm, xmm0 ; xmm = DC, DC, DC, x0*xF
; SoA code
; X = x0,x1,x2,x3
; Y = y0,y1,y2,y3
; Z = z0,z1,z2,z3
; A = xF,xF,xF,xF
; B = yF,yF,yF,yF
; C = zF,zF,zF,zF
Performing SIMD operations on the original AoS format can require more calculations and some operations do not
take advantage of all SIMD elements available. Therefore, this option is generally less efficient.
The recommended way for computing data in AoS format is to swizzle each set of elements to SoA format before
processing it using SIMD technologies. Swizzling can either be done dynamically during program execution or
statically when the data structures are generated. See Chapter 6, “Optimizing for SIMD Integer Applications” and
Chapter 7, “Optimizing for SIMD Floating-Point Applications” for examples. Performing the swizzle dynamically is
usually better than using AoS, but can be somewhat inefficient because there are extra instructions during
computation. Performing the swizzle statically, when data structures are being laid out, is best as there is no runtime
overhead.
As mentioned earlier, the SoA arrangement allows more efficient use of the parallelism of SIMD technologies because
the data is ready for computation in a more optimal vertical manner: multiplying components X0,X1,X2,X3 by
XF,XF,XF,XF using four SIMD execution slots to produce four unique results. In contrast, computing directly on AoS
data can lead to horizontal operations that consume SIMD execution slots but produce only a single scalar result (as
shown by the many “don’t-care” (DC) slots in Example 5-21).
Use of the SoA format for data structures can lead to more efficient use of caches and bandwidth. When the elements
of the structure are not accessed with equal frequency, such as when element x, y, z are accessed ten times more
often than the other entries, then SoA saves memory and prevents fetching unnecessary data items a, b, and c.
} VerticesCoordList;
typedef struct{
int a[SIMDwidth];
int b[SIMDwidth];
int c[SIMDwidth];
...
} VerticesColorList;
VerticesCoordList VerticesCoord[NumOfGroups];
VerticesColorList VerticesColor[NumOfGroups];
Note that SoA can have the disadvantage of requiring more independent memory stream references. A computation
that uses arrays X, Y, and Z (see Example 5-20) would require three separate data streams. This can require the use of
more prefetches, additional address generation calculations, as well as having a greater impact on DRAM page access
efficiency.
There is an alternative: a hybrid SoA approach blends the two alternatives (see Example 5-22). In this case, only two
separate address streams are generated and referenced:
• One contains XXXX, YYYY,ZZZZ, ZZZZ,... .
• One contains AAAA, BBBB, CCCC, AAAA, DDDD,... .
The approach prevents fetching unnecessary data, assuming the variables X, Y, Z are always used together; whereas
the variables A, B, C would also be used together, but not at the same time as X, Y, Z.
The hybrid SoA approach ensures:
• Data is organized to enable more efficient vertical SIMD computation.
• Simpler/less address generation than AoS.
• Fewer streams, which reduces DRAM page misses.
• Use of fewer prefetches, due to fewer streams.
• Efficient cache line packing of data elements that are used concurrently.
With the advent of the SIMD technologies, the choice of data organization becomes more important and should be
carefully based on the operations to be performed on the data.
5.5.2 STRIP-MINING
Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encodings of
loops, as well as providing a means of improving memory performance. First introduced for vectorizers, this
technique consists of the generation of code when each vector operation is done for a size less than or equal to the
maximum vector length on a given vector machine. By fragmenting a large loop into smaller segments or strips, this
technique transforms the loop structure by:
• Increasing the temporal and spatial locality in the data cache if the data are reusable in different passes of an
algorithm.
• Reducing the number of iterations of the loop by a factor of the length of each “vector,” or number of operations
being performed per SIMD operation. In the case of Intel SSE, this vector or strip-length is reduced by 4 times:
four floating-point data items per single Streaming SIMD Extensions single-precision floating-point SIMD
operation are processed.
Consider Example 5-23:
main()
{
Vertex_rec v[Num];
....
for (i=0; i<Num; i++) {
Transform(v[i]);
}
The main loop consists of two functions: transformation and lighting. For each object, the main loop calls a
transformation routine to update some data, then calls the lighting routine to further work on the data. If the size of
array V[NUM] is larger than the cache, then the coordinates for V[I] that were cached during TRANSFORM(V[I])
will be evicted from the cache by the time we do LIGHTING(V[I]). This means that V[I] will have to be fetched from
main memory a second time, reducing performance.
In Example 5-24, the computation has been strip-mined to a size STRIP_SIZE. The value STRIP_SIZE is chosen so
STRIP_SIZE elements of array V[NUM] fit into the cache hierarchy. By doing this, a given element V[I] brought into
the cache by TRANSFORM(V[I]) will still be in the cache when we perform LIGHTING(V[I]), and thus improve
performance over the non-strip-mined code.
For the first iteration of the inner loop, each access to array B will generate a cache miss. If the size of one row of array
A, that is, A[2, 0:MAX-1], is large enough, by the time the second iteration starts, each access to array B will always
generate a cache miss. For instance, on the first iteration, the cache line containing B[0, 0:7] will be brought in when
B[0,0] is referenced because the float type variable is four bytes and each cache line is 32 bytes. Due to the limitation
of cache capacity, this line will be evicted due to conflict misses before the inner loop reaches the end.
For the next iteration of the outer loop, another cache miss will be generated while referencing B[0, 1]. In this
manner, a cache miss occurs when each element of array B is referenced, that is, there is no data reuse in the cache at
all for array B.
This situation can be avoided if the loop is blocked with respect to the cache size. In Figure 5-5, a BLOCK_SIZE is
selected as the loop blocking factor. Suppose that BLOCK_SIZE is 8, then the blocked chunk of each array will be
eight cache lines (32 bytes each). In the first iteration of the inner loop, A[0, 0:7] and B[0, 0:7] will be brought into
the cache. B[0, 0:7] will be completely consumed by the first iteration of the outer loop. Consequently, B[0, 0:7]
will only experience one cache miss after applying loop blocking optimization in lieu of eight misses for the original
algorithm.
As illustrated in Figure 5-5, arrays A and B are blocked into smaller rectangular chunks so that the total size of two
blocked A and B chunks is smaller than the cache size. This allows maximum data reuse.
Blocking
As one can see, all the redundant cache misses can be eliminated by applying this loop blocking technique. If MAX is
huge, loop blocking can also help reduce the penalty from DTLB (data translation look-aside buffer) misses. In
addition to improving the cache/memory performance, this optimization technique also saves external bus
bandwidth.
compares and logicals, as shown in Example 5-26. SSE4.1 provides packed blend instruction that can vectorize data-
dependent branches in a loop.
}
MMX assembly code processes 4 short values per iteration:
xor eax, eax
top_of_loop:
movq mm0, [A + eax]
pcmpgtwxmm0, [B + eax]; Create compare mask
movq mm1, [D + eax]
pand mm1, mm0; Drop elements where A<B
pandn mm0, [E + eax] ; Drop elements where A>B
top_of_loop:
movdqq xmm0, [A + eax]
pcmpgtwxmm0, [B + eax]; Create compare mask
movdqa xmm1, [E + eax]
pblendv xmm1, [D + eax], xmm0;
movdqa [C + eax], xmm1;
add eax, 16
cmp eax, MAX_ELEMENT*2
jle top_of_loop
If there are multiple consumers of an instance of a register, group the consumers together as closely as possible.
However, the consumers should not be scheduled near the producer.
CHAPTER 6
OPTIMIZING FOR SIMD INTEGER APPLICATIONS
SIMD integer instructions provide performance improvements in applications that are integer-intensive and can take
advantage of SIMD architecture.
Guidelines in this chapter for using SIMD integer instructions (in addition to those described in Chapter 3, “General
Optimization Guidelines”) may be used to develop fast and efficient code that scales across processor generations.
The collection of 64-bit and 128-bit SIMD integer instructions supported by MMX technology, SSE, SSE2, SSE3, SSSE3,
SSE4.1, and PCMPEQQ in SSE4.2 are referred to as SIMD integer instructions.
Code sequences in this chapter demonstrates the use of basic 64-bit SIMD integer instructions and more efficient
128-bit SIMD integer instructions.
Processors based on Intel Core microarchitecture support MMX, SSE, SSE2, SSE3, and SSSE3. Processors based on
Enhanced Intel Core microarchitecture support SSE4.1 and all previous generations of SIMD integer instructions.
Processors based on Nehalem microarchitecture support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2.
Single-instruction, multiple-data techniques can be applied to text/string processing, lexing and parser applications.
SIMD programming in string/text processing and lexing applications often require sophisticated techniques beyond
those commonly used in SIMD integer programming. This is covered in Chapter 14, “Intel® SSE4.2 and SIMD
Programming For Text-Processing/Lexing/Parsing.”
Execution of 128-bit SIMD integer instructions in Intel Core microarchitecture and Enhanced Intel Core
microarchitecture are substantially more efficient than on previous microarchitectures. Thus newer SIMD capabilities
introduced in SSE4.1 operate on 128-bit operands and do not introduce equivalent 64-bit SIMD capabilities.
Conversion from 64-bit SIMD integer code to 128-bit SIMD integer code is highly recommended.
This chapter contains examples that will help you to get started with coding your application. The goal is to provide
simple, low-level operations that are frequently used. The examples use a minimum number of instructions necessary
to achieve best performance on the current generation of Intel 64 and IA-32 processors.
Each example includes a short description, sample code, and notes if necessary. These examples do not address
scheduling as it is assumed the examples will be incorporated in longer code sequences.
For planning considerations of using the SIMD integer instructions, refer to Section 5.1.3.
Code sequences containing cross-typed usage produce the same result across different implementa-
tions but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3/SSE44.1 instruc-
tions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged.
• Use the optimization rules and guidelines described in Chapter 3 and Chapter 5, “Coding for SIMD Architectures”.
• Take advantage of hardware prefetcher where possible. Use the PREFETCH instruction only when data access
patterns are irregular and prefetch distance can be pre-determined. See Chapter 9, “Optimizing Cache Usage.”
• Emulate conditional moves by using blend, masked compares and logicals instead of using conditional branches.
NOTE
Failure to reset the tag word for FP instructions after using an MMX instruction can result in faulty
execution or poor performance.
2. Insert the EMMS instruction at the end of all 64-bit SIMD integer code segments to avoid an x87
floating-point stack overflow exception when an x87 floating-point instruction is executed.
When writing an application that uses both floating-point and 64-bit SIMD integer instructions, use the following
guidelines to help you determine when to use EMMS:
• If next instruction is x87 FP: Use _MM_EMPTY() after a 64-bit SIMD integer instruction if the next
instruction is an X87 FP instruction; for example, before doing calculations on floats, doubles or long doubles.
• Don’t empty when already empty: If the next instruction uses an MMX register, _MM_EMPTY() incurs a
cost with no benefit.
• Group Instructions: Try to partition regions that use X87 FP instructions from those that use 64-bit SIMD
integer instructions. This eliminates the need for an EMMS instruction within the body of a critical loop.
• Runtime initialization: Use _MM_EMPTY() during runtime initialization of __M64 and X87 FP data types.
This ensures resetting the register between data type transitions. See Example 6-1 for coding usage.
Example 6-1. Resetting Register Between __m64 and FP Data Types Code
Incorrect Usage Correct Usage
You must be aware that your code generates an MMX instruction, which uses MMX registers with the Intel C++
Compiler, in the following situations:
• When using a 64-bit SIMD integer intrinsic from MMX technology, SSE/SSE2/SSSE3.
• When using a 64-bit SIMD integer instruction from MMX technology, SSE/SSE2/SSSE3 through inline assembly.
• When referencing the __M64 data type variable.
Additional information on the x87 floating-point programming model can be found in the Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 1. For more on EMMS, visit the Intel® C++ Compiler Classic
Developer Guide and Reference.
Using PALIGNRs to replace unaligned loads improves performance by eliminating cache line splits and other penalties.
In routines like MEMCPY( ), PALIGNR can boost the performance of misaligned cases. Example 6-2 shows a situation
that benefits by using PALIGNR.
Example 6-3 compares an optimal SSE2 sequence of the FIR loop and an equivalent SSSE3 implementation. Both
implementations unroll 4 iteration of the FIR inner loop to enable SIMD coding techniques. The SSE2 code can not
avoid experiencing cache line split once every four iterations. PALGNR allows the SSSE3 code to avoid the delays
associated with cache line splits.
Example 6-3. SSE2 and SSSE3 Implementation of FIR Processing Code (Contd.)
Optimized for SSE2 Optimized for SSSE3
add ecx, 16 movaps xmm2, xmm1
cmp ecx, 4*TAP palignr xmm2, xmm3, 12
jl inner_loop mulps xmm2, xmmword ptr[ebx+4*ecx+48]
addps xmm0, xmm2
mov eax, dword ptr[output]
movaps xmmword ptr[eax], xmm0 add ecx, 16
cmp ecx, 4*TAP
jl inner_loop
Example 6-4. Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack Instructions Code
; Input:
; XMM0 8 16-bit values in source
; XMM7 0 a local variable can be used
; instead of the register XMM7 if
; desired.
; Output:
; XMM0 four zero-extended 32-bit
; doublewords from four low-end
; words
SSE2 extends PACKSSDW so that it saturates four signed doublewords from a source operand and four signed
doublewords from a destination operand into eight signed words; the eight signed words are packed into the
destination.
m m /m 64 mm
D C B A
D1 C1 B1 A1
mm
O M15159
Figure 6-2 illustrates where two pairs of values are interleaved in a destination register; Example 6-6 shows MMX
code that accomplishes the operation.
Two signed doublewords are used as source operands and the result is interleaved signed words. The sequence in
Example 6-6 can be extended in SSE2 to interleave eight signed words using XMM registers.
M M /M 6 4 mm
D C B A
D1 B1 C1 A1
mm
O M 1 516 0
Pack instructions always assume that source operands are signed numbers. The result in the destination register is
always defined by the pack instruction that performs the operation. For example, PACKSSDW packs each of two
signed 32-bit values of two sources into four saturated 16-bit signed values in a destination register. PACKUSWB, on
the other hand, packs the four signed 16-bit values of two sources into eight saturated eight-bit unsigned values in the
destination.
; Output:
; MM0 the first and third words contain the
; low 16-bits of the doublewords in MM0,
; the second and fourth words contain the
; low 16-bits of the doublewords in MM1
The other destination register will contain the opposite combination illustrated in Figure 6-4.
m m /m 64 mm
23 22 21 20 13 12 11 10
21 20 11 10
mm
m m /m 6 4 mm
23 22 21 20 13 12 11 10
23 22 13 12
mm
Code in the Example 6-8 unpacks two packed-word sources in a non-interleaved way. The goal is to use the instruction
which unpacks doublewords to a quadword, instead of using the instruction which unpacks words to doublewords.
MM
63 31 0
X4 X3 X2 X1
R32
31 0
0 ..0 X1
OM 15163
With SSE2, PINSRW can insert a word from the lower 16 bits of an integer register or memory into an XMM register.
SSE4.1 provides insertion of a byte, dword and qword from either a memory location or integer register into an XMM
register.
MM
63 31 0
X4 X3 Y1 X1
R32
31 0
Y2 Y1
OM 15164
If all of the operands in a register are being replaced by a series of PINSRW instructions, it can be useful to clear the
content and break the dependence chain by either using the PXOR instruction or loading the register. See
Example 6-11 and Section 3.5.1.7
Goal: Non-Unit Stride Load Dwords Goal: Non-Unit Stride Store Dwords
movd xmm0, [addr] movd [addr], xmm0
pinsrd xmm0, [addr + stride], 1 pextrd [addr + stride], xmm0, 1
pinsrd xmm0, [addr + 2*stride], 2 pextrd [addr + 2*stride], xmm0, 2
pinsrd xmm0, [addr + 3*stride], 3 pextrd [addr + 3*stride], xmm0, 3
Example 6-13 provides two examples: using INSERTPS and PEXTRD to perform gather operations on floating-point
data; using EXTRACTPS and PEXTRD to perform scatter operations on floating-point data.
MM
63 55 47 39 31 23 15 7 0
31
0..0 0..0
7 0
R32
OM 15165
5-4 2
7-6 3
PSHUFHW (3,2,1,1)| 7| 6| 5| 5| 3| 2| 1| 0|
PSHUFD (2,2,2,2)| 5| 5| 5| 5| 5| 5| 5| 5|
Goal: Swap the values in word 6 and word 1 Goal: Reverse the order of the words
/* Instruction Result */ /* Instruction Result */
| 7| 6| 5| 4| 3| 2| 1| 0| | 7| 6| 5| 4| 3| 2| 1| 0|
vectorizing conditional flows within a loop and can be more efficient than inserting single element one at a time for
some situations.
NOTE
Because SIMD integer instruction sets do not support shift instructions for bytes, 2n–1 and -2n are
relevant only for packed words and packed doublewords.
This example will not work if the operands are signed. Note that PSADBW may also be used in some situations. See
Section 6.6.9 for details.
;Output:
; XMM1absolute difference of the unsigned operands
NOTE
The absolute value of the most negative number (that is, 8000H for 16-bit) cannot be represented
using positive numbers. This algorithm will return the original value for the absolute value (8000H).
Use PSHUFB if the alternative code uses 5 or more instructions. Example 6-21 shows the basic form of conversion of
color pixel formats.
Example 6-22 and Example 6-23 show SSE2 code and SSSE3 code for pixel format conversion. In the SSSE3 example,
PSHUFB replaces six SSE2 instructions.
add esi, 64
add edi, 64
sub ecx, 1
jnz convert16Pixs
add eax, 16
add ecx, 16
sub edx, 4
jnz start
With SSE4.1, Example 6-25 can be easily extended to clip signed bytes, unsigned words, signed and unsigned dwords.
The code above converts values to unsigned numbers first and then clips them to an unsigned range. The last
instruction converts the data back to signed data and places the data within the signed range.
Conversion to unsigned data is required for correct results when (High - Low) < 0X8000. If (High - Low) >= 0X8000,
simplify the algorithm as in Example 6-27.
This algorithm saves a cycle when it is known that (High - Low) >= 0x8000. The three-instruction algorithm does not
work when (High - Low) < 0x8000 because 0xffff minus any number < 0x8000 will yield a number greater in magnitude
than 0x8000 (which is a negative number).
When the second instruction, psubssw MM0, (0xffff - High + Low) in the three-step algorithm (Example 6-27) is
executed, a negative number is subtracted. The result of this subtraction causes the values in MM0 to be increased
instead of decreased, as should be the case, and an incorrect answer is generated.
The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on packed unsigned
words.
When an individual result is too large to be represented in 64-bits, the lower 64-bits of the result are written to the
destination operand and therefore the result wraps around. These instructions are added in both a 64-bit and 128-bit
version; the latter performs 2 independent operations, on the low and high halves of a 128-bit register.
Example 6-30. Using PTEST to Separate Vectorizable and Non-Vectorizable Loop Iterations
(a) Loops Requiring Infrequent (b) PTEST Enables Early Out to Handle Infrequent, Non-
Exception Handling Vectorizable Portion
float a[CNT]; xor eax,eax
unsigned int i; movaps xmm7, [all_ones]
xorps xmm6, xmm6
for (i=0;i<CNT;i++) lp:
{ movaps xmm0, a[eax]
if (a[i] != 0.0) cmpeqps xmm6, xmm0 ; convert each non-zero to ones
{ a[i] = 1.0f/a[i]; ptest xmm6, xmm7
} jnc zero_present; carry will be set if all 4 were non-zero
else movaps xmm1,[_1_0f_]
{ call DivException(); divps xmm1, xmm0
} movaps a[eax], xmm1
} add eax, 16
cmp eax, CNT
jnz lp
jmp end
zero_present:
// execute one by one, call
// exception when value is zero
Example 6-30(b) shows an assembly sequence that uses PTEST to cause an early-out branch whenever any one of the
four floating-point values in xmm0 is zero. The fall-through path enables the rest of the floating-point calculations to
be vectorized because none of the four values are zero.
Example 6-31(b) depicts an assembly sequence that uses BLENDVPS to vectorize the handling of heterogeneous
computations occurring across four consecutive loop iterations.
void mandelbrot_C()
Example 6-32. Baseline C Code for Mandelbrot Set Map Evaluation (Contd.)
{ int i,j;
float x,y;
for (i=0,x=-1.8f;i<DIMX;i++,x+=X_STEP)
{
for (j=0,y=-0.2f;j<DIMY/2;j++,y+=Y_STEP)
{float sx,sy;
int iter = 0;
sx = x;
sy = y;
while (iter < 256)
{ if (sx*sx + sy*sy >= 4.0f) break;
float old_sx = sx;
sx = x + sx*sx - sy*sy;
sy = y + 2*old_sx*sy;
iter++;
}
map[i][j] = iter;
}
}
}
Example 6-33 shows a vectorized implementation of Mandelbrot map evaluation. Vectorization is not done on the
inner most loop, because the presence of the break statement implies the iteration count will vary from one pixel to
the next. The vectorized version take into account the parallel nature of 2-D, vectorize over four iterations of Y values
of 4 consecutive pixels, and conditionally handles three scenarios:
• In the inner most iteration, when all 4 pixels do not reach break condition, vectorize 4 pixels.
• When one or more pixels reached break condition, use blend intrinsics to accumulate the complex height vector
for the remaining pixels not reaching the break condition and continue the inner iteration of the complex height
vector.
• When all four pixels reached break condition, exit the inner loop.
Example 6-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics
__declspec(align(16)) float _INIT_Y_4[4] = {0,Y_STEP,2*Y_STEP,3*Y_STEP};
F32vec4 _F_STEP_Y(4*Y_STEP);
I32vec4 _I_ONE_ = _mm_set1_epi32(1);
F32vec4 _F_FOUR_(4.0f);
F32vec4 _F_TWO_(2.0f);;
void mandelbrot_C()
{ int i,j;
F32vec4 x,y;
Example 6-33. Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics (Contd.)
{ F32vec4 sx,sy;
I32vec4 iter = _mm_setzero_si128();
int scalar_iter = 0;
sx = x;
sy = y;
The goal of the above recommendations is twofold. First, the loading and storing of SIMD data is more efficient using
the larger block sizes. Second, following the above recommendations helps to avoid mixing of 8-, 16-, or 32-bit load
and store operations with SIMD integer technology load and store operations to the same SIMD data.
This prevents situations in which small loads follow large stores to the same area of memory, or large loads follow
small stores to the same area of memory.
MOVQ must wait for the stores to write memory before it can access all data it requires. This stall can also occur with
other data types (for example, when bytes or words are stored and then words or doublewords are read from the
same area of memory). When you change the code sequence as shown in Example 6-35, the processor can access the
data without delay.
Consider a case with a series of small loads after a large store to the same area of memory (beginning at memory
address MEM), as shown in Example 6-36. Most of the small loads stall because they are not aligned with the store.
See Section 3.6.4 for details.
The word loads must wait for the quadword store to write to memory before they can access the data they require.
This stall can also occur with other data types (for example: when doublewords or words are stored and then words
or bytes are read from the same area of memory).
When you change the code sequence as shown in Example 6-37, the processor can access the data without delay.
Example 6-37. Eliminating Delay for a Series of Small Loads after a Large Store
movq mem, mm0 ; store qword to address “mem"
:
:
psrlq mm1, 32
shr eax, 16
movd ebx, mm1 ; transfer “mem + 4" to bx from
; MMX register, not memory
and ebx, 0ffffh
These transformations, in general, increase the number of instructions required to perform the desired operation.
The following are guidelines for obtaining higher bandwidth and shorter latencies for sequential memory fills (video
fills). These recommendations are relevant for all Intel architecture processors with MMX technology and refer to
cases in which the loads and stores do not hit in the first- or second-level cache.
6.7.2.2 Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM
Page
DRAM is divided into pages, which are not the same as operating system (OS) pages. The size of a DRAM page is a
function of the total size of the DRAM and the organization of the DRAM. Page sizes of several Kilobytes are common.
Like OS pages, DRAM pages are constructed of sequential addresses. Sequential memory accesses to the same DRAM
page have shorter latencies than sequential accesses to different DRAM pages.
In many systems the latency for a page miss (that is, an access to a different page instead of the page previously
accessed) can be twice as large as the latency of a memory page hit (access to the same page as the previous access).
Therefore, if the loads and stores of the memory fill cycle are to the same DRAM page, a significant increase in the
bandwidth of the memory fill cycles can be achieved.
Using MOVDQA or MOVDQU, software can load and store up to 16 bytes at a time but must either ensure 16 byte
alignment requirement (if using MOVDQA) or minimize the delays MOVDQU may encounter if data span across cache
line boundary.
(a)
N
0 1 2 3 4 5 6 ...
Source Bytes
16 Byte Aligned
Cache Line boundary
Destination Bytes
(b)
0 1 2 3 4 5 6 ... N
Source
Destination
Figure 6-8. Data Alignment of Loads and Stores in Reverse Memory Copy
Given the general problem of arbitrary byte count to copy, arbitrary offsets of leading source byte and destination
bytes, address alignment relative to 16 byte and cache line boundaries, these alignment situations can be a bit
complicated. Figure 6-8 (a) and (b) depict the alignment situations of reverse memory copy of N bytes.
The general guidelines for dealing with unaligned loads and stores are (in order of importance):
• Avoid stores that span cache line boundaries.
• Minimize the number of loads that span cacheline boundaries.
• Favor 16-byte aligned loads and stores over unaligned versions.
In Figure 6-8 (a), the guidelines above can be applied to the reverse memory copy problem as follows:
1. Peel off several leading destination bytes until it aligns on 16 Byte boundary, then the ensuing destination bytes
can be written to using MOVAPS until the remaining byte count falls below 16 bytes.
2. After the leading source bytes have been peeled (corresponding to step 1 above), the source
alignment in Figure 6-8 (a) allows loading 16 bytes at a time using MOVAPS until the remaining byte
count falls below 16 bytes.
Switching the byte ordering of each 16 bytes of data can be accomplished by a 16-byte mask with PSHUFB. The
pertinent code sequence is shown in Example 6-39.
In Figure 6-8 (b), we also start with peeling the destination bytes:
1. Peel off several leading destination bytes until it aligns on 16 Byte boundary, then the ensuing destination bytes
can be written to using MOVAPS until the remaining byte count falls below 16 bytes. However, the remaining
source bytes are not aligned on 16 byte boundaries, replacing MOVDQA with MOVDQU for loads will inevitably
run into cache line splits.
2. To achieve higher data throughput than loading unaligned bytes with MOVDQU, the 16 bytes of data
targeted to each of 16 bytes of aligned destination addresses can be assembled using two aligned
loads. This technique is illustrated in Figure 6-9.
Source Bytes
PO
PO R
R
Reverse byte orde
Step2 : Load 2 r In register, Sto
aligned 16 byte re
aligned 16-Byte s
Blocks
Cache Line boundary
Figure 6-9. A Technique to Avoid Cacheline Split Loads in Reverse Memory Copy Using Two Aligned
Loads
Although some of the arithmetic computations and input/output to data array in each iteration can be easily
vectorizable, but the table look-up via an index array is not. This creates different approaches to tuning. A compiler
can take a scalar approach to execute each iteration sequentially. Hand-tuning of such loops may use a couple of
different techniques to handle the non-vectorizable table look-up operation. One vectorization technique is to load
the input data for four iteration at once, then use SSE2 instruction to shift out individual index out of an XMM register
to carry out table look-up sequentially. The shift technique is depicted by Example 6-41. Another technique is to use
PEXTRD in SSE4.1 to extract the index from an XMM directly and then carry out table look-up sequentially. The
PEXTRD technique is depicted by Example 6-42.
add ebx, 16
add esi, 16
add edi, 16
sub ecx, 1
test ecx, ecx
jne lloop
The effectiveness of these two hand-tuning techniques on partially vectorizable code depends on the relative cost of
transforming data layout format using various forms of pack and unpack instructions.
The shift technique requires additional instructions to pack scalar table values into an XMM to transition into
vectorized arithmetic computations. The net performance gain or loss of this technique will vary with the
characteristics of different microarchitectures. The alternate PEXTRD technique uses less instruction to extract each
index, does not require extraneous packing of scalar data into packed SIMD data format to begin vectorized
arithmetic computation.
To achieve optimal parallel operation with multiple blocks, write the AES software sequences in a way that it
computes one AES round on multiple blocks, using one Round Key, and then it continues to compute the subsequent
round for multiple blocks, using another Round Key.
For such software optimization, you need to define the number of blocks that are processed in parallel. In Sandy
Bridge microarchitecture, the optimal parallelization parameter is eight blocks, compared to four blocks on prior
microarchitecture.
Example 6-44 in the following pages show the assembly implementation of the above code, optimized for Sandy
Bridge microarchitecture.
.align 16
TWO_N_ONE: .quad 0x00000002,0x00000001
.align 16
TWO_N_TWO: .quad 0x00000002,0x00000002
.align 16
LOAD_HIGH_BROADCAST_AND_BSWAP: .byte 15,14,13,12,11,10,9,8
.byte 15,14,13,12,11,10,9,8
align 16
BSWAP_EPI_64: .byte 7,6,5,4,3,2,1,0
.byte 15,14,13,12,11,10,9,8
AES_CTR_encrypt:
# parameter 1: %rdi # parameter 2: %rsi
# parameter 3: %rdx # parameter 4: %rcx
# parameter 5: %r8 # parameter 6: %r9
# parameter 7: 8 + %rsp
movq %r8, %r10
movl 8(%rsp), %r12d
shrq $4, %r8
shlq $60, %r10
je NO_PARTS
addq $1, %r8
NO_PARTS:
movq %r8, %r10
shlq $61, %r10
shrq $61, %r10
jb LAST
jb LAST
dec %r8
REMAINDER:
cmp $0, %r10
je END
shufpd $2, %xmm1, %xmm0
IN_LOOP:
movdqa %xmm0, %xmm11
pshufb (BSWAP_EPI_64), %xmm0
pxor (%r9), %xmm11
paddq (ONE), %xmm0
aesenc 16(%r9), %xmm11
aesenc 32(%r9), %xmm11
pshufb (BSWAP_EPI_64), %xmm0
aesenc 48(%r9), %xmm11
aesenc 64(%r9), %xmm11
aesenc 80(%r9), %xmm11
aesenc 96(%r9), %xmm11
aesenc 112(%r9), %xmm11
aesenc 128(%r9), %xmm11
jb IN_LAST
jb IN_LAST
IN_LAST:
aesenclast %xmm2, %xmm11
pxor (%rdi) ,%xmm11
movdqu %xmm11, (%rsi)
addq $16,%rdi
addq $16,%rsi
dec %r10
jne IN_LOOP
END:
ret
The optimal objective for light-weight compression/decompression is to deliver high throughput at reasonably low
CPU utilization, so the finite total compute bandwidth can be divided more favorably between query processing and
decompression to achieve maximal query throughput. SSE4.2 can raise the compute bandwidth for some query
operations to a significantly higher level (see Section 14.3.3), compared to query primitives implemented using
general-purpose-register instructions. This also places higher demand on the streaming data feed of decompressed
columnar data.
1. “SIMD-scan: ultra fast in-memory table scan using on-chip vector processing units”, T. Willhalm, et. al., Proceed-
ings of the VLDB Endowment, Vol. 2, #1, August 2009.
2. "Super-Scalar RAM-CPU Cache Compression," M. Zukowski, et, al, Data Engineering, International Conference,
vol. 0, no. 0, pp. 59, 2006.
int i, j;
__m128i a0, a1, a2, a3, c0, c1, b0, b1, b2, b3, bb;
__m128i msk4 ;
__m128i sprd4 = _mm_loadu_si128( (__m128i*) &sprdb_0_5_10_15[0]);
switch( bucket_width) {
case 5:j= 0;
• Four-way bit stitching: In each way (dword) of the destination, 5 bits are packed consecutively from the
corresponding byte element that contains 5 non-zero bit patterns. Since each dword destination will be
completely filled up by the contents of 7 consecutive elements, the remaining three bits of the 7th element and
the 8th element are done separately in a similar 4-way stitching operation but require the assistance of shuffle
operations.
Example 6-47 shows the reverse operation of decompressing consecutively packed 5-bit buckets into 32-bit data
elements.
Example 6-47. Decompression of a Stream of 5-bit Integers into 32-bit Elements (Contd.)
switch( bucket_width) {
case 5:j= 0;
msk4 = _mm_loadu_si128( (__m128i*) &mask_dw_5b[0]);
for (i = 0; i < cnt; i+= 32) {
a1 = _mm_loadu_si128( (__m128i*) &src[j +4]);
// pick up bytes 4, 9, 14, 19 and shuffle into offset 3, 7, 11, 15
c0 = _mm_shuffle_epi8(a1, pck4);
b1 = _mm_and_si128( _mm_srli_epi32(c0, 3), _mm_slli_epi32(msk4, 24));
// put 3 unaligned dword 1-4, 6-9, 11-14 to vacate bytes 0-3
a1 = _mm_shuffle_epi8(a1, pckdw3);
b0 = _mm_and_si128( _mm_srli_epi32(c0, 6), _mm_slli_epi32(msk4, 16));
a0 = _mm_cvtsi32_si128( *(int *)&src[j ]);
b1 = _mm_or_si128( b0, b1); // finished decompress source bytes 4, 9, 14, 19
a0 = _mm_or_si128( a0, a1); // bytes 0-16 contain compressed bits
b0 = _mm_and_si128( _mm_srli_epi32(a0, 14), _mm_slli_epi32(msk4, 16));
b1 = _mm_or_si128( b0, b1);
b0 = _mm_and_si128( _mm_srli_epi32(a0, 17), _mm_slli_epi32(msk4, 8));
b1 = _mm_or_si128( b0, b1);
b0 = _mm_and_si128( _mm_srli_epi32(a0, 20), msk4);
b1 = _mm_or_si128( b0, b1);// b1 now full with decompressed 4-7,12-15,20-23,28-31
_mm_storeu_si128( (__m128i *) &out[i+4] , _mm_cvtepu8_epi32(b1));
b0 = _mm_and_si128( _mm_slli_epi32(a0, 9), _mm_slli_epi32(msk4, 24));
c0 = _mm_and_si128( _mm_slli_epi32(a0, 6), _mm_slli_epi32(msk4, 16));
b0 = _mm_or_si128( b0, c0);
_mm_storeu_si128( (__m128i *) &out[i+12] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 4)));
c0 = _mm_and_si128( _mm_slli_epi32(a0, 3), _mm_slli_epi32(msk4, 8));
_mm_storeu_si128( (__m128i *) &out[i+20] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 8)));
b0 = _mm_or_si128( b0, c0);
_mm_storeu_si128( (__m128i *) &out[i+28] , _mm_cvtepu8_epi32(_mm_srli_si128(b1, 12)));
c0 = _mm_and_si128( a0, msk4);
b0 = _mm_or_si128( b0, c0);// b0 now full with decompressed 0-3,8-11,16-19,24-27
Compression/decompression of integers for dynamic range that are non-power-of-2 can generally use similar
mask/packed shift/stitch technique with additional adaptation of the horizontal rearrangement of partially stitched
vectors. The increase in throughput relative to using general-purpose scalar instructions will depend on
implementation and bucket width.
When compiled with the “/O2” option on an Intel Compiler, the compression throughput can reach 6 Bytes/cycle on
Sandy Bridge microarchitecture, and the throughput varies little for working set sizes due to the streaming data
access pattern and the effectiveness of hardware prefetchers. The decompression throughput of the above example
is more than 5 Bytes/cycle at full utilization, allowing a database query engine to partition CPU utilization effectively
to allocate a small fraction for on-the-fly decompression to feed vectorized query computation.
The decompression throughput increase using a SIMD light-weight compression technique offers database architects
new degrees of freedom to relocate critical performance bottlenecks from a lower-throughput technology (disk I/O,
DRAM) to a faster pipeline.
CHAPTER 7
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
This chapter discusses rules for optimizing the single-instruction, multiple-data (SIMD) floating-point instructions
available in Intel® SSE, Intel® SSE2, Intel® SSE3, and Intel® SSE4.1. The chapter also provides examples illustrating the
optimization techniques for single-precision and double-precision SIMD floating-point applications.
For some applications (3D geometry, for example), traditional data arrangement requires some changes to use the
SIMD registers and parallel techniques fully. Traditionally, the data layout has been an array of structures (AoS). A new
data layout has been proposed to fully use the SIMD registers in such applications: a structure of arrays (SoA) resulting
in more optimized performance.
X3 X2 X1 X0
Y3 Y2 Y1 Y0
OP OP OP OP
X3 OP Y3 X2 OP Y2 X 1OP Y1 X0 OP Y0
The organization of structured data significantly impacts SIMD programming efficiency and performance. This can be
illustrated using two common type of data structure organizations:
• Array of Structure (AoS): This refers to arranging an array of data structures. Within the data structure, each
member is a scalar. This is shown in Figure 7-2. Typically, a repetitive computation sequence is applied to each
element of an array, i.e., a data structure. The computational sequence for the scalar members of the structure is
likely to be non-homogeneous within each iteration. AoS is generally associated with a horizontal computation
model.
X Y Z W
• Structure of Array (SoA): Here, each member of the data structure is an array. Each element of the array is a scalar.
This is shown in Table 7-1. The repetitive computational sequence is applied to scalar elements and
homogeneous operation can be easily achieved across consecutive iterations within the same structural
member. Consequently, SoA is generally amenable to the vertical computation model.
Vz array Z1 Z2 Z3 Y4 ..... Zn
Vw array W1 W2 W3 W4 ..... Wn
SIMD instructions with vertical computation on the SoA arrangement can achieve higher efficiency and performance
than AoS and horizontal computation. This can be seen with dot-product operation on vectors. The dot product
operation on the SoA arrangement is shown in Figure 7-3.
X1 X2 X3 X4
X Fx Fx Fx Fx
+ Y1 Y2 Y3 Y4
X Fy Fy Fy Fy
+ Z1 Z2 Z3 Z4
X Fz Fz Fz Fz
+ W1 W2 W3 W4
X Fw Fw Fw Fw
= R1 R2 R3 R4
OM 15168
Example 7-1 shows how one result would be computed for seven instructions if the data were organized as AoS and
using SSE alone: four results would require 28 instructions.
Now consider the case when the data is organized as SoA. Example 7-2 demonstrates how four results are computed
for five instructions.
Example 7-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation
mulps ; x*x' for all 4 x-components of 4 vertices
mulps ; y*y' for all 4 y-components of 4 vertices
mulps ; z*z' for all 4 z-components of 4 vertices
addps ; x*x' + y*y'
addps ; x*x'+y*y'+z*z'
For the most efficient use of the four component-wide registers, reorganizing the data into the SoA format yields
increased throughput and hence much better performance for the instructions used.
This simple example shows that vertical computation can yield 100% use of the available SIMD registers to produce
four results. Note that results may vary for other situations. Suppose the data structures are represented in a format
that is not “friendly” to vertical computation. In that case, it can be rearranged “on the fly” to facilitate better
utilization of the SIMD registers. This operation is referred to as a “swizzling” operation. The reverse operation is
referred to as “deswizzling.”
Example 7-4 shows a similar data-swizzling algorithm using SIMD instructions in the integer domain.
The technique in Example 7-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable
over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures.
This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of
the execution engine.
The performance considerations of Example 7-3, and Example 7-4 often depend on each microarchitecture’s
characteristics. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx
instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction execute with one cycle
throughput due to the 128-bit shuffle execution unit. The next important consideration is that only one port can
execute PUNPCKxxx rather than MOVLHPS/MOVHLPS executing on multiple ports. The performance of both
techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing
SIMD instructions. Both techniques further improve the Enhanced Intel Core microarchitecture due to the 128-bit
shuffle unit.
Example 7-6 shows a similar deswizzle function using SIMD integer instructions. Both techniques demonstrate
loading 16 bytes and performing horizontal data movement in registers. This approach is likely more efficient than
alternative techniques of storing 8-byte halves of XMM registers using MOVLPS and MOVHPS.
xm m 0 xm m 1 xm m 2 xm m 3
A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4
A1 A2 B1 B2 A3 A4 B3 B4 C1 C2 D1 D2 C3 C4 D3 D4
ADDPS ADDPS
SHUFPS SHUFPS
ADDPS
OM 15169
1. "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-
84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229. https://ieeexplore.ieee.org/document/8766229
provides additional enhancement with instructions capable of directly evaluating dot product operations of vectors
of 2, 3 or 4 components.
Example 7-9, Example 7-10, and Example 7-11 compare the basic code sequence to compute one dot-product result
for a pair of vectors.
The selection of an optimal sequence in conjunction with an application’s memory access patterns may favor
different approaches. For example, if each dot product result is immediately consumed by additional computational
sequences, it may be more optimal to compare the relative speed of these different approaches. If dot products can
be computed for an array of vectors and kept in the cache for subsequent computations, then more optimal choice
may depend on the relative throughput of the sequence of instructions.
In Intel Core microarchitecture, Example 7-10 has higher throughput than Example 7-9. Due to the relatively longer
latency of HADDPS, the speed of Example 7-10 is slightly slower than Example 7-9.
In Enhanced Intel Core microarchitecture, Example 7-11 is faster in both speed and throughput than Example 7-9 and
Example 7-10. Although the latency of DPPS is also relatively long, it is compensated by the reduction of number of
instructions in Example 7-11 to do the same amount of work.
Unrolling can further improve the throughput of each of three dot product implementations. Example 7-12 shows
two unrolled versions using the basic SSE2 and SSE3 sequences. The SSE4.1 version can also be unrolled and using
INSERTPS to pack four dot-product results.
for (i=0;i<CNT;i++)
{ float size = nodes[i].vec.dot();
if (size != 0.0)
{ size = 1.0f/sqrtf(size); }
else
{ size = 0.0; }
nodes[i].vec.x *= size;
nodes[i].vec.y *= size;
nodes[i].vec.z *= size;
}
Example 7-14 shows an assembly sequence that normalizes the x, y, z components of a vector.
Vec3 *p = &nodes[i].vec;
__asm
{ mov rax, p
xorps xmm2, xmm2
movups xmm1, [rax] // loads the (x, y, z) of input vector plus x of next vector
movaps xmm7, xmm1 // save a copy of data from memory (to restore the unnormalized value)
movaps xmm5, _mask // mask to select (x, y, z) values from an xmm register to normalize
andps xmm1, xmm5 // mask 1st 3 elements
movaps xmm6, xmm1 // save a copy of (x, y, z) to compute normalized vector later
mulps xmm1,xmm1 // 0, z*z, y*y, x*x
pshufd xmm3, xmm1, 0x1b // x*x, y*y, z*z, 0
addps xmm1, xmm3 // x*x, z*z+y*y, z*z+y*y, x*x
pshufd xmm3, xmm1, 0x41 // z*z+y*y, x*x, x*x, z*z+y*y
addps xmm1, xmm3 // x*x+y*y+z*z, x*x+y*y+z*z, x*x+y*y+z*z, x*x+y*y+z*z
comisd xmm1, xmm2 // compare size to 0
jz zero
movaps xmm3, xmm4 // preloaded unitary vector (1.0, 1.0, 1.0, 1.0)
sqrtps xmm1, xmm1
divps xmm3, xmm1
jmp store
zero:
movaps xmm3, xmm2
store:
mulps xmm3, xmm6 //normalize the vector in the lower 3 elements
andnps xmm5, xmm7 // mask off the lower 3 elements to keep the un-normalized value
orps xmm3, xmm5 // order the un-normalized component after the normalized vector
movaps [rax], xmm3 // writes normalized x, y, z; followed by unmodified value
Example 7-15 shows an assembly sequence using SSE4.1 to normalizes the x, y, z components of a vector.
Vec3 *p = &nodes[i].vec;
__asm
{ mov rax, p
xorps xmm2, xmm2
movups xmm1, [rax] // loads the (x, y, z) of input vector plus x of next vector
movaps xmm7, xmm1 // save a copy of data from memory
dpps xmm1, xmm1, 0x7f // x*x+y*y+z*z, x*x+y*y+z*z, x*x+y*y+z*z, x*x+y*y+z*z
comisd xmm1, xmm2 // compare size to 0
jz zero
movaps xmm3, xmm4 // preloaded unitary vector (1.0, 1.0, 1.0, 1.0)
sqrtps xmm1, xmm1
divps xmm3, xmm1
jmp store
zero:
movaps xmm3, xmm2
store:
mulps xmm3, xmm6 //normalize the vector in the lower 3 elements
blendps xmm3, xmm7, 0x8 // copy the un-normalized component next to the normalized vector
movaps [rax], xmm3
In Example 7-14 and Example 7-15, the throughput of these instruction sequences are basically limited by the long-
latency instructions of DIVPS and SQRTPS. In Example 7-15, the use of DPPS replaces eight SSE2 instructions to
evaluate and broadcast the dot-product result to four elements of an XMM register. This could result in improvement
of the relative speed of Example 7-15 over Example 7-14.
Example 7-16 shows the vector-matrix data layout in AOS, where the input and out vectors are stored as an array of
structure.
Example 7-17 shows an example using HADDPS and MULPS to perform vector-matrix multiplication with data layout
in AOS. After three HADDPS completing the summations of each output vector component, the output components
are arranged in AOS.
Example 7-18 shows an example using DPPS to perform vector-matrix multiplication in AOS.
Example 7-17 and Example 7-18 both work with AOS data layout using different horizontal processing techniques
provided by SSE3 and SSE4.1. The effectiveness of either techniques will vary, depending on the degree of exposures
of long-latency instruction in the inner loop, the overhead/efficiency of data movement, and the latency of HADDPS
vs. DPPS.
On processors that support both HADDPS and DPPS, the choice between either technique may depend on
application-specific considerations. If the output vectors are written back to memory directly in a batch situation,
Example 7-17 may be preferable over Example 7-18, because the latency of DPPS is long and storing each output
vector component individually is less than ideal for storing an array of vectors.
There may be partially-vectorizable situations that the individual output vector component is consumed immediately
by other non-vectorizable computations. Then, using DPPS producing individual component may be more suitable
than dispersing the packed output vector produced by three HADDPS as in Example 7-17.
vector-matrix multiplication is shown in Example 7-19. Each matrix element is replicated four times to minimize data
movement overhead for producing packed results.
The corresponding vector-matrix multiply example in SOA (unrolled for four iteration of vectors) is shown in
Example 7-20.
CHAPTER 8
INT8 DEEP LEARNING INFERENCE
This chapter describes INT8 as a data type for Deep learning Inference on Intel technology. The document covers both
Intel® AVX-512 implementations and implementations using the new Intel® DL Boost Instructions.
The chapter is divided into several parts. The first part introduces INT8, and more specifically the Intel DL Boost
instructions as the core data type and instructions for use in ML workloads. The second part discusses general
methodologies and guidelines for efficient inference computation. The third part discusses optimizations specific to
CNNs and the final part discusses optimizations specific to LSTM/RNNs.
When relevant, examples are provided with and without the new Intel DL Boost instruction set. In many cases
(quantization, memory layout) there are steps that can be taken offline and steps that must be taken in runtime; we
try to clearly state when each step is taken.
Example 8-1 uses the VPDPBUSD instruction to perform faster matrix multiplication of two byte matrices, SIGNAL and
WEIGHT. Assuming the source matrices have dimensions MxK and KxN, respectively, and are given in row-major
order, the source matrices in the example have the layouts defined below.
• Matrix signal[K/64][M][64], built out of matrix SIGNAL[M][K] by the following procedure:
FOR m = 0 … M-1
FOR k = 0 … K-1
signal[k/64][m][k%64] = SIGNAL[m][k]
NOTES:
1. Client architectures based on processors that support Intel® DL Boost, such as processors based on Ice Lake
microarchitecture will only see a 2x speedup. This is because VPADDD can exploit the vector SIMD unit on port 5
so the baseline takes 2 cycles per 64 MACs (peak) vs. 1 cycle with Intel® DL Boost.
1. See the Intel AI & Machine Learning Landing Page for additional details.
8.3.2 QUANTIZATION
Quantization is the process of reducing the size of the data type for activations and weights, typically from floats to
int8/uint81.
Example 1-1
1. Please see: Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and the AI & Machine Learn-
ing Landing page.
8.3.3.3 NUMA
The Cascade Lake Advanced Performance 2-Socket server contains two Cascade Lake Advanced Performance
packages where each of the packages is made of two processor dies connected via one Intel® Ultra Path Interconnect
(Intel® UPI) link, creating a total of four NUMA domains. In such a setup it is crucial to maintain a separate DL process
per NUMA domain/die. This is also the case for previous 2-socket setups of the previous generation product lines with
multiple NUMA domains.
8.4 CNNS
Memory Layout
To present the inputs in matrix form we flatten the spatial dimension to be the rows of A (henceforth called the M
dimension), and the channel (IFM) dimension to be the columns of A (henceforth called the K dimension). Similarly
the spatial dimension of the outputs becomes the rows of C, and the channel (OFM) dimension becomes the columns
of C (henceforth called the N dimension). In Figure 8-2 there are six channels of size 5x5 in the inputs which are
transformed by the convolutional layer to four channels of size 3x3.
Standard 2D convolution uses #OFMs different 3D kernels of size KH x KW x #IFMs, for each target OFM, where KH,
KW, #IFMs and #OFMs are the height and width of the convolutional kernel, number of input channels and number of
output channels respectively.
The weights are transformed into KH x KW matrices of size #IFMs x #OFMs (see Figure 8-3).
Matrix Multiplication
Matrix multiplication is performed in a standard manner (see Chapter 17, “Software Optimization for Intel® AVX-512
Instructions”).
Blocking
Since the matrices in question are generally large, one needs to traverse the matrices iteratively, while accumulating
the results of the multiplication. Hence, one needs to allocate several registers (accumulators) to accumulate the
results, and optionally have several registers for temporary caching of the A and B matrices, to enable reuse.
The order in which the matrices are traversed can have a monumental effect on the overall performance. In general,
it is preferable to traverse the entire K dimension before moving on to the next element in the M or N dimension.
When the K dimension is exhausted, the results in the accumulators are final (see discussion below about partial
results). They can be fed to the post-convolution stage (See Post Convolution) and stored to the output location. If,
however, the K dimension is not exhausted, before a move in the M or N dimension is initiated, the results in the
accumulators are partial, i.e., they are the result of multiplication of some columns of A by some rows of B. These
results have to be saved in an auxiliary buffer to free the accumulators for the new chunk of data. When we return to
these M, N coordinates, these results have to be loaded from the buffer. Thus we perform additional store(s) and
load(s) when compared to the exhaustive-K scenario. Furthermore, it is generally advisable to limit the advancement
in the M or N dimension. Generally speaking, best results were obtained when the Accumulator K cache level of
matrix B (Figure 8-6) was in the DCU, and when the accumulative size of the cache blocks (Figure 8-7) was as large as
possible while still in MLC. However, there are cases where the best results are achieved when the accumulative size
is much larger (even up to 3x of the MLC). These hyper-parameters are usually found by experimentation.
The “exhaustive-K” guideline does not yield optimal results in all cases, and the optimal extent of M,N locality should
be determined on a case-by-case basis. We outline a simple yet effective way of structuring the control flow of the
convolution process to accommodate the variety of scenarios abundant in modern CNNs.
sixteen times in the ZMM register. Then a ZMM register is loaded from the B buffer and the VNNI instruction is
executed on the two registers.
// blocking params
int n_accum;
int m_accum;
};
#define B_MATRIX_BLOCK 4
#define C_MATRIX_BLOCK 16
#define a_buffer_vnni_at(d, a, h, w, k) \
(a[((h) * (d)->w_dim * (d)->k_dim) + ((w) * (d)->k_dim) + (k)])
#define b_buffer_vnni_at(d, b, k_h, k_w, k_d, n, k_m) \
(b[((k_h) * (d)->kw * ((d)->k_dim / B_MATRIX_BLOCK) * (d)->n_dim * \
B_MATRIX_BLOCK) + \
((k_w) * ((d)->k_dim / B_MATRIX_BLOCK) * (d)->n_dim * B_MATRIX_BLOCK) + \
((k_d) * (d)->n_dim * B_MATRIX_BLOCK) + ((n)*B_MATRIX_BLOCK) + (k_m)])
void direct_conv_opt(const direct_conv_dims_t *dims, const char *a_buffer_vnni,
const char *b_buffer_vnni, int32_t *c_buffer_vnni)
{
int m_dim = OUT_H(dims) * OUT_W(dims);
__m512i *cvec =
_mm_malloc(dims->n_accum * dims->m_accum * sizeof(*cvec), 64);
In this code we allocate M_ACCUM (4) zmm registers zmm8-zmm11 for IFM values, and N_ACCUM (2) zmm registers
zmm12-zm13 for weights.
In the beginning the accumulators must be zeroed out. Then the entire K dimension (#IFMs=32) must be traversed,
each iteration operating on 4 consecutive IFMs. The convolution consists from a series of 4-byte broadcasts of IFM
data, 64-byte loads of weights data, and multiplication and accumulation operations. Due to the large IFM data
overlap between different kh,kw values, the IFM data can be efficiently reused, and the number of data loads
significantly lowered.
Figure 8-8. Standard vs Optimized vs. Low OFM Optimized Data Layouts1
NOTES:
1. The 4x16 blocks of the Low OFM optimization are created on the fly and used only once.
Example 1-2
# IFM_W % 16 == 0
# NUM_OFMS = 3
# NUM_IFMS = 64
# dqfs - array of dequantization factors for the down convert
__m512i gather_indices = _mm512_setr_epi32(0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60);
__m512 dqf_broadcast[NUM_OFMS];
#pragma unroll(NUM_OFMS)
for (int ofm = 0 ; ofm < NUM_OFMS ; ofm++) {
dqf_broadcast[ofm] = _mm512_set1_ps(dqfs[ofm]);
}
for (int h = 0 ; h < IFM_H ; h++) {
int src_line_offset = h * IFM_W * IFMBlock;
int w = 0;
int src_w_offset = src_line_offset;
for ( ; w < IFM_W ; w += 16) {
__m512i res_i32[NUM_OFMS] = { 0 };
__m512i res_i32[NUM_OFMS] = { 0 };
// Convolve 4x16 OFMs by reorganizing on the fly.
for (int ifm = 0 ; ifm < NUM_IFMS ; ifm += 4) {
int src_block_offset = ifm & 0xf;
int src_ifm_index = ifm >> 4;
size_t src_ifm_offset = src_w_offset + src_ifm_index * src_ifm_size + src_block_offset;
__m512i ivec = _mm512_i32gather_epi32(gather_indices, input + src_ifm_offset, 4);
#pragma unroll(NUM_OFMS)
for (int ofm = 0 ; ofm < NUM_OFMS ; ofm++) {
int weight_offset = (ofm * NUM_IFMS + ifm) * 16;
__m512i wvec = _mm512_load_si512(weights_reorged + weight_offset);
res_i32[ofm] = _mm512_dpbusd_epi32(res_i32[ofm], ivec, wvec);
}
}
// Down convert and store results in native layout.
#pragma unroll(NUM_OFMS)
for (int ofm = 0 ; ofm < NUM_OFMS ; ofm++) {
__m512 res_f32 = _mm512_cvtepi32_ps(res_i32[ofm]);
res_f32 = _mm512_mul_ps(res_f32, dqf_broadcast[ofm]);
size_t output_offset = ofm * ofm_size + h * IFM_W + w;
_mm512_store_ps(output + output_offset, res_f32);
}
src_w_offset += 16 * IFMBlock;
}
}
}
// add bias if there is one and then dequantize + requantize in a single step
if (bias) {
resf = _mm512_fmadd_ps(resf,
_mm512_load_ps(dqfs + OFMChannelOffset),
_mm512_load_ps((__m512*) (bias + OFMChannelOffset)));
} else {
resf = _mm512_mul_ps(resf,
_mm512_load_ps(dqfs + OFMChannelOffset));
}
#if ELTWISE
/* fused Eltwise ops */
#else
# if !RELU
res = _mm512_add_epi32(res, _mm512_set1_epi32(128));
# endif
res8 = _mm512_cvtusepi32_epi8(res);
#endif // ELTWISE
#if POOLING
/* fused pooling ops */
#endif
8.4.2.2 ReLu
ReLu is implemented as a max with a zero register with negligible overhead as it fused with the convolution.
8.4.2.3 EltWise
Element wise operations are usually easier to fuse into the convolution step because they operate directly on the
accumulators just before the final result is saved. Note, however, that the quantization factors of the different input
layers are usually not the same so the inputs must first be dequantaized to f32, operated on and then quantized again
(we show an optimization for this step in the vectorized code).
The following example of the eltwise operation required given the data type of the inputs and outputs. In all the
examples it is assumed that the data from the convolutional operation is the INT32 data returned from the
VPDPBUSD operation, and that the quantized output must be uint8, even though in some cases the unquantized
output could be negative. See Section 8.3.2 to understand how quantization to uint8 works with negative values.
The following optimized code shows eltwise implementations for several eltwise use cases assuming that “dest”
points to a vector of 16 OFMs belonging to the same pixel in the output. In principle we need to dequantize the
eltwise data and the convolution data, do the addition and then dequantize, as in the following equation.
However, we can preprocess the factors offline (operations in square brackets) so that we have only two
multiplications online.
if (!relu) {
res8 = _mm_add_epi8(res8, _mm_set1_epi8(-128)); //hack to add 128
}
8.4.2.4 Pooling
Pooling layers may not be simple to fuse to the convolution step but in some cases the fusion is easy. The average
pooling of the last convolutional layer of Inception ResNet 50 for example amounts to averaging all the 8x8 pixels of
every OFM channel into a single value, thus emitting a single value per OFM. Such an operation is easy to fuse
because it behaves the same for every pixel.
Example 8-8. 8x8 Average Pooling with Stride 1 of 8x8 Layers (Contd.)
8x8 Average Pooling with Stride 1 of 8x8 Layers
__m512 res_tmp_ps = _mm512_add_ps(resf, prev_val);
_mm512_store_ps((__m512 *) pool_dest, res_tmp_ps);
The following unfused vectorized code can be used to do max and average pooling. In the example below the pooling
step can also adjust the input quantization range to the output quantization range. This is usually necessary before a
concat layer, which is implemented as a No-Op, which means that the output quantization range of all the
concatenated layers must be the same.
}
if (AVERAGE) {
uint8_t kernel_size_u8 = kernel_h_ * kernel_w_;
__m256i broadcast_kernel_size = _mm256_set1_epi16(kernel_size_u8);
res_pixel=_mm256_div_epu16(res_pixel,broadcast_kernel_size);
}
// compute final offset and save
uint8_t * total_offset =
top_data + output_image_offset + layer_offset + block_offset_out +
y_ofsset_out + x_ofsset_out ;//+ vect_idx;
_mm_store_si128((__m128i*) total_offset, _mm256_cvtusepi16_epi8(res_pixel));
}
}
}
Because of the memory layout of the vectorized directConv, it is easy to fuse the pixel shuffler layer to the
convolution. The only change that is required is to save the result of the convolution in the correct place in the output.
Implementing the activation part as full precision scalar or SMVL-based vectorized code may be slow. The alternative
is to use approximations which provide good performance. One of the approaches for approximating transcendental
functions is to use piece-wise polynomial approximations.
1. Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz, and John Makhoul. 2014. Fast
and robust neural network joint models for statistical machine translation. In ACL (1). Citeseer, pages 1370-1380.
// clang-format on
/*
* Approximate
*/
__m512 func_p0 = _mm512_permutexvar_ps(indices, sigmoid_coeff0);
__m512 func_p1 = _mm512_permutexvar_ps(indices, sigmoid_coeff1);
__m512 func_p2 = _mm512_permutexvar_ps(indices, sigmoid_coeff2);
func = _mm512_fmadd_ps(abs_arg, func_p2, func_p1);
func = _mm512_fmadd_ps(abs_arg, func, func_p0);
While the minimax polynomial approximation may show the best accuracy on a layer-by-layer basis, the end-to-end
accuracy may suffer in some topologies (NMT notably). In such cases a different approach might be better. The
following approximation uses the fact that
Where
Different RNN objects, e.g., sentences, can require very different computation effort, e.g., short sentence vs long
sentence. When batching multiple objects together, it is important take this fact into consideration to avoid
unnecessary computations. In NMT for example (see Figure 8-9), if we ensure sentences are ordered by length it is
easy to adapt each iteration to the actual number of active sentences.
In Neural Machine Translation (NMT), a significant amount of time is spent searching for the current top
BEAM_WIDTH attention scores out of BEAM_WIDTH*VOCAB_SIZE values, which could be very large. Please see the
following for more information:
index = 0
pxor ZMM4
while index < MAX_DATA
vbroadcastss ZMM2, array[index]
VPCMPPS K1,ZMM0,ZMM2, _CMP_LT_OQ
KTESTW K1,K1
JZ … // K1 == 0 so we can just put new score first
//if K1!=0
VPERMPS ZMM0(k1),ZMM0
VPERMPS ZMM1(k1),ZMM0
KSHIFT k2,k1,1
KXOR k3,k2,k1
VPBLENDMPS k3, ZMM0,ZMM2
VPBLENDMD k3, ZMM1,ZMM4
VPADD ZMM4, 1
add index, 1
CHAPTER 9
OPTIMIZING CACHE USAGE
While processor speed has increased, memory access speed has increased at a slower pace. The resulting disparity
has made it important to tune applications in one of two ways:
1. A majority of data accesses are fulfilled from processor caches.
2. Effectively masking memory latency to utilize peak memory bandwidth as much as possible.
Hardware prefetching mechanisms are enhancements in microarchitecture to facilitate the latter aspect, and will be
most effective when combined with software tuning. The performance of most applications can be considerably
improved if the data required can be fetched from the processor caches or if memory traffic can take advantage of
hardware prefetching effectively.
Standard techniques to bring data into the processor before it is needed involve additional programming which can
be difficult to implement and may require special steps to prevent performance degradation. Streaming SIMD
Extensions addressed this issue by providing various prefetch instructions.
Streaming SIMD Extensions introduced the various non-temporal store instructions. SSE2 extends this support to new
data types and also introduce non-temporal store support for the 32-bit integer registers.
This chapter focuses on:
• Hardware Prefetch Mechanism, Software Prefetch and Cacheability Instructions — Discusses microarchitectural
feature and instructions that allow you to affect data caching in an application.
• Memory Optimization Using Hardware Prefetching, Software Prefetch and Cacheability Instructions
— Discusses techniques for implementing memory optimizations using the above instructions.
• Using deterministic cache parameters to manage cache hierarchy.
— Single-pass, or unlayered execution passes a single data element through an entire computation pipeline.
— Multi-pass, or layered execution performs a single stage of the pipeline on a batch of data elements before
passing the entire batch on to the next stage.
— If your algorithm is single-pass use PREFETCHNTA. If your algorithm is multi-pass use PREFETCHT0.
• Resolve memory bank conflict issues. Minimize memory bank conflicts by applying array grouping to group
contiguously used data together or by allocating data within 4-KByte memory pages.
• Resolve cache management issues. Minimize the disturbance of temporal data held within processor’s caches by
using streaming store instructions.
• Optimize software prefetch scheduling distance:
— Far ahead enough to allow interim computations to overlap memory access time.
— Near enough that prefetched data is not replaced from the data cache.
• Use software prefetch concatenation. Arrange prefetches to avoid unnecessary prefetches at the end of an inner
loop and to prefetch the first few iterations of the inner loop inside the next outer loop.
• Minimize the number of software prefetches. Prefetch instructions are not completely free in terms of bus cycles,
machine cycles and resources; excessive usage of prefetches can adversely impact application performance.
• Interleave prefetches with computation instructions. For best performance, software prefetch instructions must
be interspersed with computational instructions in the instruction sequence (rather than clustered together).
9.3 PREFETCH
This section discusses the mechanics of the software PREFETCH instructions. In general, software prefetch
instructions should be used to supplement the practice of tuning an access pattern to suit the automatic hardware
prefetch mechanism.
PREFETCH loads either non-temporal data or temporal data in the specified cache level. This data access type and the
cache level are specified as a hint. Depending on the implementation, the instruction fetches 32 or more aligned
bytes (including the specified address byte) into the instruction-specified cache levels.
PREFETCH is implementation-specific; applications need to be tuned to each implementation to maximize
performance.
NOTE
Using the PREFETCH instruction is recommended only if data does not fit in cache. Use of software
prefetch should be limited to memory addresses that are managed or owned within the application
context. Prefetching to addresses that are not mapped to physical pages can experience non-
deterministic performance penalty. For example specifying a NULL pointer (0L) as address for a
prefetch can cause long delays.
PREFETCH provides a hint to the hardware; it does not generate exceptions or faults except for a few special cases (see
Section 9.3.3). However, excessive use of PREFETCH instructions may waste memory bandwidth and result in a
performance penalty due to resource constraints.
Nevertheless, PREFETCH can lessen the overhead of memory transactions by preventing cache pollution and by using
caches and memory efficiently. This is particularly important for applications that share critical system resources,
such as the memory bus. See an example in Section 9.6.2.1
PREFETCH is mainly designed to improve application performance by hiding memory latency in the background. If
segments of an application access data in a predictable manner (for example, using arrays with known strides), they
are good candidates for using PREFETCH to improve performance.
Use the PREFETCH instructions in:
• Predictable memory access patterns.
• Time-consuming innermost loops.
• Locations where the execution pipeline may stall if data is unavailable.
NOTE
At the time of PREFETCH, if data is already found in a cache level that is closer to the processor than
the cache level specified by the instruction, no data movement occurs.
The implementation details of the prefetch hint instructions vary across different microarchitectures. A summary is
given in the tables below.
Table 9-1. Implementation Details of Prefetch Hint Instructions for Intel® Core™ Duo, Intel®
Core™ 2, and Intel® Atom® Processors
Instruction Fill Cache? L1 Fill Cache? L2
PrefetchT0 Yes Yes
PrefetchT1 No Yes
PrefetchT2 No Yes
PrefetchNTA Yes No
NOTES:
1. PrefetchW is only available on Intel Atom processors; not Intel Core duo or Intel Core 2 processors.
Table 9-2. Details of Prefetch Hint Instructions for Processors Based on Skylake
Microarchitectures
Instruction Fill Cache? L1 Fill Cache? L2 Fill Cache? L3
PrefetchT0 Yes Yes No
PrefetchT1 No Yes No
PrefetchT2 No Yes No
PrefetchNTA Yes No No
Table 9-3. Implementation Details of Prefetch Hint Instructions for Intel® Xeon® Scalable
Family of Processors
Instruction Fill Cache? L1 Fill Cache? L2 Fill Cache? L3 Fill Snoop Filter?
PrefetchT0 Yes Yes Yes Yes
PrefetchT11 No Yes Yes Yes
NOTES:
1. There is no implementation difference between PrefetchT1/T2 on any microarchitecture.
2. For PrefetchNTA, the fill into the L3 cache or Snoop Filter may not be placed into the Most Recently Used posi-
tioned and may be chosen for replacement faster than a regular cache fill.
3. PrefetchW is only available on processors based on Broadwell/Skylake microarchitecture; it is unavailable on pro-
cessors based on Haswell microarchitecture or earlier microarchitectures.
9.4.1.1 Fencing
Because streaming stores are weakly ordered, a fencing operation is required to ensure that the stored data is flushed
from the processor to memory. Failure to use an appropriate fence may result in data being “trapped” within the
processor and will prevent visibility of this data by other processors or system agents.
WC stores require software to ensure coherence of data by performing the fencing operation. See Section 9.4.5
9.4.1.4 Write-Combining
Generally, WC semantics require software to ensure coherence with respect to other processors and other system
agents (such as graphics cards). Appropriate use of synchronization and a fencing operation must be performed for
producer-consumer usage models (see Section 9.4.5). Fencing ensures that all system agents have global visibility of
the stored data. For instance, failure to fence may result in a written cache line staying within a processor, and the line
would not be visible to other agents.
For processors which implement non-temporal stores by updating data in-place that already resides in the cache
hierarchy, the destination region should also be mapped as WC. Otherwise, if mapped as WB or WT, there is a
potential for speculative processor reads to bring the data into the caches. In such a case, non-temporal stores would
then update in place and data would not be flushed from the processor by a subsequent fencing operation.
The memory type visible on the bus in the presence of memory type aliasing is implementation-specific. As one
example, the memory type written to the bus may reflect the memory type for the first store to the line, as seen in
program order. Other alternatives are possible. This behavior should be considered reserved and dependence on the
behavior of any particular implementation risks future incompatibility.
NOTE
Failure to map the region as WC may allow the line to be speculatively read into the processor
caches (via the wrong path of a mispredicted branch).
In case the region is not mapped as WC, the streaming might update in-place in the cache and a subsequent SFENCE
would not result in the data being written to system memory. Explicitly mapping the region as WC in this case ensures
that any data read from this region will not be placed in the processor’s caches. A read of this memory location by a
non-coherent I/O device would return incorrect/out-of-date results.
For a processor which solely implements Case 2 (Section 9.4.1.3), a streaming store can be used in this non-coherent
domain without requiring the memory region to also be mapped as WB, since any cached data will be flushed to
memory by the streaming store.
1. Memory order recommendation of CLFLUSH in previous manuals had required software to add MFENCE after
CLFLUSH. MFENCE is not required following CLFLUSH as all processors implementing the CLFLUSH instruction
also order it relative to the other operations enumerated above.
The throughput characteristics of using CLFLUSH to flush cache lines can vary significantly depending on several
factors. In general using CLFLUSH back-to-back to flush a large number of cache lines will experience larger cost per
cache line than flushing a moderately-sized buffer (e.g. less than 4KB); the reduction of CLFLUSH throughput can be
an order of magnitude. Flushing cache lines in modified state are more costly than flushing cache lines in non-
modified states.
User/Source Coding Rule 13. If CLFLUSHOPT is available, use CLFLUSHOPT over CLFLUSH and use SFENCE to guard
CLFLUSHOPT to ensure write order is globally observed. If CLUSHOPT is unavailable, consider flushing large buffers
with CLFLUSH in smaller chunks of less than 4KB.
Example 9-2 gives equivalent assembly sequences of flushing cache lines using CLFLUSH or CLFLUSHOPT. The
corresponding sequence in C are:
CLFLUSH:
For (i = 0; i < iSizeOfBufferToFlush; i += CACHE_LINE_SIZE) _mm_clflush( &pBufferToFlush[ i ] );
CLFLUSHOPT:
_mm_sfence();
For (i = 0; i < iSizeOfBufferToFlush; i += CACHE_LINE_SIZE) _mm_clflushopt( &pBufferToFlush[ i ] );
_mm_sfence();
* If imposing memory ordering rules is important for the application then executing CLFLUSHOPT instructions
should be guarded with SFENCE instructions to guarantee order of memory writes. As per the example above,
such solution still performs better than using the CLFLUSH instruction, and its performance is identical to
CLFLUSHOPT from 2048 byte buffers and bigger.
— The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less
than the trigger threshold distance and close to 64 bytes.
• There is a start-up penalty before the prefetcher triggers and there may be fetches an array finishes. For short
arrays, overhead can reduce effectiveness.
— The hardware prefetcher requires a couple misses before it starts operating.
— Hardware prefetching generates a request for data beyond the end of an array, which is not be utilized. This
behavior wastes bus bandwidth. In addition this behavior results in a start-up penalty when fetching the
beginning of the next array. Software prefetching may recognize and handle these cases.
• It will not prefetch across a 4-KByte page boundary. A program has to initiate demand loads for the new page
before the hardware prefetcher starts prefetching from the new page.
• The hardware prefetcher may consume extra system bandwidth if the application’s memory traffic has significant
portions with strides of cache misses greater than the trigger distance threshold of hardware prefetch (large-
stride memory traffic).
• The effectiveness with existing applications depends on the proportions of small-stride versus large-stride
accesses in the application’s memory traffic. An application with a preponderance of small-stride memory traffic
with good temporal locality will benefit greatly from the automatic hardware prefetcher.
• In some situations, memory traffic consisting of a preponderance of large-stride cache misses can be transformed
by re-arrangement of data access sequences to alter the concentration of small-stride cache misses at the
expense of large-stride cache misses to take advantage of the automatic hardware prefetcher.
Example 9-3. Populating an Array for Circular Pointer Chasing with Constant Stride
register char ** p;
char *next; // Populating pArray for circular pointer
// chasing with constant access stride
// p = (char **) *p; loads a value pointing to next load
p = (char **)&pArray;
The effective latency reduction for several microarchitecture implementations is shown in Figure 9-2. For a constant-
stride access pattern, the benefit of the automatic hardware prefetcher begins at half the trigger threshold distance
and reaches maximum benefit when the cache-miss stride is 64 bytes.
U p p e r b o u n d o f P o in t e r - C h a s in g L a t e n c y R e d u c t io n
120%
100%
Effective Latency Reduction
80% F a m .1 5; M od e l 3 , 4
F a m .1 5; M od e l 0 ,1 ,2
60% F am . 6; M odel 13
F am . 6; M odel 14
40% F am . 15; M odel 6
20%
0%
2
0
64
80
96
11
12
14
16
17
19
20
22
24
S tr i d e (B y t e s)
Tim e
Issue loads
Issue loads
(vertex data)
OM15170
Tim e
Execution
Vertex n-2 Vertex n-1 Vertex n Vertex n+1
pipeline
issue prefetch prefetch prefetch
for vertex n V n+1 V n+2
Front-Side
Mem latency for V n
Bus
M em latency for V n+ 2
OM 15171
The performance loss caused by poor utilization of resources can be completely eliminated by correctly scheduling
the PREFETCH instructions. As shown in Figure 9-4, prefetch instructions are issued two vertex iterations ahead. This
assumes that only one vertex gets processed in one iteration and a new data cache line is needed for each iteration.
As a result, when iteration n, vertex Vn, is being processed; the requested data is already brought into cache. In the
meantime, the front-side bus is transferring the data needed for iteration n+1, vertex Vn+1. Because there is no
dependence between Vn+1 data and the execution of Vn, the latency for data access of Vn+1 can be entirely hidden
behind the execution of Vn. Under such circumstances, no “bubbles” are present in the pipelines and thus the best
possible performance can be achieved.
Prefetching is useful for inner loops that have heavy computations, or are close to the boundary between being
compute-bound and memory-bandwidth-bound. It is probably not very useful for loops which are predominately
memory bandwidth-bound.
When data is already located in the first level cache, prefetching can be useless and could even slow down the
performance because the extra µops either back up waiting for outstanding memory accesses or may be dropped
altogether. This behavior is platform-specific and may change in the future.
.....
.....
add esi, 128
cmp esi, ecx
jl top_loop
For nested loops, memory de-pipelining could occur during the interval between the last iteration of an inner loop
and the next iteration of its associated outer loop. Without paying special attention to prefetch insertion, loads from
the first iteration of an inner loop can miss the cache and stall the execution pipeline waiting for data returned, thus
degrading the performance.
In Example 9-5, the cache line containing A[II][0] is not prefetched at all and always misses the cache. This assumes
that no array A[][] footprint resides in the cache. The penalty of memory de-pipelining stalls can be amortized across
the inner loop iterations. However, it may become very harmful when the inner loop is short. In addition, the last
prefetch in the last PSD iterations are wasted and consume machine resources. Prefetch concatenation is introduced
here to eliminate the performance issue of memory de-pipelining.
Prefetch concatenation can bridge the execution pipeline bubbles between the boundary of an inner loop and its
associated outer loop. Simply by unrolling the last iteration out of the inner loop and specifying the effective prefetch
address for data used in the following iteration, the performance loss of memory de-pipelining can be completely
removed. Example 9-6 gives the rewritten code.
Example 9-6. Concatenation and Unrolling the Last Iteration of Inner Loop
for (ii = 0; ii < 100; ii++) {
for (jj = 0; jj < 24; jj+=8) { /* N-1 iterations */
prefetch a[ii][jj+8]
computation a[ii][jj]
}
prefetch a[ii+1][0]
computation a[ii][jj]/* Last iteration */
}
This code segment for data prefetching is improved and only the first iteration of the outer loop suffers any memory
access latency penalty, assuming the computation time is larger than the memory latency. Inserting a prefetch of the
first data element needed prior to entering the nested loop computation would eliminate or reduce the start-up
penalty for the very first iteration of the outer loop. This uncomplicated high-level code optimization can improve
memory performance significantly.
One approach to solve the excessive prefetching issue is to unroll and/or software-pipeline loops to reduce the
number of prefetches required. Figure 9-5 presents a code example which implements prefetch and unrolls the loop
to remove the redundant prefetch instructions whose prefetch addresses hit the previously issued prefetch
instructions. In this particular example, unrolling the original loop once saves six prefetch instructions and nine
instructions for conditional jumps in every other iteration.
top_loop: top_loop:
prefetchnta [edx+esi+32] prefetchnta [edx+esi+128]
prefetchnta [edx*4+esi+32] prefetchnta [edx*4+esi+128]
. . . . . . . . . .
m ovaps xm m 1, [edx+esi] m ovaps xm m 1, [edx+esi]
m ovaps xm m 2, [edx*4+esi] m ovaps xm m 2, [edx*4+esi]
. . . . . . . . . .
add esi, 16 m ovaps xm m 1, [edx+esi+16]
cm p esi, ecx unrolled m ovaps xm m 2, [edx*4+esi+16]
jl top_loop iteration . . . . .
m ovaps xm m 1, [edx+esi+96]
m ovaps xm m 2, [edx*4+esi+96]
. . . . .
. . . . .
add esi, 128
cm p esi, ecx
jl top_loop
T im e
E xecution
V ertex n-2 V ertex n-1 Vertex n V ertex n+1
pipeline
issue prefetch prefetch prefetch
for vertex n V n+1 V n+2
Front-Side
M em latency for V n
Bus
The Xaxis in Figure 9-6 indicates the number of computation clocks per loop (each iteration is independent). The Y
axis indicates the execution time measured in clocks per loop. The secondary Y-axis indicates the percentage of bus
bandwidth utilization. The tests vary by the following parameters:
• Number of Load /Store Streams: Each load and store stream accesses one 128-byte cache line each per
iteration.
• Amount of Computation Per Loop: This is varied by increasing the number of dependent arithmetic
operations executed.
• Number of the software prefetches per loop — For example, one every 16 bytes, 32 bytes, 64 bytes,
128 bytes.
As expected, the leftmost portion of each of the graphs in Figure 9-6 shows that when there is not enough
computation to overlap the latency of memory access, prefetch does not help and that the execution is essentially
memory-bound. The graphs also illustrate that redundant prefetches do not increase performance.
top_loop: top_loop:
prefetchnta [ebx+128] prefetchnta [ebx+128]
prefetchnta [ebx+1128] movps xmm1, [ebx]
prefetchnta [ebx+2128] addps xmm2, [ebx+3000]
prefetchnta [ebx+3128] mulps xmm3, [ebx+4000]
. . . . prefetchnta [ebx+1128]
. . . . addps xmm1, [ebx+1000]
prefetchnta [ebx+17128] addps xmm2, [ebx+3016]
prefetchnta [ebx+18128]
s prefetchnta [ebx+2128]
prefetchnta [ebx+19128]
tc he mulps xmm1, [ebx+2000]
e fe
prefetchnta [ebx+20128] mulps xmm1, xmm2
movps xmm1, [ebx]
d pr prefetchnta [ebx+3128]
re a
addps xmm2, [ebx+3000] . . . . . . .
mulps xmm3, [ebx+4000] . . .
addps xmm1, [ebx+1000] sp prefetchnta [ebx+18128]
addps xmm2, [ebx+3016] . . . . . .
mulps xmm1, [ebx+2000] prefetchnta [ebx+19128]
mulps xmm1, xmm2 . . . . . .
. . . . . . . . . . . .
. . . . . . prefetchnta [ebx+20128]
. . . . . add ebx, 128
add ebx, 128 cmp ebx, ecx
cmp ebx, ecx jl top_loop
jl top_loop
NOTE
To avoid instruction execution stalls due to the over-utilization of the resource, PREFETCH
instructions must be interspersed with computational instructions. The spreading of PREFETCH
instructions may need to be re-tuned for new processors.
Temporally Temporally
adjacent passes non-adjacent
passes
In the temporally-adjacent scenario, subsequent passes use the same data and find it already in second-level cache.
Prefetch issues aside, this is the preferred situation. In the temporally non-adjacent scenario, data used in pass m is
displaced by pass (m+1), requiring data re-fetch into the first level cache and perhaps the second level cache if a
later pass reuses the data. If both data sets fit into the second-level cache, load operations in passes 3 and 4 become
less expensive.
Figure 9-9 shows how prefetch instructions and strip-mining can be applied to increase performance in both of these
scenarios.
Prefetchnta Prefetcht0
Dataset A Dataset A
SM1
Reuse Prefetcht0
Dataset A Dataset B
SM2
Prefetchnta Reuse
Dataset B Dataset A
SM1
Reuse Reuse
Dataset B Dataset B
Temporally Temporally
adjacent passes non-adjacent passes
Figure 9-9. Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes
Loops
The left scenario shows a graphical implementation of using PREFETCHNTA to prefetch data into L1, minimizing
second-level cache pollution. Use PREFETCHNTA if the data is only touched once during the entire execution pass to
minimize cache pollution in the higher level caches. This provides instant availability, assuming the prefetch was
issued far ahead enough, when the read access is issued.
In the scenario to the right (see Figure 9-9), the workload footprint is too large for the L1 cache. Therefore, use
PREFETCHT0 to prefetch the data. This amortizes the latency of the memory references in passes 1 and 2, and keeps
a copy of the data in second-level cache, which reduces memory traffic and latencies for passes 3 and 4. To further
reduce the latency, it might be worth considering extra PREFETCHNTA instructions prior to the memory references in
passes 3 and 4.
In Example 9-7, consider the data access patterns of a 3D geometry engine first without strip-mining and then
incorporating strip-mining.
Without strip-mining, all the x,y,z coordinates for the four vertices must be re-fetched from memory in the second
pass, that is, the lighting loop. This causes under-utilization of cache lines fetched during transformation loop as well
as bandwidth wasted in the lighting loop.
Now consider the code in Example 9-8 where strip-mining has been incorporated into the loops.
With strip-mining, all vertex data can be kept in the cache (for example, one way of second-level cache) during the
strip-mined transformation loop and reused in the lighting loop. Keeping data in the cache reduces both bus traffic
and the number of prefetches used.
Table 9-4 summarizes the steps of the basic usage model that incorporates only software prefetch with strip-mining.
The steps are:
• Do strip-mining: partition loops so that the dataset fits into second-level cache.
• Use PREFETCHNTA if the data is only used once or the dataset fits into 32 KBytes (one way of second-level cache).
Use PREFETCHT0 if the dataset exceeds 32 KBytes.
The above steps are platform-specific and provide an implementation example. The variables NUM_STRIP and
MAX_NUM_VX_PER_STRIP can be heuristically determined for peak performance for specific application on a
specific platform.
Example 9-9 (b) shows applying the techniques of tiling with optimal selection of tile size and tile width to take
advantage of hardware prefetch. With tiling, one can choose the size of two tiles to fit in the last level cache.
Maximizing the width of each tile for memory read references enables the hardware prefetcher to initiate bus
requests to read some cache lines before the code actually reference the linear addresses.
strip list
80 vis
60 invis
40 vis
Transform
Transform Outer loop is
Vertex processing
processing strips
(inner loop) Lighting
Lighting
Single-Pass Multi-Pass
The choice of single-pass or multi-pass can have several performance implications. For instance, in a multi-pass
pipeline, stages that are limited by bandwidth (either input or output) will reflect more of this performance limitation
in overall execution time. In contrast, for a single-pass approach, bandwidth-limitations can be distributed/amortized
across other computation-intensive stages. Also, the choice of which prefetch hints to use are also impacted by
whether a single-pass or multi-pass approach is used.
In Streaming SIMD Extensions implementation, when non-temporal stores are written into writeback or write-
combining memory regions, these stores are weakly-ordered and will be combined internally inside the processor’s
write-combining buffer and be written out to memory as a line burst transaction. To achieve the best possible
performance, it is recommended to align data along the cache line boundary and write them consecutively in a cache
line size while using non-temporal stores. If the consecutive writes are prohibitive due to programming constraints,
then software write-combining (SWWC) buffers can be used to enable line burst transaction.
You can declare small SWWC buffers (a cache line for each buffer) in your application to enable explicit write-
combining operations. Instead of writing to non-temporal memory space immediately, the program writes data into
SWWC buffers and combines them inside these buffers. The program only writes a SWWC buffer out using non-
temporal stores when the buffer is filled up, that is, a cache line. Although the SWWC method requires explicit
instructions for performing temporary writes and reads, this ensures that the transaction on the front-side bus causes
line transaction rather than several partial transactions. Application performance gains considerably from
implementing this technique. These SWWC buffers can be maintained in the second-level and re-used throughout
the program.
by the processor to generate future data. The assumption is that the size of the reference data is too large to fit in the
processor’s caches. A streaming store is used to write the data around the cache, to avoid displaying other temporal
data held in the caches. Later, the processor re-reads the data using PREFETCHNTA, which ensures maximum
bandwidth, yet minimizes disturbance of other cached temporal data by using the non-temporal (NTA) version of
prefetch.
This task can be optimized using various coding techniques. One technique uses software prefetch and streaming
store instructions. It is discussed in the following paragraph and a code example shown in Example 9-11.
The memory copy algorithm can be optimized using the Streaming SIMD Extensions with these considerations:
• Alignment of data.
• Proper layout of pages in memory.
• Cache size.
• Interaction of the transaction lookaside buffer (TLB) with memory accesses.
• Combining prefetch and streaming-store instructions.
In Example 9-11, eight _MM_LOAD_PS and _MM_STREAM_PS intrinsics are used so that all of the data prefetched (a
128-byte cache line) is written back. The prefetch and streaming-stores are executed in separate loops to minimize
the number of transitions between reading and writing data. This significantly improves the bandwidth of the
memory accesses.
The TEMP = A[KK+CACHESIZE] instruction is used to ensure the page table entry for array in older architectures, and
A is entered in the TLB prior to prefetching. This is essentially a prefetch itself, as a cache line is filled from that
memory location with this instruction. Hence, the prefetching starts from KK+4 in this loop.
This example assumes that the destination of the copy is not temporally adjacent to the code. If the copied data is
destined to be reused in the near future, then the streaming store instructions should be replaced with regular 128 bit
stores (_MM_STORE_PS).
Example 9-12. Memory Copy Using Hardware Prefetch and Bus Segmentation
void block_prefetch(void *dst,void *src)
{ _asm {
mov edi,dst
mov esi,src
mov edx,SIZE
align 16
main_loop:
xor ecx,ecx
align 16
}
prefetch_loop:
movaps xmm0, [esi+ecx]
movaps xmm0, [esi+ecx+64]
add ecx,128
cmp ecx,BLOCK_SIZE
jne prefetch_loop
xor ecx,ecx
align 16
cpy_loop:
movdqa xmm0,[esi+ecx]
movdqa xmm1,[esi+ecx+16]
movdqa xmm2,[esi+ecx+32]
movdqa xmm3,[esi+ecx+48]
movdqa xmm4,[esi+ecx+64]
movdqa xmm5,[esi+ecx+16+64]
movdqa xmm6,[esi+ecx+32+64]
movdqa xmm7,[esi+ecx+48+64]
movntdq [edi+ecx],xmm0
movntdq [edi+ecx+16],xmm1
movntdq [edi+ecx+32],xmm2
movntdq [edi+ecx+48],xmm3
Example 9-12. Memory Copy Using Hardware Prefetch and Bus Segmentation (Contd.)
movntdq [edi+ecx+64],xmm4
movntdq [edi+ecx+80],xmm5
movntdq [edi+ecx+96],xmm6
movntdq [edi+ecx+112],xmm7
add ecx,128
cmp ecx,BLOCK_SIZE
jne cpy_loop
add esi,ecx
add edi,ecx
sub edx,ecx
jnz main_loop
sfence
}
}
EAX[13:10] Reserved -
Maximum number of logical processors
EAX[25:14] Plus 1 encoding
sharing this cache
CPUID leaves > 3 < 80000000 are only visible when IA32_CR_MISC_ENABLES.BOOT_NT4 (bit 22) is clear
(Default).
The deterministic cache parameter leaf provides a means to implement software with a degree of forward
compatibility with respect to enumerating cache parameters. Deterministic cache parameters can be used in several
situations, including:
• Determine the size of a cache level.
• Adapt cache blocking parameters to different sharing topology of a cache-level across Intel HT Technology,
multicore and single-core processors.
• Determine multithreading resource topology in an MP system..
• Determine cache hierarchy topology in a platform using multicore processors.
• Manage threads and processor affinities.
• Determine prefetch stride.
The size of a given level of cache is given by:
(# of Ways) * (Partitions) * (Line_size) * (Sets) = (EBX[31:22] + 1) * (EBX[21:12] + 1) *
(EBX[11:0] + 1) * (ECX + 1)
CHAPTER 10
SUB-NUMA CLUSTERING
Sub-NUMA Clustering (SNC) is a mode for improving average latency from last level cache (LLC) to local memory. It
replaces the Cluster-on-Die (COD) implementation which was used in the previous generation of the Intel® Xeon®
processor E5 family.
Example 10-1. Code Using libnuma to Find the Maximum Number of NUMA Node
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
Example 10-1. (Contd.)Code Using libnuma to Find the Maximum Number of NUMA Node
{ int max_node;
numactl
//In Linux* you can check the NUMA configuration with the numactl utility (the numactl-libs, and numactl-devel
packages might also be required).
$ numactl --hardware
NUMA disabled:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
109 110 111
node 0 size: 196045 MB
node 0 free: 190581 MB
node distances:
node 0
0: 10
SNC off:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
node 0 size: 96973 MB
Example 10-1. (Contd.)Code Using libnuma to Find the Maximum Number of NUMA Node
hwloc
In Linux* you can also check the NUMA configuration with the lstopo utility (the hwloc package is required). For
example:
$ lstopo -p --of png --no-io --no-caches > numa_topology.png
Figure 10-5. Domain Example with One MPI Process Per Domain
Each MPI process can create some child threads to run within the corresponding domain. The process’ threads can
freely migrate from one logical processor to another within the particular domain.
For example, I_MPI_PIN_DOMAIN=numa may be a reasonable option for hybrid MPI/OpenMP* applications
with SNC mode enabled. In this case, each domain consists of logical processors that share a particular NUMA node.
The number of domains on a machine is equal to the number of NUMA nodes on the machine.
Please see the Intel MPI Library documentation for detailed information.
NOTE
It is challenging to measure memory latencies on modern Intel processors accurately as they have
sophisticated HW prefetchers. Intel MLC automatically disables these prefetchers while measuring
the latencies and restores them to their previous state on completion. The prefetcher control is
exposed through an MSR and MSR access requires root level permission. So, Intel MLC needs to be
run as ‘root’ on Linux.
The software configuration used for these measurements is Intel MLC v3.3-Beta2, Red Hat* Linux* 7.2.
NUMA disabled:
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Memory node
Socket 0 1
0 126.5 129.4
1 123.1 122.6
SNC off:
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 81.9 153.1
1 153.7 82.0
SNC on:
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1 2 3
0 81.6 89.4 140.4 153.6
1 86.5 78.5 144.3 162.8
2 142.3 153.0 81.6 89.3
3 144.5 162.8 85.5 77.4
CHAPTER 11
MULTICORE AND INTEL® HYPER-THREADING TECHNOLOGY
(INTEL® HT)
This chapter describes software optimization techniques for multithreaded applications running in an environment
using either multiprocessor (MP) systems or processors with hardware-based multithreading support.
Multiprocessor systems are systems with two or more sockets, each mated with a physical processor package. Intel®
64 and IA-32 processors that provide hardware multithreading support include dual-core processors, quad-core
processors and processors supporting Intel® Hyper-Threading Technology (Intel® HT Technology)1.
Computational throughput in a multithreading environment can increase as more hardware resources are added to
take advantage of thread-level or task-level parallelism. Hardware resources can be added in the form of more than
one physical-processor, processor-core-per-package, and/or logical-processor-per-core. Therefore, there are some
aspects of multithreading optimization that apply across MP, multicore, and Intel HT Technology. There are also some
specific microarchitectural resources that may be implemented differently in different hardware multithreading
configurations (for example: execution resources are not shared across different cores but shared by two logical
processors in the same core if HT Technology is enabled). This chapter covers guidelines that apply to these situations.
This chapter covers:
• Performance characteristics and usage models.
• Programming models for multithreaded applications.
• Software optimization techniques in five specific areas.
11.1.1 MULTITHREADING
When an application employs multithreading to exploit task-level parallelism in a workload, the control flow of the
multi-threaded software can be divided into two parts: parallel tasks and sequential tasks.
Amdahl’s Law describes an application’s performance gain as it relates to the degree of parallelism in the control flow.
It is a useful guide for selecting the code modules, functions, or instruction sequences that are most likely to realize
the most gains from transforming sequential tasks and control flows into parallel code to take advantage
multithreading hardware support.
1. The presence of hardware multithreading support in Intel 64 and IA-32 processors can be detected by checking
the feature flag CPUID .01H:EDX[28]. A return value of in bit 28 indicates that at least one form of hardware mul-
tithreading is present in the physical processor package. The number of logical processors present in each pack-
age can also be obtained from CPUID. The application must check how many logical processors are enabled and
made available to application at runtime by making the appropriate operating system calls. See the Intel® 64 and
IA-32 Architectures Software Developer’s Manual, Volume 2A for information.
Figure 11-1 illustrates how performance gains can be realized for any workload according to Amdahl’s Law. The bar in
Figure 11-1 represents an individual task unit or the collective workload of an entire application.
In general, the speed-up of running multiple threads on an MP systems with N physical processors, over single-
threaded execution, can be expressed as:
RelativeResponse = Tsequential P- + O
-------------------------------- = 1 – P + ---
Tparallel N
where P is the fraction of workload that can be parallelized, and O represents the overhead of multithreading and may
vary between different operating systems. In this case, performance gain is the inverse of the relative response.
Tsequential
Single Thread
1-P P
Tparallel
P/2
Overhead
Multi-Thread on MP 1-P
P/2
When optimizing application performance in a multithreaded environment, control flow parallelism is likely to have
the largest impact on performance scaling with respect to the number of physical processors and to the number of
logical processors per physical processor.
If the control flow of a multi-threaded application contains a workload in which only 50% can be executed in parallel,
the maximum performance gain using two physical processors is only 33%, compared to using a single processor.
Using four processors can deliver no more than a 60% speed-up over a single processor. Thus, it is critical to maximize
the portion of control flow that can take advantage of parallelism. Improper implementation of thread
synchronization can significantly increase the proportion of serial control flow and further reduce the application’s
performance scaling.
In addition to maximizing the parallelism of control flows, interaction between threads in the form of thread
synchronization and imbalance of task scheduling can also impact overall processor scaling significantly.
Excessive cache misses are one cause of poor performance scaling. In a multithreaded execution environment, they
can occur from:
• Aliased stack accesses by different threads in the same process.
• Thread contentions resulting in cache line evictions.
• False-sharing of cache lines between different processors.
Techniques that address each of these situations (and many other areas) are described in sections in this chapter.
Efficient use of hardware resources between concurrent threads requires optimization techniques in specific areas to
prevent contentions of hardware resources. Coding techniques for optimizing thread synchronization and managing
other hardware resources are discussed in subsequent sections.
Parallel programming models are discussed next.
some destination (hopefully in the second-level cache) and another thread executing on the other core in the same
physical package subsequently reads data produced by the first thread.
The basic approach for implementing a producer-consumer model is to create two threads; one thread is the
producer and the other is the consumer. Typically, the producer and consumer take turns to work on a buffer and
inform each other when they are ready to exchange buffers. In a producer-consumer model, there is some thread
synchronization overhead when buffers are exchanged between the producer and consumer. To achieve optimal
scaling with the number of cores, the synchronization overhead must be kept low. This can be done by ensuring the
producer and consumer threads have comparable time constants for completing each incremental task prior to
exchanging buffers.
Example 11-1 illustrates the coding structure of single-threaded execution of a sequence of task units, where each
task unit (either the producer or consumer) executes serially (shown in Figure 11-2). In the equivalent scenario under
multi-threaded execution, each producer-consumer pair is wrapped as a thread function and two threads can be
scheduled on available processor resources simultaneously.
Table 1-1
Main
Thread P(1) C(1) P(1) C(1) P(1)
Main
Thread P(1) P(2) P(1) P(2) P(1)
The basic structure to implement the producer and consumer thread functions with synchronization to communicate
buffer index is shown in Example 11-2.
It is possible to structure the producer-consumer model in an interlaced manner so it can minimize bus traffic and be
effective on multicore processors without shared second-level cache.
In this interlaced variation of the producer-consumer model, each scheduling quanta of an application thread
comprises of a producer task and a consumer task. Two identical threads are created to execute in parallel. During
each scheduling quanta of a thread, the producer task starts first and the consumer task follows after the completion
of the producer task; both tasks work on the same buffer. As each task completes, one thread signals to the other
thread notifying its corresponding task to use its designated buffer. Thus, the producer and consumer tasks execute in
parallel in two threads. As long as the data generated by the producer reside in either the first or second level cache
of the same core, the consumer can access them without incurring bus traffic. The scheduling of the interlaced
producer-consumer model is shown in Figure 11-4.
Thread 1
P(2) C(2) P(2) C(2)
Example 11-3 shows the basic structure of a thread function that can be used in this interlaced producer-consumer
model.
1. Intel Compiler 5.0 and later supports OpenMP directives. Visit http://software.intel.com for details.
2. Intel Compiler 6.0 supports auto-parallelization.
do {
_asm pause
// Ensure this loop is de-pipelined, i.e. preventing more than one
// load request to sync_var to be outstanding,
// avoiding performance penalty when the worker thread updates
// sync_var and the spinning thread exiting the loop.
}
while( sync_var != constant_value);
(c) A spin-wait loop using a “test, test-and-set” technique to determine the availability of the synchronization
variable. This technique is recommended when writing spin-wait loops to run on Intel 64 and IA-32 architecture
processors.
User/Source Coding Rule 14. (M impact, H generality) Insert the PAUSE instruction in fast spin loops and keep the
number of loop repetitions to a minimum to improve overall system performance.
The penalty of exiting from a spin-wait loop can be avoided by inserting a PAUSE instruction in the loop. In spite of
the name, the PAUSE instruction improves performance by introducing a slight delay in the loop and effectively
causing the memory read requests to be issued at a rate that allows immediate detection of any store to the
synchronization variable. This prevents the occurrence of a long delay due to memory order violation.
One example of inserting the PAUSE instruction in a simplified spin-wait loop is shown in Example 11-4(b). The PAUSE
instruction is compatible with all Intel 64 and IA-32 processors. On IA-32 processors prior to Intel NetBurst
microarchitecture, the PAUSE instruction is essentially a NOP instruction. Inserting the PAUSE instruction has the
added benefit of significantly reducing the power consumed during the spin-wait because fewer system resources are
used.
• Applications should use an OS service to block the waiting thread; this can release the processor so that other
runnable threads can make use of the processor or available execution resources.
On processors supporting Intel HT Technology, operating systems should use the HLT instruction if one logical
processor is active and the other is not. HLT will allow an idle logical processor to transition to a halted state; this
allows the active logical processor to use all the hardware resources in the physical package. An operating system that
does not use this technique must still execute instructions on the idle logical processor that repeatedly check for
work. This “idle loop” consumes execution resources that could otherwise be used to make progress on the other
active logical processor.
If an application thread must remain idle for a long time, the application should use a thread blocking API or other
method to release the idle processor. The techniques discussed here apply to traditional MP system, but they have an
even higher impact on processors that support Intel HT Technology.
Typically, an operating system provides timing services, for example Sleep(dwMilliseconds)1; such variables can be
used to prevent frequent checking of a synchronization variable.
Another technique to synchronize between worker threads and a control loop is to use a thread-blocking API
provided by the OS. Using a thread-blocking API allows the control thread to use less processor cycles for spinning and
waiting. This gives the OS more time quanta to schedule the worker threads on available processors. Furthermore,
using a thread-blocking API also benefits from the system idle loop optimization that OS implements using the HLT
instruction.
User/Source Coding Rule 16. (H impact, M generality) Use a thread-blocking API in a long idle loop to free up the
processor.
Using a spin-wait loop in a traditional MP system may be less of an issue when the number of runnable threads is less
than the number of processors in the system. If the number of threads in an application is expected to be greater than
the number of processors (either one processor or multiple processors), use a thread-blocking API to free up
processor resources. A multithreaded application adopting one control thread to synchronize multiple worker
threads may consider limiting worker threads to the number of processors in a system and use thread-blocking APIs
in the control thread.
1. The Sleep() API is not thread-blocking, because it does not guarantee the processor will be released.
Example 11-5(a) shows an example of using Sleep(0), which does not always realize the processor to another
thread.
In general, OS function calls should be used with care when synchronizing threads. When using OS-supported thread
ResumeWorkThread(thread_handle);
While (!task_not_done ) {
Sleep(0) // Returns immediately back to spin loop.
…
}
(b) A polling loop frees up the processor correctly.
synchronization objects (critical section, mutex, or semaphore), preference should be given to the OS service that has
the least synchronization overhead, such as a critical section.
— Use the CLFLUSH line size, i.e. the integer value of CPUID.01H:EBX[15:8], as the “false-sharing threshold”.
2. If CLFLUSH line size is unavailable, use CPUID leaf 4 as described below:
— Determine the “false-sharing threshold” by evaluating the largest system coherency line size among valid
cache types that are reported via the sub-leaves of CPUID leaf 4. For each sub-leaf n, its associated system
coherency line size is (CPUID.(EAX=4, ECX=n):EBX[11:0] + 1).
3. If neither CLFLUSH line size is available, nor CPUID leaf 4 is available, then software may choose the “false-sharing
threshold” from one of the following:
— Query the descriptor tables of CPUID leaf 2 and choose from available descriptor entries.
— A Family/Model-specific mechanism available in the platform or a Family/Model-specific known value.
— Default to a safe value 64 bytes.
User/Source Coding Rule 17. (H impact, M generality) Beware of false sharing within a cache line or within a sector.
Allocate critical data or locks separately using alignment granularity not smaller than the “false-sharing threshold”.
When a common block of parameters is passed from a parent thread to several worker threads, it is desirable for each
work thread to create a private copy (each copy aligned to multiples of the “false-sharing threshold”) of frequently
accessed data in the parameter block.
On processors based on Intel Core microarchitecture; a synchronization variable should be placed alone and in
separate cache line to avoid false-sharing. Software must not allow a synchronization variable to span across page
boundary.
User/Source Coding Rule 18. (M impact, ML generality) Place each synchronization variable alone, separated by
128 bytes or in a separate cache line.
User/Source Coding Rule 19. (H impact, L generality) Do not place any spin lock variable to span a cache line
boundary.
At the code level, false sharing is a special concern in the following cases:
• Global data variables and static data variables that are placed in the same cache line and are written by different
threads.
• Objects allocated dynamically by different threads may share cache lines. Make sure that the variables used
locally by one thread are allocated in a manner to prevent sharing the cache line with other threads.
Another technique to enforce alignment of synchronization variables and to avoid a cacheline being shared is to use
compiler directives when declaring data structures. See Example 11-7.
Using a compiler that supports profiler-guided optimization can improve code locality by keeping frequently used
code paths in the cache. This reduces instruction fetches. Loop blocking can also improve the data locality. Other
locality enhancement techniques can also be applied in a multithreading environment to conserve bus bandwidth
(see Section 9.5).
Because the system bus is shared between many bus agents (logical processors or processor cores), software tuning
should recognize symptoms of the bus approaching saturation. One useful technique is to examine the queue depth
of bus read traffic. When the bus queue depth is high, locality enhancement to improve cache utilization will benefit
performance more than other techniques, such as inserting more software prefetches or masking memory latency
with overlapping bus reads. An approximate working guideline for software to operate below bus saturation is to
check if bus read queue depth is significantly below 5.
Some MP and workstation platforms may have a chipset that provides two system buses, with each bus servicing one
or more physical processors. The guidelines for conserving bus bandwidth described above also applies to each bus
domain.
User/Source Coding Rule 21. (M impact, L generality) Avoid excessive use of software prefetch instructions and
allow automatic hardware prefetcher to work. Excessive use of software prefetches can significantly and
unnecessarily increase bus utilization if used inappropriately.
consumer tasks are necessary to achieve optimal performance. This is because fetching data from L2 to L1 is much
faster than having a cache line in one core invalidated and fetched from the bus.
Figure 11-5 illustrates a batched producer-consumer model that can be used to overcome the drawback of using
small work buffers in a standard producer-consumer model. In a batched producer-consumer model, each scheduling
quanta batches two or more producer tasks, each producer working on a designated buffer. The number of tasks to
batch is determined by the criteria that the total working set be greater than the first-level cache but smaller than the
second-level cache.
Main
Thread P(1) P(2) P(3) P(4) P(5) P(6)
Example 11-8 shows the batched implementation of the producer and consumer thread functions.
while (iter_num--)
{ Signal(&signal1,1);
produce(buffs[mode1],count); // placeholder function
WaitForSignal(&end1);
mode1++;
if (mode1 > batchsize)
mode1 = 0;
}
}
void consumer_thread()
{ int mode2 = 0;
int iter_num = workamount - batchsize;
while (iter_num--)
{ WaitForSignal(&signal1);
consume(buffs[mode2],count); // placeholder function
Signal(&end1,1);
mode2++;
if (mode2 > batchsize)
mode2 = 0;
Example 11-9 shows an example that multi-threaded application could undertake the least amount of effort dealing
with OS-specific APIs and to take advantage of NUMA hardware capability. This parallel approach to memory buffer
initialization is conducive to having each worker thread keep memory traffic local on NUMA systems.
Example 11-9. Parallel Memory Initialization Technique Using OpenMP and NUMA
#ifdef _LINUX // Linux implements malloc to commit physical page at first touch/access
buf1 = (char *) malloc(DIM*(sizeof (double))+1024);
buf2 = (char *) malloc(DIM*(sizeof (double))+1024);
buf3 = (char *) malloc(DIM*(sizeof (double))+1024);
#endif
#ifdef windows
// Windows implements malloc to commit physical page at allocation, so use VirtualAlloc
buf1 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
buf2 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
buf3 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect);
#endif
a = (double *) buf1;
b = (double *) buf2;
c = (double *) buf3;
#pragma omp parallel
{ // use OpenMP threads to execute each iteration of the loop
// number of OpenMP threads can be specified by default or via environment variable
#pragma omp for private(num)
// each loop iteration is dispatched to execute in different OpenMP threads using private iterator
for(num=0;num<len;num++)
{// each thread perform first-touches to its own subset of memory address, physical pages
// mapped to the local memory controller of the respective threads
a[num]=10.;
b[num]=10.;
c[num]=10.;
}
}
Note that the example shown in Example 11-9 implies that the memory buffers will be freed after the worker threads
created by OpenMP have ended. This situation avoids a potential issue of repeated use of malloc/free across different
application threads. Because if the local memory that was initialized by one thread and subsequently got freed up by
another thread, the OS may have difficulty in tracking/re-allocating memory pools in linear address space relative to
NUMA topology. In Linux, another API, “numa_local_alloc” may be used.
NUMA, three channels per socket to SMP, FSB, or dual FSB, up to 12.8
Memory and Bandwidth
DDR3, up to 32GB/s per socket. GB/s per FSB.
For compute bound workloads, the Intel HT opportunity in Intel NetBurst microarchitecture tends to favor thread
contexts that executes with relatively high CPI (average cycles to retire consecutive instructions). At a hardware level,
this is in part due to the issue port imbalance in the microarchitecture, as port 1 is shared by fast ALU, slow ALU (more
heavy-duty integer operations), SIMD, and FP computations. At a software level, some of the cause for high CPI and
may appear as benign catalyst for providing HT benefit may include: long latency instructions (port 1), some L2 hits,
occasional branch mispredictions, etc. But the length of the pipeline in NetBurst microarchitecture often impose
additional internal hardware constraints that limits software’s ability to take advantage of Intel HT.
The microarchitectural enhancements listed in Table 11-3 are expected to provide broader software optimization
opportunities for compute-bound workloads. Whereas contention in the same execution unit by two compute-bound
threads might be a concern to choose a functional-decomposition threading model over data-composition threading.
Nehalem microarchitecture will likely be more accommodating to support the programmer to choose the optimal
threading decomposition models.
Memory intensive workloads can exhibit a wide range of performance characteristics, ranging from completely
parallel memory traffic (saturating system memory bandwidth, as in the well-known example of Stream), memory
traffic dominated by memory latency, or various mixtures of compute operations and memory traffic of either kind.
The Intel HT implementation in Intel NetBurst microarchitecture may provide benefit to some of the latter two types
of workload characteristics. TheIntel HT capability in the Nehalem microarchitecture can broaden the operating
envelop of the two latter types of workload characteristics to deliver higher system throughput, due to its support for
non-uniform memory access (NUMA), more efficient link protocol, and system memory bandwidth that scales with
the number of physical processors.
Some cache levels of the cache hierarchy may be shared by multiple logical processors. Using the cache hierarchy is an
important means for software to improve the efficiency of memory traffic and avoid saturating the system memory
bandwidth. Multi-threaded applications employing cache-blocking technique may wish to partition a target cache
level to take advantage of Intel HT. Alternately two logical processors sharing the same L1 and L2, or logical processors
sharing the L3 may wish to manage the shared resources according to their relative topological relationship. A white
paper on processor topology enumeration and cache topology enumeration with companion reference code has
been published.
CHAPTER 12
INTEL® OPTANE™ DC PERSISTENT MEMORY
The Intel® Xeon® scalable processor family based on the Cascade Lake product introduces support for Intel® Optane™
DC Persistent Memory Modules. These Intel Optane DC Persistent Memory Modules are larger in size compared to
DRAM and are persistent, i.e., the content is maintained even when the system is powered down. However, latency is
higher and bandwidth is lower than DRAM DIMMs.
Once the persistent memory is mapped to the virtual address space of the application, reads and writes can be done
using load and store instructions, and this has the following advantages over storage over app direct:
• This completely avoids the system call.
• Instead of transferring a complete page (e.g. 4KB), only a cache line is transferred (64B).
• Only one copy of the data is needed as memory is not copied back and forth between persistent memory
and DRAM.
• Access is synchronous.
Note that the memory the operating system and conventional OS memory reporting tools report only the DRAM that
is present in the system, since persistent memory is accessible via a file system. The two pools of memory are distinct
from one another, and one of the key differentiators of app direct mode is that software has full control over which
data is located in DRAM vs. NVDIMM. For optimal performance, software can therefore place latency-sensitive data
structures in DRAM, either by generating a copy or reconstructing it. Examples are index data structures, which are
typically accessed randomly but can be reconstructed after a restart.
Figure 12-1. In App Direct Mode, Data on the Intel® Optane™ DC Persistent Memory Module is
Accessed Directly with Loads and Stores
• For example, memory could be the final, durable destination for data instead of disk.
• Applications that are bound by disk latency or bandwidth can benefit from using memory for durability.
• What is the sensitivity of the application to memory latency?
— Intel Optane DC Persistent Memory Module latencies are higher than DRAM, typically around 3-4 times the
latency of DRAM.
• In the cases where an Intel Optane DC Persistent Memory Module is replacing memory, a lot depends on
how predicable those accesses are, and also how sensitive those memory accesses are to latency.
To illustrate these cases, let’s first consider the scenario where the application is reading a sequential array of
numbers that is several GB in size from an Intel Optane DC Persistent Memory Module. In this case, since the accesses
are spatially predictable, they are prefetchable by hardware and software prefetchers. As a result, the data can always
be in the processor caches before the application requests the data, and the latency of the Intel Optane DC Persistent
Memory Module is not seen by the application.
On the other hand, if the application was walking a linked list for example, it is not possible to identify the next node
in the linked list without first reading the current node (this is called “pointer chasing”). In this case, the latency of the
Intel Optane DC Persistent Memory Module is seen by the application.
Another important consideration mentioned above is the sensitivity of the application to memory latency. In some
cases, the application is so the processor cores can do other useful work while waiting for memory references to the
Intel Optane DC Persistent Memory Module to return; since useful work is being done, performance is often not
significantly impacted.
In other cases, the cores are stalled while waiting for memory references from the Intel Optane DC Persistent Memory
Module, which often impacts performance.
If the application as a whole is indeed sensitive to memory latency, an examination of which memory data structures
are sensitive is warranted. A good use for the Intel Optane DC Persistent Memory Module is large capacity data
structures that are not as sensitive to memory latency based on the considerations outlines above. Smaller data
structures that are heavily accessed and/or are sensitive to memory latency are better suited to DRAM.
The chart below shows a pictorial flow based on the above considerations.
Table 12-1. Latencies for Accessing Intel® Optane™ DC Persistent Memory Modules
Latency Intel® Optane™ DC Persistent Memory Module DRAM
Idle sequential read latency ~170ns ~75ns
In the case of DRAM, the difference between sequential and random latencies is limited to a few nanoseconds; this is
due to sequential accesses resulting in greater hits in DRAM row buffers. However in the case of Intel Optane DC
Persistent Memory Modules, not only do the latencies differ overall from DRAM, they also differ significantly
between the sequential and random access cases.
The difference in access latency of Intel Optane DC Persistent Memory Modules from DRAM requires special
consideration for software developers from a performance perspective.
See Chapter 9, “Optimizing Cache Usage” for general guidelines on optimizing processor cache usage.
In memory mode, it is expected that the DRAM cache would absorb most of the accesses, and the application would
see DRAM-like latencies. Note that the latency to access Intel Optane DC Persistent Memory Modules in memory
mode is ~30-40 ns higher than in app direct mode, due to the overhead of first looking up the DRAM cache.
Performance in memory mode can be improved with traditional cache tiling and locality optimization techniques that
keep the working set within the size of the DRAM cache.
Further, each Intel Optane DC Persistent Memory Module features some form of buffering at 256 Byte granularity,
and this is one of the units at which we distinguish between sequential and random accesses. It is therefore beneficial
to collocate data inside 256 Bytes and read them together to get sequential access latencies as opposed to random, a
consideration for software data structure design.
Table 12-2. Bandwidths per DIMM for Intel® Optane™ DC Persistent Memory Modules and
DRAM
Per DIMM Bandwidths Intel® Optane™ DC Persistent Memory Module DRAM
Sequential read ~7.6 GB/s ~15 GB/s
Random read ~2.4 GB/s ~15 GB/s
Figure 12-3. Loaded Latency Curves for One Intel® Optane™ DC Persistent Memory Module DIMM:
Sequential Traffic (Left) and Random Traffic (Right)
When memory bandwidth is close to being saturated, the latencies tend to be very high and hurt application
performance. The bandwidth demand is typically a function of the number of cores driving memory accesses, and the
nature of the accesses, i.e., sequential vs. random access pattern as well as the read-write mix. On the other hand, the
bandwidth capability of the platform is a function of the number of channels and DIMMs available.
It is therefore important to balance the read and write traffic with the capabilities of the system. Which is to say, the
number of threads reading and writing to the Intel Optane DC Persistent Memory Module vs. the number of
populated memory channels.
While writing to Intel Optane DC Persistent Memory Module, since bandwidth is more limited than for DRAM, it is
recommended to use non-temporal stores over regular stores in cases when it is not expected that the data written to
will be re-used in the near future, or while writing to very large buffers. (See Section 9.4.1.2 for details).
Figure 12-5, Figure 12-6, and Figure 12-7 illustrate the differences in combining at 256B locality and how this is
impacted by the number of threads that are injecting references to the Intel Optane DC Persistent Memory Module.
It is important to keep this 256B locality in mind while selecting data structures, and the concurrency of accesses to
the Intel Optane DC Persistent Memory Module.
Figure 12-10. Read-Write Equivalence for Intel® Optane™ DC Persistent Memory Module
DIMMs within Different Power Budgets1
NOTES:
1. The bars on the left show the case for 100% reads. In this scenario, if we consider a power budget of 15W, then
~6.9GB/s of reads are possible. However, if we have the same power budget of 15W, only 2.1GB/s of writes are
possible.
Figure 12-11. Bandwidth Available to Software when There is No Locality at 256B Granularity
From Figure 12-11, we can infer that it is critical to choose data structures that have good access locality at 256B to
make good use of a given power budget from a bandwidth standpoint. More specifically, by comparing Figure 12-10
with Figure 12-11, we can observe that using access locality from a 256B window, the bandwidth improves by factors
up to 3-4.
CHAPTER 13
64-BIT MODE CODING GUIDELINES
13.1 INTRODUCTION
This chapter describes coding guidelines for application software written to run in 64-bit mode. Some coding
recommendations applicable to 64-bit mode are covered in Chapter 3, “General Optimization Guidelines”. The
guidelines in this chapter should be considered as an addendum to the coding guidelines described in Chapter 3
through Chapter 11, “Multicore and Intel® Hyper-Threading Technology (Intel® HT)”.
Software that runs in either compatibility mode or legacy non-64-bit modes should follow the guidelines described in
Chapter 3 through Chapter 11.
Assembly/Compiler Coding Rule 57. (M impact, MH generality) When they are needed to reduce register pressure,
use the 8 extra general purpose registers for integer code and 8 extra XMM registers for floating-point or SIMD code.
The 128-bit IDIV/DIV instructions restrict the range of divisor, quotient, and remainder to be within 64- bits to avoid
causing numerical exceptions. This presents a challenge for situations with either of the three having a value near the
upper bounds of 64-bit and for dividend values nearing the upper bound of 128 bits.
This challenge can be overcome with choosing a larger shift count N, and extending the (Dividend * Cx) operation
from the 128-bit range to the next computing-efficient range. For example, if (Dividend * Cx) is greater than 128 bits
and N is greater than 63 bits, one can take advantage of computing bits 191:64 of the 192-bit results using 128-bit
MUL without implementing a full 192-bit multiplication.
A convenient way to choose the congruent constant Cx is as follows:
• If the range of dividend is within 64 bits: Nmin ~ BSR(Divisor) + 63.
• In situations of disparate dynamic range of quotient/remainder relative to the range of divisor, raise N
accordingly so that quotient/remainder can be computed efficiently.
Consider the computation of quotient/remainder computation for the divisor 10^16 on unsigned dividends near the
range of 64-bits. Example 13-1 illustrates using the “MUL r64” instruction to handle a 64-bit dividend with 64-bit
divisors.
Example 13-1. Compute 64-bit Quotient and Remainder with 64-bit Divisor
_Cx10to16: ; Congruent constant for 10^16 with shift count ‘N’ = 117
DD 0c44de15ch ; floor ( (2^117 / 10^16) + 1)
DD 0e69594beh ; Optimize length of Cx to reduce # of 128-bit multiplication
_tento16: ; 10^16
DD 6fc10000h
DD 002386f2h
Example 13-2 shows a similar technique to handle a 128-bit dividend with 64-bit divisors.
Example 13-2. Quotient and Remainder of 128-bit Dividend with 64-bit Divisor
mov rax, qword ptr [rcx] ; load bits 63:0 of 128-bit dividend from memory
mov rsi, _Cx10to16 ; Congruent Constant for 10^16 with shift count 117
mov r9, qword ptr [rsi] ; load Congruent Constant
mul r9 ; 128-bit multiplication
xor r11, r11 ; clear accumulator
mov rax, qword ptr 8[rcx] ; load bits 127:64 of 128-bit dividend
shr rdx, 53; ;
mov r10, rdx ; initialize bits 127:64 of 192 b it result
mul r9 ; Accumulate to bits 191:128
add rax, r10; ;
adc rdx, r11; ;
shr rax, 53; ;
shl rdx, 11; ;
or rdx, rax; ;
mov r8, qword ptr 8[rsi] ; load Divisor 10^16
mov r9, rdx; ; approximate quotient, may be off by 1
mov rax, r8
mul r9 ; will quotient * divisor > dividend?
sub rdx, qword ptr 8[rcx] ;
sbb rax, qword ptr [rcx] ;
jb remain
sub r9, 1 ; this may be off by one due to round up
mov rax, r8 ; retrieve Divisor 10^16
mul r9 ; final quotient * divisor
sub rax, qword ptr [rcx] ;
sbb rdx, qword ptr 8[rcx] ;
remain:
mov rdx, r9 ; quotient
neg rax ; remainder
The techniques illustrated in Example 13-1 and Example 13-2 can increase the speed of the remainder/quotient
calculation of 128-bit dividends to at or below the cost of a 32-bit integer division.
Extending the technique above to deal with a divisor greater than 64-bits is relatively straightforward. One
optimization worth considering is to choose a shift count N > 128 bits. This can reduce the number of 128-bit MUL
needed to compute the relevant upper bits of (Dividend * Cx).
For example, the following code sequence loads the 32-bit values sign-extended into the 64-bit registers and
performs a multiply:
movsx rax, DWORD PTR[x]
movsx rcx, DWORD PTR[y]
imul rax, rcx
The 64-bit version above is more efficient than using the following 32-bit version:
mov eax, DWORD PTR[x]
mov ecx, DWORD PTR[y]
imul ecx
In the 32-bit case above, EAX is required to be a source. The result ends up in the EDX:EAX pair instead of in a single
64-bit register.
Assembly/Compiler Coding Rule 60. (ML impact, M generality) Use the 64-bit versions of multiply for 32-bit integer
multiplies that require a 64 bit result.
To add two 64-bit numbers in 32-bit legacy mode, the add instruction followed by the addc instruction is used. For
example, to add two 64-bit variables (X and Y), the following four instructions could be used:
mov eax, DWORD PTR[X]
mov edx, DWORD PTR[X+4]
add eax, DWORD PTR[Y]
adc edx, DWORD PTR[Y+4]
The result will end up in the two-register EDX:EAX.
In 64-bit mode, the above sequence can be reduced to the following:
mov rax, QWORD PTR[X]
add rax, QWORD PTR[Y]
The result is stored in rax. One register is required instead of two.
Assembly/Compiler Coding Rule 61. (ML impact, M generality) Use the 64-bit versions of add for 64-bit adds.
CHAPTER 14
INTEL® SSE4.2 AND SIMD PROGRAMMING FOR TEXT-
PROCESSING/LEXING/PARSING
String/text processing spans a discipline that often employs techniques different from traditional SIMD integer vector
processing. Much of the traditional string/text algorithms are character based, where characters may be represented
by encodings (or code points) of fixed or variable byte sizes. Textual data represents a vast amount of raw data and
often carrying contextual information.
The contextual information embedded in raw textual data often requires algorithmic processing dealing with a wide
range of attributes, such as:
• Character values.
• Character positions.
• Character encoding formats.
• Subsetting of character sets.
• Strings of explicit or implicit lengths.
• Tokens.
• Delimiters.
Contextual objects may be represented by sequential characters within a predefined character subset (e.g. decimal-
valued strings). Textual streams may contain embedded state transitions separating objects of different contexts (e.g.,
tag-delimited fields).
Traditional Integer SIMD vector instructions may, in some simpler situations, be successful to speed up simple string
processing functions. Intel SSE4.2 includes four new instructions that offer advances to computational algorithms
targeting string/text processing, lexing and parsing of either unstructured or structured textual data.
The processor’s support for Intel SSE4.2 is indicated by the feature flag value returned in ECX [bit 20] after executing
CPUID instruction with EAX input value of 1 (i.e. SSE4.2 is supported if CPUID.01H:ECX.SSE4_2 [bit 20] = 1). Therefore,
software must verify CPUID.01H:ECX.SSE4_2 [bit 20] is set before using these 4 instructions. (Verifying
CPUID.01H:ECX.SSE4_2 = 1 is also required before using PCMPGTQ or CRC32. Verifying CPUID.01H:ECX.POPCNT[Bit
23] = 1 is required before using the POPCNT instruction.)
These string/text processing instructions work by performing up to 256 comparison operations on text fragments.
Each text fragment can be sixteen bytes. They can handle fragments of different formats: either byte or word
elements. Each of these four instructions can be configured to perform four types of parallel comparison operation on
two text fragments.
The aggregated intermediate result of a parallel comparison of two text fragments become a bit patterns:sixteen bits
for processing byte elements or eight bits for word elements. These instruction provide additional flexibility, using bit
fields in the immediate operand of the instruction syntax, to configure an unary transformation (polarity) on the first
intermediate result.
Lastly, the instruction’s immediate operand offers a output selection control to further configure the flexibility of the
final result produced by the instruction. The rich configurability of these instruction is summarized in Figure 14-1.
Fragment1
The PCMPxSTRI instructions produce final result as an integer index in ECX, the PCMPxSTRM instructions produce
final result as a bit mask in the XMM0 register. The PCMPISTRy instructions support processing string/text fragments
using implicit length control via null termination for handling string/text of unknown size. the PCMPESTRy
instructions support explicit length control via EDX:EAX register pair to specify the length text fragments in the source
operands.
The first intermediate result, IntRes1, is an aggregated result of bit patterns from parallel comparison operations done
on pairs of data elements from each text fragment, according to the imm[3:2] bit field encoding, see Table 14-1.
Input data element format selection using imm[1:0] can support signed or unsigned byte/word elements.
The bit field imm[5:4] allows applying a unary transformation on IntRes1, see Table 14-2.
The comparison operation on each data element pair is defined in Table 14-4. This table defines the type of
comparison operation between valid data elements in the last row and boundary conditions when the fragment in a
source operand may contain invalid data elements (rows one through three). Arithmetic comparison are performed
only if both data elements are valid element in fragment1 and fragment2, as shown in row four.
Valid Invalid Force False Force False Force False Force False
Valid Valid Compare Compare Compare Compare
The string and text processing instruction provides several aid to handle end-of-string situations, see Table 14-5.
Additionally, the PCMPxSTRy instructions are designed to not require 16-byte alignment to simplify text processing
requirements.
14.1.1 CRC32
CRC32 instruction computes the 32-bit cyclic redundancy checksum signature for byte/word/dword or qword stream
of data. It can also be used as a hash function. For example, a dictionary uses hash indices to de-reference strings.
CRC32 instruction can be easily adapted for use in this situation.
Example 14-1 shows a straight forward hash function that can be used to evaluate the hash index of a string to
populate a hash table. Typically, the hash index is derived from the hash value by taking the remainder of the hash
value modulo the size of a hash table.
CRC32 instruction can be use to derive an alternate hash function. Example 14-2 takes advantage the 32-bit granular
CRC32 instruction to update signature value of the input data stream. For string of small to moderate sizes, using the
hardware accelerated CRC32 can be twice as fast as Example 14-1.
else if(pDW[0] < 0x10000) { // finish last two byte that’s non-zero
hVal = _mm_crc32_u16 (hVal, pDW[0]);
}
else { // finish last three byte that’s non-zero
hVal = _mm_crc32_u32 (hVal, pDW[0]);
}
}
return hVal;
}
Unstructured, raw text/string data consist of characters, and have no natural alignment preferences. Therefore,
PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM instructions are architected to not require the 16-Byte alignment
restrictions of other 128-bit SIMD integer vector processing instructions.
With respect to memory alignment, PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM support unaligned memory
loads like other unaligned 128-bit memory access instructions, e.g. MOVDQU.
Unaligned memory accesses may encounter special situations that require additional coding techniques, depending
on the code running in ring 3 application space or in privileged space. Specifically, an unaligned 16-byte load may
cross page boundary. Section 14.2.1 discusses a technique that application code can use. Section 14.2.2 discusses the
situation string library functions needs to deal with. Section 14.3 gives detailed examples of using
PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM instructions to implement equivalent functionality of several
string library functions in situations that application code has control over memory buffer allocation.
For simplicity, we will consider string/text in byte data format in situations that caller functions have allocated
sufficient buffer size to support unaligned 128-bit SIMD loads from memory without encountering side-effects of
cross page boundaries.
The equivalent functionality of EOS identification can be implemented using PCMPISTRI. Example 14-4 shows a
simplistic Intel SSE4.2 implementation to scan a text block by loading 16-byte text fragments and locate the null
termination character. Example 14-5 shows the optimized Intel SSE4.2 implementation that demonstrates the
effectiveness of memory disambiguation to improve instruction-level parallelism.
static char ssch2[16]= {0x1, 0xff, 0x00, }; // range values for non-null characters
int strlen_un_optimized(const char* s1)
{int len = 0;
_asm{
mov eax, s1
movdquxmm2, ssch2 ; load character pair as range (0x01 to 0xff)
xor ecx, ecx ; initial offset to 0
_loopc:
add eax, ecx ; update addr pointer to start of text fragment
pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx
; if there is a null char in the 16Byte fragment at [eax], zf will be set.
; if all 16 bytes of the fragment are non-null characters, ECX will return 16,
jnz short _loopc; xmm1 has no null code, ecx has 16, continue search
; we have a null code in xmm1, ecx has the offset of the null code i
add eax, ecx ; add ecx to the address of the last fragment2/xmm1
mov edx, s1; retrieve effective address of the input string
sub eax, edx;the string length
mov len, eax; store result
}
return len;
}
The code sequence shown in Example 14-4 has a loop consisting of three instructions. From a performance tuning
perspective, the loop iteration has loop-carry dependency because address update is done using the result (ECX
value) of a previous loop iteration. This loop-carry dependency deprives the out-of-order engine’s capability to have
multiple iterations of the instruction sequence making forward progress. The latency of memory loads, the latency of
these instructions, any bypass delay could not be amortized by OOO execution in the presence of loop-carry
dependency.
A simple optimization technique to eliminate loop-carry dependency is shown in Example 14-5.
Using memory disambiguation technique to eliminate loop-carry dependency, the cumulative latency exposure of
the three-instruction sequence of Example 14-5 is amortized over multiple iterations, the net cost of executing each
iteration (handling sixteen bytes) is less than three cycles. In contrast, handling 4=four bytes of string data using eight
ALU instructions in Example 14-3 will also take a little less than three cycles per iteration. Whereas each iteration of
the code sequence in Example 14-4 will take more than ten cycles because of loop-carry dependency.
_loopc:
add eax, 16 ; adjust address pointer and disambiguate load address for each iteration
pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx
; if there is a null char in [eax] fragment, zf will be set.
; if all 16 bytes of the fragment are non-null characters, ECX will return 16,
jnz short _loopc ; ECX will be 16 if there is no null byte in [eax], so we disambiguate
_endofstring:
add eax, ecx ; add ecx to the address of the last fragment
mov edx, s1; retrieve effective address of the input string
sub eax, edx;the string length
mov len, eax; store result
}
return len;
}
Tuning Suggestion 4. (H impact, H generality) Loop-carry dependency that depends on the ECX result of
PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM for address adjustment must be minimized. Isolate code paths
that expect ECX result will be 16 (bytes) or 8 (words), replace these values of ECX with constants in address
adjustment expressions to take advantage of memory disambiguation hardware.
static char alphnrange[16]= {0x27, 0x27, 0x30, 0x39, 0x41, 0x5a, 0x61, 0x7a, 0x0};
static char alp_map8[32] = {0x0, 0x0, 0x0, 0x0, 0x80,0x0,0xff, 0x3,0xfe, 0xff, 0xff, 0x7, 0xfe, 0xff, 0xff, 0x7}; // 32
byte lookup table, 1s map to bit patterns of alpha numerics in alphnrange
int wordcnt_c(const char* s1)
{int i, j, cnt = 0;
char cc, cc2;
char flg[3]; // capture the a wavelet to locate a falling edge
cc2 = cc = s1[0];
// use the compacted bit pattern to consolidate multiple comparisons into one look up
if( alp_map8[cc>>3] & ( 1<< ( cc & 7) ) )
{ flg[1] = 1; } // non-white-space char that is part of a word,
// we're including apostrophe in this example since counting the
// following 's' as a separate word would be kind of silly
else
{ flg[1] = 0; } // 0: whitespace, punctuations not be considered as part of a word
In Example 14-6, a 32-byte look-up table is constructed to represent the ascii code values 0x0-0xff, and partitioned
with each bit of 1 corresponding to the specified subset of characters. While this bit-lookup technique simplifies the
comparison operations, data fetching remains byte-granular.
Example 14-7 shows an equivalent implementation of counting words using PCMPISTRM. The loop iteration is
performed at 16-byte granularity instead of byte granularity. Additionally, character set subsetting is easily expressed
using range value pairs and parallel comparisons between the range values and each byte in the text fragment are
performed by executing PCMPISTRI once.
psllw xmm1, 1
por xmm5, xmm1 ; combine MSB of last iter and the rest from current iter
pxor xmm5, xmm0; differentiate binary wave form into pattern of edges
pextrdedi, xmm5, 0 ; the edge patterns has (1 bit from last, 15 bits from this round)
jz _lastfragment; if xmm1 had a null, zf would be set
mov ecx, 16; xmm1, had no null char, advance 16 bytes
popcntedi, edi; count both rising and trailing edges
add esi, edi; keep a running count of both edges
jmp short _loopc
_lastfragment:
popcntedi, edi; count both rising and trailing edges
add esi, edi; keep a running count of both edges
shr esi, 1; word count corresponds to the trailing edges
mov len, esi
}
return len;
}
Target Str B A C A C G C M B A C A G M C M
Ref str B A C A G M C M
T/F: T T T T F
Retrace 3 bytes after partial match of first 4 bytes
B A C A G M C M
F
B A C A G M C M
F
The Knuth, Morris, Pratt algorithm1 (KMP) provides an elegant enhancement to overcome the re-trace inefficiency of
brute-force substring searches. By deriving an overlap table that is used to manage retrace distance when a partial
match leads to a false match, KMP algorithm is very useful for applications that search relevant articles containing
keywords from a large corpus of documents.
Example 14-8 illustrates a C-code example of using KMP substring searches.
Example 14-8 also includes the calculation of the KMP overlap table. Typical usage of KMP algorithm involves multiple
int str_kmp_c(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int i, j;
i = 0; j = 0;
while ( i+j < cnt1) {
if( s2[i] == s1[i+j]) {
i++;
if( i == cnt2) break; // found full match
}
else {
j = j+i - ovrlap_tbl[i]; // update the offset in s1 to start next round of string compare
if( i > 0) {
i = ovrlap_tbl[i]; // update the offset of s2 for next string compare should start at
}
}
void kmp_precalc(const char * s2, int cnt2)
{int i = 2;
char nch = 0;
ovrlap_tbl[0] = -1; ovrlap_tbl[1] = 0;
// pre-calculate KMP table
while( i < cnt2) {
if( s2[i-1] == s2[nch]) {
ovrlap_tbl[i] = nch +1;
i++; nch++;
}
else if ( nch > 0) nch = ovrlap_tbl[nch];
else {
ovrlap_tbl[i] = 0;
i++;
}
};
ovrlap_tbl[cnt2] = 0;
}
invocation of the same reference string, so the overhead of precalculating the overlap table is easily amortized. When
1. Donald E. Knuth, James H. Morris, and Vaughan R. Pratt; SIAM J. Comput. Volume 6, Issue 2, pp. 323-350 (1977)
a false match is determined at offset i of the reference string, the overlap table will predict where the next round of
string comparison should start (updating the offset j), and the offset in the reference string that byte-granular
character comparison should resume/restart.
While KMP algorithm provides efficiency improvement over brute-force byte-granular substring search, its best
performance is still limited by the number of byte-granular operations. To demonstrate the versatility and built-in
lexical capability of PCMPISTRI, we show an Intel SSE4.2 implementation of substring search using brute-force 16-
byte granular approach in Example 14-9, and combining KMP overlap table with substring search using PCMPISTRI in
Example 14-10.
int strsubs_sse4_2i(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int kpm_i=0, idx;
int ln1= 16, ln2=16, rcnt1 = cnt1, rcnt2= cnt2;
__m128i *p1 = (__m128i *) s1;
__m128i *p2 = (__m128i *) s2;
__m128ifrag1, frag2;
int cmp, cmp2, cmp_s;
__m128i *pt = NULL;
if( cnt2 > cnt1 || !cnt1) return -1;
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
while(rcnt1 > 0)
{ cmp_s = _mm_cmpestrs(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
cmp = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
if( !cmp) { // we have a partial match that needs further analysis
if( cmp_s) { // if we're done with s2
if( pt)
{ idx = (int) ((char *) pt - (char *) s1) ; }
else
{ idx = (int) ((char *) p1 - (char *) s1) ; }
return idx;
}
// we do a round of string compare to verify full match till end of s2
if( pt == NULL) pt = p1;
cmp2 = 16;
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
while( cmp2 == 16 && rcnt2) { // each 16B frag matches,
rcnt1 = cnt1 - 16 -(int) ((char *)p1-(char *)s1);
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
if( rcnt1 <=0 || rcnt2 <= 0 ) break;
p1 = (__m128i *)(((char *)p1) + 16);
p2 = (__m128i *)(((char *)p2) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
cmp2 = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x18); // lsb, eq each
};
if( !rcnt2 || rcnt2 == cmp2) {
idx = (int) ((char *) pt - (char *) s1) ;
return idx;
}
else if ( rcnt1 <= 0) { // also cmp2 < 16, non match
if( cmp2 == 16 && ((rcnt1 + 16) >= (rcnt2+16) ) )
{idx = (int) ((char *) pt - (char *) s1) ;
return idx;
}
else return -1;
}
In Example 14-9, address adjustment using a constant to minimize loop-carry dependency is practised in two places:
• In the inner while loop of string comparison to determine full match or false match (the result cmp2 is not used
for address adjustment to avoid dependency).
• In the last code block when the outer loop executed PCMPISTRI to compare sixteen sets of ordered compare
between a target fragment with the first 16-byte fragment of the reference string, and all sixteen ordered
compare operations produced false result (producing cmp with a value of 16).
Example 14-10 shows an equivalent intrinsic implementation of substring search using Intel SSE4.2 and KMP overlap
table. When the inner loop of string comparison determines a false match, the KMP overlap table is consulted to
determine the address offset for the target string fragment and the reference string fragment to minimize retrace.
Notably, a significant portions of retrace with a retrace distance less than 15 bytes are avoided even in the brute-force
Intel SSE4.2 implementation of Example 14-9. This is due to the order-compare primitive of PCMPISTRI. “Ordered
compare” performs sixteen sets of string fragment compare, and many false match with less than fifteen bytes of
partial matches can be filtered out in the same iteration that executed PCMPISTRI.
Retrace distance of greater than fifteen bytes does not get filtered out by the Example 14-9. By consulting with the
KMP overlap table, Example 14-10 can eliminate retraces of greater than fifteen bytes.
Example 14-10. Substring Search Using PCMPISTRI and KMP Overlap Table
int strkmp_sse4_2(const char* s1, int cnt1, const char* s2, int cnt2 )
{ int kpm_i=0, idx;
int ln1= 16, ln2=16, rcnt1 = cnt1, rcnt2= cnt2;
__m128i *p1 = (__m128i *) s1;
__m128i *p2 = (__m128i *) s2;
__m128ifrag1, frag2;
while(rcnt1 > 0)
{ cmp_s = _mm_cmpestrs(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
cmp = _mm_cmpestri(frag2, (rcnt2>ln2)? ln2: rcnt2, frag1, (rcnt1>ln1)? ln1: rcnt1, 0x0c);
if( !cmp) { // we have a partial match that needs further analysis
if( cmp_s) { // if we've reached the end with s2
if( pt)
{ idx = (int) ((char *) pt - (char *) s1) ; }
else
{ idx = (int) ((char *) p1 - (char *) s1) ; }
return idx;
}
// we do a round of string compare to verify full match till end of s2
if( pt == NULL) pt = p1;
cmp2 = 16;
rcnt2 = cnt2 - 16 -(int) ((char *)p2-(char *)s2);
Example 14-10. Substring Search Using PCMPISTRI and KMP Overlap Table (Contd.)
else{
if( kpm_i && ovrlap_tbl[kpm_i]) {
p2 = (__m128i *)(((char *)s2) );
frag2 = _mm_loadu_si128(p2);// load up to 16 bytes of fragment
//p1 = (__m128i *)(((char *)p1) );
The relative speed up of byte-granular KMP, brute-force Intel SSE4.2, and Intel SSE4.2 with KMP overlap table over
byte-granular brute-force substring search is illustrated in the graph that plots relative speedup over percentage of
retrace for a reference string of 55 bytes long. A retrace of 40% in the graph meant, after a partial match of the first 22
characters, a false match is determined.
So when brute-force, byte-granular code has to retrace, the other three implementation may be able to avoid the
need to retrace because:
• Example 14-8 can use KMP overlap table to predict the start offset of next round of string compare operation
after a partial-match/false-match, but forward movement after a first-character-false-match is still byte-granular.
• Example 14-9 can avoid retrace of shorter than 15 bytes but will be subject to retrace of 21 bytes after a partial-
match/false-match at byte 22 of the reference string. Forward movement after each order-compare-false-match
is 16 byte granular.
• Example 14-10 avoids retrace of 21 bytes after a partial-match/false-match, but KMP overlap table lookup incurs
some overhead. Forward movement after each order-compare-false-match is 16 byte granular.
7.0
6.0
5.0
Brute
4.0 KMP
RelativePerf.
3.0 STTNI
2.0 STTNI+KMP
1.0
0.0
%
.8%
.8%
.4%
.1%
.8%
.5%
.2%
.9%
.6%
.4%
8.7
34
43
52
60
69
78
86
95
17
27
p1 = (__m128i *) *pCtxt;
s1 = *pCtxt;
}
else p1 = (__m128i *) s1;
memset(&ws_map8[0], 0, 32);
while (sdlm[jj] ) {
ws_map8[ (sdlm[jj] >> 3) ] |= (1 << (sdlm[jj] & 7) ); jj ++
}
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
stmpz = _mm_loadu_si128((__m128i *)sdelimiter);
// if the first char is not a delimiter , proceed to check non-delimiter,
// otherwise need to skip leading delimiter chars
if( ws_map8[s1[0]>>3] & (1 << (s1[0]&7)) ) {
start = s_idx = _mm_cmpistri(stmpz, frag1, 0x10);// unsigned bytes/equal any, invert, lsb
}
else start = s_idx = 0;
// check if we're dealing with short input string less than 16 bytes
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x10);
if( cmp_z) { // last fragment
if( !start) {
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);
if( endtok == 16) { // didn't find delimiter at the end, since it's null-terminated
// find where is the null byte
*pCtxt = s1+ 1+ _mm_cmpistri(frag1, frag1, 0x40);
return s1;
}
else { // found a delimiter that ends this word
s1[start+endtok] = 0;
*pCtxt = s1+start+endtok+1;
}
}
else {
if(!s1[start] ) {
*pCtxt = s1 + start +1;
return NULL;
}
p1 = (__m128i *)(((char *)p1) + start);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
if( endtok == 16) { // looking for delimiter, found none
*pCtxt = (char *)p1 + 1+ _mm_cmpistri(frag1, frag1, 0x40);
return s1+start;
}
else { // found delimiter before null byte
s1[start+endtok] = 0;
*pCtxt = s1+start+endtok+1;
}
}
}
else
{ while ( !cmp_z && s_idx == 16) {
p1 = (__m128i *)(((char *)p1) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
s_idx = _mm_cmpistri(stmpz, frag1, 0x10);// unsigned bytes/equal any, invert, lsb
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x10);
}
if(s_idx != 16) start = ((char *) p1 -s1) + s_idx;
else { // corner case if we ran to the end looking for delimiter and never found a non-dilimiter
*pCtxt = (char *)p1 +1+ _mm_cmpistri(frag1, frag1, 0x40);
return NULL;
}
if( !s1[start] ) { // in case a null byte follows delimiter chars
*pCtxt = s1 + start+1;
return NULL;
}
// now proceed to find how many non-delimiters are there
p1 = (__m128i *)(((char *)p1) + s_idx);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
endtok = ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
cmp_z = 0;
while ( !cmp_z && ldx == 16) {
p1 = (__m128i *)(((char *)p1) + 16);
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
ldx = _mm_cmpistri(stmpz, frag1, 0x00);// unsigned bytes/equal any, lsb
cmp_z = _mm_cmpistrz(stmpz, frag1, 0x00);
if(cmp_z) { endtok += ldx; }
}
An Intel SSE4.2 implementation of the equivalent functionality of strupr() using intrinsic is shown in Example 14-12.
static char uldelta[16]= {0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20, 0x20,
0x20, 0x20};
static char ranglc[6]= {0x61, 0x7a, 0x00, 0x00, 0x00, 0x00};
char * strup_sse4_2i( char* s1)
{int len = 0, res = 0;
__m128i *p1 = (__m128i *) s1;
__m128ifrag1, ranglo, rmsk, stmpz, stmp1;
int cmp_c, cmp_z, cmp_s;
if( !s1[0]) return (char *) s1;
frag1 = _mm_loadu_si128(p1);// load up to 16 bytes of fragment
ranglo = _mm_loadu_si128((__m128i *)ranglc);// load up to 16 bytes of fragment
stmpz = _mm_loadu_si128((__m128i *)uldelta);
// This example demonstrates validation of surrogate pairs (32-bit code point) and
// tally the number of16-bit and 32-bit code points in the text block
// Parameters: s1 is pointer to input utf-16 text block.
// pLen: store count of utf-16 code points
// return the number of 16-bit code point encoded in the surrogate range but do not form
// a properly encoded surrogate pair. if 0: s1 is a properly encoded utf-16 block,
// If return value >0 then s1 contains invalid encoding of surrogates
cc2 = cc = s1[0];
// map each word in s1into bit patterns of 0, 1or 2 using a table lookup
// the first half of a surrogate pair must be encoded between D800-DBFF and mapped as 2
// the 2nd half of a surrogate pair must be encoded between DC00-DFFF and mapped as 1
// regular 16-bit encodings are mapped to 0, except null code mapped to 3
flg[1] = utf16map[cc];
flg[0] = flg[1];
if(!flg[1]) cnt ++;
i = 1;
The VerStrlen() function for UTF-16 encoded text block can be implemented using SSE4.2.
Example 14-14 shows the listing of Intel SSE4.2 assembly implementation and Example 14-15 shows the listing of
Intel SSE4.2 intrinsic listings of VerStrlen().
// complementary range values for detecting either halves of 32-bit UTF-16 code point
static short ssch0[16]= {0x1, 0xd7ff, 0xe000, 0xffff, 0, 0};
// complementary range values for detecting the 1st half of 32-bit UTF-16 code point
static short ssch1[16]= {0x1, 0xd7ff, 0xdc00, 0xffff, 0, 0};
// complementary range values for detecting the 2nd half of 32-bit UTF-16 code point
static short ssch2[16]= {0x1, 0xdbff, 0xe000, 0xffff, 0, 0};
_loopc:
shl ecx, 1; pcmpistri with word processing return ecx in word granularity, multiply by 2 to get byte offset
add eax, ecx
movdquxmm1, [eax] ; load a string fragment of up to 8 words
pcmpistri xmm2, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
; if there is a utf-16 null wchar in xmm1, zf will be set.
; if all 8 words in the comparison matched range,
; none of bits in the intermediate result will be set after polarity inversions,
; and ECX will return with a value of 8
jz short _lstfrag; if null code, handle last fragment
; if ecx < 8, ecx point to a word of either 1st or 2nd half of a 32-bit code point
cmp ecx, 8
jne _chksp
add ebx, ecx ; accumulate # of 16-bit non-null code points
mov ecx, 8 ; ecx must be 8 at this point, we want to avoid loop carry dependency
jmp _loopc
_chksp:; this fragment has word encodings in the surrogate value range
add ebx, ecx ; account for the 16-bit code points
shl ecx, 1; pcmpistri with word processing return ecx in word granularity, multiply by 2 to get byte offset
add eax, ecx
movdquxmm1, [eax] ; ensure the fragment start with word encoding in either half
pcmpistri xmm3, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
jz short _lstfrag2; if null code, handle the last fragment
cmpecx, 0 ; properly encoded 32-bit code point must start with 1st half
jg _invalidsp; some invalid s-p code point exists in the fragment
pcmpistri xmm4, xmm1, 15h; unsigned words, ranges, invert, lsb index returned to ecx
cmp ecx, 1 ; the 2nd half must follow the first half
jne _invalidsp
add edx, 1
mov ecx, 2
jmp _morept
_invalidsp2:
add edi, 1
mov ecx, 1
jmp _morept
_final:
add edx, ebx; add # of 16-bit and 32-bit code points
mov ecx, pLen; retrieve address of pointer provided by caller
mov [ecx], edx; store result of string length to memory
mov res, edi
}
return res;
}
else {
offset1 = 8; // increment address by 16 bytes to handle next fragment
cnt_16+= 8;
}
};
*pLen = cnt_16 + cnt_sp;
return cnt_invl;
}
Example 14-16. Replacement String Library Strcmp Using Intel® SSE4.2 (Contd.)
not_equal:
movzx eax, BYTE PTR[esi+edx]
movzx edx, BYTE PTR[edi+edx]
cmp eax, edx
cmova eax, ONE
cmovb eax, NEG_ONE
jmp ret_tag
too_close_pgb:
add edx, 1 ; do byte granular compare
movzx ecx, BYTE PTR[esi+edx-1]
movzx ebx, BYTE PTR[edi+edx-1]
cmp ecx, ebx
jne inequality
add ebx, ecx
jnz next
jmp ret_tag
inequality:
cmovb eax, NEG_ONE
cmova eax, ONE
ret_tag:
mov [val], eax
}
return(val);
}
In Example 14-16, eight instructions were added following the label “next” to perform 4KByte boundary checking of
address that will be used to load two string fragments into registers. If either address is found to be within sixteen
bytes of crossing over to the next page, the code branches to byte-granular comparison path following the label
“too_close_pgb“.
The return values of Example 14-16 uses the convention of returning 0, +1, -1 using CMOV. It is straight forward to
modify a few instructions to implement the convention of returning 0, positive integer, negative integer.
presence/absence of sign, must be validated in mid-stream. The flexibility of the SSE4.2 primitive can handle
these state-dependent validation well.
• Additionally, exit condition to wrap up arithmetic computation can happen in mid-stream due to invalid
characters, or due to finite representable range of the data type (~10^19 for int64, no more than 10 non-zero-
leading digits for int32) may lead one to believe this type data stream consisting of short bursts are not suited for
exploring SIMD ISA and be content with byte-granular solutions.
Because of the character subset validation and state-dependent nature, byte-granular solutions of the standard
library function tends to have a high start-up cost (for example, converting a single numerical digit to integer may take
50 or 60 cycles), and low throughput (each additional numeric digit in the input character stream may take 6-8 cycles
per byte).
A high level pseudo-operation flow of implementing a library replacement of atol() is described in Example 14-17.
Example 14-17. High-level flow of Character Subset Validation for String Conversion
Example 14-18 shows the code listing of an equivalent functionality of atol() capable of producing int64 output range.
Auxiliary function and data constants are listed in Example 14-19.
/* load up to 16 byte safely and check how many valid numeric digits we can do SIMD */
value0 = _mm_loadu_si128 ((__m128i *) rangenumint);
mask0 = __m128i_strloadu_page_boundary (p);
index = _mm_cmpistri (value0, mask0, 0x14);
zflag = _mm_cmpistrz (value0, mask0, 0x14);
/* index points to the first digit that is not a valid numeric digit */
if( !index) return 0;
else if (index == 16)
{ if( *p == '0') /* if all 16 bytes are numeric digits */
{ /* skip leading zero */
value1 = _mm_loadu_si128 ((__m128i *) rangenumintzr);
index = _mm_cmpistri (value1, mask0, 0x14);
zflag = _mm_cmpistrz (value1, mask0, 0x14);
while(index == 16 && !zflag )
{ p = ( char *) ((size_t) p + 16);
mask0 = __m128i_strloadu_page_boundary (p);
index = _mm_cmpistri (value1, mask0, 0x14);
zflag = _mm_cmpistrz (value1, mask0, 0x14);
}
/* now the 1st digit is non-zero, load up to 16 bytes and update index */
if( index < 16)
p = ( char *) ((size_t) p + index);
/* load up to 16 bytes of non-zero leading numeric digits */
mask0 = __m128i_strloadu_page_boundary (p);
/* update index to point to non-numeric character or indicate we may have more than 16 bytes */
index = _mm_cmpistri (value0, mask0, 0x14);
}
}
if( index == 0) return 0;
else if( index == 1) return (NegSgn? (long long) -(p[0]-48): (long long) (p[0]-48));
// Input digits in xmm are ordered in reverse order. the LS digit of output is next to eos
// least sig numeric digit aligned to byte 15 , and subtract 0x30 from each ascii code
mask0 = ShfLAlnLSByte( mask0, 16 -index);
w1_u8 = _mm_slli_si128 ( mask0, 1);
The general performance characteristics of an Intel SSE4.2-enhanced atol() replacement have a start-up cost that is
somewhat lower than byte-granular implementations generated from C code.
Example 14-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing
// bit lookup table of valid ascii code for decimal string conversion, white space, sign, numeric digits
static char BtMLValDecInt[32] = {0x0, 0x3e, 0x0, 0x0, 0x1, 0x28, 0xff, 0x03,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0};
// we use pmaddwd to merge two adjacent short integer pair, this is the second step of merging each pair of 2-
digit integers
static short MulplyPairBaseP2[8] =
{ 100, 1, 100, 1, 100, 1, 100, 1};
// Multiplier-pair for two adjacent short integer pair, this is the third step of merging each pair of 4-digit integers
static short MulplyPairBaseP4[8] =
{ 10000, 1, 10000, 1, 10000, 1, 10000, 1 };
Example 14-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing (Contd.)
Example 14-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing (Contd.)
case 13:
value = _mm_slli_si128 (value, 13); break;
case 14:
value = _mm_slli_si128 (value, 14); break;
case 15:
value = _mm_slli_si128 (value, 15); break;
}
return value;
}
With an input byte stream no more than 16 non-zero-leading digits, it has a constant performance. An input string
consisting of more than 16 bytes of non-zero-leading digits can be processed in about 100 cycles or less, compared
byte-granular solution needing around 200 cycles. Even for shorter input strings of 9 non-zero-leading digits, Intel
SSE4.2 enhanced replacement can also achieve ~2X performance of byte-granular solutions.
{
y = x;
while (y > 0)
{ r = (int) (y % base); // one digit at a time from least significant digit
y = y /base;
* --p_bkwd = digits[r];
len ++;
}
cnt = len;;
while( len--) *out++ = p_bkwd++; // copy each converted digits
}
out[cnt] = 0;
return (int) cnt;
}
Example 14-20 employs iterative sequence that process one digit at a time using the hardware native integer divide
instruction. The reliance on integer divide can be replaced by fixed-point multiply technique discussed in Chapter 13,
“64-bit Mode Coding Guidelines”. This is shown in Example 14-21.
Example 14-21. Conversion of 64-bit Integer to ASCII without Integer Division (Contd.)
y =q;
* --p_bkwd = digits[r];
len ++;
}
*out++ = ‘-’;
cnt = len +1;
while( len--) *out++ = p_bkwd++; // copy each converted digits
} else
{
y = x;
while (y > 0)
{ umul_64x64( &z128[0], y, cg_10_pms3);
q = z128[1] >> 3;
q = (y < q * (unsigned __int64) base)? q-1: q;
r = (int) (y - q * (unsigned __int64) base); // one digit at a time from least significant digit
y =q;
* --p_bkwd = digits[r];
len ++;
}
cnt = len;;
while( len--) *out++ = p_bkwd++; // copy each converted digits
}
out[cnt] = 0;
return cnt;
}
Example 14-21 provides significant speed improvement by eliminating the reliance on integer divisions. However, the
numeric format conversion problem is still constrained by the dependent chain that process one digit at a time.
SIMD technique can apply to this class of integer numeric conversion problem by noting that an unsigned 64-bit
integer can expand a dynamic range of up to twenty digits. Such a wide dynamic range can be expressed as
polynomial expressions of the form:
a0 + a1 *10^4 + a2 *10^8 + a3 *10^12 + a4 *10^16 where
the dynamic range of ai is between [0, 9999].
Reduction of an unsigned 64-bit integer into up-to 5 reduced-range coefficients can be computed using fixed-point
multiply in stages. Once the dynamic range of coefficients are reduced to no more than 4 digits, one can apply SIMD
techniques to compute ascii conversion in parallel.
The SIMD technique to convert an unsigned 16-bit integer via radix 10 with input dynamic range [0, 9999] is shown in
Figure 14-4. This technique can also be generalized to apply to other non-power-of-2 radix that is less than 16.
127 0
r0 r1 r2 r3
U32 (range <10^8) -> r0 + r1 *10 + r2*100 + r3*1000 + r4*10^4 + r5*10^5 + r6*10^6 + r7*10^7
127 0
r0 r1 r2 r3 r4 r5 r6 r7
To handle greater input dynamic ranges, the input is reduced into multiple unsigned short integers and converted
sequentially. The most significant U16 conversion is computed first, followed by the conversion of the next four
significant digits.
Example 14-22 shows the fixed-point multiply combined with parallel remainder computation using SSE4 instructions
for 64-bit integer conversion up to 19 digits.
/* macro to convert input parameter of short integer "hi4" into output variable "x3" which is __m128i;
the input value "hi4" is assume to be less than 10^4;
the output is 4 single-digit integer between 0-9, located in the low byte of each dword,
most significant digit in lowest DW.
implicit overwrites: locally allocated __m128i variable "x0", "x2"
*/
#define __ParMod10to4SSSE3( x3, hi4 ) \
{ \
x0 = _mm_shuffle_epi32( _mm_cvtsi32_si128( (hi4)), 0); \
x2 = _mm_mulhi_epu16(x0, _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d));\
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d)), 10); \
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
(x3) = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) (hi4), 1); \
(x3) = _mm_or_si128(x2, (x3));\
(x3) = _mm_madd_epi16((x3), _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;\
}
/* macro to convert input parameter of the 3rd dword element of "t5" ( __m128i type)
into output variable "x3" which is __m128i;
the third dword element "t5" is assume to be less than 10^4, the 4th dword must be 0;
the output is 4 single-digit integer between 0-9, located in the low byte of each dword,
MS digit in LS DW.
implicit overwrites: locally allocated __m128i variable "x0", "x2"
*/
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
__int64_t q;
t = b * (__int128_t)xx;
q = t>>(64 +26); // shift count associated with QWCG10to8
*pLo = xx - QWCONST10to8 * q;
return q;
}
/* convert integer between 2^63-1 and 0 to ASCII string */
int sse4i_q2a_u63 ( __int64_t xx, char *ps)
{int j, tmp, idx=0, cnt;
__int64_t lo8, hi8, abv16, temp;
__m128i x0, m0, x1, x2, x3, x4, x5, x6, m1;
long long w, u;
if ( xx < 10000 )
{ j = ubs_Lt10k_2s_i2 ( (unsigned ) xx, ps);
ps[j] = 0; return j;
}
if (xx < 100000000 ) // dynamic range of xx is less than 32-bits
{ m0 = _mm_cvtsi32_si128( xx);
x1 = _mm_shuffle_epi32(m0, 0x44); // broadcast to dw0 and dw2
x3 = _mm_mul_epu32(x1, _mm_loadu_si128( (__m128i *) pr_cg_10to4 ));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x1, 8), x3); // quotient in dw2, remainder in dw0
__ParMod10to4SSSE3v( x3, m0); // pack single digit from each dword to dw0
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8)); // move the remainder to dw2 first
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5); // pack digits in bytes 0-7 with leading 0
cnt = 8;
}
else
{ hi8 = u64mod10to8(&lo8, xx);
if ( hi8 < 10000) // decompose lo8 dword into quotient and remainder mod 10^4
{ m0 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3); // quotient in dw0
__ParMod10to4SSSE3( x3, hi8); // handle digist 11:8 first
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, m0); // handle digits 7:4
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw2) );
x4 = _mm_or_si128(x4, x5); // pack single digist in bytes 0-11 with leading 0
cnt = 12;
}
else
{ cnt = 0;
if ( hi8 >= 100000000) // handle input greater than 10^16
{ abv16 = u64mod10to8(&temp, (__int64_t)hi8);
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
hi8 = temp;
__ParMod10to4SSSE3( x3, abv16);
x6 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
cnt = 4;
} // start with handling digits 15:12
m0 = _mm_cvtsi32_si128( hi8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
m1 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m1, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m1 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
__ParMod10to4SSSE3v( x3, m0);
x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m0, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw1) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, m1);
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw2) );
x4 = _mm_or_si128(x4, x5);
__ParMod10to4SSSE3v( x3, _mm_slli_si128(m1, 8));
x5 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpkdw3) );
x4 = _mm_or_si128(x4, x5);
cnt += 16;
}
}
m0 = _mm_loadu_si128( (__m128i *) asc0reversebias);
if( cnt > 16)
{ tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(x6,_mm_setzero_si128()) );
x6 = _mm_sub_epi8(x6, m0);
} else {
tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(x4,_mm_setzero_si128()) );
}
#ifndef __USE_GCC__
__asm__ ("bsfl %1, %%ecx; movl %%ecx, %0;" :"=r"(idx) :"r"(tmp) : "%ecx");
#else
_BitScanForward(&idx, tmp);
#endif
x4 = _mm_sub_epi8(x4, m0);
cnt -= idx;
w = _mm_cvtsi128_si64(x4);
switch(cnt)
{ case5:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
break;
case6:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
break;
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
case7:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(unsigned *) (&ps[3]) = (w >>32);
break;
case 8: *(long long *)ps = w;
break;
case9:*ps++ = (char) (w >>24);
*(long long *) (&ps[0]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case10:*(short *)ps = (short) (w >>16);
*(long long *) (&ps[2]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case11:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(long long *) (&ps[3]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case 12: *(unsigned *)ps = w;
*(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 4));
break;
case13:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
*(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case14:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
*(long long *) (&ps[6]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case15: *ps = (char) (w >>8);
*(short *) (&ps[1]) = (short) (w >>16); *(unsigned *) (&ps[3]) = (w >>32);
*(long long *) (&ps[7]) = _mm_cvtsi128_si64( _mm_srli_si128(x4, 8));
break;
case 16: _mm_storeu_si128( (__m128i *) ps, x4);
break;
case17:u = _mm_cvtsi128_si64(x6); *ps++ = (char) (u >>24);
_mm_storeu_si128( (__m128i *) &ps[0], x4);
break;
case18:u = _mm_cvtsi128_si64(x6); *(short *)ps = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
break;
case19:u = _mm_cvtsi128_si64(x6); *ps = (char) (u >>8);
*(short *) (&ps[1]) = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[3], x4);
break;
case20:u = _mm_cvtsi128_si64(x6); *(unsigned *)ps = (short) (u);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
break;
}
return cnt;
}
/* convert input value into 4 single digits via parallel fixed-point arithmetic with each dword
element, and pack each digit into low dword element and write to buffer without leading
white space; input value must be < 10000 and > 9
*/
__inline int ubs_Lt10k_2s_i2(int x_Lt10k, char *ps)
{int tmp;
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
__m128i x0, m0, x2, x3, x4, compv;
// Use a set of scaling constant to compensate for lack for per-element shift count
compv = _mm_loadu_si128( (__m128i *) quo4digComp_mulplr_d);
// broadcast input value to each dword element
x0 = _mm_shuffle_epi32( _mm_cvtsi32_si128( x_Lt10k), 0);
// low to high dword in x0 : u16, u16, u16, u16
m0 = _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d); // load 4 congruent consts
x2 = _mm_mulhi_epu16(x0, m0); // parallel fixed-point multiply for base 10,100, 1000, 10000
x2 = _mm_srli_epi32( _mm_madd_epi16( x2, compv), 10);
// dword content in x2: u16/10, u16/100, u16/1000, u16/10000
x3 = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) x_Lt10k, 1);
//word content in x3: 0, u16, 0, u16/10, 0, u16/100, 0, u16/1000
x4 = _mm_or_si128(x2, x3);
// perform parallel remainder operation with each word pair to derive 4 unbiased single-digit result
x4 = _mm_madd_epi16(x4, _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;
x2 = _mm_add_epi32( x4, _mm_loadu_si128( (__m128i *) asc0bias) ) ;
// pack each ascii-biased digits from respective dword to the low dword element
x3 = _mm_shuffle_epi8(x2, _mm_loadu_si128( (__m128i *) bcstpklodw) );
// store ascii result to buffer without leading white space
if (x_Lt10k > 999 )
{ *(int *) ps = _mm_cvtsi128_si32( x3);
return 4;
}
else if (x_Lt10k > 99 )
{ tmp = _mm_cvtsi128_si32( x3);
*ps = (char ) (tmp >>8);
*((short *) (++ps)) = (short ) (tmp >>16);
return 3;
}
else if ( x_Lt10k > 9) // take advantage of reduced dynamic range > 9 to reduce branching
{ *((short *) ps) = (short ) _mm_extract_epi16( x3, 1);
return 2;
}
*ps = '0' + x_Lt10k;
return 1;
}
char lower_digits[] = "0123456789";
temp = -s1;
len ++;
beg[0] = '-';
if( temp < 10) beg[1] = digits[ (int) temp];
else len += sse4i_q2a_u63( temp, &buf[ 1]); // parallel conversion in 4-digit granular operation
}
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 (Contd.)
else {
if( s1 < 10) beg[ 0 ] = digits[(int)s1];
else len += sse4i_q2a_u63( s1, &buf[ 1] );
}
buf[len] = 0;
return len;
}
When an ltoa()-like utility implementation executes native IDIV instruction to convert one digit at a time, it can
produce output at a speed of about 45-50 cycles per digit. Using fixed-point multiply to replace IDIV (like Example
14-21) can reduce 10-15 cycles per digit.
The range-reduction technique demonstrated in Example 14-22 reduces up-to 19 levels of dependency chain down to
a five-level hierarchy and allows parallel SIMD technique to perform 4-wide numeric conversion. This technique can
also be done with only Intel SSSE3, and with similar speed improvement.
Support for conversion to wide character strings can be easily adapted using the code snippet shown in Example
14-23.
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4 (Contd.)
else if ( x_Lt10k > 9){ // take advantage of reduced dynamic range > 9 to reduce branching
*(long long *) ps = _mm_cvtsi128_si64( _mm_srli_si128( x2, 8));
return 2;
}
*ps = L'0' + x_Lt10k;
return 1;
}
if ( xx < 10000 ) {
j = ubs_Lt10k_2wcs_i2 ( (unsigned ) xx, ps); ps[j] = 0; return j;
}
if (xx < 100000000 ) { // dynamic range of xx is less than 32-bits
m0 = _mm_cvtsi32_si128( xx);
x1 = _mm_shuffle_epi32(m0, 0x44); // broadcast to dw0 and dw2
x3 = _mm_mul_epu32(x1, _mm_loadu_si128( (__m128i *) pr_cg_10to4 ));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x1, 8), x3); // quotient in dw2, remainder in dw0
__ParMod10to4SSSE3v( x3, m0);
//x4 = _mm_shuffle_epi8(x3, _mm_loadu_si128( (__m128i *) bcstpklodw) );
x3 = _mm_shuffle_epi32(x3, 0x1b);
__ParMod10to4SSSE3v( x4, _mm_slli_si128(m0, 8)); // move the remainder to dw2 first
x4 = _mm_shuffle_epi32(x4, 0x1b);
cnt = 8;
} else {
hi8 = u64mod10to8(&lo8, xx);
if( hi8 < 10000) {
m0 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4 (Contd.)
hi8 = temp;
__ParMod10to4SSSE3( x7, abv16);
x7 = _mm_shuffle_epi32(x7, 0x1b);
cnt = 4;
}
m0 = _mm_cvtsi32_si128( hi8);
x2 = _mm_shuffle_epi32(m0, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m0 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
m1 = _mm_cvtsi32_si128( lo8);
x2 = _mm_shuffle_epi32(m1, 0x44);
x3 = _mm_mul_epu32(x2, _mm_loadu_si128( (__m128i *)pr_cg_10to4));
x3 = _mm_mullo_epi32(_mm_srli_epi64(x3, 40), _mm_loadu_si128( (__m128i *)pr_1_m10to4));
m1 = _mm_add_epi32( _mm_srli_si128( x2, 8), x3);
__ParMod10to4SSSE3v( x3, m0);
x3 = _mm_shuffle_epi32(x3, 0x1b);
__ParMod10to4SSSE3v( x4, _mm_slli_si128(m0, 8));
x4 = _mm_shuffle_epi32(x4, 0x1b);
__ParMod10to4SSSE3v( x5, m1);
x5 = _mm_shuffle_epi32(x5, 0x1b);
__ParMod10to4SSSE3v( x6, _mm_slli_si128(m1, 8));
x6 = _mm_shuffle_epi32(x6, 0x1b);
cnt += 16;
}
}
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4 (Contd.)
break;
case 8: _mm_storeu_si128( (__m128i *) &ps[0], x3);
_mm_storeu_si128( (__m128i *) &ps[4], x4);
break;
case9:*ps++ = (wchar_t) _mm_cvtsi128_si32( _mm_srli_si128( x3, 12));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) ps, x4);
_mm_storeu_si128( (__m128i *) &ps[4], x5);
break;
case10:*(long long *)ps = _mm_cvtsi128_si64( _mm_srli_si128( x3, 8));
x5 = _mm_add_epi32(x5, m0);
_mm_storeu_si128( (__m128i *) &ps[2], x4);
_mm_storeu_si128( (__m128i *) &ps[6], x5);
break;
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4 (Contd.)
x6 = _mm_add_epi32(x6, m0);
_mm_storeu_si128( (__m128i *) &ps[8], x5);
_mm_storeu_si128( (__m128i *) &ps[12], x6);
break;
Using MULX to implement 128-bit integer output can be a useful building block for implementing library functions
ranging from atof/strtod or intermediate mantisa computation or mantissa/exponent normalization in 128-bit binary
decimal floating-point operations. Example 14-25 gives examples of building-block macros, used in 128-bit binary-
decimal floating-point operations, which can take advantage MULX to calculate intermediate results of multiple-
precision integers of widths between 128 to 256 bits. Details of binary-integer-decimal (BID) floating-point format
and library implementation of BID operation can be found at the Intel® Decimal Floating-Point Math Library.
Example 14-25. Building-block Macro Used in Binary Decimal Floating-point Operations (Contd.)
CHAPTER 15
OPTIMIZATIONS FOR INTEL® AVX, INTEL® AVX2,
& INTEL® FMA
Intel® Advanced Vector Extension (Intel® AVX), is a major enhancement to Intel Architecture. It extends the
functionality of previous generations of 128-bit Intel® Streaming SIMD Extensions (Intel® SSE) vector instructions and
increased the vector register width to support 256-bit operations. The Intel AVX ISA enhancement is focused on float-
point instructions. Some 256-bit integer vectors are supported via floating-point to integer and integer to floating-
point conversions.
The Sandy Bridge microarchitecture implements the Intel AVX instructions, in most cases, on 256-bit hardware. Thus,
each core has 256-bit floating-point Add and Multiply units. The Divide and Square-root units are not enhanced to
256-bits. Thus, Intel AVX instructions use the 128-bit hardware in two steps to complete these 256-bit operations.
Prior generations of Intel® SSE instructions generally are two-operand syntax, where one of the operands serves both
as source and as destination. Intel AVX instructions are encoded with a VEX prefix, which includes a bit field to encode
vector lengths and support three-operand syntax. A typical instruction has two sources and one destination. Four
operand instructions such as VBLENDVPS and VBLENDVPD exist as well. The added operand enables non-destructive
source (NDS) and it eliminates the need for register duplication using MOVAPS operations.
With the exception of MMX™ instructions, almost all legacy 128-bit Intel SSE instructions have Intel AVX equivalents
that support three operand syntax. 256-bit Intel AVX instructions employ three-operand syntax and some with 4-
operand syntax.
The 256-bit vector register YMM extends the 128-bit XMM register to 256 bits. Thus the lower 128-bits of YMM is
aliased to the legacy XMM registers.
While 256-bit Intel AVX instructions writes 256 bits of results to YMM, 128-bit Intel AVX instructions writes 128-bits of
results into the XMM register and zeros the upper bits above bit 128 of the corresponding YMM. 16 vector registers
are available in 64-bit mode. Only the lower 8 vector registers are available in non-64-bit modes.
Software can continue to use any mixture of legacy Intel SSE code, 128-bit Intel AVX code and 256-bit Intel AVX code.
Section covers guidelines to deliver optimal performance across mixed-vector-length code modules without
experiencing transition delays between legacy Intel SSE and Intel AVX code. There are no transition delays of mixing
128-bit Intel AVX code and 256-bit Intel AVX code.
The optimal memory alignment of an Intel AVX 256-bit vector, stored in memory, is 32 bytes. Some data-movement
256-bit Intel AVX instructions enforce 32-byte alignment and will signal #GP fault if memory operand is not properly
aligned. Most 256-bit Intel AVX instructions do not require address alignment. These instructions generally combine
load and compute operations, so any non-aligned memory address can be used in these instructions.
For best performance, software should pay attention to align the load and store addresses to 32 bytes whenever
possible. Most examples can be found in the Chapter 15 GitHub Repository. A link to each example there has been
added to its corresponding example in this chapter.
The major differences between using Intel AVX instructions and legacy Intel SSE instructions are summarized in
Table 15-1.
Table 15-1. Features Between 256-bit Intel® AVX, 128-bit Intel® AVX, and Legacy Intel® SSE
Extensions
Features 256-bit AVX 128-bit AVX Legacy SSE-AESNI
Floating-point
Matches legacy SIMD ISA 128-bit FP and integer
Functionality Scope operation, Data
(except MMX). SIMD ISA.
Movement.
Register Operand YMM. XMM. XMM.
Aligned Move
32 byte alignment. 16 byte alignment. 16 byte alignment.
Instructions
Non-destructive source
Yes Yes No
operand
//SSE 128bit packed single load //AVX 256bit packed single load
__m128 Xmm_sin_cos = __m256 Ymm_sin_cos = _mm256_-
_mm_load_ps(sin_cos_theta_vec); load_ps(sin_cos_theta_vec);
__m256 Ymm0, Ymm1, Ymm2, Ymm3;
__m128 Xmm0, Xmm1, Xmm2, Xmm3; //processing 8 elements in an unrolled twice loop
loop1:
movsldup xmm0, [rax+rcx] loop1:
movshdup xmm1, [rax+rcx] vmovsldup ymm0, [rax+rcx]
//example: mulps has 2 operands vmovshdup ymm1, [rax+rcx]
//example: vmulps has 3 operands
mulps xmm0, xmm3
vmulps ymm0, ymm0, ymm3
mulps xmm1, xmm4
vmulps ymm1, ymm1, ymm4
addsubps xmm0, xmm1 vaddsubps ymm0, ymm0, ymm1
// 16 byte store from an xmm register // 32 byte store from an ymm register
movaps [rbx+rcx], xmm0 vmovaps [rbx+rcx], ymm0
movsldup xmm0, [rax+rcx+16] vmovsldup ymm0, [rax+rcx+32]
movshdup xmm1, [rax+rcx+16] vmovshdup ymm1, [rax+rcx+32]
mulps xmm0, xmm3 vmulps ymm0, ymm0, ymm3
mulps xmm1, xmm4 vmulps ymm1, ymm1, ymm4
addsubps xmm0, xmm1 vaddsubps ymm0, ymm0, ymm1
// offset of 16 bytes from previous store // offset of 32 bytes from previous store
movaps [rbx+rcx+16], xmm0 vmovaps [rbx+rcx+32], ymm0
// Processed 64bytes in this loop
// Processed 32bytes in this loop
//(The code is unrolled twice)
//(The code is unrolled twice)
add rcx, 64
add rcx, 32 cmp rcx, rdx
cmp rcx, rdx jl loop1
jl loop1 }
}
_mm_free(pInVector); _mm_free(pInVector);
_mm_free(pOutVector); _mm_free(pOutVector);
return 0; return 0;
} }
The middle column in this example uses 128-bit Intel AVX instructions and takes advantage of NDS. The additional
load and register copies are eliminated. This code uses eight micro-ops to process four elements and is about 30%
faster than the baseline above.
The right column in this example uses 256-bit AVX instructions. It uses eight micro-ops to process eight elements.
Combining the NDS feature with the doubling of vector width, this speeds up the baseline by more than 2x.
loop1:
loop1: loop1:
//Load A
//Load A //Load A
vmovups xmm0, [rax+r8*4]
movups xmm0, [rax+r8*4] vmovups ymm0, [rax+r8*4]
//Copy A
movups xmm1, [rax+r8*4]
//A^2
//A^2 //A^2
vmulps xmm1, xmm0, xmm0
mulps xmm1, xmm1 vmulps ymm1, ymm0, ymm0
//Copy A^2
movupsxmm2, xmm1
//A^3
//A^3 //A^3
vmulps xmm2, xmm1, xmm0
mulps xmm2, xmm0 vmulps ymm2, ymm1, ymm0
//A+A^2
//A + A^2 //A+A^2
vaddps xmm0, xmm0, xmm1
addps xmm0, xmm1 vaddps ymm0, ymm0, ymm1
//A+A^2+A^3
//A + A^2 + A^3 //A+A^2+A^3
vaddps xmm0, xmm0, xmm2
addps xmm0, xmm2 vaddps ymm0, ymm0, ymm2
//Store result
//Store result //Store result
vmovups[rbx+r8*4], xmm0
movups[rbx+r8*4], xmm0 vmovups [rbx+r8*4], ymm0
sub r8, 4 sub r8, 8
sub r8, 4
jge loop1 jge loop1
jge loop1
} }
}
• Intel AVX and Intel SSE code can co-exist and execute in the same run. This can happen if your application includes
third party libraries with Intel SSE code, a new DLL using Intel AVX code is deployed that calls other modules
running Intel SSE code, or you cannot recompile all your application at once. In these cases, the Intel AVX code
must use the VZEROUPPER instruction to avoid AVX/SSE transition penalty.
Intel AVX instructions always modify the upper bits of YMM registers and Intel SSE instructions do not modify the
upper bits. From a hardware perspective, the upper bits of the YMM register collection can be considered to be in one
of three states:
• Clean: All upper bits of YMM are zero. This is the state when the processor starts from RESET.
• Modified and Unsaved (In Table 15-2, this is abbreviated as M/U): The execution of one Intel AVX instruction
(either 256-bit or 128-bit) modifies the upper bits of the destination YMM. This is also referred to as dirty upper
YMM state. In this state, bits 255:128 and bits 127:0 of a given YMM are related to the most recent 256-bit or 128-
bit AVX instruction that operated on that register.
• Preserved/Non_INIT Upper State (In Table 15-2, this is abbreviated as P/N): In this state, the upper YMM state is
not zero. The upper 128 bits of a YMM and the lower 128 bits may be unrelated to the last Intel AVX instruction
executed in the processor as a result of XRSTOR from a saved image with dirty upper YMM state.
If software inter-mixes Intel AVX and Intel SSE instructions without using VZEROUPPER properly, it can experience an
Intel AVX/Intel SSE transition penalty. The situations of executing Intel SSE, Intel AVX, or managing the YMM state
using XSAVE/XRSTOR/VZEROUPPER/VZEROALL is illustrated in Figure 15-1. The penalty associated with transitions
into or out of the processor state “Modified and Unsaved” is implementation specific, depending on the
microarchitecture.
Figure 15-1 depicts the situations that a transition penalty will occur for earlier microarchitectures that support Intel
AVX. The transition penalty of A and B occurs with each instruction execution that would cause the transition. It is
largely the cost of copying the entire YMM state to internal storage.
To minimize the occurrence of YMM state transitions related to the “Preserved/Non_INIT Upper State”, software that
uses XSAVE/XRSTOR family of instructions to save/restore the YMM state should write a “Clean” upper YMM state to
the XSAVE region in memory. Restoring a dirty YMM image from memory into the YMM registers can experience a
penalty. This is illustrated in Figure 15-1.
The Skylake microarchitecture implements a different state machine than prior generations to manage the YMM
state transition associated with mixing Intel SSE and Intel AVX instructions. It no longer saves the entire upper YMM
state when executing an Intel SSE instruction when in “Modified and Unsaved” state, but saves the upper bits of
individual register. As a result, mixing Intel SSE and Intel AVX instructions will experience a penalty associated with
partial register dependency of the destination registers being used and additional blend operation on the upper bits
of the destination registers. Figure 15-2 depicts the transition penalty applicable to the Skylake microarchitecture.
Execute SSE
Clean
Execute 256-bit Intel AVX XSAVE w/o
UpperState
Vzero*
Dirty
Execute Vzeroupper/ Upper State
Execute Intel SSE VzeroallXrstor w/ INIT
or 128-bit Intel AVX Execute 256-bit
XRSTOR or 128-bit Intel AVX
Table 15-2 lists the effect of mixing Intel AVX and Intel SSE code, with the bottom row indicating the types of penalty
that might arise depending on the initial YMM state (the row marked ‘Begin’) and the ending state. Table 15-2 also
includes the effect of transition penalty (Type C and D) associated with restoring a dirty YMM state image stored in
memory.
XSAVE’d Dirty
NON-INIT
Image in Mem
XRSTOR
Penalty C
Execute SSE
Penalty A
Clean
UpperState Execute 256-bit AVX
XSAVE w/o
Vzero*
Dirty
Execute Vzeroupper/ Upper State
Execute SSE VzeroallXrstor w/ INIT
or 128-bit AVX Execute 256-bit
XRSTOR or 128-bit AVX
Figure 15-2. Intel® AVX- Intel® SSE Transitions in the Skylake Microarchitecture
Penalt
No A No No No B No No B D C No
y
The magnitude of each type of transition penalty can vary across different microarchitectures. In Skylake
microarchitecture, some of the transition penalty is reduced. The transition diagram and associated penalty is
depicted in Table 15-2. Table 15-3 gives approximate order of magnitude of the different transition penalty types
across recent microarchitectures.
Table 15-3. Approximate Magnitude of Intel® AVX—Intel® SSE Transition Penalties in Different
Microarchitectures
Haswell Broadwell Skylake Ice Lake Client
Type
Microarchitecture Microarchitecture Microarchitecture Microarchitecture
Partial Register
A ~XSAVE ~XSAVE ~XSAVE
Dependency + Blend
B ~XSAVE ~XSAVE NA ~XSAVE
To enable fast transitions between 256-bit Intel AVX and Intel SSE code blocks, use the VZEROUPPER instruction
before and after an AVX code block that would need to switch to execute SSE code. The VZEROUPPER instruction
resets the upper 128 bits of all Intel AVX registers. This instruction has zero latency. In addition, the processor changes
back to a Clean state, after which execution of SSE instructions or Intel AVX instructions has no transition penalty with
prior microarchitectures. In Skylake microarchitecture, the SSE block can executed from a Clean state without the
penalty of upper-bits dependency and blend operation.
128-bit Intel AVX instructions zero the upper 128-bits of the destination registers. Therefore, 128-bit and 256-bit Intel
AVX instructions can be mixed with no penalty.
Assembly/Compiler Coding Rule 63. (H impact, H generality) Whenever a 256-bit AVX code block and 128-bit SSE
code block might execute in sequence, use the VZEROUPPER instruction to facilitate a transition to a “Clean” state
for the next block to execute from.
__asm vaddps ymm1, ymm2, ymm3 __asm vaddps ymm1, ymm2, ymm3
.. //add vzeroupper before
//calling SSE function from AVX code
//penalty __asm vzeroupper //no penalty
SSE_function(); SSE_function();
AVX_function_no_zeroupper(); AVX_function_with_zeroupper();
//penalty //no penalty
__asm addps xmm1, xmm2 __asm addps xmm1, xmm2
Table 15-2 summarizes a heuristic of the performance impact of using or not using VZEROUPPER to bridge transitions
of inter-function calls that changes between AVX code implementation and SSE code.
Table 15-4. Effect of VZEROUPPER with Inter-Function Calls Between AVX and SSE Code
Inter-Function Call Prior Microarchitectures Skylake Microarchitecture
With VZEROUPPER 1X (baseline) ~1
SRC1 X7 X6 X5 X4 X3 X2 X1 X0
SRC2 Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
Some vectorized algorithms implemented with SSE instructions cannot use a simple conversion described above. For
example, shuffles that move elements within 16 bytes cannot be naturally converted to shuffles with 32 byte since 32
byte shuffles can't cross lanes.
You can use the following instructions as building blocks for working with lanes:
• VINSERTF128 - insert packed floating-point values.
• VEXTRACTF128 - extract packed floating-point values.
• VPERM2F128 - permute floating-point values.
• VBROADCAST - load with broadcast.
The sections below describe two techniques: the strided loads and the cross register overlap. These methods
implement the in lane data arrangement described above and are useful in many algorithms that initially seem to
require cross lane calculations.
The values in the low lanes of Ymm1 and Ymm2 in the figure above correspond to iteration i in the SSE
implementation. Similarly, the values in the high lanes of Ymm1 and Ymm2 correspond to iteration i+1.
Example 15-5 demonstrates the strided load method in a conversion of an Array of Structures (AoS) to a Structure of
Arrays (SoA). In this example, the input buffer contains complex numbers in an AoS format. Each complex number is
made of a real and an imaginary float values. The output buffer is arranged as SoA. All the real components of the
complex numbers are located at the first half of the output buffer and all the imaginary components are located at the
second half of the buffer. The following pseudo code and figure illustrate the conversion:
A simple extension of the Intel SSE algorithm from 16-byte to 32-byte operations would require cross-lane data
transition, as shown in the following figure. However, this is not possible with Intel AVX architecture and a different
technique is required.
The challenge of cross-lane shuffle can be overcome with Intel AVX for AoS to SoA conversion. Using VINSERTF128 to
load 16 bytes to the appropriate lane in the YMM registers obviates the need for
cross-lane shuffle. Once the data is organized properly in the YMM registers for step 1, 32-byte VSHUFPS can be used
to move the data in lanes, as shown in step 2.
The following code compares the Intel SSE implementation of AoS to SoA with the 256-bit Intel AVX implementation
and demonstrates the performance gained.
Example 15-6. Aos to SoA Conversion of Complex Numbers Using Intel® AVX
Intel® SSE Code Intel® AVX Code
xor rbx, rbx xor rbx, rbx
xor rdx, rdx xor rdx, rdx
mov rcx, len mov rcx, len
mov rdi, inPtr mov rdi, inPtr
mov rsi, outPtr1 mov rsi, outPtr1
mov rax, outPtr2 mov rax, outPtr2
loop1: loop1:
movups xmm0, [rdi+rbx] vmovups xmm0, [rdi+rbx]
//i1 r1 i0 r0 //i1 r1 i0 r0
movups xmm1, [rdi+rbx+16] vmovups xmm1, [rdi+rbx+16]
// i3 r3 i2 r2 // i3 r3 i2 r2
movups xmm2, xmm0 vinsertf128 ymm0, ymm0, [rdi+rbx+32] , 1
//i5 r5 i4 r4; i1 r1 i0 r0
shufps xmm0, xmm1, 0xdd vinsertf128 ymm1, ymm1, [rdi+rbx+48] , 1
//i3 i2 i1 i0 //i7 r7 i6 r6; i3 r3 i2 r2
shufps xmm2, xmm1, 0x88 vshufps ymm2, ymm0, ymm1, 0xdd
//r3 r2 r1 r0 //i7 i6 i5 i4; i3 i2 i1 i0
vshufps ymm3, ymm0, ymm1, 0x88
//r7 r6 r5 r4; r3 r2 r1 r0
movups [rax+rdx], xmm0 vmovups [rax+rdx], ymm2
Example 15-6. Aos to SoA Conversion of Complex Numbers Using Intel® AVX (Contd.)
Intel® SSE Code Intel® AVX Code
movups [rsi+rdx], xmm2 vmovups [rsi+rdx], ymm3
add rdx, 16 add rdx, 32
add rbx, 32 add rbx, 64
cmp rcx, rbx cmp rcx, rbx
jnz loop1 jnz loop1
The Median3 code sample below demonstrates the register overlap technique. The median3 technique calculates the
median of every three consecutive elements in a vector.
Y[i] = Median( X[i], X[i+1], X[i+2] )
Where Y is the output vector and X is the input vector. The following figure illustrates the calculation done by the
median algorithm.
add rdi, 16
add rbx, 8 vshufps ymm2, ymm0, ymm3, 0x99
add rbx, 4
add rdi, 32 add rbx, 8
shufps xmm2, xmm4, 0x4e
vminps ymm4, ymm0, ymm1 vminps ymm5, ymm0, ymm2
shufps xmm1, xmm2, 0x99
vmaxps ymm0, ymm0, ymm1 vmaxps ymm0, ymm0, ymm2
minps xmm3, xmm1
vminps ymm3, ymm0, ymm2 vminps ymm4, ymm0, ymm3
maxps xmm0, xmm1
vmaxps ymm5, ymm3, ymm4 vmaxps ymm7, ymm4, ymm5
minps xmm0, xmm2
vmovaps [rsi], ymm5 vmovaps ymm0, ymm6
maxps xmm0, xmm3
add rsi, 32 vmovaps [rsi], ymm7
movaps [rsi], xmm0
vmovaps ymm0, [rdi] add rsi, 32
movaps xmm0, xmm4
cmp rbx, rcx cmp rbx, rcx
add rsi, 16
jl loop_start jl loop_start
cmp rbx, rcx
jl loop_start
Following are 3 implementations for the gather operation from an array of 4-byte elements.
• Alternative one is a scalar implementation using general purpose registers.
• Alternatives two and three use Intel AVX instructions.
The figure below shows code snippets from Example 15-8 assuming that it runs the first iteration on data from the
previous figure.
Performance of the Intel AVX examples is similar to the performance of a corresponding Intel SSE implementation.
The table below shows the three gather implementations.
loop1: loop1:
loop1:
mov rax, [rdx] mov rax, [rdx + 4*rcx]
mov rax, [rdx + 4*rcx]
movsxd rbx, eax movsxd rbx, eax
movsxd rbx, eax
sar rax, 32 sar rax, 32
sar rax, 32
mov ebx, [rdi + 4*rbx] vmovss xmm1, [rdi + 4*rbx]
vmovss xmm1, [rdi + 4*rbx]
mov [rsi], ebx vinsertps xmm1, xmm1, [rdi +
vinsertps xmm1, xmm1, [rdi + 4*rax],
mov eax, [rdi + 4*rax] 4*rax], 0x10
0x10
mov [rsi + 4], eax mov rax, [rdx + 8 + 4*rcx]
mov rax, [rdx + 8 + 4*rcx]
mov rax, [rdx + 8] movsxd rbx, eax
movsxd rbx, eax
movsxd rbx, eax sar rax, 32
sar rax, 32
sar rax, 32 vmovss xmm3, [rdi + 4*rbx]
vinsertps xmm1, xmm1, [rdi + 4*rbx],
mov ebx, [rdi + 4*rbx] vinsertps xmm3, xmm3, [rdi +
0x20
mov [rsi + 8], ebx 4*rax], 0x10
vinsertps xmm1, xmm1, [rdi + 4*rax],
mov eax, [rdi + 4*rax] vshufps xmm1, xmm1, xmm3,
0x30
mov [rsi + 12], eax 0x44
Example 15-8. Data Gather - Intel® AVX versus Scalar Code (Contd.)
The following example includes a scalar implementation and an Intel AVX implementation of a scatter sequence. The
Intel AVX examples consist mainly of 128-bit Intel AVX instructions. Performance of the Intel AVX examples is similar
to the performance of corresponding Intel SSE implementation.
movrdi, InBuf
movrdi, InBuf
movrsi, OutBuf
movrsi, OutBuf
movrdx, Index
movrdx, Index
xor rcx, rcx
xor rcx, rcx
loop1:
loop1:
vmovaps ymm0, [rdi + 4*rcx]
movsxd rax, [rdx]
movsxd rax, [rdx + 4*rcx]
mov ebx, [rdi]
movsxd rbx, [rdx + 4*rcx + 4]
mov [rsi + 4*rax], ebx
vmovss [rsi + 4*rax], xmm0
movsxd rax, [rdx + 4]
movsxd rax, [rdx + 4*rcx + 8]
mov ebx, [rdi + 4]
vpalignr xmm1, xmm0, xmm0, 4
mov [rsi + 4*rax], ebx
movsxd rax, [rdx + 8]
vmovss [rsi + 4*rbx], xmm1
movsxd rbx, [rdx + 4*rcx + 12]
mov ebx, [rdi + 8]
vpalignr xmm2, xmm0, xmm0, 8
mov [rsi + 4*rax], ebx
vmovss [rsi + 4*rax], xmm2
movsxd rax, [rdx + 12]
movsxd rax, [rdx + 4*rcx + 16]
mov ebx, [rdi + 12]
vpalignr xmm3, xmm0, xmm0, 12
mov [rsi + 4*rax], ebx
vmovss [rsi + 4*rbx], xmm3
movsxd rax, [rdx + 16]
movsxd rbx, [rdx + 4*rcx + 20]
mov ebx, [rdi + 16]
vextractf128 xmm0, ymm0, 1
mov [rsi + 4*rax], ebx
vmovss [rsi + 4*rax], xmm0
movsxd rax, [rdx + 20]
movsxd rax, [rdx + 4*rcx + 24]
add rdi, 64
cmp rdi, rdx
jl start_loop
SAXPY is a memory intensive kernel that emphasizes the importance of data alignment. Optimal performance
requires both data source address to be 32-byte aligned and destination address also 32-byte aligned1.
• If only one of the three address is not aligned to 32-byte boundary, the performance may be halved.
• If all three addresses are mis-aligned relative to 32 byte, the performance degrades further.
• In some cases, unaligned accesses may result in lower performance for Intel AVX code compared to Intel SSE
code.
Other Intel AVX kernels typically have more computation which can reduce the effect of the data alignment penalty.
Assembly/Compiler Coding Rule 64. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer
store alignment over load alignment.
• You can use dynamic data alignment using the _mm_malloc intrinsic instruction with the Intel® Compiler, or
_aligned_malloc of the Microsoft® Compiler.
For example:
//dynamically allocating 32byte aligned buffer with 2048 float elements.
InputBuffer = (float*) _mm_malloc (2048*sizeof(float), 32);
• You can use static data alignment using __declspec(align(32)).
For example:
//Statically allocating 32byte aligned buffer with 2048 float elements.
__declspec(align(32)) float InputBuffer[2048];
NOTE
Beginning with Skylake microarchitecture, this optimization is not necessary. The only case where
16-byte loads may be more efficient is when the data is 16-byte aligned but not 32-byte aligned. In
this case 16-byte loads might be preferable as no cache line split memory accesses are issued.
Consider replacing unaligned 32-byte memory accesses using a combination of VMOVUPS, VINSERTF128, and
VEXTRACTF128 instructions.
Example 15-11. Using 16-Byte Memory Operations for Unaligned 32-Byte Memory Operation
Convert 32-byte loads as follows:
vmovups ymm0, mem -> vmovups xmm0, mem
vinsertf128 ymm0, ymm0, mem+16, 1
Convert 32-byte stores as follows:
vmovups mem, ymm0 -> vmovups mem, xmm0
vextractf128 mem+16, ymm0, 1
The following intrinsics are available to handle unaligned 32-byte memory operating using 16-byte memory
accesses:
_mm256_loadu2_m128 ( float const * addr_hi, float const * addr_lo);
_mm256_loadu2_m128d ( double const * addr_hi, double const * addr_lo);
_mm256_loadu2_m128 i( __m128i const * addr_hi, __m128i const * addr_lo);
_mm256_storeu2_m128 ( float * addr_hi, float * addr_lo, __m256 a);
_mm256_storeu2_m128d ( double * addr_hi, double * addr_lo, __m256d a);
_mm256_storeu2_m128 i( __m128i * addr_hi, __m128i * addr_lo, __m256i a);
Example 15-121 shows two implementations for SAXPY with unaligned addresses. Alternative one use 32-byte loads
and alternative two uses 16-byte loads. These code samples are executed with two source buffers, src1, src2, at 4 byte
offset from 32-byte alignment, and a destination buffer, DST, that is 32-byte aligned. Using two 16-byte memory
operations in lieu of 32-byte memory access performs faster.2
AVX with 32-byte memory operation AVX using two 16-byte memory operations
mov rax, src1 mov rax, src1
mov rbx, src2 mov rbx, src2
mov rcx, dst mov rcx, dst
mov rdx, len mov rdx, len
xor rdi, rdi xor rdi, rdi
vbroadcastss ymm0, alpha vbroadcastss ymm0, alpha
start_loop: start_loop:
vmovups ymm1, [rax + rdi] vmovups xmm2, [rax+rdi]
vmulps ymm1, ymm1, ymm0 vinsertf128 ymm2, ymm2, [rax+rdi+16], 1
vmovups ymm2, [rbx + rdi] vmulps ymm1, ymm0, ymm2
vaddps ymm1, ymm1, ymm2 vmovups xmm2, [ rbx + rdi]
vmovups [rcx + rdi], ymm1 vinsertf128 ymm2, ymm2, [rbx+rdi+16], 1
vaddps ymm1, ymm1, ymm2
vmovups [rcx+rdi], ymm1
vmovups ymm1, [rax+rdi+32] vmovups xmm2, [rax+rdi+32]
vmulps ymm1, ymm1, ymm0 vinsertf128 ymm2, ymm2, [rax+rdi+48], 1
vmulps ymm1, ymm0, ymm2
AVX with 32-byte memory operation AVX using two 16-byte memory operations
vmovups ymm2, [rbx+rdi+32] vmovups xmm2, [rbx+rdi+32]
vaddps ymm1, ymm1, ymm2 vinsertf128 ymm2, ymm2, [rbx+rdi+48], 1
vmovups [rcx+rdi+32], ymm1 vaddps ymm1, ymm1, ymm2
vmovups [rcx+rdi+32], ymm1
add rdi, 64 add rdi, 64
cmp rdi, rdx cmp rdi, rdx
jl start_loop jl start_loop
Assembly/Compiler Coding Rule 65. (M impact, H generality) Align data to 32-byte boundary when possible. If it is
not possible to align both loads and stores, then prefer store alignment over load alignment.
When a load misses the L1D Cache, a cache line with the requested data is brought from a higher memory hierarchy
level. In memory intensive code where the L1D Cache is always active, replacing a cache line in the L1D Cache may
delay other loads. In Sandy Bridge and Ivy Bridge microarchitectures, the penalty for 32-Byte loads may be higher
than the penalty for 16-Byte loads. Therefore, memory intensive Intel AVX code with 32-Byte loads and with data set
larger than the L1D Cache may be slower than similar code with 16-Byte loads.
When Example 15-12 is run with a data set that resides in the L2 Cache, the 16-byte memory access implementation
is slightly faster than the 32-byte memory operation.
Be aware that the relative merit of 16-byte memory accesses versus 32-byte memory access is implementation
specific across generations of microarchitectures.
In Haswell microarchitecture, the L1D Cache can support two 32-byte fetch each cycle.
15.8 4K ALIASING
4-KByte memory aliasing occurs when the code stores to one memory location and shortly after that it loads from a
different memory location with a 4-KByte offset between them. For example, a load to linear address 0x400020
follows a store to linear address 0x401020.
The load and store have the same value for bits 5 - 11 of their addresses and the accessed byte offsets should have
partial or complete overlap.
4K aliasing may have a five-cycle penalty on the load latency. This penalty may be significant when 4K aliasing
happens repeatedly and the loads are on the critical path. If the load spans two cache lines it might be delayed until
the conflicting store is committed to the cache. Therefore 4K aliasing that happens on repeated unaligned Intel AVX
loads incurs a higher performance penalty.
To detect 4K aliasing, use the LD_BLOCKS_PARTIAL.ADDRESS_ALIAS event that counts the number of times Intel AVX
loads were blocked due to 4K aliasing.
To resolve 4K aliasing, try the following methods in the following order:
• Align data to 32 Bytes.
• Change offsets between input and output buffers if possible.
• Sandy Bridge and Ivy Bridge microarchitectures may benefit from using 16-Byte memory accesses on memory
which is not 32-Byte aligned.
• Masked loads including an illegal address range do not result in an exception if the range is under a zero mask
value. However, the processor may take a multi-hundred-cycle “assist” to determine that no part of the illegal
range have a one mask value. This assist may occur even when the mask is “zero” and it seems obvious to the
programmer that the load should not be executed.
When using VMASKMOV, consider the following:
• Use VMASKMOV only in cases where VMOVUPS cannot be used.
• Use VMASKMOV on 32Byte aligned addresses if possible.
• If possible use valid address range for masked loads, even if the illegal part is masked with zeros.
• Determine the mask as early as possible.
• Avoid store-forwarding issues by performing loads prior to a VMASKMOV store if possible.
• Be aware of mask values that would cause the VMASKMOV instruction to require assist (if an assist is required,
the latency of VMASKMOV to load data will increase dramatically):
— Load data using VMASKMOV with a mask value selecting 0 elements from an illegal address will require an
assist.
— Load data using VMASKMOV with a mask value selecting 0 elements from a legal address expressed in some
addressing form (e.g. [base+index], disp[base+index] )will require an assist.
With processors based on the Skylake microarchitecture, the performance characteristics of VMASKMOV instructions
have the following notable items:
• Loads that follow a masked store is not longer blocked until the mask value is known.
• Store data using VMASKMOV with a mask value permitting 0 elements to be written to an illegal address will
require an assist.
loop1:
vmovss xmm1, [rax+r9] loop1:
vcomiss xmm1, xmm8 vmovups ymm1, [rax+r9]
jbe a_le vcmpps ymm2, ymm8, ymm1, 1
a_gt: vmaskmovps ymm4, ymm2, [rcx+r9]
vmovss xmm4, [rcx+r9] vxorps ymm2, ymm2, ymm9
jmp mul vmaskmovps ymm5, ymm2, [rdx+r9]
a_le: vorps ymm4, ymm4, ymm5
vmovss xmm4, [rdx+r9] vmulps ymm4, ymm4, [rsi+r9]
mul: vmovups [rbx+r9], ymm4
vmulss xmm4, xmm4, [rsi+r9] add r9, 32
vmovss [rbx+r9], xmm4 cmp r9, r8
add r9, 4 jl loop1
cmp r9, r8 }
jl loop1
}
The performance of the left side of Example 15-141 is sensitive to branch mis-predictions and can be an order of
magnitude slower than the VMASKMOV example which has no data-dependent branches.
Example 15-16. Three-Tap Filter with 128-bit Mixed Integer and FP SIMD
xor ebx, ebx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
mov r15, coeffs
movss xmm2, [r15] // load coeff 0
shufps xmm2, xmm2, 0 // broadcast coeff 0
movss xmm1, [r15+4] // load coeff 1
shufps xmm1, xmm1, 0 // broadcast coeff 1
movss xmm0, [r15+8] // coeff 2
shufps xmm0, xmm0, 0// broadcast coeff 2
movapsxmm5, [rdi] // xmm5={A[n+3],A[n+2],A[n+1],A[n]}
Example 15-16. Three-Tap Filter with 128-bit Mixed Integer and FP SIMD (Contd.)
loop_start:
movaps xmm6, [rdi+16] // xmm6={A[n+7],A[n+6],A[n+5],A[n+4]}
movaps xmm7, xmm6
movaps xmm8, xmm6
add rdi, 16 // inPtr+=16
add rbx, 4 // loop counter
palignr xmm7, xmm5, 4 // xmm7={A[n+4],A[n+3],A[n+2],A[n+1]}
palignr xmm8, xmm5, 8 // xmm8={A[n+5],A[n+4],A[n+3],A[n+2]}
mulps xmm5, xmm2 //xmm5={C0*A[n+3],C0*A[n+2],C0*A[n+1], C0*A[n]}
loop_start:
vmovapsymm5, [rdi] // Ymm5={A[n+7],A[n+6],A[n+5],A[n+4];
// A[n+3],A[n+2],A[n+1] , A[n]}
vshufpsymm6, ymm5, [rdi+16], 0x4e// ymm6={A[n+9],A[n+8],A[n+7],A[n+6];
// A[n+5],A[n+4],A[n+3],A[n+2]}
vshufpsymm7, ymm5, ymm6, 0x99 // ymm7={A[n+8],A[n+7],A[n+6],A[n+5];
// A[n+4],A[n+3],A[n+2],A[n+1]}
vmulpsymm3, ymm5, ymm2 // ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5],C0*A[n+4];
// C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]}
vmulpsymm9, ymm7, ymm1 // ymm9={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5];
// C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}
vmulpsymm4, ymm6, ymm0 // ymm4={C2*A[n+9],C2*A[n+8],C2*A[n+7],C2*A[n+6];
// C2*A[n+5],C2*A[n+4],C2*A[n+3],C2*A[n+2]}
vaddps ymm8, ymm3, ymm4
vaddps ymm10, ymm8, ymm9
vmovaps[rsi], ymm10
add rdi, 32 // inPtr+=32
add rbx, 8 // loop counter
add rsi, 32 // outPtr+=32
cmp rbx, rcx
jl loop_start
Example 15-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code
xor ebx, ebx
mov rcx, len
mov rdi, inPtr
mov rsi, outPtr
mov r15, coeffs
vbroadcastss ymm2, [r15] // load and broadcast coeff 0
vbroadcastss ymm1, [r15+4] // load and broadcast coeff 1
vbroadcastss ymm0, [r15+8] // load and broadcast coeff 2
vmovaps xmm3, [rdi] // xmm3={A[n+3],A[n+2],A[n+1],A[n]}
loop_start:
vmovaps xmm4, [rdi+16] // xmm4={A[n+7],A[n+6],A[n+5],A[n+4]}
vmovaps xmm5, [rdi+32] // xmm5={A[n+11], A[n+10],A[n+9],A[n+8]}
vinsertf128 ymm3, ymm3, xmm4, 1 // ymm3={A[n+7],A[n+6],A[n+5],A[n+4];
// A[n+3], A[n+2],A[n+1],A[n]}
vpalignr xmm6, xmm4, xmm3, 4 // xmm6={A[n+4],A[n+3],A[n+2],A[n+1]}
vpalignr xmm7, xmm5, xmm4, 4 // xmm7={A[n+8],A[n+7],A[n+6],A[n+5]}
vinsertf128 ymm6, ymm6, xmm7, 1 // ymm6={A[n+8],A[n+7],A[n+6],A[n+5];
// A[n+4],A[n+3],A[n+2],A[n+1]}
vpalignr xmm8, xmm4, xmm3, 8 // xmm8={A[n+5],A[n+4],A[n+3],A[n+2]}
vpalignr xmm9, xmm5, xmm4, 8 // xmm9={A[n+9],A[n+8],A[n+7],A[n+6]}
vinsertf128 ymm8, ymm8, xmm9, 1 // ymm8={A[n+9],A[n+8],A[n+7],A[n+6];
// A[n+5],A[n+4],A[n+3],A[n+2]}
vmulps ymm3, ymm3, ymm2 // ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5], C0*A[n+4];
// C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]}
vmulps ymm6, ymm6, ymm1 // ymm6={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5];
// C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}
Example 15-171 uses 256-bit VSHUFPS to replace the PALIGNR in 128-bit mixed SSE code. This speeds up almost 70%
over the 128-bit mixed SSE code of Example 15-16 and slightly ahead of Example 15-182.
For code that includes integer instructions and is written with 256-bit Intel AVX instructions, replace the integer
instruction with floating-point instructions that have similar functionality and performance. If there is no similar
floating-point instruction, consider using a 128-bit Intel AVX instruction to perform the required integer operation.
The following example shows two implementations of an 8x8 Matrix transpose. In both cases, the bottleneck is Port
5 pressure. Alternative 1 uses 12 vshufps instructions that are executed only on port 5. Alternative 2 replaces eight of
the vshufps instructions with the vblendps instruction which can be executed on Port 0.
Example 15-19. 8x8 Matrix Transpose - Replace Shuffles with Blends (Contd.)
In Example 15-191, replacing VSHUFPS with VBLENDPS relieved port 5 pressure and can gain almost 40% speedup.
Assembly/Compiler Coding Rule 66. (M impact, M generality) Use Blend instructions in lieu of shuffle instruction in
AVX whenever possible.
the load ports and VPERM2F128 that is only performed on port 5. Therefore redesigning the algorithm to use
VINSERTF128 reduces port 5 pressure and improves performance.
Figure 15-6 describes step 1 of the 8x8 matrix transpose with vinsertf128. Step 2 performs the same operations on
different columns.
In Example 15-201, this reduced port 5 pressure further than the combination of VSHUFPS with VBLENDPS in Example
15-19. It can gain 70% speedup relative to relying on VSHUFPS alone in Example 15-19.
Example 15-212 includes two versions of the complex multiply. Both versions are unrolled twice.
Alternative 1 shuffles all the data in registers. Alternative 2 shuffles data while it is loaded from memory.
loop1:
loop1:
vmovaps ymm0, [rax +8*rcx]
vmovaps ymm0, [rax +8*rcx]
vmovaps ymm4, [rax +8*rcx +32]
vmovaps ymm4, [rax +8*rcx +32]
vmovaps ymm3, [rbx +8*rcx]
vmovsldupymm2, ymm3
vmovsldupymm2, [rbx +8*rcx]
vmulps ymm2, ymm2, ymm0
vmulps ymm2, ymm2, ymm0
vshufps ymm0, ymm0, ymm0, 177
vshufps ymm0, ymm0, ymm0, 177
vmovshdupymm1, ymm3
vmovshdupymm1, [rbx +8*rcx]
vmulps ymm1, ymm1, ymm0
vmulps ymm1, ymm1, ymm0
vmovaps ymm7, [rbx +8*rcx +32]
vmovsldupymm6, [rbx +8*rcx +32]
vmovsldupymm6, ymm7
vmulps ymm6, ymm6, ymm4
vmulps ymm6, ymm6, ymm4
vaddsubpsymm3, ymm2, ymm1
vaddsubpsymm2, ymm2, ymm1
vmovshdupymm5, [rbx +8*rcx +32]
vmovshdupymm5, ymm7
vmovaps [rdx +8*rcx], ymm3
vmovaps [rdx+8*rcx], ymm2
vshufps ymm4, ymm4, ymm4, 177
vshufps ymm4, ymm4, ymm4, 177
vmulps ymm5, ymm5, ymm4
vmulps ymm5, ymm5, ymm4
vaddsubpsymm7, ymm6, ymm5
vaddsubpsymm6, ymm6, ymm5
vmovaps [rdx +8*rcx +32], ymm7
vmovaps [rdx+8*rcx+32], ymm6
add rcx, 8
add rcx, 8
cmp rcx, r8
cmp rcx, r8
jl loop1
jl loop1
instructions at seven-cycle latency and two-cycle throughput, a single Newton-Raphson iteration or Taylor
approximation can achieve almost the same precision as the (V)DIVPS and (V)SQRTPS instructions. See Intel®
Architecture Instruction Set Extensions Programming Referencefor more information on these instructions.
In some cases, when the divide or square root operations are part of a larger algorithm that hides some of the latency
of these operations, the approximation with Newton-Raphson can slow down execution, because more micro-ops,
coming from the additional instructions, fill the pipe.
With the Skylake microarchitecture, choosing between approximate reciprocal instruction alternative versus
DIVPS/SQRTPS for optimal performance of simple algebraic computations depend on several factors. Table 15-5
shows several algebraic formula the throughput comparison of implementations of different numeric accuracy
tolerances. In each row, 24-bit accurate implementations are IEEE-compliant and using the respective instructions of
128-bit or 256-bit ISA. The columns of 22-bit and 11-bit accurate implementations are using approximate reciprocal
instructions of the respective instruction set.
SSE 1X 0.85X 1X
Z = (X+2Y+3)/(Z-2Y-3)
256-bit AVX 1X 0.8X 1X
If targeting processors based on the Skylake microarchitecture, Table 15-5 can be summarized as:
• For 256- bit AVX code, Newton-Raphson approximation can be beneficial on Skylake microarchitecture when the
algorithm contains only operations executed on the divide unit. However, when single precision divide or square
root operations are part of a longer computation, the lower latency of the DIVPS or SQRTPS instructions can lead
to better overall performance.
• For SSE or 128-bit AVX implementation, consider use of approximation for divide and square root instructions
only for algorithms that do not require precision higher than 11-bit or algorithms that contain multiple
operations executed on the divide unit.
Table 15-6 summarizes recommended calculation methods of divisions or square root when using single-precision
instructions, based on the desired accuracy level across recent generations of Intel microarchitectures.
loop1: loop1:
movups xmm0, [rax+rdx*1] vmovups ymm0, [rax+rdx*1]
movups xmm1, [rbx+rdx*1] vmovups ymm1, [rbx+rdx*1]
divps xmm0, xmm1 vdivps ymm0, ymm0, ymm1
movups [rcx+rdx*1], xmm0 vmovups [rcx+rdx*1], ymm0
add rdx, 16 add rdx, 32
cmp rdx, rsi cmp rdx, rsi
jl loop1 jl loop1
loop1: loop1:
movups xmm0,[rax+rdx*1] vmovups ymm0, [rax+rdx]
movups xmm1,[rbx+rdx*1] vmovups ymm1, [rbx+rdx]
rcpps xmm1,xmm1 vrcpps ymm1, ymm1
mulps xmm0,xmm1 vmulps ymm0, ymm0, ymm1
movups [rcx+rdx*1],xmm0 vmovups [rcx+rdx], ymm0
add rdx, 16 add rdx, 32
cmp rdx, rsi cmp rdx, rsi
jl loop1 jl loop1
Example 15-25. Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy
Example 15-27. Reciprocal Square Root Using RSQRTPS and Newton-Raphson Iteration
loop1: loop1:
movups xmm5, [rax+rdx] vmovups ymm5, [rax+rdx]
rsqrtps xmm0, xmm5 vrsqrtps ymm0, ymm5
movaps xmm2, xmm0
mulps xmm0, xmm0 vmulps ymm2, ymm0, ymm0
mulps xmm0, xmm5 vmulps ymm2, ymm2, ymm5
subps xmm0, xmm3 vsubps ymm2, ymm3, ymm2
mulps xmm0, xmm2 vmulps ymm0, ymm0, ymm2
mulps xmm0, xmm4 vmulps ymm0, ymm0, ymm4
movups [rbx+rdx], xmm0
add rdx, 16 vmovups [rbx+rdx], ymm0
cmp rdx, rcx add rdx, 32
jl loop1 cmp rdx, rcx
} jl loop1
}
Example 15-30. Square Root Using RSQRTPS and One Taylor Series Expansion
__asm __asm
{ {
mov rax, pIn mov rax, pIn
mov rbx, pOut mov rbx, pOut
mov rcx, iLen mov rcx, iLen
xor rdx, rdx xor rdx, rdx
The figure below describes the Intel AVX implementation of the Array Sub Sums algorithm. The PSLLDQ is an integer
SIMD instruction which does not have an AVX equivalent. It is replaced by VSHUFPS.
Figure 15-9. Intel® AVX Implementation of the Array Sub Sums Algorithm
Example 15-311 shows SSE implementation of array sub sum and AVX implementation. The AVX code is about 40%
faster, though not on microarchitectures where there are more compute than shuffle ports.
The code using VCVTPS2PH is approximately four times faster than the AVX-128 sequence. Although it is possible to
load 8 data elements at once with 256-bit AVX, most of the per-element conversion operations require packed integer
instructions which do not have 256-bit extensions yet. Using VCVTPS2PH is not only faster but also provides handling
of special cases that do not encode to normal half-precision floating-point values.
loop: loop:
add rdi, 32 add rdi,16
vmovaps ymm6, [rdi] vcvtph2ps ymm6, [rdi]
vperm2f128 ymm1, ymm0, ymm6, 0x21 vperm2f128 ymm1, ymm0, ymm6, 0x21
vshufps ymm3, ymm0, ymm1, 0x4E vshufps ymm3, ymm0, ymm1, 0x4E
vshufps ymm2, ymm0, ymm3, 0x99 vshufps ymm2, ymm0, ymm3, 0x99
vminps ymm5, ymm0, ymm2 vminps ymm5, ymm0, ymm2
vmaxps ymm0, ymm0, ymm2 vmaxps ymm0, ymm0, ymm2
vminps ymm4, ymm0, ymm3 vminps ymm4, ymm0, ymm3
vmaxps ymm7, ymm4, ymm5 vmaxps ymm7, ymm4, ymm5
vmovaps ymm0, ymm6 vmovaps ymm0, ymm6
vmovaps [rsi], ymm7 vcvtps2ph [rsi], ymm7, roundingCtrl
add rsi, 32 add rsi, 16
add rbx, 8 add rbx, 8
cmp rbx, rcx cmp rbx, rcx
jl loop jl loop
When the locality of the working set resides in memory, using half-precision format with processors based on Ivy
Bridge microarchitecture is about 30% faster than single-precision format, despite the conversion overhead. When
the locality resides in L3, using half-precision format is still ~15% faster. When the locality resides in L1, using single-
precision format is faster because the cache bandwidth of the L1 data cache is much higher than the rest of the
cache/memory hierarchy and the overhead of the conversion becomes a performance consideration.
The arrangement of identical latency and number of pipes allows software to increase the performance of situations
where floating-point calculations are limited by the floating-point add operations that follow FP multiplies. Consider
a situation of vector operation An = C1 + C2 * An-1:
Cost per iteration: ~ fp add latency + fp add latency Cost per iteration: ~ fma latency
The overall throughput of the code sequence on the LHS is limited by the combined latency of the FP MUL and FP ADD
instructions of specific microarchitecture. The overall throughput of the code sequence on the RHS is limited by the
throughput of the FMA instruction of the corresponding microarchitecture.
A common situation where the latency of the FP ADD operation dominates performance is the following C code:
for ( int 1 = 0; i < arrLenght; i ++) result += arrToSum[i];
Example 15-351 shows two implementations with and without unrolling.
Without unrolling (LHS of Example 15-35), the cost of summing every eight array elements is about proportional to
the latency of the FP ADD instruction, assuming the working set fit in L1. To use unrolling effectively, the number of
unrolled operations should be at least “latency of the critical operation” * “number of pipes”. The performance gain
of optimized unrolling versus no unrolling, for a given microarchitecture, can approach “number of pipes” * “Latency
of FP ADD”.
User/Source Coding Rule 31. Consider using unrolling technique for loops containing back-to-back dependent FMA,
FP Add or Vector MUL operations, The unrolling factor can be chosen by considering the latency of the critical
instruction of the dependency chain and the number of pipes available to execute that instruction.
Cost per iteration: ~ fp add latency + fp add latency Cost per iteration: ~ fma latency
1. C. Yeo, Y. H. Tan, Z. Li and S. Rahardja, “Mode-Dependent Fast Separable KLT for Block-based Intra Coding,”
JCTVC-B024, Geneva, Switzerland, Jul 2010
29
29 5555 74 74 8484 64 64 64 64
11 74 74 0 – 74 1 84 35
--------
-- 74 74 0 – 74
-------- X X --------- – 35 – 84
128
128 84 128 64 – 64
84 ––29
29 ––74
74 55
55 – 64 64
55
55 ––84
84 7474 ––29
29 35 – 84 84 – 35
L B R
The same technique can be implemented using AVX2 instructions in a straightforward manner. The AVX2 sequence is
illustrated in Example 15-381 and Example 15-39.
Example 15-38. Macros for Separable KLT Intra-block Transformation Using AVX2
// b0: input row vector from 4 consecutive 4x4 image block of word pixels
// rmc0-3: columnar vector coefficient of the RHS matrix, repeated 4X for 256-bit
// min32km1: saturation constant vector to cap intermediate pixel to less than or equal to 32767
// w0: output row vector of garbled intermediate matrix, elements within each block are garbled
// e.g Low 128-bit of row 0 in descending order: y07, y05, y06, y04, y03, y01, y02, y00
Example 15-38. Macros for Separable KLT Intra-block Transformation Using AVX2 (Contd.)
In Example 15-391, matrix multiplication of 1/128 * (B xR) is evaluated first in a four-wide manner by fetching from
four consecutive 4x4 image block of word pixels. The first macro shown in Example 15-38 produces an output vector
where each intermediate row result is in an garbled sequence between the two middle elements of each 4x4 block. In
Example 15-39, undoing the garbled elements and transposing the intermediate row vector into column vectors are
implemented using blend primitives instead of shuffle/unpack primitives.
In Haswell microarchitecture, shuffle/pack/unpack primitives rely on the shuffle execution unit dispatched to port 5.
In some situations of heavy SIMD sequences, port 5 pressure may become a determining factor in performance.
If 128-bit SIMD code faces port 5 pressure when running on Haswell microarchitecture, porting 128-bit code to use
256-bit AVX2 can improve performance and alleviate port 5 pressure.
short __declspec(align(16))cst_rmc0[8] = {64, 84, 64, 35, 64, 84, 64, 35};
short __declspec(align(16))cst_rmc1[8] = {64, 35, -64, -84, 64, 35, -64, -84};
short __declspec(align(16))cst_rmc2[8] = {64, -35, -64, 84, 64, -35, -64, 84};
short __declspec(align(16))cst_rmc3[8] = {64, -84, 64, -35, 64, -84, 64, -35};
short __declspec(align(16))cst_lmr0[8] = {29, 55, 74, 84, 29, 55, 74, 84};
short __declspec(align(16))cst_lmr1[8] = {74, 74, 0, -74, 74, 74, 0, -74};
short __declspec(align(16))cst_lmr2[8] = {84, -29, -74, 55, 84, -29, -74, 55};
short __declspec(align(16)) cst_lmr3[8] = {55, -84, 74, -29, 55, -84, 74, -29};
void Klt_256_d(short * Input, short * Output, int iWidth, int iHeight)
{int iX, iY;
__m256i rmc0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *) &cst_rmc0[0]));
__m256i rmc1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc1[0]));
__m256i rmc2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc2[0]));
__m256i rmc3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc3[0]));
__m256i lmr0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr0[0]));
__m256i lmr1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr1[0]));
__m256i lmr2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr2[0]));
__m256i lmr3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr3[0]));
__m256i min32km1 = _mm256_broadcastd_epi32(_mm_cvtsi32_si128( _mm_setr_epi32( 0x7fff7fff, 0x7fff7fff, 0x7fff7fff,
0x7fff7fff));
__m256i b0, b1, b2, b3, t0, t1, t2, t3;
__m256i w0, w1, w2, w3;
short* pImage = Input;
short* pOutImage = Output;
int hgt = iHeight, wid= iWidth;
// We implement 1/128 * (Mat_L x (1/128 * (Mat_B x Mat_R))) from the inner most parenthesis
for( iY = 0; iY < hgt; iY+=4) {
for( iX = 0; iX < wid; iX+=16) {
//load row 0 of 4 consecutive 4x4 matrix of word pixels
b0 = _mm256_loadu_si256( (__m256i *) (pImage + iY*wid+ iX)) ;
// multiply row 0 with columnar vectors of the RHS matrix coefficients
__MyM_KIP_PxRMC_ROW_4x4Wx4(b0, w0, rmc0, rmc1, rmc2, rmc3, min32km1);
// low 128-bit of garbled row 0, from hi->lo: y07, y05, y06, y04, y03, y01, y02, y00
b1 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+1)*wid+ iX) );
__MyM_KIP_PxRMC_ROW_4x4Wx4(b1, w1, rmc0, rmc1, rmc2, rmc3, min32km1);
Although 128-bit SIMD implementation is not shown here, it can be easily derived.
When running 128-bit SIMD code of this KLT intra-coding transformation on Sandy Bridge microarchitecture, the port
5 pressure are less because there are two shuffle units, and the effective throughput for each 4x4 image block
transformation is around fifty cycles. Its speed-up relative to optimized scalar implementation is about 2.5X.
When the 128-bit SIMD code runs on Haswell microarchitecture, micro-ops issued to port 5 account for slightly less
than 50% of all micro-ops, compared to about one third on prior microarchitecture, resulting in about 25%
performance regression. On the other hand, AVX2 implementation can deliver effective throughput in less than
thirty-five cycle per 4x4 block.
1. For details of modular exponentiation/multiplication and AVX2 implementation in OpenSSL, see: Software
Implementation of Modular Exponentiation, Using Advanced Vector Instructions Architectures.
ports available in the previous microarchitecture generation; it will benefit immediately on Haswell
microarchitecture.
In some situations, there may be some intricate interactions between microarchitectural restrictions on the
instruction set that is worth some discussion. We consider two commonly used library functions memcpy() and
memset() and the optimal choice to implement them on the new microarchitecture.
With memcpy() on Haswell microarchitecture, using REP MOVSB to implement memcpy operation for large copy
length can take advantage the 256-bit store data path and deliver throughput of more than 20 bytes per cycle. For
copy length that are smaller than a few hundred bytes, REP MOVSB approach is slower than using 128-bit SIMD
technique described in Section 15.16.3.1.
With memcpy() on Ice Lake microarchitecture, using in-lined REP MOVSB to implement memcpy is as fast as a 256-bit
AVX implementation for copy lengths that are variable and unknown at compile time. For lengths that are known at
compile time, REP MOVSB is almost as good as 256-bit AVX for short strings up to 128 bytes (nine cycles vs three to
seven cycles), and better for strings of 2K bytes and longer. For these cases we recommend using inline REP MOVSB.
That said, software should still branch away for zero byte copies.
The net of this attempt to use 256-bit ISA to take advantage of the 256-bit store data-path microarchitecture was
offset by the 4-instruction sequence and cacheline split penalty.
• There are intervening instruction streams being executed between invocations of memset(), the state of branch
predictor prior to memset() invocation is not pre-trained for the branching sequence inside a memset()
implementation.
• Memset() count values are likely to be uncorrected.
The proper measurement technique to compare memset() performance for more realistic memset() invocation
scenarios will require a per-invocation technique that wraps two RDTSC around each invocation of memset().
With the per-invocation RDTSC measurement technique, the overhead of RDTSC and be pre-calibrated and post-
validated outside of a measurement loop. The per-invocation technique may also consider cache warming effect by
using a loop to wrap around the per-invocation measurements.
When the relevant skew factors of measurement techniques are taken into effect, the performance of memset()
using REP STOSB, for count values smaller than a few hundred bytes, is generally faster than the AVX2 version for the
common memset() invocation scenarios. Only in the extreme scenarios of hundreds of unrolled memset() calls, all
using count values less than a few hundred bytes and with no intervening instruction stream between each pair of
memset() can the AVX2 version of memset() take advantage of the training effect of the branch predictor.
Example 15-411 shows a helper utility and overall steps to reduce a 64-bit signed integer into a 63-bit unsigned range
with reduced-range integer quotient/remainder pairs using MULX. Note that this example relies on Example 15-40
and Example 15-42.
Example 15-42 shows the steps of numeric conversion of a 63-bit dynamic range into ascii format according to a
progressive range reduction technique using a vectorized Montgomery reduction scheme. Note that this example
relies on Example 15-40.
// pack 8 single-digit integer into first 8 bytes and set rest to zeros
x4 = _mm256_permutevar8x32_epi32( x4, _mm256_setr_epi32(0x4, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1) );
tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) );
BitScanForward((unsigned long *) &idx, tmp);
cnt = 8 -idx; // actual number non-zero-leading digits to write to output
} else { // conversion of 9-12 digits
lo64 = _mulx_u64(xx, (unsigned __int64) QWCG10to8, &hi64);
hi64 >>= 26;
xxi = _mulx_u64(hi64, (unsigned __int64)100000000, &xx2);
lo64 = (unsigned __int64)xx - xxi;
w = _mm_cvtsi128_si64( _mm256_castsi256_si128(x4));
switch(cnt) {
case 5:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32);
break;
case 6:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32);
break;
case 7:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(unsigned *) (&ps[3]) = (w >>32);
break;
case 8: *(long long *)ps = w;
break;
case 9:*ps++ = (char) (w >>24); *(long long *) (&ps[0]) = _mm_cvtsi128_si64(
mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case 10:*(short *)ps = (short) (w >>16);
*(long long *) (&ps[2]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case 11:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16);
*(long long *) (&ps[3]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4));
break;
case 19:u = (int) _mm_cvtsi128_si64(v0); *ps = (char) (u >>8); *(short *) (&ps[1]) = (short) (u >>16);
_mm_storeu_si128( (__m128i *) &ps[3], _mm256_castsi256_si128(x4));
break;
case 20:u = (int) _mm_cvtsi128_si64(v0); *(unsigned *)ps = (short) (u);
_mm_storeu_si128( (__m128i *) &ps[4], _mm256_castsi256_si128(x4));
break;
}
return cnt;
}
The AVX2 version of numeric conversion across the dynamic range of 3/9/17 output digits are approximately 23/57/54
cycles per input, compared to standard library implement ion’s range of 85/260/560 cycles per input.
The techniques illustrated above can be extended to numeric conversion of other library, such as binary-integer-
decimal (BID) encoded IEEE-754-2008 Decimal floating-point format. For BID-128 format, Example 15-42 can be
adapted by adding another range-reduction stage using a pre-computed 256-bit constant to perform Montgomery
reduction at modulus 10^16. The technique to construct the 256-bit constant is covered in Chapter 14, “Intel® SSE4.2
and SIMD Programming For Text-Processing/Lexing/Parsing”.
Load once + shuffle/blend/logical to build data vectors in register. In this case, result[i] =
x[index[i]] + x[index[i+1]], the technique below may be preferable to using multiple
VGATHER:
Redundant elements
ymm0 <- VGATHER ( x[index[k] ]); // fetching 8 elements
ymm1 <- VBLEND( VPERM (ymm0), VBROADCAST ( x[indexx[k+8]]);
ymm2 <- VPADD( ymm0, ymm1);
In other cases, using the VGATHER instruction can reduce code size and execute faster with tech-
niques including but not limited to amortizing the latency and throughput of VGATHER, or by
hoisting the fetch operations well in advance of consumer code of the destination register of those
fetches. Example 15-44 lists some patterns that can benefit from using VGATHER on Haswell
microarchitecture.
General tips for using VGATHER:
• Gathering more elements with a VGATHER instruction helps amortize the latency and throughput of VGATHER,
and is more likely to provide performance benefit over an equivalent non-VGATHER flow. For example, the
latency of 256-bit VGATHER is less than twice the equivalent 128-bit VGATHER and therefore more likely to show
gains than two 128-bit equivalent ones. Also, using index size larger than data element size results in only half of
the register slots utilized but not a proportional latency reduction. Therefore the dword index form of VGATHER
is preferred over qword index if dwords or single-precision values are to be fetched.
• It is advantageous to hoist VGATHER well in advance of the consumer code.
• VGATHER merges the (unmasked) gathered elements with the previous value of the destination. Therefore, in
cases where the previous value of the destination doesn’t need to be merged (for instance, when no elements is
masked off), it can be beneficial to break the dependency of the VGATHER instruction on the previous writer of
the destination register (by zeroing out the register with a VXOR instruction).
Performance of the VGATHER instruction compared to a multi-instruction gather equivalent flow can vary due to:
• Differences in the base algorithm.
• Different data organization
• The effectiveness of the equivalent flow.
In performance critical applications it is advisable to evaluate both options before choosing one.
The throughput of GATHER instructions continue to improve from Broadwell to Skylake Microarchitecture. This is
shown in Figure 15-11.
Example 15-45 gives the asm sequence of software implementation that is equivalent to the VPGATHERD instruction.
This can be used to compare the trade-off of using a hardware gather instruction or software gather sequence based
on inserting an individual element.
Figure 15-12 compares per-element throughput using the VPGATHERD instruction versus a software gather sequence
with Skylake microarchitecture as a function of cache locality of data supply. With the exception of using hardware
GATHER on two data elements per instruction, the gather instruction out-performs the software sequence on Skylake
microarchitecture.
If data supply locality is from memory, software sequences are likely to perform better than the hardware GATHER
instruction.
With strided access patterns, an AVX software sequence can load and shuffle on multiple elements and is the more
optimal technique.
With non-strided, regular access pattern of AOS to SOA, an Intel AVX software sequence that uses VINSERTF128 and
interleaved packing of multiple elements can be more optimal.
CHAPTER 16
POWER OPTIMIZATION FOR MOBILE USAGES
16.1 OVERVIEW
Mobile computing allows computers to operate anywhere, anytime. Battery life is a key factor in delivering this
benefit. Mobile applications require software optimization that considers both performance and power
consumption. This chapter provides background on power saving techniques in mobile processors1 and makes
recommendations that developers can leverage to provide longer battery life.
A microprocessor consumes power while actively executing instructions and doing useful work. It also consumes
power in inactive states (when halted). When a processor is active, its power consumption is referred to as active
power. When a processor is halted, its power consumption is referred to as static power.
ACPI 3.0 (ACPI stands for Advanced Configuration and Power Interface) provides a standard that enables intelligent
power management and consumption. It does this by allowing devices to be turned on when they are needed and by
allowing control of processor speed (depending on application requirements). The standard defines a number of P-
states to facilitate management of active power consumption; and several C-state types2 to facilitate management of
static power consumption.
Pentium M, Intel Core Solo, Intel Core Duo processors, and processors based on Intel Core microarchitecture
implement features designed to enable the reduction of active power and static power consumption. These include:
• Enhanced Intel SpeedStep® Technology enables operating system (OS) to program a processor to transition to
lower frequency and/or voltage levels while executing a workload.
• Support for various activity states (for example: Sleep states, ACPI C-states) to reduces static power consumption
by turning off power to sub-systems in the processor.
Enhanced Intel SpeedStep Technology provides low-latency transitions between operating points that support P-state
usages. In general, a high-numbered P-state operates at a lower frequency to reduce active power consumption.
High-numbered C-state types correspond to more aggressive static power reduction. The trade-off is that transitions
out of higher-numbered C-states have longer latency.
1. For Intel® Centrino® mobile technology and Intel® Centrino® Duo mobile technology, only processor-related
techniques are covered in this manual.
2. ACPI 3.0 specification defines four C-state types, known as C0, C1, C2, C3. Microprocessors supporting the ACPI
standard implement processor-specific states that map to each ACPI C-state type.
3. This chapter uses numerical values representing time constants (300 ms, 100 ms, etc.) on power management
decisions as examples to illustrate the order of magnitude or relative magnitude. Actual values vary by imple-
mentation and may vary between product releases from the same vendor.
Consider, for example, an application that changes processor utilization from 100% to a lower utilization and then
jumps back to 100%. The diagram in Figure 16-1 shows how the OS changes processor frequency to accommodate
demand and adapt power consumption. The interaction between the OS power management policy and
performance history is described below.
3 5
4
1. Demand is high and the processor works at its highest possible frequency (P0).
2. Demand decreases, which the OS recognizes after some delay; the OS sets the processor to a lower frequency
(P1).
3. The processor decreases frequency and processor utilization increases to the most effective level, 80-90% of the
highest possible frequency. The same amount of work is performed at a lower frequency.
4. Demand decreases and the OS sets the processor to the lowest frequency, sometimes called Low Frequency
Mode (LFM).
5. Demand increases and the OS restores the processor to the highest frequency.
• Architect an awareness of power consumption and/or dynamic power policy into your application for contextual
usages and optimal end user experiences. Specific detail may vary across OSes. For Microsoft Windows OS
consult http://www.microsoft.com/whdc/system/pnppwr/powermgmt/PMpolicy_Windows.mspx#.
• Characterize your application’s power consumption. There are various techniques available to measure the
power consumption of a platform:
— Use hardware instrumentation such as Fluke NetDAQ*. This provides power measurements for each
component such as CPU, HDD, and memory.
— Use C-state residency counters. See Chapter 2, “Model-Specific Registers (MSRs)” of Intel® 64 and IA-32
Architectures Software Developer’s Manual, Volume 4.
— Study parameters such as CPU usage, kernel time, and time interrupt rate, to gain insight into the behavior of
the software, which can then be related to platform power consumption if the hardware instrumentation is
unavailable.
Section 16.5 provide some examples on how to relate performance with power consumption and techniques for
optimizing software.
As shown in the illustration of Figure 16-2, the processor is in either active or idle (halted) state. ACPI defines four C-
state types (C0, C1, C2 and C3). Processor-specific C states can be mapped to an ACPI C-state type via ACPI standard
mechanisms. The C-state types are divided into two categories: active (C0), in which the processor consumes full
power; and idle (C1-3), in which the processor is idle and may consume significantly less power.
The index of a C-state type designates the depth of sleep. Higher numbers indicate a deeper sleep state and lower
power consumption. They also require more time to wake up (higher exit latency).
C-state types are described below:
• C0: The processor is active and performing computations and executing instructions.
• C1: This is the lowest-latency idle state, which has very low exit latency. In the C1 power state, the processor is
able to maintain the context of the system caches.
• C2: This level has improved power savings over the C1 state. The main improvements are provided at the
platform level.
• C3: This level provides greater power savings than C1 or C2. In C3, the processor stops clock generating and
snooping activity. It also allows system memory to enter self-refresh mode.
The basic technique to implement OS power management policy to reduce static power consumption is by evaluating
processor idle durations and initiating transitions to higher-numbered C-state types. This is similar to the technique of
reducing active power consumption by evaluating processor utilization and initiating P-state transitions. The OS looks
at history within a time window and then sets a target C-state type for the next time window, as illustrated in
Figure 16-3:
Consider that a processor is in lowest frequency (LFM- low frequency mode) and utilization is low. During the first
time slice window (Figure 16-3 shows an example that uses 100 ms time slice for C-state decisions), processor
utilization is low and the OS decides to go to C2 for the next time slice. After the second time slice, processor
utilization is still low and the OS decides to go into C3.
1. Pentium M processor can be detected by CPUID signature with family 6, model 9 or 13; Intel Core Solo and Intel
Core Duo processor has CPUID signature with family 6, model 14; processors based on Intel Core microarchitec-
ture has CPUID signature with family 6, model 15.
Table 16-1. ACPI C-State-Type Mappings to Processor Specific C-State of the Sandy Bridge
Microarchitecture
ACPI C-State Type Processor-Specific C-State
C0 C0
C1 C1
C2 C3
C3 C6/C7
The microarchitectural behavior of processor-specific deep C-states are implementation dependent. The following
summarizes some of their key power-saving and intelligent responsive characteristics:
• For mobile platforms, while the cores are already in C7, the last level cache (L3) is flushed.
• Auto-demotion: The processor can demote OS requests to a target C-state (core C6/C7 or C3 state) to a
numerically lower C-state (core C3 or C1 state) in the following cases:
— When history indicates that C6/C7 or C3 states are less energy efficient than C3 or C1 states.
— When history indicates that a deeper sleep state may impact performance.
— Energy inefficiency or performance degradation can occur due to the deeper C-state transition overhead
occurring too frequently. Sandy Bridge microarchitecture has an enhanced algorithm that improves power
gain from this feature.
• Un-demotion: An OS request to a deeper C-state can be demoted by auto-demotion, resulting in C1 or C3 states.
After long residency in the demoted state, the hardware returns control back to the OS. The expectation is that in
this case, the OS will repeat the deeper C-state request and hardware un-demotion will enter into the OS-
requested deeper C state.
1. Generally, energy measurements and power management decisions based on these MSR interfaces should oper-
ate within the same processor family/model and refrain from extrapolating across different family/models or
unsupported environmental conditions.
• Avoid frequent disk access. Each disk access forces the device to spin up and stay in high power mode for some
period after the last access. Buffer small disk reads and writes to RAM to consolidate disk operations over time.
Use the GetDevicePowerState() Windows API to test disk state and delay the disk access if it is not spinning.
Figure 16-4 illustrates the chronological profiles of coarse-grain (> 300 ms) task scheduling and its effect on operating
frequency and power consumption.
CPU demand
Frequency
& Power
Average power
The same application can be written in such a way that work units are divided into smaller granularity, but scheduling
of each work unit and Sleep() occurring at more frequent intervals (e.g. 100 ms) to deliver the same QOS (operating
at full performance 50% of the time). In this scenario, the OS observes that the workload does not require full
performance for each 300 ms sampling. Its power management policy may then commence to lower the processor’s
frequency and voltage while maintaining the level of QOS.
The relationship between active power consumption, frequency and voltage is expressed by the equation:
In the equation: ‘V’ is core voltage, ‘F’ is operating frequency, and ‘’ is the activity factor. Typically, the quality of
service for 100% performance at 50% duty cycle can be met by 50% performance at 100% duty cycle. Because the
slope of frequency scaling efficiency of most workloads will be less than one, reducing the core frequency to 50% can
achieve more than 50% of the original performance level. At the same time, reducing the core frequency to 50%
allows for a significant reduction of the core voltage.
Because executing instructions at higher P-state (lower power state) takes less energy per instruction than at P0 state,
Energy savings relative to the half of the duty cycle in P0 state (Pmax /2) more than compensate for the increase of the
half of the duty cycle relative to inactive power consumption (Pmin /2). The non-linear relationship between power
consumption to frequency and voltage means that changing the task unit to finer granularity will deliver substantial
energy savings. This optimization is possible when processor demand is low (such as with media streaming, playing a
DVD, or running less resource intensive applications like a word processor, email or web browsing).
An additional positive effect of continuously operating at a lower frequency is that frequent changes in power draw
(from low to high in our case) and battery current eventually harm the battery. They accelerate its deterioration.
When the lowest possible operating point (highest P-state) is reached, there is no need for dividing computations.
Instead, use longer idle periods to allow the processor to enter a deeper low power mode.
decision to change frequency is made based on a larger window of time than the period to decide to enter deep sleep.
If the processor is to enter a processor-specific C4 state to take advantage of aggressive static power reduction
features, the decision should be based on:
• Whether the QOS can be maintained in spite of the fact that the processor will be in a low-power, long-exit-
latency state for a long period.
• Whether the interval in which the processor stays in C4 is long enough to amortize the longer exit latency of this
low-power C state.
Eventually, if the interval is large enough, the processor will be able to enter deeper sleep and save a considerable
amount of power. The following guidelines can help applications take advantage of Intel® Enhanced Deeper Sleep:
• Avoid setting higher interrupt rates. Shorter periods between interrupts may keep OSes from entering lower
power states. This is because transition to/from a deep C-state consumes power, in addition to a latency penalty.
In some cases, the overhead may outweigh power savings.
• Avoid polling hardware. In a ACPI C3 type state, the processor may stop snooping and each bus activity (including
DMA and bus mastering) requires moving the processor to a lower-numbered C-state type. The lower-numbered
state type is usually C2, but may even be C0. The situation is significantly improved in the Intel Core Solo
processor (compared to previous generations of the Pentium M processors), but polling will likely prevent the
processor from entering into highest-numbered, processor-specific C-state.
When one full-speed thread is migrated from one core to another core that has idled for a period of time, an OS
without a multicore-aware P-state coordination policy may mistakenly decide that each core demands only 50% of
processor resources (based on idle history). The processor frequency may be reduced by such multicore unaware P-
state coordination, resulting in a performance anomaly. See Figure 16-5.
active
Core 1
Idle
active
Core 2
Idle
Active
Thread 1
Sleep (core 1)
Active
Thread 2
Sleep (core 2)
Active
CPU
Deeper
Sleep
Sleep
When each core in a multicore processor meets the requirements necessary to enter a different C-state type,
multicore-unaware hardware coordination causes the physical processor to enter the lowest possible C-state type
(lower-numbered C state has less power saving). For example, if Core 1 meets the requirement to be in ACPI C1 and
Core 2 meets requirement for ACPI C3, multicore-unaware OS coordination takes the physical processor to ACPI C1.
See Figure 16-6.
2. Enabling Both Cores to Take Advantage of Intel® Enhanced Deeper Sleep.
To best utilize processor-specific C-state (e.g., Intel® Enhanced Deeper Sleep) to conserve battery life in multithreaded
applications, a multi-threaded application should synchronize threads to work simultaneously and sleep
simultaneously using OS synchronization primitives. By keeping the package in a fully idle state longer (satisfying ACPI
C3 requirement), the physical processor can transparently take advantage of processor-specific Deep C4 state if it is
available.
Multi-threaded applications need to identify and correct load-imbalances of its threaded execution before
implementing coordinated thread synchronization. Identifying thread imbalance can be accomplished using
performance monitoring events. Intel Core Duo processor provides an event for this purpose. The event
(Serial_Execution_Cycle) increments under the following conditions:
• Core actively executing code in C0 state.
• Second core in physical processor in idle state (C1-C4).
This event enables software developers to find code that is executing serially, by comparing Serial_Execution_Cycle
and Unhalted_Ref_Cycles. Changing sections of serialized code to execute into two parallel threads enables
coordinated thread synchronization to achieve better power savings.
Although Serial_Execution_Cycle is available only on Intel Core Duo processors, application thread with load-
imbalance situations usually remains the same for symmetric application threads and on symmetrically configured
multicore processors, irrespective of differences in their underlying microarchitecture. For this reason, the technique
to identify load-imbalance situations can be applied to multi-threaded applications in general, and not specific to Intel
Core Duo processors.
Lo w e r B a r C o n su m e d Le ss E n e r g y
S in g le T h r e a d e d
M u lt iT h r e a d e d
Figure 16-7 above shows the result of a study that compares processor energy consumption of single threaded
workloads with their corresponding performance-optimized implementations, using three sets of applications across
different application domains. In this particular study, optimization effort in application 1 (Cryptography) achieved 2X
gain in performance alone. At the same time, its energy consumption reduced about 12%. In application 3 (a media
application), performance optimization efforts including multi-threading and other techniques achieved 10X
performance gain. Its energy consumption reduced about 60%.
16.5.1.2 Vectorization
Use SIMD instructions can reduce the path length of completing a given computational task, often reducing active
cycles. Code that performs the same operation on multiple independent data elements is a good candidate for
vectorization. Vectorization techniques are typically applied to applications with loops with elements that can be
processed in single instruction. Typically, the slight power increase per unit time of using SIMD instructions are
compensated by much greater reduction of active cycles. The net effect is improved energy consumption.
Baseline
SSE/AVX
Figure 16-7 shows the result of a study on the energy saving effect due to vectorization. A media playback workload
achieved 2.15X speedup due to using SSE2 and SSE4 instruction sets. Another audio processing workload increased
performance to ~5X by using Intel AVX instruction sets. At the same time, the latter also had better energy saving.
This construct of sitting in a tight loop and calling Sleep() service with a parameter of 0 is actually a polling loop with
side effects:
• Each call to Sleep() experiences the expensive cost of a context switch, which can be 10000+ cycles.
• It also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles.
• When there is no other thread waiting to take possession of control, this sleep loop behaves to the OS as a highly
active task demanding CPU resource, preventing the OS to put the CPU into a low-power state.
Example 16-2 shows the technique of using PAUSE instruction to make the sleep loop power friendly.
By slowing down the “spin-wait” with the PAUSE instruction, the multi-threading software gains:
• Performance by facilitating the waiting tasks to acquire resources more easily from a busy wait.
• Power-savings by both using fewer parts of the pipeline while spinning.
• Elimination of great majority of unnecessarily executed instructions caused by the overhead of a Sleep(0) call.
In one case study, this technique achieved 4.3x of performance gain, which translated to 21% power savings at the
processor and 13% power savings at platform level.
the lock, you can use EnterCriticalSection with a spin count. The advantage of this API over WaitForSingleObject() is
that it does not enter kernel mode unless there is a contention on the lock. Hence, when there is no contention,
EnterCriticalSection with spin count is much cheaper to use and reduces the time spent in privilege mode.
Studies were done by taking a small test application which has four active threads on a Sandy Bridge
microarchitecture-based system. The locks in the test case were implemented by using WaitForSingleObject and
EnterCriticalSection. There was no contention on the lock, so each thread achieved the lock at the first attempt. As
shown in the graph below, when there is no contention, using WaitForSingleObject() has negative impact on both
power and performance as compared to using EnterCriticalSection().
As indicated in the following graph, using WaitForSingleObject() on an un-contended lock uses more power. Using
EnterCriticalSection() provides a 10x performance gain and 60% energy reduction.
Game A Game B
Table 16-2 and Table 16-3 list Package C-State entry/exit latency for processors with CPUID
DisplayFamily_DisplayModel signature of 06_2AH, and for two configurations of voltage regulator slew rate
capabilities. Table 16-2 applies to slow VR configuration, and Table 16-3 applies to fast VR configuration. For each
configuration, the VR device can operate in either a fast interrupt break mode enabled or slow interrupt break mode,
depending on the setting of MSR_POWER_CTL.[bit 4]. These C-Sate entry/exit latency are not processor
specifications but estimates derived from empirical measurements. There may be some situations exit latency from a
core is higher than those listed in Table 16-2 and Table 16-3.
Table 16-2. C-State Total Processor Exit Latency for Client Systems with Slow VR
Typical Exit Latency 2 Worst Case Exit Latency
C-State1
MSR_POWER_CTL MSR.[4] =0 MSR_POWER_CTL MSR.[4] =1
C1 1 s 1 s
C3 156 s 80 s
C6 181 s 104 s
C7 199 s 109 s
NOTES:
1. These C-State Entry/Exit Latencies are Intel estimates only and not processor specifications.
2. It is assumed that package is in C0 when one of the core is active.
3. Fast interrupt break mode is enabled if MSR_POWER_CTL.[4] = 1.
4.A device that connect to PCH may result in latencies equivalent to that of a slow interrupt break mode.
Table 16-3. C-State Total Processor Exit Latency for Client Systems (Core+ Package Exit Latency)
with Fast VR
Typical Worst Case Exit Latency Time (All Skus)2 Typical Worst Case Exit Latency Time (All Skus)3
C-State1
MSR_POWER_CTL MSR.[4] =0 MSR_POWER_CTL MSR.[4] =1
C1 1 s 1 s
C3 156 s 80 s
C6 181 s 104 s
C7 199 s 109 s
NOTES:
1. These C-State Entry/Exit Latencies are Intel estimates only and not processor specifications.
2. It is assumed that package is in C0 when one of the core is active.
3. If the package is in a deeper C-states, the exit latency of Local APIC timer wake up depends on the typical core
level exit latency; If the package is in C0, it may vary between typical or worst case of the respective core-level exit
latency.
3. It is assumed that package is in C0 when one of the core is active.
3. If the package is in a deeper C-states, the exit latency of Local APIC timer wake up depends on the typical core
level exit latency; If the package is in C0, it may vary between typical or worst case of the respective core-level exit
latency.
Table 16-4 lists Core-only C-State entry/exit latency for processors with CPUID DisplayFamily_DisplayModel signature
of 06_2AH, and for two configurations of voltage regulator slew rate capabilities. Core-only exit latency is not affected
by MSR_POWER_CTL.[4].
Table 16-4. C-State Core-Only Exit Latency for Client Systems with Slow VR
C-State1 Typical Worst Case Exit Latency Time (All Skus)2
C1 1 s 1 s
C3 21 s 240 s
C6 46 s 250 s
C7 46 s 250 s
NOTES:
1. These C-State Entry/Exit Latencies are Intel estimates only and not processor specifications.
2. A slow VR device refers to a device with ramp time of 10 mv/µs in fast mode and 2.5 mv/µs in slow mode.
CHAPTER 17
SOFTWARE OPTIMIZATION FOR INTEL® AVX-512 INSTRUCTIONS
As part of the family of Intel® Accelerator Engines in Intel® Xeon® Scalable processors, Intel® Advanced Vector
Extensions 512 (Intel® AVX-512) provides built-in acceleration for demanding workloads that involve heavy vector-
based processing. They are the following set of 512-bit instruction set extensions:
• Intel® AVX-512 Foundation (F)
— 512-bit vector width.
— 32 512-bit long vector registers.
— Data expand and data compress instructions.
— Ternary logic instruction.
— 8 new 64-bit long mask registers.
— Two source cross-lane permute instructions.
— Scatter instructions.
— Embedded broadcast/rounding.
— Transcendental support.
• Intel® AVX-512 Conflict Detection Instructions (CD)
• Intel® AVX-512 Exponential and Reciprocal Instructions (ER)
• Intel® AVX-512 Prefetch Instructions (PF)
• Intel® AVX-512 Byte and Word Instructions (BW)
• Intel® AVX-512 Double Word and Quad Word Instructions (DQ)
— New QWORD and Compute and Convert Instructions.
• Intel® AVX-512 Vector Length Extensions (VL)
Performance reports in this chapter are based on Data Cache Unit (DCU) resident data measurements on the Skylake
Server System with Intel® Turbo-Boost technology disabled, Intel® SpeedStep® Technology disabled, core and uncore
frequency set to 1.8GHz, unless otherwise specified. This fixed frequency configuration is used to isolate code change
impacts from other factors.
Table 17-1. Intel® AVX-512 Feature Flags Across Intel® Xeon® Processor Generations
2nd Generation 3rd Generation 4th/5th
Intel® Xeon®
Intel® Core™ Intel® Xeon® Intel® Xeon® Generation Intel®
Scalable
Processors Scalable Scalable Xeon® Scalable
Processors
Processors Processors Processors
AVX/AVX2 AVX/AVX2 AVX/AVX2 AVX/AVX2 AVX/AVX2
AVX512F, AVX512F, AVX512F, AVX512F, AVX512F,
AVX512CD, AVX512CD, AVX512CD, AVX512CD, AVX512CD,
AVX512BW, AVX512BW, AVX512BW, AVX512BW, AVX512BW,
AVX512DQ AVX512DQ AVX512DQ AVX512DQ AVX512DQ
Table 17-1. (Contd.)Intel® AVX-512 Feature Flags Across Intel® Xeon® Processor Generations
2nd Generation 3rd Generation 4th/5th
Intel® Xeon®
Intel® Core™ Intel® Xeon® Intel® Xeon® Generation Intel®
Scalable
Processors Scalable Scalable Xeon® Scalable
Processors
Processors Processors Processors
NA NA NA AVX512_BF16 AVX512_BF16
AVX512_VPOPCNTD,
AVX512_VBM12,
NA NA NA NA VAES, GFNI,
VPCLMULQDQ,
AVX512_BITALG
NA NA NA NA AVX512_FP16
Y
Y’ X’
θ
X
x‘ = xcosθ - ysinθ
y‘ = xsinθ + ycosθ
... Y5 X5 Y5 X5 Y5 X5 Y5 X5 Y5 X5 Y5 X5 : In Buffer
Y’5 X’5 Y’4 X’4 Y’3 X’3 Y’2 X’2 Y’1 X’1 Y’0 X’0
s*X5 s*X5 s*X4 s*X4 s*X3 s*X3 s*X2 s*X2 s*X1 s*X1 s*X0 s*X0
... + - + - + - + - + - + - : Out Buffer
c*Y5 c*Y5 c*Y4 c*Y4 c*Y3 c*Y3 c*Y2 c*Y2 c*Y1 c*Y1 c*Y0 c*Y0
*c = cosθ
s = sinθ
SOM00002
//Static memory allocation of 8 floats with 32byte //Static memory allocation of 16 floats with 64byte
alignments alignments
__declspec(align(32)) float cos_sin_theta_vec[8] = __declspec(align(64)) float cos_sin_theta_vec[16] =
{cos_theta, sin_theta, {cos_theta,
cos_theta, sin_theta, cos_theta, sin_theta, cos_theta, sin_theta, cos_theta, sin_theta, cos_theta, sin_theta,
sin_theta}; cos_theta, sin_theta, cos_theta, sin_theta, cos_theta,
sin_theta, cos_theta, sin_theta, cos_theta, sin_theta};
_mm_free(pInVector); _mm_free(pInVector);
_mm_free(pOutVector); _mm_free(pOutVector);
return 0; return 0;
} }
17.2 MASKING
Intel AVX-512 instructions which use the Extended VEX coding scheme (EVEX) encode a predicate operand to
conditionally control per-element computational operation and update the result to the destination operand. The
predicate operand is known as the opmask register. The opmask is a set of eight architectural registers, 64 bits each.
From this set of 8 architectural registers, only k1 through k7 can be addressed as the predicate operand; k0 can be
used as a regular source or destination but cannot be encoded as a predicate operand.
A predicate operand can be used to enable memory fault-suppression for some instructions with a memory source
operand.
As a predicate operand, the opmask registers contain one bit to govern the operation / update of each data element
of a vector register. Masking is supported on Skylake microarchitecture for instructions with all data sizes:
• Byte (int8)
• Word (int16)
• Single-precision floating-point (float32)
• Integer doubleword (int32)
The destination register before instruction execution is shown in Figure 17-2, 17-3 and 17-4.
... 63 32 31 0 bits
SOM00003
… 63 32 31 0 bits
… 5 4 3 2 1 0 bits
0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 1 K1
… 63 32 31 0 bits
SOM00004
The result of the execution with zeroing masking is (notice the {z} in the instruction):
vmovaps zmm1 {k1}{z}, zmm0
.
… 63 32 31 0 bits
0 a14 0 0 0 0 0 0 a7 a6 a5 a4 0 0 a1 a0 ZMM1
SOM00005
Notice that merging masking operations has a dependency on the destination, but zeroing masking is free of such
dependency.
The following example shows how masking could be done with Intel AVX-512 in contrast to Intel AVX2.
C Code:
const int N = miBufferWidth;
const double* restrict a = A;
const double* restrict b = B;
double* restrict c = Cref;
With no masking, the processor executes 2 multiplies per cycle on a 2 FMA server.
With merge masking, the processor executes 2 multiplies every 4 cycles as the multiplies in iteration N depend on the
output of the multiplies in iteration N-1.
Zero masking does not have a dependency on the destination register and therefore can execute 2 multiplies per cycle
on a 2 FMA server.
Recommendation: Masking has a cost, so use it only when necessary. When possible, use zero masking rather than
merge masking.
In Alternative 1, there is a dependency between instructions (1) and (2), and (2) and (3). That means that instruction
(2) has to wait for the result of the blending of instruction (1), before starting execution, and instruction (3) needs to
wait for instruction (2).
In Alternative 2, there is only one such dependency because each branch of conditional code is executed in parallel on
all the data, and a mask is used for blending back to one register only before writing data back to the memory.
Blending is faster, but it does not mask exceptions, which may occur on the unmasked data.
Alternative 2 executes 11% more instructions; it provides 23% speedup in overall execution. Alternative 2 uses an
extra register (zmm3). This extra register usage may cause extra latency in case of register pressure (freeing register
to memory and loading it afterwards).
The following code is another example of masking vs. blending.
for (int i = 0;i<len;i++)
{
if (a[i] > b[i]){
a[i] += b[i];
}
}
In Alternative 1, there is a dependency between instructions (1) and (2), and (2) and (3).
In Alternative 2, there are only 2 instructions in the dependency chain: (1) and (2).
pRefImage++;
pInImage++;
}
Table 17-2. Cache Comparison Between Skylake Server Microarchitecture and Broadwell
Microarchitecture
Item Broadwell Microarchitecture Skylake Server Microarchitecture
• Address of a vmaskmov store is considered as • Issue is resolved.
resolved only after the mask is known. • Address of a vmaskmov store can be resolved
1 • Loads following a masked store may be blocked, before mask is known.
depending on the memory disambiguation predictor,
until the mask value is known.
• If mask is not all 1 or all 0: loads depending upon the • If mask is not all 1 or all 0: loads that depend on
masked store must wait until the store data is written the masked store must wait until store data is
to the cache. written tocache.
2 • If mask is all 1: the data can be forwarded from the • If mask is all 1: data can be forwarded from the
masked store to the dependent loads. masked store to the dependent loads.
• If mask is all 0: loads do not depend on the masked • If mask is all 0: loads do not depend on the
store. masked store.
The table below shows the difference in implementation and execution speed of two versions of the code, both
working on unaligned output data array.
1 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N
2 Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N
4 Y Y Y Y N N N N Y Y Y Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N N
8 Y N N N N N N N Y N N N N N N N Y N N N N N N N Y N N N N N N N N
2 Y Y Y Y Y Y Y Y Y Y Y Y Y YN Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y NY
4 Y Y Y Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N
8 Y N N N N N N N Y N N N N N N N Y N N N N N N N Y N N N N N N N
16 Y N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N
32 Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
64 Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
2 Y Y Y Y Y Y Y Y Y Y Y Y Y YN Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y NY
4 Y Y Y Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N Y Y Y Y Y N N N
8 Y N N N N N N N Y N N N N N N N Y N N N N N N N Y N N N N N N N
16 Y N N N N N N N N N N N N N N N Y N N N N N N N N N N N N N N N
32 Y N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
64 N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N
SOM 00006
There are two important points to be considered when using data forwarding.
1. Data forwarding to GPR is possible only from the lower 256 bits of store instruction. Note this when loading GPR
with data that has recently been written.
2. Do not use masks, as forwarding is supported only for certain masks.
— st_mask = 00000000, ld_mask = don’t care, can forward: no, should block: no
In summary, a masked store should be used carefully, for example, if the remainder size is known at compile time to
be 1, and there is a load operation from the same cache line after it (or there is an overlap in addresses + vector
lengths), it may be better to use scalar remainder processing, rather than a masked remainder block.
… 5 4 3 2 1 0 bits
Mask
... 0 0 1 1 1 0 0 1 1 0 0 0 1 0 Register
… 63 32 31 0 bits
Input
... a13 a12 a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0 Buffer
… 63 32 31 0 bits
SOM00007
Alternative 1: Scalar
mov rsi, source
mov rdi, dest
mov r9, len
xor r8, r8
xor r10, r10
mainloop:
mov r11d, dword ptr [rsi+r8*4]
test r11d, r11d
jle m1
mov dword ptr [rdi+r10*4], r11d
inc r10
m1:
inc r8
cmp r8, r9
jne mainloop
Baseline 1x
Speedup: 2.87x
xor r8, r8
xor r11, r11
vpxor xmm0, xmm0, xmm0
mainloop:
vmovdqa xmm1, [rsi+r8*4]
vpcmpgtd xmm2, xmm1, xmm0
mov r10, 4
vmovmskps r13, xmm2
shl r13, 4
vmovdqu xmm3, [r14+r13]
vpshufb xmm2, xmm1, xmm3
popcnt r13, r13
sub r10, r13
vmovdqu xmm3, [r15+r10*4]
vmaskmovps [rdi+r11*4], xmm3, xmm2
add r11, r13
add r8, 4
cmp r8, r9
jne mainloop
shuffle_LUT:
.int 0x80808080, 0x80808080, 0x80808080, 0x80808080
.int 0x03020100, 0x80808080, 0x80808080, 0x80808080
.int 0x07060504, 0x80808080, 0x80808080, 0x80808080
.int 0x03020100, 0x07060504, 0x80808080, 0x80808080
.int 0x0b0A0908, 0x80808080, 0x80808080, 0x80808080
.int 0x03020100, 0x0b0A0908, 0x80808080, 0x80808080
.int 0x07060504, 0x0b0A0908, 0x80808080, 0x80808080
.int 0x03020100, 0x07060504, 0x0b0A0908, 0x80808080
.int 0x0F0E0D0C, 0x80808080, 0x80808080, 0x80808080
.int 0x03020100, 0x0F0E0D0C, 0x80808080, 0x80808080
.int 0x07060504, 0x0F0E0D0C, 0x80808080, 0x80808080
.int 0x03020100, 0x07060504, 0x0F0E0D0C, 0x80808080
.int 0x0b0A0908, 0x0F0E0D0C, 0x80808080, 0x80808080
.int 0x03020100, 0x0b0A0908, 0x0F0E0D0C, 0x80808080
.int 0x07060504, 0x0b0A0908, 0x0F0E0D0C, 0x80808080
.int 0x03020100, 0x07060504, 0x0b0A0908, 0x0F0E0D0C
write_mask:
.int 0x80000000, 0x80000000, 0x80000000, 0x80000000
.int 0x00000000, 0x00000000, 0x00000000, 0x00000000
Speedup: 2.87x
xor r8, r8
xor r11, r11
vpxor ymm0, ymm0, ymm0
mainloop:
vmovdqa ymm1, [rsi+r8*4]
vpcmpgtd ymm2, ymm1, ymm0
mov r10, 8
vmovmskps r13, ymm2
shl r13, 5
vmovdqu ymm3, [r14+r13]
vpermd ymm2, ymm3, ymm1
popcnt r13, r13
sub r10, r13
vmovdqu ymm3, [r15+r10*4]
vmaskmovps [rdi+r11*4], ymm3, ymm2
add r11, r13
add r8, 8
cmp r8, r9
jne mainloop
// The lookup table is too large to reproduce in the document. It consists of 256 rows of 8 32 bit integers.
//The first 8 and the last 8 rows are shown below.
shuffle_LUT:
.int 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x0, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
.int 0x0, 0x1, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0
// Skipping 240 lines
.int 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x0, 0x0
.int 0x0, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x0
.int 0x1, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x0
.int 0x0, 0x1, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0
.int 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0, 0x0
.int 0x0, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0
.int 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x0
.int 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7
write_mask:
.int 0x80000000, 0x80000000, 0x80000000, 0x80000000
.int 0x80000000, 0x80000000, 0x80000000, 0x80000000
Example 17-12. (Contd.)Comparing Intel® AVX-512 Data Compress with Alternative 3 (Contd.)
Speedup: 5.27x
xor r8, r8
xor r10, r10
vpxord zmm0, zmm0, zmm0
mainloop:
vmovdqa32 zmm1, [rsi+r8*4]
vpcmpgtd k1, zmm1, zmm0
vpcompressd zmm2 {k1}, zmm1
vmovdqu32 [rdi+r10*4], zmm2
kmovd r11d, k1
popcnt r12, r11
add r8, 16
add r10, r12
cmp r8, r9
jne mainloop
Speedup: 11.9x
Data expand operations read elements from the source array (register) and put them in the destination register in the
places indicated by enabled bits in the mask register. If the number of enabled bits is less than destination register
size, the extra values are ignored.
if (k[i] == 1)
{
dest[i] = src[a];
a++;
}
… 63 32 31 0 bits
Input
... a13 a12 a11 a10 a9 a8 a7 a6 a5 a4 a3 a2 a1 a0 Buffer
… 13 12 11 10 9 8 7 6 5 4 3 2 1 0 bits
Mask
... 0 0 1 1 1 0 0 1 1 0 0 0 1 0 Register
… 63 32 31 0 bits
... 0 0 a5 a4 a3 0 0 a2 a1 0 0 0 a0 0 Destination
SOM00008
Example 17-14. Comparing Intel® AVX-512 Data Expand Operation with Other Alternatives
Example 17-14. (Contd.)Comparing Intel® AVX-512 Data Expand Operation with Other Alternatives
X 1 1 1 1 0 0 0 0 Immediate value
Y 1 1 0 0 1 1 0 0 that is used.
Z 1 0 1 0 1 0 1 0
f(X, Y, Z) 1 0 0 1 0 0 1 0 0x92
SOM00009
Using Karnaugh maps on this truth table, we can define the function as:
f(X,Y,Z) = y(z ⊕ x)Vxyz
The C code for the function above is as follows:
for (int i=0; i<SIZE; i++)
{
Dst[i] = ((~Src2[i]) & (Src1[i] ^ Src3[i])) | (Src1[i] & Src2[i] & Src3[i]);
}
The value of the function for each combination of X, Y and Z gives an immediate value that is used in the instruction.
Here are three implementations for this logical function applied to all values in X, Y and Z arrays.
• Alternative 1: an Intel AVX2 256-bit vector computation, using bitwise logical functions available in Intel AVX2.
• Alternative 2: a 512-bit vector computation, using bitwise logical functions available in Intel AVX-512, without
using the vpternlog instruction.
• Alternative 3: an Intel AVX-512 512-bit vector computation, using the vpternlog instruction.
All alternatives in the table are unrolled by factor 2.
Speedup: 2.36x
Speedup: 1.94x
(1.22x vs Intel® AVX-512 with logic instructions)
X 1 1 1 1 0 0 0 0 Immediate value
that is used in the
Y 1 1 0 0 1 1 0 0
vpternlog instruction.
Z 1 0 1 0 1 0 1 0
f(X, Y, Z) 0 1 1 1 1 0 0 0 0x78
SOM00010
Therefore one vpternlog instruction can be used instead of using two logic instructions (vpand and vpxor):
vpternlog x,y,z,0x78
… 63 32 31 0 bits
… 63 32 31 0 bits
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 index
… 63 32 31 0 bits
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 index
… 63 32 31 0 bits
SOM00011
Note that the index register values must have the same resolution as the instruction and source registers (word when
working on words, dword when working on dwords, etc.).
→
Figure 17-11. Two-Source Permute Instructions in a Matrix Transpose Operation
The corresponding C code is as follows (assuming each matrix occupies a continuous block of 8*8*2 = 128 bytes):
for(int iY = 0; iY < 8; iY++)
{
for(int iX = 0; iX < 8; iX++)
{
trasposedMatrix[iY*8+iX] = originalMatrix[iX*8+iY];
}
}
Here are three implementations for this matrix transpose.
• Alternative 1 is scalar code, which accesses each element of the source matrix and puts it to the corresponding
place in the destination matrix. This code does 64 (8x8) iterations per 1 matrix.
• Alternative 2 is Intel AVX2 code, which uses Intel AVX2 permutation and shuffle (unpack) instructions. Only 1
iteration per 8x8 matrix is required.
• Alternative 3 Intel AVX-512 code which uses the Two Source Permutation instructions. Note that this code first
loads permutation masks, and then matrix data. The mask used to perform the permutation is stored in the
following array:
Speedup: 37.3x
Baseline 1x Speedup: 13.7x
(2.7x vs Intel® AVX2 code)
17.9 BROADCAST
executing the broadcast on the load ports reduces the workload on port 5 and increases performance. Alternative 3
shows how embedded broadcast benefits from both executing the broadcast on the load ports and micro fusion.
Alternative 1: 32-bit Load and Alternative 2: Broadcast with Alternative 3: 32-bit Embedded
Register Broadcast a 32-bit Memory Operand Broadcast
loop: loop:
loop:
vmovd xmm0, [rax] vpbroadcastd zmm0, [rax]
vpaddd zmm2, zmm1, [rax]{1to16}
vpbroadcastd zmm0, xmm0 vpaddd zmm2, zmm1, zmm0
vpermd zmm2, zmm3, zmm2
vpaddd zmm2, zmm1, zmm0 vpermd zmm2, zmm3, zmm2
add rax, 0x4
vpermd zmm2, zmm3, zmm2 add rax, 0x4
sub rdx, 0x1
add rax, 0x4 sub rdx, 0x1
jnz loop
sub rdx, 0x1 jnz loop
jnz loop
The following example shows that on Skylake Server microarchitecture, 16-bit broadcast is executed on port 5 and
therefore does not gain from the memory operand broadcast.
Notice that embedded broadcast is not supported for 16-bit memory operands.
immediate bits. In such case, the immediate bits take precedence over the embedded rounding mode in the same
way as they take precedence over the bits in MXCSR.RM
This piece of code would perform the single-precision floating point addition of vectors zmm2 and zmm4 with round-
towards-plus-infinity, leaving the result in vector zmm7 using k6 as a conditional writemask. Note that MXCSR.RM bits
are ignored and unaffected by the outcome of this instruction.
The following are examples of instructions instances where the static rounding-mode is not allowed.
; rounding-mode already specified in the instruction immediate
vrndscaleps zmm7 {k6}, zmm2 {rd}, 0x00
; instructions with vector length different than maximal vector length (512-bit)
vaddps ymm7 {k6}, ymm2, ymm4 {rd}
… 63 32 31 0 bits
… 63 32 31 0 bits
Base
Address GPR
(BA)
NOTE
A hardware Scatter operation issues as many store operations, as the number of elements in the
vector. Do not use a scatter operation to store sequential elements, which can be stored with one
vmov instruction.
xor r9, r9
mainloop:
mov r9d, [rbx+rdx-0x4]
vcvtsi2ss xmm0, xmm0, qword ptr [rax+rdx*2-0x8]
vmovss [rcx+r9*4], xmm0
sub rdx, 4
jnz mainloop
Baseline 1x
Example 17-22. QWORD Example, Intel® AVX2 vs. Intel® AVX-512 Intrinsics
Example 17-22. QWORD Example, Intel® AVX2 vs. Intel® AVX-512 Intrinsics (Contd.)
Example 17-23. QWORD Example, Intel® AVX2 vs. Intel® AVX-512 Assembly
Intel® AVX2 Assembly Intel® AVX-512 Assembly
loop: loop:
vmovdqu32 ymm28, ymmword ptr [rax+rcx*8+0x20] vmovups zmm0, zmmword ptr [rax+rcx*8]
inc r9d inc r9d
vmovdqu32 ymm26, ymmword ptr [r11+rcx*8+0x20] vmovups zmm5, zmmword ptr [rax+rcx*8+0x40]
vmovdqu32 ymm17, ymmword ptr [r11+rcx*8] vmovups zmm10, zmmword ptr [rax+rcx*8+0x80]
vmovdqu32 ymm19, ymmword ptr [rax+rcx*8] vmovups zmm15, zmmword ptr [rax+rcx*8+0xc0]
vmovdqu ymm13, ymmword ptr [rax+rcx*8+0x40] vmovups zmm1, zmmword ptr [r11+rcx*8]
vmovdqu ymm11, ymmword ptr [r11+rcx*8+0x40] vmovups zmm6, zmmword ptr [r11+rcx*8+0x40]
vpsrlq ymm25, ymm28, 0x20 vmovups zmm11, zmmword ptr [r11+rcx*8+0x80]
vpsrlq ymm27, ymm26, 0x20 vmovups zmm16, zmmword ptr [r11+rcx*8+0xc0]
vpsrlq ymm16, ymm19, 0x20 vpaddq zmm2, zmm0, zmm1
vpsrlq ymm18, ymm17, 0x20 vpmullq zmm3, zmm0, zmm1
vpaddq ymm6, ymm28, ymm26 vpaddq zmm7, zmm5, zmm6
vpsrlq ymm10, ymm13, 0x20 vpmullq zmm8, zmm5, zmm6
vpsrlq ymm12, ymm11, 0x20 vpaddq zmm12, zmm10, zmm11
vpaddq ymm0, ymm19, ymm17 vpmullq zmm13, zmm10, zmm11
vpmuludq ymm29, ymm25, ymm26 vpaddq zmm17, zmm15, zmm16
vpmuludq ymm30, ymm27, ymm28 vpmullq zmm18, zmm15, zmm16
vpaddd ymm31, ymm29, ymm30 vpmaxsq zmm4, zmm2, zmm3
vmovdqu32 ymm29, ymmword ptr [r11+rcx*8+0x80] vpmaxsq zmm9, zmm7, zmm8
vpsllq ymm5, ymm31, 0x20 vpmaxsq zmm14, zmm12, zmm13
vmovdqu32 ymm31, ymmword ptr [rax+rcx*8+0x80] vpmaxsq zmm19, zmm17, zmm18
vpsrlq ymm30, ymm29, 0x20 vmovups zmmword ptr [rsi], zmm4
vpmuludq ymm20, ymm16, ymm17
Example 17-23. QWORD Example, Intel® AVX2 vs. Intel® AVX-512 Assembly (Contd.)
Intel® AVX2 Assembly Intel® AVX-512 Assembly
vpmuludq ymm21, ymm18, ymm19 vmovups zmmword ptr [rsi+0x40], zmm9
vpmuludq ymm4, ymm28, ymm26 vmovups zmmword ptr [rsi+0x80], zmm14
vpaddd ymm22, ymm20, ymm21 vmovups zmmword ptr [rsi+0xc0], zmm19
vpaddq ymm7, ymm4, ymm5 add rcx, 0x20
vpsrlq ymm28, ymm31, 0x20 add rsi, 0x100
vmovdqu32 ymm20, ymmword ptr [r11+rcx*8+0x60] cmp r9d, r8d
vpsllq ymm24, ymm22, 0x20 jb loop
vmovdqu32 ymm22, ymmword ptr [rax+rcx*8+0x60]
vpsrlq ymm21, ymm20, 0x20
vpaddq ymm4, ymm22, ymm20
vpcmpgtq ymm8, ymm7, ymm6
vblendvpd ymm9, ymm6, ymm7, ymm8
vmovups ymmword ptr [rsi+0x20], ymm9
vpmuludq ymm14, ymm10, ymm11
vpmuludq ymm15, ymm12, ymm13
vpmuludq ymm8, ymm28, ymm29
vpmuludq ymm9, ymm30, ymm31
vpmuludq ymm23, ymm19, ymm17
vpaddd ymm16, ymm14, ymm15
vpsrlq ymm19, ymm22, 0x20
vpaddd ymm10, ymm8, ymm9
vpaddq ymm1, ymm23, ymm24
vpsllq ymm18, ymm16, 0x20
vmovdqu32 ymm28, ymmword ptr [rax+rcx*8+0xc0]
vpsllq ymm12, ymm10, 0x20
vpmuludq ymm23, ymm19, ymm20
vpmuludq ymm24, ymm21, ymm22
vpaddd ymm25, ymm23, ymm24
vmovdqu32 ymm19, ymmword ptr [rax+rcx*8+0xa0]
vpsllq ymm27, ymm25, 0x20
vpsrlq ymm25, ymm28, 0x20
vpsrlq ymm16, ymm19, 0x20
vpcmpgtq ymm2, ymm1, ymm0
vblendvpd ymm3, ymm0, ymm1, ymm2
vpaddq ymm0, ymm13, ymm11
vmovups ymmword ptr [rsi], ymm3
vpmuludq ymm17, ymm13, ymm11
vpmuludq ymm11, ymm31, ymm29
vpaddq ymm1, ymm17, ymm18
vpaddq ymm13, ymm31, ymm29
vpaddq ymm14, ymm11, ymm12
vmovdqu32 ymm17, ymmword ptr [r11+rcx*8+0xa0]
vmovdqu ymm12, ymmword ptr [r11+rcx*8+0xe0]
vpsrlq ymm18, ymm17, 0x20
vpcmpgtq ymm2, ymm1, ymm0
vpmuludq ymm26, ymm22, ymm20
vpcmpgtq ymm15, ymm14, ymm13
vblendvpd ymm3, ymm0, ymm1, ymm2
vblendvpd ymm0, ymm13, ymm14, ymm15
vmovdqu ymm14, ymmword ptr [rax+rcx*8+0xe0]
Example 17-23. QWORD Example, Intel® AVX2 vs. Intel® AVX-512 Assembly (Contd.)
Intel® AVX2 Assembly Intel® AVX-512 Assembly
vmovups ymmword ptr [rsi+0x40], ymm3
vmovups ymmword ptr [rsi+0x80], ymm0
vpaddq ymm5, ymm26, ymm27
vpsrlq ymm11, ymm14, 0x20
vpsrlq ymm13, ymm12, 0x20
vpaddq ymm1, ymm19, ymm17
vpaddq ymm0, ymm14, ymm12
vmovdqu32 ymm26, ymmword ptr [r11+rcx*8+0xc0]
vpmuludq ymm20, ymm16, ymm17
add rcx, 0x20
vpmuludq ymm21, ymm18, ymm19
vpaddd ymm22, ymm20, ymm21
vpsrlq ymm27, ymm26, 0x20
vpsllq ymm24, ymm22, 0x20
vpmuludq ymm29, ymm25, ymm26
vpmuludq ymm30, ymm27, ymm28
vpmuludq ymm15, ymm11, ymm12
vpmuludq ymm16, ymm13, ymm14
vpmuludq ymm23, ymm19, ymm17
vpaddd ymm31, ymm29, ymm30
vpaddd ymm17, ymm15, ymm16
vpaddq ymm2, ymm23, ymm24
vpsllq ymm19, ymm17, 0x20
vpcmpgtq ymm6, ymm5, ymm4
vblendvpd ymm7, ymm4, ymm5, ymm6
vpsllq ymm6, ymm31, 0x20
vmovups ymmword ptr [rsi+0x60], ymm7
vpaddq ymm7, ymm28, ymm26
vpcmpgtq ymm3, ymm2, ymm1
vpmuludq ymm5, ymm28, ymm26
vpmuludq ymm18, ymm14, ymm12
vblendvpd ymm4, ymm1, ymm2, ymm3
vpaddq ymm8, ymm5, ymm6
vpaddq ymm1, ymm18, ymm19
vmovups ymmword ptr [rsi+0xa0], ymm4
vpcmpgtq ymm9, ymm8, ymm7
vpcmpgtq ymm2, ymm1, ymm0
vblendvpd ymm10, ymm7, ymm8, ymm9
vblendvpd ymm3, ymm0, ymm1, ymm2
vmovups ymmword ptr [rsi+0xc0], ymm10
vmovups ymmword ptr [rsi+0xe0], ymm3
add rsi, 0x100
cmp r9d, r8d
jb loop
This reduction allows for a branch-free implementation of divide, that covers overflow, underflow, and special inputs
(zeroes, Infinities, or denormals).
|VGETMANT(x,0)| is in [1,2) for all non-NaN inputs.
VGETMANT(a,0)/VGETMANT(b,0) can be computed to the desired accuracy.
The suppress-all-exceptions (SAE) feature available in Intel AVX-512 can help ensure spurious flag settings do not
occur. Flags can be set correctly as part of the computation (except for divide-by-zero, which requires an additional
step).
For high accuracy or IEEE compliance, the hardware instruction typically provides better performance, especially in
terms of latency.
Branching to a scalar version of the loop on any duplicate indices can work well if duplicates are extremely rare.
However, if the chance of getting even one duplicate in a given iteration of the vectorized loop is large enough, then
it is better to use SIMD as much as possible, to exploit as much parallelism as possible.
ZMM1 … 63 32 31 0 bits
ZMM1 3 10 3 9 4 6 7 0 1 50 2 8 1 3 3 5
10 0
3 1 0
9 0 0 0
4 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
50 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
…
8 0 0 0 0 0 0 0 0 0 0 0
63 1 0 0 0 0 0 0 0 0 1 0 0 0
32 3 1 0 1 0 0 0 0 0 0 0 0 0 0
31 3 1 0 1 0 0 0 0 0 0 0 0 0 0 1
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
bits … 63 32 31 0 bits
8198 0 6 0 0 0 0 0 8 0 0 0 0 2 0 0 ZMM0
For loops performing updates to memory locations, such as in the histogram example, minimize store-load
forwarding by merging the updates to each distinct index while the data is in registers, and only perform a single write
to each memory location. Further, the merge can be performed in a parallel fashion.
ZMM1 … 63 32 31 0 bits
3 10 3 9 4 6 7 0 1 50 2 8 1 3 3 5
Step 1
Step 2
The figure above shows the merging process for the example set of indices. While the figure shows only the indices, it
actually merges the values. Most of the indices are unique, and thus require no merging. Step 1 combines three pairs
of indices: two pairs of '3's and one pair of '1's. Step 2 combines the intermediate results for the '3's from step 1, so
that there is now a single value for each distinct index. Notice that in only two steps, the four elements with an index
value of 3 are merged, because we performed a tree reduction; we merged pairs of results or intermediate results at
each step.
The merging (combining or reduction) process shown above is done with a set of permute operations. The initial
permute control is generated with a VPLZCNT+VPSUB sequence. VPLZCNT provides the number of leading zeros for
each vector element (i.e., contiguous zeros in the most significant bit positions). Subtracting the results of VPLZCNT
from the number of bits in each vector element, minus one, provides the bit position of the most significant '1' bit in
the result of the VPCONFLICT instruction, or results in a '-1' for an element if it has no conflicts. In the example above
this sequence results in the following permute control.
13 -1 -2 -1 -1 -1 -1 -1 -3 -1 -1 -1 -1 -1 -1 -1
The permute loop for merging matching indices and generating the next set of permute indices repeats until all values
in the permute control become equal to ‘-1’.
The assembly code below shows both the scalar version of a histogram loop, and the vectorized version with a tree
reduction. Speedups are modest because the loop contains little computation; the SIMD benefit comes almost
entirely from vectorizing just the logical AND operation and the increment. SIMD speedups can be much higher for
loops containing more vectorizable computation.
update:
vpaddd zmm0, zmm2, zmm1
kxnorw k1, k1, k1
add rcx, 16
vpscatterdd [r15+zmm3*4]{k1}, zmm0
cmp ecx, ebx
jb histogram_loop
Notice that the end result of the conflict loop (i.e., the resulting vector after all merging is done, ZMM2 in the above
sequence) holds the complete set of partial sums. That is, for each element, the result contains the value of that
element merged with all earlier elements with the same index value. Using the earlier example values, ZMM2
contains the result shown in Figure 17-16.
4 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1
While the above sequence does not take advantage of this, other use cases might.
… 127 64 63 0 bits
… 63 32 31 0 bits
87 41 32 15 10 4 3 0 A_index
SOM00017
To perform a dot product of two sparse vectors efficiently, we need to find elements with matching indices; those are
the only ones on which we should perform the multiply and accumulation. The scalar method for doing this is to start
at the beginning of the two index arrays, compare those indices, and if there is a match, do the multiply and
accumulate, then advance the indices of both vectors. If there is no match, we advance the index of the lagging vector.
A_offset = 0; B_offset = 0; sum = 0;
while ((A_offset < A_length) && (B_offset < B_length))
{
if (A_index[A_offset] == B_index[B_offset]) // match
{
sum += A_value[A_offset] * B_value[B_offset];
A_offset++;
B_offset++;
}
VPERMB Operation:
// vpermb zmm Dst {k1}, zmm Src1, zmm Src2
bool zero_masking=false;
unsigned char *Dst, *Src1, *Src2;
for(int i=0;i<64;i++){
if(k1[i]){
Dst[i]= Src2[Src1[i]];
}else{
Dst[i]= zero_masking? 0 : Dst[i];
}
}
The following example shows a 64-byte lookup table implementation.
Scalar code:
void lookup(unsigned char* in_bytes, unsigned char* out_bytes, unsigned char* dictionary_bytes, int
numOfElements){
for(int i = 0; i < numOfElements; i++) {
out_bytes[i] = dictionary_bytes[in_bytes[i] & 63];
}
}
Index: 0 1 2 3 4 … 63 64 65 66 67 68 … 127
VPERMI2B Operation:
/// vpermi2b Dst{k1}, Src1, Src2
bool zero_masking=false;
unsigned char *Dst, *Src1, *Src2;
for(int i=0;i<64;i++){
if(k1[i]){
Dst[i]= Dst [i]>63 ? Src1[Dst [i] & 63] : Src2[Dst [i] & 63] ;
}else{
Dst[i]= zero_masking? 0 : Dst[i];
}
}
Index: 0 1 2 3 4 … 63 64 65 66 67 68 … 127
zmm0
zmm1 src1: A 0 A1 A2 A3 A4 ... A63 data source: B0 B 1 B 2 B 3 B 4 ... B63
VPERMT2B Operation:
// vpermt2b Dst{k1}, Src1, Src2
bool zero_masking=false;
unsigned char *Dst, *Src1, * Src2;
data2= copy(Dst);
for(int i=0;i<64;i++){
if(k1[i]){
Dst[i]= Src2[i]>63 ? Src1[Src2 [i] & 63] : Dst[Src2[i] & 63] ;
}else{
Dst[i]= zero_masking? 0 : Dst[i];
}
}
C Code:
void lookup(unsigned char* in_bytes, unsigned char* out_bytes, unsigned char* dictionary_bytes, int
numOfElements){
for(int i = 0; i < numOfElements; i++) {
out_bytes[i] = dictionary_bytes[in_bytes[i] & 127];
}
}
zmm0 dst: A1-8 A 8-15 ... B2-9 B7-14 ... ... H0-7 H9-16 ...
VPMULTISHIFTQB Operation:
// vpmultishiftqb Dst{k1},Src1,Src2
bool zero_masking=false;
unsigned char *Dst, * Src1;
unsigned __int64 *Src2;
bit * k1;
for(int i=0;i<8;i++){
for(int j=0;j<8;j++){
if(k1[i*8 +j]){
Dst[i*8 +j]= (src2[i]>> Src1[i*8 +j]) &0xFF ;
}else{
Dst[i*8 +j]= zero_masking? 0 : Dst[i*8 +j];
}
}
}
The following example converts a 5-bit unsigned integer array to a 1-byte unsigned integer array.
C code:
void decompress (unsigned char* compressedData, unsigned char* decompressedData, int
numOfElements){
for(int i = 0; i < numOfElements; i += 8){
unsigned __int64 * data = (unsigned __int64 * )compressedData;
decompressedData[i+0] = * data & 0x1f;
decompressedData[i+1] = (*data >> 5 ) & 0x1f;
decompressedData[i+2] = (*data >> 10 ) & 0x1f;
decompressedData[i+3] = (*data >> 15 ) & 0x1f;
decompressedData[i+4] = (*data >> 20 ) & 0x1f;
decompressedData[i+5] = (*data >> 25 ) & 0x1f;
decompressedData[i+6] = (*data >> 30 ) & 0x1f;
decompressedData[i+7] = (*data >> 35 ) & 0x1f;
compressedData += 5;
}
}
loop:
shr rcx, 0x23
vmovdqu32 zmm1, [rsi]
and r10, 0x1f
vpermb zmm2, zmm10, zmm1
and rcx, 0x1f
vpmultishiftqb zmm2, zmm11, zmm2
mov byte ptr [r9+rsi*8+0x1], r11b
vpandq zmm2, zmm12, zmm2
mov byte ptr [r9+rsi*8+0x6], r10b
vmovdqu32 [rdi], zmm2
mov byte ptr [r9+rsi*8+0x7], cl
add rdi, 64
inc rsi
add rsi, 40
cmp rsi, rax
cmp rdi, r8
jb loop
jl loop
FMA port0 1 2 3 4
FMA port5 1 2 3 4 5 6
FMA port5 1 2 3 4 5 6
FMA port5 1 2 3 4 5 6
FMA port5 1 2 3 4 5 6
FMA port0 1 2 3 4
Figure 17-22. Fast Bypass When All Sources Come from FMA Unit
The gray boxes represent compute cycles. The white boxes represent data transfer for the port5 FMA unit.
If fast bypass is not used, that is, when not all sources come from the FMA unit, group A instructions have a latency of
four cycles on Port0 and six cycles on port5, while group B instructions have an additional cycle and hence have a
latency of five cycles on Port0 and seven cycles on port5.
The following table summarizes the FMA unit latency for the various options.
Figure 17-23. Mixing Intel AVX Instructions or Intel AVX-512 Instructions with Intel SSE Instructions
Recommendations:
• When mixing group B instructions with Intel SSE instructions, or suspecting that such a mixture might occur, use
the VZEROUPPER instruction whenever a transition is expected.
• Add VZEROUPPER after group B instructions were executed and before any function call that might lead to an
Intel SSE instruction execution.
• Add VZEROUPPER at the end of any function that uses group B instructions.
• Add VZEROUPPER before thread creation if not already in a clean state so that the thread does not inherit a Dirty
Upper State.
Example 17-29. 256-bit Code vs. 256-bit Code Mixed with 512-bit Code
256-bit Code Only 256-bit Code Mixed with 512-bit Code
Loop: Loop:
vpbroadcastd ymm0, dword ptr [rsp] vpbroadcastd zmm0, dword ptr [rsp]
vfmadd213ps ymm7, ymm7, ymm7 vfmadd213ps ymm7, ymm7, ymm7
vfmadd213ps ymm8, ymm8, ymm8 vfmadd213ps ymm8, ymm8, ymm8
vfmadd213ps ymm9, ymm9, ymm9 vfmadd213ps ymm9, ymm9, ymm9
vfmadd213ps ymm10, ymm10, ymm10 vfmadd213ps ymm10, ymm10, ymm10
vfmadd213ps ymm11, ymm11, ymm11 vfmadd213ps ymm11, ymm11, ymm11
vfmadd213ps ymm12, ymm12, ymm12 vfmadd213ps ymm12, ymm12, ymm12
vfmadd213ps ymm13, ymm13, ymm13 vfmadd213ps ymm13, ymm13, ymm13
vfmadd213ps ymm14, ymm14, ymm14 vfmadd213ps ymm14, ymm14, ymm14
vfmadd213ps ymm15, ymm15, ymm15 vfmadd213ps ymm15, ymm15, ymm15
vfmadd213ps ymm16, ymm16, ymm16 vfmadd213ps ymm16, ymm16, ymm16
vfmadd213ps ymm17, ymm17, ymm17 vfmadd213ps ymm17, ymm17, ymm17
vfmadd213ps ymm18, ymm18, ymm18 vfmadd213ps ymm18, ymm18, ymm18
vpermd ymm1, ymm1, ymm1 vpermd ymm1, ymm1, ymm1
vpermd ymm2, ymm2, ymm2 vpermd ymm2, ymm2, ymm2
vpermd ymm3, ymm3, ymm3 vpermd ymm3, ymm3, ymm3
vpermd ymm4, ymm4, ymm4 vpermd ymm4, ymm4, ymm4
vpermd ymm5, ymm5, ymm5 vpermd ymm5, ymm5, ymm5
vpermd ymm6, ymm6, ymm6 vpermd ymm6, ymm6, ymm6
dec rdx dec rdx
jnle Loop jnle Loop
In the 256-bit code only example, the FMAs are dispatched to ports 0 and 1, and permd is dispatched to port 5 as the
broadcast instruction is 256 bits wide. In the 256-bit and 512-bit mixed code example, the broadcast is 512 bits wide;
therefore, the processor uses the 512-bit port scheme where the FMAs dispatch to ports 0 and 5 and permd to port 5,
thus increasing the pressure on port 5.
The differentiation between the two processors is based on the ratio between the two throughput tests. Processors
with two FMA units are able to run the FMA-only test twice as fast as the FMA and shuffle test. However, a processor
with one FMA unit will run both tests at the same speed.
Example 17-30. Identifying One or Two FMA Units in a Processor Based on Skylake Microarchitecture
#include <string.h>
#include <stdlib.h>
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
Example 17-30. Identifying One or Two FMA Units in a Processor Based on Skylake Microarchitecture
vmovups zmm22, [shuf_vec]
vmovups zmm23, [shuf_vec]
vmovups zmm30, [shuf_vec]
mov rdx, loops
loop1:
vfmadd231pd zmm0, zmm0, zmm0
vfmadd231pd zmm1, zmm1, zmm1
vfmadd231pd zmm2, zmm2, zmm2
vfmadd231pd zmm3, zmm3, zmm3
vfmadd231pd zmm4, zmm4, zmm4
vfmadd231pd zmm5, zmm5, zmm5
vfmadd231pd zmm6, zmm6, zmm6
vfmadd231pd zmm7, zmm7, zmm7
vfmadd231pd zmm8, zmm8, zmm8
vfmadd231pd zmm9, zmm9, zmm9
vfmadd231pd zmm10, zmm10, zmm10
vfmadd231pd zmm11, zmm11, zmm11
vpermd zmm12, zmm30, zmm30
vpermd zmm13, zmm30, zmm30
vpermd zmm14, zmm30, zmm30
vpermd zmm15, zmm30, zmm30
vpermd zmm16, zmm30, zmm30
vpermd zmm17, zmm30, zmm30
vpermd zmm18, zmm30, zmm30
vpermd zmm19, zmm30, zmm30
vpermd zmm20, zmm30, zmm30
vpermd zmm21, zmm30, zmm30
vpermd zmm22, zmm30, zmm30
vpermd zmm23, zmm30, zmm30
dec rdx
jg loop1
}
}
uint64_t fma_only_tpt(int loop_cnt){
uint64_t loops = loop_cnt;
__declspec(align(64)) double one_vec[8] = {1, 1, 1, 1,1, 1, 1, 1};
__asm
{
vmovups zmm0, [one_vec]
vmovups zmm1, [one_vec]
vmovups zmm2, [one_vec]
vmovups zmm3, [one_vec]
vmovups zmm4, [one_vec]
vmovups zmm5, [one_vec]
vmovups zmm6, [one_vec]
vmovups zmm7, [one_vec]
vmovups zmm8, [one_vec]
vmovups zmm9, [one_vec]
vmovups zmm10, [one_vec]
vmovups zmm11, [one_vec]
mov rdx, loops
Example 17-30. Identifying One or Two FMA Units in a Processor Based on Skylake Microarchitecture
loop1:
vfmadd231pd zmm0, zmm0, zmm0
vfmadd231pd zmm1, zmm1, zmm1
vfmadd231pd zmm2, zmm2, zmm2
vfmadd231pd zmm3, zmm3, zmm3
vfmadd231pd zmm4, zmm4, zmm4
vfmadd231pd zmm5, zmm5, zmm5
vfmadd231pd zmm6, zmm6, zmm6
vfmadd231pd zmm7, zmm7, zmm7
vfmadd231pd zmm8, zmm8, zmm8
vfmadd231pd zmm9, zmm9, zmm9
vfmadd231pd zmm10, zmm10, zmm10
vfmadd231pd zmm11, zmm11, zmm11
dec rdx
jg loop1
}
}
int main()
{
int i;
uint64_t fma_shuf_tpt_test[3];
uint64_t fma_shuf_tpt_test_min;
uint64_t fma_only_tpt_test[3];
uint64_t fma_only_tpt_test_min;
uint64_t start = 0;
uint64_t number_of_fma_units_per_core = 2;
/*********************************************************/
/* Step 1: Warmup */
/*********************************************************/
fma_only_tpt(100000);
/*********************************************************/
/* Step 2: Execute FMA and Shuffle TPT Test */
/*********************************************************/
/*********************************************************/
/* Step 3: Execute FMA only TPT Test */
/*********************************************************/
for(i = 0; i < 3; i++){
start = rdtsc();
fma_only_tpt(1000);
fma_only_tpt_test[i] = rdtsc() - start;
}
Example 17-30. Identifying One or Two FMA Units in a Processor Based on Skylake Microarchitecture
/*********************************************************/
/* Step 4: Decide if 1 FMA server or 2 FMA server */
/*********************************************************/
fma_shuf_tpt_test_min = fma_shuf_tpt_test[0];
fma_only_tpt_test_min = fma_only_tpt_test[0];
for(i = 1; i < 3; i++){
if ((int)fma_shuf_tpt_test[i] < (int)fma_shuf_tpt_test_min) fma_shuf_tpt_test_min = fma_shuf_tpt_test[i];
if ((int)fma_only_tpt_test[i] < (int)fma_only_tpt_test_min) fma_only_tpt_test_min = fma_only_tpt_test[i];
}
The following constants were loaded into zmm registers and used as gather and permute indices:
Zmm0 (Alternative 1), zmm6 (Alternative 2)
__declspec (align(64)) const __int32 gather_imag_index[16] = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27,
29, 31};
Zmm1 (Alternative 1), zmm7 (Alternative 2)
__declspec (align(64)) const __int32 gather_real_index[16] = {0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26,
28, 30};
Recommendation: For best performance, replace strided loads where the stride is short, with a sequence of loads
and permutes.
The Table 17-10 summarizes the data alignment effects on SAXPY performance with speedup values for the various
options.
Table 17-10. Data Alignment Effects on SAXPY Performance vs. Speedup Value
Data Alignment Effects on SAXPY Performance Speedup
Alternative 1: Both sources and the destination are 64-byte aligned. Baseline, 1.0
Alternative 2: Both sources are 64-byte aligned, destination has a 4 byte offset from the
0.66x
alignment.
Alternative 3: Both sources and the destinations have 4 bytes offset from the alignment. 0.59x
Alternative 4: One source has a 4 byte offset from the alignment, the other source and the
0.77x
destination are 64-byte aligned.
— For example:
//dynamically allocating 64byte aligned buffer with 2048 float elements.
InputBuffer = (float*) _mm_malloc (2048*sizeof(float), 64);
• Use static data alignment using __declspec(align(64)).
— For example:
//Statically allocating 64byte aligned buffer with 2048 float elements.
__declspec(align(64)) float InputBuffer[2048];
NOTE
In some cases, when the divide or square root operations are part of a larger algorithm that hides
some of the latency of these operations, the approximation with Newton-Raphson can slow down
execution, because more micro-ops, coming from the additional instructions, fill the pipe.
The following sections show the operations with recommended calculation methods depending on the desired
accuracy level.
NOTE
There are two definitions for approximation error of a value and it's approximation approx:
Absolute error = | - approx|
Relative error = | - approx| / ||
In this chapter, the “number of bits” error is relative, and not the error of absolute values.
The value to which we compare our approximation should be as accurate as possible, better
double accuracy.
14 bits RSQRT14PS
14 bits RSQRT14PD
below.
Table 17-13. 256-bit Intel AVX2 Divide and Square Root Instruction Performance
Broadwell
DIVPS SQRTPS DIVPD SQRTPD
Microarchitecture
Latency 17 21 23 35
Throughput 10 14 16 28
Latency 11 12 14 18
Throughput 5 6 8 12
Table 17-14. 512-bit Intel AVX-512 Divide and Square Root Instruction Performance
Skylake Microarchitecture DIVPS SQRTPS DIVPD SQRTPD
Latency 17 19 23 31
Throughput 10 12 16 24
The table below shows the latency and throughput of single precision Intel AVX-512 divide and square root
instructions, compared to the approximation methods on Skylake microarchitecture.
Table 17-15. Latency/Throughput of Different Methods of Computing Divide and Square Root
on Skylake Microarchitecture for Different Vector Widths, on Single Precision
256-bit Intel® AVX-512 512-bit Intel® AVX-512
Accurac Instructions Instructions
Operation Method
y Throughpu Throughp
Latency Latency
t ut
24 bits
DIVPS 5 11 10 17
(IEEE)
RCP14PS + MULPS + 1
Divide (a/b) 23 bits 2 16 3 20
Newton-Raphson Iteration
RCP14PS + MULPS 14 bits 1 8 2 10-12
24 bits
SQRTPS 6 12 12 19
(IEEE)
RSQRT14PS + MULPS + 1
Square root 23 bits 3 16 5 20
Newton-Raphson Iteration
RSQRT14PS + MULPS 14 bits 2 9 3 12
RSQRT14PS 14 bits 1 4 2 6
Table 17-16. Latency/Throughput of Different Methods of Computing Divide and Square Root
on Skylake Microarchitecture for Different Vector Widths, on Double Precision
256-bit Intel® AVX-512 512-bit Intel® AVX-512
Operation Method Accuracy Instructions Instructions
Throughput Latency Throughput Latency
53 bits
DIVPD 8 14 16 23
(IEEE)
RCP14PD + MULPD + 2
22 bits 3.2 27 4.7 28.4
Divide (a/b) Newton-Raphson Iteration
RCP14PD + MULPD + 1
26 bits 2 16 3 20
Newton-Raphson Iteration
Table 17-16. Latency/Throughput of Different Methods of Computing Divide and Square Root
on Skylake Microarchitecture for Different Vector Widths, on Double Precision (Contd.)
256-bit Intel® AVX-512 512-bit Intel® AVX-512
Operation Method Accuracy Instructions Instructions
Throughput Latency Throughput Latency
53 bits
SQRTPD 12 18 24 31
(IEEE)
RSQRT14PD + MULPD +
22 bits 4.82 24.541 6.4 28.481
Square root Polynomial Approximation
RSQRT14PD + MULPD + 1 N-
26 bits 3.76 17 5 20
R
RSQRT14PD + 2-NR +
52 bits 5 29.38 6.53 34
ErrorCcorrection
RSQRT14PD 14 bits 1 4 2 6
NOTES:
1. These numbers are not rounded because their code sequence contains several FMA (Fused-multiply-add) instruc-
tions, which have a varying latency of 4/6. Therefore the latency for these sequences is not necessarily fixed.
__asm {
vbroadcastss zmm0, a// fill zmm0 with 16 elements of a
vbroadcastss zmm1, b// fill zmm1 with 16 elements of b
vdivps zmm2, zmm0, zmm1// zmm2 = 16 elements of a/b
}
__asm {
vbroadcastss zmm1, one// zmm1 = vector of 16 1’s
vsqrtps zmm2, zmm0
vdivps zmm2, zmm1, zmm2
}
Single Precision, Reciprocal Square Root, 23 Bits Single Precision, Reciprocal Square Root, 14 Bits
/* Input:
zmm0 = vector of a’s
Output:
zmm2 = vector of 1/sqrt (a) /* Input:
*/ zmm0 = vector of a’s
Output:
float half = 0.5; zmm2 = vector of 1/sqrt (a)
*/
__asm {
vbroadcastss zmm1, half// zmm1 = vector of 16 0.5’s __asm {
vrsqrt14ps zmm2, zmm0 vrsqrt14ps zmm2, zmm0
vmulps zmm3, zmm0, zmm2 }
vmulps zmm4, zmm1, zmm2
vfnmadd231ps zmm1, zmm3, zmm4
vfmsub231ps zmm3, zmm0, zmm2
__asm {
vsqrtps zmm2, zmm0
}
Single Precision, Square Root, 23 Bits Single Precision, Square Root, 14 Bits
/* Input:
zmm0 = vector of a’s /* Input:
Output: zmm0 = vector of a’s
zmm0 = vector of sqrt (a) Output:
*/ zmm0 = vector of sqrt (a)
*/
float half = 0.5;
Single Precision, Square Root, 23 Bits Single Precision, Square Root, 14 Bits
__asm {
vbroadcastss zmm3, half
vrsqrt14ps zmm1, zmm0 __asm {
vfpclassps k2, zmm0, 0xe vrsqrt14ps zmm1, zmm0
vmulps zmm2, zmm0, zmm1, {rn-sae} vfpclassps k2, zmm0, 0xe
vmulps zmm1, zmm1, zmm3 knotw k3, k2
knotw k3, k2 vmulps zmm0{k3}, zmm0, zmm1
vfnmadd231ps zmm0{k3}, zmm2, zmm2 }
vfmadd213ps zmm0{k3}, zmm1, zmm2
}
Double Precision, Square Root, 53 Bits (IEEE) Double Precision, Square Root, 52 Bits
/* Input:
zmm0 = vector of a’s
Output:
zmm0 = vector of sqrt (a)
*/
Double Precision, Square Root, 26 Bits Double Precision, Square Root, 14 Bits
/* Input:
zmm0 = vector of a’s
Output:
zmm0 = vector of sqrt (a)
*/
/* Input:
zmm0 = vector of a’s
// duplicates x eight times
Output:
#define DUP8_DECL(x) x, x, x, x, x, x, x, x
zmm0 = vector of sqrt (a)
*/
// used for aligning data structures to n bytes
#define ALIGNTO(n) __declspec(align(n))
__asm {
vrsqrt14pd zmm1, zmm0
ALIGNTO(64) __int64 OneHalf[ ] =
vfpclasspd k2, zmm0, 0xe
{DUP8_DECL(0X3FE0000000000000)};
knotw k3, k2
__asm {
vmulpd zmm0 {k3}, zmm0, zmm1
vrsqrt14pd zmm1, zmm0
}
vfpclasspd k2, zmm0, 0xe
knotw k3, k2
vmulpd zmm0 {k3}, zmm0, zmm1
vmulpd zmm1, zmm1, ZMMWORD PTR [OneHalf]
vfnmadd213pd zmm1, zmm0, ZMMWORD PTR
[OneHalf]
vfmadd213pd zmm0 {k3}, zmm1, zmm0
}
17.26 CLDEMOTE
Using the CLDEMOTE instruction, a processor puts a cache line into the last shared level of the cache hierarchy so that
other CPU cores 'find' the same cache line in the last shared level and expensive cross-core snoop is avoided. The
most significant advantage of CLDEMOTE is that multiple consumers can access the shared cache line amortizing each
snoop request portion.
The option -qopt-report-phase (/Qopt-report-phase on Windows) controls report generation from various compiler
phases, but it is recommended to use the default setting where the report is generated for all compiler phases.
• The report is a useful tool to gain insight into the performance optimizations performed, or not performed, by the
compiler, and also to understand the interactions between multiple optimizations such as inlining, OpenMP*
parallelization, loop optimizations (such as loop distribution or loop unrolling) and vectorization.
• The report is based on static compiler analysis. Hence the reports are most useful when correlated with dynamic
performance analysis tools, such as Intel® VTune™ Amplifier or Vectorization Advisor (part of Intel® Advisor XE),
that do hotspot analysis and provide other dynamic information.
• Once this information is available, the optimization information can be studied for hotspots
(functions/loopnests) in compiler reports.
— The compiler can generate multiple versions of loop-nests, so it is useful to correlate the analysis with the
version actually executed at runtime.
• The phase ordering of the compiler loop optimizations is intended to enable optimal vectorization.
Often, understanding the loop optimization parameters helps to further tune performance.
• Finer control of these loop optimizations is often available via pragmas, directives, and options.
If the application contains OpenMP pragmas or directives, it can be compiled with -qopenmp (/Qopenmp on
Windows) to enable full OpenMP based multi-threading and vectorization. Alternatively, the SIMD vectorization
features of OpenMP alone can be enabled by using the option -qopenmp-simd (/Qopenmp-simd on Windows).
For doing studies where compiler-based vectorization has to be turned off completely, use the options
-no-vec -no-simd -qno-openmp-simd (/Qvec- /Qsimd- /Qopenmp-simd- on Windows).
Data alignment plays an important role in improving the efficiency of vectorization. This usually involves two distinct
steps from the user or application:
• Align the data.
— When compiling a Fortran program, it is possible to use the option -align array64byte (/align:array64byte on
Windows) to align the start of most arrays at a memory address that is divisible by 64.
— For C/C++ programs, data allocation can be done using routines such as _mm_malloc(…, 64) to align the
return-value pointer at 64 bytes. For more information on data alignment, see
https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization.
• Convey the alignment information to the compiler using appropriate clauses, pragmas, and directives.
Compiler-based software data prefetching can be enabled with the options -O3 -xcore-avx512 -qopt-prefetch[=n] (-
O3 /QxCORE-AVX512 /Qopt-prefetch[=n] on Windows), for n=0 (no prefetching) to 5 (maximal prefetching). Using a
value of n=5 enables aggressive compiler prefetching, disregarding any hardware prefetching, for strided loads/stores
and indexed loads/stores which appear inside loops. Using a value of n=2 reduces the amount of compiler prefetching
and restricts it only to direct memory accesses where the compiler heuristics determine that the hardware prefetcher
may not be able to handle well. It is recommended to try values of n=2 to 5 to determine the best prefetching strategy
for a particular application. It is also possible to use the -qopt-prefetch-distance=n1[,n2] (/Qopt-prefetch-
distance=n1[,n2] on Windows) option to fine-tune application performance.
• Useful values to try for n1: 0,4,8,16,32,64.
• Useful values to try for n2: 0,1,2,4,8.
Loop-nests that have a relatively low trip-count value at runtime in hotspots can sometimes lead to sub-optimal AVX-
512 performance unless the trip-count is conveyed to the compiler. In many such cases, the compiler will be able to
generate better code and deliver better performance if values of loop trip-counts, loop-strides, and array extents
(such as for Fortran multi-dimensional arrays) are all known to the compiler. If that is not possible, it may be useful to
add appropriate loop_count pragmas to such loops.
Interprocedural optimization (IPO) is enabled using the option -ipo (/Qipo on Windows). This option can be enabled
on all the source-files of the application or it can be applied selectively to the source files containing the application
hot-spots. IPO permits inlining and other inter-procedural optimizations to happen across these multiple source files.
In some cases, this option can significantly increase compile time and code size. Using the option -inline-factor=n
(/Qinline-factor:n on Windows) controls the amount of inlining done by the compiler. The default value of n is 100,
indicating 100%, or a scale factor of 1. For example, if a value of 200 is specified, all inlining options that define upper
limits are multiplied by a factor of 2, thus enabling more inlining than the default.
Profile-guided optimizations (PGO) are enabled using the options -prof-gen and -prof-use (/Qprof-gen and /Qprof-use
on Windows). Typically, using PGO increases the effectiveness of using IPO.
The option -fp-model name (/fp:name on Windows) controls tradeoffs between performance, accuracy and
reproducibility of floating-point results at a high level. The default value for name is fast=1. Changing it to fast=2
enables more aggressive optimizations at a slight cost in accuracy or reproducibility. Using the value precise for name
disallows optimizations that might produce slight variations in floating-point results. When name is double, extended
or source, intermediate results are computed in the corresponding precision. In most situations where enhanced
floating-point consistency and reproducibility are needed -fp-model precise -fp-model source (/fp:precise /fp:source
on Windows) are recommended.
The option -fimf-precision=name (/Qimf-precision=name on Windows) is used to set the accuracy for math library
functions. The default is OFF, which means that the compiler uses its own default heuristics. Possible values of name
are high, medium, and low. Reduced precision might lead to increased performance and vice versa, particularly for
vectorized code. The options -[no-]prec-div and -[no-]prec-sqrt improve[reduce] precision of floating-point divides
and square root computations. This may slightly degrade [improve] performance. For more details on floating-point
options, see Consistency of Floating-Point Results using the Intel® Compiler (2018) .
The option -[no-]ansi-alias (/Qansi-alias[-] on Windows) enables [disables] ANSI and ISO C Standard aliasing rules. By
default, this option is enabled on Linux, but disabled on Windows. On Windows, especially for C++ programs, adding
/Qansi-alias to the compilation options enable the compiler to perform additional optimizations, particularly taking
advantage of the type-based disambiguation rules of the ANSI Standard, which says for example, that pointer and
float variables do not overlap.
If the optimization report specifies that compiler optimizations may have been disabled to reduce compile-time, use
the option -qoverride-limits to override such disabling in the compiler and ensure optimization is applied. This can
sometimes be important for applications, especially ones with functions that have big bodies. Note that using this
additional option may increase compile time and compiler memory usage significantly in some cases.
The list below shows a sampling of loop-level controls available for fine-tuning optimizations - including a way to turn
off a particular transformation reported by the compiler.
• #pragma simd reduction(+:sum)
— The loop is transformed as is, no other loop-optimizations will change the simd-loop.
• #pragma loop_count min(220) avg (300) max (380)
— Fortran syntax: !dir$ loop count(16)
• #pragma vector aligned nontemporal
• #pragma novector // to suppress vectorization
• #pragma unroll(4)
• #pragma unroll(0) // to suppress loop unrolling
• #pragma unroll_and_jam(2) // before an outer loop
• #pragma nofusion
• #pragma distribute_point
— If placed as the first statement right after the for-loop, distribution will be suppressed for that loop.
— Fortran syntax: !dir$ distribute point
• #pragma prefetch *:<hint>:<distance>
— Apply uniform prefetch distance for all arrays in a loop.
• #pragma prefetch <var>:<hint>:<distance>
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are
not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimization in the product are intended for
use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding
the specific instruction sets covered by this notice.
Notice revision #20110804
CHAPTER 18
INTEL® ADVANCED VECTOR EXTENSIONS 512 - FP16
INSTRUCTION SET FOR INTEL® XEON® PROCESSORS
18.1 INTRODUCTION
The Intel® AVX-512 FP16 instruction set architecture supports a wide range of general purpose numeric operations for
16-bit half-precision IEEE-754 floating-point and complements the existing 32-bit and 64-bit floating-point
instructions already available in Intel Xeon processors. This instruction set architecture also provides complex-valued
native hardware support.
This instruction set architecture is ideal for numeric operations where reduced precision can be used, such as signal
and media processing. For example, wireless signal processing operations such as beam-forming, precoding, and
minimum mean squared error (MMSE) perform well with this ISA. Furthermore, traditional signal processing such as
real or complex-valued fast Fourier transform (FFTs) also works well with this instruction set. The advantage of using
reduced precision in these cases is that because fewer bits are processed for each element, the overall compute
throughput can be increased, allowing precision and speed to be traded against each other.
18.3 OVERVIEW
In this chapter, we describe the addition of the FP16 ISA for Intel AVX-512 into the Intel Xeon processor family to
handle IEEE 754-2019-compliant half-precision floating-point operations (also known officially as binary16, or
unofficially as FP16). This instruction set is general-purpose and can be used for all numeric operations that could be
reasonably expected, including numeric operations (add, subtract, divide, multiply), fused operations (for example,
fused multiply-add), comparisons, conversions to and from other data types, and many more. Broadly, the FP16
instruction set mirrors the floating-point support that is already available in Intel Xeon processors for 32-bit (FP32)
and 64-bit (FP64), although there are a few exceptions to this, which will be noted where appropriate. There is one
notable new feature of FP16 when compared to existing FP32 and FP64 instruction sets: the addition of native
complex-value support for interleaved FP16 data, which is useful in scientific computing and signal processing.
The two major advantages of using the FP16 instruction set compared to other floating-point formats are increased
execution throughput and reduced storage requirements. Half-precision floating-point values only require 16 bits for
storing each value, as opposed to the 32 or 64 bits needed for other common IEEE floating-point formats. This allows
FP16 to handle twice as many operations per each clock cycle compared to FP32, and four times as many compared
to FP64. Similarly, the reduced size means that more values can be stored in a given memory region compared to the
other formats, increasing the effectiveness of the registers and the cache hierarchy. The disadvantages are the
reduced range and precision. It is the responsibility of the programmer to decide whether this floating-point format
is suitable for a certain application.
Half-precision floating-point is useful for building systems where the dynamic range of floating-point is required but a
lower numeric precision can be easily tolerated and traded for higher compute performance. Typical applications for
half-precision floating-point include signal processing, media or video processing, artificial intelligence, and machine
learning.
Historically, some limited support for half-precision data types was available in processors from the 3rd generation
Intel® Core™ processor onwards, but the operations were restricted to converting between half-precision and FP32
values. On older platforms, all numeric operations had to be implemented using higher precision formats and down-
converted on completion. Those instructions were useful for compatibility with other platforms (for example, Intel®
GPUs), but did not realize the benefits in higher compute performance brought about in FP16.
IEEE FP16 is not the only 16-bit floating-point format. Another common type is bfloat16, which is primarily used in
artificial intelligence and machine learning. Intel Xeon processors support some bfloat16 operations, including type
conversions and a few limited numeric operations, but not the full range of general-purpose operations that are
supported in FP16 for Intel AVX-512. This chapter describes only the instruction set relating to IEEE 754-2019.
This chapter covers both the general-purpose instruction set as well as the new complex-valued instructions. We then
look at the numeric implications of using FP16 and discuss how to write optimal code sequences for some common
operations.
The examples provided in this document use the intrinsic and data type support provided as part of the Intel® OneAPI
DPC++ Compiler.
256-bit AVX2 register __m256h 16 x FP16 values, or 8 x complex FP16 values (CFP16)
512-bit AVX-512 register __m512h 32 x FP16 values, or 16 x complex FP16 values (CFP16)
The complex instructions operate on standard SIMD vector types, such as __m128h, but internally those instructions
treat the register as sets of complex-valued pairs, as shown in Figure 18-1. We refer to a complex pair of FP16 values
as `CFP16'. The CFP16 type is laid out as though it were an array of two FP16 values, or a C++ type such as
std::complex<_Float16>.
128 bits
e7 e6 e5 e4 e3 e2 e1 e0
Figure 18-1. Layout of a 128-Bit Register Representing Four Complex FP16 (CFP16) Values
In the latest Intel OneAPI compilers, 16-bit floating-point literals can be created by suffixing a value with f16. For
example:
_Float16 value = 12.34f16;
_mm256_add_ph Add a pair of 16xFP16 vector registers to form a result containing 16xFP16 outputs.
_mm512_add_ph Add a pair of 32xFP16 vector registers to form a result containing 32xFP16 outputs.
Multiply a pair of 16xFP16 vector registers and add the result to a third vector
_mm256_fmadd_ph
register of 16xFP16 values, forming a result containing 16xFP16 vector elements.
Compute the reciprocal of a vector register containing 32xFP16 values, generating an
_mm512_rcp_ph
output of another vector register containing 32xFP16 values.
Note that pch complex operation intrinsics are only provided for multiply and fused-multiply-add operations since
these require special hardware support. No intrinsics are provided for operations like addition, since the existing
add_ph intrinsic behaves correctly for those without extra support requirements.
For a complete list of all the intrinsics provided as part of FP16, refer to the Intel® Intrinsics Guide.
In the remainder of this chapter where the names of intrinsics are given, typically only the 128-bit variant is shown.
Because AVX-512-FP16 supports VL encoding, all three length variants of the intrinsics are available (i.e., 128-bit, 256-
bit, 512-bit).
The conjugate of a complex number is formed by negating its imaginary component. A common operation with
conjugation is to multiply a complex number with a conjugate of another complex number. Conjugation in FP16 is
supported using three classes of intrinsic, as illustrated in Table 18-3.
_mm_fcmul_pch Compute the multiplication of a conjugated value with another complex value.
Both the multiply and the FMA are able to perform the conjugation as part of the instruction operation itself. It is not
necessary to conjugate the value first. For example, an _mm_fcmul_pch intrinsic is functionally equivalent to:
_mm_fcmul_pch(_mm_conj_pch(lhs), rhs)
However, _mm_fcmul_pc will operate in fewer cycles than calling that sequence explicitly. When the compiler
notices separate conjugate and multiply intrinsics being used, it fuses them into a single conjugate-multiply.
128 bits
Source B
b7 b6 b5 b4 b3 b2 b1 b0
Note that masks also control whether faults within an instruction are suppressed. If an operation generates a fault in
a particular element, but the element's operation has been disabled by a zero bit in the mask, then the fault is not
reported.
Some instructions in Intel AVX-512 can generate mask registers, and with FP16, these are normally the result of a
comparison operation. For example, consider the following code snippet:
whichElementsAreLess = _mm512_cmp_ph_mask(lhs, rhs, _CMP_LT_OS);
In this example, every element of the left-hand vector is compared to see if it is less than the corresponding element
in the right-hand vector. If the left element is less than the right element, then a 1 is generated in the mask bit output,
otherwise a zero is emitted. This comparison instruction allows all the major binary comparison operations to be
performed between two vectors.
The FP16 instruction set also provides support to test for special values using the _mm_fpclass_ph_mask instruction.
This instruction takes a special immediate value that directs the instruction to the numeric classes to look for in the
vector register (for example, infinities, NaN, zero, denormal). This instruction is often used in combination with other
instructions to remove special case values from a register and replace them with something different. For example,
the following code snippet removes NaN values and replaces them with zero:
__mmask8 whichAreNan = _mm_fpclass_ph_mask(values, QUIET_NAN | SIGNAL_NAN);
__m128h valuesWithNoNan = _mm_mask_blend_ph(whichAreNan, values, __m128h());
In Intel AVX-512, there is a special instruction that does direct replacement of special values with known constants
called _mm_fixupimm_ps/pd, but this is unavailable in the FP16 instruction set.
128 bits
Source B
b7 b6 b5 b4 b3 b2 b1 b0
No direct numeric support is provided for complex operations such as addition, subtraction, and real-valued scaling,
but their standard real-valued equivalent instructions can be used instead. However, if such an operation has to be
masked on a per-complex-element basis, then the incoming complex-valued mask needs to be expanded into pairs of
identical bits, one pair per complex-element. An example of this is illustrated in Figure 18-4. Note that the incoming
mask bit, which is per CFP16 element, needs to be expanded to duplicate each bit for the real-valued intrinsic.
Figure 18-4. Using a Real-Valued FP16 Vector Operation for Implementing a Masked Complex Addition
The operation to expand the incoming complex-mask-bits to generate real-valued mask could be performed in
numerous different ways, but one efficient way to achieve this operation is shown in Example 18-1. This code
fragment uses the fast mask-to-vector and vector-to-mask instructions to effect the upscaling of the bit-mask
elements.
Example 18-1. Converting a Complex-Valued Mask to a Real-Valued Mask by Duplicating Adjacent Bits
__mmask8 getRealMaskFromComplexMask(__mmask8 m)
{
// 4 incoming bits representing the 4 complex elements in a 128-bit register.
// Each mask bit is converted into an entire element in a vector register
// where a 0-mask generates 32x0, and a 1-mask generates 32x1. For example
// 0010 -> [000....00], [000...000], [111....111], [000....000]
auto wholeElements = _mm_movm_epi32(m);
// Each complex element can now be treated as a pair of 16-bit elements instead,
// and the MSB of each 16-bit unit can extracted as a mask bit in its own right.
return _mm_movepi16_mask(wholeElements);
}
It may also be necessary to perform a similar operation in reverse, where pairs of bits representing adjacent FP16
values need to be reduced in some way into a single bit representing the complete complex element (e.g., AND, OR).
For example, if two complex vectors must be compared for equality then the individual FP16 elements must be
compared for equality first, and then if two adjacent mask bits are both set (that is, the logical AND of those bits), then
the complex element as a whole must be equal. This comparison test is illustrated in Figure 18-5. Note that some of
the sub-elements in each CFP16 do compare equal when using the _mm_cmp_ph_mask intrinsic, but both elements
in each CFP16 value must be equal for the complex values to be truly equal.
128 bits
34 12 90 31 67 43 21 23 Source A
34 12 90 87 0 43 11 3 Source B
One implementation of the function to combine adjacent mask bits using an AND operation is shown in Example 18-2.
Like the example above, it uses the mask-to-vector and vector-to-mask instructions to good effect.
Example 18-2. Function for Converting from a Real-Valued Mask to a Complex-Valued Mask By AND-
Combining Adjacent Bits
__mmask8 getComplexMaskFromRealMask_AND(__mmask8 m)
{
// 8 incoming bits representing the 8 real-valued elements in a 128-bit register.
// Broadcast the bits into 8-bit elements of all 1's or all 0's.
auto wholeElements = _mm_movm_epi8(m);
// Extract single mask bits from each 16-bit element which are the logical ANDs of the
// MSBs of each incoming 8-bit element. Because the movm above generated all 0/1 bits
// across the whole element the only combinations of values in each 32-bit unit are
// both all zero, both all one, or one of each. The logical AND of the MSBs can only
// occur when both 8-bit sub-elements are all ones, so this is equivalent to
// comparing the 16-bit block as though it were entirely 1, which is a direct
// equality comparison.
return _mm_cmp_epi16_mask(wholeElements, allOnes, _MM_CMPINT_EQ);
}
Note that the individual mask bits are expanded to 8-bit elements and then compared for equality as 16-bit elements
to combine adjacent elements. There is no need to expand to the same size as the data being processed (that is,
16/32-bit respectively in this case), since the bitwise pairing is independent of the original data element sizes. By using
smaller registers, efficiency is very slightly improved compared to using wider registers.
The adjacent mask bits could also be combined using an OR operation, which might be useful if testing whether a
complex value is NaN (that is, a complex value is NaN if either of its individual elements is NaN). A sequence for
determining an OR of adjacent mask bits is shown in Example 18-3.
Example 18-3. Function for Converting from a Real-Valued Mask to a Complex-Valued Mask by OR-
Combining Adjacent Bits
__mmask8 getComplexMaskFromRealMask_OR(__mmask8 m)
{
auto wholeElements = _mm_movm_epi16(m);
// Similar logic to the AND variant above but now any 32-bit element which
// isn't zero represents the logical OR or two adjacent 16-bit block
// elements in one 32-bit block.
return _mm_cmp_epi32_mask(wholeElements, __m128i(), _MM_CMPINT_NE);
}
18.5 NUMERICS
Using FP16 instead of the more conventional and widely used FP32 and FP64 formats introduces a number of
interesting numeric behaviors. It is beyond the scope of this chapter to discuss these fully or to describe the numeric
methods required to build FP16 algorithms, but in this section, we highlight a few of the properties and behaviors of
the FP16 number format and the consequences that arise from this.
Certain bitwise operations can be used to manipulate the floating-point numbers without requiring special hardware
support. For example, an absolute operation (that is, convert the value to its positive equivalent) can be implemented
as a bitwise AND of the lower 15 bits of the value, thereby stripping off any sign bits. Similarly, functions like negate,
negative-absolute (nabs), copy-sign, test-sign, and so on can also be implemented using existing Intel AVX-512 bitwise
intrinsics.
1 0x3c00 One
Inf 0x7c00 Positive Infinity
1. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BRIEFS, VOL. 63, NO. 6, JUNE 2016: Quantization
Noise Power Estimation for Floating-Point DSP Circuits Gabriel Caffarena, Senior Member, IEEE, and Daniel
Menard, Member, IEEE: https://ieeexplore.ieee.org/document/7407669.
number format. If the result is not representable, the distance between the next lowest representable value and the
next highest representable value is called the unit-in-last-place, or ULP (and less commonly but equivalently, the unit-
of-least-precision), and the actual answer will lie somewhere between the two. When the output value is rounded up
or down to the nearest representable value, it therefore follows that the error in that calculation is no more than 0.5
ULP.
In IEEE 754 floating-point arithmetic, the standard mandates that the result of any hardware implementation will
generate a correctly rounded result that has no more than 0.5 ULP of error when rounding `to nearest' for the
following operations:
• Addition
• Subtraction
• Multiplication
• Division
• Square root
• Fused multiply-add
When rounding up, down, or toward zero, the error is less than 1 ULP.
The fused multiply-add guarantees that the intermediate result of the multiplication is kept in a higher precision form
internally before being added. This means that the result of an FMA operation can have less overall error than doing
a sequence of individual multiply and add instructions.
The Intel AVX-512 FP16 instruction set is compliant with IEEE Standard 754-2019, and arithmetic operations on it are
implemented in the spirit of the Standard (which does not require arithmetic operations for binary16). Consequently,
all the operations listed above yield correctly rounded results. FP16 also contains a few instructions (not defined in
IEEE 754-2019) that produce approximate results to within 0.5625 ULP error. These include:
• Reciprocal (rcp)
• Reciprocal square-root (rsqrt)
Further examination of these special cases is given in later sections.
Note also that complex multiplications (and fused multiplications) have an intermediate quantization to FP16
because, as described earlier, the hardware implements these operations as a sequence of FMAs. Each step of that
sequence introduces quantizing, so the overall effect of the complete complex multiply has some small error.
0.5 ULP, as guaranteed by IEEE 754, the approximation instructions give very slightly less accurate results, but these
are still useful, especially when compared with their equivalents in FP32 and FP64.
In FP32 and FP64 the approximation instructions are quite rough (that is, have a very high ULP error) and can only be
used as a substitute for full-precision operations if combined with one or two Newton-Raphson iterations to refine
the initial approximation to a point where it becomes sufficiently accurate. However, in FP16 the approximation
functions give results that are so close to their full precision results - within 0.5 ULP for 98% of the possible values and
within 0.5625 ULP for the remaining 2% of values - that there is no need to add Newton-Raphson iterations. This
makes the approximation instructions very useful. They give virtually the correct answer, but with substantial benefits
in performance over their full-accuracy counterparts. The following sections examine each approximation instruction
in more detail.
Figure 18-8. Heat-map Showing Relative ULP Error for Different Combinations of Divisor and Dividend
Value Ranges
A division instruction is relatively expensive, taking 24 cycles with a throughput of 16 in 512-bit mode. In contrast,
both multiply and reciprocal are cheap instructions, even when used in sequence, and consequently the
approximation to division is ~3x faster. This speed, coupled with the low error for most FP16 values, means that well-
designed algorithms could use the approximation sequence with little disadvantage.
Example 18-4. Function to Implement the 16-Bit Compress Operation on FP16 Vector Elements
__m512h compress_ph(__m512h src, __mmask32 mask, __m512h value)
{
const auto asInt16 = _mm512_castph_si512(value);
const auto src16 = _mm512_castph_si512(src);
const auto comp = _mm512_mask_compress_epi16(src16, mask, asInt16);
return _mm512_castsi512_ph(comp);
}
The strategy followed in the example is to take the incoming vector of FP16 values and recast to a vector of int16_t
values using the castph_si512 intrinsic. The newly cast values are then processed as though they are a vector of
int16_t elements instead. This step does not change the individual 16-bit element blocks; it just moves them around
within the register. On completion, the value is recast back to its original type as a vector of FP16 elements.
Note that the cast operations have no runtime impact and are purely used to inform the compiler that the
programmer is treating the underlying bits in the incoming register as though they were a different type. No type
conversion takes place. In practice, the entire code sequence in the example function collapses into a single compress
instruction.
This same strategy can be used to apply any sort of data movement instructions as a method of moving data around
within FP16 vector registers. The code is somewhat verbose but can be easily hidden away as a library function.
Furthermore, many common utility intrinsics of this sort have already been implemented in the Intel OneAPI compiler
and can be used directly. It should only be necessary to build additional intrinsic support functions for more unusual
operations.
In addition to data movement instructions, other bitwise operations like abs, nabs, negate, copy-sign, and so on, can
also be implemented using the underlying Intel AVX-512 foundation instructions.
does not work when both values are negative. The fast-integer property can be exploited to give a low-latency
minimum (or maximum) function as shown in the code fragment in Example 18-5.
Example 18-5. Function that Performs Fast Floating-Point Minimum Using Integer Instructions
// Assume the inputs are sane values, and either both positive or opposite signs.
__m128h fast_special_min(__m128h lhs, __m128h rhs)
{
const auto lhsInt16 = _mm_castph_si128(lhs);
const auto rhsInt16 = _mm_castph_si128(rhs);
const auto smallest = _mm_min_epi16(lhsInt16, rhsInt16);
return _mm_castsi128_ph(smallest);
}
By using the int16_t minimum instead, the instruction takes only 1 cycle to execute, which is faster than the
equivalent FP16 minimum, and can be used to accelerate latency or dependency-sensitive code. Note however that
the throughput is lower than the equivalent FP16 minimum instruction, so code that is exclusively performing
minimum operations may do better using the FP16 minimum.1
For comparison operations all data types take the same number of cycles to compare so using the equivalent int16_t
form of the instruction to perform a comparison makes no difference to performance.
immediate value of 8 returns the normalized mantissa (in the [1,2) range) for positive inputs, and QNaN_Indefinite for
negative inputs (which helps with special case handling).
VSCALEF(a,b)=a*2floor(b) is used in exponential and power functions; other possible applications include software
division. This operation helps with correct overflow and underflow treatment in the main path. It also includes
support for special exp() cases, thus eliminating the need for branches or other fixup code for this function family.
VFPCLASS is used to test for multiple special case categories (sNaN, negative finite, denormal, -Infinity, +Infinity, -0,
+0, qNAN). This helps when redirecting special inputs to a secondary path (for example, in the pow() function), or to
generate a fixup mask for setting special case results in the main path.
VRNDSCALE (round to specified number of fractional bits, using specified rounding mode) is used in function
argument reduction, and to help generate lookup table indices (also as part of argument reduction). VRNDSCALE is a
generalized form of round-to-integral, so it provides ceil/floor/trunc functionality, and also helps with floating-point
remainder operations.
VREDUCE is closely related to VRNDSCALE: VREDUCE(x, imm) = x - VRNDSCALE(x,imm). This instruction helps further
speed up argument reduction for certain functions (for example, exp2, pow).
The existing Intel AVX-512 permute operations (VPERM, VPERMT2, VPERMI2 for 16-bit and 32-bit data) provide fast
vector gather support for those implementations that need lookup tables (up to 32 16-bit entries for VPERMW, up to
64 16-bit entries for VPERMT2W/VPERMI2Wn operations).
CHAPTER 19
CRYPTOGRAPHY & FINITE FIELD ARITHMETIC ENHANCEMENTS
Several instruction extensions designated for acceleration of cryptography flows and finite field arithmetic are
available, beginning with the Ice Lake Client microarchitecture. The ISA extensions include Vector AES,
VPCLMULQDQ, Galois Field New Instructions (GFNI), and AVX512_IMFA, also known as VPMADD52. The following
sections describe these extensions and provide examples and simple comparisons to previous implementation code.
See the Intel® 64 and IA-32 Architectures Software Developer’s Manual for the complete instruction definitions.
Intel implements support for the most common cryptography algorithms, supporting main public libraries and
commonly used software. The following sections describe the instructions briefly.
Legacy Intel® AES-NI - AES ECB Encryption Vector AES - AES ECB Encryption
// rcx = pointer to Key Expansion
movdqa 16*1(%rcx), %xmm9
// xmm1 - xmm8 - 8 blocks of AES
// 1st round of AES for 8 blocks // rcx = pointer to Key Expansion
aesenc %xmm9, %xmm1 // broadcasting key to zmm0
aesenc %xmm9, %xmm2 vbroadcasti64x2 1*16(%rcx), %zmm0
aesenc %xmm9, %xmm3 // 1st round of AES for 8 blocks
aesenc %xmm9, %xmm4 vaesenc %zmm0, %zmm1, %zmm1
aesenc %xmm9, %xmm5 vaesenc %zmm0, %zmm2, ,%zmm2
aesenc %xmm9, %xmm6 vbroadcasti64x2 2*16(%rcx), %zmm0
aesenc %xmm9, %xmm7 // 2nd Round of AES for 8 Blocks
aesenc %xmm9, %xmm8 vaesenc %zmm0, %zmm1, %zmm1
movdqa 16*2(%rcx), %xmm9 vaesenc %zmm0, %zmm2, %zmm2
// 2nd Round of AES for 8 Blocks
aesenc %xmm9, %xmm1
aesenc %xmm9, %xmm2
1. https://www.intel.com/content/dam/doc/white-paper/advanced-encryption-standard-new-instructions-set-
paper.pdf
Legacy Intel® AES-NI - AES ECB Encryption Vector AES - AES ECB Encryption
aesenc %xmm9, %xmm3
aesenc %xmm9, %xmm4
aesenc %xmm9, %xmm5
aesenc %xmm9, %xmm6
aesenc %xmm9, %xmm7
aesenc %xmm9, %xmm8
The code above demonstrates the ability of implementing AES in ECB mode of operation, using 8 parallel buffers,
implemented on legacy vs. Vector AES. The same acceleration can be applied to other modes of operation, such as
AES-CTR and AES-CBC, and also to more elaborate schemes such as AES-GCM. The latter one requires fast
computations of a hash function, namely GHASH, which can be accelerated using the VPCLMULQDQ new instruction.
19.2 VPCLMULQDQ
Carry-less multiplication, namely PCLMULQDQ, was previously introduced on the Intel® Core processor family
based on Westmere microarchitecture1. In newer architectures, beginning with Ice Lake Client microarchitecture,
Intel introduces the vectorization of PCLMULQDQ, namely VPCLMULQDQ, supporting acceleration of up to 4x
compared to its legacy. The new instruction is used for polynomial multiplication over binary fields used on current
cryptography algorithms such as AES-GCM. The new instruction may also be useful for the upcoming Post-Quantum
Cryptography project submissions, used for BIKE and others. Such usages emphasizes the importance of the current
use of VPCLMULQDQ. A common use case for using the instruction can be seen on GHASH computation, with four
different carry-less multiplications done within a single instruction, using the wide 512-bit registers. This use case
elaborates the performance of AES-GCM, which is the main mode of operation used on AES.
1. https://software.intel.com/en-us/articles/intel-carry-less-multiplication-instruction-and-its-usage-for-comput-
ing-the-gcm-mode
.LAFFINE_OUT:
.byte 0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7,0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7
.byte 0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7,0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7
.byte 0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7,0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7
.byte 0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7,0x19,0x8b,0x6c,0x1e,0x51,0x8e,0x2d,0xd7
.globl SM4_ENC_ECB_AVX512_GFNI
SM4_ENC_ECB_AVX512_GFNI:
vmovdqa64 .LAFFINE_IN(%rip), %zmm10
vmovdqa64 .LAFFINE_OUT(%rip), %zmm11
…
/* Load data swapped LE-BE in transposed way - each block's double word is found on
different AVX512 register
*/
.Rounds:
// Initial xor between the 2nd, 3rd, 4th double word to key
vpbroadcastd 4*0(key), %zmm6
vpternlogd $0x96, %zmm1, %zmm2, %zmm3
vpxorq %zmm3, %zmm6, %zmm6
/* Sbox phase */
vgf2p8affineqb $0x65, %zmm10, %zmm6, %zmm6
vgf2p8affineinvqb $0xd3, %zmm11, %zmm6, %zmm6
/* Done Sbox , Linear rotations start xor with 1st double word input*/
vprold $2, %zmm6, %zmm12
vprold $10, %zmm6, %zmm13
vprold $18, %zmm6, %zmm7
vprold $24, %zmm6, %zmm14
vpternlogd $0x96, %zmm6, %zmm12, %zmm0
vpternlogd $0x96, %zmm13, %zmm7, %zmm14
vpxord %zmm14, %zmm0, %zmm0
/* Linear part done - round complete */
1. https://www.openssl.org/
CHAPTER 20
INTEL® ADVANCED MATRIX EXTENSIONS (INTEL® AMX)
This chapter aims to help low-level DL programmers optimally code to the metal on Intel® Xeon®
Processors based on Sapphire Rapids SP and Emerald Rapids microarchitectures. It extends the public documentation
on optimizing DL code with DL Boost instructions in Section 20.8.
It explains how to detect processor support in Intel® Advanced Matrix Extensions (Intel® AMX)
architecture (Section 20.1). It provides an overview of the architecture (Section 20.2) and presents Intel AMX
instruction throughput and latency (Section 20.3). It also discusses software optimization
opportunities for Intel AMX (Section 20.5 through Section 20.18), TILECONFIG/TILERELEASE and compiler ABI
(Section 20.19), Intel AMX state management and system software aspects (Section 20.20), and the use of Intel AMX
for higher precision GEMMs (Section 20.21).
TILELOADD 8 45
TILELOADDT1 33 48
NOTE
Due to the high latency of the LDTILECFG instruction, we recommend issuing a single pair of
LDTILECFG and TILERELEASE operations per Intel AMX-based DL layer implementation.
20.5.1 NOTATION
The following notation is used for the matrices (A, B, C) and the dimensions (M, K, N) in matrix
multiplication (GEMM).
Lines 6—10 in Example 20-1 illustrate how a tile is loaded from memory.
Example 20-1. Pseudo-Code for the TILEZERO, TILELOAD, and TILESTORE Instructions
template<size_t rows, size_t bytes_cols> class tile {
public:
friend void TILEZERO(tile& t) {
memset(t.v, 0, sizeof(v));
}
friend void TILELOAD(tile& t, void* src, size_t bytes_stride) {
for (size_t row = 0; row < rows; ++row)
for (size_t bcol = 0; bcol < bytes_cols; ++bcol)
t.v[row][bcol] = static_cast<int8_t*>(src)[row*bytes_stride + bcol];
}
friend void TILESTORE(tile& t, void* dst, size_t bytes_stride) {
for (size_t row = 0; row < rows; ++row)
for (size_t bcol = 0; bcol < bytes_cols; ++bcol)
static_cast<int8_t*>(dst)[row*bytes_stride + bcol] = t.v[row][bcol];
}
template <class TC, class TA, class TB>
friend void tdp(TC &tC, TA &tA, TB &tB);
private:
int8_t v[rows][bytes_cols];
};
// clang-format on
template <class TC, class TA, class TB> void tdp(TC &tC, TA &tA, TB &tB)
}
For the sake of readability, a tile template class abstraction is introduced. The number of rows in the tile and the
number of column bytes per row parametrizes the abstraction.
Example 1-3
The following tables illustrate the data re-layout process for a 64x16 int8 B matrix and a 32x16 bfloat16 B matrix
(corresponding to the maximum-sized B-tile):
Example 1-4
Example 1-5
Example 1-6
Example 1-7
Row0: {0, 16, 32, 48, 1, 17, 33, 49, 2, 18, 34, 50, 3, 19, … 15, 31, 47, 63}
Row1: {64, 80, 96, 112, 65, 81, 97, 113, … 79, 95, 111, 127}
Row2: {128, 144, 160, 176, 129, 145, … 143, 159, 175, 191}
Row3: {192, 208, 224, 240, 193, … 207, 223, 239, 255}
Row4: {256, 272, 288, 304, 257, … 271, 287, 303, 319}
Row5: {320, 336, 352, 368, 321, … 335, 351, 367, 383}
Row6: {384, 400, 416, 432,385, … 399, 415, 431, 447}
Row7: {448, 464, 480, 496, 449, … 463, 479, 495, 511}
Row8: {512, 528, 544, 560,513 … 527, 543, 559, 575}
Row9: {576, 592, 608, 624, 577… 591, 607, 623, 639}
...
Row14: {896 912, 928, 944, 897,… 911, 927, 943, 959}
Row15: {960, 976, 992, 1008, 961, … 975, 991, 1007, 1023}
Data type_t is the type being operated upon, i.e., signed/unsigned int8 or bfloat16. For the description of KPACK, see
Section 20.5.5. The tile template class and the three functions that operate on it are the same as the ones introduced
in Example 20-7. TILEZERO (t) resets the contents of tile t to 0, TILELOAD(t, src, stride) and loads tile t with the
contents of data at src with a stride of stride between consecutive rows. TILESTORE(t, dst, stride) stores the contents
of tile t to dst with a stride of stride between consecutive rows. TDP(tC,tA,tB) also performs a matrix multiplication
equivalent of tC=tC+tA×tB. In reality, tiles are defined by known compile-time integers, and the actual code operating
on tiles looks slightly different. Please visit the GitHub Repository for proper usage.
Example 20-8 is a simple implementation of the GEMM of the matrices stored in A_mem and B_mem.
20.5.5 OPTIMIZATIONS
Example 1-8
While both approaches yield correct results, there are K/TILE_K×N_ACC B tile loads are in the reference
implementation. Additionally, K/TILE_K×N_ACC×M_ACC B tile loads in the implementation presented in this section.
The number of A tile loads is identical.
This approach is also characterized by excessive pressure on the memory and an increased number of tile loads.
Suppose the B_mem data resides in main memory.
• In the reference implementation, a new chunk of TILE_K×TILE_N B data is read every M_ACC iteration of the inner
loop.
• The inner loop then reuses the read data.
In the current implementation:
• When n_acc == m_acc == 0, a new chunk of TILE_K×TILE_N B data is read every iteration of the inner loop.
• Then the same data is read (presumably from caches) on subsequent iterations of n_acc, m_acc.
• This burst access pattern of reads from main memory results in increased data latency and decreased
performance.
Hence, keeping the K-dimension loop outside the M_ACC and N_ACC loops is recommended.
Example 1-9
• The A-tile has been extended to an array of A-tiles (line 2) and pre-read the A tiles for the current K loop iteration
(lines 3–4).
• A pre-read A-tile is used in the tile multiplication (line 9).
• There were K/TILE_K×N_ACC×M_ACC A-tile reads in the reference implementation, while there are only
K/TILE_K×M_ACC A-tile reads in the current implementation.
Hence, preallocation and pre-reading the tiles of the innermost loop (tA[M_ACC] in this case) is
recommended.
The maximum number of tiles used at any given time in this scenario is N_ACC×M_ACC+M_ACC+1 instead of
N_ACC×M_ACC+2 in the reference implementation. Since this optimization requires the
preallocation of additional M_ACC-1 tiles and tiles are a scarce resource, if N_ACC<M_ACC, switching the order of the
N_ACC and M_ACC loops might be beneficial.
This way, it is possible to allocate N_ACC-1<M_ACC-1 additional tiles:
• N_ACC=4,M_ACC=1
As stated before, the number of A tile loads in lines 3–11 is M_ACC, and the number of B tile loads is N_ACC. Thus, the
total number of tile loads (M_ACC+N_ACC) is 4 in the first scenario vs. 5 in the second one (an increase of 25%), even
though both scenarios perform the same amount of work.
Hence, using 2D accumulator arrays is recommended. Selecting dimensions close to square is strongly recommended
(since x=y minimizes f(x,y)=x+y under the constraint x×y=const).
While placing the tile loads and stores under conditions inside the main loop (lines 13, 16, 20), conditions can be
eliminated by sufficiently unrolling the loops.
The rest of this section presents a specific example of GEMM implemented in low-level Intel AMX
instructions. This is to show a full performance potential from using Intel AMX extensions.
Lines 1-12 in Example 20-14 define the tile configuration for this example and contain information about tile sizes. Tile
configuration should be loaded before executing Intel AMX instructions (line 16). Tile sizes are defined by the
configuration at load time and can’t be changed dynamically (unless TILERELEASE is called). The ‘palette_id’ field in
the configuration specifies the number of logical tiles available; palette_id == 1 means 8 logical tiles are available,
named tmm0 through tmm7 are available. This example uses seven logical tiles (tmm4, tmm5 for A, tmm6 for B,
tmm0-tmm3 for C).
According to the dimensions specified, the K-loop consists of two iterations (cf. code listing 8.1, line 11) according to
the dimensions specified in the example. Lines 23-34 implement the first iteration, and lines 35-46 the second
iteration. Note the interleaving of the tdp and TILESTORE instructions to hide the high cost of TILESTORE operation.
Notably, in variable M dimension use cases there is an advantage to 1D accumulators. Up to N_ACC=6, M_ACC=1
dimensions are possible if N is 96 or larger, one tile for A, one for B, and six for the
accumulator.
Activations Layout
Similar to the Intel DL Boost use case, the activations are laid out in a layout obtained from the original layout by the
following procedure:
This procedure on the left side of the diagram below shows the conversion of a 3-dimensional tensor into a 2-
dimensional matrix:
The procedure shown on the right is identical for the outputs (for example, the activations of the next layer in the
topology).
Weights Layout
Similar to the Intel DL Boost use case, the weights are re-laid by the following procedure:
The procedure transforms the original 4-dimensional tensor into a series of 2-dimensional matrices (a single matrix is
highlighted in orange in Example 20-16) as illustrated in Figure 20-8 for KH=KW=3, resulting in a series of nine B-
matrices:
The A-matrix subset participating in the matrix-like multiplication depends on the spatial weight element in question
(i.e., the kh,kw coordinates, or the index in the range 0–8 in the previous example). For each weight element, the A-
matrix’s participating rows will interact with the weight element when the filter is slid over the activations. For
example, when sliding the filter over the activations in the previous example, weight element 0 will only interact with
activation elements 0, 1, 2, 5, 6, 7, 10, 11, and 12. For example, it will not interact with activation element four
because when the filter is applied in such a manner (i.e., weight element 0 interacts with activation element 4),
weight elements 2, 5, and 8 leave the activation frame entirely. The A-matrix subsets for several weight elements are
illustrated in the following figure.
• Use different sized A tiles (and correspondingly C tiles) depending on the current position in A (if there are
enough free tiles, performing TILECONFIG during the convolution is highly discouraged).
• Define TILE_M without consideration for WC and remove/disregard the “junk” data from the results at the post-
processing stage (code not shown). Care should be taken in this case concerning the advancement of the m index
(line 2) since the current assumption is that every row of every tile is valid (corresponds to a row in the C matrix).
This is no longer true if “junk” data is loaded: a C-tile will have less than TILE_M rows of C.
The loops over the N, MC, and K-dimensions are replaced by loops over cache blocks of N, MC, and K.
Additional loops over the entire N, MC, and K dimensions are added at the outermost level. These loops have a step
size equal to N, MC, and K cache blocks.
In the case of cache-blocking along the K dimension, additional calls to TILELOAD and TILESTORE are required to load
and store intermediate accumulation results. Note that this adds additional memory traffic, especially for int8 output
data types (as Accumulation data type is either int32_t or float). For this reason, it is generally not advisable to block
along the K dimension.
For simplicity, assume the following relationships:
• N is an integer multiple of N_CACHE: an integer multiple of N_ACC*TILE_N.
• MC is an integer multiple of MC_CACHE: an integer multiple of M_ACC*TILE_M. As before, the condition
WC%(TILE_M*M_ACC)==0 still holds.
• K is an integer multiple of K_CACHE: an integer multiple of TILE_K.
Define the following set of operations as the compute kernel of the optimized convolution implementation. First,
initialize the accumulation tiles to zero (line 13) for a M_ACC*TILE_M x N_ACC*TILE_N chunk of the C-matrix. Next,
for each of the KH*KW B-matrices, the matrix multiplication of the corresponding M_ACC*TILE_M x K chunk of the A-
matrix by a K x N_ACC*TILE_N chunk of the B-matrix is performed, each time accumulating to the same set of
accumulation tiles (lines 18–30). Finally, the results are stored in the C-matrix (line 32).
Continue with the computation of a full cache block of C-matrix, ignoring any blocking along the K dimension. First,
the kernel is performed for the first chunks of the A, B, and C cache blocks. Next, the chunks of A and C advance along
the M dimension, and the kernel is repeated with the same chunk set of the B-matrices. The above step is repeated
until the last chunks of A and C in the current cache block have been accessed. Next, the chunks of B and C are
advanced along the N-dimension by N_ACC*TILE_N, and the chunk of A returns to the beginning of its cache block.
Observe the following from the above description of the computation of a full cache block of the C-matrix:
• For each kernel iteration, it is better if the current chunk of matrix A (roughly
KH*M_ACC*TILE_M*K*sizeof(type_t)) fits into the DCU. This allows for maximal data reuse between the partially
overlapping regions of A that need to be accessed by the different B matrices.
• Advancing from one chunk of matrix A to the next, it is better if the current chunk set of the B matrices (in total,
KH*KW*K*N_ACC*TILE_N*sizeof(type_t)) fits into the DCU.
• Advancing from one chunk set of the B matrices to the next, it is better if the current cache block of matrix A fits
into the MLC.
• Advancing from one cache block of matrix A to the next, it is better if the current cache block of the B matrices (in
total, KH*KW*K*N_CACHE*sizeof(type_t)) fits into the MLC.
From these observations, a general cache-blocking strategy is choosing MC_CACHE and N_CACHE to be as large as
possible while keeping the A, B, and C cache blocks in the MLC.
converting between data types) must occur by way of vector registers. Thus, a buffer is needed to store results from
the accumulation tiles and load them into vector registers for
post-processing. Note that if acc_type_t is the same as res_type_t, the C-matrix can store
intermediate results. However, the buffer is small (at most 4KB for the accumulation strategies described in “2D
Accumulator Array vs. 1D Accumulator Array”) and easily fits into the DCU. While it should still be considered when
determining the optimal cache block partitioning, it is unlikely to influence kernel performance strongly.
Figure 20-7. Batching Execution Using Six Layers with Four Instances Per Thread
2 m [0:M:M_ACC×TILE_M] M×K
K×N_ACC×TILE_N
8 k [0:K:TILE_K] MC_CACHE×K
9 n acc [0:N_ACC:1]
M_ACC×TILE_M×TILE_K TILE_K×N_ACC×TILE_N
12 m ac [0:M_ACC:1]
Scenario One:
Consider the following scenario, including M=256, K=1024, and N=256.
Table 20-4 illustrates accessed data sizes:
At the k loop level, the combined sizes of A and B accessed data will overflow the L1 cache by a factor of two.
Proceeding to the m-level, since m is progressing, new A-data are constantly read (a total of 256kB-32kB=224kB new
A data), while the same 32kB of B data are being accessed repeatedly. Thus, a priority inversion occurs: new A-data
placed in the L1 cache repeatedly are accessed only once. They evict the 32kB of B data that are accessed eight times.
Placement of A data in the L1 cache is not beneficial: the next time the same data are accessed will be in the n loop
after 256kB (x8 L1 cache size) of A data has been read. Additionally, it is detrimental because it causes repeated
eviction of 32kB of B data that could have been read from the L1 cache eight times.
Scenario Two:
Consider the following scenario, including M=32, K=1024, and N=256. Here, the M-dimension is covered in the m_acc
loop, and the loop over m is redundant. The priority inversion is: as n advances, new B-data (accessed only once)
repeatedly evict 32kB of A-data that could have been read (8 times) from the L1 cache had it not been pushed out by
B-data.
Here, the M-dimension is covered in the m_acc loop, and the loop over m is redundant. The priority inversion is: as n
advances, new B-data (accessed only once) repeatedly evict 32kB of A-data that could have been read (8 times) from
the L1 cache had it not been pushed out by B-data.
These two basic scenarios can be readily extended to the blocked code in Example 20-20.
2 mb [0:MC:MC_CACHE] M×K
3 kb [0:K:K_CACHE] MC_CACHE×K
4 n [nb:nb+N_CACHE:N_ACC×TILE_N]
[mb:mb+MC_CACHE:M_ACC×TILE K_CACHE×KH×KW×N_ACC×TILE_
5 m MC_CACHE×K_CACHE
_M] N
18 k [kb:kb+K_CACHE:TILE_K]
19 kh [0:KH:1] /*/*
TILE_K×KH×KW×N_ACC×TILE_N
20 kw [0:KW:1]
M_ACC×TILE_M×TILE_
21 n acc [0:N_ACC:1]
K TILE_K×N_ACC×TILE_N
24 m ac [0:M_ACC:1]
NOTE
Due to the nature of convolution, the loops over kh, kw reuse most of the A-data.
The innermost loops m_acc, n_acc, kh,kw will access at most M_ACC kB of A data and KH×KW×N_ACC kB of B-data,
which, in some cases (e.g., KH=KW=3, N_ACC=4) might already overflow the L1 cache size. Thus, several opportunities
for priority inversions exist in this more complex loop structure, depending on the parameters in the table above:
• B-data evicting reusable A-data at the kh,kw loops level.
• A-data evicting reusable B-data at the m loop level.
• B-data evicting reusable A-data at the n loop level.
• A-data evicting reusable B-data at the mb loop level.
• B-data evicting reusable A-data at the nb loop level.
Thus, the yellow parts of the input frame are the only ones that should be loaded into A-tiles when processing weight
element kh,kw=0,0. The white parts of the input frame should be ignored. This requires the number of tile rows to be
set at seven, utilizing less than half of the A-tile, reducing B (weights) data reuse by a factor of two. Each A-tile is now
half the size, and seven tiles are required to cover the spatial dimension. Because there are not seven tiles, B-tiles
must be loaded twice as many times, potentially leading to significant performance degradation, depending on the
size of the weights. This is usually inversely proportional to the spatial size of the input frame).
Figure 20-9 shows three A-tiles with sixteen rows and one tile with seven rows to cover the entire spatial dimension
of the convolution.
Each tile is highlighted differently. The green, blue, and orange tiles now load those two “extra” pieces previously
ignored. Those pieces will waste compute resources and take up two rows in the accumulator tiles. The user may
ignore those rows in subsequent computations (e.g., int8-quantization, RELU, etc.), complicating the
implementation. The potential benefit of increased B-data reuse could be dramatic, however.
buffer is required because the size of the post-processed data is four times smaller. Hence, the convolutional output
cannot be written directly to the output buffer.
The GPRs r8, r9, r10, r11, and r14 point to the current location in the A, B, C, temporary_C, and q_bias (which holds
the quantization factors and biases) buffers, respectively.
The macros A_OFFSET(m,k), B_OFFSET(k,n), C_OFFSET(m,n), C_TMP_OFFSET(m,n), Q_OFFSET(n), and
BIAS_OFFSET(n) receive as arguments m,k,n tile indices and return the offset of the data from r8,r9,r10, r11, and r14,
respectively.
In Example 20-32, the content deviates from the previous example by interleaving the current iteration’s
convolutional code with the previous iteration’s post-convolutional code. Temporary_C is
double-buffered, with r11 pointing to the buffer of the current iteration and r12 pointing to the previous iteration’s
buffer. They are exchanged at the end of the iteration.
Example 20-22. An Example of a Short GEMM Fused and Pipelined with Quantization and ReLU
1 ldtilecfg tc // Load tile config
2 mov r15, 192 // A stride
3 mov r13, 128 // B, C_TMP stride
4 TILELOADD tmm5, [r9 + r13*1 + B_OFFSET(0,0)] // Load B [k,n] = [0,0]
5 TILELOADD tmm4, [r8 + r15*1 + A_OFFSET(0,0)] // Load A [m,k] = [0,0]
6 TILEZERO tmm0 // Zero acc [m,n] = [0,0]
7 vcvtdq2ps zmm0, [r12 + C_TMP_OFFSET(0,0) + 0*TILE_N_B] // int32 -> float
8 vmovups zmm1, [r14 + Q_OFFSET(0)] // q-factors for N=0
9 vmovups zmm2, [r14 + BIAS_OFFSET(0)] // biases for N=0
10 vfmadd213ps zmm0, zmm1, zmm2 // zmm0 = zmm0 * q + b
11 vcvtps2dq zmm0, zmm0 // float -> int32
12 vpxord zmm3, zmm3, zmm3 // Prepare zero ZMM
13 vpmaxsd zmm0, zmm0, zmm3 // RELU (int32)
14 tdpbusd tmm0, tmm4, tmm5
15 TILELOADD tmm6, [r9 + r13*1 + B_OFFSET(0,1)] // Load B [k,n] = [0,1]
16 TILEZERO tmm2 // Zero acc [m,n] = [0,1]
17 vpmovusdb [r10 + C_OFFSET(0,0) + 0*TILE_N_B], zmm0 // uint32 -> uint8
18 vcvtdq2ps zmm4 , [r12 + C_TMP_OFFSET(0,0) + 4*TILE_N_B] // int32 -> float
19 vfmadd213ps zmm4 , zmm1 , zmm2 // zmm4 = zmm4 * q + b
20 tdpbusd tmm2, tmm4, tmm6
21 TILELOADD tmm4, [r8 + r15*1 + A_OFFSET(1,0)] // Load A [m,k] = [1,0]
22 TILEZERO tmm1 // Zero acc [m,n] = [1,0]
23 vcvtps2dq zmm4 , zmm4 // float -> int32
24 vpmaxsd zmm4 , zmm4 , zmm3 // RELU (int32)
25 vpmovusdb [r10 + C_OFFSET(0,0) + 1*TILE_N_B], zmm4 // uint32 -> uint8
26 tdpbusd tmm1, tmm4, tmm5
27 TILEZERO tmm3 // Zero acc [m,n] = [1,1]
28 vcvtdq2ps zmm5 , [r12 + C_TMP_OFFSET(1,0) + 0*TILE_N_B] // int32 -> float
29 vfmadd213ps zmm5 , zmm1 , zmm2 // zmm5 = zmm5 * q + b
30 vcvtps2dq zmm5 , zmm5 // float -> int32
31 vpmaxsd zmm5 , zmm5 , zmm3 // RELU (int32)
32 tdpbusd tmm3, tmm4, tmm6
33 TILELOADD tmm5 , [r9 + r13*1 + B_OFFSET(1,0)] // Load B [k,n] = [1,0]
34 TILELOADD tmm4 , [r8 + r15*1 + A_OFFSET(0,1)] // Load A [m,k] = [0,1]
35 vpmovusdb [r10 + C_OFFSET(1,0) + 0*TILE_N_B], zmm5 // uint32 -> uint8
36 vcvtdq2ps zmm6 , [r12 + C_TMP_OFFSET(1,0) + 4*TILE_N_B] // int32 -> float
37 vfmadd213ps zmm6 , zmm1 , zmm2 // zmm6 = zmm6 * q + b
38 tdpbusd tmm0, tmm4, tmm5
39 TILELOADD tmm6, [r9 + r13*1 + B_OFFSET(1,1)] // Load B [k,n] = [1,1]
40 vcvtps2dq zmm6 , zmm6 // float -> int32
41 vpmaxsd zmm6 , zmm6 , zmm3 // RELU (int32)
42 vpmovusdb [r10 + C_OFFSET(1,0) + 1*TILE_N_B], zmm6 // uint32 -> uint8
43 tdpbusd tmm2 , tmm4, tmm6
Example 20-22. (Contd.)An Example of a Short GEMM Fused and Pipelined with Quantization and
44 TILELOADD tmm4 , [r8 + r15*1 + A_OFFSET(1,1)] // Load A [m,k] = [1,1]
45 vcvtdq2ps zmm7 , [r12 + C_TMP_OFFSET(0,1) + 0*TILE_N_B] // int32 -> float
46 vmovups zmm8 , [r14 + Q_OFFSET(1)] // q-factors for N=1
47 vmovups zmm9 , [r14 + BIAS_OFFSET(1)] // biases for N=1
48 vfmadd213ps zmm7 , zmm8 , zmm9 // zmm7 = zmm7 * q + b
49 vcvtps2dq zmm7 , zmm7 // float -> int32
50 vpmaxsd zmm7 , zmm7 , zmm3 // RELU (int32)
51 tdpbusd tmm1 , tmm4, tmm5
52 vpmovusdb [r10 + C_OFFSET(0,1) + 0*TILE_N_B], zmm7 // uint32 -> uint8
53 vcvtdq2ps zmm10 , [r12 + C_TMP_OFFSET(0,1) + 4*TILE_N_B] // int32 -> float
54 vfmadd213ps zmm10 , zmm8 , zmm9 // zmm10 = zmm10 * q + b
55 tdpbusd tmm3 , tmm4, tmm6
56 TILELOADD tmm5 , [r9 + r13*1 + B_OFFSET(2,0)] // Load B [k,n] = [2,0]
57 TILELOADD tmm4 , [r8 + r15*1 + A_OFFSET(0,2)] // Load A [m,k] = [0,2]
Except for a larger TILE_M (N_ACC=M_ACC=2, TILE_M=16, TILE_K=64, TILE_N=16) on a [256x192] x [192x256]
GEMM, application of this algorithm with the parameters laid out in section Section 20.8.1 yielded an 18.5%
improvement in running time vs. the non-interleaved code described in Section 20.11.1.
NOTE
The TILEZERO instruction is considered an Intel AMX compute instruction for that matter.
Suppose a single high-precision cache line (512-bit) is processed for conversion at a time. In that case, there will be
two or four rounds of processing until a single low-precision cache line is generated for 8- or 16-bit inputs. Potential
problems include:
• the number of loads and stores of the same cache line increases 4X or 2X, respectively.
• the next round of processing of the same cache line may occur after this cache line is evicted from DCU.
One of the optimizations mitigating these performance issues is to collect enough high-precision outputs to convert
the full low-precision cache line in a single round.
The following drawing shows the conversion flow of 32-bit integers to 8-bit integers. Each colored block at the top
represents a single full TILE output. The horizontal dimension is OFMs the vertical dimension is spatial).
To generate full 512-bit cache lines of 8-bit inputs (bottom), a multiple of 64 OFMs should be collected before
conversion. Accordingly, to generate full cache lines with 16-bit inputs, a multiple of 32 OFMs should be collected.
This often produces better performance results, though it may be viewed as a restriction to convolution-blocking
parameters (in particular, N_ACC).
Example 20-23 shows the conversion code for two blocks of sixteen cache lines of 32-bit floats converted to a single
block of sixteen cache lines of 16-bit bfloats. TMUL outputs are assumed to be placed into a scratchpad spad, and the
conversion result is placed in the next_inputs buffer.
Example 20-23. Two Blocks of 16 Cache Lines of 32-bit Floats Converted to One Block of 16 Cache Lines
of 16-bit BFloat
float* spad;
bfloat_16* next_inputs;
inline unsigned inputs_spatial_dim( void ) {
return /* number of pixels in map */
}
for (int i = 0; i < 16; i++)
{
__m512 f32_0 = _mm512_load_ps(spad);
__m512 f32_1 = _mm512_load_ps(spad + 16*16);
__m512 bf16 = _mm512_castsi512_ps(_mm512_cvtne2ps_pbh(f32_1, f32_0));
_mm512_store_ps(next_inputs, bf16);
Example 1-10
return bytes;
}
Figure 20-12 illustrates where the previous layer feeds the next layer (left).
Figure 20-12. Trivial Deep Learning Topology with Naive Buffer Allocation
A straightforward buffer allocation scheme is illustrated in Figure 20-12, in which the output of layer N is placed into
a dedicated memory buffer, which is then consumed as input by layer N+1. In this scheme, such topology with L-layers
would require L+1 memory buffers, of which only the last is valuable (containing the final results). The rest of the L
memory buffers are single-use and disposable,
significantly increasing the application’s memory footprint.
The allocation scheme in Figure 20-13 offers an improved scheme whereby the entire topology only requires two
reusable memory buffers.
Figure 20-13. Minimal Memory Footprint Buffer Allocation Scheme for Trivial Deep Learning Topology
A more complex topology would require more reusable buffers, but this number is significantly smaller than the naïve
approach. ResNet-50, for example, requires only three reusable buffers (instead of 55). Inception-ResNet-V2 requires
only five reusable buffers (instead of over 250). This optimization resulted in a 25% improved performance on the int8
end-to-end large batch throughput run of Resnet50 v1.5.
• The optimum is probably closer to 100 TMUL ops. At any rate, the developer must check the current CPU
architecture and make sure that the MLC will not overflow.
indices that benefit from CPU caching, the pattern is often considered random or
semi-random. This can make the HW prefetcher less efficient. Since the entire content of the index buffer is already
known, rows soon to be encountered can be prefetched to the DCU.
TILESTORE forwarding to non-TileLoad instructions via store buffers is supported under one restriction: they must
both be of cache line size (64 bytes).
Forwarding is generally not advised because this mechanism has outliers. To avoid store-to-load forwarding, ensure
enough distance between those two operations in the order of 10s of cycles in time.
Implementation discussion:
• Lines 1-6 set mask registers k1, k2, k3.
• Lines 7 and 8 put trip counts for primitive blocks in N- and M-dimensions, respectively.
• Lines 9-72 implement the transpose of a primitive block 32x8. It uses 16 ZMM registers (zmm0-zmm15).
• Lines 9-40 implement loading 32 quarter-cache lines into 8 ZMM registers, according to Table 20-7 (numbers are
in bytes):
ZIMM
zmm0
zmm1
zmm2
broadcast32x4
zmm3
zmm4
zmm5
zmm6
zmm7
Zmm0
zmm1
broadcast32x4{0xf0}
zmm2
zmm3
zmm4
zmm5
zmm6
zmm7
zmm0
zmm1
broadcast32x4{0xf00}
zmm2
zmm3
zmm4
zmm5
zmm6
zmm7
ZIMM
zmm0
zmm1
broadcast32x4{0xf000}
zmm2
zmm3
zmm4
zmm5
zmm6
zmm7
• Lines 41-64 are transpose flow proper. It creates a transposed block 8x32 in registers zmm8-zmm15.
• Lines 65-72 store transposed block 8x32 to the output buffer.
• Lines 17–20 implement simultaneous transpose of four 2x2 blocks of QWORDs (i.e., 2x8 blocks of BF16). It
Table 20-8. Loading Eight Quarter-Cache Lines into Two ZMM Registers
16 16 16 16 16
ZIMM
zmm0
broadcast32x4
zmm1
zmm0
broadcast32x4{0xf0}
zmm1
zmm0
broadcast32x4{0xf00}
zmm1
zmm0
broadcast32x4{0xf000}
zmm1
The primitive block transposed in this algorithm is 16x8 (i.e., 16 rows, 8 BF16 numbers each), which is transformed
into a 4x32 block (i.e., four rows of 32 BF16 numbers each).
The implementation uses eight ZMM registers and three mask registers.
Input parameters:
• MxN, sizes of the rectangular block to be transposed; it is assumed that M multiple of 16, N multiple of eight.
• I_STRIDE is the row size of the input matrix in bytes.
• O_STRIDE is the row size of the output buffer in bytes.
• r8 contains the starting address for the input matrix.
• r9 contains the starting address for the output buffer.
Implementation Discussion:
• Lines 1–6 set mask registers k1, k2, k3.
• Lines 7 and 8 put trip counts for primitive blocks in N- and M-dimensions, respectively.
• Lines 9–36 implement the transpose of a primitive block 16x8. It uses eight ZMM registers (zmm0–zmm7).
• Lines 9–24 implement loading 16 quarter-cache lines into four ZMM registers zmm0-zmm3, according to
Table 20-9 (numbers are in bytes):
ZIMM
zmm0
broadcast32x4
zmm1
zmm2
zmm3
zmm0
broadcast32x4{0xf0}
zmm1
zmm2
zmm3
zmm0
broadcast32x4{0xf00}
zmm1
zmm2
zmm3
zmm0
broadcast32x4{0xf000}
zmm1
zmm2
• Lines 25–32 are the transpose flow proper. It creates a transposed block 4x32 in registers zmm0–zmm3.
• Lines 33–36 store transposed block 4x32 to the output buffer.
Implementation Discussion:
• Lines 1 and 2 put trip counts for primitive blocks in N- and M-dimensions, respectively.
• Lines 3 and 4 implement loading two full cache lines into two ZMM registers, zmm0-zmm1, from consecutive
rows of the input matrix.
• Lines 5—7 implement the re-layout of a primitive block 2x32. It uses five ZMM registers (zmm0–zmm2, zmm30-
zmm31).
• Lines 8 and 9 implement storing two full cache lines in two ZMM registers, zmm1-zmm2, into consecutive
columns of the output matrix.
Figure 20-15. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the M-Dimension
Here the data read and written by each of the three cores is bound by a black rectangle.
It should be noted that in the case of convolutions, limited overlap in the M-dimension of the activations occurs
between neighboring cores. Due to the convolutions, a finite-sized filter is slid over the activations. Thus, the M-
dimension overlaps (KH-1)/*W (refer to Example 20-30) between the two neighboring cores.
• Advantages: When multiple layers in a chain are partitioned by the M-dimension between the same number of
cores, each core has its data in its local cache.
• Disadvantages: All the cores read the B-matrix (or weights in convolutions) entirely, which might pose a
bandwidth problem if the B-matrix is large.
Figure 20-16. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the N-Dimension
Unfortunately, the layer’s output is also partitioned by the N-dimension between the cores, which is incompatible
with the M and N partitioning of the subsequent layer. For visualization, compare the right side of Figure 20-16 to the
left side of Figure 20-15 and 20-16. In this scenario, a core in the subsequent layer is guaranteed to have most of its
data from outside its local caches. This is not the case in
K-dimension partitioning (see Section 20.17.3.3), but it also comes at a price.
• Advantages: It may reduce read bandwidth significantly in case of large B / large weights.
• Disadvantages: If the next layer is partitioned by M or N, most of the activations in the next layer will not reside
in the local caches of the corresponding cores.
Figure 20-17. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the K-Dimension
Suppose a layer is partitioned by the N-dimension, and the K-dimension partitions the subsequent layer. In that case,
the activation data will reside in the local caches of the cores in the layer partitioned by the K-dimension. For
visualization, compare the right side of Figure 20-16 with the left side of Figure 20-17. Unfortunately, this comes at a
price: each core prepares partial results of the entire C-matrix.
To obtain final results, either a mutex (or several mutexes) is required to guard the write operations into the C-matrix,
or a reduction operation is needed at the end of the layer. The mutex solution is not advised because threads will be
blocked for a significant time. A reduction runs the risk of being costly since it entails the following:
• A synchronization barrier is required before the reduction.
• Reading a potentially large amount of data during the reduction:
— There are T copies of the C-matrix, where T is the number of threads (the example has three).
— The size of the matrices before the reduction is x2 (in case of a bfloat16 datatype) or x4 (in case of int8
datatype) times larger than the output C-matrix.
— During the reduction, most of the cores’ data will come outside their local cache hierarchy.
The collapse clause specifies how many loops within a nested loop should be collapsed into a single
iteration space and divided between the threads. The order of the iterations in the collapsed iteration space is the
same as though they were executed sequentially.
OpenMP automatically uses schedule(static,1) if there is no specified schedule, resulting in the sequential assignment
of loop iterations to threads.
If we assume N=4*N_ACC*TILE_N and M=4*M_ACC*TILE_M wherein the K-dimension is deliberately excluded from
consideration due to its problematic nature, there would be 4*4=16 iterations in the two nested loops. Now assume
the division of iterations between three threads. Table 20-10 shows that the code in Example 20-30 would partition
the iterations between threads.
Every cell of the form n’,m’ contains the n’=n/N_ACC*TILE_N and m’=m/M_ACC*TILE_M indices from the loops in
Example 20-30.
It is clear from Table 20-10 that each of the three threads executes at least one iteration with n’=0,1,2,3 and at least
one iteration with m’=0,1,2,3. This means that every thread reads all of A and B.
By rearranging the work between threads in the following partitioning, the size of the B read is reduced by each
thread by 50%, which might be significant in workloads where B is large. Similarly, the size of A can be reduced by 50%
by swapping m’ and n’ indices for workloads with a large A.
Thread 1: 1.0 1.1 1.2 1.3 3.2 3.3 100% 50% 38%
The code in Example 20-31 uses Intel AVX-512 to generate num rows of decompressed data, assuming 8-bit elements
and 64 elements per tile row.
The matrix multiplication code will load the decompressed matrix to tiles from decompressed[], an array containing
the decompressed matrix data.
The decompression code uses of the Intel AVX-512 date expand operation as shown in Figure 20-19.
Decompression code for 16-byte elements can be designed in the same way.
For the best performance, apply the following optimizations:
• Interleaving: Fine-grained interleaving of decompression code and matrix multiplication to overlap Intel AVX-512
decompression with Intel AMX computation.
• Decompress Early: Before immediate Intel AMX use, prepare the decompressed buffer to avoid store forwarding
issues.
• Buffer Reuse: Decompressing the full sparse matrix could overflow the CPU caches. For best cache reuse, it is
recommended to have a decompressed buffer that can hold two decompressed panels of the sparse matrix.
While the matrix is multiplying with one panel, decompress the next panel for the subsequent iteration. In the
subsequent iteration, decompress the next panel into the first half of the decompressed buffer that is no longer
used, and so on.
• Decompress Once: Coordinate the matrix multiplication blocking and loop structure with the decompression
scheme to minimize the number of times the same portion of the sparse matrix is decompressed. For example, if
the B-matrix is sparse, traversing the entire vertical M-dimension will compress every vertical panel of the B-
matrix only once.
20.19.1 ABI
The tile data registers (tmm0 – tmm7) are volatile. Their contents are passed back and forth between functions
through memory. No tile register is saved and restored by the callee. Tile configuration is also volatile. The compiler
saves and restores tile configuration and tile register contents if the register(s) need to live across the function call.
The compiler eliminates the save instruction because its content remains the same on the stack. It reuses the
configured content saved on the stack before the call. All functions must configure the tile registers themselves;
however, tile registers may not be configured across functions.
Please download the System V Application Binary Interface: Intel386 Architecture Processor
Supplement, Version1.0.
20.19.2 INTRINSICS
Example 20-32. Identification of Tile Shape Using Parameter m, n, k
typedef int _tile1024i __attribute__((__vector_size__(1024), __aligned__(64)));
_tile1024i _tile_loadd_internal(unsigned short m, unsigned short n, const void*base, __SIZE_TYPE__ stride);
_tile1024i _tile_loaddt1_internal(unsigned short m, uunsigned short n, const void*base, __SIZE_TYPE__ stride);
_tile1024i _tile_dpbssd_internal(unsigned short m, unsigned short n, unsigned short k, _tile1024i dst, _tile1024i
src1, _tile1024i src2);
_tile1024i _tile_dpbsud_internal(unsigned short m, unsigned short n, unsigned short k, _tile1024i dst, _tile1024i
src1, _tile1024i src2);
_tile1024i _tile_dpbusd_internal(unsigned short m, unsigned short n, unsigned short k, _tile1024i dst, _tile1024i
src1, _tile1024i src2);
_tile1024i _tile_dpbuud_internal(unsigned short m, unsigned short n, unsigned short k, _tile1024i dst, _tile1024i
src1, _tile1024i src2);
_tile1024i _tile_dpbf16ps_internal(unsigned short m, unsigned short n, unsigned short k, _tile1024i dst, _tile1024i
src1, _tile1024i src2);
void_tile_stored_internal(unsigned short m, unsigned short n, void*base, __SIZE_TYPE__ stride, _tile1024i tile);
/// Load tile rows from memory specified by "base" address and "stride" into destination tile "dst".
///
/// \headerfile <immintrin.h>
///
/// This intrinsic corresponds to the <c> TILELOADD </c> instruction.
///
/// \param dst
/// A destination tile. Max size is 1024 Bytes.
/// \param base
/// A pointer to base address.
/// \param stride
/// The stride between the rows' data to be loaded in memory.
void __tile_loadd(__tile1024i *dst, const void *base, __SIZE_TYPE__ stride);
/// Load tile rows from memory specified by "base" address and "stride" into destination tile "dst".
///This intrinsic provides a hint to the implementation that the data will likely not be reused in the near future and
///the data caching can be optimized accordingly.
/// \headerfile <immintrin.h>
///
• State component 17 is used for the 64-byte TILECFG register (XTILECFG state).
• State component 18 is used for the 8192 bytes of tile data (XTILEDATA state).
These are both user-state components, meaning the entire XSAVE feature set can manage them. In addition, it implies
that setting bits 18:17 of extended control register XCR0 by system software enables Intel AMX. If those bits are zero,
an Intel AMX instruction execution results in an invalid opcode exception (#UD).
About the XSAVE feature set’s INIT optimization, the Intel AMX state is in its initial configuration if the TILECFG register
is zero and all tile data are zero.
Enumeration and feature-enabling documentation can be found in Section 20.2.
An execution of XRSTOR or XRSTORS initializes the TILECFG register (resulting in TILES_CONFIGURED = 0) in response
to an attempt to load it with an illegal value. Moreover, an execution of XRSTOR or XRSTORS that is not directed to
load XTILEDATA leaves it unmodified, even if the execution is loading XTILECFG.
It is highly recommended that developers execute TILERELEASE to initialize the tiles at the end of the Intel AMX
instructions code region. More on this is in Section 20.19.
If the system software does not initialize the Intel AMX state first (by executing TILERELEASE, for example), it may
disable Intel AMX by clearing XCR0[18:17], by clearing CR4.OSXSAVE, or by setting IA32_XFD[18].
s1*A1 + s2*A2 + … + sn*An can be written where each matrix A_i is lower precision, and each si is a constant scaling
factor.
For Bfloat16 decomposition of FP32, consider the following:
• Let A be a matrix of FP32 values.
• Let A1 = bfloat16(A), a matrix containing RNE-rounded Bfloat16 conversions of A.
• Let A2 = bfloat16(A – fp32(A1)).
• Let A3 = bfloat16(A – fp32(A1) – fp32(A2)).
• Now A is approximately A1 + A2 + A3.
Once one has written two matrices as a sum of lower precision matrices, one can run AMX/TMUL on the product to
approximate the higher precision. But to do this effectively, one needs to have higher precision accumulation. There
are tricks in the literature for doing higher precision all in a lower precision, such as works on so-called double-double
arithmetic. Still, these tend to vary too much from standard
matrix-matrix multiplication to be helpful with TMUL. In the case of Bfloat16, having 32-bit accumulation in the
product allows one to use Bfloat16 to approximate FP32 accuracy.
Therefore, if A = s1*A1 + s2*A2 + s3*A3, and B = t1*B1 + t2*B2 + t3*B3, then A*B can be computed using AMX/TMUL
on the projects Ai*Bj for 1<=i,j<=3, assuming scaling is done carefully to avoid
denormals. Assuming FP32 accumulation, the FP32 approximation of A*B can be made by writing out these lower
precision multiplies. Scaling factors can be chosen to avoid denormals at times, but they can also be picked in a way
that simplifies further steps in the algorithm. In some cases, scaling factors can be chosen to be a power of two, for
instance, without significantly reducing the accuracy of the resulting matrix-matrix multiply.
The number of matrices for A or B are picked depending on the mantissa range to cover. If trying to emulate FP32
which has 24 bits of mantissa (including the implicit mantissa bit), it is possible with three Bfloat16 matrices (because
each of the triples has 8 bits of mantissa, including the implicit bit.). Here, the range is less important because
Bfloat16 and FP32 have the same exponent range. Use three Bfloat16 matrices to approximate FP32 precision by
BF16x3. Range issues may still come up for BF16x3 cases where A has values close to the maximum or minimum
exponent for FP32, but that too can be circumvented by scaling constants. Scaling factors of 2^24 or 2^(-24) suffice to
push it far enough away from the boundary to make the computation feasible again. This is dependent upon the
closest end of the spectrum.
A few terms from an expansion can also be dropped. For instance, in the BF16x3 case, where there are three As and
three Bs, nine products may result. That is:
A*B = (A1+A2+A3)*(B1+B2+B3) = (A1*B1) + (A1*B2 + A2*B1) + (A1*B3 + A2*B2 + A3*B1) + (A2*B3 + A3*B2)
+(A3*B3).
The parentheses in the last equation are intentionally derived so that all entries in the same “bin” are put together,
and there are nine entries of the form Ai*Bj. This example has five bins, each with its own set of parentheses. In the
Bfloat16 case, |Ai| <= |A_i-1}| / 256. This shows the last two bins (with A2*B3,A3*B2,A3*B3) are too small to
contribute significantly to the answer, which is why if there are Y terms on each side of A*B, only (Y+1)*Y/2 multiplies
are required, not Y*Y multiplies. In this case,
dropping the last three (also the difference between Y*Y – (Y+1)*Y/2 when Y=3.) from the nine
multiplies. The last three multiplies in the last two bins have terms less than 2^(-24) as big as the first term. So, A*B
can be approximated (ignoring the scaling terms for now) as the sum of the first three most significant bins: A1*B1 +
(A1*B2+A2*B1)+(A1*B3+A2*B2*A3*B1). In this case, adding from the least significant bin to the most significant bin
(A1*B1) is recommended.
Whenever A and B are each expanded out to Y-terms, computing only Y*(Y+1)/2 products works under the condition
that each term has the same number of mantissa bits. If some terms have a different number of bits, then this
guideline no longer applies. But for BF16x3, each term covers eight mantissa bits and Y=3, so six products are needed.
Regarding accuracy, the worst-case relative error for BF16x3 may be worse than FP32. However, BF16x3 tends to
cover a larger mantissa range due to implicit bits, which can be more accurate in many cases. Nevertheless, accuracy
is not offered by matrix-matrix multiplication. Even FP64 or FP128 can be bad for component-wise relative errors.
Take A = [1, -1] and B = [1; 1]. A*B is zero. Let eps be a small
perturbation to A and/or B. The solution may now be arbitrarily bad in terms of relative error. In general, assume that
the same mantissa range and exponent range is covered as a given higher-precision floating point format, and the
accumulation is at least as good as the higher-precision format. With such an assumption, the answer will be
approximately the same as the higher-precision floating point format. It may or may not be identical. Performing the
same operation in the higher precision format but changing the order of the computations could yield slightly
different results. In terms of matrix-matrix
multiplication, it could yield vast differences in relative error.
Things get slightly more complicated if low precision is used to approximate matrix-matrix at FP64
accuracy or FP128 precision. Here the scalars aren’t just for avoiding denormals but are necessary to do the initial
matrix conversion. Nevertheless, converting to an integer is recommended in this case because the FP32-rounded
errors in each of the seven or fewer bins may introduce too many errors. An integer is easier to get right because there
are no floating-point errors in each bin.
Conversion to Integer functions in the same way as all of the previous Bfloat16 examples. The
quantization literature explains how to map floating point numbers into integers. The only difference is that these
integers are further broken down into 8-bit pieces for the use of Intel AMX. Constant factors are still needed, but in
this case they are primarily defined in the conversion itself.
One difficulty with quantization to integers is the notion of a shared exponent. All the numbers quantized together
with shared exponents must share the same range. The assumption is that all of A shares a joint exponent range.
Since this will also be true for B, each row of A and column of B can be quantized
separately.
Assuming that there is Integer32 accumulation with the Integer8 multiplies, a matrix may be broken down into far
more bits than required. This may significantly reduce the inaccuracy impact of picking a shared exponent. Because
Integer32 arithmetic will be precise, modulo overflow/underflow concerns, then one can break up A or B into a huge
number of 8-bit integer matrices, then do all the matrix-matrix work with Intel AMX, and then convert back all the
results to even get accuracies up to quad-precision.
Considering an extreme case of trying to get over 100-bits of accuracy in a matrix-matrix multiply.
All A-values can be quantified into 128-bit integers. The same holds true with B. Once broken down into 8-bit
quantities, this will have a significant expansion like: A = s1*A1 + s2*A2 + … + s14*A14 for when attempting 112-bits
of mantissa. The same can be done with B = t1*B1 + t2*B2 + … + t14*B14. A*B is potentially 14*14=196 products, but
only 105 products are needed because the last few products may have scaling factors less than 2^(-112) times the
most important terms. Each product term should be added separately and computing into C from the least significant
bits forward.
C15 = (s1*t14)*A1*B14 + (s2*t13)*A2*B13 + … + (s14*t1)*A14*B1
• C14 = (s1*t13)*A1*B13 + (s2*t12)*A2*B12 + … + (s13*t1)*A13*B1
• C13 = (s1*t12)*A1*B12 + (s2*t11)*A2*B11 + … + (s12*t1)*A12*B1
• C02 = (s1*t1)*A1*B1
Sometimes choosing scalers is possible so all the products in a given row can be computed with the same scratch
array. The converted sum of C02 gives the final product through C15, where terms like C15 should be computed first.
Writing matrix-matrix multiplies in terms of an expansion like (A1+A2+A3)*(B1+B2+B3) is referred to as “cascading
GEMM.” Performance will vary depending on the TMUL/Intel AMX specification, and may vary from generation to
generation. Note that some computations may become bandwidth-bound. Since there is no quad floating-point
precision in hardware for Intel Architecture, the above algorithm may be competitive performance-wise with other
approaches like doing software double-double optimizations or software-based quad precision.
CHAPTER 21
INTEL® QUICKASSIST TECHNOLOGY (INTEL® QAT)
The Intel® QuickAssist Technology (Intel® QAT) API supports two acceleration services:
• Cryptographic
• Data Compression.
The acceleration driver interfaces with the hardware via hardware-assisted rings. These rings are used as request and
response rings. The driver uses request rings to submit requests to the accelerator and response rings to retrieve
responses from the accelerator. The availability of responses can be indicated to the driver using either interrupts or
by having software poll the response rings.
At the Intel QAT API, services are accessed via “instances.” A set of rings is assigned to an instance, and any operations
performed on a service instance will involve communication over the rings assigned to that instance.
• If the polling frequency is too low, latency is increased and throughput may be impacted if the response rings fill
causing the accelerator to stall.
The choice of threading model has an impact on performance when using a polling approach. There are two main
threading approaches when polling:
• Creating a polling thread that periodically calls the polling API. This model is often the simplest to implement,
allows for maximum throughput, but can lead to increased offload cost due to the overhead associated with
context switching to/from the polling thread.
• Invoking the polling API and submitting new requests from within the same thread. This model is characterized by
having a “dispatch loop” that alternates between submitting new requests and polling for responses. Additional
steps are often included in the loop such as checking for received network packets or transmitting network
packets. This approach often leads to the best performance since the polling rate can be easily tuned to match the
submission rate so throughput is maximized and offload cost is minimized.
21.1.1.3 Recommendations
Polling mode tends to be preferred in cases where traffic is steady (like packet processing applications) and can result
in a minimal offload cost. Epoll mode is preferred for cases where traffic is bursty, as the application can sleep until
there is a response to process.
Considerations when using polling mode:
• Fine-tuning the polling interval is critical to achieving optimal performance.
• The preference is for invoking the polling API and submitting new requests from within the same thread rather
than having a separate polling thread.
Considerations when using epoll mode:
• CPU usage will be at 0% in idle state in epoll mode versus a non-zero value in standard poll mode. However, with
a high load state, standard poll mode should out-perform epoll mode.
cpaDcDpEnqueueOpBatch() API calls for the compression data plane API. In all cases, requests are only submitted to
the accelerator when the performOpNow parameter is set to CPA_TRUE.
It is recommended to use the batch submission mode of operation where possible to reduce offload cost.
NOTE
Specific performance numbers are not given in this document since exact performance numbers
depend on a variety of factors and tend to be specific to a given workload, software and platform
configuration.
When using the Data Plane API, it is possible to pass a flat buffer to the API instead of a buffer list. This is the most
efficient usage of system resources (mainly PCIe bandwidth) and can lead to lower latencies compared to using buffer
lists.
In summary, the recommendations for using buffer lists are:
• If using the Data Plane API, use a flat buffer instead of a buffer list.
• If using a buffer list, a single buffer per buffer list leads to highest throughput performance.
• If using a buffer list, keep the number of buffers in the list to a minimum.
• As the maximum number of concurrent requests is increased in the configuration file, the memory requirements
required to support this also increases.
• If the number of concurrent requests is set too low, there may not be enough outstanding requests to keep the
accelerator busy and throughput will degrade. The minimum number of concurrent requests required to keep
the accelerator busy is a function of the size of the requests and of the rate at which responses are processed via
either polling or interrupts (see Section 21.1.1 for additional details).
• If the number of concurrent requests is set too high, the maximum latency will increase.
It is recommended that the maximum number of concurrent requests is tuned to achieve the correct balance
between memory usage, throughput and latency for a given application. As a guide the maximum number configured
should reflect the peak request rate that the accelerator must handle.
CHAPTER 22
USING PERFORMANCE MONITORING EVENTS
Performance monitoring (Perfmon) provides means to characterize the interaction between programmed sequences
of instructions and microarchitectural sub-systems. Performance monitoring facilities are described in Chapter 19,
“Architectural Last Branch Records” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume
3B. Performance-monitoring events are described at https://perfmon-events.intel.com/.
The first section of this appendix (Top-Down Analysis Method) provides information on the Top-Down
Microarchitecture Analysis (TMA) method for analyzing performance bottlenecks when tuning for Intel
microarchitectures. Sections 22.1.1 through 22.1.7 present a generalized formalism that can adapt to several recent
Intel® microarchitectures.
The remaining subsections have instantiations of TMA for:
• Golden Cove microarchitecture
• Ice Lake microarchitecture
• Cascade Lake microarchitecture
• Skylake microarchitectures
The instantiations include examples where applicable.
The rest of this chapter has performance monitoring information for previous generations of Intel microarchitectures.
For example, if instruction fetch issues significantly hurt an application, TMA categorizes it as Frontend Bound at the
tree’s top level. A user/tool can drill down and focus only on the Frontend sub-tree. The drill down is recursively
performed until a tree-leaf is reached. A leaf can point to a specific stall of the workload or denote a subset of issues
with a common micro-architectural symptom likely to limit the application’s performance.
TMA was first developed1 in conjunction with the performance monitoring capability of the Sandy Bridge
microarchitecture. The methodology is refined with subsequent generations to support multiple microarchitecture
generations and enhanced by subsequent PMU capabilities. Please refer to the TMA electronic file at
https://download.01.org/perfmon/ for details on the complete hierarchy and its nodes, additional useful,
informative metrics, metric descriptions, event ratios per generation, or specific events.
22.1.1 TOP-LEVEL
At the top-level, TMA classifies pipeline slots into four main categories:
• Frontend Bound
• Backend Bound
• Bad Speculation
• Retiring
The latter two denote non-stalled slots while the former two indicate stalls, as illustrated in Figure 22-1 above.
Figure 22-2 depicts a simple decision tree to start the drill-down process. If some operation utilizes a slot, it will be
classified as Retiring or Bad Speculation, depending on whether it eventually gets retired (committed).
1. A Top-Down Method for Performance Analysis and Counters Architecture, Ahmad Yasin. In IEEE International
Symposium on Performance Analysis of Systems and Software, ISPASS 2014. http://bit.ly/tma-ispass14 .
— Unused slots are classified as Backend Bound if the back end portion of the pipeline is unable to accept more
operations (a.k.a. back-end stall1), or
— Frontend Bound: indicating no operations (μops) delivered while there was no back-end stall.
Uop
Allocate?
Yes No
Yes No Yes No
The single entry point of division at a pipeline’s issue stage (allocation stage) makes the four categories additive to the
total possible slots. The classification at slots granularity (sub-cycle) makes the breakdown very accurate and robust
for superscalar cores, which is necessary at the top level.
Retiring denotes slots utilized by “good operations.” Ideally, you want to see all slots attributed here since it correlates
with Instructions Per Cycle (IPC). Nevertheless, a high Retiring fraction does not necessarily mean there is no room for
speedup.
Bad Speculation denotes slots wasted due to all aspects of incorrect speculations. It includes: (a) slots of operations
that do not eventually retire and (b) slots where the issue pipeline was blocked due to recovery from earlier mis-
speculations. Note there is a third portion covered by Branch_Resteers1. This category can be split per type of
speculation. For example, Branch Mispredicts and Machine Clears cover control-flow and data mis-speculation,
respectively.
Frontend Bound denotes when the pipeline’s Frontend under-supplies the Backend. The Frontend is the pipeline
portion responsible for delivering operations to be executed later by the Backend. This category is further classified
into Fetch Latency (for example, ICache or ITLB misses) and Fetch Bandwidth (for instance, sub-optimal decoding).
Backend Bound denotes remaining stalled slots due to a lack of required Backend resources. It is split into Memory
Bound, which reflects execution stalls due to the memory subsystem, and Core Bound, which demonstrates either
pressure on the execution units (compute bound) or lack of Instructions-Level-Parallelism (ILP).
1. ibid.
The following sections provide more details on these categories and nodes in subsequent levels of the hierarchy.
The out-of-order scheduler can dispatch micro-ops into multiple execution units for execution. While these micro-ops
were executing in-flight, some of the memory access latency exposure for data can be hidden by keeping the
execution units busy with useful micro-ops that do not depend on pending memory accesses. Thus for common
cases, the real penalty for memory access is when the scheduler has nothing ready to feed the execution units. It is
likely that further micro-ops are either waiting for the pending memory access or depend on other unready micro-
ops.
ExecutionStalls span several sub-categories, each associated with a particular cache level and depending on the
demanded data satisfied by the respective cache level. In some situations, an ExecutionStall can experience
significant delay, greater than the nominal latency of the corresponding cache level, while no demand-load is missing
that cache level.
For example, the L1D cache often has short latency which is comparable to ALU stalls (or waiting for completion of
some commonly-used execution units like floating-point adds/multiplies or integer multiplies). Yet in certain
scenarios, like a load blocked from forward data from an earlier store to an overlapping address, this load might suffer
a high effective latency while eventually being satisfied by L1D. In such a scenario, the in-flight load will last a long
time without missing L1D. Hence, it gets tagged under L1 Bound. Load blocks due to 4K Aliasing is another scenario
with the same symptom.
ExecutionStalls related to store operations are also treated in the Store Bound category. Store operations are
buffered and executed post-retirement due to memory ordering requirements. Typically, store operations have little
impact on performance, but they cannot be neglected entirely. TMA defines Stores Bound as a fraction of cycles with
low execution ports utilization and a high number of stores consuming resources needed to buff the stores.
Data TLB misses are categorized under various Memory Bound sub-nodes. For example, if a TLB translation is satisfied
by L1D, it is tagged under L1 Bound.
A simple heuristic is used to distinguish MEM Bandwidth and MEM Latency under DRAM Bound. The heuristic uses
the occupancy of requests pending on data return from the memory controller. Whenever the occupancy exceeds a
high threshold, say 70% of the max number of requests, the memory controller can serve simultaneously; TMA flags
this as potentially limited by memory bandwidth. The remainder fraction will be attributed to memory latency.
Having a Bad Speculation category at the Top Level is a crucial principle in TMA. It determines the fraction of the
workload under analysis that is affected by incorrect execution paths, which in turn dictates the accuracy of
observations listed in other categories. Furthermore, this permits nodes at lower levels to use of some of the many
traditional performance counter events, despite most of those counter events counting speculatively. Hence, it would
be best if you treated a high value in Bad Speculation as a “red flag” that needs to be investigated before looking at
other categories. In other words, minimizing Bad Speculation improves the processor resource utilization and
increases confidence in metrics reported throughout the hierarchy.
TMA further classifies Bad Speculation into Branch Mispredicts and Machine Clears, with similar symptoms where
the pipeline is flushed. Branch misprediction applies when the BPU incorrectly predicts the branch direction and
target. Memory Order Machine Clears (for example, due to memory disambiguation) are a subset of Machine Clears.
The next steps to the analysis of these issues can be completely different—the first deals with making the program
control flow friendlier to the branch predictor. The latter often points to unexpected situations such as memory
ordering machine clears or self-modifying code.
22.1.7 RETIRING
This category reflects slots utilized by “good micro-ops” – issued micro-ops that get retired expeditiously without
performance bottlenecks. Ideally, all slots attributed to the Retiring category should be seen; that is, Retiring 100% of
slots corresponds to hitting the maximal micro-ops retired per cycle of the given microarchitecture. For example,
assuming one instruction is decoded into one micro-op, Retiring of 50% in one slot means an IPC of 2 was achieved in
a four-wide machine. In other words, maximizing the Retiring category increases the IPC of your program.
Nevertheless, a high Retiring value does not necessarily mean there is no room for more performance. Heavy
Operations, like Floating Point (FP) assists, typically hurt performance and can be avoided. They are isolated under
Microcode Sequencer to bring it to your attention.
A high Retiring value for non-vectorized code may be an excellent hint to vectorize the code. Doing so lets more
operations to be completed by single instruction/micro-op, hence improving performance. TMA further breaks down
the Retiring->Light Operations category into FP Arith, with FP Scalar and FP Vector operations distinction in level 4
(omitted in Figure B-1). For more details see the Matrix-Multiply use-case of the paper2.
PipelineSlots
Not Stalled Stalled
DSB Switches
Stores Bound
MS Switches
Icache Miss
Resteers
Execution
Branch
ITLB Miss
L3 Bound
L1 Bound
L2 Bound
Ext.
FP-Arith
MITE
Divider
Other
DSB
LSD
ports Memory
Utilization Bound
Contested access
Mem Bandwidth
1 or 2 ports
DTLB Store
Mem Latency
Store Miss
4K aliasing
STLB Hit
3+ ports
Data sharing
L2 Miss
0 ports
Vector
Scalar
L2 Hit
L3 latency
X87
Figure 22-3. TMA Hierarchy and Precise Events in the Skylake Microarchitecture
Two-way
DDR3 DDR3
Core0 Core1 Core0 Core1
8MB L3 8MB L3
IMC QPI QPI QPI QPI IMC
QPI Link
IOH/PCH
Figure 22-4. System Topology Supported by Intel® Xeon® Processor 5500 Series
The performance monitoring events on Intel Xeon processor 5500 series can be used to analyze the interaction
between software (code and data) and microarchitectural units hierarchically:
• Per-core PMU: Each processor core provides four programmable counters and three fixed counters. The
programmable per-core counters can be configured to investigate Frontend/micro-op flow issues and stalls
inside a processor core. Additionally, a subset of per-core PMU events supports precise event-based sampling
(PEBS). Load latency measurement facility is new in Intel Core i7 processor and Intel Xeon processor 5500.
• Uncore PMU: The uncore PMU provides eight programmable counters and one fixed counter. The programmable
per-core counters can be configured to characterize L3 and Intel QPI operations and local and remote data
memory accesses.
The number and variety of performance counters and the breadth of programmable performance events available in
Intel Xeon processor 5500 offer software tuning engineers the ability to analyze performance issues and achieve
higher performance. Using performance events to analyze performance issues can be grouped into the following
subjects:
• Cycle Accounting and Uop Flow
• Stall Decomposition and Core Memory Access Events (non-PEBS)
• Precise Memory Access Events (PEBS)
• Precise Branch Events (PEBS, LBR)
• Core Memory Access Events (non-PEBS)
• Other Core Events (non-PEBS)
• Frontend Issues
• Uncore Events
Cycle accounting of executed micro-ops is an effective technique to identify stalled cycles for performance tuning.
Within the microarchitecture pipeline, the meaning of micro-ops being “issued,” “dispatched,” “executed,” “retired”
has a precise definition. This is illustrated in Figure 22-5.
Cycles are divided into those where micro-ops are dispatched to the execution units and those where no micro-ops
are dispatched, which are considered execution stalls.
“Total cycles” of execution for the code under test can be directly measured with CPU_CLK_UNHALTED.THREAD
(event code 3CH, Umask= 1) and setting CMask = 2 and INV=1 in IA32_PERFEVTSELCn.
The signals used to count the memory access μops executed (ports 2, 3 and 4) are the only core events that cannot be
counted per-logical processor. Thus, Event code B1H with Umask=3FH only counts on a per-core basis, and the entire
execution stall cycles can only be evaluated on a per-core basis. If HT is disabled, conducting a per-thread analysis of
micro-op flow cycle accounting presents no difficulty.
“UOPS_EXECUTED“
IFetch/
BPU
Execution ROB
Dispatch Units
Resource RS
Allocator
Decoder
“UOPS_RETIRED”
Retirement/
“UOPS_ISSUED” Writeback
“RESOURCE_STALLS”
The PMU signals to count μops_executed in ports 0, 1, 5 can count on a per-thread basis even when HT is active. This
provides an alternate cycle accounting technique when the workload under test interacts with HT.
The alternate metric is built from UOPS_EXECUTED.PORT015_STALL_CYCLES, using appropriate CMask, Inv, and Edge
settings. Details of performance events are shown in Table 22-3.
UOPS_EXECUTED.PORT015_STALLS_CYCLE
40H B1H 1 1 0 0
S
UOPS_RETIRED.STALL_CYCLES 1H C2H 1 1 0 0
UOPS_RETIRED.ACTIVE_CYCLES 1H C2H 1 0 0 0
In processors based on Nehalem microarchitecture, there are no execution stalls associated with clearing the pipeline
of mispredicted μops (component 2). These μops are simply removed from the pipeline without stalling executions or
dispatch. This typically lowers the penalty for mispredicted branches. Further, the penalty associated with instruction
starvation (component 3) can be measured.
The wasted work within executed μops are those μops that will never be retired. This is part of the cost associated
with mispredicted branches. It can be found through monitoring the flow of μops through the pipeline. The μop flow
can be measured at 3 points in Figure 22-5, going into the RS with the event UOPS_ISSUED, going into the execution
units with UOPS_EXECUTED and at retirement with UOPS_RETIRED. The differences of between the upstream
measurements and at retirement measure the wasted work associated with these mispredicted μops.
As UOPS_EXECUTED must be measured per core, rather than per thread, the wasted work per core is evaluated as:
Wasted Work = UOPS_EXECUTED.PORT234_CORE + UOPS_EXECUTED.PORT015_All_Thread - UOPS_RE-
TIRED.ANY_ALL_THREAD.
The ratio above can be converted to cycles by dividing the average issue rate of μops. The events above were designed
to be used in this manner without corrections for micro fusion or macro fusion.
A “per thread” measurement can be made from the difference between the μops issued and μops retired as the latter
two of the above events can be counted per thread. It over counts slightly, by the mispredicted μops that are
eliminated in the RS before they can waste cycles being executed, but this is usually a small correction:
Wasted Work/thread = (UOPS_ISSUED.ANY + UOPS_ISSUED.FUSED) - UOPS_RETIRED.ANY.
Objective Evaluate μops that executed but not retired due to misprediction
The third component of the misprediction penalty, instruction starvation, occurs when the instructions associated
with the correct path are far away from the core and execution is stalled due to lack of μops in the RAT. Because the
two primary cause of μops not being issued are
• Frontend starvation.
• Resources are unavailable in the back end.
So the output of the resource allocation can be measured as follows:
• Count the total number of cycles where no μops were issued to the OOO engine.
• Count the cycles where resources (RS, ROB entries, load buffer, store buffer, etc.) are unavailable for allocation.
Method Examine cycle differences between μops issuing and resource allocation
PMU-Pipeline
Micro-ops issue and resource allocation
Focus
Event Event code 0EH, Umak= 1, for μops issued.
code/Umask Event code A2H, Umask=1, for Resource allocation stall cycles
UOPS_RETIRED.ANY_ALL_THREAD 1H C2H 0 0 0 1
RESOURCE_STALLS.ANY 1H A2H 0 0 0 0
UOPS_ISSUED.ANY 1H 0EH 0 0 0 0
UOPS_ISSUED.STALL_CYCLES 1H 0EH 1 1 0 0
UOPS_ISSUED.ACTIVE_CYCLES 1H 0EH 1 0 0 0
UOPS_ISSUED.CORE_CYCLES_ACTIVE 1H 0EH 1 0 0 1
techniques less practical. Using precise-event-based sampling (PEBS) is the preferred technique for processors based
on Nehalem microarchitecture.
The profiling the penalty by sampling (to localize the measurement in IP) is likely to have accuracy difficulties. Since
the latencies for L2 misses can vary from 40 to 400 cycles, collecting the number of required samples will tend to be
invasive.
The use of the precise latency event, that will be discussed later, provides a more accurate and flexible measurement
technique when sampling is used. As each sample records both a load to use latency and a data source, the average
latency per data source can be evaluated. Further as the PEBS hardware supports buffering the events without
generating a PMI until the buffer is full, it is possible to make such an evaluation efficient without perturbing the
workload intrusively.
A number of performance events in core PMU can be used to measure the costs of memory accesses that originated
in the core and experienced delays due to various conditions, locality, or traffic due to cache coherence requirements.
The latency of memory accesses vary, depending on locality of L3, DRAM attached to the local memory controller or
remote controller, and cache coherency factors. Some examples of the approximate latency values are shown in
Table 22-7.
Local DRAM ~ 50 ns
Remote DRAM ~ 90 ns
One of the three fields written to each PEBS record by the PEBS assist mechanism of the load latency event, encodes
the data source locality information.
Table 22-9. Data Source Encoding for Load Latency PEBS Record
Encoding Description
0x0 Unknown L3 cache miss.
0x1 Minimal latency core cache hit. This request was satisfied by the L1 data cache.
Pending core cache HIT. Outstanding core cache miss to same cache-line address was already
0x2 underway. The data is not yet in the data cache, but is located in a fill buffer that will soon be
committed to cache.
0x3 This data request was satisfied by the L2.
L3 HIT. Local or Remote home requests that hit L3 cache in the uncore with no coherency
0x4
actions required (snooping).
L3 HIT (other core hit snoop). Local or Remote home requests that hit the L3 cache and was
0x5 serviced by another processor core with a cross core snoop where no modified copies were
found. (Clean).
L3 HIT (other core HITM). Local or Remote home requests that hit the L3 cache and was
0x6 serviced by another processor core with a cross core snoop where modified copies were found.
(HITM).
0x7 Reserved
L3 MISS (remote cache forwarding). Local homed requests that missed the L3 cache and was
0x8 serviced by forwarded data following a cross package snoop where no modified copies found.
(Remote home requests are not counted).
0x9 Reserved.
L3 MISS (local DRMA go to S). Local home requests that missed the L3 cache and was serviced
0xA
by local DRAM (go to shared state).
L3 MISS (remote DRMA go to S). Remote home requests that missed the L3 cache and was
0xB
serviced by remote DRAM (go to shared state).
L3 MISS (local DRMA go to E). Local home requests that missed the L3 cache and was serviced
0xC
by local DRAM (go to exclusive state).
Table 22-9. Data Source Encoding for Load Latency PEBS Record (Contd.)
Encoding Description
L3 MISS (remote DRMA go to E). Remote home requests that missed the L3 cache and was
0xD
serviced by remote DRAM (go to exclusive state).
0xE I/O, Request of input/output operation.
0xF The request was to uncacheable memory.
The latency event is the recommended method to measure the penalties for a cycle accounting decomposition. Each
time a PMI is raised by this PEBS event a load to use latency and a data source for the cacheline is recorded in the PEBS
buffer. The data source for the cacheline can be deduced from the low order 4 bits of the data source field and the
table shown above. Thus an average latency for each of the 16 sources can be evaluated from the collected data. As
only one minimum latency at a time can be collected it may be awkward to evaluate the latency for an MLC hit and a
remote socket dram. A minimum latency of 32 cycles should give a reasonable distribution for all the off-core sources
however. The Intel® PTU version 3.2 performance tool can display the latency distribution in the data profiling mode
and allows sophisticated event filtering capabilities for this event.
end of short basic blocks to not be the last entry in the LBR, distorting the measurement. Since all the instructions in
a basic block are by definition executed the same number of times.
The shadowing effect on call counts and basic block execution counts can be alleviated to a large degree by averaging
over the entries in the LBR. This will be discussed in the section on LBRs.
Typically, branches account for more than 10% of all instructions in a workload, loop optimization must focus on those
loops with high tripcounts. For counted loops, it is very common for the induction variable to be compared to the
tripcount in the termination condition evaluation. This is particularly true if the induction variable is used within the
body of the loop, even in the face of heavy optimization. Thus a loop sequence of unrolled operation by eight times
may resemble:
add rcx, 8
cmp rcx, rax
jnge triad+0x27
In this case the two registers, rax and rcx are the tripcount and induction variable. If the PEBS buffer is captured for the
conditional branches retired event, the average values of the two registers in the compare can be evaluated. The one
with the larger average will be the tripcount. Thus the average, RMS, min and max can be evaluated and even a
distribution of the recorded values.
“From” “To”
Branch_1 Target_1
“All instructions between Target_0 and Branch_1 are retired 1 time for each event count”
“All basic blocks between Target_0 and Branch_1 are executed 1 time for each event count”
“All branch instructions between Target_0 and Branch_1 are not taken”
The 16 sets of LBR records can help rectify the artifact of PEBS samples aggregating disproportionately to certain
instructions in the sampling process. The situation of the skewed distribution of the PEBS sample is illustrated in
Figure 22-7.
Consider several basic blocks in the flow of normal execution; some basic blocks take twenty cycles to execute, others
take two, and shadowing takes ten cycles. Each time an overflow condition occurs, the delay of PEBS being armed is
at least ten cycles. Once the PEBS is armed, the PEBS record is captured on the next eventing condition. The skewed
distribution of the sampled instruction address using the PEBS record will be skewed, as shown in Figure 22-7. In this
conceptual example, every branch is assumed to be taken in these basic blocks.
In the skewed distribution of PEBS samples, the branch IP of the last basic block will be recorded five times as much as
the least sampled branch IP address (the second basic block).
2 O 0 18N
P O 0
20 P 19N
P
P
P
20 C C C C C 5N 20N
Branch IP Distribution
O: overflow; P: PEBS armed; C: interrupt occurs in LBRTrajectory
This situation where some basic blocks would appear to never get samples and some have many times too many.
Weighting each entry by 1/(num of basic blocks in the LBR trajectory), in this example would result in dividing the
numbers in the right most table by 16. Thus far more accurate execution counts are achieved ((1.25-> 1.0) * N) in all
of the basic blocks, even those that never directly caused a PEBS sample.
As on Intel® Core™2 processors there is a precise instructions retired event that can be used in a wide variety of ways.
In addition there are precise events for uops_retired, various SSE instruction classes, FP assists. It should be noted
that the FP assist events only detect x87 FP assists, not those involving SSE FP instructions. Detecting all assists will be
discussed in the section on the pipeline Frontend.
The instructions retired event has a few special uses. While its distribution is not uniform, the totals are correct. If the
values recorded for all the instructions in a basic block are averaged, a measure of the basic block execution count can
be extracted. The ratios of basic block executions can be used to estimate loop tripcounts when the counted loop
technique discussed above cannot be applied.
The PEBS version (general counter) instructions retired event can further be used to profile OS execution accurately
even in the face of STI/CLI semantics, because the PEBS interrupt then occurs after the critical section has completed,
but the data was frozen correctly. If the CMask value is set to some very high value and the invert condition is applied,
the result is always true, and the event will count core cycles (halted + unhalted).
Consequently both cycles and instructions retired can be accurately profiled. The UOPS_RETIRED.ANY event, which is
also precise can also be used to profile Ring 0 execution and really gives a more accurate display of execution. The
precise events available for this purpose are listed under event code C0H, C2H, C7H, F7H at: https://perfmon-
events.intel.com/.
Measuring Core Memory Access Latency
Drilling down performance issues associated with locality or cache coherence issues will require using performance
monitoring events. In each processor core, there is a super queue that allocates entries to buffer requests of memory
access traffic due to an L2 miss to the uncore sub-system. Table 22-10 lists various performance events available in
the core PMU that can drill down performance issues related to L2 misses.
NOTES:
1. The *DEMAND* events also include any requests made by the L1D cache hardware prefetchers.
Table 22-11 lists various performance events available in the core PMU that can drill down performance issues
related to super queue operation.
Additionally, L2 misses can be drilled down further by data origin attributes and response attributes. The matrix to
specify data origin and response type attributes is done by a dedicated MSR OFFCORE_RSP_0 at address 1A6H. See
Table 22-12 and Table 22-13.
Although Table 22-13 allows 2^16 combinations of setting in MSR_OFFCORE_RSP_0 in theory, it is more useful to
consider combining the subsets of 8-bit values to specify “Request type” and “Response type”. The more common 8-
bit mask values are listed in Table 22-14.
Table 22-14. Common Request and Response Types for OFFCORE_RSP_0 MSR
Request Type Mask Response Type Mask
ANY_DATA xx11H ANY_CACHE_DRAM 7FxxH
ANY_IFETCH xx44H ANY_DRAM 60xxH
Table 22-14. Common Request and Response Types for OFFCORE_RSP_0 MSR (Contd.)
Request Type Mask Response Type Mask
DEMAND_DATA xx03H L3_OTHER_CORE_HITM 04xxH
PREFETCH xx70H
NOTES:
1. The PMU may report incorrect counts with setting MSR_OFFCORE_RSP_0 to the value of 4080H. Non-temporal
stores to the local DRAM is not reported in the count.
The use of “CACHE_DRAM” encoding is to work around the defect in the footnote of Table 22-14. Note that none of
the above includes the bandwidth associated with writebacks of modified cacheable lines.
The evictions of modified lines in the L1D result in writebacks to the L2. These are counted with the L1D_WB_L2
events. The Umask values break these down by the MESI state of the version of the line in the L2.
The locked references can be counted also with the L1D_CACHE_LOCK events. Again these are broken down by MES
states for the lines in L1D.
The total number of lines brought into L1D, the number that arrived in an M state and the number of modified lines
that get evicted due to receiving a snoop are counted with the L1D event and its Umask variations.
The L1D events are listed under event codes28H, 40H, 41H, 42H, 43H, 48H, 4EH, 51H, 52H, 53H, 80H, and 83H at:
https://perfmon-events.intel.com/.
There are few cases of loads not being able to forward from active store buffers. The predominant situations have to
do with larger loads overlapping smaller stores. There is not event that detects when this occurs. There is also a “false
store forwarding” case where the addresses only match in the lower 12 address bits. This is sometimes referred to as
4K aliasing. This can be detected with the event “PARTIAL_ADDRESS_ALIAS“ which has event code 07H and Umask
01H.
Other situations that can trigger this event are due to FP assists, like performing a numeric operation on denormalized
FP values or QNaNs. In such cases the penalty is essentially the μops required for the assist plus the pipeline clearing
required to ensure the correct state.
Consequently this situation has a very clear signature consisting of MACHINE_CLEAR.CYCLES and μops being inserted
by the microcode sequencer, UOPS_DECODED.MS. The execution penalty being the sum of these two contributions.
The event codes for these are listed under D1H and C3H.
Latency can measured by the average duration of the queue occupancy, if the occupancy stops as soon as the data has
been delivered. Thus the ratio of UNC_GQ_TRACKER_OCCUP.X/UNC_GQ_ALLOC.X measures an average duration of
queue occupancy, where ‘X’ represents a specific Umask value. The total occupancy period of the read tracker as
measured by:
Total Read Period = UNC_GQ_TRACKER_OCCUP.RT/UNC_GQ_ALLOC.RT
Is longer than the data delivery latency due to it including time for extra bookkeeping and cleanup. The
measurement:
LLC response Latency = UNC_GQ_TRACKER_OCCUP.RT_TO_LLC_RESP / UNC_GQ_ALLOC.RT_TO_LLC_RESP
is essentially a constant. It does not include the total time to snoop and retrieve a modified line from another core for
example, just the time to scan the L3 and see if the line is or is not present in this socket.
An overall latency for an L3 hit is the weighted average of three terms:
• The latency of a simple hit, where the line has only been used by the core making the request.
• The latencies for accessing clean lines by multiple cores.
• The latencies for accessing dirty lines that have been accessed by multiple cores.
These three components of the L3 hit for loads can be decomposed using the derivative events of
OFFCORE_RESPONSE:
• OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HIT_NO_OTHER_CORE.
• OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HIT_OTHER_CORE_HIT.
• OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HIT_OTHER_CORE_HITM.
The event OFFCORE_RESPONSE_0.DEMAND_DATA.LOCAL_CACHE should be used as the denominator to obtain
latencies. The individual latencies could have to be measured with microbenchmarks, but the use of the precise
latency event will be far more effective as any bandwidth loading effects will be included.
The L3 miss component is the weighted average over three terms:
• The latencies of L3 hits in a cache on another socket (this is described in the previous paragraph).
• The latencies to local DRAM.
• The latencies to remote DRAM.
The local dram access and the remote socket access can be decomposed with more uncore events.
Miss to fill latency = UNC_GQ_TRACKER_OCCUP.RT_LLC_MISS / UNC_GQ_ALLOC.RT_LLC_MISS
The uncore GQ events using Umask value associated with *RTID* mnemonic allow the monitoring of a sub
component of the Miss to fill latency associated with the communications between the GQ and the QHL.
There are uncore PMU events which monitor cycles when the three trackers are not empty (>= 1 entry) or full. These
events are listed under the event code 00H and 01H at: https://perfmon-events.intel.com/.
Because the uncore PMU generally does not differentiate which processor core causes a particular eventing
condition, the technique of dividing the latencies by the average queue occupancy to determine a penalty does not
work for the uncore. Overlapping entries from different cores do not result in overlapping penalties and thus a
reduction in stalled cycles. Each core suffers the full latency independently.
To evaluate the correction on a per-core basis, the number of cycles is required for an entry from the core in question.
A *NOT_EMPTY_CORE_N type event is required, however, there is no such event. Consequently, in the cycle
decomposition one must use the full latency for the estimate of the penalty. As has been stated before it is best to use
the PEBS latency event as the data sources are also collected with the latency for the individual sample.
The individual components of the read tracker, discussed above, can also be monitored as busy or full by setting the
CMask value to 1 or 32 and applying it to the assorted read tracker occupancy events.
The snoop responses are divided into requests for locally homed data and remotely homed data. If the line is in a
modified state and the GQ is responding to a read request, the line also must be written back to memory. This would
be a wasted effort for a response to a RFO as the line will just be modified again, so no Writeback is done for RFOs.
The snoop responses of local home events that can be monitored by an uncore PMU are listed under event code 06H
at: https://perfmon-events.intel.com/. The snoop responses of remotely home events are listed under event code
07H.
Some related events count the MESI transitions in response to snoops from other caching agents (processors or IOH).
Some of these rely on programming MSR so they can only be measured one at a time, as there is only one MSR. The
Intel performance tools will schedule this correctly by restricting these events to a single general uncore counter.
22.4.5.4 L3 Events
Although the number of L3 hits and misses can be determined from the GQ tracker allocation events, Several uncore
PMU event is simpler to use. They are listed under event code 08H and 09H in the uncore event list at:
https://perfmon-events.intel.com/.
The MESI states breakdown of lines allocated and victimized can also be monitored with LINES_IN, LINES_OUT events
in the uncore using event code 0AH and 0BH.
UNC_ADDR_OPCODE_MATCH.REMOTE.NONE 0 2H 35H
UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDI 40001900_00000000 2H 35H
UNC_ADDR_OPCODE_MATCH.LOCAL.NONE 0 4H 35H
UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDI 40001900_00000000 1H 35H
These predefined opcode match encodings can be used to monitor HITM accesses. It is the only event that allows
profiling the code requesting HITM transfers.
The diagrams Figure 22-8 through Figure 22-15 show a series of Intel QPI protocol exchanges associated with Data
Reads and Reads for Ownership (RFO), after an L3 miss, under a variety of combinations of the local home of the
cacheline, and the MESI state in the remote cache. Of particular note are the cases where the data comes from the
remote QHL even when the data was in the remote L3. These are the Read Data with the remote L3 having the line in
an M state.
Cores Cores
DRd
Uncore Uncore
[Broadcast
GQ GQ
snoops to all
C
ac
up
other caching
he ch
ok
Lo e M
S
C
agents]
Lo
at
np
ok
D
D
he
A ->
up
np
at
llo E
s
[I
ac
is
S
a
ca ]
R
M
C
is
sp
te
SnpData
s
he
l
l
sp
in
L L
ac
Q Q
E
C
st
L
DataC_E_CMP
P P L
at
e
RdData
C [Send I I
Rspl C
Snoop
to LLC]
[Fill complete
R
to socket 2]
sp
l
[Sending Req Speculative
to Local Home mem Rd
(socket 2 owns
IMC QHL this address)] QHL IMC
Data
Socket 1 Socket 2
Figure 22-8. RdData Request after LLC Miss to Local Home (Clean Rsp)
Cores Cores
DRd
(1)
Uncore Uncore
GQ GQ C
ac Ca
)
(7
he
up
Lo
(4
ok
ok
np
2)
Lo
ch
at
up s (
(1
D
[Send
e
dD
A ->
he
at
)
M
llo E
(2 3)
(8
[I
a
m
R
ac
is
Snoop )
ca ] (1
(6
sp
_c
C
te 3
)
R
_E
L L
an
aC
Q Q
E
le
at
st
L
C
P P L
D
Rspl (9)
at
e
C [Send
I DataC_E_cmp I C
Request [Sending Req
to CHL] (11) to Remote Home
)
(6
[Rspl indicates
0)
(socket 1 owns
a
(1
at
p
m
R
_c
_E
Speculative
aC
mem Rd (7)
at
D
Socket 1 Socket 2
Figure 22-9. RdData Request after LLC Miss to Remote Home (Clean Rsp)
Cores Cores
DRd
(1)
Uncore [Send Uncore
Snoop
GQ GQ
C
to LLC]
ac Ca
)
(7
he
up
Lo
(4
ok
ok
np
2)
Lo
ch
at
up s (
(1
D
e
dD
A ->
he
at
M
llo E
(2 3)
[I
a
)
m
R
ac
(8
is
)
ca ] (1
(6
_c
C
I, sp
a
te 3
)
_E
at
RdData (5)
-> R
in )
L L
D
aC
Q Q
M itm
E
at
st
L
H
P P L
at
to eq d
L] t
H es
e
R en
C I I C
u
[S
WblData (9) DataC_E_cmp
C
[Data written back to [Sending Req
RsplWb,
)
(6
0)
NDR response. Hint (socket 1 owns
a
(1
to home that wb data at this address)]
dD
p
m
follows shortly which
R
_c
is WblData.]
_E
WB
aC
Socket 1 Socket 2
Figure 22-10. RdData Request after LLC Miss to Remote Home (Hitm Response)
Cores Cores
DRd
Uncore [Send Uncore
Snoop [Broadcast
GQ snoops to all GQ C
to LLC] ac Ca
other caching
he
up
agents]
Lo
ok
ok
np
Lo
ch
W
up s
D
e
he
A ->
bl
at
at
M
llo E
D
[I
ac
is
at
np
ca ]
C
a
I, sp
a
te
at
SnpData
-> R
in
L L
D
Q Q
M itm
E
sp
DataC_E_cmp
st
L
H
lW
P P L
at
b
RdData
to socket 2]
bl
at
sp
is WblData.]
lW
WB
b
Speculative memRd
[Sending Req
IMC QHL to Remote Home
QHL IMC
(socket 2 owns
this address)] Data
Socket 1 Socket 2
Figure 22-11. RdData Request after LLC Miss to Local Home (Hitm Response)
Cores Cores
DRd
Uncore [Send Uncore
Snoop [Broadcast
GQ GQ
C
to LLC] snoops to all
ac Ca
other caching
he
up
agents]
Lo
ok
ok
np
Lo
at
ch
D
up s
D
D
e
at
he
A ->
np
at
M
aC
llo F
_F
[I
ac
a
a
is
ca ]
at
_F Rs
C
aC
,D
te
-> p
at
SnpData
in
S
s
L L
D
Q Q
E it R
F
pF
st
L
H
,F
w
P P L
at
dS
e
RdData
C I I C
CMP
[RspFwdS indicates Hit DataC_F
snoop response and data RspFwdS
forwarded to Peer agent] [Send complete
to socket 2]
[DataC_F indicates data
R
sp
forwarded to Peer agent
F
w
in F state]
dS
Speculative memRd
[Sending Req
IMC QHL to Local Home
QHL IMC
(socket 2 owns
this address)] Data
Socket 1 Socket 2
Figure 22-12. RdData Request after LLC Miss to Local Home (Hit Response)
Cores Cores
RFO
Uncore [Send Uncore
Snoop
GQ GQ
C
to LLC]
ac Ca
he
up
Lo
n
ok
ok
np
O
Lo
ch
up s
nv
In
I)
e
he
A ->
vO
dI
->
M
llo E
[I
ac
m
R
is
w
ca ]
,I
_c
C
n
,F
te
_E
RdInvOwn
(S
in
L L
aC
[Send Q Q
E
an
at
Request
st
L
le
P P L
D
at
C
to CHL]
e
C I DataC_E_cmp I C
Rspl
Rspl indicates
n
w
Clean snoop
m
O
[Sending Req
nv
_c
_E
R
(socket 1 owns
aC
this address)]
at
D
Speculative mem Rd
[Home sends cmp
IMC and Data to socket
QHL 2 to allocate in E QHL IMC
Data state]
Socket 1 Socket 2
Figure 22-13. RdInvOwn Request after LLC Miss to Remote Home (Clean Res)
Cores Cores
RFO
Uncore [Send Uncore
Snoop
GQ GQ
C
to LLC]
ac Ca
he
up
Lo
_M n
ok
w
D
ok
np
aC O
Lo
ch
at
up s
v
In
aC
e
D dIn
he
A ->
vO
M
llo M
_M
[I
ac
R
[Send
I)
is
w
ca ]
at
C
->
te
Data to
cm
M
RdInvOwn
in
L L
a (
socket 2 to
Q Q
at M
M
allocate in
D IT
L DataC_M
st
P P L
H
M state]
at
RsplFwdI
e
C I I C
cmp
n
w
O
Indicates to Home
nv
that Data has already L] t
to eq nd
H s
[Sending Req
dI
C ue
R e
Socket 1 Socket 2
Figure 22-14. RdInvOwn Request after LLC Miss to Remote Home (Hitm Res)
Cores Cores
RFO
Uncore [Send Uncore
Snoop [Broadcast
GQ GQ
C
to LLC] snoops to all
ac Ca
other caching
he
up
agents]
Lo
n
ok
ok
np
Lo
vO
ch
D
up s
In
e
at
he
In
A ->
vO
M
aC
llo E
np
_E
[I
ac
,
I)
is
w
ca ]
S
_E Rs
aC
C
->
te
SnpInvOwn
at
at (E
in
L L
D
Q Q
D IT
E
pF
DataC_E
a
Indicates to
RdInvOwn
H
st
L
w
P P L
at
Home that
dI
C Data has I I C
RspFwdI
cmp
already been
forwarded to
socket 2 [Send complete
[Send Data to to socket 2]
socket 2 to
R
sp
allocate in E
F
w
state]
dI
Speculative memRd
[Sending Req
IMC QHL to Local Home
QHL IMC
(socket 2 owns
this address)] Data
Socket 1 Socket 2
Figure 22-15. RdInvOwn Request after LLC Miss to Local Home (Hit Res)
Whether the line is locally or remotely “homed” it has to be written back to dram before the originating GQ receives
the line, so it always appears to come from a QHL. The RFO does not do this. However, when responding to a remote
RFO (SnpInvOwn) and the line is in an S or F state, the cacheline gets invalidated and the line is sent from the QHL. The
point is that the data source might not always be so obvious.
— On the other hand, sampling loads that forwarded successfully with no penalty will have much larger skids,
and less helpful for performance tuning.
• The closer the eventing condition is to the retirement of the instruction, the shorter the skid. The events in the
Frontend of the pipeline tend to tag to instructions further from the responsible instruction than events that are
taken at execution or retirement.
• Cycles counted with the event CPU_CLK_UNHALTED.THREAD often tag in greater counts on the instruction after
larger bottlenecks in the pipeline. If cycles are accumulated on an instruction this is probably due to a bottleneck
on the instruction at the previous instruction.
• It is very difficult to determine the source of issues with a low cost that occur in the Frontend. Frontend events
can also skid to IPs that precede the actual instructions that are causing the issue.
It is possible to estimate the amount of execution slots spent in each category using the following formulas in
conjunction with core PMU performance events in Sandy Bridge microarchitecture:
%FE_Bound =
100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N);
%Bad_Speculation =
100 * ((UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 *
INT_MISC.RECOVERY_CYCLES) / N);
%Retiring = 100 * (UOPS_RETIRED.RETIRE_SLOTS/ N);
%BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation));
N represents total execution slots opportunities. Execution opportunities are the number of cycles multiplied by four.
N = 4*CPU_CLK_UNHALTED.THREAD
The following sections explain the source for penalty cycles in three categories: back end stalls, Frontend stalls and
bad speculation. They use formulas that can be applied to process, module, function, and instruction granularity.
The metric “%Memory_Bound” is described in Section 22.5.2.3. Once a workload is identified as “core bound”, the
user may want to drill down into OOO or Execution related issues through their transitional targeted performance
counter, like, for example, execution ports pressure, or use of FP-chained long-latency arithmetic operations.
RESOURCE_STALLS.ROB - Counts cycles when allocation stalls because all the reorder buffer (ROB) entries are taken.
This event occurs less frequently than the RESOURCE_STALLS.RS and typically indicates that the pipeline is being
backed up by a micro-op that is holding all younger micro-ops from retiring because they have to retire in order.
%RESOURCE.STALLS.ROB.COST = 100 * RESOURCE_STALLS.ROB/ CPU_CLK_UNHALTED.THREAD;
RESOURCE_STALLS2.BOB_FULL - Counts when allocation is stalled due to a branch micro-op that is ready for
allocation, but the number of branches in progress in the processor has reached the limit.
%RESOURCE.STALLS.BOB.COST = 100 * RESOURCE_STALLS2.BOB/ CPU_CLK_UNHALTED.THREAD;
$SumOf_PRECISE_LOADS =
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS +MEM_LOAD_UOPS_RETIRED.L1_HIT_PS +
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS +MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS +
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS +
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS +
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS +
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS;
Estimated Load Penalty
The formulas below help estimating to what degree loads from a certain memory hierarchy are responsible for a
slowdown. The CPU_CLK_UNHALTED.THREAD programmable event represents the penalty in cycles tagged at the
same granularity. At the instruction level, the cycle cost of an expensive load tends to only skid one IP, similar to the
precise event. The calculations below apply to any granularity process, module, function or instruction, since the
events are precise. Anything representing 10%, or higher, of the total clocks in a granularity of interest should be
investigated.
If the code has highly dependent loads you can use the MEM_LOAD_UOPS_RETIRED.L1_HIT_PS event to determine if
the loads are hit by the five cycle latency of the L1 DCache.
Estimated cost of L2 latency
%L2.COST =
12 * MEM_LOAD_UOPS_RETIRED.L2_HIT_PS / CPU_CLK_UNHALTED.THREAD;
Estimated cost of L3 hits
%L3.COST =
26 * MEM_LOAD_UOPS_RETIRED.L3_HIT_PS / CPU_CLK_UNHALTED.THREAD;
Estimated cost of hits in the cache of other cores
%HIT.COST =
43* MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS /
CPU_CLK_UNHALTED.THREAD;
Estimated cost of memory latency
%MEMORY.COST =
200 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS /
CPU_CLK_UNHALTED.THREAD;
Actual memory latency can vary greatly depending on memory parameters. The amount of concurrent memory
traffic often reduces the effect cost of a given memory hierarchy. Typically, the estimates above may be on the
pessimistic side (like pointer-chasing situations).
Often, cache misses will manifest as delaying and bunching on the retirement of instructions. The precise loads
breakdown can provide estimates of the distribution of hierarchy levels where the load is satisfied.
Given a significant impact from a particular cache level, the first step is to find where heavy cache line replacements
are occurring in the code. This could coincide with your hot portions of code detected by the memory hierarchy
breakdown, but often does not. For instance, regular traversal of a large data structure can unintentionally clear out
levels of cache.
If hits of non modified or modified data in another core have high estimated cost and are hot at locations in the code,
it can be due to locking, sharing or false sharing issues between threads.
If load latency in memory hierarchy levels further from the L1 DCache does not justify the amount of cycles spent on
a load, try one of the following:
• Eliminate superfluous load operations such as spilling general purpose registers to XMM registers rather than
memory.
• Continue searching for issues impacting load instructions described in Section 22.5.4.4.
involving a memory address or one of the following instructions with memory destination and lock prefix: ADD, ADC,
AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR or XADD. Precise events enable
you to get an idea of the contention on any lock. Many locking APIs start by an atomic instruction in ring3 and back off
a contended lock by jumping into ring0. This means many locking APIs can be very costly in low contention scenarios.
To estimate the amount of contention on a locked instruction, you can measure the number of times the cache line
containing the memory destination of an atomic instruction is found modified in another core.
Required events:
• MEM_UOPS_RETIRED.LOCK_LOADS_PS - Counts the number of atomic instructions which are retired with a
precise skid of IP+1.
• MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS - Counts the occurrences that the load hits a modified
cache line in another core. This event is important for many performance bottlenecks that can occur in multi-core
systems, such as lock contention, and false sharing.
Usages of events:
The lock contention factor gives the percentage of locked operations executed that contend with another core and
therefore have a high penalty. Usually a lock contention factor over 5% is worth investigating on a hot lock. A heavily
contended lock may impact the performance of multiple threads.
%LOCK.CONTENTION =
100 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS /
MEM_UOPS_RETIRED.LOCK_LOAD_PS;
Split store penalty is more difficult to find using an estimated cost, because in typical cases stores do not push out the
retirement of instructions. To detect significant amount of split stores divide their number by the total number of
stores retired at that IP.
SPLIT.STORE.RATIO =
MEM_UOPS_RETIRED.SPLIT_STORES_PS / MEM_UOPS_RETIRED.ANY_STORES_PS;
4k Aliasing
A 4k aliasing conflict between loads and stores causes a reissue on the load. Five cycles is used as an estimate in the
model below.
Required Events:
• LD_BLOCKS_PARTIAL.ADDRESS_ALIAS - Counts the number of loads that have partial address match with
preceding stores, causing the load to be reissued.
Usages of events:
%4KALIAS.COST =
100 * LD_BLOCK_PARTIAL.ADDRESS_ALIAS * 5 / CPU_CLK_UNHALTED.THREAD;
Large walk durations, of hundreds of cycles, are an indication that the page tables have been thrown out of the LLC.
To determine the average cost of a page walk use the following ratio:
STLB.LOAD.MISS.AVGCOST =
DTLB_LOAD_MISSES.WALK_DURATION /
DTLB_LOAD_MISSES.WALK_COMPLETED;
To a lesser extent than loads, STLB misses on stores can be a bottleneck. If the store itself is a large bottleneck, cycles
will tag to the next IP after the store.
%STLB.STORE.MISS =
100 * MEM_UOPS_RETIRED.STLB_MISS_STORES_PS/
MEM_UOPS_RETIRED.ANY_STORES_PS;
Reducing DTLB/STLB misses increases data locality. One may consider using an commercial-grade memory allocators
to improve data locality. Compilers which offer profile guided optimizations may reorder global variables to increase
data locality, if the compiler can operate on the whole module. For issues with a large amount of time spent in page
walks, server and HPC applications may be able to use large pages for data.
22.5.5.2 Assists
Assists usually involve the microcode sequencer that helps handle the assist. Determining the number of cycles
where microcode is generated from the microcode sequencer is often a good methodology to determine the total
cost of the assist. If the overall cost of assists are high, a breakdown of assists into specific types will be useful.
• Estimating the total cost of assists using microcode sequencer cycles:
%ASSISTS.COST =
100 * IDQ.MS_CYCLES / CPU_CLK_UNHALTED.THREAD;
Floating-point assists
Denormal inputs for X87 instructions require an FP assist, potentially costing hundreds of cycles.
%FP.ASSISTS =
100 *FP_ASSIST.ANY / INST_RETIRED.ANY;
after the branch misprediction. This study can be performed at the process, module, function or instruction
granularity.
Usages of Events:
Use the following ratio to estimate the cost of mispredicted branches:
%BR.MISP.COST =
20 * BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD;
Usages of Counters
The event IDQ_UOPS_NOT_DELIVERED counts when the maximum of four micro-ops are not delivered to the rename
stage, while it is requesting micro-ops. When the pipeline is backed up the rename stage does not request any further
micro-ops from the front end. The diagram above shows how this event tracks micro-ops between the micro-op
queue and the rename stage.
You can use the IDQ_UOPS_NOT_DELIVERED event to breakdown the distribution of cycles when 0, 1, 2, 3 micro-ops
are delivered from the front end.
Percentage of cycles the front end is effective, or execution is back end bound:
%FE.DELIVERING =
100 * (CPU_CLK_UNHALTED.THREAD -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering three micro-ops per cycle:
%FE.DELIVER.3UOPS =
100 * (IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering two micro-ops per cycle:
%FE.DELIVER.2UOPS =
100 * (IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering one micro-ops per cycle:
%FE.DELIVER.1UOPS =
100 * (IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE -
IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Percentage of cycles the front end is delivering zero micro-ops per cycle:
%FE.DELIVER.0UOPS =
100 * (IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) /
CPU_CLK_UNHALTED.THREAD;
Average Micro-ops Delivered per Cycle: This ratio assumes that the front end could potentially deliver four micro-ops
per cycle when bound in the back end.
AVG.uops.per.cycle =
(4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) +
(%FE.DELIVER.1UOPS)) / 100
Seeing the distribution of the micro-ops being delivered in a cycle is a hint at the front end bottlenecks that might be
occurring. Issues such as LCPs and penalties from switching from the decoded ICache to the legacy decode pipeline
tend to result in zero micro-ops being delivered for several cycles. Fetch bandwidth issues and decoder stalls result in
less than four micro-ops delivered per cycle.
A typical distribution is approximately 80% of the micro-ops coming from the Decoded ICache, 15% coming from
legacy decode pipeline and 5% coming from the microcode sequencer. Excessive micro-ops coming from the legacy
decode pipeline can be a warning sign that the Decoded ICache is not working effectively. A large portion of micro-ops
coming from the microcode sequencer may be benign, such as complex instructions, or string operations, but can
also be due to code assists handling undesired situations like Intel SSE to Intel AVX code transitions.
Usage of Counters
Percentage of micro-ops coming from Decoded ICache
%UOPS.DSB =
IDQ.DSB_UOPS / ALL_IDQ_UOPS;
Percentage of micro-ops coming from legacy decoder pipeline:
%UOPS.MITE =
IDQ.MITE_UOPS / ALL_IDQ_UOPS;
Percentage of micro-ops coming from micro-sequencer:
%UOPS.MS =
IDQ.MS_UOPS / ALL_IDQ_UOPS;
ALL_IDQ_UOPS = (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS);
If your application is not bound in the front end then whether micro-ops are coming from the legacy decode pipeline
or Decoded ICache is of lesser importance. Excessive micro-ops coming from the microcode sequencer are worth
investigating further to see if assists might be a problem.
Cases to investigate are listed below:
• (%FE_BOUND > 30%) and (%UOPS.DSB < 70%)
A threshold of 30% defines a “front end bound” case. This threshold may be applicable to many
situations, but may also vary somewhat across different workloads.
— Investigate why micro-ops are not coming from the Decoded ICache.
— Investigate issues which can impact the legacy decode pipeline.
• (%FE_BOUND > 30%) and (%UOP_DSB > 70%)
— Investigate switches from Decoded ICache to legacy decode pipeline since it may be switching to run portions
of code that are too small to be effective.
— Look at the amount of bad speculation, since branch mispredictions still impact FE performance.
— Determine the average number of micro-ops being delivered per 32-byte chunk hit. If there are many taken
branches from one 32-byte chunk into another, it impacts the micro-ops being delivered per cycle.
— Micro-op delivery from the Decoded ICache may be an issue which is not covered.
• (%FE_BOUND < 20%) and (%UOPS_MS>25%)
A threshold of 20% defines a “front end not bound” case. This threshold may be applicable to many
situations, but may also vary somewhat across different workloads.
The following steps can help determine why micro-ops came from the microcode, in order of most common to
least common.
— Long latency instructions - Any instruction over four micro-ops starts the microcode sequencer. Some
instructions such as transcendentals can generate many micro-ops from the microcode.
— String operations - string operations can produce a large amount of microcode. In some cases there are
assists which can occur due to string operations such as REP MOVSB with trip count greater than 3, which
costs 70+ cycles.
— Assists - See Section 22.5.5.2.
Required Events
The Decoded ICache events all have large skids and the exact instruction where they are tagged is usually not the
source of the problem so only look for this issue at the process, module and function granularities.
• DSB2MITE_SWITCHES.PENALTY_CYCLES - Counts the cycles attributed to the switch from the Decoded ICache to
the legacy decode pipeline, excluding cycles when the micro-op queue cannot accept micro-ops because it is
back end bound.
• DSB2MITE_SWITCHES.COUNT - Counts the number of switches between the Decoded ICache and the legacy
decode pipeline.
• DSB_FILL.ALL_CANCEL - Counts when fills to the Decoded ICache are canceled.
• DSB_FILL.EXCEED_DSB_LINES- Counts when a fill is canceled because the allocated lines for Decoded ICache has
exceeded three for the 32-byte chunk.
Usage of Events
Since these studies involve front end events, do not try to tag the event to a specific instruction.
• Determining cost of switches from the Decoded ICache to the legacy decode pipeline.
%DSB2MITE.SWITCH.COST =
100 * DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD;
• Determining the average cost per Decoded ICache switch to the legacy front end:
AVG.DSB2MITE.SWITCH.COST =
DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT;
• Determining Causes of Misses in the Decoded ICache
— There are no partial hits in the Decoded ICache. If any micro-op that is part of that lookup on the 32-byte
chunk is missing, a Decoded ICache miss occurs on all micro-ops for that transaction.
— There are three primary reasons for missing micro-ops in the Decoded ICache:
• Portions of a 32-byte chunk of code were not able to fit within three ways of the Decoded ICache.
• A frequently run portion of your code section is too large for the Decoded ICache. This case is more
common on server applications since client applications tend to have a smaller set of code which is
“hot”.
• The Decoded ICache is getting flushed for example when an ITLB entry is evicted.
— To determine if a portion of the 32-byte code is unable to fit into three lines within the Decoded ICache use
the DSB_FILL.EXCEED_DSB_LINESevent at the process, module or function granularities
%DSB.EXCEED.WAY.LIMIT =
100 * DSB_FILL.EXCEED_DSB_LINES/ DSB_FILL.ALL_CANCEL;
Required Events
ICACHE.MISSES - Counts the number of instruction byte fetches that miss the ICache
Usage of Events
To determine whether ICache misses are causing the issue, compare them to the instructions retired event count,
using the same granularity (process, model, or function). Anything over 1% of instructions retired can be a significant
issue.
ICACHE.PER.INST.RET =
ICACHE.MISSES / INST_RETIRED.ANY;
If ICache misses are causing a significant problem, try to reduce the size of your hot code section, using the profile
guided optimizations. Most compilers have options for text reordering which helps reduce the number of pages and,
to a lesser extent, the number of pages your application is covering.
If the application makes significant use of macros, try to either convert them to functions, or use intelligent linking to
eliminate repeated code.
https://perfmon-events.intel.com/.
applied to different microarchitectures, this section will use performance events available in processors based on
Intel Core microarchitecture for simplicity.
Performance tuning usually centers around reducing the time it takes to complete a well-defined workload.
Performance events can be used to measure the elapsed time between the start and end of a workload. Thus,
reducing elapsed time of completing a workload is equivalent to reducing measured processor cycles.
The drill-down methodology can be summarized as four phases of performance event measurements to help
characterize interactions of the code with key pipe stages or sub-systems of the microarchitecture. The relation of the
performance event drill-down methodology to the software tuning feedback loop is illustrated in Figure 22-16.
Start_to_Finish
View Total_Cycles_Completion
Execution
View Non_retiring_uops Retiring_uops Stalled
Code Layout,
Tuning Focus Vectorize w/ Identify hot spot
Branch
SIMD code, apply fix
Misprediction
Tuning
Consistency
Apply one fix at time; repeat
from the top OM19805
Figure 22-16. Performance Events Drill-Down and Software Tuning Feedback Loop
Typically, the logic in performance monitoring hardware measures microarchitectural conditions that varies across
different counting domains, ranging from cycles, micro-ops, address references, instances, etc. The drill-down
methodology attempts to provide an intuitive, cycle-based view across different phases by making suitable
approximations that are described below:
• Total cycle measurement — This is the start to finish view of total number of cycle to complete the
workload of interest. In typical performance tuning situations, the metric Total_cycles can be measured by the
event CPU_CLK_UNHALTED.CORE. See: https://perfmon-events.intel.com/).
• Cycle composition at issue port — The reservation station (RS) dispatches micro-ops for execution so that
the program can make forward progress. Hence the metric Total_cycles can be decomposed as consisting of two
exclusive components: Cycles_not_issuing_uops representing cycles that the RS is not issuing micro-ops for
execution, and Cycles_issuing_uops cycles that the RS is issuing micro-ops for execution. The latter component
includes micro-ops in the architected code path or in the speculative code path.
• Cycle composition of OOO execution — The out-of-order engine provides multiple execution units that
can execute micro-ops in parallel. If one execution unit stalls, it does not necessarily imply the program execution
is stalled. Our methodology attempts to construct a cycle-composition view that approximates the progress of
program execution. The three relevant metrics are: Cycles_stalled, Cycles_not_retiring_uops, and
Cycles_retiring_uops.
• Execution stall analysis — From the cycle compositions of overall program execution, the programmer can
narrow down the selection of performance events to further pin-point unproductive interaction between the
workload and a micro-architectural sub-system.
When cycles lost to a stalled microarchitectural sub-system, or to unproductive speculative execution are identified,
the programmer can use VTune Analyzer to correlate each significant performance impact to source code location. If
the performance impact of stalls or misprediction is insignificant, VTune can also identify the source locations of hot
functions, so the programmer can evaluate the benefits of vectorization on those hot functions.
An estimation of overall L2 hit impact by multiplying the L2 hit latency with the number of L2 hit references
ignores the OOO engine’s ability to handle multiple outstanding load misses.
• L1 DTLB Miss Impact — The cost of a DTLB lookup miss is about 10 cycles. The event
MEM_LOAD_RETIRED.DTLB_MISS measures the number of load micro-ops that experienced a DTLB miss.
• LCP Impact — The overall impact of LCP stalls can be directly measured by the event ILD_STALLS. The event
ILD_STALLS measures the number of times the slow decoder was triggered, the cost of each instance is 6 cycles
• Store forwarding stall Impact — When a store forwarding situation does not meet address or size
requirements imposed by hardware, a stall occurs. The delay varies for different store forwarding stall situations.
Consequently, there are several performance events that provide fine-grain specificity to detect different store-
forwarding stall conditions. These include:
— A load blocked by preceding store to unknown address: This situation can be measure by the event Load_-
Blocks.Sta. The per-instance cost is about 5 cycles.
— Load partially overlaps with proceeding store or 4-KByte aliased address between a load and a proceeding
store: these two situations can be measured by the event Load_Blocks.Overlap_store.
— A load spanning across cache line boundary: This can be measured by Load_Blocks.Until_Retire. The per-
instance cost is about 20 cycles.
Code Locality
• Instruction Fetch Stall: CYCLES_L1I_MEM_STALLED / CPU_CLK_UNHALTED.CORE * 100
— The Instruction Fetch Stall ratio is the percentage of cycles during which the Instruction Fetch Unit (IFU)
cannot provide cache lines for decoding due to cache and Instruction TLB (ITLB) misses. A high value for this
ratio indicates potential opportunities to improve performance by reducing the working set size of code
pages and instructions being executed, hence improving code locality.
• ITLB Miss Rate: ITLB_MISS_RETIRED / INST_RETIRED.ANY
— A high ITLB Miss Rate indicates that the executed code is spread over too many pages and cause many
Instructions TLB misses. Retired ITLB misses cause the pipeline to naturally drain, while the miss stalls
fetching of more instructions.
• L1 Instruction Cache Miss Rate: L1I_MISSES / INST_RETIRED.ANY
— A high value for L1 Instruction Cache Miss Rate indicates that the code working set is bigger than the L1
instruction cache. Reducing the code working set may improve performance.
• L2 Instruction Cache Line Miss Rate: L2_IFETCH.SELF.I_STATE / INST_RETIRED.ANY
— L2 Instruction Cache Line Miss Rate higher than zero indicates instruction cache line misses from the L2 cache
may have a noticeable performance impact of program performance.
22.8.1.4 Macro-fusion
• Macro-Fusion: UOPS_RETIRED.MACRO_FUSION / INST_RETIRED.ANY
— The Macro-Fusion ratio calculates how many of the retired instructions were fused to a single micro-op. You
may find this ratio is high for a 32-bit binary executable but significantly lower for the equivalent 64-bit
binary, and the 64-bit binary performs slower than the 32-bit binary. A possible reason is the 32-bit binary
benefited from macro-fusion significantly.
— Floating-Point assist is activated for non-regular FP values like denormals and NANs. FP assist is extremely
slow compared to regular FP execution. Different assists incur different penalties. FP Assist Performance
Impact estimates the overall impact.
• Divider Busy: IDLE_DURING_DIV / CPU_CLK_UNHALTED.CORE * 100
— A high value for the Divider Busy ratio indicates that the divider is busy and no other execution unit or load
operation is in progress for many cycles. Using this ratio ignores L1 data cache misses and L2 cache misses
that can be executed in parallel and hide the divider penalty.
• Floating-Point Control Word Stall Ratio: RESOURCE_STALLS.FPCW / CPU_CLK_UNHALTED.CORE * 100
— Frequent modifications to the Floating-Point Control Word (FPCW) might significantly decrease perfor-
mance. The main reason for changing FPCW is for changing rounding mode when doing FP to integer conver-
sions.
Data Bus Utilization is the percentage of bus cycles used for transferring data among all bus agents in the system,
including processors and memory. High bus utilization indicates heavy traffic between the processor(s) and memory.
Memory sub-system latency can impact the performance of the program. For compute-intensive applications with
high bus utilization, look for opportunities to improve data and code locality. For other types of applications (for
example, copying large amounts of data from one memory area to another), try to maximize bus utilization.
29. Bus Not Ready Ratio: BUS_BNR_DRV.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS * 100
Bus Not Ready Ratio estimates the percentage of bus cycles during which new bus transactions cannot start. A high
value for Bus Not Ready Ratio indicates that the bus is highly loaded. As a result of the Bus Not Ready (BNR) signal,
new bus transactions might defer and their latency will have higher impact on program performance.
30. Burst Read in Bus Utilization: BUS_TRANS_BRD.SELF * 2 / CPU_CLK_UNHALTED.BUS * 100
A high value for Burst Read in Bus Utilization indicates that bus and memory latency of burst read operations may
impact the performance of the program.
31. RFO in Bus Utilization: BUS_TRANS_RFO.SELF * 2 / CPU_CLK_UNHALTED.BUS * 100
A high value for RFO in Bus Utilization indicates that latency of Read For Ownership (RFO) transactions may impact the
performance of the program. RFO transactions may have a higher impact on the program performance compared to
other burst read operations (for example, as a result of loads that missed the L2). See also Ratio 31.
APPENDIX A
APPLICATION PERFORMANCE TOOLS
Intel offers an array of application performance tools that are optimized to take advantage of the Intel architecture
(IA)-based processors. This appendix introduces these tools and explains their capabilities for developing the most
efficient programs without having to write assembly code.
The following performance tools are available.
• Compilers
— Intel® C++ Compiler: a high-performance, optimized C and C++ cross compiler with the capability of
offloading compute-intensive code to Intel® Many Integrated Core Architecture (Intel® MIC Architecture) as
well as Intel® HD Graphics, and executing on multiple execution units by using Intel® Cilk™ parallel extensions.
— Intel® Fortran Compiler: a high-performance, optimized Fortran compiler.
• Performance Libraries — a set of software libraries optimized for Intel architecture processors.
— Intel® Integrated Performance Primitives (Intel® IPP): performance building blocks to boost embedded
system performance.
— Intel® Math Kernel Library (Intel® MKL): a set of highly optimized linear algebra, Fast Fourier Transform (FFT),
vector math, and statistics functions.
— Intel® Threading Building Blocks (Intel® TBB): a C and C++ template library for creating high performance,
scalable parallel applications.
— Intel® Data Analytics Acceleration Library (Intel® DAAL): C++ and Java API library of optimized analytics
building blocks for all data analysis stages, from data acquisition to data mining and machine learning.
Essential for engineering high performance Big Data applications.
• Performance Profilers — performance profilers collect, analyze, and display software performance data for
tuning CPU, GPU, threading, vectorization and MPI parallelism from the system-wide view down to a specific line
of code.
— Intel® VTune™ Amplifier XE: performance profiler.
— Intel® Graphics Performance Analyzers (Intel® GPA) - a set of performance analyzers for graphics applications.
— Intel® Advisor: vectorization optimization and thread prototyping.
— Intel® Trace Analyzer and Collector: MPI communications performance profiler and correctness checker.
• Debuggers
— Intel® Inspector: memory and thread debugger.
— Intel® Application Debugger.
— Intel® JTAG Debugger.
— Intel® System Debugger.
• Cluster Tools
— Intel® MPI Library: high-performance MPI library.
— Intel® MPI Benchmarks: a set of MPI kernel tests to verify the performance of your cluster or MPI implemen-
tation.
The performance tools listed above can be found in the following product suites.
• Intel® Parallel Studio XE1
— Intel® Media Server Studio.
— Intel® Systems Studio.
A.1 COMPILERS
Intel compilers support several general optimization settings, including /O1, /O2, /O3, and /fast. Each of them
enables a number of specific optimization options. In most cases, /O2 is recommended over /O1 because the /O2
option enables function expansion, which helps programs that have many calls to small functions. The /O1 may
sometimes be preferred when code size is a concern. The /O2 option is on by default.
The /Od (-O0 on Linux) option disables all optimizations. The /O3 option enables more aggressive optimizations, most
of which are effective only in conjunction with processor-specific optimizations described below.
The /fast option maximizes speed across the entire program. For most Intel 64 and IA-32 processors, the “/fast”
option is equivalent to “/O3 /Qipo /Qprec-div- /fp:fast=2 /QxHost” on Windows*, “-ipo -O3 -no-prec-div -static -fp-
model fast=2 -xHost” on Linux, and “-ipo -mdynamic-no-pic -O3 -no-prec-div -fp-model fast=2 -xHost” on OS X*.
All the command-line options are described in Intel® C++ Compiler documentation.
NOTE
The compiler issues a warning that the dynamic information corresponds to a modified function.
• Repeat the instrumentation compilation if you make many changes to your source files after execution and
before feedback compilation.
For more on code optimization options, see the Intel C++ Compiler documentation.
new architectural features of future generations of Intel processors simply by relinking the application with upgraded
versions of the libraries.
The library set includes the Intel Integrated Performance Primitives (Intel IPP), Intel Math Kernel Library (Intel MKL)
and Intel Threading Building Blocks (Intel TBB).
1. Intel® Trace Analyzer and Collector is only available as part of Intel® Cluster Studio or Intel® Cluster Studio XE.
Intel® architecture-based cluster systems, features a high degree of compatibility with current standards, and
includes trace file idealization and comparison, counter data displays, performance assistant and an MPI correctness
checking library. Analyze MPI performance, speed up parallel application runs, locate hotspots and bottlenecks, and
compare trace files with graphics providing extensively detailed analysis and aligned timelines.
APPENDIX B
RUNTIME PERFORMANCE OPTIMIZATION BLUEPRINT: INTEL®
ARCHITECTURE OPTIMIZATION WITH LARGE CODE PAGES
This appendix provides a runtime optimization blueprint illustrating how the performance of runtimes can be
improved by using large code pages.
B.1 OVERVIEW
Modern microprocessors support multiple page sizes for program code. Intel platforms have supported 4 KB and 2
MB pages for instructions as far back as 2011 in the Intel® Xeon® E5 processor (based on Ivy Bridge microarchitecture).
Nevertheless, most programs use only one page size, which is the default of 4 KB. On Linux*, all applications are
loaded into 4 KB memory pages by default. When examining performance bottlenecks for workloads on language
runtimes, high stalls due to ITLB misses are found. This is largely due to the runtimes using only 4 KB pages for
instructions.
When the processor does not find an entry in the ITLB, it has to do a page walk and populate the entry. A miss in the
L1 (first level) ITLBs results in a very small penalty that can usually be hidden by the Out of Order (OOO) execution. A
miss in the STLB results in the page walker being invoked; this penalty can be noticeable in the execution. During this
process, the processor is stalled. The following table lists the TLB sizes across different Intel product generations.
Table B-1. Core TLB Structure Size and Organization Across Multiple Intel Product Generations
TLB Sandy Bridge/ Haswell / Broadwell Skylake / Cascade Lake
Ivy Bridge Microarchitecture Microarchitecture
Microarchitecture
4K+2M shared:
Haswell:
4K+2M shared:
1024, 8-way
L2 (Unified) STLB 4K – 512, 4-way 1536, 12-way
Broadwell:
1G: 16, 4-way
1536, 6-way
1G: 16, 4-way
From Table B-1 we can see that 2M page entries are shared in the L2 Unified TLB from Haswell microarchitecture
onwards.
Let us look at a concrete example. The Ghost.js workload has an ITLB_miss stall % of 10.6 when run in cluster mode
across the whole system. A sampling of these two counters along with Equation 1 enables us to determine the % of
ITLB_miss stall. A 10.6% stall due to ITLB misses is significant for this workload.
Measuring ITLB miss stall is critical to determine if your workload on a runtime has an ITLB performance issue.
In Section B.6, we show that even while running in single instance mode, Ghost.js has a 6.47% stall due to ITLB misses.
When large pages are implemented, the performance improves by 5% and the ITLB misses are reduced by 30% and
the ITLB Miss Stall is reduced from 6.47% to 2.87% .
Another key metric is the ITLB Misses Per Kilo Instructions (MPKI). This metric is a normalization of the ITLB misses
against number of instructions, and it allows comparison between different systems. This metric is calculated using
two PMU counters: ITLB_MISSES.WALK_COMPLETED and INST_RETIRED.ANY, as described in Equation 2. There are
distinct PMU counters for large pages and 4KB pages, so Equation 2 shows the calculation for each PMU counter,
respectively.
Upon calculating the MPKI for the runtime workloads in Figure B-1, we find that the ITLB MPKI and the ITLB 4K MPKI
are very close to each other across the workloads. We can thus infer that most of the misses are from 4KB page walks.
Another observation is that the benchmarks have lower ITLB MPKI than large real world software, which means that
optimization decisions made on benchmarks might not translate to open source software.
Having ITLB MPKI also enables us to do comparisons across different systems and workloads. Table B-3 compiles the
ITLB MPKI across various workloads published1,2. We can observe that there is not a direct correlation of binary size
to ITLB MPKI. Some smaller binaries, such as MySQL, have one of the largest ITLB MPKI. When multiple threads are
active, the ITLB MPKI almost doubles for both Ghost.js (single instance vs. multi instance) and Clang (-j1 vs -j4). The
1. Ottoni, G., & Bertrand, M. (2017). Optimizing Function Placement for Large-Scale Data-Center Applications. CGO
2017.
2. Lavaee, R., Criswell, J., & Ding, C. (Oct 2018). Codestitcher: Inter-Procedural Basic Block Layout Optimization.
arXiv:1810.00905v1 [cs.PL].
ITLB MPKI is much lower on newer servers (using Intel® Xeon® 8180 processors) as compared to older generation
servers (using Intel® Xeon® E5 processors).
Table B-3. ITLB MPKI and Executable Sizes Across Various Workloads
Workload Text (MB) ITLB MPKI System Details
Dual 2.8 GHz Intel® Xeon® E5-2680 v2
AdIndexer 186 0.48
(based on Ivy Bridge
HHVM 133 1.28
microarchitecture) server platform,
Multifeed 199 0.40
with 10 cores and 25 MB LLC per
TAO 70 3.08
processor.
Two dual core Intel® Core™ i5-4278U
MySQL 15 9.35 (based on Haswell microarchitecture)
Clang –j4 50 2.23 processors running at 2.60 GHz. 32 KB
Clang –j1 50 1.01 instruction cache and a 256 KB second
Firefox 81 1.54 level unified cache private to each
Apache PHP (w opcache) 16 0.33 core. Both caches are 8-way set
Apache PHP (w/o opcache) 16 0.96 associative. The last level cache is 3
Python 2 0.19 MB, 12-way set associative, and is
shared by all core.
SPECjEnterprise2018-WP-VM 0.23
SPECjEnterprise2018-WP-Native 0.18
Intel® Xeon® Platinum 8180 (based on
SPECjbb2015 0.09
Skylake Server microarchitecture) with
Wordpress/PHP 0.49
112 cores @ 2.5 GHz (except
MediaWiki/HHVM 0.59
MediaWiki/HHVM which is on a SKX-D
Ghost.js/Multi 0.56
with 18 cores).
Ghost.js/Single 0.23
Python Django (Instagram) 0.14
Figure B-2. measure-perf-metric.sh Tool Usage for Process ID 69772 for 30 Seconds
Use the command “measure-perf-metric.sh –h” to display help messages for using the tool. Refer to the README.md
file, which describes how to add new metrics to the tool.
Figure B-3. Using measure-perf-metric.sh with -r to Determine Where TLB Misses are Coming From
While Figure B-3 shows where the TLB miss overheads are coming from in terms of stalled cycles, we further analyze
the latest upstream node.js (14.0.0-pre) to find the overhead in terms of ITLB miss counts using the perf record -e
frontend_retired.tlb_misscommand. We extract the report using the perf script command and filtering it
1. NodeJS Foundation. (2019,. August). Node.js JavaScript Runtime. Retrieved from Node.js JavaScript Runtime:
https://nodejs.org/en
based on the ITLB miss addresses. We find that 17.6% of ITLB misses are from JITted code and 72.8% from the node
binary. We also find that “built-in” functions, which are part of node binary, account for 19.5% of the total ITLB misses.
On Linux and Windows systems, applications are loaded into memory into 4KB pages, which is the default on most
systems. One way to reduce the ITLB misses is to use the larger page size, which has two benefits. The first benefit is
fewer translations are required leading to fewer page walks. The second benefit is less space is used in the cache for
storing translations, allowing more space to be available for the application code. Some older systems, such as one
using Intel® Xeon® E5-2680 v2 processors (based on Ivy Bridge microarchitecture), have only 8 huge-page ITLB entries
that are organized in a single level, so mapping all the text to large pages could cause a regression. However on Intel®
Xeon® Platinum 8180 processors (based on Skylake Server microarchitecture), the STLB is shared by both 4KB and
2MB pages and has 1536 entries.
B.3 SOLUTION
• Using an explicit option or flag in the runtime: The Node.js runtime has an implementation that is exposed using
--enable-largepages=on when you run Node.js. The PHP runtime has a flag that can be added to the .ini file. For
details, see: https://www.php.net/manual/en/opcache.configuration.php.
1. Aleksey Shipilev, Redhat. (2019, 03 03). Transparent Huge Pages. Retrieved from JVM Anatomy Quarks:
https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/.
1. Google V8 JavaScript. (2019, August). V8 JavaScript Engine. Retrieved from V8 JavaScript Engine: https://v8.dev/
B.5 LIMITATIONS
There are several limitations to be aware of when using large pages:
• Fragmentation is an issue that is introduced when using large pages. If there is insufficient contiguous memory to
assemble the large page, the operating system tries to reorganize the memory to satisfy the large page request,
which can result in longer latencies. This can be mitigated by allocating large pages explicitly ahead of time. The
reference code does not have support for explicit huge pages.
• Another issue is the additional execution time it takes to perform the algorithm in the Intel reference code. For
short running programs, it adds additional execution time and might result in a slowdown rather than a speedup.
• We have recently encountered an issue when the current implementation is used with multiple instances of the
same application. We have a report that it increases the LLC misses. We think this is due to the kernel not sharing
the code after the remapping. We are investigating and working on a solution.
Tools like perf are no longer able to follow the symbols after the .text is remapped (Figure B-6) and the perf output will
not have the symbols. You will need to provide the static symbols to perf in
/tmp/perf-PID.map at startup.
Figure B-6. perf Output Will Not Have the Proper Symbols After Large Page Mapping
1. Ghost Team. (n.d.). Ghost: The professional publishing platform. Retrieved from Ghost Non Profit Web Site:
https://ghost.org/.
2. Google Web Tooling. (2019, August). Web Tooling Benchmark. Retrieved from Web Tooling Benchmark:
https://github.com/v8/web-tooling-benchmark.
3. WikiMedia Foundation. (2019, August). Mediawiki Software. Retrieved from Mediawiki Software:
http://mediawiki.org/wiki/MediaWiki.
4. Xeon-D, I. (n.d.). Xeon-D. Retrieved from intel.com:
https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html.
Table B-4. Key Metrics for Ghost.js With and Without Large Pages
Node.js Without
Node.js With Large Large
Metric Large Pages
Pages Pages/Default
(Default)
Throughput: Requests Per Second
134.32 127.23 1.05
(RPS)
metric_ITLB_Misses(%) 2.87 6.47 0.44
Table B-5. Key Metrics for Web Tooling across Clear Linux and Ubuntu 18.04
Clear
Metric Clear Linux Ubuntu 18.04
Linux/Ubuntu
Throughput 10.91 10.80 1.01
metric_ITLB_Misses(%) 0.91 2.21 0.41
ITLB_MISSES.WALK_COMPLETED_2M_
241,215 142,356 1.69
4M
metric ITLB MPKI 0.0298 0.0604 0.49
1. Ottoni, G., & Bertrand, M. (2017). Optimizing Function Placement for Large-Scale Data-Center Applications. CGO
2017.
FRONTEND_RETIRED.ITLB_MISS counts retired instructions that experienced ITLB (Instruction TLB) true miss) and we
can see that all those are lower with large pages with STLB_MISS reducing by 23%.
Figure B-7. Using Perf Record with -e frontend_retired.itlb_miss to Determine ITLB Misses
The output of perf script can be imported into FlameScope and we can visualize the ITLB misses. We can see that
some portions of the workload have much more ITLB misses than others. When we compare Figure B-8 and
Figure B-9 we can see that the heatmap is much sparser for the ITLB misses when we are using large pages in Node.js.
Figure B-8. Using FlameScope to Visualize the ITLB Misses Heatmap from the WebTooling Workload
Figure B-9. Using FlameScope to Visualize the ITLB Misses Heatmap from the WebTooling Workload
We also visualize the ITLB miss counts for the v8 “Built-in” functions by extracting the ITLB misses associated with the
“Built-in” function from the perf script output. We plot the graph with the virtual address of the “Built-in”
functions on y-axis and time on the x-axis (Figure B-10 and Figure B-11). Similar to FlameScope graphs, ITLB misses
are sparser when we are using large pages in Node.js.
Figure B-10. Visualizing ITLB Miss Trends for “Built-in” Functions from the Ghost.js Workload
Figure B-11. Visualizing ITLB Miss Trends for “Built-in” Functions from the Ghost.js Workload
When Run With Large Pages
B.7 SUMMARY
This runtime optimization blueprint described the problem that runtimes have with high ITLB miss stalls, and
discussed how to diagnose the problem, as well as techniques and a reference implementation to solve the problem.
A case study showed the benefits of integrating the solution into a new runtime. The three examples in the case study
demonstrated that the use of 2M pages has the potential to improve ITLB Miss Stalls by 43%, ITLB Walks by 45%, and
ITLB MPKI by 46%.
Kernel 4.15.0-58-generic
Microcode 0x200005e
Sockets 2
NUMA Nodes 2
L2 Cache 1024K
L3 Cache 39424K
CVE-2018-3639 OK (Mitigation: Speculative Store Bypass disabled via prctl and seccomp)
CVE-2018-3615 OK (your CPU vendor reported your CPU model as not vulnerable)
CHAPTER 2
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.1 6TH GENERATION INTEL® XEON® SCALABLE PROCESSOR FAMILY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.1.1 The Redwood Cove Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.1.1.1 Branch Hint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
2.1.1.2 Profiling Support for Branch Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.1.1.3 New Redwood Cove Macro-Fusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
2.1.1.4 Improved Memory BW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2.1.1.5 Array of Pointers Prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
2.1.1.6 LLC Page Prefetcher (LLCPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.2 THE SAPPHIRE RAPIDS MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
2.2.1 4th Generation Intel® Xeon® Scalable Family of Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3 THE ALDER LAKE PERFORMANCE HYBRID ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.1 12th Generation Intel® Core™ Processors Supporting Performance Hybrid Architecture . . . . . . . 2-6
2.3.2 Hybrid Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.2.1 Intel® Thread Director . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3.2.2 Scheduling with Intel® Hyper-Threading Technology-Enabled on Processors Supporting
x86 Hybrid Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9
2.3.2.3 Scheduling with a Multi-E-Core Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.3.2.4 Scheduling Background Threads on x86 Hybrid Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.3.3 Recommendations for Application Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.4 THE GOLDEN COVE MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
2.4.1 Golden Cove Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
2.4.1.1 Cache Subsystem and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2.4.1.2 Avoiding Destination False Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
2.5 ICE LAKE CLIENT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.5.1 Ice Lake Client Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
2.5.1.1 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.5.1.2 The Out of Order and Execution Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
2.5.1.3 Cache and Memory Subsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
2.5.1.4 Fast Store Forwarding Prediction (FSFP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
2.5.1.5 New Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
2.5.1.6 Ice Lake Client Microarchitecture Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
2.6 SKYLAKE SERVER MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
2.6.1 Skylake Server Microarchitecture Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.1 Larger Mid-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.2 Non-Inclusive Last Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.1.3 Skylake Server Microarchitecture Cache Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
2.6.2 Non-Temporal Stores on Skylake Server Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.6.3 Skylake Server Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.7 SKYLAKE CLIENT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
2.7.1 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
2.7.2 The Out-of-Order Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
2.7.3 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
2.7.4 Pause Latency in Skylake Client Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
CHAPTER 3
GENERAL OPTIMIZATION GUIDELINES
3.1 PERFORMANCE TOOLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.1 Intel® C++ and Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.1.2 General Compiler Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.1.3 VTune™ Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2 PROCESSOR PERSPECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2.1 CPUID Dispatch Strategy and Compatible Code Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.2.2 Transparent Cache-Parameter Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.2.3 Threading Strategy and Hardware Multithreading Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.3 CODING RULES, SUGGESTIONS, AND TUNING HINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
3.4 OPTIMIZING THE FRONT END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1 Branch Prediction Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1.1 Eliminating Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.4.1.2 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.4.1.3 Inlining, Calls, and Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.4.1.4 Code Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.4.1.5 Branch Type Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
3.4.1.6 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
3.4.2 Fetch and Decode Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.2.1 Optimizing for Microfusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.4.2.2 Optimizing for Macrofusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.4.2.3 Length-Changing Prefixes (LCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
3.4.2.4 Optimizing the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
3.4.2.5 Optimization for Decoded ICache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18
3.4.2.6 Other Decoding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5 OPTIMIZING THE EXECUTION CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
3.5.1.1 Integer Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.2 Using LEA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
3.5.1.3 ADC and SBB in Sandy Bridge Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Document #: 248966-049US iv
3.5.1.4 Bitwise Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3.5.1.5 Variable Bit Count Rotation and Shift. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-24
3.5.1.6 Address Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
3.5.1.7 Clearing Registers and Dependency Breaking Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
3.5.1.8 Compares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
3.5.1.9 Using NOPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
3.5.1.10 Mixing SIMD Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3.5.1.11 Spill Scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
3.5.1.12 Zero-Latency MOV Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
3.5.2 Avoiding Stalls in Execution Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3.5.2.1 Writeback Bus Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3.5.2.2 Bypass Between Execution Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
3.5.2.3 Partial Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
3.5.2.4 Partial XMM Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
3.5.2.5 Partial Flag Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
3.5.2.6 Floating-Point/SIMD Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
3.5.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
3.5.4 Optimization of Partially Vectorizable Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
3.5.4.1 Alternate Packing Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
3.5.4.2 Simplifying Result Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
3.5.4.3 Stack Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
3.5.4.4 Tuning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
3.6 OPTIMIZING MEMORY ACCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
3.6.1 Load and Store Execution Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41
3.6.1.1 Making Use of Load Bandwidth in Sandy Bridge Microarchitecture . . . . . . . . . . . . . . . . . . . . . 3-41
3.6.1.2 L1D Cache Latency in Sandy Bridge Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
3.6.1.3 Handling L1D Cache Bank Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
3.6.2 Minimize Register Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
3.6.3 Enhance Speculative Execution and Memory Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
3.6.4 Store Forwarding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
3.6.4.1 Store-to-Load-Forwarding Restriction on Size and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
3.6.4.2 Store-Forwarding Restriction on Data Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
3.6.5 Data Layout Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50
3.6.6 Stack Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
3.6.7 Capacity Limits and Aliasing in Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-53
3.6.8 Mixing Code and Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-53
3.6.8.1 Self-Modifying Code (SMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
3.6.8.2 Position Independent Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
3.6.9 Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54
3.6.10 Locality Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55
3.6.11 Non-Temporal Store Bus Traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56
3.7 PREFETCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
3.7.1 Hardware Instruction Fetching and Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57
3.7.2 Hardware Prefetching for First-Level Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-58
3.7.3 Hardware Prefetching for Second-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
3.7.4 Cacheability Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
3.7.5 REP Prefix and Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61
3.7.6 Enhanced REP MOVSB and STOSB Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63
3.7.6.1 Fast Short REP MOVSB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63
3.7.6.2 Memcpy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63
3.7.6.3 Memmove Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
3.7.6.4 Memset Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65
3.8 REP STRING OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
3.8.1 Fast Zero Length REP MOVSB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
3.8.2 Fast Short REP STOSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
3.8.3 Fast Short REP CMPSB and SCASB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
3.9 FLOATING-POINT CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66
3.9.1 Guidelines for Optimizing Floating-Point Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
3.9.2 Floating-Point Modes and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
Document #: 248966-049US v
3.9.2.1 Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-67
3.9.2.2 Dealing with Floating-Point Exceptions in x87 FPU Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68
3.9.2.3 Floating-Point Exceptions in SSE/SSE2/SSE3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68
3.9.3 Floating-Point Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
3.9.3.1 Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69
3.9.3.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
3.9.4 x87 vs. Scalar SIMD Floating-Point Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-71
3.9.4.1 Scalar Intel® SSE/Intel® SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-72
3.9.4.2 Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-72
3.10 MAXIMIZING PCIE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-72
3.10.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory and MMIO
Regions (P2P). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
3.11 SCALABILITY WITH CONTENDED LINE ACCESS IN 4TH GENERATION INTEL® XEON® SCALABLE
PROCESSORS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
3.11.1 Causes of Performance Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
3.11.2 Performance Bottleneck Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-73
3.11.3 Solutions for Performance Bottlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75
3.11.4 Case Study: SysBench/MariaDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-76
3.11.5 Scalability With False Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
3.11.5.1 Causes of False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
3.11.5.2 Detecting False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-77
3.11.5.3 Fixing False Sharing and Additional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-78
3.11.5.4 Case Study: DeathStarBench/hotelReservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-79
3.11.6 Instruction Sequence Slowdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-80
3.11.6.1 Causes of Instruction Sequence Slowdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-80
3.11.6.2 Detecting Instruction Sequence Slowdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81
3.11.6.3 Fixing Instruction Sequence Slowdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-81
3.11.7 Misprediction for Branches >2GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-82
3.11.7.1 Causes of Branch Misprediction >2GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-82
3.11.7.2 Detecting Branch Mispredictions >2GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-82
3.11.7.3 Fixing Branch Mispredictions >2GB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-83
3.12 OPTIMIZING COMMUNICATION WITH PCI DEVICES ON INTEL® 4TH GENERATION INTEL®
XEON® SCALABLE PROCESSORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
3.12.1 Signaling Devices with Direct Move. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-84
3.12.1.1 MOVDIR64B: Additional Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-85
3.12.1.2 Streaming Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-85
3.13 SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-85
3.13.1 User-Level Monitor, User-Level MWAIT, and TPAUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-85
3.13.1.1 Checking for User-Level Monitor, MWAIT, and TPAUSE Support . . . . . . . . . . . . . . . . . . . . . . . 3-86
3.13.1.2 User-Level Monitor, User-Level MWAIT, and TPAUSE Operations . . . . . . . . . . . . . . . . . . . . . . 3-86
3.13.1.3 Recommended Usage of Monitor, MWAIT, and TPAUSE Operations . . . . . . . . . . . . . . . . . . . 3-86
CHAPTER 4
INTEL ATOM® PROCESSOR ARCHITECTURES
4.1 THE CRESTMONT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.1.1 Crestmont Microarchitecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
4.1.2 Predict and Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
4.1.3 Dynamic Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
4.1.4 Instruction Decode and the On-Demand Instruction Length Decoder (OD-ILD) . . . . . . . . . . . . . . . 4-4
4.1.5 Allocation and Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.1.6 The Out-of-Order and Execution Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.1.7 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
4.1.8 Crestmont New Instruction Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.1 AVX-NE-CONVERT Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.2 AVX-IFMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.8.3 AVX-VNNI-INT8 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.9 Legacy Intel® AVX1/Intel® AVX2 Instruction Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
Document #: 248966-049US vi
4.1.9.1 256-bit Permute Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
4.1.9.2 256-bit Broadcast with 128-bit Memory Operand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9
4.1.9.3 256-bit Insertion, Up-Conversion Instructions with 128-bit Memory Operand . . . . . . . . . . . . . .4-9
4.1.9.4 256-bit Variable Blend Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9
4.1.9.5 256-bit Vector TEST Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10
4.1.9.6 The GATHER Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10
4.1.9.7 Masked Load and Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10
4.1.9.8 ADX Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10
4.2 THE TREMONT MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
4.2.1 Tremont Microarchitecture Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12
4.2.2 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12
4.2.3 The Out-of-Order and Execution Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-13
4.2.4 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-14
4.2.5 New Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15
4.2.6 Tremont Microarchitecture Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15
CHAPTER 5
CODING FOR SIMD ARCHITECTURES
5.1 CHECKING FOR PROCESSOR SUPPORT OF SIMD TECHNOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.1 Checking for MMX Technology Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
5.1.2 Checking for Intel® Streaming SIMD Extensions (Intel® SSE) Support . . . . . . . . . . . . . . . . . . . . . . . .5-2
5.1.3 Checking for Intel® Streaming SIMD Extensions 2 (Intel® SSE2) Support. . . . . . . . . . . . . . . . . . . . . .5-2
5.1.4 Checking for Intel® Streaming SIMD Extensions 3 (Intel® SSE3) Support. . . . . . . . . . . . . . . . . . . . . .5-3
5.1.5 Checking for Intel® Supplemental Streaming SIMD Extensions 3 (Intel® SSSE) Support. . . . . . . . . .5-3
5.1.6 Checking for Intel® SSE4.1 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
5.1.7 Checking for Intel® SSE4.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
5.1.8 Detection of PCLMULQDQ and AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.1.9 Detection of Intel® AVX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.1.10 Detection of VEX-Encoded AES and VPCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
5.1.11 Detection of F16C Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
5.1.12 Detection of FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
5.1.13 Detection of Intel® AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
5.2 CONSIDERATIONS FOR CODE CONVERSION TO SIMD PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . 5-11
5.2.1 Identifying Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.2.2 Determine If Code Benefits by Conversion to SIMD Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-13
5.3 CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
5.3.1 Coding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-14
5.3.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.3.1.2 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-15
5.3.1.3 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
5.3.1.4 Automatic Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-17
5.4 STACK AND DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
5.4.1 Alignment and Contiguity of Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.1.1 Using Padding to Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.1.2 Using Arrays to Make Data Contiguous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-18
5.4.2 Stack Alignment for 128-bit SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-19
5.4.3 Data Alignment for MMX™ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.4.4 Data Alignment for 128-bit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.4.4.1 Compiler-Supported Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-20
5.5 IMPROVING MEMORY UTILIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22
5.5.1 Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-22
5.5.2 Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25
5.5.3 Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26
5.6 INSTRUCTION SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
5.7 TUNING THE FINAL APPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30
CHAPTER 7
OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS
7.1 GENERAL RULES FOR SIMD FLOATING-POINT CODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2 PLANNING CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.3 USING SIMD FLOATING-POINT WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.4 SCALAR FLOATING-POINT CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5 DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5.1 Data Arrangement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
7.5.1.1 Vertical versus Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.5.1.2 Data Swizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.5.1.3 Data Deswizzling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
7.5.1.4 Horizontal ADD Using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
7.5.2 Use of CVTTPS2PI/CVTTSS2SI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.5.3 Flush-to-Zero and Denormals-are-Zero Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6 SIMD OPTIMIZATIONS AND MICROARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6.1 Dot Product and Horizontal SIMD Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.6.2 Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
7.6.3 Using Horizontal SIMD Instruction Sets and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
7.6.3.1 SOA and Vector Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
CHAPTER 8
INT8 DEEP LEARNING INFERENCE
8.1 INTRODUCING INT8 AS DATA TYPE FOR DEEP LEARNING INFERENCE. . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2 INTRODUCING INTEL® DL BOOST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2.1 Multiply and Add Unsigned and Signed Bytes (VPDPBUSD Instruction). . . . . . . . . . . . . . . . . . . . . . 8-2
8.2.2 Multiply & Add Signed Word Integers (VPDPWSSD Instruction) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3 GENERAL OPTIMIZATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3.1 Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
8.3.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.1 Quantization of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.2 Quantization of Activations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.3.2.3 Quantizing Negative Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3 Multicore Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3.1 Large Batch (Throughput Workload) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
8.3.3.2 Small Batch (Throughput at Latency Workload) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.3.3.3 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4 CNNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1 Convolutional Layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1.1 Direct Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.4.1.2 Convolutional Layers with Low OFM Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
8.4.2 Post Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
8.4.2.1 Fused Quantization/Dequantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
8.4.2.2 ReLu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
8.4.2.3 EltWise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
8.4.2.4 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
8.4.2.5 Pixel Shuffler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
8.5 LSTM NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
8.5.1 Fused LSTM Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
8.5.2 Fused post GEMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Document #: 248966-049US ix
8.5.3 Dynamic Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
8.5.4 NMT Example: Beam Search Decoder Get Top K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
CHAPTER 9
OPTIMIZING CACHE USAGE
9.1 GENERAL PREFETCH CODING GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 PREFETCH AND CACHEABILITY INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3 PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3.1 Software Data Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.3.2 Prefetch Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
9.3.3 Prefetch and Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4 CACHEABILITY CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4.1 The Non-temporal Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.4.1.1 Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.2 Streaming Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.3 Memory Type and Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.1.4 Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
9.4.2 Streaming Store Usage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.2.1 Coherent Requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.2.2 Non-coherent requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4.3 Streaming Store Instruction Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.4 The Streaming Load Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5 FENCE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5.1 SFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.4.5.2 LFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.5.3 MFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.6 CLFLUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.4.7 CLFLUSHOPT Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
9.5 MEMORY OPTIMIZATION USING PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.1 Software-Controlled Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.2 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.5.3 Example of Effective Latency Reduction with Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
9.5.4 Example of Latency Hiding with S/W Prefetch Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
9.5.5 Software Prefetching Usage Checklist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
9.5.6 Software Prefetch Scheduling Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
9.5.7 Software Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
9.5.8 Minimize Number of Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-17
9.5.9 Mix Software Prefetch with Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-19
9.5.10 Software Prefetch and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20
9.5.11 Hardware Prefetching and Cache Blocking Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-23
9.5.12 Single-Pass versus Multi-Pass Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-24
9.6 MEMORY OPTIMIZATION USING NON-TEMPORAL STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.6.1 Non-Temporal Stores and Software Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.6.2 Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.1 Video Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.2 Video Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.6.2.3 Conclusions from Video Encoder and Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . 9-27
9.6.2.4 Optimizing Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-27
9.6.2.5 Using the 8-byte Streaming Stores and Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28
9.6.2.6 Using 16-byte Streaming Stores and Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29
9.6.2.7 Performance Comparisons of Memory Copy Routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
9.6.3 Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
9.6.3.1 Cache Sharing Using Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
9.6.3.2 Cache Sharing in Single-Core or Multicore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
9.6.3.3 Determine Prefetch Stride. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
Document #: 248966-049US x
CHAPTER 10
SUB-NUMA CLUSTERING
10.1 SUB-NUMA CLUSTERING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.2 COMPARISON WITH CLUSTER-ON-DIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3 SNC USAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3.1 How to Check NUMA Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
10.3.2 MPI Optimizations for SNC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
10.3.3 SNC Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-8
CHAPTER 11
MULTICORE AND INTEL® HYPER-THREADING TECHNOLOGY (INTEL® HT)
11.1 PERFORMANCE AND USAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.1.1 Multithreading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-1
11.1.2 Multitasking Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-3
11.2 PROGRAMMING MODELS AND MULTITHREADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
11.2.1 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.1.1 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.2 Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.3 Specialized Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
11.2.3.1 Producer-Consumer Threading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-5
11.2.4 Tools for Creating Multithreaded Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.1 Programming with OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.2 Automatic Parallelization of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.2.4.3 Supporting Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-8
11.3 OPTIMIZATION GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
11.3.1 Key Practices of Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.2 Key Practices of System Bus Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.3 Key Practices of Memory Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
11.3.4 Key Practices of Execution Resource Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-10
11.3.5 Generality and Performance Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-10
11.4 THREAD SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
11.4.1 Choice of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-11
11.4.2 Synchronization for Short Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-12
11.4.3 Optimization with Spin-Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.4.4 Synchronization for Longer Periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-13
11.4.4.1 Avoid Coding Pitfalls in Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
11.4.5 Prevent Sharing of Modified Data and False-Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-15
11.4.6 Placement of Shared Synchronization Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-16
11.5 SYSTEM BUS OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
11.5.1 Conserve Bus Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-17
11.5.2 Understand the Bus and Cache Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-18
11.5.3 Avoid Excessive Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-18
11.5.4 Improve Effective Latency of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-19
11.5.5 Use Full Write Transactions to Achieve Higher Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-19
11.6 MEMORY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
11.6.1 Cache Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-20
11.6.2 Shared-Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-20
11.6.2.1 Minimize Sharing of Data between Physical Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-20
11.6.2.2 Batched Producer-Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-20
11.6.3 Eliminate 64-KByte Aliased Data Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-22
11.7 FRONT END OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.7.1 Avoid Excessive Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-22
11.8 AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
11.8.1 Topology Enumeration of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-24
11.8.2 Non-Uniform Memory Access (NUMA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-24
11.9 OPTIMIZATION OF OTHER SHARED RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
11.9.1 Expanded Opportunity for Intel® HT Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-26
Document #: 248966-049US xi
CHAPTER 12
INTEL® OPTANE™ DC PERSISTENT MEMORY
12.1 MEMORY MODE AND APP-DIRECT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
12.1.1 Memory Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1
12.1.2 App Direct Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1
12.1.3 Selecting a Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2
12.2 DEVICE CHARACTERISTICS OF INTEL® OPTANE™ DC PERSISTENT MEMORY MODULE . . . . . . . . . . . 12-4
12.2.1 Intel® Optane™ DC Persistent Memory Module Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
12.2.2 Read vs. Write Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
12.2.3 Number of Threads for Optimal Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-6
12.3 PLATFORM IMPLICATIONS OF HANDLING A SECOND TYPE OF MEMORY. . . . . . . . . . . . . . . . . . . . . . 12-8
12.3.1 Multi-Processor Cache Coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-8
12.3.2 Shared Queues in the Memory Hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-9
12.4 IMPLEMENTING PERSISTENCE FOR MEMORY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
12.5 POWER CONSUMPTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
12.5.1 Read-Write Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-11
12.5.2 Spatial and Temporal Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-12
CHAPTER 13
64-BIT MODE CODING GUIDELINES
13.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2 CODING RULES AFFECTING 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.1 Use Legacy 32-Bit Instructions When Data Size Is 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.2 Use Extra Registers to Reduce Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
13.2.3 Effective Use of 64-Bit by 64-Bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.2.4 Replace 128-bit Integer Division with 128-bit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
13.2.5 Sign Extension to Full 64-Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.3 ALTERNATE CODING RULES FOR 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
13.3.1 Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic Result . . . . . . . . . . . . 13-5
13.3.2 Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
CHAPTER 14
INTEL® SSE4.2 AND SIMD PROGRAMMING FOR TEXT-
PROCESSING/LEXING/PARSING
14.1 INTEL® SSE4.2 STRING AND TEXT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1
14.1.1 CRC32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-4
14.2 USING INTEL® SSE4.2 STRING AND TEXT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-5
14.2.1 Unaligned Memory Access and Buffer Size Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.2.2 Unaligned Memory Access and String Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.3 INTEL® SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
14.3.1 Null Character Identification (Strlen equivalent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7
14.3.2 White-Space-Like Character Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11
14.3.3 Substring Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13
14.3.4 String Token Extraction and Case Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19
14.3.5 Unicode Processing and PCMPxSTRy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-22
14.3.6 Replacement String Library Function Using Intel® SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-27
14.4 INTEL® SSE4.2-ENABLED NUMERICAL AND LEXICAL COMPUTATION . . . . . . . . . . . . . . . . . . . . . . . . 14-28
14.5 NUMERICAL DATA CONVERSION TO ASCII FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34
14.5.1 Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
14.5.1.1 MULX Instruction and Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
CHAPTER 17
SOFTWARE OPTIMIZATION FOR INTEL® AVX-512 INSTRUCTIONS
17.1 BASIC INTEL® AVX-512 VS. INTEL® AVX2 CODING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
17.1.1 Intrinsic Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-3
17.1.2 Assembly Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-5
17.2 MASKING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-7
17.2.1 Masking Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-8
17.2.2 Masking Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-12
17.2.3 Masking vs. Blending. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-13
17.2.4 Nested Conditions / Mask Aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-15
17.2.5 Memory Masking Microarchitecture Improvements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-16
17.2.6 Peeling and Remainder Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-16
17.3 FORWARDING AND UNMASKED OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-19
17.4 FORWARDING AND MEMORY MASKING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-19
17.5 DATA COMPRESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20
17.5.1 Data Compress Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-21
17.6 DATA EXPAND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-24
17.6.1 Data Expand Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-25
17.7 TERNARY LOGIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-27
17.7.1 Ternary Logic Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-27
Document #: 248966-049US xv
CHAPTER 18
INTEL® ADVANCED VECTOR EXTENSIONS 512 - FP16 INSTRUCTION SET FOR INTEL®
XEON® PROCESSORS
18.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.2 TERM DEFINITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.3 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.4 FP16 NUMERIC INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
18.4.1 Data Type Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
18.4.2 Overview of Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-3
18.4.3 Fundamental Complex-Valued Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-4
18.4.4 Using Intel® AVX-512 Bit Masks for Real-Valued Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-5
18.4.5 Using Intel® AVX-512 Bit Masks for Complex-Valued Operations . . . . . . . . . . . . . . . . . . . . . . . . . 18-6
18.5 NUMERICS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
18.5.1 Introduction to FP16 Number Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
18.5.2 Observations on Representing Numbers in FP16 Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-10
18.5.3 Numeric Accuracy Guarantees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-11
18.5.4 Handling Denormal Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-12
18.5.5 Embedded Rounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13
18.5.6 Legacy FP16 Data Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13
18.5.7 FP16 Conversions to and from Other Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-14
18.5.8 Approximation Instructions and Their Uses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-14
18.5.8.1 Approximate Reciprocal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
18.5.8.2 Approximate Division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-15
18.5.8.3 Approximate Reciprocal Square Root. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-16
18.5.9 Approximate Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-17
18.6 USING EXISTING INTEL® AVX-512 INSTRUCTIONS TO AUGMENT FP16 SUPPORT . . . . . . . . . . . . . . 18-17
18.6.1 Using Existing Instructions to Extend Intel® AVX-512 FP16 Intrinsics. . . . . . . . . . . . . . . . . . . . . . 18-17
18.6.2 Common Convenience Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-18
18.6.3 Using Integer Comparisons for Fast Floating-Point Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . 18-18
18.7 MATH LIBRARY SUPPORT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-19
CHAPTER 19
CRYPTOGRAPHY & FINITE FIELD ARITHMETIC ENHANCEMENTS
19.1 VECTOR AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1
19.2 VPCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
19.3 GALOIS FIELD NEW INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-2
19.4 INTEGER FUSED MULTIPLY ACCUMULATE OPERATIONS (AVX512_IFMA - VPMADD52) . . . . . . . . . . 19-4
CHAPTER 20
INTEL® ADVANCED MATRIX EXTENSIONS (INTEL® AMX)
20.1 DETECTING INTEL® AMX SUPPORT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.2 INTEL® AMX MICROARCHITECTURE OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.2.1 Intel® AMX Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
20.3 INTEL® AMX INSTRUCTIONS THROUGHPUT AND LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.4 DATA STRUCTURE ALIGNMENT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5 GEMMS / CONVOLUTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-3
20.5.2 Tiles in the Intel® AMX Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-4
20.5.3 B Matrix Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-6
20.5.4 Straightforward GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-9
20.5.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
20.5.5.1 Minimizing Tile Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
20.5.5.2 Software Pipelining of Tile Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13
20.5.5.3 Optimized GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13
20.5.5.4 Direct Convolution with Intel® AMX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-15
CHAPTER 21
INTEL® QUICKASSIST TECHNOLOGY (INTEL® QAT)
21.1 SOFTWARE DESIGN GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-1
21.1.1 Polling vs. Interrupts (If Supported). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-1
21.1.1.1 Interrupt Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
21.1.1.2 Polling Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-2
21.1.1.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-3
CHAPTER 22
USING PERFORMANCE MONITORING EVENTS
22.1 TOP-DOWN ANALYSIS METHOD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-1
22.1.1 Top-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-2
22.1.2 Frontend Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.3 Backend Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.4 Memory Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
22.1.5 Core Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-5
22.1.6 Bad Speculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-5
22.1.7 Retiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.8 Golden Cove Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.9 Ice Lake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.10 Optane Persistent Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.11 Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-6
22.1.11.1 TMA Use Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7
22.1.11.2 TMA Use Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7
22.2 PERFORMANCE MONITORING AND MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-8
22.3 INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15
22.4 PERFORMANCE ANALYSIS TECHNIQUES FOR INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . 22-16
22.4.1 Cycle Accounting and Uop Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-17
22.4.1.1 Cycle Drill Down and Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-18
22.4.1.2 Basic Block Drill Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-21
22.4.2 Stall Cycle Decomposition and Core Memory Accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22
22.4.2.1 Measuring Costs of Microarchitectural Conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22
22.4.3 Core PMU Precise Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-23
22.4.3.1 Precise Memory Access Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24
22.4.3.2 Load Latency Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-25
22.4.3.3 Precise Execution Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-27
22.4.3.4 Last Branch Record (LBR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-28
22.4.3.5 Measuring Per-Core Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-33
22.4.3.6 Miscellaneous L1 and L2 Events for Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.3.7 TLB Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.3.8 L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-34
22.4.4 Front End Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.4.1 Branch Mispredictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.4.2 Front End Code Generation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-35
22.4.5 Uncore Performance Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-36
22.4.5.1 Global Queue Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-36
22.4.5.2 Global Queue Port Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-38
22.4.5.3 Global Queue Snoop Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-38
22.4.5.4 L3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-39
22.4.6 Intel QuickPath Interconnect Home Logic (QHL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-39
22.4.7 Measuring Bandwidth From the Uncore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-45
22.5 PERFORMANCE TUNING TECHNIQUES FOR SANDY BRIDGE MICROARCHITECTURE . . . . . . . . . . . . 22-45
22.5.1 Correlating Performance Bottleneck to Source Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-45
APPENDIX A
APPLICATION PERFORMANCE TOOLS
A.1 COMPILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
A.1.1 Recommended Optimization Settings for Intel® 64 and IA-32 Processors. . . . . . . . . . . . . . . . . . . . A-2
A.1.2 Vectorization and Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.2.1 Multithreading with OpenMP* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.2.2 Automatic Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.3 Inline Expansion of Library Functions (/Oi, /Oi-). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.4 Interprocedural and Profile-Guided Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
A.1.4.1 Interprocedural Optimization (IPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.1.4.2 Profile-Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.1.5 Intel® Cilk™ Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.2 PERFORMANCE LIBRARIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-4
A.2.1 Intel® Integrated Performance Primitives (Intel® IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.2 Intel® Math Kernel Library (Intel® MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.3 Intel® Threading Building Blocks (Intel® TBB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.2.4 Benefits Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
A.3 PERFORMANCE PROFILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1 Intel® VTune™ Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.1 Hardware Event-Based Sampling Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.2 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
A.3.1.3 Platform Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.4 THREAD AND MEMORY CHECKERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.4.1 Intel® Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.5 VECTORIZATION ASSISTANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.5.1 Intel® Advisor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6 CLUSTER TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6.1 Intel® Trace Analyzer and Collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
A.6.1.1 MPI Performance Snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.6.2 Intel® MPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.6.3 Intel® MPI Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
A.7 INTEL® COMMUNITIES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
APPENDIX B
RUNTIME PERFORMANCE OPTIMIZATION BLUEPRINT: INTEL® ARCHITECTURE
OPTIMIZATION WITH LARGE CODE PAGES
B.1 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
B.1.1 TLBs and Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-1
B.1.2 Large Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2
B.2 DIAGNOSING THE PROBLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
B.2.1 ITLB Misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2
B.2.2 Measuring the ITLB Miss Stall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5
Document #: 248966-049US xx
B.2.3 Source of ITLB Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6
B.3 SOLUTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7
B.3.1 Linux* and Large Pages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7
B.3.2 Large Pages for .text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7
B.3.3 Reference Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8
B.3.4 Large Pages for the Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-9
B.4 SOLUTION INTEGRATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
B.4.1 V8 Integration with the Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
B.4.2 JAVA JVM Integration with the Reference Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
B.5 LIMITATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
B.6 CASE STUDY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-11
B.6.1 Ghost.js Workload. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
B.6.2 Web Tooling Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
B.6.2.1 Node Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-12
B.6.2.2 Web Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
B.6.2.3 Comparing Clear Linux* OS and Ubuntu* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
B.6.3 MediaWiki Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13
B.6.4 Visualization of Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
B.6.4.1 Precise Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
B.6.4.2 Visualizing Precise ITLB Miss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-14
B.7 SUMMARY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17
B.8 TEST CONFIGURATION DETAILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-17
B.9 ADDITIONAL REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-19
Example 14-17. High-level flow of Character Subset Validation for String Conversion . . . . . . . . . . . . . . . . 14-29
Example 14-18. Intrinsic Listings of atol() Replacement Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . . . . . . 14-29
Example 14-19. Auxiliary Routines and Data Constants Used in sse4i_atol() listing . . . . . . . . . . . . . . . . . . 14-32
Example 14-20. Conversion of 64-bit Integer to ASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-34
Example 14-21. Conversion of 64-bit Integer to ASCII without Integer Division . . . . . . . . . . . . . . . . . . . . . 14-35
Example 14-22. Conversion of 64-bit Integer to ASCII Using Intel® SSE4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-37
Example 14-23. Conversion of 64-bit Integer to Wide Character String Using Intel® SSE4 . . . . . . . . . . . . . 14-43
Example 14-24. MULX and Carry Chain in Large Integer Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-48
Example 14-25. Building-block Macro Used in Binary Decimal Floating-point Operations . . . . . . . . . . . . 14-48
Example 15-1. Cartesian Coordinate Transformation with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-3
Example 15-2. Cartesian Coordinate Transformation with Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5
Example 15-3. Direct Polynomial Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-7
Example 15-4. Function Calls and Intel® AVX/Intel® SSE transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11
Example 15-5. AoS to SoA Conversion of Complex Numbers in C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14
Example 15-6. Aos to SoA Conversion of Complex Numbers Using Intel® AVX . . . . . . . . . . . . . . . . . . . . . . 15-15
Example 15-7. Register Overlap Method for Median of 3 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17
Example 15-8. Data Gather - Intel® AVX versus Scalar Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19
Example 15-9. Scatter Operation Using Intel® AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21
Example 15-10. SAXPY using Intel® AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-22
Example 15-11. Using 16-Byte Memory Operations for Unaligned 32-Byte Memory Operation . . . . . . . . 15-24
Example 15-12. SAXPY Implementations for Unaligned Data Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-24
Example 15-13. Loop with Conditional Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28
Example 15-14. Handling Loop Conditional with VMASKMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-28
Example 15-15. Three-Tap Filter in C Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29
Example 15-16. Three-Tap Filter with 128-bit Mixed Integer and FP SIMD . . . . . . . . . . . . . . . . . . . . . . . . . 15-29
Example 15-17. 256-bit AVX Three-Tap Filter Code with VSHUFPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-30
Example 15-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code . . . . . . . . . . . . . . . 15-31
Example 15-19. 8x8 Matrix Transpose - Replace Shuffles with Blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-33
Example 15-20. 8x8 Matrix Transpose Using VINSERTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-36
Example 15-21. Port 5 versus Load Port Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-38
Example 15-22. Divide Using DIVPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-41
Example 15-23. Divide Using RCPPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-41
Example 15-24. Divide Using RCPPS and Newton-Raphson Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-41
Example 15-25. Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . 15-42
Example 15-26. Reciprocal Square Root Using RSQRTPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . 15-43
Example 15-27. Reciprocal Square Root Using RSQRTPS and Newton-Raphson Iteration . . . . . . . . . . . . . 15-43
Example 15-28. Square Root Using SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-44
Example 15-29. Square Root Using RSQRTPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-44
Example 15-30. Square Root Using RSQRTPS and One Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . 15-45
Example 15-31. Array Sub Sums Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-47
Example 15-32. Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-48
Example 15-33. Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-50
Example 15-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision . . . . . . . 15-51
Example 15-35. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-53
Example 15-36. Unrolling to Hide Dependent FP Add Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-53
Example 15-37. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-55
Example 15-38. Macros for Separable KLT Intra-block Transformation Using AVX2 . . . . . . . . . . . . . . . . . . 15-56
Example 15-39. Separable KLT Intra-block Transformation Using AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-58
Example 15-40. Macros for Parallel Moduli/Remainder Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-64
Example 15-41. Signed 64-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-65
Example 15-42. Unsigned 63-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-66
Example 15-43. Access Patterns Favoring Non-VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-70
Example 15-44. Access Patterns Likely to Favor VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-71
Example 15-45. Software AVX Sequence Equivalent to Full-Mask VPGATHERD . . . . . . . . . . . . . . . . . . . . . 15-72
Example 18-2. Function for Converting from a Real-Valued Mask to a Complex-Valued Mask By
AND-Combining Adjacent Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-8
Example 18-3. Function for Converting from a Real-Valued Mask to a Complex-Valued Mask by
OR-Combining Adjacent Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-9
Example 18-4. Function to Implement the 16-Bit Compress Operation on FP16 Vector Elements . . . . . . 18-17
Example 18-5. Function that Performs Fast Floating-Point Minimum Using Integer Instructions . . . . . . . 18-19
Example 19-1. Legacy Intel® AES-NI vs. Vector AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1
Example 19-2. SM4 GFNI Encryption Round Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3
Example 20-1. Pseudo-Code for the TILEZERO, TILELOAD, and TILESTORE Instructions . . . . . . . . . . . . . . . . 20-6
Example 20-2. B Matrix Re-Layout Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-6
Example 20-3. Original Layout of 32x16 bfloat16 B-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7
Example 20-4. Re-Layout of 32x16 bfloat16 B-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7
Example 20-5. Original Layout of 64 x 16 unt8 B-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-8
Example 20-6. Re-Layout of 64x16 bfloat16 B-Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-8
Example 20-7. Common Defines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-9
Example 20-8. Reference GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
Example 20-9. K-Dimension Loop as Innermost Loop–A, a Highly Inefficient Approach . . . . . . . . . . . . . . . 20-11
Example 20-10. Innermost Loop Tile Pre-Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-12
Example 20-11. Switched Order of M_ACC and N_ACC Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-12
Example 20-12. Optimized GEMM Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-13
Example 20-13. Dimension of Matrices, Data Types, and Tile Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-14
Example 20-14. Optimized GEMM Assembly Language Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-14
Example 20-15. Activations Layout Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-15
Example 20-16. Weights Re-Layout Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-16
Example 20-17. Common Defines for Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-19
Example 20-18. Optimized Direct Convolution Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-20
Example 20-19. Additional Defines for Convolution with Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21
Example 20-20. Optimized Convolution Implementation with Cache Blocking . . . . . . . . . . . . . . . . . . . . . . . 20-22
Example 20-21. Convolution Code Fused with Post-Convolution Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-30
Example 20-22. An Example of a Short GEMM Fused and Pipelined with Quantization and ReLU . . . . . . . 20-33
Example 20-23. Two Blocks of 16 Cache Lines of 32-bit Floats Converted to One Block of 16 Cache Lines
of 16-bit BFloat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-37
Example 20-24. Using Unsigned Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-37
Example 20-25. Prefetching Rows to the DCU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-41
Example 20-26. BF16 Matrix Transpose (32x8 to 8x32) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-42
Example 20-27. BF16 VNNI-to-VNNI Transpose (8x8 to 2x32) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-46
Example 20-28. BF16 Flat-to-VNNI Transpose (16x8 to 4x32) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-49
Example 20-29. BF16 Flat-to-VNNI Re-Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-51
Example 20-30. GEMM Parallelized with omp Parallel for Collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-55
Example 20-31. Byte Decompression Code with Intel® AVX-512 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . 20-58
Example 20-32. Identification of Tile Shape Using Parameter m, n, k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-60
Example 20-33. Intel® AMX Intrinsics Header File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-60
Example 20-34. Intel® AMX Intrinsics Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-63
Example 20-35. Compiler-Generated Assembly-Level Code from Example 20-30 . . . . . . . . . . . . . . . . . . . . 20-64
Example 20-36. Compiler-Generated Assembly-Level Code Where Tile Register Save/Restore is
Optimized Away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-65
Ref#: 248966-049
FIGURES
PAGE
Figure 9-5. Prefetch and Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-18
Figure 9-6. Memory Access Latency and Execution With Prefetch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-18
Figure 9-7. Spread Prefetch Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-19
Figure 9-8. Cache Blocking – Temporally Adjacent and Non-adjacent Passes . . . . . . . . . . . . . . . . . . . . . . . . .9-20
Figure 9-9. Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes
Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-21
Figure 9-10. Single-Pass vs. Multi-Pass 3D Geometry Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-25
Figure 10-1. Example of SNC Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-1
Figure 10-2. NUMA Disabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
Figure 10-3. SNC Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-6
Figure 10-4. SNC On . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
Figure 10-5. Domain Example with One MPI Process Per Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-8
Figure 11-1. Amdahl’s Law and MP Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-2
Figure 11-2. Single-threaded Execution of Producer-consumer Threading Model . . . . . . . . . . . . . . . . . . . . . .11-5
Figure 11-3. Execution of Producer-consumer Threading Model
on a Multicore Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-5
Figure 11-4. Interlaced Variation of the Producer Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-7
Figure 11-5. Batched Approach of Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-21
Figure 12-1. In App Direct Mode, Data on the Intel® Optane™ DC Persistent Memory Module is
Accessed Directly with Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2
Figure 12-2. Decision Flow for Determining When to Use
Intel® Optane™ DC Persistent Memory Module vs. DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-3
Figure 12-3. Loaded Latency Curves for One Intel® Optane™ DC Persistent Memory Module DIMM:
Sequential Traffic (Left) and Random Traffic (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-5
Figure 12-4. Number of Threads vs. Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-6
Figure 12-5. Combining with Two Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7
Figure 12-6. Combining with Four Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7
Figure 12-7. Combining with Eight Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-8
Figure 12-8. PMDK vs. MSYNC Flushing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-10
Figure 12-9. Bandwidth vs. Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-10
Figure 12-10. Read-Write Equivalence for Intel® Optane™ DC Persistent Memory Module
DIMMs within Different Power Budgets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-11
Figure 12-11. Bandwidth Available to Software when There is No Locality at 256B Granularity. . . . . . . . . . .12-12
Figure 14-1. Intel® SSE4.2 String/Text Instruction Immediate Operand Control . . . . . . . . . . . . . . . . . . . . . . . .14-2
Figure 14-2. Retrace Inefficiency of Byte-Granular, Brute-Force Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-13
Figure 14-3. Intel® SSE4.2 Speedup of SubString Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-19
Figure 14-4. Compute Four Remainders of Unsigned Short Integer in Parallel . . . . . . . . . . . . . . . . . . . . . . . .14-37
Figure 15-1. Intel® AVX—Intel® SSE Transitions in Earlier Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . . .15-9
Figure 15-2. Intel® AVX- Intel® SSE Transitions in the Skylake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . .15-9
Figure 15-3. Source Location of Each Destination Element in a 128-bit Lane . . . . . . . . . . . . . . . . . . . . . . . . .15-12
Figure 15-4. Gather Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-18
Figure 15-5. Using VPERM2F128 to Swap Elements Between Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-35
Figure 15-6. Step One of 8 x 8 Matrix Transpose Using VINSERTF128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-35
Figure 15-7. Intel AVX Implementation of the Complex Multiply Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .15-37
Figure 15-8. SSE Implementation Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-46
Figure 15-9. Intel® AVX Implementation of the Array Sub Sums Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . .15-47
Figure 15-10. 4x4 Image Block Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-56
Figure 15-11. Throughput Comparison of Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-72
Figure 15-12. Comparison of HW GATHER Versus Software Sequence in Skylake Microarchitecture . . . . . .15-73
Figure 16-1. Performance History and State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
Figure 16-2. Active Time Versus Halted Time of a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-3
Figure 16-3. Application of C-States to Idle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-4
Figure 16-4. Profiles of Coarse Task Scheduling and Power Consumption. . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-9
Figure 16-5. Thread Migration in a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-11
Ref#: 248966-049
FIGURES
PAGE
Figure 16-6. Progression to Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-11
Figure 16-7. Energy Savings Due to Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-13
Figure 16-8. Energy Savings Due to Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-13
Figure 16-9. Energy-Saving Comparison of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-16
Figure 16-10. Power-Saving Comparison of Power-Source-Aware Frame Rate Configurations. . . . . . . . . . . .16-17
Figure 17-1. Cartesian Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-3
Figure 17-2. Mask Move When Using Merging Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-9
Figure 17-3. Mask Move Operation When Using Merging Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-9
Figure 17-4. Result of Execution with Zeroing Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-9
Figure 17-5. Data Forwarding Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-19
Figure 17-6. Data Compress Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-20
Figure 17-7. Data Expand Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-25
Figure 17-8. Ternary Logic Example 1 Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-27
Figure 17-9. Ternary Logic Example 2 Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-30
Figure 17-10. VPERMI2PS Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-31
Figure 17-11. Two-Source Permute Instructions in a Matrix Transpose Operation . . . . . . . . . . . . . . . . . . . . .17-32
Figure 17-12. VSCATTERDPD Instruction Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-38
Figure 17-13. VPCONFLICTD Instruction Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-51
Figure 17-14. VPCONFLICTD Merging Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-51
Figure 17-15. VPCONFLICTD Permute Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-52
Figure 17-16. VPCONFLICTD ZMM2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-54
Figure 17-17. Sparse Vector Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-54
Figure 17-18. VPERMB Instruction Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-56
Figure 17-19. VPERMI2B Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-58
Figure 17-20. VPERMT2B Instruction Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-59
Figure 17-21. VPMULTISHIFTQB Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-61
Figure 17-22. Fast Bypass When All Sources Come from FMA Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17-64
Figure 17-23. Mixing Intel AVX Instructions or Intel AVX-512 Instructions with Intel SSE Instructions . . . . . .17-65
Figure 18-1. Layout of a 128-Bit Register Representing Four Complex FP16 (CFP16) Values. . . . . . . . . . . . . .18-3
Figure 18-2. A Zero-Masked FP16 Add On Two 128-Bit Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18-5
Figure 18-3. A Masked Complex Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18-6
Figure 18-4. Using a Real-Valued FP16 Vector Operation for Implementing a Masked Complex Addition . .18-7
Figure 18-5. Comparison Operation Between Two Complex-Valued Vectors . . . . . . . . . . . . . . . . . . . . . . . . . .18-8
Figure 18-6. Bit Layout of Three Types of Floating-Point Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18-10
Figure 18-7. Landmark Numbers on the Real-Valued FP16 Axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18-10
Figure 18-8. Heat-map Showing Relative ULP Error for Different Combinations of Divisor and Dividend
Value Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18-16
Figure 20-1. Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-4
Figure 20-2. Intel® AMX Multiplication with Max-sized int8 Tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-5
Figure 20-3. Activations layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-16
Figure 20-4. Weights Re-Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-17
Figure 20-5. Convolution-Matrix Multiplication and Summation Equivalence . . . . . . . . . . . . . . . . . . . . . . . .20-17
Figure 20-6. Matrix-Like Multiplications Part of a Convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-18
Figure 20-7. Batching Execution Using Six Layers with Four Instances Per Thread . . . . . . . . . . . . . . . . . . . . .20-25
Figure 20-8. A Convolution Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-28
Figure 20-9. A Convolution Example with Large Tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-28
Figure 20-10. Using TILEZERO to Solve Performance Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-35
Figure 20-11. A Conversion Flow of 32-bit Integers to 8-bit Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20-36
Figure 20-12. Trivial Deep Learning Topology with Naive Buffer Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .20-38
Figure 20-13. Minimal Memory Footprint Buffer Allocation Scheme for Trivial Deep Learning Topology . . .20-38
Figure 20-14. Flat-to-VNNI Transpose of WORDs Equivalence to Flat-to-Flat Transpose of DWORDs . . . . . .20-48
Figure 20-15. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the M-Dimension . .20-53
Figure 20-16. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the N-Dimension . .20-54
Figure 20-17. GEMM Data Partitioning Between Three Cores in a Layer Partitioned by the K-Dimension. . .20-54
Ref#: 248966-049
INDEX
Ref#: 248966-050US 1
USER RULES
User/Source Coding Rule 18. (M impact, ML generality) Place each synchronization variable alone, separated by 128
bytes or in a separate cache line. .............................................................................................................11-18
User/Source Coding Rule 19. (H impact, L generality) Do not place any spin lock variable to span a cache line
boundary ...................................................................................................................................................11-18
User/Source Coding Rule 20. (M impact, H generality) Improve data and code locality to conserve bus command
bandwidth. ................................................................................................................................................11-19
User/Source Coding Rule 21. (M impact, L generality) Avoid excessive use of software prefetch instructions and
allow automatic hardware prefetcher to work. Excessive use of software prefetches can significantly and
unnecessarily increase bus utilization if used inappropriately. ................................................................11-20
User/Source Coding Rule 22. (M impact, M generality) Consider using overlapping multiple back-to-back memory
reads to improve effective cache miss latencies. .....................................................................................11-20
User/Source Coding Rule 23. (M impact, M generality) Consider adjusting the sequencing of memory references so
the distribution of distances of successive cache misses of the last level cache peaks towards 64 bytes. 11-21
User/Source Coding Rule 24. (M impact, M generality) Use full write transactions to achieve higher data throughput.
...................................................................................................................................................................11-21
User/Source Coding Rule 25. (H impact, H generality) Use cache blocking to improve locality of data access. Target
one quarter to one half of the cache size when targeting Intel processors supporting HT Technology or target a
block size that allow all the logical processors serviced by a cache to share that cache simultaneously. 11-22
User/Source Coding Rule 26. (H impact, M generality) Minimize the sharing of data between threads that execute
on different bus agents sharing a common bus. The situation of a platform consisting of multiple bus domains
should also minimize data sharing across bus domains ...........................................................................11-22
User/Source Coding Rule 27. (H impact, H generality) Minimize data access patterns that are offset by multiples of
64 KBytes in each thread. .........................................................................................................................11-24
User/Source Coding Rule 28. (M impact, L generality) Avoid excessive loop unrolling to ensure the LSD is operating
efficiently ..................................................................................................................................................11-24
User/Source Coding Rule 29. Factor in precision and rounding characteristics of FMA instructions when replacing
multiply/add operations executing non-FMA instructions. ......................................................................15-53
User/Source Coding Rule 30. Factor in result-dependency, latency of FP add vs. FMA instructions when replacing FP
add operations with FMA instructions .....................................................................................................15-53
User/Source Coding Rule 31. Consider using unrolling technique for loops containing back-to-back dependent FMA,
FP Add or Vector MUL operations, The unrolling factor can be chosen by considering the latency of the critical
instruction of the dependency chain and the number of pipes available to execute that instruction ....15-55
Ref#: 248966-050US 2
USER RULES
Assembler/Compiler Coding Rule 1. (MH impact, M generality) Arrange code to make basic blocks contiguous and
eliminate unnecessary branches. ..................................................................................................................3-5
Assembler/Compiler Coding Rule 2. (M impact, ML generality) Use the SETCC and CMOV instructions to eliminate
unpredictable conditional branches where possible. Do not do this for predictable branches. Do not use these
instructions to eliminate all unpredictable conditional branches (because using these instructions will incur
execution overhead due to the requirement for executing both paths of a conditional branch). In addition,
converting a conditional branch to SETCC or CMOV trades off control flow dependence for data dependence
and restricts the capability of the out-of-order engine. When tuning, note that all Intel 64 and IA-32 processors
usually have very high branch prediction rates. Consistently mispredicted branches are generally rare. Use these
instructions only if the increase in computation time is less than the expected cost of a mispredicted branch.3-5
Assembler/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch
prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch
with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a
branch with a backward target......................................................................................................................3-6
Assembler/Compiler Coding Rule 4. (MH impact, MH generality) Near calls must be matched with near returns, and
far calls must be matched with far returns. Pushing the return address on the stack and jumping to the routine
to be called is not recommended since it creates a mismatch in calls and returns. .....................................3-7
Assembler/Compiler Coding Rule 5. (MH impact, MH generality) Selectively inline a function if doing so decreases
code size or if the function is small and the call site is frequently executed.................................................3-8
Assembler/Compiler Coding Rule 6. (ML impact, ML generality) If there are more than 16 nested calls and returns
in rapid succession; consider transforming the program with inline to reduce the call depth.....................3-8
Assembler/Compiler Coding Rule 7. (ML impact, ML generality) Favor inlining small functions that contain branches
with poor prediction rates. If a branch misprediction results in a RETURN being prematurely predicted as taken,
a performance penalty may be incurred. ......................................................................................................3-8
Assembler/Compiler Coding Rule 8. (L impact, L generality) If the last statement in a function is a call to another
function, consider converting the call to a jump. This will save the call/return overhead as well as an entry in the
return stack buffer.........................................................................................................................................3-8
Assembler/Compiler Coding Rule 9. (M impact, L generality) Do not put more than four branches in a 16-byte chunk.
3-8
Assembler/Compiler Coding Rule 10. (M impact, L generality) Do not put more than two end loop branches in a 16-
byte chunk. ....................................................................................................................................................3-8
Assembler/Compiler Coding Rule 11. (M impact, H generality) When executing code from the Decoded ICache,
direct branches that are mostly taken should have all their instruction bytes in a 64B cache line and nearer the
end of that cache line. Their targets should be at or near the beginning of a 64B cache line. .....................3-8
Assembler/Compiler Coding Rule 12. (M impact, H generality) If the body of a conditional is not likely to be
executed, it should be placed in another part of the program. If it is highly unlikely to be executed and code
locality is an issue, it should be placed on a different code page.................................................................. 3-8
Assembler/Compiler Coding Rule 13. (M impact, L generality) When indirect branches are present, try to put the
most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect
branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect
branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path. 3-8
Assembler/Compiler Coding Rule 14. (H impact, M generality) Unroll small loops until the overhead of the branch
and induction variable accounts (generally) for less than 10% of the execution time of the loop. ............3-10
Assembler/Compiler Coding Rule 15. (M impact, M generality) Unroll loops that are frequently executed and have
a predictable number of iterations to reduce the number of iterations to 16 or fewer. Do this unless it increases
code size so that the working set no longer fits in the instruction cache. If the loop body contains more than one
conditional branch, then unroll so that the number of iterations is 16/(# conditional branches)..............3-10
Assembler/Compiler Coding Rule 16. (ML impact, M generality) For improving fetch/decode throughput, Give
preference to memory flavor of an instruction over the register-only flavor of the same instruction, if such
instruction can benefit from micro-fusion...................................................................................................3-11
Ref#: 248966-050US 1
USER RULES
Assembler/Compiler Coding Rule 17. (M impact, ML generality) Employ macrofusion where possible using
instruction pairs that support macrofusion. Prefer TEST over CMP if possible. Use unsigned variables and
unsigned jumps when possible. Try to logically verify that a variable is non-negative at the time of comparison.
Avoid CMP or TEST of MEM-IMM flavor when possible. However, do not add other instructions to avoid using the
MEM-IMM flavor. ........................................................................................................................................3-15
Assembler/Compiler Coding Rule 18. (M impact, ML generality) Software can enable macro fusion when it can be
logically determined that a variable is non-negative at the time of comparison; use TEST appropriately to enable
macrofusion when comparing a variable with 0. ........................................................................................3-16
Assembler/Compiler Coding Rule 19. (MH impact, MH generality) Favor generating code using imm8 or imm32
values instead of imm16 values...................................................................................................................3-17
Assembler/Compiler Coding Rule 20. (M impact, ML generality) Ensure instructions using 0xF7 opcode byte does
not start at offset 14 of a fetch line; and avoid using these instruction to operate on 16-bit data, upcast short data
to 32 bits. .....................................................................................................................................................3-19
Assembler/Compiler Coding Rule 21. (MH impact, MH generality) Break up a loop body with a long sequence of
instructions into loops of shorter instruction blocks of no more than the size of the LSD. ........................3-19
Assembler/Compiler Coding Rule 22. (M impact, M generality) Avoid putting explicit references to ESP in a
sequence of stack operations (POP, PUSH, CALL, RET)................................................................................3-21
Assembler/Compiler Coding Rule 23. (ML impact, L generality) Use simple instructions that are less than eight bytes
in length. ......................................................................................................................................................3-21
Assembler/Compiler Coding Rule 24. (M impact, MH generality) Avoid using prefixes to change the size of
immediate and displacement. .....................................................................................................................3-21
Assembler/Compiler Coding Rule 25. (M impact, H generality) Favor single-micro-operation instructions. Also favor
instruction with shorter latencies................................................................................................................3-22
Assembler/Compiler Coding Rule 26. (M impact, L generality) Avoid prefixes, especially multiple non-0F-prefixed
opcodes........................................................................................................................................................3-22
Assembler/Compiler Coding Rule 27. (M impact, L generality) Do not use many segment registers. ...........3-22
Assembler/Compiler Coding Rule 28. (M impact, M generality) Avoid using complex instructions (for example,
enter, leave, or loop) that have more than four µops and require multiple cycles to decode. Use sequences of
simple instructions instead..........................................................................................................................3-22
Assembler/Compiler Coding Rule 29. (MH impact, M generality) Use push/pop to manage stack space and address
adjustments between function calls/returns instead of enter/leave. Using enter instruction with non-zero
immediates can experience significant delays in the pipeline in addition to misprediction.......................3-22
Assembler/Compiler Coding Rule 30. (ML impact, L generality) If an LEA instruction using the scaled index is on the
critical path, a sequence with ADDs may be better.....................................................................................3-24
Assembler/Compiler Coding Rule 31. (ML impact, L generality) Avoid ROTATE by register or ROTATE by immediate
instructions. If possible, replace with a ROTATE by 1 instruction. ..............................................................3-26
Assembler/Compiler Coding Rule 32. (M impact, ML generality) Use dependency-breaking-idiom instructions to set
a register to 0, or to break a false dependence chain resulting from re-use of registers. In contexts where the
condition codes must be preserved, move 0 into the register instead. This requires more code space than using
XOR and SUB, but avoids setting the condition codes...............................................................................3-27
Assembler/Compiler Coding Rule 33. (M impact, MH generality) Break dependences on portions of registers
between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be
accomplished with 32-bit moves or by using MOVZX. ................................................................................3-28
Assembler/Compiler Coding Rule 34. (M impact, M generality) Try to use zero extension or operate on 32-bit
operands instead of using moves with sign extension. ...............................................................................3-28
Assembler/Compiler Coding Rule 35. (ML impact, L generality) Avoid placing instructions that use 32-bit
immediates which cannot be encoded as sign-extended 16-bit immediates near each other. Try to schedule µops
that have no immediate immediately before or after µops with 32-bit immediates. ................................3-29
Ref#: 248966-050US 2
USER RULES
Assembler/Compiler Coding Rule 36. (ML impact, M generality) Use the TEST instruction instead of AND when the
result of the logical and is not used. This saves µops in execution. Use a test of a register with itself instead of a
cmp of the register to zero, this saves the need to encode the zero and saves encoding space. Avoid comparing a
constant to a memory operand. It is preferable to load the memory operand and compare the constant to a
register.........................................................................................................................................................3-29
Assembler/Compiler Coding Rule 37. (ML impact, M generality) Eliminate unnecessary compare with zero
instructions by using the appropriate conditional jump instruction when the flags are already set by a preceding
arithmetic instruction. If necessary, use a TEST instruction instead of a compare. Be certain that any code
transformations made do not introduce problems with overflow.............................................................. 3-29
Assembler/Compiler Coding Rule 38. (H impact, MH generality) For small loops, placing loop invariants in memory
is better than spilling loop-carried dependencies. ......................................................................................3-31
Assembler/Compiler Coding Rule 39. (M impact, ML generality) Avoid introducing dependences with partial
floating-point register writes, e.g. from the MOVSD XMMREG1, XMMREG2 instruction. Use the MOVAPD
XMMREG1, XMMREG2 instruction instead. ..............................................................................................3-36
Assembler/Compiler Coding Rule 40. (H impact, M generality) Pass parameters in registers instead of on the stack
where possible. Passing arguments on the stack requires a store followed by a reload. While this sequence is
optimized in hardware by providing the value to the load directly from the memory order buffer without the need
to access the data cache if permitted by store-forwarding restrictions, floating-point values incur a significant
latency in forwarding. Passing floating-point arguments in (preferably XMM) registers should save this long
latency operation.........................................................................................................................................3-48
Assembler/Compiler Coding Rule 41. (H impact, M generality) A load that forwards from a store must have the same
address start point and therefore the same alignment as the store data...................................................3-48
Assembler/Compiler Coding Rule 42. (H impact, M generality) The data of a load which is forwarded from a store
must be completely contained within the store data. ................................................................................3-48
Assembler/Compiler Coding Rule 43. (H impact, ML generality) If it is necessary to extract a non-aligned portion of
stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as
necessary. This is better than incurring the penalties of a failed store-forward.........................................3-48
Assembler/Compiler Coding Rule 44. (MH impact, ML generality) Avoid several small loads after large stores to the
same area of memory by using a single large read and register copies as needed.....................................3-48
Assembler/Compiler Coding Rule 45. (H impact, MH generality) Where it is possible to do so without incurring other
penalties, prioritize the allocation of variables to registers, as in register allocation and for parameter passing, to
minimize the likelihood and impact of store-forwarding problems. Try not to store-forward data generated from
a long latency instruction - for example, MUL or DIV. Avoid store-forwarding data for variables with the shortest
store-load distance. Avoid store-forwarding data for variables with many and/or long dependence chains, and
especially avoid including a store forward on a loop-carried dependence chain. ......................................3-51
Assembler/Compiler Coding Rule 46. (M impact, MH generality) Calculate store addresses as early as possible to
avoid having stores block loads. ..................................................................................................................3-52
Assembler/Compiler Coding Rule 47. (H impact, M generality) Try to arrange data structures so they permit
sequential access. ........................................................................................................................................3-54
Assembler/Compiler Coding Rule 48. (H impact, M generality) Make sure that the stack is aligned at the largest
multi-byte granular data type boundary matching the register width........................................................3-54
Assembler/Compiler Coding Rule 49. (M impact, L generality) If (hopefully read-only) data must occur on the same
page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its
mostly likely target, and place the data after an unconditional branch......................................................3-55
Assembler/Compiler Coding Rule 50. (H impact, L generality) Always put code and data on separate pages. Avoid
self-modifying code wherever possible. If code is to be modified, try to do it all at once and make sure the code
that performs the modifications and the code being modified are on separate 4-KByte pages or on separate
aligned 1-KByte sub-pages...........................................................................................................................3-55
Ref#: 248966-050US 3
USER RULES
Assembler/Compiler Coding Rule 51. (H impact, L generality) If an inner loop writes to more than four arrays (four
distinct cache lines), apply loop fission to break up the body of the loop so only four arrays are being written to
in each iteration of each of the resulting loops...........................................................................................3-56
Assembler/Compiler Coding Rule 52. (H impact, M generality) Minimize changes to bits 8-12 of the floating-point
control word. Changes for more than two values (each value being a combination of the following bits: precision,
rounding and infinity control, and the rest of bits in FCW) leads to delays that are on the order of the pipeline
depth............................................................................................................................................................3-72
Assembler/Compiler Coding Rule 53. (H impact, L generality) Minimize the number of changes to the rounding
mode. Do not use changes in the rounding mode to implement the floor and ceiling functions if this involves a
total of more than two values of the set of rounding, precision, and infinity bits......................................3-73
Assembler/Compiler Coding Rule 54. (H impact, L generality) Minimize the number of changes to the precision
mode............................................................................................................................................................3-73
Assembler/Compiler Coding Rule 55. (M impact, M generality) Use Streaming SIMD Extensions 2 or Streaming SIMD
Extensions unless you need an x87 feature. Most SSE2 arithmetic operations have shorter latency then their X87
counterpart and they eliminate the overhead associated with the management of the X87 register stack.3-74
Assembler/Compiler Coding Rule 56. (H impact, M generality) Use the 32-bit versions of instructions in 64-bit mode
to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers. ..13-1
Assembler/Compiler Coding Rule 57. (M impact, MH generality) When they are needed to reduce register pressure,
use the 8 extra general purpose registers for integer code and 8 extra XMM registers for floating-point or SIMD
code. ............................................................................................................................................................ 13-2
Assembler/Compiler Coding Rule 58. (ML impact, M generality) Prefer 64-bit by 64-bit integer multiplication that
produces 64-bit results over multiplication that produces 128-bit results. ................................................13-2
Assembler/Compiler Coding Rule 59. (ML impact, M generality) Stagger accessing the high 64-bit result of a 128-bit
multiplication after accessing the low 64-bit results...................................................................................13-2
Assembler/Compiler Coding Rule 60. (ML impact, M generality) Use the 64-bit versions of multiply for 32-bit integer
multiplies that require a 64 bit result. .........................................................................................................13-6
Assembler/Compiler Coding Rule 61. (ML impact, M generality) Use the 64-bit versions of add for 64-bit adds.13-
6
Assembler/Compiler Coding Rule 62. (L impact, L generality) If software prefetch instructions are necessary, use the
prefetch instructions provided by SSE. ........................................................................................................13-6
Assembler/Compiler Coding Rule 63. (H impact, H generality) Whenever a 256-bit AVX code block and 128-bit SSE
code block might execute in sequence, use the VZEROUPPER instruction to facilitate a transition to a “Clean”
state for the next block to execute from. ..................................................................................................15-11
Assembler/Compiler Coding Rule 64. (H impact, M generality) Align data to 32-byte boundary when possible. Prefer
store alignment over load alignment. .......................................................................................................15-24
Assembler/Compiler Coding Rule 65. (M impact, H generality) Align data to 32-byte boundary when possible. If it is
not possible to align both loads and stores, then prefer store alignment over load alignment. ..............15-26
Assembler/Compiler Coding Rule 66. (M impact, M generality) Use Blend instructions in lieu of shuffle instruction
in AVX whenever possible..........................................................................................................................15-35
Ref#: 248966-050US 4
TUNING SUGGESTIONS
Tuning Suggestion 1. In rare cases, a performance problem may be caused by executing data on a
code page as instructions. This is very likely to happen when execution is following an indirect branch
that is not resident in the trace cache. If this is clearly causing a performance problem, try moving
the data elsewhere, or inserting an illegal opcode or a pause instruction immediately after the indirect
branch. Note that the latter two alternatives may degrade performance in some circumstances. 3-
55
Tuning Suggestion 2. Optimize single threaded code to maximize execution throughput first. 11-28
Tuning Suggestion 3. Employ efficient threading model, leverage available tools (such as Intel
Threading Building Block, Intel Thread Checker, Intel Thread Profiler) to achieve optimal processor
scaling with respect to the number of physical processors or processor cores. ................ 11-28
Ref#: 248966-048 1
TUNING SUGGESTIONS
Ref#: 248966-048 2