Chapter 13 Memory Ordering

本文深入探讨ARMv8架构的内存模型,包括弱排序内存模型的特点、内存类型的定义及应用,以及屏障指令如何帮助控制内存访问顺序。此外,文章还介绍了内存属性的概念及其在系统内存映射中的作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

第十三章
内存排序

如果你的代码直接与硬件或在其他核心上执行的代码交互,或者直接加载或写入要执行的指令,或者修改页表,你需要注意内存排序问题。在所有这些情况下,内存排序问题都由相关代码为您处理.
如果您正在编写操作系统内核或设备驱动程序,或者实现管理程序、JIT编译器或多线程库,那么您必须很好地理解ARM架构的内存排序规则。

ARMv8架构使用了一个弱排序的内存模型。一般来说,这意味着内存访问的顺序不需要与加载和存储操作的程序顺序相同。对普通内存的读写可以由硬件重新排序,只受数据依赖关系和显式内存屏障指令的影响。某些情况需要更严格的排序规则.

非常高性能的系统可能会支持一些技术,比如推测内存读取,多个指令发出,或者乱序执行,这些技术和其他技术一起,为硬件内存访问重新排序提供了进一步的可能性:

  1. Multiple issue of instructions 指令多次发出
    一个处理器可以在每个周期发出和执行多条指令,这样按照程序顺序依次执行的指令就可以同时执行。
  2. Out-of-order execution 乱序执行
    许多处理器支持非依赖指令的乱序执行。
    每当一条指令在等待前一条指令的结果时被暂停,处理器就可以执行没有依赖关系的后续指令。
  3. speculation 预测
    当处理器遇到条件指令(如分支)时,它可以在确定是否必须执行特定指令之前推测地开始执行指令。因此,如果情况好转,证明猜测是正确的,结果就会更早出炉。
  4. Speculative loads 预测加载
    如果从可缓存位置读取的load指令被预测地执行,这可能导致缓存线填充和潜在的现有缓存线的回收。
  5. Load and store optimizations 加载和存储优化
    由于对外部内存的读写可能有很长的延迟,处理器可以减少传输的数量,例如,将多个存储合并到一个更大的事务中。
  6. External memory systems 外部内存系统
    在许多复杂的片上系统(SoC)设备中,有许多能够发起传输的代理,以及到被读写的从设备的多条路由。其中一些设备,如DRAM控制器,可能能够接受来自不同主机的同时请求。事务可以被interconnect缓冲,或者重新排序。这意味着来自不同主机的访问可能因此需要不同的周期数来完成,并且可能会相互超越。
  7. Cache coherent multi-core processing 高速缓存连贯多核处理
    在多核处理器中,硬件缓存一致性可以在核之间迁移缓存线。因此,不同的内核可能会以不同的顺序看到缓存内存位置的更新。
  8. Optimizing compilers 优化编译程序
    优化编译器可以重新排序指令,以隐藏延迟或充分利用硬件特性。它通常可以向前移动内存访问,使其更早,并给它更多的时间在需要的值之前完成。

如果您有多个通过共享内存通信的内核,或者以其他方式共享数据,那么内存排序的考虑就变得更加重要。本章讨论几个与多处理(MP)操作和多个执行线程的同步相关的主题。本文还讨论了由体系结构定义的内存类型和规则,以及如何控制它们

13.1 Memory types
ARMv8体系结构定义了两种互斥内存类型。Normal和Device。
除了内存类型之外,属性还提供对缓存性、共享性、访问和执行权限的控制。可共享和缓存属性只属于普通内存。设备区域总是被认为是不可缓存和可外部共享的。对于可缓存的位置,可以使用属性向处理器指示缓存分配策略。
13.1.1 Normal memory
内存里的所有代码和大部分数据区域使用Normal memory。普通内存包括物理内存中的RAM、Flash或ROM区域。这种类型的内存提供了最高的处理器性能,因为它的顺序较弱,并且对处理器的限制较少。处理器可以重新排序、重复和合并访问Normal memory。
然而,处理器必须始终处理由地址依赖引起的危险。
13.1.2 Device memory
您可以为访问可能产生副作用的所有内存区域使用设备内存。设备内存类型对内核施加了更多的限制。不能对Device memory执行Speculative data(推测数据)。这里有一个罕见的例外。如果NEON操作用于从设备内存中读取字节,处理器可能会读取未显式引用的字节,如果它们是对齐的,包含显式引用的一个或多个字节的16字节块。
试图从标记为设备的区域执行代码通常是UNPREDICTABLE(不可预测的)。该实现可以像处理具有普通非缓存属性的内存位置一样处理获取指令,也可以处理权限错误。
有四种不同类型的设备内存,适用不同的规则:
•Device-nGnRnE最严格(相当于ARMv7架构中的强排序内存)。
•Device-nGnRE
•Device-nGRE
•Device-GRE限制最少
字母后缀指的是以下三个属性:

  1. Gathering or non Gathering (G or nG)
    此属性确定是否可以将此内存区域的多个访问合并到单个总线事务中。如果地址被标记为非聚集(nG),那么内存总线上对该位置执行的访问次数和大小必须与代码中显式访问的次数和大小完全匹配。如果地址被标记为gather (G),那么处理器可以将两个字节的写入合并为一个半字的写入。
  2. Re-ordering (R or nR)
    这决定了对同一设备的访问是否可以相互重新排序。如果地址被标记为非重新排序(nR),那么同一块内的访问总是按程序顺序出现在总线上。这个块的大小由实现定义。如果这个块的大小很大,它可以跨越几个表项。在这种情况下,排序规则是针对任何其他也被标记为nR的访问进行的。
  3. Early Write Acknowledgement (E or nE)
    这决定了是否允许处理器和被访问的从设备之间的中间写缓冲区发送写完成的确认。如果地址被标记为非早期写确认(nE),那么写响应必须来自外设。如果地址被标记为早期写确认(E),那么它允许互连逻辑中的缓冲器信号写接受,在写被终端设备实际接收之前。这本质上是给外部存储系统的一条消息。

13.2 Barriers障碍指令
ARM体系结构包括强制访问顺序和访问完成在特定点的障碍指令。

Instruction Synchronization Barrier (ISB) 。指令同步隔离。最严格:它会清洗流水线,以保证所有它前面的指令都执行完毕之后,才执行它后面的指令。
Data Synchronization Barrier (DSB)。数据同步隔离。比 DMB 严格: 仅当所有在它前面的存储器访问操作都执行完毕后,才执行在它后面的指令(亦即任何指令都要等待存储器访 问操作——译者注)
Data Memory Barrier (DMB)。数据存储器隔离。DMB 指令保证: 仅当所有在它前面的存储器访问操作都执行完毕后,才提交(commit)在它后面的存储器访问操作。

13.2.1 One-way barriers
AArch64具有拥有隐式barrier功能的的Load和Store指令。在这些指令前后的加载和存储指令要按顺序执行。
Load-Acquire (LDAR)
所有按照程序顺序在LDAR之后的加载和存储,以及匹配目标地址的共享域,都必须在LDAR之后观察。
Store-Release (STLR)
所有匹配目标地址共享域的STLR前面的load和store必须在STLR前面观察。
也有上述的独家版本,LDAXR和STLXR,可用。

data barrier指令使用限定符来控制哪些共享域可以看到barrier的效果,而LDAR和STLR指令使用访问地址的属性。
在这里插入图片描述
13.2.2 ISB in more detail
ARMv8体系结构中,将上下文定义为系统寄存器的状态。(参考M3任务调度中主要保存替换寄存器的值)。并且将上下文更改操作定义为cache,TLB的分支预测维护操作,或者对系统控制寄存器的更改(例如SCTLR_EL1, TCR_EL1, TTBRn_EL1)。这样的上下文更改操作的效果只能保证在上下文同步事件之后才能看到。

有三种上下文同步事件:
•接受异常。
•从异常返回。
•指令同步障碍(ISB)。

ISB刷新管道,并从缓存或内存中重新获取指令,并确保在ISB之前完成的上下文更改操作的效果对ISB之后的任何指令都是可见的。它还确保在ISB指令之后的任何上下文更改操作只在ISB执行之后生效,而在ISB之前的指令不会看到。但并不是所有对系统寄存器操作都要有ISB。

例:写入CPACR_EL1寄存器的 [20] 位,ISB是一个上下文同步事件,确保之后的指令等待写CRACR_EL生效后执行。
MRS X1, CPACR_EL1
ORR X1, X1, #(0x3 << 20)
MSR CPACR_EL1, X1
ISB

13.2.3 Use of barriers in C code
在单核处理器中,可以使用C11/C++11的序列点,但在多核中,没有什么有效方法,只能使用锁

13.2.4 Non-temporal load and store pair
ARMv8中的一个新概念是非时间加载和存储。这些是执行对寄存器值的读或写的LDNP和STNP指令。

13.3 Memory attributes
系统的内存映射被划分为许多区域。每个区域可能需要不同的内存属性,例如访问权限,包括针对不同权限级别、内存类型和缓存策略的读写权限。代码和数据的功能片段通常在内存映射中组合在一起,每个区域的属性分别控制。该功能由内存管理单元执行。转换表项使MMU硬件能够将虚拟地址转换为物理地址。此外,它们还指定了许多与每个页面相关联的属性。
在这里插入图片描述
•UXN和PXN是执行权限
•AF是访问标志
•SH为可共享属性
•AP为访问权限
NS是安全位,但只在EL3和安全EL1
•Indx是内存间接寄存器MAIR_ELn的索引
为了清晰起见,图中并没有显示所有的位
13.3.1 Cacheable and shareable memory attributes
第十四章多核处理器关于Cacheable内存的更多信息。
shareable属性用于定义一个位置是否与多个核共享。将一个区域标记为不可共享意味着它只被这个核心使用,而将其标记为内部可共享或外部可共享,或两者兼备,则意味着该位置与其他观察者共享,例如,GPU或DMA设备可能被视为另一个观察者。同样,内部和外部的划分由实现定义。这些属性的架构定义是,它们使我们能够定义一组观察者,这些观察者的共享性属性使数据或统一缓存对数据访问透明。
这意味着系统提供了硬件一致性管理,以便在内部可共享域中的两个内核必须看到标记为内部可共享的位置的一致性副本。如果一个处理器或系统中的其他主处理器不支持一致性,那么它必须将可共享区域视为不可缓存的。
在这里插入图片描述
缓存一致性硬件有一定的开销。数据内存访问会比其他方式花费更长的时间和更多的能量。这种开销可以通过在更少数量的主机之间保持一致性,并确保它们在硅中物理上接近而最小化。出于这个原因,体系结构将系统划分为多个域,这样就可以将开销限制在那些需要一致性的位置。(域,这个概念在操作系统中有提到)
Non-shareable
这表示只能由单个处理器或其他代理访问的内存,因此内存访问永远不需要与其他处理器同步。该域通常不用于SMP系统。
Inner shareable
这表示一个可由多个处理器共享的可共享域,但不一定是系统中的所有代理。一个系统可能有多个内部共享域。影响一个内部可共享域的操作不会影响系统中其他内部可共享域。这种域的一个例子可能是四核Cortex-A57集群。
Outer shareable
外部可共享(OSH)域重排序由多个代理共享,可以由一个或多个内部可共享域组成。影响外部可共享域的操作也会隐式地影响其中的所有内部可共享域。
然而,它在其他方面并不表现为内部可共享操作。
Full system
对全系统(SY)的操作会影响系统中所有的观察者。

Chapter 4: Processor Architecture. This chapter covers basic combinational and sequential logic elements, and then shows how these elements can be combined in a datapath that executes a simplified subset of the x86-64 instruction set called “Y86-64.” We begin with the design of a single-cycle datapath. This design is conceptually very simple, but it would not be very fast. We then introduce pipelining, where the different steps required to process an instruction are implemented as separate stages. At any given time, each stage can work on a different instruction. Our five-stage processor pipeline is much more realistic. The control logic for the processor designs is described using a simple hardware description language called HCL. Hardware designs written in HCL can be compiled and linked into simulators provided with the textbook, and they can be used to generate Verilog descriptions suitable for synthesis into working hardware. Chapter 5: Optimizing Program Performance. This chapter introduces a number of techniques for improving code performance, with the idea being that programmers learn to write their C code in such a way that a compiler can then generate efficient machine code. We start with transformations that reduce the work to be done by a program and hence should be standard practice when writing any program for any machine. We then progress to transformations that enhance the degree of instruction-level parallelism in the generated machine code, thereby improving their performance on modern “superscalar” processors. To motivate these transformations, we introduce a simple operational model of how modern out-of-order processors work, and show how to measure the potential performance of a program in terms of the critical paths through a graphical representation of a program. You will be surprised how much you can speed up a program by simple transformations of the C code. Bryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiii (front) Windfall Software, PCA ZzTEX 16.2 xxiv Preface Chapter 6: The Memory Hierarchy. The memory system is one of the most visible parts of a computer system to application programmers. To this point, you have relied on a conceptual model of the memory system as a linear array with uniform access times. In practice, a memory system is a hierarchy of storage devices with different capacities, costs, and access times. We cover the different types of RAM and ROM memories and the geometry and organization of magnetic-disk and solid state drives. We describe how these storage devices are arranged in a hierarchy. We show how this hierarchy is made possible by locality of reference. We make these ideas concrete by introducing a unique view of a memory system as a “memory mountain” with ridges of temporal locality and slopes of spatial locality. Finally, we show you how to improve the performance of application programs by improving their temporal and spatial locality. Chapter 7: Linking. This chapter covers both static and dynamic linking, including the ideas of relocatable and executable object files, symbol resolution, relocation, static libraries, shared object libraries, position-independent code, and library interpositioning. Linking is not covered in most systems texts, but we cover it for two reasons. First, some of the most confusing errors that programmers can encounter are related to glitches during linking, especially for large software packages. Second, the object files produced by linkers are tied to concepts such as loading, virtual memory, and memory mapping. Chapter 8: Exceptional Control Flow. In this part of the presentation, we step beyond the single-program model by introducing the general concept of exceptional control flow (i.e., changes in control flow that are outside the normal branches and procedure calls). We cover examples of exceptional control flow that exist at all levels of the system, from low-level hardware exceptions and interrupts, to context switches between concurrent processes, to abrupt changes in control flow caused by the receipt of Linux signals, to the nonlocal jumps in C that break the stack discipline. This is the part of the book where we introduce the fundamental idea of a process, an abstraction of an executing program. You will learn how processes work and how they can be created and manipulated from application programs. We show how application programmers can make use of multiple processes via Linux system calls. When you finish this chapter, you will be able to write a simple Linux shell with job control. It is also your first introduction to the nondeterministic behavior that arises with concurrent program execution. Chapter 9: Virtual Memory. Our presentation of the virtual memory system seeks to give some understanding of how it works and its characteristics. We want you to know how it is that the different simultaneous processes can each use an identical range of addresses, sharing some pages but having individual copies of others. We also cover issues involved in managing and manipulating virtual memory. In particular, we cover the operation of storage allocators such as the standard-library malloc and free operations. CovBryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiv (front) Windfall Software, PCA ZzTEX 16.2 Preface xxv ering this material serves several purposes. It reinforces the concept that the virtual memory space is just an array of bytes that the program can subdivide into different storage units. It helps you understand the effects of programs containing memory referencing errors such as storage leaks and invalid pointer references. Finally, many application programmers write their own storage allocators optimized toward the needs and characteristics of the application. This chapter, more than any other, demonstrates the benefit of covering both the hardware and the software aspects of computer systems in a unified way. Traditional computer architecture and operating systems texts present only part of the virtual memory story. Chapter 10: System-Level I/O. We cover the basic concepts of Unix I/O such as files and descriptors. We describe how files are shared, how I/O redirection works, and how to access file metadata. We also develop a robust buffered I/O package that deals correctly with a curious behavior known as short counts, where the library function reads only part of the input data. We cover the C standard I/O library and its relationship to Linux I/O, focusing on limitations of standard I/O that make it unsuitable for network programming. In general, the topics covered in this chapter are building blocks for the next two chapters on network and concurrent programming. Chapter 11: Network Programming. Networks are interesting I/O devices to program, tying together many of the ideas that we study earlier in the text, such as processes, signals, byte ordering, memory mapping, and dynamic storage allocation. Network programs also provide a compelling context for concurrency, which is the topic of the next chapter. This chapter is a thin slice through network programming that gets you to the point where you can write a simple Web server. We cover the client-server model that underlies all network applications. We present a programmer’s view of the Internet and show how to write Internet clients and servers using the sockets interface. Finally, we introduce HTTP and develop a simple iterative Web server. Chapter 12: Concurrent Programming. This chapter introduces concurrent programming using Internet server design as the running motivational example. We compare and contrast the three basic mechanisms for writing concurrent programs—processes, I/O multiplexing, and threads—and show how to use them to build concurrent Internet servers. We cover basic principles of synchronization using P and V semaphore operations, thread safety and reentrancy, race conditions, and deadlocks. Writing concurrent code is essential for most server applications. We also describe the use of thread-level programming to express parallelism in an application program, enabling faster execution on multi-core processors. Getting all of the cores working on a single computational problem requires a careful coordination of the concurrent threads, both for correctness and to achieve high performance翻译以上英文为中文
最新发布
08-05
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值