The RISC-V Instruction Set Manual for CV64A6_MMU: Volume II: Privileged Architecture
+-
+
- Preface +
- 1. Introduction + + +
- 2. Control and Status Registers (CSRs) + + +
- 3. Machine-Level ISA, Version 1.13
+
-
+
- 3.1. Machine-Level CSRs
+
-
+
- 3.1.1. Machine ISA (
misa
) Register
+ - 3.1.2. Machine Vendor ID (
mvendorid
) Register
+ - 3.1.3. Machine Architecture ID (
marchid
) Register
+ - 3.1.4. Machine Implementation ID (
mimpid
) Register
+ - 3.1.5. Hart ID (
mhartid
) Register
+ - 3.1.6. Machine Status (
mstatus
) Register +-
+
- 3.1.6.1. Privilege and Global Interrupt-Enable Stack in
mstatus
register
+ - 3.1.6.2. Double Trap Control in
mstatus
Register
+ - 3.1.6.3. Base ISA Control in
mstatus
Register
+ - 3.1.6.4. Memory Privilege in
mstatus
Register
+ - 3.1.6.5. Endianness Control in
mstatus
andmstatush
Registers
+ - 3.1.6.6. Virtualization Support in
mstatus
Register
+ - 3.1.6.7. Extension Context Status in
mstatus
Register
+ - 3.1.6.8. Previous Expected Landing Pad (ELP) State in
mstatus
Register
+
+ - 3.1.6.1. Privilege and Global Interrupt-Enable Stack in
- 3.1.7. Machine Trap-Vector Base-Address (
mtvec
) Register
+ - 3.1.8. Machine Trap Delegation (
medeleg
andmideleg
) Registers
+ - 3.1.9. Machine Interrupt (
mip
andmie
) Registers
+ - 3.1.10. Hardware Performance Monitor +
- 3.1.11. Machine Counter-Enable (
mcounteren
) Register
+ - 3.1.12. Machine Counter-Inhibit (
mcountinhibit
) Register
+ - 3.1.13. Machine Scratch (
mscratch
) Register
+ - 3.1.14. Machine Exception Program Counter (
mepc
) Register
+ - 3.1.15. Machine Cause (
mcause
) Register
+ - 3.1.16. Machine Trap Value (
mtval
) Register
+ - 3.1.17. Machine Configuration Pointer (
mconfigptr
) Register
+ - 3.1.18. Machine Environment Configuration (
menvcfg
) Register
+ - 3.1.19. Machine Security Configuration (
mseccfg
) Register
+
+ - 3.1.1. Machine ISA (
- 3.2. Machine-Level Memory-Mapped Registers + + +
- 3.3. Machine-Mode Privileged Instructions + + +
- 3.4. Reset +
- 3.5. Non-Maskable Interrupts +
- 3.6. Physical Memory Attributes + + +
- 3.7. Physical Memory Protection + + +
+ - 3.1. Machine-Level CSRs
+
- 4. "Smstateen/Ssstateen" Extensions, Version 1.0 +
- 5. "Smcsrind/Sscsrind" Indirect CSR Access, Version 1.0 +
- 6. "Smepmp" Extension for PMP Enhancements for memory access and execution prevention in Machine mode, Version 1.0 +
- 7. "Smcntrpmf" Cycle and Instret Privilege Mode Filtering, Version 1.0 +
- 8. "Smrnmi" Extension for Resumable Non-Maskable Interrupts, Version 0.5 +
- 9. "Smcdeleg" Counter Delegation Extension, Version 1.0 +
- 10. "Smdbltrp" Double Trap Extension, Version 1.0 +
- 11. Supervisor-Level ISA, Version 1.13
+
-
+
- 11.1. Supervisor CSRs
+
-
+
- 11.1.1. Supervisor Status (
sstatus
) Register + +
+ - 11.1.2. Supervisor Trap Vector Base Address (
stvec
) Register
+ - 11.1.3. Supervisor Interrupt (
sip
andsie
) Registers
+ - 11.1.4. Supervisor Timers and Performance Counters +
- 11.1.5. Counter-Enable (
scounteren
) Register
+ - 11.1.6. Supervisor Scratch (
sscratch
) Register
+ - 11.1.7. Supervisor Exception Program Counter (
sepc
) Register
+ - 11.1.8. Supervisor Cause (
scause
) Register
+ - 11.1.9. Supervisor Trap Value (
stval
) Register
+ - 11.1.10. Supervisor Environment Configuration (
senvcfg
) Register
+ - 11.1.11. Supervisor Address Translation and Protection (
satp
) Register
+
+ - 11.1.1. Supervisor Status (
- 11.2. Supervisor Instructions + + +
- 11.3. Sv39: Page-Based 39-bit Virtual-Memory System + + +
+ - 11.1. Supervisor CSRs
+
- 12. "Sstc" Extension for Supervisor-mode Timer Interrupts, Version 1.0 +
- 13. "Sscofpmf" Extension for Count Overflow and Mode-Based Filtering, Version 1.0 +
- 14. "H" Extension for Hypervisor Support, Version 1.0 +
- 15. Control-flow Integrity (CFI) +
- 16. "Ssdbltrp" Double Trap Extension, Version 1.0 +
- 17. RISC-V Privileged Instruction Set Listings +
- 18. History + + +
- Bibliography +
This document describes the RISC-V privileged architecture tailored for +OpenHW Group CV64A6_MMU. +Not relevant parts (e.g. unsupported extensions) of the original +specification are replaced by placeholders.
+Contributors to all versions of the spec in alphabetical order (please contact +editors to suggest corrections): Krste Asanović, Peter Ashenden, Rimas +Avižienis, Jacob Bachmeyer, Allen J. Baum, Jonathan Behrens, Paolo Bonzini, Ruslan Bukin, +Christopher Celio, Chuanhua Chang, David Chisnall, Anthony Coulter, Palmer Dabbelt, Monte +Dalrymple, Paul Donahue, Greg Favor, Dennis Ferguson, Marc Gauthier, Andy Glew, +Gary Guo, Mike Frysinger, John Hauser, David Horner, Olof +Johansson, David Kruckemyer, Yunsup Lee, Daniel Lustig, Andrew Lutomirski, Prashanth Mundkur, +Jonathan Neuschäfer, Rishiyur +Nikhil, Stefan O’Rear, Albert Ou, John Ousterhout, David Patterson, Dmitri +Pavlov, Kade Phillips, Josh Scheid, Colin Schmidt, Michael Taylor, Wesley Terpstra, Matt Thomas, Tommy Thorn, Ray +VanDeWalker, Megan Wachs, Steve Wallach, Andrew Waterman, Claire Wolf, +and Reinoud Zandijk..
+This document is released under a Creative Commons Attribution 4.0 International License.
+This document is a derivative of the RISC-V +privileged specification version 1.9.1 released under following license: ©2010-2017 Andrew Waterman, Yunsup Lee, Rimas +Avižienis, +David Patterson, Krste Asanović. Creative Commons Attribution 4.0 International License.
+Contributors to CV64A6_MMU versions of the spec in alphabetical order: +Jean-Roch Coulon, André Sintzoff.
+Preface
+Preface to Version for CV64A6_MMU
+This document describes the RISC-V privileged architecture tailored for +OpenHW Group CV64A6_MMU.
+Preface to Version 20240612
+This document describes the RISC-V privileged architecture. This +release, version 20240612, contains the following versions of the RISC-V ISA +modules:
+Module | +Version | +Status | +
---|---|---|
Machine ISA |
+1.13 |
+Draft |
+
The following changes have been made since version 1.12 of the Machine and +Supervisor ISAs, which, while not strictly backwards compatible, are not +anticipated to cause software portability problems in practice:
+-
+
-
+
Redefined
+misa
.MXL to be read-only, making MXLEN a constant.
+ -
+
Added the constraint that SXLEN≥UXLEN.
+
+
Additionally, the following compatible changes have been +made to the Machine and Supervisor ISAs since version 1.12:
+-
+
-
+
Defined the
+misa
.B field to reflect that the B extension has been +implemented.
+ -
+
Defined the
+misa
.V field to reflect that the V extension has been +implemented.
+ -
+
Defined the RV32-only
+medelegh
andhedelegh
CSRs.
+ -
+
Defined the misaligned atomicity granule PMA, superseding the proposed Zam +extension.
+
+ -
+
Allocated interrupt 13 for Sscofpmf LCOFI interrupt.
+
+ -
+
Defined hardware error and software check exception codes.
+
+ -
+
Specified synchronization requirements when changing the PBMTE fields +in
+menvcfg
andhenvcfg
.
+ -
+
Exposed count-overflow interrups to VS-mode via the Shlcofideleg extension.
+
+
Finally, the following clarifications and document improvments have been made +since the last document release:
+-
+
-
+
Transliterated the document from LaTeX into AsciiDoc.
+
+ -
+
Included all ratified extensions through March 2024.
+
+ -
+
Clarified that "platform- or custom-use" interrupts are actually +"platform-use interrupts", where the platform can choose to make some custom.
+
+ -
+
Clarified semantics of explicit accesses to CSRs wider than XLEN bits.
+
+ -
+
Clarified that MXLEN≥SXLEN.
+
+ -
+
Clarified that WFI is not a HINT instruction.
+
+ -
+
Clarified that VS-stage page-table accesses set G-stage A/D bits.
+
+ -
+
Clarified ordering rules when PBMT=IO is used on main-memory regions.
+
+ -
+
Clarified ordering rules for hardware A/D bit updates.
+
+ -
+
Clarified that, for a given exception cause,
+xtval
might sometimes +be set to a nonzero value but sometimes not.
+ -
+
Clarified exception behavior of unimplemented or inaccessible CSRs.
+
+ -
+
Clarified that Svpbmt allows implementations to override additional PMAs.
+
+ -
+
Replaced the concept of vacant memory regions with inaccessible memory or I/O regions.
+
+
Preface to Version 20211203
+This document describes the RISC-V privileged architecture. This +release, version 20211203, contains the following versions of the RISC-V +ISA modules:
+Module | +Version | +Status | +
---|---|---|
Machine ISA |
+1.12 |
+Ratified |
+
The following changes have been made since version 1.11, which, while +not strictly backwards compatible, are not anticipated to cause software +portability problems in practice:
+-
+
-
+
Changed MRET and SRET to clear
+mstatus
.MPRV when leaving M-mode.
+ -
+
Reserved additional
+satp
patterns for future use.
+ -
+
Stated that the
+scause
Exception Code field must implement bits 4–0 +at minimum.
+ -
+
Relaxed I/O regions have been specified to follow RVWMO. The previous +specification implied that PPO rules other than fences and +acquire/release annotations did not apply.
+
+ -
+
Constrained the LR/SC reservation set size and shape when using +page-based virtual memory.
+
+ -
+
PMP changes require an SFENCE.VMA on any hart that implements +page-based virtual memory, even if VM is not currently enabled.
+
+ -
+
Allowed for speculative updates of page table entry A bits.
+
+ -
+
Clarify that if the address-translation algorithm non-speculatively +reaches a PTE in which a bit reserved for future standard use is set, a +page-fault exception must be raised.
+
+
Additionally, the following compatible changes have been made since +version 1.11:
+-
+
-
+
Removed the N extension.
+
+ -
+
Defined the mandatory RV32-only CSR
+mstatush
, which contains most of +the same fields as the upper 32 bits of RV64’smstatus
.
+ -
+
Defined the mandatory CSR
+mconfigptr
, which if nonzero contains the +address of a configuration data structure.
+ -
+
Defined optional
+mseccfg
andmseccfgh
CSRs, which control the +machine’s security configuration.
+ -
+
Defined
+menvcfg
,henvcfg
, andsenvcfg
CSRs (and RV32-only +menvcfgh
andhenvcfgh
CSRs), which control various characteristics +of the execution environment.
+ -
+
Designated part of SYSTEM major opcode for custom use.
+
+ -
+
Permitted the unconditional delegation of less-privileged interrupts.
+
+ -
+
Added optional big-endian and bi-endian support.
+
+ -
+
Made priority of load/store/AMO address-misaligned exceptions +implementation-defined relative to load/store/AMO page-fault and +access-fault exceptions.
+
+ -
+
PMP reset values are now platform-defined.
+
+ -
+
An additional 48 optional PMP registers have been defined.
+
+ -
+
Slightly relaxed the atomicity requirement for A and D bit updates +performed by the implementation.
+
+ -
+
Clarify the architectural behavior of address-translation caches
+
+ -
+
Added Sv57 and Sv57x4 address translation modes.
+
+ -
+
Software breakpoint exceptions are permitted to write either 0 or the +
+pc
toxtval
.
+ -
+
Clarified that bare S-mode need not support the SFENCE.VMA +instruction.
+
+ -
+
Specified relaxed constraints for implicit reads of non-idempotent +regions.
+
+ -
+
Added the Svnapot Standard Extension, along with the N bit in Sv39, +Sv48, and Sv57 PTEs.
+
+ -
+
Added the Svpbmt Standard Extension, along with the PBMT bits in Sv39, +Sv48, and Sv57 PTEs.
+
+ -
+
Added the Svinval Standard Extension and associated instructions.
+
+
Finally, the hypervisor architecture proposal has been extensively +revised.
+Preface to Version 1.11
+This is version 1.11 of the RISC-V privileged architecture. The document +contains the following versions of the RISC-V ISA modules:
+Module | +Version | +Status | +
---|---|---|
Machine ISA |
+1.11 |
+Ratified |
+
Changes from version 1.10 include:
+-
+
-
+
Moved Machine and Supervisor spec to Ratified status.
+
+ -
+
Improvements to the description and commentary.
+
+ -
+
Added a draft proposal for a hypervisor extension.
+
+ -
+
Specified which interrupt sources are reserved for standard use.
+
+ -
+
Allocated some synchronous exception causes for custom use.
+
+ -
+
Specified the priority ordering of synchronous exceptions.
+
+ -
+
Added specification that xRET instructions may, but are not required +to, clear LR reservations if A extension present.
+
+ -
+
The virtual-memory system no longer permits supervisor mode to execute +instructions from user pages, regardless of the SUM setting.
+
+ -
+
Clarified that ASIDs are private to a hart, and added commentary about +the possibility of a future global-ASID extension.
+
+ -
+
SFENCE.VMA semantics have been clarified.
+
+ -
+
Made the
+mstatus
.MPP field WARL, rather than WLRL.
+ -
+
Made the unused
+xip
fields WPRI, rather than WIRI.
+ -
+
Made the unused
+misa
fields WARL, rather than WIRI.
+ -
+
Made the unused
+pmpaddr
andpmpcfg
fields WARL, rather than WIRI.
+ -
+
Required all harts in a system to employ the same PTE-update scheme as +each other.
+
+ -
+
Rectified an editing error that misdescribed the mechanism by which +
+mstatus.xIE
is written upon an exception.
+ -
+
Described scheme for emulating misaligned AMOs.
+
+ -
+
Specified the behavior of the
+misa
andxepc
registers in systems +with variable IALIGN.
+ -
+
Specified the behavior of writing self-contradictory values to the +
+misa
register.
+ -
+
Defined the
+mcountinhibit
CSR, which stops performance counters from +incrementing to reduce energy consumption.
+ -
+
Specified semantics for PMP regions coarser than four bytes.
+
+ -
+
Specified contents of CSRs across XLEN modification.
+
+ -
+
Moved PLIC chapter into its own document.
+
+
Preface to Version 1.10
+This is version 1.10 of the RISC-V privileged architecture proposal. +Changes from version 1.9.1 include:
+-
+
-
+
The previous version of this document was released under a Creative +Commons Attribution 4.0 International License by the original authors, +and this and future versions of this document will be released under the +same license.
+
+ -
+
The explicit convention on shadow CSR addresses has been removed to +reclaim CSR space. Shadow CSRs can still be added as needed.
+
+ -
+
The
+mvendorid
register now contains the JEDEC code of the core +provider as opposed to a code supplied by the Foundation. This avoids +redundancy and offloads work from the Foundation.
+ -
+
The interrupt-enable stack discipline has been simplified.
+
+ -
+
An optional mechanism to change the base ISA used by supervisor and +user modes has been added to the
+mstatus
CSR, and the field previously +called Base inmisa
has been renamed toMXL
for consistency.
+ -
+
Clarified expected use of XS to summarize additional extension state +status fields in
+mstatus
.
+ -
+
Optional vectored interrupt support has been added to the
+mtvec
and +stvec
CSRs.
+ -
+
The SEIP and UEIP bits in the
+mip
CSR have been redefined to support +software injection of external interrupts.
+ -
+
The
+mbadaddr
register has been subsumed by a more generalmtval
+register that can now capture bad instruction bits on an illegal +instruction fault to speed instruction emulation.
+ -
+
The machine-mode base-and-bounds translation and protection schemes +have been removed from the specification as part of moving the virtual +memory configuration to
+sptbr
(nowsatp
). Some of the motivation for +the base and bound schemes are now covered by the PMP registers, but +space remains available inmstatus
to add these back at a later date +if deemed useful.
+ -
+
In systems with only M-mode, or with both M-mode and U-mode but +without U-mode trap support, the
+medeleg
andmideleg
registers now +do not exist, whereas previously they returned zero.
+ -
+
Virtual-memory page faults now have
+mcause
values distinct from +physical-memory access faults. Page-fault exceptions can now be +delegated to S-mode without delegating exceptions generated by PMA and +PMP checks.
+ -
+
An optional physical-memory protection (PMP) scheme has been proposed.
+
+ -
+
The supervisor virtual memory configuration has been moved from the +
+mstatus
register to thesptbr
register. Accordingly, thesptbr
+register has been renamed tosatp
(Supervisor Address Translation and +Protection) to reflect its broadened role.
+ -
+
The SFENCE.VM instruction has been removed in favor of the improved +SFENCE.VMA instruction.
+
+ -
+
The
+mstatus
bit MXR has been exposed to S-mode viasstatus
.
+ -
+
The polarity of the PUM bit in
+sstatus
has been inverted to shorten +code sequences involving MXR. The bit has been renamed to SUM.
+ -
+
Hardware management of page-table entry Accessed and Dirty bits has +been made optional; simpler implementations may trap to software to set +them.
+
+ -
+
The counter-enable scheme has changed, so that S-mode can control +availability of counters to U-mode.
+
+ -
+
H-mode has been removed, as we are focusing on recursive +virtualization support in S-mode. The encoding space has been reserved +and may be repurposed at a later date.
+
+ -
+
A mechanism to improve virtualization performance by trapping S-mode +virtual-memory management operations has been added.
+
+ -
+
The Supervisor Binary Interface (SBI) chapter has been removed, so +that it can be maintained as a separate specification.
+
+
Preface to Version 1.9.1
+This is version 1.9.1 of the RISC-V privileged architecture proposal. +Changes from version 1.9 include:
+-
+
-
+
Numerous additions and improvements to the commentary sections.
+
+ -
+
Change configuration string proposal to be use a search process that +supports various formats including Device Tree String and flattened +Device Tree.
+
+ -
+
Made
+misa
optionally writable to support modifying base and +supported ISA extensions. CSR address ofmisa
changed.
+ -
+
Added description of debug mode and debug CSRs.
+
+ -
+
Added a hardware performance monitoring scheme. Simplified the +handling of existing hardware counters, removing privileged versions of +the counters and the corresponding delta registers.
+
+ -
+
Fixed description of SPIE in presence of user-level interrupts.
+
+
1. Introduction
+This document describes the RISC-V privileged architecture, which covers +all aspects of RISC-V systems beyond the unprivileged ISA, including +privileged instructions as well as additional functionality required for +running operating systems and attaching external devices.
++ + | +
+
+
+Commentary on our design decisions is formatted as in this paragraph, +and can be skipped if the reader is only interested in the specification +itself. ++
+
+We briefly note that the entire privileged-level design described in +this document could be replaced with an entirely different +privileged-level design without changing the unprivileged ISA, and +possibly without even changing the ABI. In particular, this privileged +specification was designed to run existing popular operating systems, +and so embodies the conventional level-based protection model. Alternate +privileged specifications could embody other more flexible +protection-domain models. For simplicity of expression, the text is +written as if this was the only possible privileged architecture. + |
+
1.1. RISC-V Privileged Software Stack Terminology
+This section describes the terminology we use to describe components of +the wide range of possible privileged software stacks for RISC-V.
+Figure 1 shows some of the possible software stacks +that can be supported by the RISC-V architecture. The left-hand side +shows a simple system that supports only a single application running on +an application execution environment (AEE). The application is coded to +run with a particular application binary interface (ABI). The ABI +includes the supported user-level ISA plus a set of ABI calls to +interact with the AEE. The ABI hides details of the AEE from the +application to allow greater flexibility in implementing the AEE. The +same ABI could be implemented natively on multiple different host OSs, +or could be supported by a user-mode emulation environment running on a +machine with a different native ISA.
++ + | +
+
+
+Our graphical convention represents abstract interfaces using black +boxes with white text, to separate them from concrete instances of +components implementing the interfaces. + |
+
The middle configuration shows a conventional operating system (OS) that +can support multiprogrammed execution of multiple applications. Each +application communicates over an ABI with the OS, which provides the +AEE. Just as applications interface with an AEE via an ABI, RISC-V +operating systems interface with a supervisor execution environment +(SEE) via a supervisor binary interface (SBI). An SBI comprises the +user-level and supervisor-level ISA together with a set of SBI function +calls. Using a single SBI across all SEE implementations allows a single +OS binary image to run on any SEE. The SEE can be a simple boot loader +and BIOS-style IO system in a low-end hardware platform, or a +hypervisor-provided virtual machine in a high-end server, or a thin +translation layer over a host operating system in an architecture +simulation environment.
++ + | +
+
+
+Most supervisor-level ISA definitions do not separate the SBI from the +execution environment and/or the hardware platform, complicating +virtualization and bring-up of new hardware platforms. + |
+
The rightmost configuration shows a virtual machine monitor +configuration where multiple multiprogrammed OSs are supported by a +single hypervisor. Each OS communicates via an SBI with the hypervisor, +which provides the SEE. The hypervisor communicates with the hypervisor +execution environment (HEE) using a hypervisor binary interface (HBI), +to isolate the hypervisor from details of the hardware platform.
++ + | +
+
+
+The ABI, SBI, and HBI are still a work-in-progress, but we are now +prioritizing support for Type-2 hypervisors where the SBI is provided +recursively by an S-mode OS. + |
+
Hardware implementations of the RISC-V ISA will generally require +additional features beyond the privileged ISA to support the various +execution environments (AEE, SEE, or HEE).
+1.2. Privilege Levels
+At any time, a RISC-V hardware thread (hart) is running at some +privilege level encoded as a mode in one or more CSRs (control and +status registers). Three RISC-V privilege levels are currently defined +as shown in Table 1.
+Level | +Encoding | +Name | +Abbreviation | +
---|---|---|---|
0 |
+
|
+User/Application |
+U |
+
Privilege levels are used to provide protection between different +components of the software stack, and attempts to perform operations not +permitted by the current privilege mode will cause an exception to be +raised. These exceptions will normally cause traps into an underlying +execution environment.
++ + | +
+
+
+In the description, we try to separate the privilege level for which +code is written, from the privilege mode in which it runs, although the +two are often tied. For example, a supervisor-level operating system can +run in supervisor-mode on a system with three privilege modes, but can +also run in user-mode under a classic virtual machine monitor on systems +with two or more privilege modes. In both cases, the same +supervisor-level operating system binary code can be used, coded to a +supervisor-level SBI and hence expecting to be able to use +supervisor-level privileged instructions and CSRs. When running a guest +OS in user mode, all supervisor-level actions will be trapped and +emulated by the SEE running in the higher-privilege level. + |
+
The machine level has the highest privileges and is the only mandatory +privilege level for a RISC-V hardware platform. Code run in machine-mode +(M-mode) is usually inherently trusted, as it has low-level access to +the machine implementation. M-mode can be used to manage secure +execution environments on RISC-V. User-mode (U-mode) and supervisor-mode +(S-mode) are intended for conventional application and operating system +usage respectively.
+Each privilege level has a core set of privileged ISA extensions with +optional extensions and variants. For example, machine-mode supports an +optional standard extension for memory protection. Also, supervisor mode +can be extended to support Type-2 hypervisor execution as described in +Chapter 14.
+Implementations might provide anywhere from 1 to 3 privilege modes +trading off reduced isolation for lower implementation cost, as shown in +Table 2.
+Number of levels | +Supported Modes | +Intended Usage | +
---|---|---|
1 |
+M |
+Simple embedded systems |
+
All hardware implementations must provide M-mode, as this is the only +mode that has unfettered access to the whole machine. The simplest +RISC-V implementations may provide only M-mode, though this will provide +no protection against incorrect or malicious application code.
++ + | +
+
+
+The lock feature of the optional PMP facility can provide some limited +protection even with only M-mode implemented. + |
+
Many RISC-V implementations will also support at least user mode +(U-mode) to protect the rest of the system from application code. +Supervisor mode (S-mode) can be added to provide isolation between a +supervisor-level operating system and the SEE.
+A hart normally runs application code in U-mode until some trap (e.g., a +supervisor call or a timer interrupt) forces a switch to a trap handler, +which usually runs in a more privileged mode. The hart will then execute +the trap handler, which will eventually resume execution at or after the +original trapped instruction in U-mode. Traps that increase privilege +level are termed vertical traps, while traps that remain at the same +privilege level are termed horizontal traps. The RISC-V privileged +architecture provides flexible routing of traps to different privilege +layers.
++ + | +
+
+
+Horizontal traps can be implemented as vertical traps that return +control to a horizontal trap handler in the less-privileged mode. + |
+
1.3. Debug Mode
+Implementations may also include a debug mode to support off-chip +debugging and/or manufacturing test. Debug mode (D-mode) can be +considered an additional privilege mode, with even more access than +M-mode. The separate debug specification proposal describes operation of +a RISC-V hart in debug mode. Debug mode reserves a few CSR addresses +that are only accessible in D-mode, and may also reserve some portions +of the physical address space on a platform.
+2. Control and Status Registers (CSRs)
+The SYSTEM major opcode is used to encode all privileged instructions in +the RISC-V ISA. These can be divided into two main classes: those that +atomically read-modify-write control and status registers (CSRs), which +are defined in the Zicsr extension, and all other privileged +instructions. The privileged architecture requires the Zicsr extension; +which other privileged instructions are required depends on the +privileged-architecture feature set.
+In addition to the unprivileged state described in Volume I of this +manual, an implementation may contain additional CSRs, accessible by +some subset of the privilege levels using the CSR instructions described +in Volume I. In this chapter, we map out the CSR address space. The +following chapters describe the function of each of the CSRs according +to privilege level, as well as the other privileged instructions which +are generally closely associated with a particular privilege level. Note +that although CSRs and instructions are associated with one privilege +level, they are also accessible at all higher privilege levels.
+Standard CSRs do not have side effects on reads but may have side +effects on writes.
+2.1. CSR Address Mapping Conventions
+The standard RISC-V ISA sets aside a 12-bit encoding space (csr[11:0])
+for up to 4,096 CSRs. By convention, the upper 4 bits of the CSR address
+(csr[11:8]) are used to encode the read and write accessibility of the
+CSRs according to privilege level as shown in Table 3. The top two bits (csr[11:10]) indicate whether the register is read/write (00
,01
, or 10
) or read-only (11
). The next two bits (csr[9:8]) encode the lowest privilege level that can access the CSR.
+ + | +
+
+
+The CSR address convention uses the upper bits of the CSR address to +encode default access privileges. This simplifies error checking in the +hardware and provides a larger CSR space, but does constrain the mapping +of CSRs into the address space. +
+
+Implementations might allow a more-privileged level to trap otherwise +permitted CSR accesses by a less-privileged level to allow these +accesses to be intercepted. This change should be transparent to the +less-privileged software. + |
+
Instructions that access a non-existent CSR are reserved. +Attempts to access a CSR without appropriate privilege level +raise illegal-instruction exceptions or, as described in +[sec:hcauses], virtual-instruction exceptions. +Attempts to write a read-only register raise illegal-instruction exceptions. +A read/write register might also contain some bits that are +read-only, in which case writes to the read-only bits are ignored.
+Table 3 also indicates the convention to +allocate CSR addresses between standard and custom uses. The CSR +addresses designated for custom uses will not be redefined by future +standard extensions.
+Machine-mode standard read-write CSRs 0x7A0
-0x7BF
are reserved for
+use by the debug system. Of these CSRs, 0x7A0
-0x7AF
are accessible
+to machine mode, whereas 0x7B0
-0x7BF
are only visible to debug mode.
+Implementations should raise illegal-instruction exceptions on
+machine-mode access to the latter set of registers.
+ + | +
+
+
+Effective virtualization requires that as many instructions run natively +as possible inside a virtualized environment, while any privileged +accesses trap to the virtual machine monitor. (Goldberg, 1974) CSRs that are read-only +at some lower privilege level are shadowed into separate CSR addresses +if they are made read-write at a higher privilege level. This avoids +trapping permitted lower-privilege accesses while still causing traps on +illegal accesses. Currently, the counters are the only shadowed CSRs. + |
+
2.2. CSR Listing
+Table 4-Table 8 list the CSRs that +have currently been allocated CSR addresses. The timers, counters, and +floating-point CSRs are standard unprivileged CSRs. The other registers +are used by privileged code, as described in the following chapters. +Note that not all registers are required on all implementations.
+CSR Address |
+Hex |
+Use and Accessibility |
+|||||
[11:10] |
+[9:8] |
+[7:4] |
+|||||
Unprivileged and User-Level CSRs |
+|||||||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Custom read-only |
+|||
Supervisor-Level CSRs |
+|||||||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Custom read-only |
+|||
Hypervisor and VS CSRs |
+|||||||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Custom read-only |
+|||
Machine-Level CSRs |
+|||||||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write debug CSRs |
+|||
|
+
|
+
|
+
|
+Debug-mode-only CSRs |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Standard read/write |
+|||
|
+
|
+
|
+
|
+Custom read/write |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Standard read-only |
+|||
|
+
|
+
|
+
|
+Custom read-only |
+
Number | +Privilege | +Name | +Description | +
---|---|---|---|
Unprivileged Floating-Point CSRs |
+|||
|
+URW |
+
|
+Floating-Point Accrued Exceptions. |
+
Unprivileged Zicfiss extension CSR |
+|||
|
+URW |
+
|
+Shadow Stack Pointer. |
+
Unprivileged Counter/Timers |
+|||
|
+URO |
+
|
+Cycle counter for RDCYCLE instruction. |
+
Number | +Privilege | +Name | +Description | +
---|---|---|---|
Supervisor Trap Setup |
+|||
|
+SRW |
+
|
+Supervisor status register. |
+
Supervisor Configuration |
+|||
|
+SRW |
+
|
+Supervisor environment configuration register. |
+
Supervisor Counter Setup |
+|||
|
+SRW |
+
|
+Supervisor counter-inhibit register. |
+
Supervisor Trap Handling |
+|||
|
+SRW |
+
|
+Scratch register for supervisor trap handlers. |
+
Supervisor Protection and Translation |
+|||
|
+SRW |
+
|
+Supervisor address translation and protection. |
+
Debug/Trace Registers |
+|||
|
+SRW |
+
|
+Supervisor-mode context register. |
+
Supervisor State Enable Registers |
+|||
|
+SRW |
+
|
+Supervisor State Enable 0 Register. |
+
Number | +Privilege | +Name | +Description | +
---|---|---|---|
Hypervisor Trap Setup |
+|||
|
+HRW |
+
|
+Hypervisor status register. |
+
Hypervisor Trap Handling |
+|||
|
+HRW |
+
|
+Hypervisor bad guest physical address. |
+
Hypervisor Configuration |
+|||
|
+HRW |
+
|
+Hypervisor environment configuration register. |
+
Hypervisor Protection and Translation |
+|||
|
+HRW |
+
|
+Hypervisor guest address translation and protection. |
+
Debug/Trace Registers |
+|||
|
+HRW |
+
|
+Hypervisor-mode context register. |
+
Hypervisor Counter/Timer Virtualization Registers |
+|||
|
+HRW |
+
|
+Delta for VS/VU-mode timer. |
+
Hypervisor State Enable Registers |
+|||
|
+HRW |
+
|
+Hypervisor State Enable 0 Register. |
+
Virtual Supervisor Registers |
+|||
|
+HRW |
+
|
+Virtual supervisor status register. |
+
Number | +Privilege | +Name | +Description | +
---|---|---|---|
Machine Information Registers |
+|||
|
+MRO |
+
|
+Vendor ID. |
+
Machine Trap Setup |
+|||
|
+MRW |
+
|
+Machine status register. |
+
Machine Trap Handling |
+|||
|
+MRW |
+
|
+Scratch register for machine trap handlers. |
+
Machine Configuration |
+|||
|
+MRW |
+
|
+Machine environment configuration register. |
+
Machine Memory Protection |
+|||
|
+MRW |
+
|
+Physical memory protection configuration. |
+
Machine State Enable Registers |
+|||
|
+MRW |
+
|
+Machine State Enable 0 Register. |
+
Number | +Privilege | +Name | +Description | +
---|---|---|---|
Machine Non-Maskable Interrupt Handling |
+|||
|
+MRW |
+
|
+Resumable NMI scratch register. |
+
Machine Counter/Timers |
+|||
|
+MRW |
+
|
+Machine cycle counter. |
+
Machine Counter Setup |
+|||
|
+MRW |
+
|
+Machine counter-inhibit register. |
+
Debug/Trace Registers (shared with Debug Mode) |
+|||
|
+MRW |
+
|
+Debug/Trace trigger register select. |
+
Debug Mode Registers |
+|||
|
+DRW |
+
|
+Debug control and status register. |
+
2.3. CSR Field Specifications
+The following definitions and abbreviations are used in specifying the +behavior of fields within the CSRs.
+2.3.1. Reserved Writes Preserve Values, Reads Ignore Values (WPRI)
+Some whole read/write fields are reserved for future use. Software +should ignore the values read from these fields, and should preserve the +values held in these fields when writing values to other fields of the +same register. For forward compatibility, implementations that do not +furnish these fields must make them read-only zero. These fields are +labeled WPRI in the register descriptions.
++ + | +
+
+
+To simplify the software model, any backward-compatible future +definition of previously reserved fields within a CSR must cope with the +possibility that a non-atomic read/modify/write sequence is used to +update other fields in the CSR. Alternatively, the original CSR +definition must specify that subfields can only be updated atomically, +which may require a two-instruction clear bit/set bit sequence in +general that can be problematic if intermediate values are not legal. + |
+
2.3.2. Write/Read Only Legal Values (WLRL)
+Some read/write CSR fields specify behavior for only a subset of +possible bit encodings, with other bit encodings reserved. Software +should not write anything other than legal values to such a field, and +should not assume a read will return a legal value unless the last write +was of a legal value, or the register has not been written since another +operation (e.g., reset) set the register to a legal value. These fields +are labeled WLRL in the register descriptions.
++ + | +
+
+
+Hardware implementations need only implement enough state bits to +differentiate between the supported values, but must always return the +complete specified bit-encoding of any supported value when read. + |
+
Implementations are permitted but not required to raise an +illegal-instruction exception if an instruction attempts to write a +non-supported value to a WLRL field. Implementations can return arbitrary +bit patterns on the read of a WLRL field when the last write was of an +illegal value, but the value returned should deterministically depend on +the illegal written value and the value of the field prior to the write.
+2.3.3. Write Any Values, Reads Legal Values (WARL)
+Some read/write CSR fields are only defined for a subset of bit +encodings, but allow any value to be written while guaranteeing to +return a legal value whenever read. Assuming that writing the CSR has no +other side effects, the range of supported values can be determined by +attempting to write a desired setting then reading to see if the value +was retained. These fields are labeled WARL in the register descriptions.
+Implementations will not raise an exception on writes of unsupported +values to a WARL field. Implementations can return any legal value on the +read of a WARL field when the last write was of an illegal value, but the +legal value returned should deterministically depend on the illegal +written value and the architectural state of the hart.
+2.4. CSR Field Modulation
+If a write to one CSR changes the set of legal values allowed for a
+field of a second CSR, then unless specified otherwise, the second CSR’s
+field immediately gets an UNSPECIFIED
value from among its new legal values. This
+is true even if the field’s value before the write remains legal after
+the write; the value of the field may be changed in consequence of the
+write to the controlling CSR.
+ + | +
+
+
+As a special case of this rule, the value written to one CSR may control
+whether a field of a second CSR is writable (with multiple legal values)
+or is read-only. When a write to the controlling CSR causes the second
+CSR’s field to change from previously read-only to now writable, that
+field immediately gets an +
+
+Some CSR fields are, when writable, defined as aliases of other CSR
+fields. Let x be such a CSR field, and let y be the CSR field it aliases when writable. If a write to a controlling CSR causes field x to change from previously read-only to now writable, the new value of x is not |
+
A change to the value of a CSR for this reason is not a write to the +affected CSR and thus does not trigger any side effects specified for +that CSR.
+2.5. Implicit Reads of CSRs
+Implementations sometimes perform implicit reads of CSRs. (For
+example, all S-mode instruction fetches implicitly read the satp
CSR.)
+Unless otherwise specified, the value returned by an implicit read of a
+CSR is the same value that would have been returned by an explicit read
+of the CSR, using a CSR-access instruction in a sufficient privilege
+mode.
2.6. CSR Width Modulation
+If the width of a CSR is changed (for example, by changing SXLEN or +UXLEN, as described in Section 3.1.6.3), the +values of the writable fields and bits of the new-width CSR are, +unless specified otherwise, determined from the previous-width CSR as +though by this algorithm:
+-
+
-
+
The value of the previous-width CSR is copied to a temporary register +of the same width.
+
+ -
+
For the read-only bits of the previous-width CSR, the bits at the same +positions in the temporary register are set to zeros.
+
+ -
+
The width of the temporary register is changed to the new width. If +the new width W is narrower than the previous width, the +least-significant W bits of the temporary register are +retained and the more-significant bits are discarded. If the new width +is wider than the previous width, the temporary register is +zero-extended to the wider width.
+
+ -
+
Each writable field of the new-width CSR takes the value of the bits +at the same positions in the temporary register.
+
+
Changing the width of a CSR is not a read or write of the CSR and thus +does not trigger any side effects.
+2.7. Explicit Accesses to CSRs Wider than XLEN
+If a standard CSR is wider than XLEN bits, then an explicit read +of the CSR returns the register’s least-significant XLEN bits, +and an explicit write to the CSR modifies only the register’s +least-significant XLEN bits, leaving the upper bits unchanged.
+Some standard CSRs, such as the counter CSRs of extension
+Zicntr, are always 64 bits, even when XLEN=32 (RV32).
+For each such 64-bit CSR (for example, counter time
),
+a corresponding 32-bit high-half CSR is usually defined with
+the same name but with the letter ‘h’ appended at the end (timeh
).
+The high-half CSR aliases bits 63:32 of its namesake
+64-bit CSR, thus providing a way for RV32 software
+to read and modify the otherwise-unreachable 32 bits.
Standard high-half CSRs are accessible only when +the base RISC-V instruction set is RV32 (XLEN=32). +For RV64 (when XLEN=64), the addresses of all standard high-half CSRs +are reserved, so an attempt to access a high-half CSR +typically raises an illegal-instruction exception.
+3. Machine-Level ISA, Version 1.13
+This chapter describes the machine-level operations available in +machine-mode (M-mode), which is the highest privilege mode in a RISC-V +hart. M-mode is used for low-level access to a hardware platform and +is the first mode entered at reset. M-mode can also be used to implement +features that are too difficult or expensive to implement in hardware +directly. The RISC-V machine-level ISA contains a common core that is +extended depending on which other privilege levels are supported and +other details of the hardware implementation.
+3.1. Machine-Level CSRs
+3.1.1. Machine ISA (misa
) Register
+The misa
CSR is a WARL read-write register reporting the ISA supported by the hart.
[CVA6] The MXL (Machine XLEN) field encodes the native base integer ISA width as
+shown in Table 9. The MXL field is read-only.
+In CVA6, the misa
register returns the MXL field which indicates the
+effective XLEN in M-mode, a constant termed MXLEN.
MXL | +XLEN | +
---|---|
1 |
+32 |
+
The misa
CSR is MXLEN bits wide.
[CVA6] The Extensions field encodes the presence of the standard extensions, +with a single bit per letter of the alphabet (bit 0 encodes presence of +extension "A" , bit 1 encodes presence of extension "B", through to +bit 25 which encodes "Z"). The "I" bit will be set for RV32I, RV64I, +and RV128I base ISAs, and the "E" bit will be set for RV32E and RV64E. +In CVA6, the Extensions field is not writeable, the presence of standard +extensions corresponds to the hardware reset value and cannot be modified +by writing in the register.
+Bit | +Character | +Description | +
---|---|---|
0 |
+A |
+Atomic extension |
+
The "U" and "S" bits will be set if there is support for user and +supervisor modes respectively.
+The "X" bit will be set if there are any non-standard extensions.
+When "B" bit is 1, the implementation supports the instructions provided by the +Zba, Zbb, and Zbs extensions. When "B" bit is 0, it indicates that the +implementation may not support one or more of the Zba, Zbb, or Zbs extensions.
+3.1.2. Machine Vendor ID (mvendorid
) Register
+[CVA6] The mvendorid
CSR is a 32-bit read-only register providing the JEDEC
+manufacturer ID of the provider of the core.
+In CVA6, mvendorid
is implemented and returns the commercial implementation
+id supplied to OpenHW Group organization, 0x602.
mvendorid
)3.1.3. Machine Architecture ID (marchid
) Register
+[CVA6] The marchid
CSR is an MXLEN-bit read-only register encoding the base
+microarchitecture of the hart.
+In CVA6, marchid
is implemented and returns the base microarchitecture
+of the hart supplied to CVA6, 0x3.
marchid
) register3.1.4. Machine Implementation ID (mimpid
) Register
+The mimpid
CSR provides a unique encoding of the version of the
+processor implementation.
[CVA6] The mimpid
register is implemented and the return value is TODO.
+The Implementation value should reflect the design of the RISC-V
+processor itself and not any surrounding system.
mimpid
) register3.1.5. Hart ID (mhartid
) Register
+[CV64A6_MMU] The mhartid
CSR is an MXLEN-bit read-only register containing the
+integer ID of the hardware thread running the code. This register is
+readable. In CV64A6_MMU-based system, only one hart is implemented.
+Hart ID is zero.
mhartid
) register3.1.6. Machine Status (mstatus
) Register
+[CV64A6_MMU] The mstatus
register is an MXLEN-bit read/write register formatted as
+shown in Figure 7. The mstatus
register
+keeps track of and controls the hart’s current operating state.
mstatus
) register for RV643.1.6.1. Privilege and Global Interrupt-Enable Stack in mstatus
register
+Global interrupt-enable bits, MIE and SIE, are provided for M-mode and +S-mode respectively. These bits are primarily used to guarantee +atomicity with respect to interrupt handlers in the current privilege +mode.
+When a hart is executing in privilege mode x, interrupts are globally +enabled when xIE=1 and globally disabled when xIE=0. Interrupts for +lower-privilege modes, w<x, are always globally +disabled regardless of the setting of any global wIE bit for the +lower-privilege mode. Interrupts for higher-privilege modes, +y>x, are always globally enabled regardless of the +setting of the global yIE bit for the higher-privilege mode. +Higher-privilege-level code can use separate per-interrupt enable bits +to disable selected higher-privilege-mode interrupts before ceding +control to a lower-privilege mode.
+TODO
+An MRET or SRET instruction is used to return from a trap in M-mode or +S-mode respectively. When executing an xRET instruction, supposing +xPP holds the value y, xIE is set to xPIE; the privilege mode is +changed to y; xPIE is set to 1; and xPP is set to the +least-privileged supported mode (U if U-mode is implemented, else M). If +y≠M, xRET also sets MPRV=0.
+xPP fields are WARL fields that can hold only privilege mode x and any implemented privilege mode lower than x. If privilege mode x is not implemented, then xPP must be read-only 0.
+3.1.6.2. Double Trap Control in mstatus
Register
+[CV64A6_MMU] As Double Trap Control (Smdbltrp extension) is not implemented, +MDT field is read-only 0.
+3.1.6.3. Base ISA Control in mstatus
Register
+[CV64A6_MMU] The SXL and UXL fields are read-only fields that encode the
+value of XLEN for S-mode and U-mode, respectively. The encoding of these
+fields is the same as the MXL field of misa
, shown in Table 9.
+The effective XLEN in S-mode and U-mode are termed SXLEN and UXLEN, respectively.
+Their values are set to UXLEN=SXLEN=MXLEN.
3.1.6.4. Memory Privilege in mstatus
Register
+The MPRV (Modify PRiVilege) bit modifies the effective privilege mode, +i.e., the privilege level at which loads and stores execute. When +MPRV=0, loads and stores behave as normal, using the translation and +protection mechanisms of the current privilege mode. When MPRV=1, load +and store memory addresses are translated and protected, and endianness +is applied, as though the current privilege mode were set to MPP. +Instruction address-translation and protection are unaffected by the +setting of MPRV.
+An MRET or SRET instruction that changes the privilege mode to a mode +less privileged than M also sets MPRV=0.
+The MXR (Make eXecutable Readable) bit modifies the privilege with which +loads access virtual memory. When MXR=0, only loads from pages marked +readable (R=1 in [sv32pte]) will succeed. When +MXR=1, loads from pages marked either readable or executable (R=1 or +X=1) will succeed. MXR has no effect when page-based virtual memory is +not in effect.
+The SUM (permit Supervisor User Memory access) bit modifies the +privilege with which S-mode loads and stores access virtual memory. When +SUM=0, S-mode memory accesses to pages that are accessible by U-mode +(U=1 in [sv32pte]) will fault. When SUM=1, these +accesses are permitted. SUM has no effect when page-based virtual memory +is not in effect. Note that, while SUM is ordinarily ignored when not +executing in S-mode, it is in effect when MPRV=1 and MPP=S.
+The MXR and SUM mechanisms only affect the interpretation of permissions +encoded in page-table entries. In particular, they have no impact on +whether access-fault exceptions are raised due to PMAs or PMP.
+3.1.6.5. Endianness Control in mstatus
and mstatush
Registers
+The MBE, SBE, and UBE bits in mstatus
and mstatush
are WARL fields that
+control the endianness of memory accesses other than instruction
+fetches. Instruction fetches are always little-endian.
MBE controls whether non-instruction-fetch memory accesses made from
+M-mode (assuming mstatus
.MPRV=0) are little-endian (MBE=0) or
+big-endian (MBE=1).
SBE controls whether explicit load and store memory accesses made from S-mode are +little-endian (SBE=0) or big-endian (SBE=1).
+UBE controls whether explicit load and store memory accesses made from U-mode are +little-endian (UBE=0) or big-endian (UBE=1).
+It is always little-endian in M-Mode, the MBE is read-only zero.
+It is always little-endian in S-Mode, the SBE is read-only zero.
+It is always little-endian in U-Mode, the UBE is read-only zero.
+3.1.6.6. Virtualization Support in mstatus
Register
+The TVM (Trap Virtual Memory) bit is a WARL field that supports intercepting
+supervisor virtual-memory management operations. When TVM=1, attempts to
+read or write the satp
CSR or execute an SFENCE.VMA or SINVAL.VMA
+instruction while executing in S-mode will raise an illegal-instruction
+exception. When TVM=0, these operations are permitted in S-mode.
The TW (Timeout Wait) bit is a WARL field that supports intercepting the WFI +instruction (see Section 3.3.3). When TW=0, the WFI +instruction may execute in lower privilege modes when not prevented for +some other reason. When TW=1, then if WFI is executed in any +less-privileged mode, and it does not complete within an +implementation-specific, bounded time limit, the WFI instruction causes +an illegal-instruction exception. An implementation may have WFI always +raise an illegal-instruction exception in less-privileged modes when +TW=1, even if there are pending globally-disabled interrupts when the +instruction is executed.
+The TSR (Trap SRET) bit is a WARL field that supports intercepting the +supervisor exception return instruction, SRET. When TSR=1, attempts to +execute SRET while executing in S-mode will raise an illegal-instruction +exception. When TSR=0, this operation is permitted in S-mode.
+3.1.6.7. Extension Context Status in mstatus
Register
+Supporting substantial extensions is one of the primary goals of RISC-V, +and hence we define a standard interface to allow unchanged +privileged-mode code, particularly a supervisor-level OS, to support +arbitrary user-mode state extensions.
+[CV64A6_MMU] The FS[1:0] and VS[1:0] WARL fields and the XS[1:0] read-only field are used +to reduce the cost of context save and restore by setting and tracking +the current state of the floating-point unit and any other user-mode +extensions respectively.
+As the F extension is not implemented, then +FS is read-only zero.
+As the v
registers is not implemented, then
+VS is read-only zero.
As no additional user extensions require new state, the +XS field is read-only zero. TODO
+[CV64A6_MMU] The SD bit is a read-only bit that summarizes whether either the FS, VS, +or XS fields signal the presence of some dirty state that will require +saving extended user context to memory.
+[CV64A6_MMU] As FS, XS, and VS are all read-only zero, SD is also always +zero.
+[CV64A6_MMU] When an extension’s status is set to Off, any instruction that attempts +to read or write the corresponding state will cause an +illegal-instruction exception.
+3.1.6.8. Previous Expected Landing Pad (ELP) State in mstatus
Register
+[CV64A6_MMU] As the Zicfilp extension is not supported,
+the SPELP
and MPELP
fields are read-only zero.
3.1.7. Machine Trap-Vector Base-Address (mtvec
) Register
+The mtvec
register is an MXLEN-bit WARL read/write register that holds
+trap vector configuration, consisting of a vector base address (BASE)
+and a vector mode (MODE).
[CV64A6_MMU] The mtvec
register is writable. The value in the BASE field must
+always be aligned on a 4-byte boundary. mtvec
is always accessed in
+Mode=Direct.
Value | +Name | +Description | +
---|---|---|
0 |
+Direct |
+All traps set |
+
The encoding of the MODE field is shown in
+Table 11. When MODE=Direct, all traps into
+machine mode cause the pc
to be set to the address in the BASE field.
3.1.8. Machine Trap Delegation (medeleg
and mideleg
) Registers
+By default, all traps at any privilege level are handled in machine
+mode, though a machine-mode handler can redirect traps back to the
+appropriate level with the MRET instruction
+(Section 3.3.2).
+The machine exception
+delegation register (medeleg
) is a 64-bit read/write register.
+The machine interrupt delegation (mideleg
) register is an MXLEN-bit
+read/write register.
+Setting a bit in medeleg
or mideleg
will delegate the
+corresponding trap, when occurring in S-mode or U-mode, to the S-mode
+trap handler.
medeleg
) register.medeleg
has a bit position allocated for every synchronous exception
+shown in Table 12, with the index of the
+bit position equal to the value returned in the mcause
register (i.e.,
+setting bit 8 allows user-mode environment calls to be delegated to a
+lower-privilege trap handler).
The medelegh
register does not exist when XLEN=64.
mideleg
) Register.mideleg
holds trap delegation bits for individual interrupts, with the
+layout of bits matching those in the mip
register (i.e., STIP
+interrupt delegation control is located in bit 5).
3.1.9. Machine Interrupt (mip
and mie
) Registers
+The mip
register is an MXLEN-bit read/write register containing
+information on pending interrupts, while mie
is the corresponding
+MXLEN-bit read/write register containing interrupt enable bits.
+Interrupt cause number i (as reported in CSR mcause
,
+Section 3.1.15) corresponds with bit i in both mip
and
+mie
. Bits 15:0 are allocated to standard interrupt causes only, while
+bits 16 and above are designated for platform use.
mip
) register.mie
) registerAn interrupt i will trap to M-mode (causing the privilege mode to
+change to M-mode) if all of the following are true: (a) either the
+current privilege mode is M and the MIE bit in the mstatus
register is
+set, or the current privilege mode has less privilege than M-mode;
+(b) bit i is set in both mip
and mie
; and (c) bit i is not set in mideleg
.
These conditions for an interrupt trap to occur must be evaluated in a
+bounded amount of time from when an interrupt becomes, or ceases to be,
+pending in mip
, and must also be evaluated immediately following the
+execution of an xRET instruction or an explicit write to a CSR on
+which these interrupt trap conditions expressly depend (including mip
,
+mie
, mstatus
, and mideleg
).
Interrupts to M-mode take priority over any interrupts to lower +privilege modes.
+[CV64A6_MMU] Each individual bit in register mip
is read-only. If interrupt i
+can become pending but bit i in mip
is read-only, the implementation
+must provide some other mechanism for clearing the pending interrupt.
[CV64A6_MMU] TODO: A bit in mie
must be writable if the corresponding interrupt can ever
+become pending. Bits of mie
that are not writable must be read-only
+zero.
[CV64A6_MMU] The standard portions (bits 15:0) of registers mip
and mie
are
+formatted as shown in Figure 13 and Figure 14 respectively.
mip
.mie
.Bits mip
.MEIP and mie
.MEIE are the interrupt-pending and
+interrupt-enable bits for machine-level external interrupts. MEIP is
+read-only in mip
, and is set and cleared by a platform-specific
+interrupt controller.
Bits mip
.MTIP and mie
.MTIE are the interrupt-pending and
+interrupt-enable bits for machine timer interrupts. MTIP is read-only in
+mip
, and is cleared by writing to the memory-mapped machine-mode timer
+compare register.
As the system has only one hart then mip
.MSIP and mie
.MSIE are
+read-only zeros.
Bits mip
.SEIP and mie
.SEIE are
+the interrupt-pending and interrupt-enable bits for supervisor-level
+external interrupts. SEIP is writable in mip
, and may be written by
+M-mode software to indicate to S-mode that an external interrupt is
+pending.
Bits mip
.STIP and mie
.STIE are
+the interrupt-pending and interrupt-enable bits for supervisor-level
+timer interrupts. STIP is writable in mip
, and may be written by
+M-mode software to deliver timer interrupts to S-mode.
Bits mip
.SSIP and mie
.SSIE are
+the interrupt-pending and interrupt-enable bits for supervisor-level
+software interrupts. SSIP is writable in mip
and may also be set to 1
+by a platform-specific interrupt controller.
As the Sscofpmf extension is not implemented, mip
.LCOFIP and mie
.LCOFIE are read-only zeros.
Multiple simultaneous interrupts destined for M-mode are handled in the +following decreasing priority order: MEI, MSI, MTI, SEI, SSI, STI.
+3.1.10. Hardware Performance Monitor
+M-mode includes a basic hardware performance-monitoring facility. The
+mcycle
CSR counts the number of clock cycles executed by the processor
+core on which the hart is running. The minstret
CSR counts the number
+of instructions the hart has retired. The mcycle
and minstret
+registers have 64-bit precision on all RV32 and RV64 harts.
The counter registers have an arbitrary value after the hart is reset,
+and can be written with a given value. Any CSR write takes effect after
+the writing instruction has otherwise completed. The mcycle
CSR may be
+shared between harts on the same core, in which case writes to mcycle
+will be visible to those harts. The platform should provide a mechanism
+to indicate which harts share an mcycle
CSR.
[CV64A6_MMU] The hardware performance monitor includes 29 additional 64-bit event
+counters, mhpmcounter3
-mhpmcounter31
. The event selector CSRs,
+mhpmevent3
-mhpmevent31
, are 64-bit WARL registers that control which
+event causes the corresponding counter to increment. The meaning of
+these events is defined by the platform, but event 0 is defined to mean
+"no event." In CV64A6_MMU all counters are implemented, but both the counter and its corresponding event
+selector are read-only 0.
The mhpmcounters
are WARL registers that support up to 64 bits of
+precision on RV32 and RV64.
As XLEN=64, mcycleh
, minstreth
, and mhpmcounternh
+do not exist.
3.1.11. Machine Counter-Enable (mcounteren
) Register
+The counter-enable mcounteren
register is a 32-bit register that
+controls the availability of the hardware performance-monitoring
+counters to the next-lower privileged mode.
mcounteren
) register.The settings in this register only control accessibility. The act of +reading or writing this register does not affect the underlying +counters, which continue to increment even when not accessible.
+When the CY, TM, IR, or HPMn bit in the mcounteren
register is
+clear, attempts to read the cycle
, time
, instret
, or
+hpmcountern
register while executing in S-mode or U-mode will cause an
+illegal-instruction exception. When one of these bits is set, access to
+the corresponding register is permitted in the next implemented
+privilege mode (S-mode if implemented, otherwise U-mode).
3.1.12. Machine Counter-Inhibit (mcountinhibit
) Register
+mcountinhibit
register[CV64A6_MMU] The mcountinhibit
register is not implemented, the implementation
+behaves as though the register were set to zero.
3.1.13. Machine Scratch (mscratch
) Register
+The mscratch
register is an MXLEN-bit read/write register dedicated
+for use by machine mode. Typically, it is used to hold a pointer to a
+machine-mode hart-local context space and swapped with a user register
+upon entry to an M-mode trap handler.
3.1.14. Machine Exception Program Counter (mepc
) Register
+mepc
is an MXLEN-bit read/write register formatted as shown in
+Figure 19. The low bit of mepc
(mepc[0]
) is
+always zero.
mepc
is a WARL register that must be able to hold all valid virtual
+addresses. It need not be capable of holding all possible invalid
+addresses. Prior to writing mepc
, implementations may convert an
+invalid address into some other invalid address that mepc
is capable
+of holding.
When a trap is taken into M-mode, mepc
is written with the virtual
+address of the instruction that was interrupted or that encountered the
+exception. Otherwise, mepc
is never written by the implementation,
+though it may be explicitly written by software.
3.1.15. Machine Cause (mcause
) Register
+The mcause
register is an MXLEN-bit read-write register formatted as
+shown in Figure 20. When a trap is taken into
+M-mode, mcause
is written with a code indicating the event that
+caused the trap. Otherwise, mcause
is never written by the
+implementation, though it may be explicitly written by software.
The Interrupt bit in the mcause
register is set if the trap was caused
+by an interrupt. The Exception Code field contains a code identifying
+the last exception or interrupt. Table 12 lists
+the possible machine-level exception codes. The Exception Code is a
+WLRL field, so is only guaranteed to hold supported exception codes.
mcause
) register.Note that load and load-reserved instructions generate load exceptions, +whereas store, store-conditional, and AMO instructions generate +store/AMO exceptions.
+[CV64A6_MMU] Note that load and load-reserved instructions generate load exceptions, +whereas store and store-conditional instructions generate +store exceptions.
+[CVA6] If an instruction may raise multiple synchronous exceptions, the
+decreasing priority order of
+Table 13 indicates which
+exception is taken and reported in mcause
. The priority of any custom
+synchronous exceptions is implementation-defined. TODO
Interrupt | +Exception Code | +Description | +
---|---|---|
1 |
+0 |
+Reserved |
+
1 |
+4 |
+Reserved |
+
1 |
+8 |
+Reserved |
+
1 |
+12 |
+Reserved |
+
0 |
+0 |
+Instruction address misaligned |
+
Priority | +Exc.Code | +Description | +
---|---|---|
Highest |
+3 |
+Instruction address breakpoint |
+
+ | 12, 1 |
+During instruction address translation: |
+
+ | 1 |
+With physical address for instruction: |
+
+ | 2 |
+Illegal instruction |
+
+ | 4,6 |
+Optionally: |
+
+ | 13, 15, 5, 7 |
+During address translation for an explicit memory access: |
+
+ | 5,7 |
+With physical address for an explicit memory access: |
+
Lowest |
+4,6 |
+If not higher priority: |
+
[CV64A6_MMU] Load/store address-misaligned exceptions may have either higher or +lower priority than load/store access-fault +exceptions. TODO
+3.1.16. Machine Trap Value (mtval
) Register
+[CV64A6_MMU] The mtval
register is an MXLEN-bit read-write register
+holding constant value zero.
3.1.17. Machine Configuration Pointer (mconfigptr
) Register
+The mconfigptr
register is an MXLEN-bit read-only CSR that holds the physical
+address of a configuration data structure.
[CV64A6_MMU] The mconfigptr
register is implemented, but it is read-only 0 to indicate the
+configuration data structure does not exist.
3.1.18. Machine Environment Configuration (menvcfg
) Register
+The menvcfg
CSR is a 64-bit read/write register, formatted
+as shown in Figure 21, that controls
+certain characteristics of the execution environment for modes less
+privileged than M.
menvcfg
) register.If bit FIOM (Fence of I/O implies Memory) is set to one in menvcfg
,
+FENCE instructions executed in modes less privileged than M are modified
+so the requirement to order accesses to device I/O implies also the
+requirement to order main memory accesses. Table 14
+details the modified interpretation of FENCE instruction bits PI, PO,
+SI, and SO for modes less privileged than M when FIOM=1.
Similarly, for modes less privileged than M when FIOM=1, if an atomic +instruction that accesses a region ordered as device I/O has its aq +and/or rl bit set, then that instruction is ordered as though it +accesses both device I/O and memory.
+If S-mode is not supported, or if satp
.MODE is read-only zero (always
+Bare), the implementation may make FIOM read-only zero.
Instruction bit | +Meaning when set | +
---|---|
PI |
+Predecessor device input and memory reads (PR implied) |
+
SI |
+Successor device input and memory reads (SR implied) |
+
The PBMTE bit controls whether the Svpbmt extension is available for use
+in S-mode and G-stage address translation (i.e., for page tables pointed
+to by satp
or hgatp
).
[CV64A6_MMU] As Svpbmt is not implemented, PBMTE is always 0
+The ADUE bit controls whether hardware +updating of PTE A/D bits is enabled for S-mode and G-stage address +translations.
+[CV64A6_MMU] As Svadu is not implemented, ADUE is always 0
+The CDE (Counter Delegation Enable) bit controls whether Zicntr and Zihpm counters can be delegated to S-mode.
+[CV64A6_MMU] As Smcdeleg is not implemented, CDE is always 0
+The definition of the STCE field is furnished by the Sstc extension.
+[CV64A6_MMU] As Sstc is not implemented, STCE is always 0
+The definition of the CBZE field is furnished by the Zicboz extension.
+[CV64A6_MMU] As Zicboz is not implemented, CBZE is always 0
+The definitions of the CBCFE and CBIE fields are furnished by the Zicbom extension.
+[CV64A6_MMU] As Zicbom is not implemented, CBCFE and CBIE fields are always 0
+The definition of the PMM field will be furnished by the forthcoming
+Smnpm extension. Its allocation within menvcfg
may change prior to the
+ratification of that extension.
[CV64A6_MMU] As Smnpm is not implemented, PMM field is always 0
+[CV64A6_MMU] As Zicfilp is not implemented, LPE field is always 0
+[CV64A6_MMU] As Zicfiss is not implemented, SSE field is always 0
+3.1.19. Machine Security Configuration (mseccfg
) Register
+mseccfg
is an optional 64-bit read/write register,
+that controls security features.
As XLEN=64, register mseccfgh
does not exist.
[CV64A6_MMU] As Zkr, Smepmp, and Smmpm extensions are not implemented,
+mseccfg
and mseccfgh
do not exist. TODO.
3.2. Machine-Level Memory-Mapped Registers
+3.2.1. Machine Timer (mtime
and mtimecmp
) Registers
+Platforms provide a real-time counter, exposed as a memory-mapped
+machine-mode read-write register, mtime
. mtime
must increment at
+constant frequency, and the platform must provide a mechanism for
+determining the period of an mtime
tick. The mtime
register will
+wrap around if the count overflows.
The mtime
register has a 64-bit precision on all RV32 and RV64
+systems. Platforms provide a 64-bit memory-mapped machine-mode timer
+compare register (mtimecmp
). A machine timer interrupt becomes pending
+whenever mtime
contains a value greater than or equal to mtimecmp
,
+treating the values as unsigned integers. The interrupt remains posted
+until mtimecmp
becomes greater than mtime
(typically as a result of
+writing mtimecmp
). The interrupt will only be taken if interrupts are
+enabled and the MTIE bit is set in the mie
register.
Writes to mtime
and mtimecmp
are guaranteed to be reflected in MTIP
+eventually, but not necessarily immediately.
For RV64, naturally aligned 64-bit memory accesses to the mtime
and
+mtimecmp
registers are additionally supported and are atomic.
3.3. Machine-Mode Privileged Instructions
+3.3.1. Environment Call and Breakpoint
+The ECALL instruction is used to make a request to the supporting +execution environment. When executed in U-mode, S-mode, or M-mode, it +generates an environment-call-from-U-mode exception, +environment-call-from-S-mode exception, or environment-call-from-M-mode +exception, respectively, and performs no other operation.
+The EBREAK instruction is used by debuggers to cause control to be +transferred back to a debugging environment. +Unless overridden by an external debug environment, EBREAK raises +a breakpoint exception and performs no other operation.
+ECALL and EBREAK cause the receiving privilege mode’s epc
register to
+be set to the address of the ECALL or EBREAK instruction itself, not
+the address of the following instruction. As ECALL and EBREAK cause
+synchronous exceptions, they are not considered to retire, and should
+not increment the minstret
CSR.
3.3.2. Trap-Return Instructions
+Instructions to return from trap are encoded under the PRIV minor +opcode.
+To return after handling a trap, there are separate trap return
+instructions per privilege level, MRET and SRET. MRET is always
+provided. SRET must be provided if supervisor mode is supported, and
+should raise an illegal-instruction exception otherwise. SRET should
+also raise an illegal-instruction exception when TSR=1 in mstatus
, as
+described in Section 3.1.6.6. An xRET instruction
+can be executed in privilege mode x or higher, where executing a
+lower-privilege xRET instruction will pop the relevant lower-privilege
+interrupt enable and privilege mode stack. In addition to manipulating
+the privilege stack as described in Section 3.1.6.1,
+xRET sets the pc
to the value stored in the xepc
register.
If the A extension is supported, the xRET instruction is allowed to +clear any outstanding LR address reservation but is not required to. +Trap handlers should explicitly clear the reservation if required (e.g., +by using a dummy SC) before executing the xRET.
+3.3.3. Wait for Interrupt
+The Wait for Interrupt instruction (WFI) informs the
+implementation that the current hart can be stalled until an interrupt
+might need servicing. Execution of the WFI instruction can also be used
+to inform the hardware platform that suitable interrupts should
+preferentially be routed to this hart. WFI is available in all
+privileged modes, and optionally available to U-mode. This instruction
+may raise an illegal-instruction exception when TW=1 in mstatus
, as
+described in Section 3.1.6.6.
If an enabled interrupt is present or later becomes present while the
+hart is stalled, the interrupt trap will be taken on the following
+instruction, i.e., execution resumes in the trap handler and mepc
=
+pc
+ 4.
Implementations are permitted to resume execution for any reason, even if an +enabled interrupt has not become pending. Hence, a legal implementation is to +simply implement the WFI instruction as a NOP.
+The WFI instruction can also be executed when interrupts are disabled.
+The operation of WFI must be unaffected by the global interrupt bits in
+mstatus
(MIE and SIE) and the delegation register mideleg
(i.e.,
+the hart must resume if a locally enabled interrupt becomes pending,
+even if it has been delegated to a less-privileged mode), but should
+honor the individual interrupt enables (e.g, MTIE) (i.e.,
+implementations should avoid resuming the hart if the interrupt is
+pending but not individually enabled). WFI is also required to resume
+execution for locally enabled interrupts pending at any privilege level,
+regardless of the global interrupt enable at each privilege level.
If the event that causes the hart to resume execution does not cause an
+interrupt to be taken, execution will resume at pc
+ 4, and software
+must determine what action to take, including looping back to repeat the
+WFI if there was no actionable event.
3.3.4. Custom SYSTEM Instructions
+The subspace of the SYSTEM major opcode shown in Figure 24 is designated for custom use. It is recommended that these instructions use bits 29:28 to designate the +minimum required privilege mode, as do other SYSTEM instructions.
+3.4. Reset
+[CV64A6_MMU] Upon reset, a hart’s privilege mode is set to M. The mstatus
fields
+MIE and MPRV are reset to 0
+As little-endian memory accesses are supported,
+the mstatus
field MBE is reset to 0.
+Upon reset, the mstatus
fields MIE and MPRV are reset to 0.
+The misa
register is set as described in Section 3.1.1.
+The pc
is set to 0x80000000 reset vector. TODO
+The mcause
register is set to a value indicating the cause of the reset.
+Writable PMP registers’ A and L fields are set to 0.
+No WARL field contains an illegal value. All other hart state is UNSPECIFIED.
As "CV64A6_MMU" does not distinguished different reset conditions,
+The mcause
returns 0 after reset.
3.5. Non-Maskable Interrupts
+Non-maskable interrupts (NMIs) are only used for hardware error
+conditions, and cause an immediate jump to an implementation-defined NMI
+vector running in M-mode regardless of the state of a hart’s interrupt
+enable bits. The mepc
register is written with the virtual address of
+the instruction that was interrupted, and mcause
is set to a value
+indicating the source of the NMI. The NMI can thus overwrite state in an
+active machine-mode interrupt handler.
[CV64A6_MMU] Upon NMI, the high Interrupt bit of mcause
is set to indicate
+that this was an interrupt. As CV64A6_MMU does not distinguish sources
+of NMIs, the mcause
register returns 0 in the Exception Code.
Unlike resets, NMIs do not reset processor state, enabling diagnosis, +reporting, and possible containment of the hardware error.
+3.6. Physical Memory Attributes
+The physical memory map for a complete system includes various address +ranges, some corresponding to memory regions and some to memory-mapped +control registers, portions of which might not be accessible. Some +memory regions might not support reads, writes, or execution; some might +not support subword or subblock accesses; some might not support atomic +operations; and some might not support cache coherence or might have +different memory models. Similarly, memory-mapped control registers vary +in their supported access widths, support for atomic operations, and +whether read and write accesses have associated side effects. In RISC-V +systems, these properties and capabilities of each region of the +machine’s physical address space are termed physical memory attributes +(PMAs). This section describes RISC-V PMA terminology and how RISC-V +systems implement and check PMAs.
+[CV64A6_MMU] PMAs are inherent properties of the underlying hardware. The PMAs of +some memory regions are fixed at chip design time.
+[CV64A6_MMU] Some PMAs are dynamically +checked in hardware later in the execution pipeline after the physical +address is known, as some operations will not be supported at all +physical memory addresses, and some operations require knowing the +setting of a PMA attribute.
+[CV64A6_MMU] For RISC-V, we separate out specification and checking of PMAs into a +separate hardware structure, the PMA checker. In CV64A6_MMU, the +attributes are known at system design time for each physical address +region, and are hardwired into the PMA checker. +PMAs are checked for any access to physical memory, including accesses +that have undergone virtual to physical memory translation. To aid in +system debugging, we strongly recommend that, where possible, RISC-V +processors precisely trap physical memory accesses that fail PMA checks. +Precisely trapped PMA violations manifest as instruction, load, or store +access-fault exceptions, distinct from virtual-memory page-fault +exceptions. Precise PMA traps might not always be possible, for example, +when probing a legacy bus architecture that uses access failures as part +of the discovery mechanism. In this case, error responses from +peripheral devices will be reported as imprecise bus-error interrupts.
+[CV64A6_MMU] PMAs are not readable by software.
+3.6.1. Main Memory versus I/O Regions
+The most important characterization of a given memory address range is +whether it holds regular main memory or I/O devices. +Regular main memory is required to have a number of properties, +specified below, whereas I/O devices can have a much broader range of +attributes. Memory regions that do not fit into regular main memory, for +example, device scratchpad RAMs, are categorized as I/O regions.
++ + | ++What previous versions of this specification termed vacant regions are +no longer a distinct category; they are now described as I/O regions that are +not accessible (i.e. lacking read, write, and execute permissions). +Main memory regions that are not accessible are also allowed. + | +
3.6.2. Supported Access Type PMAs
+Access types specify which access widths, from 8-bit byte to long +multi-word burst, are supported, and also whether misaligned accesses +are supported for each access width.
+Main memory regions always support read and write of all access widths +required by the attached devices, and can specify whether instruction +fetch is supported.
+I/O regions can specify which combinations of read, write, or execute +accesses to which data widths are supported.
+For systems with page-based virtual memory, I/O and memory regions can +specify which combinations of hardware page-table reads and hardware +page-table writes are supported.
+3.6.3. Atomicity PMAs
+[CV64A6_MMU] Atomic extension is not implemented.
+3.6.3.1. AMO PMA
+[CV64A6_MMU] Atomic extension is not implemented.
+3.6.3.2. Reservability PMA
+[CV64A6_MMU] Atomic extension is not implemented.
+3.6.4. Misaligned Atomicity Granule PMA
+[CV64A6_MMU] Atomic extension is not implemented.
+3.6.5. Memory-Ordering PMAs
+[CV64A6_MMU] As CV64A6_MMU is dedicated to a one hart +platform without any DMA, no memory-ordering mechanism is implemented.
+3.6.6. Coherence and Cacheability PMAs
+[CV64A6_MMU] Write accesses are not cached. No cache-coherence scheme +is implemented.
+If a PMA indicates non-cacheability, then accesses to that region must +be satisfied by the memory itself, not by any caches.
+3.6.7. Idempotency PMAs
+Idempotency PMAs describe whether reads and writes to an address region +are idempotent. Main memory regions are assumed to be idempotent. For +I/O regions, idempotency on reads and writes can be specified separately +(e.g., reads are idempotent but writes are not). If accesses are +non-idempotent, i.e., there is potentially a side effect on any read or +write access, then speculative or redundant accesses must be avoided.
+For the purposes of defining the idempotency PMAs, changes in observed +memory ordering created by redundant accesses are not considered a side +effect.
+For non-idempotent regions, implicit reads and writes must not be +performed early or speculatively, with the following exceptions. When a +non-speculative implicit read is performed, an implementation is +permitted to additionally read any of the bytes within a naturally +aligned power-of-2 region containing the address of the non-speculative +implicit read. Furthermore, when a non-speculative instruction fetch is +performed, an implementation is permitted to additionally read any of +the bytes within the next naturally aligned power-of-2 region of the +same size (with the address of the region taken modulo +2XLEN. The results of these additional reads +may be used to satisfy subsequent early or speculative implicit reads. +The size of these naturally aligned power-of-2 regions is +implementation-defined, but, for systems with page-based virtual memory, +must not exceed the smallest supported page size.
+3.7. Physical Memory Protection
+To support secure processing and contain faults, it is desirable to +limit the physical addresses accessible by software running on a hart. +An optional physical memory protection (PMP) unit provides per-hart +machine-mode control registers to allow physical memory access +privileges (read, write, execute) to be specified for each physical +memory region. The PMP values are checked in parallel with the PMA +checks described in Section 3.6.
+The granularity of PMP access control settings are platform-specific, +but the standard PMP encoding supports regions as small as four bytes. +Certain regions’ privileges can be hardwired—for example, some regions +might only ever be visible in machine mode but in no lower-privilege +layers.
+PMP checks are applied to all accesses whose effective privilege mode is
+S or U, including instruction fetches and data accesses in S and U mode,
+and data accesses in M-mode when the MPRV bit in mstatus
is set and
+the MPP field in mstatus
contains S or U. PMP checks are also applied
+to page-table accesses for virtual-address translation, for which the
+effective privilege mode is S. Optionally, PMP checks may additionally
+apply to M-mode accesses, in which case the PMP registers themselves are
+locked, so that even M-mode software cannot change them until the hart
+is reset. In effect, PMP can grant permissions to S and U modes, which
+by default have none, and can revoke permissions from M-mode, which by
+default has full permissions.
PMP violations are always trapped precisely at the processor.
+3.7.1. Physical Memory Protection CSRs
+PMP entries are described by an 8-bit configuration register and one +MXLEN-bit address register. Some PMP settings additionally use the +address register associated with the preceding PMP entry. 16 PMP +entries are implemented. The lowest-numbered PMP entries must be +implemented first. All PMP CSR fields are WARL and 8 upper entries are +read-only zero. PMP CSRs are only accessible to M-mode.
+[CV64A6_MMU] The PMP configuration registers are densely packed into CSRs to minimize
+context-switch time. For CV64A6_MMU with sixteen CSRs, pmpcfg0
–pmpcfg3
, hold
+the configurations as shown
+in Figure 25.
+The 2 upper entries are read-only zero.
[CV64A6_MMU] The PMP address registers are CSRs named pmpaddr0
-pmpaddr15
. Each
+PMP address register encodes bits 33-2 of a 34-bit physical address for
+RV32, as shown in Figure 26. Not all
+physical address bits may be implemented, and so the pmpaddr
registers
+are WARL.
Figure 27 shows the layout of a PMP configuration +register. The R, W, and X bits, when set, indicate that the PMP entry +permits read, write, and instruction execution, respectively. When one +of these bits is clear, the corresponding access type is denied. The R, +W, and X fields form a collective WARL field for which the combinations with R=0 and W=1 are reserved. The remaining two fields, A and L, are described in the following sections.
+Attempting to fetch an instruction from a PMP region that does not have +execute permissions raises an instruction access-fault exception. +Attempting to execute a load or load-reserved instruction which accesses +a physical address within a PMP region without read permissions raises a +load access-fault exception. Attempting to execute a store, +store-conditional, or AMO instruction which accesses a physical address +within a PMP region without write permissions raises a store +access-fault exception.
+3.7.1.1. Address Matching
+The A field in a PMP entry’s configuration register encodes the +address-matching mode of the associated PMP address register. The +encoding of this field is shown in [pmpcfg-a].
+When A=0, this PMP entry is disabled and matches no addresses. Two other +address-matching modes are supported: naturally aligned power-of-2 +regions (NAPOT), including the special case of naturally aligned +four-byte regions (NA4); and the top boundary of an arbitrary range +(TOR). These modes support four-byte granularity.
+[CV64A6_MMU] Two address-matching modes are supported: disabled and TOR.
+If TOR is selected, the associated address register forms the top of the
+address range, and the preceding PMP address register forms the bottom
+of the address range. If PMP entry i's A field is set to
+TOR, the entry matches any address y such that pmpaddri-1
≤y<pmpaddri
(irrespective of the value of pmpcfgi-1
). If PMP entry 0’s A field is set to TOR, zero is used for the lower bound, and so it matches
+any address y<pmpaddr0
.
[CV64A6_MMU] Although the PMP mechanism supports regions as small as four bytes, +platforms may specify coarser PMP regions. In general, the PMP grain is + bytes and must be the same across all PMP regions. +When and +.A[1] is clear, i.e. the mode is OFF or TOR, +then bits [G-1:0] read as all zeros. Bits +[G-1:0] do not affect the TOR address-matching +logic.
+If the current XLEN is greater than MXLEN, the PMP address registers are +zero-extended from MXLEN to XLEN bits for the purposes of address +matching.
+3.7.1.2. Locking and Privilege Mode
+The L bit indicates that the PMP entry is locked, i.e., writes to the
+configuration register and associated address registers are ignored.
+Locked PMP entries remain locked until the hart is reset. If PMP entry
+i is locked, writes to pmp
icfg
and pmpaddr
i are ignored. Additionally, if PMP
+entry i is locked and pmp
icfg.A
is set
+to TOR, writes to pmpaddr
i-1 are ignored.
In addition to locking the PMP entry, the L bit indicates whether the +R/W/X permissions are enforced on M-mode accesses. When the L bit is +set, these permissions are enforced for all privilege modes. When the L +bit is clear, any M-mode access matching the PMP entry will succeed; the +R/W/X permissions apply only to S and U modes.
+3.7.1.3. Priority and Matching Logic
+PMP entries are statically prioritized. The lowest-numbered PMP entry
+that matches any byte of an access determines whether that access
+succeeds or fails. The matching PMP entry must match all bytes of an
+access, or the access fails, irrespective of the L, R, W, and X bits.
+For example, if a PMP entry is configured to match the four-byte range
+0xC
–0xF
, then an 8-byte access to the range 0x8
–0xF
will fail,
+assuming that PMP entry is the highest-priority entry that matches those
+addresses.
If a PMP entry matches all bytes of an access, then the L, R, W, and X +bits determine whether the access succeeds or fails. If the L bit is +clear and the privilege mode of the access is M, the access succeeds.
+Otherwise, if the L bit is set or the privilege mode of the access is S +or U, then the access succeeds only if the R, W, or X bit corresponding +to the access type is set.
+If no PMP entry matches an M-mode access, the access succeeds. If no PMP +entry matches an S-mode or U-mode access, but at least one PMP entry is +implemented, the access fails.
+Failed accesses generate an instruction, load, or store access-fault +exception. Note that a single instruction may generate multiple +accesses, which may not be mutually atomic. An access-fault exception is +generated if at least one access generated by an instruction fails, +though other accesses generated by that instruction may succeed with +visible side effects. Notably, instructions that reference virtual +memory are decomposed into multiple accesses.
+On some implementations, misaligned loads, stores, and instruction +fetches may also be decomposed into multiple accesses, some of which may +succeed before an access-fault exception occurs. In particular, a +portion of a misaligned store that passes the PMP check may become +visible, even if another portion fails the PMP check. The same behavior +may manifest for stores wider than XLEN bits (e.g., the FSD instruction +in RV32D), even when the store address is naturally aligned.
+3.7.2. Physical Memory Protection and Paging
+ +4. "Smstateen/Ssstateen" Extensions, Version 1.0
+CV64A6_MMU: This extension is not supported.
+5. "Smcsrind/Sscsrind" Indirect CSR Access, Version 1.0
+CV64A6_MMU: This extension is not supported.
+6. "Smepmp" Extension for PMP Enhancements for memory access and execution prevention in Machine mode, Version 1.0
+CV64A6_MMU: This extension is not supported.
+7. "Smcntrpmf" Cycle and Instret Privilege Mode Filtering, Version 1.0
+CV64A6_MMU: This extension is not supported.
+8. "Smrnmi" Extension for Resumable Non-Maskable Interrupts, Version 0.5
+CV64A6_MMU: This extension is not supported.
+9. "Smcdeleg" Counter Delegation Extension, Version 1.0
+CV64A6_MMU: This extension is not supported.
+10. "Smdbltrp" Double Trap Extension, Version 1.0
+11. Supervisor-Level ISA, Version 1.13
+This chapter describes the RISC-V supervisor-level architecture, which +contains a common core that is used with various supervisor-level +address translation and protection schemes.
+11.1. Supervisor CSRs
+A number of CSRs are provided for the supervisor.
+11.1.1. Supervisor Status (sstatus
) Register
+The sstatus
register is an SXLEN-bit read/write register formatted as
+shown in Figure 28. The sstatus
+register keeps track of the processor’s current operating state.
sstatus
) register when SXLEN=64.The SPP bit indicates the privilege level at which a hart was executing +before entering supervisor mode. When a trap is taken, SPP is set to 0 +if the trap originated from user mode, or 1 otherwise. When an SRET +instruction (see Section 3.3.2) is executed to +return from the trap handler, the privilege level is set to user mode if +the SPP bit is 0, or supervisor mode if the SPP bit is 1; SPP is then +set to 0.
+The SIE bit enables or disables all interrupts in supervisor mode. When
+SIE is clear, interrupts are not taken while in supervisor mode. When
+the hart is running in user-mode, the value in SIE is ignored, and
+supervisor-level interrupts are enabled. The supervisor can disable
+individual interrupt sources using the sie
CSR.
The SPIE bit indicates whether supervisor interrupts were enabled prior +to trapping into supervisor mode. When a trap is taken into supervisor +mode, SPIE is set to SIE, and SIE is set to 0. When an SRET instruction +is executed, SIE is set to SPIE, then SPIE is set to 1.
+The sstatus
register is a subset of the mstatus
register.
11.1.1.1. Base ISA Control in sstatus
Register
+[CV64A6_MMU] The UXL field is a read-only field that encode the
+value of XLEN for S-mode. The encoding of this
+field is the same as the MXL field of misa
, shown in Table 9.
+The effective XLEN in S-mode is termed SXLEN.
+Its value is set to SXLEN=MXLEN.
11.1.1.2. Memory Privilege in sstatus
Register
+The MXR (Make eXecutable Readable) bit modifies the privilege with which +loads access virtual memory. When MXR=0, only loads from pages marked +readable (R=1 in [sv32pte]) will succeed. When +MXR=1, loads from pages marked either readable or executable (R=1 or +X=1) will succeed. MXR has no effect when page-based virtual memory is +not in effect.
+The SUM (permit Supervisor User Memory access) bit modifies the +privilege with which S-mode loads and stores access virtual memory. When +SUM=0, S-mode memory accesses to pages that are accessible by U-mode +(U=1 in [sv32pte]) will fault. When SUM=1, these +accesses are permitted. SUM has no effect when page-based virtual memory +is not in effect, nor when executing in U-mode. Note that S-mode can +never execute instructions from user pages, regardless of the state of +SUM.
+11.1.1.3. Endianness Control in sstatus
Register
+UBE controls whether explicit load and store memory accesses made from +U-mode are little-endian (UBE=0) or big-endian (UBE=1).
+It is always little-endian in U-Mode, the UBE is read-only zero.
+11.1.1.4. Previous Expected Landing Pad (ELP) State in sstatus
Register
+Access to the SPELP
field, added by Zicfilp, accesses the homonymous
+fields of mstatus
when V=0
, and the homonymous fields of vsstatus
+when V=1
.
11.1.2. Supervisor Trap Vector Base Address (stvec
) Register
+The stvec
register is an SXLEN-bit read/write register that holds trap
+vector configuration, consisting of a vector base address (BASE) and a
+vector mode (MODE).
stvec
) register.The BASE field in stvec
is a field that can hold any valid virtual or
+physical address, subject to the following alignment constraints: the
+address must be 4-byte aligned, and MODE settings other than Direct
+might impose additional alignment constraints on the value in the BASE
+field.
Value | +Name | +Description | +
---|---|---|
0 |
+Direct |
+All exceptions set |
+
The encoding of the MODE field is shown in
+Table 15. When MODE=Direct, all traps into
+supervisor mode cause the pc
to be set to the address in the BASE
+field. When MODE=Vectored, all synchronous exceptions into supervisor
+mode cause the pc
to be set to the address in the BASE field, whereas
+interrupts cause the pc
to be set to the address in the BASE field
+plus four times the interrupt cause number. For example, a
+supervisor-mode timer interrupt (see Table 16)
+causes the pc
to be set to BASE+0x14
. Setting MODE=Vectored may
+impose a stricter alignment constraint on BASE.
11.1.3. Supervisor Interrupt (sip
and sie
) Registers
+The sip
register is an SXLEN-bit read/write register containing
+information on pending interrupts, while sie
is the corresponding
+SXLEN-bit read/write register containing interrupt enable bits.
+Interrupt cause number i (as reported in CSR scause
,
+Section 11.1.8) corresponds with bit i in both sip
and
+sie
. Bits 15:0 are allocated to standard interrupt causes only, while
+bits 16 and above are designated for platform use.
sip
).sie
).An interrupt i will trap to S-mode if both of the following are true:
+(a) either the current privilege mode is S and the SIE bit in the
+sstatus
register is set, or the current privilege mode has less
+privilege than S-mode; and (b) bit i is set in both sip
and sie
.
These conditions for an interrupt trap to occur must be evaluated in a
+bounded amount of time from when an interrupt becomes, or ceases to be,
+pending in sip
, and must also be evaluated immediately following the
+execution of an SRET instruction or an explicit write to a CSR on which
+these interrupt trap conditions expressly depend (including sip
, sie
+and sstatus
).
Interrupts to S-mode take priority over any interrupts to lower +privilege modes.
+Each individual bit in register sip
may be writable or may be
+read-only. When bit i in sip
is writable, a pending interrupt i
+can be cleared by writing 0 to this bit. If interrupt i can become
+pending but bit i in sip
is read-only, the implementation must
+provide some other mechanism for clearing the pending interrupt (which
+may involve a call to the execution environment).
A bit in sie
must be writable if the corresponding interrupt can ever
+become pending. Bits of sie
that are not writable are read-only zero.
The standard portions (bits 15:0) of registers sip
and sie
are
+formatted as shown in Figures Figure 32
+and Figure 33 respectively.
sip
.sie
.Bits sip
.SEIP and sie
.SEIE are the interrupt-pending and
+interrupt-enable bits for supervisor-level external interrupts. If
+implemented, SEIP is read-only in sip
, and is set and cleared by the
+execution environment, typically through a platform-specific interrupt
+controller.
Bits sip
.STIP and sie
.STIE are the interrupt-pending and
+interrupt-enable bits for supervisor-level timer interrupts. If
+implemented, STIP is read-only in sip
, and is set and cleared by the
+execution environment.
Bits sip
.SSIP and sie
.SSIE are the interrupt-pending and
+interrupt-enable bits for supervisor-level software interrupts. If
+implemented, SSIP is writable in sip
and may also be set to 1 by a
+platform-specific interrupt controller.
Each standard interrupt type (SEI, STI, SSI, or LCOFI) may not be implemented,
+in which case the corresponding interrupt-pending and interrupt-enable
+bits are read-only zeros. All bits in sip
and sie
are WARL fields. The
+implemented interrupts may be found by writing one to every bit location
+in sie
, then reading back to see which bit positions hold a one.
11.1.4. Supervisor Timers and Performance Counters
+Supervisor software uses the same hardware performance monitoring
+facility as user-mode software, including the time
, cycle
, and
+instret
CSRs. The implementation should provide a mechanism to modify
+the counter values.
The implementation must provide a facility for scheduling timer
+interrupts in terms of the real-time counter, time
.
11.1.5. Counter-Enable (scounteren
) Register
+scounteren
) registerThe counter-enable (scounteren
) CSR is a 32-bit register that
+controls the availability of the hardware performance monitoring
+counters to U-mode.
When the CY, TM, IR, or HPMn bit in the scounteren
register is
+clear, attempts to read the cycle
, time
, instret
, or hpmcountern
+register while executing in U-mode will cause an illegal-instruction
+exception. When one of these bits is set, access to the corresponding
+register is permitted.
11.1.6. Supervisor Scratch (sscratch
) Register
+The sscratch
CSR is an SXLEN-bit read/write register, dedicated
+for use by the supervisor. Typically, sscratch
is used to hold a
+pointer to the hart-local supervisor context while the hart is executing
+user code. At the beginning of a trap handler, sscratch
is swapped
+with a user register to provide an initial working register.
11.1.7. Supervisor Exception Program Counter (sepc
) Register
+sepc
is an SXLEN-bit read/write CSR formatted as shown in
+Figure 36. The low bit of sepc
(sepc[0]
) is always zero. On implementations that support only IALIGN=32, the two low bits (sepc[1:0]
) are always zero.
sepc
is a WARL register that must be able to hold all valid virtual
+addresses. It need not be capable of holding all possible invalid
+addresses. Prior to writing sepc
, implementations may convert an
+invalid address into some other invalid address that sepc
is capable
+of holding.
When a trap is taken into S-mode, sepc
is written with the virtual
+address of the instruction that was interrupted or that encountered the
+exception. Otherwise, sepc
is never written by the implementation,
+though it may be explicitly written by software.
11.1.8. Supervisor Cause (scause
) Register
+The scause
CSR is an SXLEN-bit read-write register formatted as
+shown in Figure 37. When a trap is taken into
+S-mode, scause
is written with a code indicating the event that
+caused the trap. Otherwise, scause
is never written by the
+implementation, though it may be explicitly written by software.
The Interrupt bit in the scause
register is set if the trap was caused
+by an interrupt. The Exception Code field contains a code identifying
+the last exception or interrupt. Table 16 lists
+the possible exception codes for the current supervisor ISAs. The
+Exception Code is a WLRL field. It is required to hold the values 0–31
+(i.e., bits 4–0 must be implemented), but otherwise it is only
+guaranteed to hold supported exception codes.
scause
) register.Interrupt | +Exception Code | +Description | +
---|---|---|
1 |
+0 |
+Reserved |
+
0 |
+0 |
+Instruction address misaligned |
+
11.1.9. Supervisor Trap Value (stval
) Register
+[CV64A6_MMU] The stval
register is an MXLEN-bit read-only 0 register.
11.1.10. Supervisor Environment Configuration (senvcfg
) Register
+The senvcfg
CSR is an SXLEN-bit read/write register, formatted as
+shown in Figure 38, that controls certain
+characteristics of the U-mode execution environment.
senvcfg
) for RV64.If bit FIOM (Fence of I/O implies Memory) is set to one in senvcfg
,
+FENCE instructions executed in U-mode are modified so the requirement to
+order accesses to device I/O implies also the requirement to order main
+memory accesses. Table 17 details the modified
+interpretation of FENCE instruction bits PI, PO, SI, and SO in U-mode
+when FIOM=1.
Similarly, for U-mode when FIOM=1, if an atomic instruction that +accesses a region ordered as device I/O has its aq and/or rl bit +set, then that instruction is ordered as though it accesses both device +I/O and memory.
+Instruction bit | +Meaning when set | +
---|---|
PI |
+Predecessor device input and memory reads (PR implied) |
+
SI |
+Successor device input and memory reads (SR implied) |
+
[CV64A6_MMU] CBZE, CBCFE, CBIE, PMM, LPE, ELP, SSE are always 0 because their corresponding extension is not implemented
+11.1.11. Supervisor Address Translation and Protection (satp
) Register
+The satp
CSR is an SXLEN-bit read/write register, formatted as
+shown in Figure 39, which controls
+supervisor-mode address translation and protection. This register holds
+the physical page number (PPN) of the root page table, i.e., its
+supervisor physical address divided by 4 KiB; an address space identifier
+(ASID), which facilitates address-translation fences on a
+per-address-space basis; and the MODE field, which selects the current
+address-translation scheme. Further details on the access to this
+register are described in Section 3.1.6.6.
satp
) register when SXLEN=64, for MODE values Bare, Sv39, Sv48, and Sv57.Table 18 shows the encodings of the MODE field when +SXLEN=32 and SXLEN=64. When MODE=Bare, supervisor virtual addresses are +equal to supervisor physical addresses, and there is no additional +memory protection beyond the physical memory protection scheme described +in Section 3.7
+[CV64A6_MMU] When SXLEN=64, the only other valid setting for MODE is Sv39, a paged +virtual-memory scheme described in Section 11.3.
+The number of ASID bits is UNSPECIFIED and may be zero. The number of implemented
+ASID bits, termed ASIDLEN, may be determined by writing one to every
+bit position in the ASID field, then reading back the value in satp
to
+see which bit positions in the ASID field hold a one. The
+least-significant bits of ASID are implemented first: that is, if
+ASIDLEN 0, ASID[ASIDLEN-1:0] is writable. The maximal
+value of ASIDLEN, termed ASIDMAX, is 9 for Sv32 or 16 for Sv39, Sv48,
+and Sv57.
SXLEN=32 | +||
---|---|---|
Value |
+Name |
+Description |
+
0 |
+Bare |
+No translation or protection. |
+
SXLEN=64 |
+||
Value |
+Name |
+Description |
+
0 |
+Bare |
+No translation or protection. |
+
The satp
CSR is considered active when the effective privilege
+mode is S-mode or U-mode. Executions of the address-translation
+algorithm may only begin using a given value of satp
when satp
is
+active.
Note that writing satp
does not imply any ordering constraints between
+page-table updates and subsequent address translations, nor does it
+imply any invalidation of address-translation caches. If the new address
+space’s page tables have been modified, or if an ASID is reused, it may
+be necessary to execute an SFENCE.VMA instruction (see
+Section 11.2.1) after, or in some cases before, writing
+satp
.
11.2. Supervisor Instructions
+In addition to the SRET instruction defined in Section 3.3.2, one new supervisor-level instruction is provided.
+11.2.1. Supervisor Memory-Management Fence Instruction
+The supervisor memory-management fence instruction SFENCE.VMA is used to +synchronize updates to in-memory memory-management data structures with +current execution. Instruction execution causes implicit reads and +writes to these data structures; however, these implicit references are +ordinarily not ordered with respect to explicit loads and stores. +Executing an SFENCE.VMA instruction guarantees that any previous stores +already visible to the current RISC-V hart are ordered before certain +implicit references by subsequent instructions in that hart to the +memory-management data structures. The specific set of operations +ordered by SFENCE.VMA is determined by rs1 and rs2, as described +below. SFENCE.VMA is also used to invalidate entries in the +address-translation cache associated with a hart (see [sv32algorithm]). Further details on the behavior of this instruction are described in Section 3.1.6.6 and Section 3.7.2.
+SFENCE.VMA orders only the local hart’s implicit references to the +memory-management data structures.
+11.3. Sv39: Page-Based 39-bit Virtual-Memory System
+This section describes a simple paged virtual-memory system for +SXLEN=64, which supports 39-bit virtual address spaces. The design of +Sv39 follows the overall scheme of Sv32, and this section details only +the differences between the schemes.
+11.3.1. Addressing and Memory Protection
+Sv39 implementations support a 39-bit virtual address space, divided +into pages. An Sv39 address is partitioned as shown in +Figure 40. Instruction fetch addresses and load and +store effective addresses, which are 64 bits, must have bits 63–39 all +equal to bit 38, or else a page-fault exception will occur. The 27-bit +VPN is translated into a 44-bit PPN via a three-level page table, while +the 12-bit page offset is untranslated.
+Sv39 page tables contain 29 page table entries (PTEs),
+eight bytes each. A page table is exactly the size of a page and must
+always be aligned to a page boundary. The physical page number of the
+root page table is stored in the satp
register’s PPN field.
The PTE format for Sv39 is shown in Figure 42.
+The V bit indicates whether the PTE is valid; if it is 0, all other bits +in the PTE are don’t-cares and may be used freely by software. The +permission bits, R, W, and X, indicate whether the page is readable, +writable, and executable, respectively. When all three are zero, the PTE +is a pointer to the next level of the page table; otherwise, it is a +leaf PTE. Writable pages must also be marked readable; the contrary +combinations are reserved for future use. Table 19 +summarizes the encoding of the permission bits.
+X | +W | +R | +Meaning | +
---|---|---|---|
0 |
+0 |
+0 |
+Pointer to next level of page table. |
+
Attempting to fetch an instruction from a page that does not have +execute permissions raises a fetch page-fault exception. Attempting to +execute a load or load-reserved instruction whose effective address lies +within a page without read permissions raises a load page-fault +exception. Attempting to execute a store, store-conditional, or AMO +instruction whose effective address lies within a page without write +permissions raises a store page-fault exception.
+The U bit indicates whether the page is accessible to user mode. U-mode
+software may only access the page when U=1. If the SUM bit in the
+sstatus
register is set, supervisor mode software may also access
+pages with U=1. However, supervisor code normally operates with the SUM
+bit clear, in which case, supervisor code will fault on accesses to
+user-mode pages.
The G bit designates a global mapping. Global mappings are those that +exist in all address spaces. For non-leaf PTEs, the global setting +implies that all mappings in the subsequent levels of the page table are +global.
+The RSW field is reserved for use by supervisor softwareand is ignored by the implementation.
+[CV64A6_MMU] As Svnapot is not implemented bit 63 remains reserved and must be zeroed by software for +forward compatibility, or else a page-fault exception is raised.
+[CV64A6_MMU] As Svpbmt is not implemented bits 62-61 remain +reserved and must be zeroed by software for forward compatibility, or +else a page-fault exception is raised.
+Bits 60-54 are reserved for +future standard use and, until their use is defined by some standard +extension, must be zeroed by software for forward compatibility. If any +of these bits are set, a page-fault exception is raised.
+12. "Sstc" Extension for Supervisor-mode Timer Interrupts, Version 1.0
+CV64A6_MMU: This extension is not supported.
+13. "Sscofpmf" Extension for Count Overflow and Mode-Based Filtering, Version 1.0
+CV64A6_MMU: This extension is not supported.
+14. "H" Extension for Hypervisor Support, Version 1.0
+CV64A6_MMU: This extension is not supported.
+15. Control-flow Integrity (CFI)
+CV64A6_MMU: The Zicfiss extension is not supported.
+CV64A6_MMU: The Zicfilp extension is not supported.
+16. "Ssdbltrp" Double Trap Extension, Version 1.0
+17. RISC-V Privileged Instruction Set Listings
+This chapter presents instruction-set listings for all instructions +defined in the RISC-V Privileged Architecture.
+The instruction-set listings for unprivileged instructions, including +the ECALL and EBREAK instructions, are provided in Volume I of this +manual.
+18. History
+18.1. Research Funding at UC Berkeley
+Development of the RISC-V architecture and implementations has been +partially funded by the following sponsors.
+-
+
-
+
Par Lab: Research supported by Microsoft (Award #024263) and Intel +(Award #024894) funding and by matching funding by U.C. Discovery (Award +#DIG07-10227). Additional support came from Par Lab affiliates Nokia, +NVIDIA, Oracle, and Samsung.
+
+ -
+
Project Isis: DoE Award DE-SC0003624.
+
+ -
+
ASPIRE Lab: DARPA PERFECT program, Award HR0011-12-2-0016. DARPA +POEM program Award HR0011-11-C-0100. The Center for Future Architectures +Research (C-FAR), a STARnet center funded by the Semiconductor Research +Corporation. Additional support from ASPIRE industrial sponsor, Intel, +and ASPIRE affiliates, Google, Huawei, Nokia, NVIDIA, Oracle, and +Samsung.
+
+
The content of this paper does not necessarily reflect the position or +the policy of the US government and no official endorsement should be +inferred.
+The RISC-V Instruction Set Manual for CV64A6_MMU: Volume I - Unprivileged Architecture
+-
+
- Preface +
- 1. Introduction + + +
- 2. RV32I Base Integer Instruction Set, Version 2.1
+
-
+
- 2.1. Programmers' Model for Base Integer ISA +
- 2.2. Base Instruction Formats +
- 2.3. Immediate Encoding Variants +
- 2.4. Integer Computational Instructions + + +
- 2.5. Control Transfer Instructions + + +
- 2.6. Load and Store Instructions +
- 2.7. Memory Ordering Instructions +
- 2.8. Environment Call and Breakpoints +
- 2.9. HINT Instructions +
+ - 3. RV32E and RV64E Base Integer Instruction Sets, Version 2.0 +
- 4. RV64I Base Integer Instruction Set, Version 2.1 + + +
- 5. RV128I Base Integer Instruction Set, Version 1.7 +
- 6. "Zifencei" Extension for Instruction-Fetch Fence, Version 2.0 +
- 7. "Zicsr", Extension for Control and Status Register (CSR) Instructions, Version 2.0 + + +
- 8. "Zicntr" and "Zihpm" Extensions for Counters, Version 2.0 + + +
- 9. "Zihintntl" Extension for Non-Temporal Locality Hints, Version 1.0 +
- 10. "Zihintpause" Extension for Pause Hint, Version 2.0 +
- 11. "Zimop" Extension for May-Be-Operations, Version 1.0 + + +
- 12. "Zicond" Extension for Integer Conditional Operations, Version 1.0.0 +
- 13. "M" Extension for Integer Multiplication and Division, Version 2.0 + + +
- 14. "A" Extension for Atomic Instructions, Version 2.1 +
- 15. "Zawrs" Extension for Wait-on-Reservation-Set instructions, Version 1.01 +
- 16. "Zacas" Extension for Atomic Compare-and-Swap (CAS) Instructions, Version 1.0.0 +
- 17. "Zabha" Extension for Byte and Halfword Atomic Memory Operations, Version 1.0.0 +
- 18. RVWMO Memory Consistency Model, Version 2.0 + + +
- 19. "Ztso" Extension for Total Store Ordering, Version 1.0 +
- 20. "CMO" Extensions for Base Cache Management Operation ISA, Version 1.0.0 +
- 21. "F" Extension for Single-Precision Floating-Point, Version 2.2 +
- 22. "D" Extension for Double-Precision Floating-Point, Version 2.2 +
- 23. "Q" Extension for Quad-Precision Floating-Point, Version 2.2 +
- 24. "Zfh" and "Zfhmin" Extensions for Half-Precision Floating-Point, Version 1.0 +
- 25. "BF16" Extensions for for BFloat16-precision Floating-Point, Version 1.0 +
- 26. "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0 +
- 27. "Zfinx", "Zdinx", "Zhinx", "Zhinxmin" Extensions for Floating-Point in Integer Registers, Version 1.0 +
- 28. "C" Extension for Compressed Instructions, Version 2.0 + + +
- 29. "Zc*" Extension for Code Size Reduction, Version 1.0.0
+
-
+
- 29.1. Zc* Overview +
- 29.2. C +
- 29.3. Zce +
- 29.4. MISA.C +
- 29.5. Zca +
- 29.6. Zcf (RV32 only) +
- 29.7. Zcd +
- 29.8. Zcb +
- 29.9. Zcmp +
- 29.10. Zcmt +
- 29.11. Zc instruction formats +
- 29.12. Zcb instructions + + +
- 29.13. PUSH/POP register instructions
+
-
+
- 29.13.1. PUSH/POP functional overview +
- 29.13.2. Example usage + + +
- 29.13.3. PUSH/POP Fault handling +
- 29.13.4. Software view of execution + + +
- 29.13.5. Non-idempotent memory handling +
- 29.13.6. Example RV32I PUSH/POP sequences + + +
- 29.13.7. cm.push +
- 29.13.8. cm.pop +
- 29.13.9. cm.popretz +
- 29.13.10. cm.popret +
- 29.13.11. cm.mvsa01 +
- 29.13.12. cm.mva01s +
+ - 29.14. Table Jump Overview + + +
+ - 30. "B" Extension for Bit Manipulation, Version 1.0.0
+
-
+
- 30.1. Zb* Overview +
- 30.2. Word Instructions +
- 30.3. Pseudocode for instruction semantics +
- 30.4. Extensions + + +
- 30.5. Instructions (in alphabetical order)
+
-
+
- 30.5.1. add.uw +
- 30.5.2. andn +
- 30.5.3. bclr +
- 30.5.4. bclri +
- 30.5.5. bext +
- 30.5.6. bexti +
- 30.5.7. binv +
- 30.5.8. binvi +
- 30.5.9. bset +
- 30.5.10. bseti +
- 30.5.11. clmul +
- 30.5.12. clmulh +
- 30.5.13. clmulr +
- 30.5.14. clz +
- 30.5.15. clzw +
- 30.5.16. cpop +
- 30.5.17. cpopw +
- 30.5.18. ctz +
- 30.5.19. ctzw +
- 30.5.20. max +
- 30.5.21. maxu +
- 30.5.22. min +
- 30.5.23. minu +
- 30.5.24. orc.b +
- 30.5.25. orn +
- 30.5.26. pack +
- 30.5.27. packh +
- 30.5.28. packw +
- 30.5.29. rev8 +
- 30.5.30. rev.b +
- 30.5.31. rol +
- 30.5.32. rolw +
- 30.5.33. ror +
- 30.5.34. rori +
- 30.5.35. roriw +
- 30.5.36. rorw +
- 30.5.37. sext.b +
- 30.5.38. sext.h +
- 30.5.39. sh1add +
- 30.5.40. sh1add.uw +
- 30.5.41. sh2add +
- 30.5.42. sh2add.uw +
- 30.5.43. sh3add +
- 30.5.44. sh3add.uw +
- 30.5.45. slli.uw +
- 30.5.46. unzip +
- 30.5.47. xnor +
- 30.5.48. xperm.b +
- 30.5.49. xperm.n +
- 30.5.50. zext.h +
- 30.5.51. zip +
+ - 30.6. Software optimization guide + + +
+ - 31. "J" Extension for Dynamically Translated Languages, Version 0.0 +
- 32. "P" Extension for Packed-SIMD Instructions, Version 0.2 +
- 33. "V" Standard Extension for Vector Operations, Version 1.0 +
- 34. Cryptography Extensions: Scalar & Entropy Source Instructions, Version 1.0.1 +
- 35. Cryptography Extensions: Vector Instructions, Version 1.0 +
- 36. Control-flow Integrity (CFI) +
- 37. RV32/64G Instruction Set Listings +
- 38. Extending RISC-V + + +
- 39. ISA Extension Naming Conventions
+
-
+
- 39.1. Case Sensitivity +
- 39.2. Base Integer ISA +
- 39.3. Instruction-Set Extension Names +
- 39.4. Version Numbers +
- 39.5. Underscores +
- 39.6. Additional Standard Unprivileged Extension Names +
- 39.7. Supervisor-level Instruction-Set Extensions +
- 39.8. Hypervisor-level Instruction-Set Extensions +
- 39.9. Machine-level Instruction-Set Extensions +
- 39.10. Non-Standard Extension Names +
- 39.11. Subset Naming Convention +
+ - 40. History and Acknowledgments
+
-
+
- 40.1. "Why Develop a new ISA?" Rationale from Berkeley Group +
- 40.2. History from Revision 1.0 of ISA manual +
- 40.3. History from Revision 2.0 of ISA manual +
- 40.4. Acknowledgments +
- 40.5. History from Revision 2.1 +
- 40.6. Acknowledgments +
- 40.7. History from Revision 2.2 +
- 40.8. Acknowledgments +
- 40.9. History for Revision 2.3 +
- 40.10. Funding +
+ - Appendix A: RVWMO Explanatory Material, Version 0.1
+
-
+
- A.1. Why RVWMO? +
- A.2. Litmus Tests +
- A.3. Explaining the RVWMO Rules
+
-
+
- A.3.1. Preserved Program Order and Global Memory Order +
- A.3.2. Load value axiom +
- A.3.3. Atomicity axiom +
- A.3.4. Progress axiom +
- A.3.5. Overlapping-Address Orderings (Rules 1-3) +
- A.3.6. Fences (Rule 4) +
- A.3.7. Explicit Synchronization (Rules 5-8) +
- A.3.8. Syntactic Dependencies (Rules 9-11) +
- A.3.9. Pipeline Dependencies (Rules 12-13) +
+ - A.4. Beyond Main Memory + + +
- A.5. Code Porting and Mapping Guidelines +
- A.6. Implementation Guidelines + + +
- A.7. Known Issues + + +
+ - Appendix B: Formal Memory Model Specifications, Version 0.1
+
-
+
- B.1. Formal Axiomatic Specification in Alloy +
- B.2. Formal Axiomatic Specification in Herd +
- B.3. An Operational Memory Model
+
-
+
- B.3.1. Intra-instruction Pseudocode Execution +
- B.3.2. Instruction Instance State +
- B.3.3. Hart State +
- B.3.4. Shared Memory State +
- B.3.5. Transitions
+
-
+
- Fetch instruction +
- Initiate memory load operations +
- Satisfy memory load operation by forwarding from unpropagated stores +
- Satisfy memory load operation from memory +
- Complete load operations +
- Early
sc
fail
+ - Paired
sc
+ - Initiate memory store operation footprints +
- Instantiate memory store operation values +
- Commit store instruction +
- Propagate store operation +
- Commit and propagate store operation of an
sc
+ - Late
sc
fail
+ - Complete store operations +
- Satisfy, commit and propagate operations of an AMO +
- Commit fence +
- Register read +
- Register write +
- Pseudocode internal step +
- Finish instruction +
+ - B.3.6. Limitations +
+
+ - Appendix C: Vector Assembly Code Examples +
- Appendix D: Calling Convention for Vector State (Not authoritative - Placeholder Only) +
- Index +
- Bibliography +
This document describes the RISC-V unprivileged architecture tailored for +OpenHW Group CV64A6_MMU. +Not relevant parts (e.g. unsupported extensions) of the original +specification are replaced by placeholders.
+Contributors to all versions of the spec in alphabetical order (please contact editors to suggest +corrections): Derek Atkins, +Arvind, +Krste Asanović, +Rimas Avižienis, +Jacob Bachmeyer, +Christopher F. Batten, +Allen J. Baum, +Abel Bernabeu, +Alex Bradbury, +Scott Beamer, +Hans Boehm, +Preston Briggs, +Christopher Celio, +Chuanhua Chang, +David Chisnall, +Paul Clayton, +Palmer Dabbelt, +L Peter Deutsch, +Ken Dockser, +Paul Donahue, +Aaron Durbin, +Roger Espasa, +Greg Favor, +Andy Glew, +Shaked Flur, +Stefan Freudenberger, +Marc Gauthier, +Andy Glew, +Jan Gray, +Gianluca Guida, +Michael Hamburg, +John Hauser, +John Ingalls, +David Horner, +Bruce Hoult, +Bill Huffman, +Alexandre Joannou, +Olof Johansson, +Ben Keller, +David Kruckemyer, +Tariq Kurd, +Yunsup Lee, +Paul Loewenstein, +Daniel Lustig, +Yatin Manerkar, +Luc Maranget, +Ben Marshall, +Margaret Martonosi, +Phil McCoy, +Nathan Menhorn, +Christoph Müllner, +Joseph Myers, +Vijayanand Nagarajan, +Rishiyur Nikhil, +Jonas Oberhauser, +Stefan O’Rear, +Markku-Juhani O. Saarinen, +Albert Ou, +John Ousterhout, +Daniel Page, +David Patterson, +Christopher Pulte, +Jose Renau, +Josh Scheid, +Colin Schmidt, +Peter Sewell, +Susmit Sarkar, +Ved Shanbhogue, +Brent Spinney, +Brendan Sweeney, +Michael Taylor, +Wesley Terpstra, +Matt Thomas, +Tommy Thorn, +Philipp Tomsich, +Caroline Trippel, +Ray VanDeWalker, +Muralidaran Vijayaraghavan, +Megan Wachs, +Paul Wamsley, +Andrew Waterman, +Robert Watson, +David Weaver, +Derek Williams, +Claire Wolf, +Andrew Wright, +Reinoud Zandijk, +and Sizhuo Zhang.
+This document is released under a Creative Commons Attribution 4.0 International License.
+This document is a derivative of “The RISC-V Instruction Set Manual, Volume I: User-Level ISA +Version 2.1” released under the following license: ©2010-2017 Andrew Waterman, Yunsup Lee, +David Patterson, Krste Asanović. Creative Commons Attribution 4.0 International License. +Please cite as: “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Document +Version 20191214-draft”, Editors Andrew Waterman and Krste Asanović, RISC-V Foundation, +December 2019.
+Contributors to CV64A6_MMU versions of the spec in alphabetical order: +Jean-Roch Coulon, André Sintzoff.
+Preface
+Preface to Document Version for CV64A6_MMU
+This document describes the RISC-V unprivileged architecture tailored for +OpenHW Group CV64A6_MMU.
+Preface to Document Version 20240612
+This document describes the RISC-V unprivileged architecture.
+The ISA modules marked Ratified have been ratified at this time. The +modules marked Frozen are not expected to change significantly before +being put up for ratification. The modules marked Draft are expected +to change before ratification.
+The document contains the following versions of the RISC-V ISA modules:
+Base | +Version | +Status | +
---|---|---|
RV32I |
+2.1 |
+Ratified |
+
RV32E |
+2.0 |
+Ratified |
+
RV64E |
+2.0 |
+Ratified |
+
RV64I |
+2.1 |
+Ratified |
+
RV128I |
+1.7 |
+Draft |
+
Extension |
+Version |
+Status |
+
Zifencei |
+2.0 |
+Ratified |
+
Zicsr |
+2.0 |
+Ratified |
+
Zicntr |
+2.0 |
+Ratified |
+
Zihintntl |
+1.0 |
+Ratified |
+
Zihintpause |
+2.0 |
+Ratified |
+
Zimop |
+1.0 |
+Ratified |
+
Zicond |
+1.0 |
+Ratified |
+
M |
+2.0 |
+Ratified |
+
Zmmul |
+1.0 |
+Ratified |
+
A |
+2.1 |
+Ratified |
+
Zawrs |
+1.01 |
+Ratified |
+
Zacas |
+1.0 |
+Ratifed |
+
RVWMO |
+2.0 |
+Ratified |
+
Ztso |
+1.0 |
+Ratified |
+
CMO |
+1.0 |
+Ratified |
+
F |
+2.2 |
+Ratified |
+
D |
+2.2 |
+Ratified |
+
Q |
+2.2 |
+Ratified |
+
Zfh |
+1.0 |
+Ratified |
+
Zfhmin |
+1.0 |
+Ratified |
+
Zfa |
+1.0 |
+Ratified |
+
Zfinx |
+1.0 |
+Ratified |
+
Zdinx |
+1.0 |
+Ratified |
+
Zhinx |
+1.0 |
+Ratified |
+
Zhinxmin |
+1.0 |
+Ratified |
+
C |
+2.0 |
+Ratified |
+
Zce |
+1.0 |
+Ratified |
+
B |
+1.0 |
+Ratified |
+
P |
+0.2 |
+Draft |
+
V |
+1.0 |
+Ratified |
+
Zbkb |
+1.0 |
+Ratified |
+
Zbkc |
+1.0 |
+Ratified |
+
Zbkx |
+1.0 |
+Ratified |
+
Zk |
+1.0 |
+Ratified |
+
Zks |
+1.0 |
+Ratified |
+
Zvbb |
+1.0 |
+Ratified |
+
Zvbc |
+1.0 |
+Ratified |
+
Zvkg |
+1.0 |
+Ratified |
+
Zvkned |
+1.0 |
+Ratified |
+
Zvknhb |
+1.0 |
+Ratified |
+
Zvksed |
+1.0 |
+Ratified |
+
Zvksh |
+1.0 |
+Ratified |
+
Zvkt |
+1.0 |
+Ratified |
+
The changes in this version of the document include:
+-
+
-
+
The inclusion of all ratified extensions through March 2024.
+
+ -
+
The draft Zam extension has been removed, in favor of the definition of a misaligned atomicity granule PMA.
+
+ -
+
The concept of vacant memory regions has been superseded by inaccessible memory or I/O regions.
+
+
Preface to Document Version 20191213-Base-Ratified
+This document describes the RISC-V unprivileged architecture.
+The ISA modules marked Ratified have been ratified at this time. The +modules marked Frozen are not expected to change significantly before +being put up for ratification. The modules marked Draft are expected +to change before ratification.
+The document contains the following versions of the RISC-V ISA modules:
+Base | +Version | +Status | +
---|---|---|
RVWMO |
+2.0 |
+Ratified |
+
RV32I |
+2.1 |
+Ratified |
+
RV64I |
+2.1 |
+Ratified |
+
RV32E |
+1.9 |
+Draft |
+
RV128I |
+1.7 |
+Draft |
+
Extension |
+Version |
+Status |
+
M |
+2.0 |
+Ratified |
+
A |
+2.1 |
+Ratified |
+
F |
+2.2 |
+Ratified |
+
D |
+2.2 |
+Ratified |
+
Q |
+2.2 |
+Ratified |
+
C |
+2.0 |
+Ratified |
+
Counters |
+2.0 |
+Draft |
+
L |
+0.0 |
+Draft |
+
B |
+0.0 |
+Draft |
+
J |
+0.0 |
+Draft |
+
T |
+0.0 |
+Draft |
+
P |
+0.2 |
+Draft |
+
V |
+0.7 |
+Draft |
+
Zicsr |
+2.0 |
+Ratified |
+
Zifencei |
+2.0 |
+Ratified |
+
Zam |
+0.1 |
+Draft |
+
Ztso |
+0.1 |
+Frozen |
+
The changes in this version of the document include:
+-
+
-
+
The A extension, now version 2.1, was ratified by the board in +December 2019.
+
+ -
+
Defined big-endian ISA variant.
+
+ -
+
Moved N extension for user-mode interrupts into Volume II.
+
+ -
+
Defined PAUSE hint instruction.
+
+
Preface to Document Version 20190608-Base-Ratified
+This document describes the RISC-V unprivileged architecture.
+The RVWMO memory model has been ratified at this time. The ISA modules +marked Ratified, have been ratified at this time. The modules marked +Frozen are not expected to change significantly before being put up +for ratification. The modules marked Draft are expected to change +before ratification.
+The document contains the following versions of the RISC-V ISA modules:
+Base | +Version | +Status | +
---|---|---|
RVWMO |
+2.0 |
+Ratified |
+
RV32I |
+2.1 |
+Ratified |
+
RV64I |
+2.1 |
+Ratified |
+
RV32E |
+1.9 |
+Draft |
+
RV128I |
+1.7 |
+Draft |
+
Extension |
+Version |
+Status |
+
Zifencei |
+2.0 |
+Ratified |
+
Zicsr |
+2.0 |
+Ratified |
+
M |
+2.0 |
+Ratified |
+
A |
+2.0 |
+Frozen |
+
F |
+2.2 |
+Ratified |
+
D |
+2.2 |
+Ratified |
+
Q |
+2.2 |
+Ratified |
+
C |
+2.0 |
+Ratified |
+
Ztso |
+0.1 |
+Frozen |
+
Counters |
+2.0 |
+Draft |
+
L |
+0.0 |
+Draft |
+
B |
+0.0 |
+Draft |
+
J |
+0.0 |
+Draft |
+
T |
+0.0 |
+Draft |
+
P |
+0.2 |
+Draft |
+
V |
+0.7 |
+Draft |
+
Zam |
+0.1 |
+Draft |
+
The changes in this version of the document include:
+-
+
-
+
Moved description to Ratified for the ISA modules ratified by the +board in early 2019.
+
+ -
+
Removed the A extension from ratification.
+
+ -
+
Changed document version scheme to avoid confusion with versions of +the ISA modules.
+
+ -
+
Incremented the version numbers of the base integer ISA to 2.1, +reflecting the presence of the ratified RVWMO memory model and exclusion +of FENCE.I, counters, and CSR instructions that were in previous base +ISA.
+
+ -
+
Incremented the version numbers of the F and D extensions to 2.2, +reflecting that version 2.1 changed the canonical NaN, and version 2.2 +defined the NaN-boxing scheme and changed the definition of the FMIN and +FMAX instructions.
+
+ -
+
Changed name of document to refer to "unprivileged" instructions as +part of move to separate ISA specifications from platform profile +mandates.
+
+ -
+
Added clearer and more precise definitions of execution environments, +harts, traps, and memory accesses.
+
+ -
+
Defined instruction-set categories: standard, reserved, custom, +non-standard, and non-conforming.
+
+ -
+
Removed text implying operation under alternate endianness, as +alternate-endianness operation has not yet been defined for RISC-V.
+
+ -
+
Changed description of misaligned load and store behavior. The +specification now allows visible misaligned address traps in execution +environment interfaces, rather than just mandating invisible handling of +misaligned loads and stores in user mode. Also, now allows access-fault +exceptions to be reported for misaligned accesses (including atomics) +that should not be emulated.
+
+ -
+
Moved FENCE.I out of the mandatory base and into a separate extension, +with Zifencei ISA name. FENCE.I was removed from the Linux user ABI and +is problematic in implementations with large incoherent instruction and +data caches. However, it remains the only standard instruction-fetch +coherence mechanism.
+
+ -
+
Removed prohibitions on using RV32E with other extensions.
+
+ -
+
Removed platform-specific mandates that certain encodings produce +illegal-instruction exceptions in RV32E and RV64I chapters.
+
+ -
+
Counter/timer instructions are now not considered part of the +mandatory base ISA, and so CSR instructions were moved into separate +chapter and marked as version 2.0, with the unprivileged counters moved +into another separate chapter. The counters are not ready for +ratification as there are outstanding issues, including counter +inaccuracies.
+
+ -
+
A CSR-access ordering model has been added.
+
+ -
+
Explicitly defined the 16-bit half-precision floating-point format for +floating-point instructions in the 2-bit fmt field.
+
+ -
+
Defined the signed-zero behavior of FMIN.fmt and FMAX.fmt, and +changed their behavior on signaling-NaN inputs to conform to the +minimumNumber and maximumNumber operations in the proposed IEEE 754-201x +specification.
+
+ -
+
The memory consistency model, RVWMO, has been defined.
+
+ -
+
The "Zam" extension, which permits misaligned AMOs and specifies +their semantics, has been defined.
+
+ -
+
The "Ztso" extension, which enforces a stricter memory consistency +model than RVWMO, has been defined.
+
+ -
+
Improvements to the description and commentary.
+
+ -
+
Defined the term
+IALIGN
as shorthand to describe the +instruction-address alignment constraint.
+ -
+
Removed text of
+P
extension chapter as now superseded by active task +group documents.
+ -
+
Removed text of
+V
extension chapter as now superseded by separate +vector extension draft document.
+
Preface to Document Version 2.2
+This is version 2.2 of the document describing the RISC-V user-level +architecture. The document contains the following versions of the RISC-V +ISA modules:
+Base | +Version | +Draft Frozen? | +
---|---|---|
RV32I |
+2.0 |
+Y |
+
RV32E |
+1.9 |
+N |
+
RV64I |
+2.0 |
+Y |
+
RV128I |
+1.7 |
+N |
+
Extension |
+Version |
+Frozen? |
+
M |
+2.0 |
+Y |
+
A |
+2.0 |
+Y |
+
F |
+2.0 |
+Y |
+
D |
+2.0 |
+Y |
+
Q |
+2.0 |
+Y |
+
L |
+0.0 |
+N |
+
C |
+2.0 |
+Y |
+
B |
+0.0 |
+N |
+
J |
+0.0 |
+N |
+
T |
+0.0 |
+N |
+
P |
+0.1 |
+N |
+
V |
+0.7 |
+N |
+
N |
+1.1 |
+N |
+
To date, no parts of the standard have been officially ratified by the +RISC-V Foundation, but the components labeled "frozen" above are not +expected to change during the ratification process beyond resolving +ambiguities and holes in the specification.
+The major changes in this version of the document include:
+-
+
-
+
The previous version of this document was released under a Creative +Commons Attribution 4.0 International License by the original authors, +and this and future versions of this document will be released under the +same license.
+
+ -
+
Rearranged chapters to put all extensions first in canonical order.
+
+ -
+
Improvements to the description and commentary.
+
+ -
+
Modified implicit hinting suggestion on
+JALR
to support more efficient +macro-op fusion ofLUI/JALR
andAUIPC/JALR
pairs.
+ -
+
Clarification of constraints on load-reserved/store-conditional +sequences.
+
+ -
+
A new table of control and status register (CSR) mappings.
+
+ -
+
Clarified purpose and behavior of high-order bits of
+fcsr
.
+ -
+
Corrected the description of the
+FNMADD
.fmt andFNMSUB
.fmt +instructions, which had suggested the incorrect sign of a zero result.
+ -
+
Instructions
+FMV.S.X
andFMV.X.S
were renamed toFMV.W.X
andFMV.X.W
+respectively to be more consistent with their semantics, which did not +change. The old names will continue to be supported in the tools.
+ -
+
Specified behavior of narrower (FLEN) floating-point +values held in wider
+f
registers using NaN-boxing model.
+ -
+
Defined the exception behavior of FMA(, 0, qNaN).
+
+ -
+
Added note indicating that the
+P
extension might be reworked into an +integer packed-SIMD proposal for fixed-point operations using the +integer registers.
+ -
+
A draft proposal of the V vector instruction-set extension.
+
+ -
+
An early draft proposal of the N user-level traps extension.
+
+ -
+
An expanded pseudoinstruction listing.
+
+ -
+
Removal of the calling convention chapter, which has been superseded +by the RISC-V ELF psABI Specification (RISC-V ELF PsABI Specification, n.d.).
+
+ -
+
The C extension has been frozen and renumbered version 2.0.
+
+
Preface to Document Version 2.1
+This is version 2.1 of the document describing the RISC-V user-level
+architecture. Note the frozen user-level ISA base and extensions IMAFDQ
+version 2.0 have not changed from the previous version of this
+document (Waterman et al., 2014), but some specification holes have been fixed and the
+documentation has been improved. Some changes have been made to the
+software conventions.
-
+
-
+
Numerous additions and improvements to the commentary sections.
+
+ -
+
Separate version numbers for each chapter.
+
+ -
+
Modification to long instruction encodings 64 bits to +avoid moving the rd specifier in very long instruction formats.
+
+ -
+
CSR instructions are now described in the base integer format where +the counter registers are introduced, as opposed to only being +introduced later in the floating-point section (and the companion +privileged architecture manual).
+
+ -
+
The SCALL and SBREAK instructions have been renamed to
+ECALL
and +EBREAK
, respectively. Their encoding and functionality are unchanged.
+ -
+
Clarification of floating-point NaN handling, and a new canonical NaN +value.
+
+ -
+
Clarification of values returned by floating-point to integer +conversions that overflow.
+
+ -
+
Clarification of
+LR/SC
allowed successes and required failures, +including use of compressed instructions in the sequence.
+ -
+
A new
+RV32E
base ISA proposal for reduced integer register counts, +supportsMAC
extensions.
+ -
+
A revised calling convention.
+
+ -
+
Relaxed stack alignment for soft-float calling convention, and +description of the RV32E calling convention.
+
+ -
+
A revised proposal for the
+C
compressed extension, version 1.9 .
+
Preface to Version 2.0
+This is the second release of the user ISA specification, and we intend +the specification of the base user ISA plus general extensions (i.e., +IMAFD) to remain fixed for future development. The following changes +have been made since Version 1.0 (Waterman et al., 2011) of this ISA specification.
+-
+
-
+
The ISA has been divided into an integer base with several standard +extensions.
+
+ -
+
The instruction formats have been rearranged to make immediate +encoding more efficient.
+
+ -
+
The base ISA has been defined to have a little-endian memory system, +with big-endian or bi-endian as non-standard variants.
+
+ -
+
Load-Reserved/Store-Conditional (
+LR/SC
) instructions have been added +in the atomic instruction extension.
+ -
+
+AMOs
andLR/SC
can support the release consistency model.
+ -
+
The
+FENCE
instruction provides finer-grain memory and I/O orderings.
+ -
+
An
+AMO
for fetch-and-XOR
(AMOXOR
) has been added, and the encoding for +AMOSWAP
has been changed to make room.
+ -
+
The
+AUIPC
instruction, which adds a 20-bit upper immediate to thePC
, +replaces theRDNPC
instruction, which only read the currentPC
value. +This results in significant savings for position-independent code.
+ -
+
The
+JAL
instruction has now moved to theU-Type
format with an +explicit destination register, and theJ
instruction has been dropped +being replaced byJAL
with rd=x0
. This removes the only instruction +with an implicit destination register and removes theJ-Type
instruction +format from the base ISA. There is an accompanying reduction inJAL
+reach, but a significant reduction in base ISA complexity.
+ -
+
The static hints on the
+JALR
instruction have been dropped. The hints +are redundant with the rd and rs1 register specifiers for code +compliant with the standard calling convention.
+ -
+
The
+JALR
instruction now clears the lowest bit of the calculated +target address, to simplify hardware and to allow auxiliary information +to be stored in function pointers.
+ -
+
The
+MFTX.S
andMFTX.D
instructions have been renamed toFMV.X.S
and +FMV.X.D
, respectively. Similarly,MXTF.S
andMXTF.D
instructions have +been renamed toFMV.S.X
andFMV.D.X
, respectively.
+ -
+
The
+MFFSR
andMTFSR
instructions have been renamed toFRCSR
andFSCSR
, +respectively.FRRM
,FSRM
,FRFLAGS
, andFSFLAGS
instructions have been +added to individually access the rounding mode and exception flags +subfields of thefcsr
.
+ -
+
The
+FMV.X.S
andFMV.X.D
instructions now source their operands from +rs1, instead of rs2. This change simplifies datapath design.
+ -
+
+FCLASS.S
andFCLASS.D
floating-point classify instructions have been +added.
+ -
+
A simpler NaN generation and propagation scheme has been adopted.
+
+ -
+
For
+RV32I
, the system performance counters have been extended to +64-bits wide, with separate read access to the upper and lower 32 bits.
+ -
+
Canonical
+NOP
andMV
encodings have been defined.
+ -
+
Standard instruction-length encodings have been defined for 48-bit, +64-bit, and 64-bit instructions.
+
+ -
+
Description of a 128-bit address space variant,
+RV128
, has been added.
+ -
+
Major opcodes in the 32-bit base instruction format have been +allocated for user-defined custom extensions.
+
+ -
+
A typographical error that suggested that stores source their data +from rd has been corrected to refer to rs2.
+
+
1. Introduction
+RISC-V (pronounced "risk-five") is a new instruction-set architecture +(ISA) that was originally designed to support computer architecture +research and education, but which we now hope will also become a +standard free and open architecture for industry implementations. Our +goals in defining RISC-V include:
+-
+
-
+
A completely open ISA that is freely available to academia and +industry.
+
+ -
+
A real ISA suitable for direct native hardware implementation, not +just simulation or binary translation.
+
+ -
+
An ISA that avoids "over-architecting" for a particular +microarchitecture style (e.g., microcoded, in-order, decoupled, +out-of-order) or implementation technology (e.g., full-custom, ASIC, +FPGA), but which allows efficient implementation in any of these.
+
+ -
+
An ISA separated into a small base integer ISA, usable by itself as +a base for customized accelerators or for educational purposes, and +optional standard extensions, to support general-purpose software +development.
+
+ -
+
Support for the revised 2008 IEEE-754 floating-point standard. (ANSI/IEEE Std 754-2008, IEEE Standard for Floating-Point Arithmetic, 2008)
+
+ -
+
An ISA supporting extensive ISA extensions and specialized variants.
+
+ -
+
Both 32-bit and 64-bit address space variants for applications, +operating system kernels, and hardware implementations.
+
+ -
+
An ISA with support for highly parallel multicore or manycore +implementations, including heterogeneous multiprocessors.
+
+ -
+
Optional variable-length instructions to both expand available +instruction encoding space and to support an optional dense instruction +encoding for improved performance, static code size, and energy +efficiency.
+
+ -
+
A fully virtualizable ISA to ease hypervisor development.
+
+ -
+
An ISA that simplifies experiments with new privileged architecture +designs.
+
+
+ + | +
+
+
+Commentary on our design decisions is formatted as in this paragraph. +This non-normative text can be skipped if the reader is only interested +in the specification itself. + |
+
+ + | +
+
+
+The name RISC-V was chosen to represent the fifth major RISC ISA design +from UC Berkeley (RISC-I (Patterson & Séquin, 1981), RISC-II (Katevenis et al., 1983), SOAR (Ungar et al., 1984), and SPUR (Lee et al., 1989) were the first +four). We also pun on the use of the Roman numeral "V" to signify +"variations" and "vectors", as support for a range of architecture +research, including various data-parallel accelerators, is an explicit +goal of the ISA design. + |
+
+The RISC-V ISA is defined avoiding implementation details as much as +possible (although commentary is included on implementation-driven +decisions) and should be read as the software-visible interface to a +wide variety of implementations rather than as the design of a +particular hardware artifact. The RISC-V manual is structured in two +volumes. This volume covers the design of the base unprivileged +instructions, including optional unprivileged ISA extensions. +Unprivileged instructions are those that are generally usable in all +privilege modes in all privileged architectures, though behavior might +vary depending on privilege mode and privilege architecture. The second +volume provides the design of the first ("classic") privileged +architecture. The manuals use IEC 80000-13:2008 conventions, with a byte +of 8 bits.
++ + | +
+
+
+In the unprivileged ISA design, we tried to remove any dependence on +particular microarchitectural features, such as cache line size, or on +privileged architecture details, such as page translation. This is both +for simplicity and to allow maximum flexibility for alternative +microarchitectures or alternative privileged architectures. + |
+
1.1. RISC-V Hardware Platform Terminology
+A RISC-V hardware platform can contain one or more RISC-V-compatible +processing cores together with other non-RISC-V-compatible cores, +fixed-function accelerators, various physical memory structures, I/O +devices, and an interconnect structure to allow the components to +communicate. +
+A component is termed a core if it contains an independent instruction +fetch unit. A RISC-V-compatible core might support multiple +RISC-V-compatible hardware threads, or harts, through multithreading. +
+A RISC-V core might have additional specialized instruction-set +extensions or an added coprocessor. We use the term coprocessor to +refer to a unit that is attached to a RISC-V core and is mostly +sequenced by a RISC-V instruction stream, but which contains additional +architectural state and instruction-set extensions, and possibly some +limited autonomy relative to the primary RISC-V instruction stream.
+We use the term accelerator to refer to either a non-programmable +fixed-function unit or a core that can operate autonomously but is +specialized for certain tasks. In RISC-V systems, we expect many +programmable accelerators will be RISC-V-based cores with specialized +instruction-set extensions and/or customized coprocessors. An important +class of RISC-V accelerators are I/O accelerators, which offload I/O +processing tasks from the main application cores. +
+The system-level organization of a RISC-V hardware platform can range +from a single-core microcontroller to a many-thousand-node cluster of +shared-memory manycore server nodes. Even small systems-on-a-chip might +be structured as a hierarchy of multicomputers and/or multiprocessors to +modularize development effort or to provide secure isolation between +subsystems. +
+1.2. RISC-V Software Execution Environments and Harts
+The behavior of a RISC-V program depends on the execution environment in +which it runs. A RISC-V execution environment interface (EEI) defines +the initial state of the program, the number and type of harts in the +environment including the privilege modes supported by the harts, the +accessibility and attributes of memory and I/O regions, the behavior of +all legal instructions executed on each hart (i.e., the ISA is one +component of the EEI), and the handling of any interrupts or exceptions +raised during execution including environment calls. Examples of EEIs +include the Linux application binary interface (ABI), or the RISC-V +supervisor binary interface (SBI). The implementation of a RISC-V +execution environment can be pure hardware, pure software, or a +combination of hardware and software. For example, opcode traps and +software emulation can be used to implement functionality not provided +in hardware. Examples of execution environment implementations include:
+-
+
-
+
"Bare metal" hardware platforms where harts are directly implemented +by physical processor threads and instructions have full access to the +physical address space. The hardware platform defines an execution +environment that begins at power-on reset.
+
+ -
+
RISC-V operating systems that provide multiple user-level execution +environments by multiplexing user-level harts onto available physical +processor threads and by controlling access to memory via virtual +memory.
+
+ -
+
RISC-V hypervisors that provide multiple supervisor-level execution +environments for guest operating systems.
+
+ -
+
RISC-V emulators, such as Spike, QEMU or rv8, which emulate RISC-V +harts on an underlying x86 system, and which can provide either a +user-level or a supervisor-level execution environment.
+
+
+ + | +
+
+
+A bare hardware platform can be considered to define an EEI, where the +accessible harts, memory, and other devices populate the environment, +and the initial state is that at power-on reset. Generally, most +software is designed to use a more abstract interface to the hardware, +as more abstract EEIs provide greater portability across different +hardware platforms. Often EEIs are layered on top of one another, where +one higher-level EEI uses another lower-level EEI. + |
+
+From the perspective of software running in a given execution +environment, a hart is a resource that autonomously fetches and executes +RISC-V instructions within that execution environment. In this respect, +a hart behaves like a hardware thread resource even if time-multiplexed +onto real hardware by the execution environment. Some EEIs support the +creation and destruction of additional harts, for example, via +environment calls to fork new harts.
+The execution environment is responsible for ensuring the eventual +forward progress of each of its harts. For a given hart, that +responsibility is suspended while the hart is exercising a mechanism +that explicitly waits for an event, such as the wait-for-interrupt +instruction defined in Volume II of this specification; and that +responsibility ends if the hart is terminated. The following events +constitute forward progress:
+-
+
-
+
The retirement of an instruction.
+
+ -
+
A trap, as defined in Section 1.6.
+
+ -
+
Any other event defined by an extension to constitute forward +progress.
+
+
+ + | +
+
+
+The term hart was introduced in the work on Lithe (Pan et al., 2009) and (Pan et al., 2010) to provide a term to +represent an abstract execution resource as opposed to a software thread +programming abstraction. +
+
+The important distinction between a hardware thread (hart) and a +software thread context is that the software running inside an execution +environment is not responsible for causing progress of each of its +harts; that is the responsibility of the outer execution environment. So +the environment’s harts operate like hardware threads from the +perspective of the software inside the execution environment. +
+
+An execution environment implementation might time-multiplex a set of +guest harts onto fewer host harts provided by its own execution +environment but must do so in a way that guest harts operate like +independent hardware threads. In particular, if there are more guest +harts than host harts then the execution environment must be able to +preempt the guest harts and must not wait indefinitely for guest +software on a guest hart to "yield" control of the guest hart. + |
+
1.3. RISC-V ISA Overview
+A RISC-V ISA is defined as a base integer ISA, which must be present in +any implementation, plus optional extensions to the base ISA. The base +integer ISAs are very similar to that of the early RISC processors +except with no branch delay slots and with support for optional +variable-length instruction encodings. A base is carefully restricted to +a minimal set of instructions sufficient to provide a reasonable target +for compilers, assemblers, linkers, and operating systems (with +additional privileged operations), and so provides a convenient ISA and +software toolchain "skeleton" around which more customized processor +ISAs can be built.
+Although it is convenient to speak of the RISC-V ISA, RISC-V is +actually a family of related ISAs, of which there are currently four +base ISAs. Each base integer instruction set is characterized by the +width of the integer registers and the corresponding size of the address +space and by the number of integer registers. There are two primary base +integer variants, RV32I and RV64I, described in +Chapter 2 and Chapter 4, which provide 32-bit +or 64-bit address spaces respectively. We use the term XLEN to refer to +the width of an integer register in bits (either 32 or 64). +Chapter 6 describes the RV32E and RV64E subset variants of the +RV32I or RV64I base instruction sets respectively, which have been added to support small +microcontrollers, and which have half the number of integer registers. +Chapter 8 sketches a future RV128I variant of the +base integer instruction set supporting a flat 128-bit address space +(XLEN=128). The base integer instruction sets use a two’s-complement +representation for signed integer values.
++ + | +
+
+
+Although 64-bit address spaces are a requirement for larger systems, we +believe 32-bit address spaces will remain adequate for many embedded and +client devices for decades to come and will be desirable to lower memory +traffic and energy consumption. In addition, 32-bit address spaces are +sufficient for educational purposes. A larger flat 128-bit address space +might eventually be required, so we ensured this could be accommodated +within the RISC-V ISA framework. + |
+
+ + | +
+
+
+The four base ISAs in RISC-V are treated as distinct base ISAs. A common +question is why is there not a single ISA, and in particular, why is +RV32I not a strict subset of RV64I? Some earlier ISA designs (SPARC, +MIPS) adopted a strict superset policy when increasing address space +size to support running existing 32-bit binaries on new 64-bit hardware. +
+
+The main advantage of explicitly separating base ISAs is that each base +ISA can be optimized for its needs without requiring to support all the +operations needed for other base ISAs. For example, RV64I can omit +instructions and CSRs that are only needed to cope with the narrower +registers in RV32I. The RV32I variants can use encoding space otherwise +reserved for instructions only required by wider address-space variants. +
+
+The main disadvantage of not treating the design as a single ISA is that +it complicates the hardware needed to emulate one base ISA on another +(e.g., RV32I on RV64I). However, differences in addressing and +illegal-instruction traps generally mean some mode switch would be required in +hardware in any case even with full superset instruction encodings, and +the different RISC-V base ISAs are similar enough that supporting +multiple versions is relatively low cost. Although some have proposed +that the strict superset design would allow legacy 32-bit libraries to +be linked with 64-bit code, this is impractical in practice, even with +compatible encodings, due to the differences in software calling +conventions and system-call interfaces. +
+
+The RISC-V privileged architecture provides fields in
+
+A related question is why there is a different encoding for 32-bit adds +in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used for +32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of the +existing design which uses the same opcode ADD for 32-bit adds in RV32I +and 64-bit adds in RV64I with a different opcode ADDW for 32-bit adds in +RV64I. This would also be more consistent with the use of the same LW +opcode for 32-bit load in both RV32I and RV64I. The very first versions +of RISC-V ISA did have a variant of this alternate design, but the +RISC-V design was changed to the current choice in January 2011. Our +focus was on supporting 32-bit integers in the 64-bit ISA not on +providing compatibility with the 32-bit ISA, and the motivation was to +remove the asymmetry that arose from having not all opcodes in RV32I +have a *W suffix (e.g., ADDW, but AND not ANDW). In hindsight, this was +perhaps not well-justified and a consequence of designing both ISAs at +the same time as opposed to adding one later to sit on top of another, +and also from a belief we had to fold platform requirements into the ISA +spec which would imply that all the RV32I instructions would have been +required in RV64I. It is too late to change the encoding now, but this +is also of little practical consequence for the reasons stated above. +
+
+It has been noted we could enable the *W variants as an extension to +RV32I systems to provide a common encoding across RV64I and a future +RV32 variant. + |
+
RISC-V has been designed to support extensive customization and +specialization. Each base integer ISA can be extended with one or more +optional instruction-set extensions. An extension may be categorized as +either standard, custom, or non-conforming. For this purpose, we divide +each RISC-V instruction-set encoding space (and related encoding spaces +such as the CSRs) into three disjoint categories: standard, +reserved, and custom. Standard extensions and encodings are defined +by RISC-V International; any extensions not defined by RISC-V International are +non-standard. Each base ISA and its standard extensions use only +standard encodings, and shall not conflict with each other in their uses +of these encodings. Reserved encodings are currently not defined but are +saved for future standard extensions; once thus used, they become +standard encodings. Custom encodings shall never be used for standard +extensions and are made available for vendor-specific non-standard +extensions. Non-standard extensions are either custom extensions, that +use only custom encodings, or non-conforming extensions, that use any +standard or reserved encoding. Instruction-set extensions are generally +shared but may provide slightly different functionality depending on the +base ISA. Chapter 38 describes various ways +of extending the RISC-V ISA. We have also developed a naming convention +for RISC-V base instructions and instruction-set extensions, described +in detail in Chapter 39.
+To support more general software development, a set of standard +extensions are defined to provide integer multiply/divide, atomic +operations, and single and double-precision floating-point arithmetic. +The base integer ISA is named "I" (prefixed by RV32 or RV64 depending +on integer register width), and contains integer computational +instructions, integer loads, integer stores, and control-flow +instructions. The standard integer multiplication and division extension +is named "M", and adds instructions to multiply and divide values held +in the integer registers. The standard atomic instruction extension, +denoted by "A", adds instructions that atomically read, modify, and +write memory for inter-processor synchronization. The standard +single-precision floating-point extension, denoted by "F", adds +floating-point registers, single-precision computational instructions, +and single-precision loads and stores. The standard double-precision +floating-point extension, denoted by "D", expands the floating-point +registers, and adds double-precision computational instructions, loads, +and stores. The standard "C" compressed instruction extension provides +narrower 16-bit forms of common instructions.
+Beyond the base integer ISA and these standard extensions, we believe +it is rare that a new instruction will provide a significant benefit for +all applications, although it may be very beneficial for a certain +domain. As energy efficiency concerns are forcing greater +specialization, we believe it is important to simplify the required +portion of an ISA specification. Whereas other architectures usually +treat their ISA as a single entity, which changes to a new version as +instructions are added over time, RISC-V will endeavor to keep the base +and each standard extension constant over time, and instead layer new +instructions as further optional extensions. For example, the base +integer ISAs will continue as fully supported standalone ISAs, +regardless of any subsequent extensions.
+1.4. Memory
+A RISC-V hart has a single byte-addressable address space of + bytes for all memory accesses. A word of +memory is defined as 32 bits (4 bytes). Correspondingly, a halfword is 16 bits (2 bytes), a +doubleword is 64 bits (8 bytes), and a quadword is 128 bits (16 bytes). The memory address space is +circular, so that the byte at address is +adjacent to the byte at address zero. Accordingly, memory address +computations done by the hardware ignore overflow and instead wrap +around modulo .
+The execution environment determines the mapping of hardware resources +into a hart’s address space. Different address ranges of a hart’s +address space may (1) contain main memory, or +(2) contain one or more I/O devices. Reads and writes of I/O devices +may have visible side effects, but accesses to main memory cannot. +Vacant address ranges are not a separate category but can be represented as +either main memory or I/O regions that are not accessible. +Although it is possible for the execution environment to call everything +in a hart’s address space an I/O device, it is usually expected that +some portion will be specified as main memory.
+When a RISC-V platform has multiple harts, the address spaces of any two +harts may be entirely the same, or entirely different, or may be partly +different but sharing some subset of resources, mapped into the same or +different address ranges.
++ + | +
+
+
+For a purely "bare metal" environment, all harts may see an identical +address space, accessed entirely by physical addresses. However, when +the execution environment includes an operating system employing address +translation, it is common for each hart to be given a virtual address +space that is largely or entirely its own. + |
+
Executing each RISC-V machine instruction entails one or more memory +accesses, subdivided into implicit and explicit accesses. For each +instruction executed, an implicit memory read (instruction fetch) is +done to obtain the encoded instruction to execute. Many RISC-V +instructions perform no further memory accesses beyond instruction +fetch. Specific load and store instructions perform an explicit read +or write of memory at an address determined by the instruction. The +execution environment may dictate that instruction execution performs +other implicit memory accesses (such as to implement address +translation) beyond those documented for the unprivileged ISA.
+The execution environment determines what portions of the +address space are accessible for each kind of memory access. For +example, the set of locations that can be implicitly read for +instruction fetch may or may not have any overlap with the set of +locations that can be explicitly read by a load instruction; and the set +of locations that can be explicitly written by a store instruction may +be only a subset of locations that can be read. Ordinarily, if an +instruction attempts to access memory at an inaccessible address, an +exception is raised for the instruction.
+Except when specified otherwise, implicit reads that do not raise an +exception and that have no side effects may occur arbitrarily early and +speculatively, even before the machine could possibly prove that the +read will be needed. For instance, a valid implementation could attempt +to read all of main memory at the earliest opportunity, cache as many +fetchable (executable) bytes as possible for later instruction fetches, +and avoid reading main memory for instruction fetches ever again. To +ensure that certain implicit reads are ordered only after writes to the +same memory locations, software must execute specific fence or +cache-control instructions defined for this purpose (such as the FENCE.I +instruction defined in Chapter 6). +
+The memory accesses (implicit or explicit) made by a hart may appear to +occur in a different order as perceived by another hart or by any other +agent that can access the same memory. This perceived reordering of +memory accesses is always constrained, however, by the applicable memory +consistency model. The default memory consistency model for RISC-V is +the RISC-V Weak Memory Ordering (RVWMO), defined in +Chapter 18 and in appendices. Optionally, +an implementation may adopt the stronger model of Total Store Ordering, +as defined in Chapter 19. The execution environment +may also add constraints that further limit the perceived reordering of +memory accesses. Since the RVWMO model is the weakest model allowed for +any RISC-V implementation, software written for this model is compatible +with the actual memory consistency rules of all RISC-V implementations. +As with implicit reads, software must execute fence or cache-control +instructions to ensure specific ordering of memory accesses beyond the +requirements of the assumed memory consistency model and execution +environment.
+1.5. Base Instruction-Length Encoding
+The base RISC-V ISA has fixed-length 32-bit instructions that must be +naturally aligned on 32-bit boundaries. However, the standard RISC-V +encoding scheme is designed to support ISA extensions with +variable-length instructions, where each instruction can be any number +of 16-bit instruction parcels in length and parcels are naturally +aligned on 16-bit boundaries. The standard compressed ISA extension +described in Chapter 28 reduces code size by +providing compressed 16-bit instructions and relaxes the alignment +constraints to allow all instructions (16 bit and 32 bit) to be aligned +on any 16-bit boundary to improve code density.
+We use the term IALIGN (measured in bits) to refer to the +instruction-address alignment constraint the implementation enforces. +IALIGN is 32 bits in the base ISA, but some ISA extensions, including +the compressed ISA extension, relax IALIGN to 16 bits. IALIGN may not +take on any value other than 16 or 32. +
+We use the term ILEN (measured in bits) to refer to the maximum +instruction length supported by an implementation, and which is always a +multiple of IALIGN. For implementations supporting only a base +instruction set, ILEN is 32 bits. Implementations supporting longer +instructions have larger values of ILEN.
+Table 1 illustrates the standard
+RISC-V instruction-length encoding convention. All the 32-bit
+instructions in the base ISA have their lowest two bits set to 11
. The
+optional compressed 16-bit instruction-set extensions have their lowest
+two bits equal to 00
, 01
, or 10
.
1.5.1. Expanded Instruction-Length Encoding
+A portion of the 32-bit instruction-encoding space has been tentatively +allocated for instructions longer than 32 bits. The entirety of this +space is reserved at this time, and the following proposal for encoding +instructions longer than 32 bits is not considered frozen. +
+Standard instruction-set extensions encoded with more than 32 bits have
+additional low-order bits set to 1
, with the conventions for 48-bit
+and 64-bit lengths shown in
+Table 1. Instruction lengths
+between 80 bits and 176 bits are encoded using a 3-bit field in bits
+[14:12] giving the number of 16-bit words in addition to the first
+516-bit words. The encoding with bits [14:12] set to
+"111" is reserved for future longer instruction encodings.
+ | + | + | xxxxxxxxxxxxxxaa | +16-bit (aa≠11) | +
---|---|---|---|---|
+ | + | xxxxxxxxxxxxxxxx |
+xxxxxxxxxxxbbb11 |
+32-bit (bbb≠111) |
+
+ | xxxx |
+xxxxxxxxxxxxxxxx |
+xxxxxxxxxx011111 |
+48-bit |
+
+ | xxxx |
+xxxxxxxxxxxxxxxx |
+xxxxxxxxx0111111 |
+64-bit |
+
+ | xxxx |
+xxxxxxxxxxxxxxxx |
+xnnnxxxxx1111111 |
+(80+16*nnn)-bit, nnn≠111 |
+
+ | xxxx |
+xxxxxxxxxxxxxxxx |
+x111xxxxx1111111 |
+Reserved for ≥192-bits |
+
Byte Address: |
+base+4 |
+base+2 |
+base |
++ |
+ + | +
+
+
+Given the code size and energy savings of a compressed format, we wanted +to build in support for a compressed format to the ISA encoding scheme +rather than adding this as an afterthought, but to allow simpler +implementations we didn’t want to make the compressed format mandatory. +We also wanted to optionally allow longer instructions to support +experimentation and larger instruction-set extensions. Although our +encoding convention required a tighter encoding of the core RISC-V ISA, +this has several beneficial effects. + +
+
+An implementation of the standard IMAFD ISA need only hold the +most-significant 30 bits in instruction caches (a 6.25% saving). On +instruction cache refills, any instructions encountered with either low +bit clear should be recoded into illegal 30-bit instructions before +storing in the cache to preserve illegal-instruction exception behavior. +
+
+Perhaps more importantly, by condensing our base ISA into a subset of +the 32-bit instruction word, we leave more space available for +non-standard and custom extensions. In particular, the base RV32I ISA +uses less than 1/8 of the encoding space in the 32-bit instruction word. +As described in Chapter 38, an implementation that does not require support +for the standard compressed instruction extension can map 3 additional non-conforming +30-bit instruction spaces into the 32-bit fixed-width format, while preserving +support for standard ≥32-bit instruction-set +extensions. Further, if the implementation also does not need +instructions >32-bits in length, it can recover a further +four major opcodes for non-conforming extensions. + |
+
Encodings with bits [15:0] all zeros are defined as illegal +instructions. These instructions are considered to be of minimal length: +16 bits if any 16-bit instruction-set extension is present, otherwise 32 +bits. The encoding with bits [ILEN-1:0] all ones is also illegal; this +instruction is considered to be ILEN bits long.
++ + | +
+
+
+We consider it a feature that any length of instruction containing all +zero bits is not legal, as this quickly traps erroneous jumps into +zeroed memory regions. Similarly, we also reserve the instruction +encoding containing all ones to be an illegal instruction, to catch the +other common pattern observed with unprogrammed non-volatile memory +devices, disconnected memory buses, or broken memory devices. +
+
+Software can rely on a naturally aligned 32-bit word containing zero to +act as an illegal instruction on all RISC-V implementations, to be used +by software where an illegal instruction is explicitly desired. Defining +a corresponding known illegal value for all ones is more difficult due +to the variable-length encoding. Software cannot generally use the +illegal value of ILEN bits of all 1s, as software might not know ILEN +for the eventual target machine (e.g., if software is compiled into a +standard binary library used by many different machines). Defining a +32-bit word of all ones as illegal was also considered, as all machines +must support a 32-bit instruction size, but this requires the +instruction-fetch unit on machines with ILEN >32 report an +illegal-instruction exception rather than an access-fault exception when +such an instruction borders a protection boundary, complicating +variable-instruction-length fetch and decode. + |
+
+RISC-V base ISAs have either little-endian or big-endian memory systems, +with the privileged architecture further defining bi-endian operation. +Instructions are stored in memory as a sequence of 16-bit little-endian +parcels, regardless of memory system endianness. Parcels forming one +instruction are stored at increasing halfword addresses, with the +lowest-addressed parcel holding the lowest-numbered bits in the +instruction specification. + +
++ + | +
+
+
+We originally chose little-endian byte ordering for the RISC-V memory +system because little-endian systems are currently dominant commercially +(all x86 systems; iOS, Android, and Windows for ARM). A minor point is +that we have also found little-endian memory systems to be more natural +for hardware designers. However, certain application areas, such as IP +networking, operate on big-endian data structures, and certain legacy +code bases have been built assuming big-endian processors, so we have +defined big-endian and bi-endian variants of RISC-V. +
+
+We have to fix the order in which instruction parcels are stored in +memory, independent of memory system endianness, to ensure that the +length-encoding bits always appear first in halfword address order. This +allows the length of a variable-length instruction to be quickly +determined by an instruction-fetch unit by examining only the first few +bits of the first 16-bit instruction parcel. +
+
+We further make the instruction parcels themselves little-endian to +decouple the instruction encoding from the memory system endianness +altogether. This design benefits both software tooling and bi-endian +hardware. Otherwise, for instance, a RISC-V assembler or disassembler +would always need to know the intended active endianness, despite that +in bi-endian systems, the endianness mode might change dynamically +during execution. In contrast, by giving instructions a fixed +endianness, it is sometimes possible for carefully written software to +be endianness-agnostic even in binary form, much like +position-independent code. +
+
+The choice to have instructions be only little-endian does have +consequences, however, for RISC-V software that encodes or decodes +machine instructions. Big-endian JIT compilers, for example, must swap +the byte order when storing to instruction memory. +
+
+Once we had decided to fix on a little-endian instruction encoding, this +naturally led to placing the length-encoding bits in the LSB positions +of the instruction format to avoid breaking up opcode fields. + |
+
1.6. Exceptions, Traps, and Interrupts
+We use the term exception to refer to an unusual condition occurring +at run time associated with an instruction in the current RISC-V hart. +We use the term interrupt to refer to an external asynchronous event +that may cause a RISC-V hart to experience an unexpected transfer of +control. We use the term trap to refer to the transfer of control to a +trap handler caused by either an exception or an interrupt. + + +
+The instruction descriptions in following chapters describe conditions +that can raise an exception during execution. The general behavior of +most RISC-V EEIs is that a trap to some handler occurs when an exception +is signaled on an instruction (except for floating-point exceptions, +which, in the standard floating-point extensions, do not cause traps). +The manner in which interrupts are generated, routed to, and enabled by +a hart depends on the EEI.
++ + | +
+
+
+Our use of "exception" and "trap" is compatible with that in the +IEEE-754 floating-point standard. + |
+
How traps are handled and made visible to software running on the hart +depends on the enclosing execution environment. From the perspective of +software running inside an execution environment, traps encountered by a +hart at runtime can have four different effects:
+-
+
- Contained Trap +
-
+
The trap is visible to, and handled by, software running inside the +execution environment. For example, in an EEI providing both +supervisor and user mode on harts, an ECALL by a user-mode hart will +generally result in a transfer of control to a supervisor-mode handler +running on the same hart. Similarly, in the same environment, when a +hart is interrupted, an interrupt handler will be run in supervisor +mode on the hart.
+
+ - Requested Trap +
-
+
The trap is a synchronous exception that is an explicit call to the +execution environment requesting an action on behalf of software +inside the execution environment. An example is a system call. In this +case, execution may or may not resume on the hart after the requested +action is taken by the execution environment. For example, a system +call could remove the hart or cause an orderly termination of the +entire execution environment.
+
+ - Invisible Trap +
-
+
The trap is handled transparently by the execution environment and +execution resumes normally after the trap is handled. Examples include +emulating missing instructions, handling non-resident page faults in a +demand-paged virtual-memory system, or handling device interrupts for +a different job in a multiprogrammed machine. In these cases, the +software running inside the execution environment is not aware of the +trap (we ignore timing effects in these definitions).
+
+ - Fatal Trap +
-
+
The trap represents a fatal failure and causes the execution +environment to terminate execution. Examples include failing a +virtual-memory page-protection check or allowing a watchdog timer to +expire. Each EEI should define how execution is terminated and +reported to an external environment.
+
+
Table 2 shows the characteristics of each kind of trap.
++ | Contained | +Requested | +Invisible | +Fatal | +
---|---|---|---|---|
Execution terminates |
+No |
+No1 |
+No |
+Yes |
+
Software is oblivious |
+No |
+No |
+Yes |
+Yes2 |
+
Handled by environment |
+No |
+Yes |
+Yes |
+Yes |
+
1 Termination may be requested
+2 Imprecise fatal traps might be observable by software
The EEI defines for each trap whether it is handled precisely, though +the recommendation is to maintain preciseness where possible. Contained +and requested traps can be observed to be imprecise by software inside +the execution environment. Invisible traps, by definition, cannot be +observed to be precise or imprecise by software running inside the +execution environment. Fatal traps can be observed to be imprecise by +software running inside the execution environment, if known-errorful +instructions do not cause immediate termination.
+Because this document describes unprivileged instructions, traps are +rarely mentioned. Architectural means to handle contained traps are +defined in the privileged architecture manual, along with other features +to support richer EEIs. Unprivileged instructions that are defined +solely to cause requested traps are documented here. Invisible traps +are, by their nature, out of scope for this document. Instruction +encodings that are not defined here and not defined by some other means +may cause a fatal trap.
+1.7. UNSPECIFIED Behaviors and Values
+The architecture fully describes what implementations must do and any +constraints on what they may do. In cases where the architecture +intentionally does not constrain implementations, the term UNSPECIFIED is +explicitly used. + +
+The term UNSPECIFIED refers to a behavior or value that is intentionally +unconstrained. The definition of these behaviors or values is open to +extensions, platform standards, or implementations. Extensions, platform +standards, or implementation documentation may provide normative content +to further constrain cases that the base architecture defines as UNSPECIFIED.
+Like the base architecture, extensions should fully describe allowable +behavior and values and use the term UNSPECIFIED for cases that are intentionally +unconstrained. These cases may be constrained or defined by other +extensions, platform standards, or implementations.
+2. RV32I Base Integer Instruction Set, Version 2.1
+This chapter describes the RV32I base integer instruction set.
++ + | +
+
+
+RV32I was designed to be sufficient to form a compiler target and to +support modern operating system environments. The ISA was also designed +to reduce the hardware required in a minimal implementation. RV32I +contains 40 unique instructions, though a simple implementation might +cover the ECALL/EBREAK instructions with a single SYSTEM hardware +instruction that always traps and might be able to implement the FENCE +instruction as a NOP, reducing base instruction count to 38 total. RV32I +can emulate almost any other ISA extension (except the A extension, +which requires additional hardware support for atomicity). +
+
+In practice, a hardware implementation including the machine-mode +privileged architecture will also require the 6 CSR instructions. +
+
+Subsets of the base integer ISA might be useful for pedagogical +purposes, but the base has been defined such that there should be little +incentive to subset a real hardware implementation beyond omitting +support for misaligned memory accesses and treating all SYSTEM +instructions as a single trap. + |
+
+ + | +
+
+
+The standard RISC-V assembly language syntax is documented in the +Assembly Programmer’s Manual (RISC-V Assembly Programmer’s Manual, n.d.). + |
+
+ + | +
+
+
+Most of the commentary for RV32I also applies to the RV64I base. + |
+
2.1. Programmers' Model for Base Integer ISA
+Table 3 shows the unprivileged state for the base
+integer ISA. For RV32I, the 32 x
registers are each 32 bits wide,
+i.e., XLEN=32
. Register x0
is hardwired with all bits equal to 0.
+General purpose registers x1-x31
hold values that various
+instructions interpret as a collection of Boolean values, or as two’s
+complement signed binary integers or unsigned binary integers.
There is one additional unprivileged register: the program counter pc
+holds the address of the current instruction.
XLEN-1 | ++ | 0 | +
---|---|---|
x0/zero |
+||
x1 |
+||
x2 |
+||
x3 |
+||
x4 |
+||
x5 |
+||
x6 |
+||
x7 |
+||
x8 |
+||
x9 |
+||
x10 |
+||
x11 |
+||
x12 |
+||
x13 |
+||
x14 |
+||
x15 |
+||
x16 |
+||
x17 |
+||
x18 |
+||
x19 |
+||
x20 |
+||
x21 |
+||
x22 |
+||
x23 |
+||
x24 |
+||
x25 |
+||
x26 |
+||
x27 |
+||
x28 |
+||
x29 |
+||
x30 |
+||
x31 |
+||
XLEN |
+||
XLEN-1 |
++ | 0 |
+
pc |
+||
XLEN |
+
+ + | +
+
+
+There is no dedicated stack pointer or subroutine return address link
+register in the Base Integer ISA; the instruction encoding allows any
+
+
+Hardware might choose to accelerate function calls and returns that use
+
+
+The optional compressed 16-bit instruction format is designed around the
+assumption that
+
+The number of available architectural registers can have large impacts +on code size, performance, and energy consumption. Although 16 registers +would arguably be sufficient for an integer ISA running compiled code, +it is impossible to encode a complete ISA with 16 registers in 16-bit +instructions using a 3-address format. Although a 2-address format would +be possible, it would increase instruction count and lower efficiency. +We wanted to avoid intermediate instruction sizes (such as Xtensa’s +24-bit instructions) to simplify base hardware implementations, and once +a 32-bit instruction size was adopted, it was straightforward to support +32 integer registers. A larger number of integer registers also helps +performance on high-performance code, where there can be extensive use +of loop unrolling, software pipelining, and cache tiling. +
+
+For these reasons, we chose a conventional size of 32 integer registers +for RV32I. Dynamic register usage tends to be dominated by a few +frequently accessed registers, and regfile implementations can be +optimized to reduce access energy for the frequently accessed +registers (Tseng & Asanović, 2000). The optional compressed 16-bit instruction format mostly +only accesses 8 registers and hence can provide a dense instruction +encoding, while additional instruction-set extensions could support a +much larger register space (either flat or hierarchical) if desired. +
+
+For resource-constrained embedded applications, we have defined the +RV32E subset, which only has 16 registers +(Chapter 3). + |
+
2.2. Base Instruction Formats
+In the base RV32I ISA, there are four core instruction formats
+(R/I/S/U), as shown in Base instruction formats. All are a fixed 32
+bits in length. The base ISA has IALIGN=32
, meaning that instructions must be aligned on a four-byte boundary in memory. An
+instruction-address-misaligned exception is generated on a taken branch
+or unconditional jump if the target address is not IALIGN-bit
aligned.
+This exception is reported on the branch or jump instruction, not on the
+target instruction. No instruction-address-misaligned exception is
+generated for a conditional branch that is not taken.
+ + | +
+
+
+The alignment constraint for base ISA instructions is relaxed to a +two-byte boundary when instruction extensions with 16-bit lengths or +other odd multiples of 16-bit lengths are added (i.e., IALIGN=16). +
+
+Instruction-address-misaligned exceptions are reported on the branch or +jump that would cause instruction misalignment to help debugging, and to +simplify hardware design for systems with IALIGN=32, where these are the +only places where misalignment can occur. + |
+
The behavior upon decoding a reserved instruction is UNSPECIFIED.
++ + | +
+
+
+Some platforms may require that opcodes reserved for standard use raise +an illegal-instruction exception. Other platforms may permit reserved +opcode space be used for non-conforming extensions. + |
+
The RISC-V ISA keeps the source (rs1 and rs2) and destination (rd) +registers at the same position in all formats to simplify decoding. +Except for the 5-bit immediates used in CSR instructions +(Chapter 7), immediates are always +sign-extended, and are generally packed towards the leftmost available +bits in the instruction and have been allocated to reduce hardware +complexity. In particular, the sign bit for all immediates is always in +bit 31 of the instruction to speed sign-extension circuitry.
+RISC-V base instruction formats. Each immediate subfield is labeled with the bit position (imm[x]) in the immediate value being produced, rather than the bit position within the instruction’s immediate field as is usually done.
++ + | +
+
+
+Decoding register specifiers is usually on the critical paths in +implementations, and so the instruction format was chosen to keep all +register specifiers at the same position in all formats at the expense +of having to move immediate bits across formats (a property shared with +RISC-IV aka. SPUR (Lee et al., 1989)). +
+
+In practice, most immediates are either small or require all XLEN bits. +We chose an asymmetric immediate split (12 bits in regular instructions +plus a special load-upper-immediate instruction with 20 bits) to +increase the opcode space available for regular instructions. +
+
+Immediates are sign-extended because we did not observe a benefit to +using zero extension for some immediates as in the MIPS ISA and wanted +to keep the ISA as simple as possible. + |
+
2.3. Immediate Encoding Variants
+There are a further two variants of the instruction formats (B/J) based +on the handling of immediates, as shown in Base instruction formats immediate variants..
+The only difference between the S and B formats is that the 12-bit +immediate field is used to encode branch offsets in multiples of 2 in +the B format. Instead of shifting all bits in the instruction-encoded +immediate left by one in hardware as is conventionally done, the middle +bits (imm[10:1]) and sign bit stay in fixed positions, while the lowest +bit in S format (inst[7]) encodes a high-order bit in B format.
+Similarly, the only difference between the U and J formats is that the +20-bit immediate is shifted left by 12 bits to form U immediates and by +1 bit to form J immediates. The location of instruction bits in the U +and J format immediates is chosen to maximize overlap with the other +formats and with each other.
+Immediate types shows the immediates produced by +each of the base instruction formats, and is labeled to show which +instruction bit (inst[y]) produces each bit of the immediate value.
+The fields are labeled with the instruction bits used to construct their value. Sign extensions always uses inst[31].
++ + | +
+
+
+Sign extension is one of the most critical operations on immediates +(particularly for XLEN>32), and in RISC-V the sign bit for +all immediates is always held in bit 31 of the instruction to allow +sign extension to proceed in parallel with instruction decoding. +
+
+Although more complex implementations might have separate adders for +branch and jump calculations and so would not benefit from keeping the +location of immediate bits constant across types of instruction, we +wanted to reduce the hardware cost of the simplest implementations. By +rotating bits in the instruction encoding of B and J immediates instead +of using dynamic hardware muxes to multiply the immediate by 2, we +reduce instruction signal fanout and immediate mux costs by around a +factor of 2. The scrambled immediate encoding will add negligible time +to static or ahead-of-time compilation. For dynamic generation of +instructions, there is some small additional overhead, but the most +common short forward branches have straightforward immediate encodings. + |
+
2.4. Integer Computational Instructions
+Most integer computational instructions operate on XLEN
bits of values
+held in the integer register file. Integer computational instructions
+are either encoded as register-immediate operations using the I-type
+format or as register-register operations using the R-type format. The
+destination is register rd for both register-immediate and
+register-register instructions. No integer computational instructions
+cause arithmetic exceptions.
+ + | +
+
+
+We did not include special instruction-set support for overflow checks
+on integer arithmetic operations in the base instruction set, as many
+overflow checks can be cheaply implemented using RISC-V branches.
+Overflow checking for unsigned addition requires only a single
+additional branch instruction after the addition:
+
+
+For signed addition, if one operand’s sign is known, overflow checking
+requires only a single branch after the addition:
+
+
+For general signed addition, three additional instructions after the +addition are required, leveraging the observation that the sum should be +less than one of the operands if and only if the other operand is +negative. +
+
+
+
+
+
+
+In RV64I, checks of 32-bit signed additions can be optimized further by +comparing the results of ADD and ADDW on the operands. + |
+
2.4.1. Integer Register-Immediate Instructions
+ADDI adds the sign-extended 12-bit immediate to register rs1. +Arithmetic overflow is ignored and the result is simply the low XLEN +bits of the result. ADDI rd, rs1, 0 is used to implement the MV rd, +rs1 assembler pseudoinstruction.
+SLTI (set less than immediate) places the value 1 in register rd if +register rs1 is less than the sign-extended immediate when both are +treated as signed numbers, else 0 is written to rd. SLTIU is similar +but compares the values as unsigned numbers (i.e., the immediate is +first sign-extended to XLEN bits then treated as an unsigned number). +Note, SLTIU rd, rs1, 1 sets rd to 1 if rs1 equals zero, otherwise +sets rd to 0 (assembler pseudoinstruction SEQZ rd, rs).
+ANDI, ORI, XORI are logical operations that perform bitwise AND, OR, and +XOR on register rs1 and the sign-extended 12-bit immediate and place +the result in rd. Note, XORI rd, rs1, -1 performs a bitwise logical +inversion of register rs1 (assembler pseudoinstruction NOT rd, rs).
+Shifts by a constant are encoded as a specialization of the I-type +format. The operand to be shifted is in rs1, and the shift amount is +encoded in the lower 5 bits of the I-immediate field. The right shift +type is encoded in bit 30. SLLI is a logical left shift (zeros are +shifted into the lower bits); SRLI is a logical right shift (zeros are +shifted into the upper bits); and SRAI is an arithmetic right shift (the +original sign bit is copied into the vacated upper bits).
+LUI (load upper immediate) is used to build 32-bit constants and uses +the U-type format. LUI places the 32-bit U-immediate value into the +destination register rd, filling in the lowest 12 bits with zeros.
+AUIPC (add upper immediate to pc
) is used to build pc
-relative
+addresses and uses the U-type format. AUIPC forms a 32-bit offset from
+the U-immediate, filling in the lowest 12 bits with zeros, adds this
+offset to the address of the AUIPC instruction, then places the result
+in register rd.
+ + | +
+
+
+The assembly syntax for
+
+The AUIPC instruction supports two-instruction sequences to access +arbitrary offsets from the PC for both control-flow transfers and data +accesses. The combination of an AUIPC and the 12-bit immediate in a JALR +can transfer control to any 32-bit PC-relative address, while an AUIPC +plus the 12-bit immediate offset in regular load or store instructions +can access any 32-bit PC-relative data address. +
+
+The current PC can be obtained by setting the U-immediate to 0. Although +a JAL +4 instruction could also be used to obtain the local PC (of the +instruction following the JAL), it might cause pipeline breaks in +simpler microarchitectures or pollute BTB structures in more complex +microarchitectures. + |
+
2.4.2. Integer Register-Register Operations
+RV32I defines several arithmetic R-type operations. All operations read +the rs1 and rs2 registers as source operands and write the result +into register rd. The funct7 and funct3 fields select the type of +operation.
+ADD performs the addition of rs1 and rs2. SUB performs the +subtraction of rs2 from rs1. Overflows are ignored and the low XLEN +bits of results are written to the destination rd. SLT and SLTU +perform signed and unsigned compares respectively, writing 1 to rd if +rs1 < rs2, 0 otherwise. Note, SLTU rd, x0, rs2 sets rd to 1 if +rs2 is not equal to zero, otherwise sets rd to zero (assembler +pseudoinstruction SNEZ rd, rs). AND, OR, and XOR perform bitwise +logical operations.
+SLL, SRL, and SRA perform logical left, logical right, and arithmetic +right shifts on the value in register rs1 by the shift amount held in +the lower 5 bits of register rs2.
+2.4.3. NOP Instruction
+The NOP instruction does not change any architecturally visible state,
+except for advancing the pc
and incrementing any applicable
+performance counters. NOP is encoded as ADDI x0, x0, 0.
+ + | +
+
+
+NOPs can be used to align code segments to microarchitecturally +significant address boundaries, or to leave space for inline code +modifications. Although there are many possible ways to encode a NOP, we +define a canonical NOP encoding to allow microarchitectural +optimizations as well as for more readable disassembly output. The other +NOP encodings are made available for HINT Instructions. +
+
+ADDI was chosen for the NOP encoding as this is most likely to take +fewest resources to execute across a range of systems (if not optimized +away in decode). In particular, the instruction only reads one register. +Also, an ADDI functional unit is more likely to be available in a +superscalar design as adds are the most common operation. In particular, +address-generation functional units can execute ADDI using the same +hardware needed for base+offset address calculations, while +register-register ADD or logical/shift operations require additional +hardware. + |
+
2.5. Control Transfer Instructions
+RV32I provides two types of control transfer instructions: unconditional +jumps and conditional branches. Control transfer instructions in RV32I +do not have architecturally visible delay slots.
+If an instruction access-fault or instruction page-fault exception +occurs on the target of a jump or taken branch, the exception is +reported on the target instruction, not on the jump or branch +instruction.
+2.5.1. Unconditional Jumps
+The jump and link (JAL) instruction uses the J-type format, where the +J-immediate encodes a signed offset in multiples of 2 bytes. The offset +is sign-extended and added to the address of the jump instruction to +form the jump target address. Jumps can therefore target a +±1 MiB range. JAL stores the address of the instruction +following the jump ('pc'+4) into register rd. The standard software +calling convention uses 'x1' as the return address register and 'x5' as +an alternate link register.
++ + | +
+
+
+The alternate link register supports calling millicode routines (e.g.,
+those to save and restore registers in compressed code) while preserving
+the regular return address register. The register |
+
Plain unconditional jumps (assembler pseudoinstruction J) are encoded as
+a JAL with rd=x0
.
The indirect jump instruction JALR (jump and link register) uses the
+I-type encoding. The target address is obtained by adding the
+sign-extended 12-bit I-immediate to the register rs1, then setting the
+least-significant bit of the result to zero. The address of the
+instruction following the jump (pc
+4) is written to register rd.
+Register x0
can be used as the destination if the result is not
+required.
+ + | +
+
+
+The unconditional jump instructions all use PC-relative addressing to
+help support position-independent code. The JALR instruction was defined
+to enable a two-instruction sequence to jump anywhere in a 32-bit
+absolute address range. A LUI instruction can first load rs1 with the
+upper 20 bits of a target address, then JALR can add in the lower bits.
+Similarly, AUIPC then JALR can jump anywhere in a 32-bit
+
+Note that the JALR instruction does not treat the 12-bit immediate as +multiples of 2 bytes, unlike the conditional branch instructions. This +avoids one more immediate format in hardware. In practice, most uses of +JALR will have either a zero immediate or be paired with a LUI or AUIPC, +so the slight reduction in range is not significant. +
+
+Clearing the least-significant bit when calculating the JALR target +address both simplifies the hardware slightly and allows the low bit of +function pointers to be used to store auxiliary information. Although +there is potentially a slight loss of error checking in this case, in +practice jumps to an incorrect instruction address will usually quickly +raise an exception. +
+
+When used with a base rs1= |
+
The JAL and JALR instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary.
++ + | +
+
+
+Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + |
+
Return-address prediction stacks are a common feature of
+high-performance instruction-fetch units, but require accurate detection
+of instructions used for procedure calls and returns to be effective.
+For RISC-V, hints as to the instructions' usage are encoded implicitly
+via the register numbers used. A JAL instruction should push the return
+address onto a return-address stack (RAS) only when rd is 'x1' or
+x5
. JALR instructions should push/pop a RAS as shown in Table 4.
rd is x1/x5 | +rs1 is x1/x5 | +rd=rs1 | +RAS action | +
---|---|---|---|
No |
+No |
+— |
+None |
+
No |
+Yes |
+— |
+Pop |
+
Yes |
+No |
+— |
+Push |
+
Yes |
+Yes |
+No |
+Pop, then push |
+
Yes |
+Yes |
+Yes |
+Push |
+
+ + | +
+
+
+Some other ISAs added explicit hint bits to their indirect-jump +instructions to guide return-address stack manipulation. We use implicit +hinting tied to register numbers and the calling convention to reduce +the encoding space used for these hints. +
+
+When two different link registers ( |
+
2.5.2. Conditional Branches
+All branch instructions use the B-type instruction format. The 12-bit +B-immediate encodes signed offsets in multiples of 2 bytes. The offset +is sign-extended and added to the address of the branch instruction to +give the target address. The conditional branch range is +±4 KiB.
+Branch instructions compare two registers. BEQ and BNE take the branch +if registers rs1 and rs2 are equal or unequal respectively. BLT and +BLTU take the branch if rs1 is less than rs2, using signed and +unsigned comparison respectively. BGE and BGEU take the branch if rs1 +is greater than or equal to rs2, using signed and unsigned comparison +respectively. Note, BGT, BGTU, BLE, and BLEU can be synthesized by +reversing the operands to BLT, BLTU, BGE, and BGEU, respectively.
++ + | +
+
+
+Signed array bounds may be checked with a single BLTU instruction, since +any negative index will compare greater than any nonnegative bound. + |
+
Software should be optimized such that the sequential code path is the +most common path, with less-frequently taken code paths placed out of +line. Software should also assume that backward branches will be +predicted taken and forward branches as not taken, at least the first +time they are encountered. Dynamic predictors should quickly learn any +predictable branch behavior.
+Unlike some other architectures, the RISC-V jump (JAL with rd=x0
)
+instruction should always be used for unconditional branches instead of
+a conditional branch instruction with an always-true condition. RISC-V
+jumps are also PC-relative and support a much wider offset range than
+branches, and will not pollute conditional-branch prediction tables.
+ + | +
+
+
+The conditional branches were designed to include arithmetic comparison +operations between two registers (as also done in PA-RISC, Xtensa, and +MIPS R6), rather than use condition codes (x86, ARM, SPARC, PowerPC), or +to only compare one register against zero (Alpha, MIPS), or two +registers only for equality (MIPS). This design was motivated by the +observation that a combined compare-and-branch instruction fits into a +regular pipeline, avoids additional condition code state or use of a +temporary register, and reduces static code size and dynamic instruction +fetch traffic. Another point is that comparisons against zero require +non-trivial circuit delay (especially after the move to static logic in +advanced processes) and so are almost as expensive as arithmetic +magnitude compares. Another advantage of a fused compare-and-branch +instruction is that branches are observed earlier in the front-end +instruction stream, and so can be predicted earlier. There is perhaps an +advantage to a design with condition codes in the case where multiple +branches can be taken based on the same condition codes, but we believe +this case to be relatively rare. +
+
+We considered but did not include static branch hints in the instruction +encoding. These can reduce the pressure on dynamic predictors, but +require more instruction encoding space and software profiling for best +results, and can result in poor performance if production runs do not +match profiling runs. +
+
+We considered but did not include conditional moves or predicated +instructions, which can effectively replace unpredictable short forward +branches. Conditional moves are the simpler of the two, but are +difficult to use with conditional code that might cause exceptions +(memory accesses and floating-point operations). Predication adds +additional flag state to a system, additional instructions to set and +clear flags, and additional encoding overhead on every instruction. Both +conditional move and predicated instructions add complexity to +out-of-order microarchitectures, adding an implicit third source operand +due to the need to copy the original value of the destination +architectural register into the renamed destination physical register if +the predicate is false. Also, static compile-time decisions to use +predication instead of branches can result in lower performance on +inputs not included in the compiler training set, especially given that +unpredictable branches are rare, and becoming rarer as branch prediction +techniques improve. +
+
+We note that various microarchitectural techniques exist to dynamically +convert unpredictable short forward branches into internally predicated +code to avoid the cost of flushing pipelines on a branch mispredict (Heil & Smith, 1996), (Klauser et al., 1998), (Kim et al., 2005) and +have been implemented in commercial processors (Sinharoy et al., 2011). The simplest techniques +just reduce the penalty of recovering from a mispredicted short forward +branch by only flushing instructions in the branch shadow instead of the +entire fetch pipeline, or by fetching instructions from both sides using +wide instruction fetch or idle instruction fetch slots. More complex +techniques for out-of-order cores add internal predicates on +instructions in the branch shadow, with the internal predicate value +written by the branch instruction, allowing the branch and following +instructions to be executed speculatively and out-of-order with respect +to other code. + |
+
The conditional branch instructions will generate an +instruction-address-misaligned exception if the target address is not +aligned to a four-byte boundary and the branch condition evaluates to +true. If the branch condition evaluates to false, the +instruction-address-misaligned exception will not be raised.
++ + | +
+
+
+Instruction-address-misaligned exceptions are not possible on machines +that support extensions with 16-bit aligned instructions, such as the +compressed instruction-set extension, C. + |
+
2.6. Load and Store Instructions
+RV32I is a load-store architecture, where only load and store
+instructions access memory and arithmetic instructions only operate on
+CPU registers. RV32I provides a 32-bit address space that is
+byte-addressed. The EEI will define what portions of the address space
+are legal to access with which instructions (e.g., some addresses might
+be read only, or support word access only). Loads with a destination of
+x0
must still raise any exceptions and cause any other side effects
+even though the load value is discarded.
The EEI will define whether the memory system is little-endian or +big-endian. In RISC-V, endianness is byte-address invariant.
++ + | +
+
+
+In a system for which endianness is byte-address invariant, the +following property holds: if a byte is stored to memory at some address +in some endianness, then a byte-sized load from that address in any +endianness returns the stored value. +
+
+In a little-endian configuration, multibyte stores write the +least-significant register byte at the lowest memory byte address, +followed by the other register bytes in ascending order of their +significance. Loads similarly transfer the contents of the lesser memory +byte addresses to the less-significant register bytes. +
+
+In a big-endian configuration, multibyte stores write the +most-significant register byte at the lowest memory byte address, +followed by the other register bytes in descending order of their +significance. Loads similarly transfer the contents of the greater +memory byte addresses to the less-significant register bytes. + |
+
Load and store instructions transfer a value between the registers and +memory. Loads are encoded in the I-type format and stores are S-type. +The effective address is obtained by adding register rs1 to the +sign-extended 12-bit offset. Loads copy a value from memory to register +rd. Stores copy the value in register rs2 to memory.
+The LW instruction loads a 32-bit value from memory into rd. LH loads +a 16-bit value from memory, then sign-extends to 32-bits before storing +in rd. LHU loads a 16-bit value from memory but then zero extends to +32-bits before storing in rd. LB and LBU are defined analogously for +8-bit values. The SW, SH, and SB instructions store 32-bit, 16-bit, and +8-bit values from the low bits of register rs2 to memory.
+Regardless of EEI, loads and stores whose effective addresses are +naturally aligned shall not raise an address-misaligned exception. Loads +and stores whose effective address is not naturally aligned to the +referenced datatype (i.e., the effective address is not divisible by the +size of the access in bytes) have behavior dependent on the EEI.
+An EEI may guarantee that misaligned loads and stores are fully +supported, and so the software running inside the execution environment +will never experience a contained or fatal address-misaligned trap. In +this case, the misaligned loads and stores can be handled in hardware, +or via an invisible trap into the execution environment implementation, +or possibly a combination of hardware and invisible trap depending on +address.
+An EEI may not guarantee misaligned loads and stores are handled +invisibly. In this case, loads and stores that are not naturally aligned +may either complete execution successfully or raise an exception. The +exception raised can be either an address-misaligned exception or an +access-fault exception. For a memory access that would otherwise be able +to complete except for the misalignment, an access-fault exception can +be raised instead of an address-misaligned exception if the misaligned +access should not be emulated, e.g., if accesses to the memory region +have side effects. When an EEI does not guarantee misaligned loads and +stores are handled invisibly, the EEI must define if exceptions caused +by address misalignment result in a contained trap (allowing software +running inside the execution environment to handle the trap) or a fatal +trap (terminating execution).
++ + | +
+
+
+Misaligned accesses are occasionally required when porting legacy code, +and help performance on applications when using any form of packed-SIMD +extension or handling externally packed data structures. Our rationale +for allowing EEIs to choose to support misaligned accesses via the +regular load and store instructions is to simplify the addition of +misaligned hardware support. One option would have been to disallow +misaligned accesses in the base ISAs and then provide some separate ISA +support for misaligned accesses, either special instructions to help +software handle misaligned accesses or a new hardware addressing mode +for misaligned accesses. Special instructions are difficult to use, +complicate the ISA, and often add new processor state (e.g., SPARC VIS +align address offset register) or complicate access to existing +processor state (e.g., MIPS LWL/LWR partial register writes). In +addition, for loop-oriented packed-SIMD code, the extra overhead when +operands are misaligned motivates software to provide multiple forms of +loop depending on operand alignment, which complicates code generation +and adds to loop startup overhead. New misaligned hardware addressing +modes take considerable space in the instruction encoding or require +very simplified addressing modes (e.g., register indirect only). + |
+
Even when misaligned loads and stores complete successfully, these +accesses might run extremely slowly depending on the implementation +(e.g., when implemented via an invisible trap). Furthermore, whereas +naturally aligned loads and stores are guaranteed to execute atomically, +misaligned loads and stores might not, and hence require additional +synchronization to ensure atomicity.
++ + | +
+
+
+We do not mandate atomicity for misaligned accesses so execution +environment implementations can use an invisible machine trap and a +software handler to handle some or all misaligned accesses. If hardware +misaligned support is provided, software can exploit this by simply +using regular load and store instructions. Hardware can then +automatically optimize accesses depending on whether runtime addresses +are aligned. + |
+
2.7. Memory Ordering Instructions
+The FENCE instruction is used to order device I/O and memory accesses as +viewed by other RISC-V harts and external devices or coprocessors. Any +combination of device input (I), device output (O), memory reads (R), +and memory writes (W) may be ordered with respect to any combination of +the same. Informally, no other RISC-V hart or external device can +observe any operation in the successor set following a FENCE before +any operation in the predecessor set preceding the FENCE. +Chapter 18 provides a precise description +of the RISC-V memory consistency model.
+The FENCE instruction also orders memory reads and writes made by the +hart as observed by memory reads and writes made by an external device. +However, FENCE does not order observations of events made by an external +device using any other signaling mechanism.
++ + | +
+
+
+A device might observe an access to a memory location via some external +communication mechanism, e.g., a memory-mapped control register that +drives an interrupt signal to an interrupt controller. This +communication is outside the scope of the FENCE ordering mechanism and +hence the FENCE instruction can provide no guarantee on when a change in +the interrupt signal is visible to the interrupt controller. Specific +devices might provide additional ordering guarantees to reduce software +overhead but those are outside the scope of the RISC-V memory model. + |
+
The EEI will define what I/O operations are possible, and in particular, +which memory addresses when accessed by load and store instructions will +be treated and ordered as device input and device output operations +respectively rather than memory reads and writes. For example, +memory-mapped I/O devices will typically be accessed with uncached loads +and stores that are ordered using the I and O bits rather than the R and +W bits. Instruction-set extensions might also describe new I/O +instructions that will also be ordered using the I and O bits in a +FENCE.
+fm field | +Mnemonic | +Meaning | +
---|---|---|
0000 |
+none |
+Normal Fence |
+
1000 |
+TSO |
+With |
+
other |
+Reserved for future use. |
+
The fence mode field fm defines the semantics of the FENCE
. A FENCE
+with fm=0000
orders all memory operations in its predecessor set
+before all memory operations in its successor set.
The FENCE.TSO
instruction is encoded as a FENCE
instruction
+with fm=1000
, predecessor=RW
, and successor=RW
. FENCE.TSO
orders
+all load operations in its predecessor set before all memory operations
+in its successor set, and all store operations in its predecessor set
+before all store operations in its successor set. This leaves non-AMO
+store operations in the FENCE.TSO’s
predecessor set unordered with
+non-AMO
loads in its successor set.
+ + | +
+
+
+Because FENCE RW,RW imposes a superset of the orderings that FENCE.TSO +imposes, it is correct to ignore the fm field and implement FENCE.TSO as FENCE RW,RW. + |
+
The unused fields in the FENCE
instructions--rs1 and rd--are reserved
+for finer-grain fences in future extensions. For forward compatibility,
+base implementations shall ignore these fields, and standard software
+shall zero these fields. Likewise, many fm and predecessor/successor
+set settings in Table 5 are also reserved for future use.
+Base implementations shall treat all such reserved configurations as
+normal fences with fm=0000, and standard software shall use only
+non-reserved configurations.
+ + | +
+
+
+We chose a relaxed memory model to allow high performance from simple +machine implementations and from likely future coprocessor or +accelerator extensions. We separate out I/O ordering from memory R/W +ordering to avoid unnecessary serialization within a device-driver hart +and also to support alternative non-memory paths to control added +coprocessors or I/O devices. Simple implementations may additionally +ignore the predecessor and successor fields and always execute a +conservative fence on all operations. + |
+
2.8. Environment Call and Breakpoints
+SYSTEM
instructions are used to access system functionality that might
+require privileged access and are encoded using the I-type instruction
+format. These can be divided into two main classes: those that
+atomically read-modify-write control and status registers (CSRs), and
+all other potentially privileged instructions. CSR instructions are
+described in Chapter 7, and the base
+unprivileged instructions are described in the following section.
+ + | +
+
+
+The SYSTEM instructions are defined to allow simpler implementations to +always trap to a single software trap handler. More sophisticated +implementations might execute more of each system instruction in +hardware. + |
+
These two instructions cause a precise requested trap to the supporting +execution environment.
+The ECALL
instruction is used to make a service request to the execution
+environment. The EEI
will define how parameters for the service request
+are passed, but usually these will be in defined locations in the
+integer register file.
The EBREAK
instruction is used to return control to a debugging
+environment.
+ + | +
+
+
+ECALL and EBREAK were previously named SCALL and SBREAK. The +instructions have the same functionality and encoding, but were renamed +to reflect that they can be used more generally than to call a +supervisor-level operating system or debugger. + |
+
+ + | +
+
+
+EBREAK was primarily designed to be used by a debugger to cause +execution to stop and fall back into the debugger. EBREAK is also used +by the standard gcc compiler to mark code paths that should not be +executed. +
+
+Another use of EBREAK is to support "semihosting", where the execution +environment includes a debugger that can provide services over an +alternate system call interface built around the EBREAK instruction. +Because the RISC-V base ISAs do not provide more than one EBREAK +instruction, RISC-V semihosting uses a special sequence of instructions +to distinguish a semihosting EBREAK from a debugger inserted EBREAK. +
+
+
+
+
+
+
+Note that these three instructions must be 32-bit-wide instructions, +i.e., they mustn’t be among the compressed 16-bit instructions described +in Chapter 28. +
+
+The shift NOP instructions are still considered available for use as +HINTs. +
+
+Semihosting is a form of service call and would be more naturally +encoded as an ECALL using an existing ABI, but this would require the +debugger to be able to intercept ECALLs, which is a newer addition to +the debug standard. We intend to move over to using ECALLs with a +standard ABI, in which case, semihosting can share a service ABI with an +existing standard. +
+
+We note that ARM processors have also moved to using SVC instead of BKPT +for semihosting calls in newer designs. + |
+
2.9. HINT Instructions
+RV32I reserves a large encoding space for HINT instructions, which are
+usually used to communicate performance hints to the microarchitecture.
+Like the NOP instruction, HINTs do not change any architecturally
+visible state, except for advancing the pc
and any applicable
+performance counters. Implementations are always allowed to ignore the
+encoded hints.
Most RV32I HINTs are encoded as integer computational instructions with +rd=x0. The other RV32I HINTs are encoded as FENCE instructions with +a null predecessor or successor set and with fm=0.
++ + | +
+
+
+These HINT encodings have been chosen so that simple implementations can
+ignore HINTs altogether, and instead execute a HINT as a regular
+instruction that happens not to mutate the architectural state. For
+example, ADD is a HINT if the destination register is
+
+As another example, a FENCE instruction with a zero pred field and a +zero fm field is a HINT; the succ, rs1, and rd fields encode the +arguments to the HINT. A simple implementation can simply execute the +HINT as a FENCE that orders the null set of prior memory accesses before +whichever subsequent memory accesses are encoded in the succ field. +Since the intersection of the predecessor and successor sets is null, +the instruction imposes no memory orderings, and so it has no +architecturally visible effect. + |
+
Table 6 lists all RV32I HINT code points. 91% of the +HINT space is reserved for standard HINTs. The remainder of the HINT +space is designated for custom HINTs: no standard HINTs will ever be +defined in this subspace.
++ + | +
+
+
+We anticipate standard hints to eventually include memory-system spatial +and temporal locality hints, branch prediction hints, thread-scheduling +hints, security tags, and instrumentation flags for simulation/emulation. + |
+
Instruction | +Constraints | +Code Points | +Purpose | +
---|---|---|---|
LUI |
+rd= |
+
|
+|
AUIPC |
+rd= |
+||
ADDI |
+rd= |
+||
ANDI |
+rd= |
+||
ORI |
+rd= |
+||
XORI |
+rd= |
+||
ADD |
+rd= |
+||
ADD |
+rd= |
+28 |
+|
ADD |
+rd= |
+4 |
+(rs2= |
+
SUB |
+rd= |
+
|
+|
AND |
+rd= |
+||
OR |
+rd= |
+||
XOR |
+rd= |
+||
SLL |
+rd= |
+||
SRL |
+rd= |
+||
SRA |
+rd= |
+||
FENCE |
+rd= |
+||
FENCE |
+rd≠ |
+||
FENCE |
+rd=rs1= |
+15 |
+|
FENCE |
+rd=rs1= |
+15 |
+|
FENCE |
+rd=rs1= |
+1 |
+PAUSE |
+
+ | |||
SLTI |
+rd= |
+
|
+|
SLTIU |
+rd= |
+||
SLLI |
+rd= |
+||
SRLI |
+rd= |
+||
SRAI |
+rd= |
+||
SLT |
+rd= |
+||
SLTU |
+rd= |
+
3. RV32E and RV64E Base Integer Instruction Sets, Version 2.0
+4. RV64I Base Integer Instruction Set, Version 2.1
+This chapter describes the RV64I base integer instruction set, which +builds upon the RV32I variant described in Chapter 2. +This chapter presents only the differences with RV32I, so should be read +in conjunction with the earlier chapter.
+4.1. Register State
+RV64I widens the integer registers and supported user address space to +64 bits (XLEN=64 in Table 3).
+4.2. Integer Computational Instructions
+Most integer computational instructions operate on XLEN-bit values. +Additional instruction variants are provided to manipulate 32-bit values +in RV64I, indicated by a 'W' suffix to the opcode. These "*W" +instructions ignore the upper 32 bits of their inputs and always produce +32-bit signed values, sign-extending them to 64 bits, i.e. bits XLEN-1 +through 31 are equal.
++
+4.2.1. Integer Register-Immediate Instructions
+ADDIW is an RV64I instruction that adds the sign-extended 12-bit +immediate to register rs1 and produces the proper sign extension of a +32-bit result in rd. Overflows are ignored and the result is the low +32 bits of the result sign-extended to 64 bits. Note, ADDIW rd, rs1, 0 +writes the sign extension of the lower 32 bits of register rs1 into +register rd (assembler pseudoinstruction SEXT.W).
+Shifts by a constant are encoded as a specialization of the I-type +format using the same instruction opcode as RV32I. The operand to be +shifted is in rs1, and the shift amount is encoded in the lower 6 bits +of the I-immediate field for RV64I. The right shift type is encoded in +bit 30. SLLI is a logical left shift (zeros are shifted into the lower +bits); SRLI is a logical right shift (zeros are shifted into the upper +bits); and SRAI is an arithmetic right shift (the original sign bit is +copied into the vacated upper bits). + + + +
+SLLIW, SRLIW, and SRAIW are RV64I-only instructions that are analogously +defined but operate on 32-bit values and sign-extend their 32-bit +results to 64 bits. SLLIW, SRLIW, and SRAIW encodings with +imm[5] ≠ 0 are reserved.
+LUI (load upper immediate) uses the same opcode as RV32I. LUI places the +32-bit U-immediate into register rd, filling in the lowest 12 bits +with zeros. The 32-bit result is sign-extended to 64 bits. +
+AUIPC (add upper immediate to pc
) uses the same opcode as RV32I. AUIPC
+is used to build pc
-relative addresses and uses the U-type format.
+AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest
+12 bits with zeros, sign-extends the result to 64 bits, adds it to the
+address of the AUIPC instruction, then places the result in register
+rd.
4.2.2. Integer Register-Register Operations
+ADDW and SUBW are RV64I-only instructions that are defined analogously +to ADD and SUB but operate on 32-bit values and produce signed 32-bit +results. Overflows are ignored, and the low 32-bits of the result is +sign-extended to 64-bits and written to the destination register. + +
+SLL, SRL, and SRA perform logical left, logical right, and arithmetic +right shifts on the value in register rs1 by the shift amount held in +register rs2. In RV64I, only the low 6 bits of rs2 are considered +for the shift amount.
+SLLW, SRLW, and SRAW are RV64I-only instructions that are analogously +defined but operate on 32-bit values and sign-extend their 32-bit +results to 64 bits. The shift amount is given by rs2[4:0]. + + +
+4.3. Load and Store Instructions
+RV64I extends the address space to 64 bits. The execution environment +will define what portions of the address space are legal to access.
+The LD instruction loads a 64-bit value from memory into register rd +for RV64I. +
+The LW instruction loads a 32-bit value from memory and sign-extends +this to 64 bits before storing it in register rd for RV64I. The LWU +instruction, on the other hand, zero-extends the 32-bit value from +memory for RV64I. LH and LHU are defined analogously for 16-bit values, +as are LB and LBU for 8-bit values. The SD, SW, SH, and SB instructions +store 64-bit, 32-bit, 16-bit, and 8-bit values from the low bits of +register rs2 to memory respectively.
+4.4. HINT Instructions
+All instructions that are microarchitectural HINTs in RV32I (see +Chapter 2) are also HINTs in RV64I. +The additional computational instructions in RV64I expand both the +standard and custom HINT encoding spaces. +
+Table 7 lists all RV64I HINT code points. 91% of the +HINT space is reserved for standard HINTs, but none are presently +defined. The remainder of the HINT space is designated for custom HINTs; +no standard HINTs will ever be defined in this subspace.
+Instruction | +Constraints | +Code Points | +Purpose | +
---|---|---|---|
LUI |
+rd=x0 |
+Designated for future standard use |
+|
AUIPC |
+rd=x0 |
+||
ADDI |
+rd=x0, and either rs1≠x0 or imm≠0 |
+||
ANDI |
+rd=x0 |
+||
ORI |
+rd=x0 |
+||
XORI |
+rd=x0 |
+||
ADDIW |
+rd=x0 |
+||
ADD |
+rd=x0, rs1≠x0 |
+||
ADD |
+rd=x0, rs1=x0, rs2≠x2-x5 |
+28 |
+|
ADD |
+rd=x0, rs1=x0, rs2=x2-x5 |
+4 |
+(rs2=x2) NTL.P1 |
+
SUB |
+rd=x0 |
+Designated for future standard use |
+|
AND |
+rd=x0 |
+||
OR |
+rd=x0 |
+||
XOR |
+rd=x0 |
+||
SLL |
+rd=x0 |
+||
SRL |
+rd=x0 |
+||
SRA |
+rd=x0 |
+||
ADDW |
+rd=x0 |
+||
SUBW |
+rd=x0 |
+||
SLLW |
+rd=x0 |
+||
SRLW |
+rd=x0 |
+||
SRAW |
+rd=x0 |
+||
FENCE |
+rd=x0, rs1≠x0,fm=0, and either pred=0 or succ=0 |
+||
FENCE |
+rd≠x0, rs1=x0, fm=0, and either pred=0 or succ=0 |
+||
FENCE |
+rd=rs1=x0, fm=0, pred=0, succ≠0 |
+15 |
+|
FENCE |
+pred=0 or succ=0, pred≠W, succ =0 |
+15 |
+|
FENCE |
+rd=rs1=x0, fm=0, pred=W, succ=0 |
+1 |
+PAUSE |
+
SLTI |
+rd=x0 |
+Designated for custom use |
+|
SLTIU |
+rd=x0 |
+||
SLLI |
+rd=x0 |
+||
SRLI |
+rd=x0 |
+||
SRAI |
+rd=x0 |
+||
SLLIW |
+rd=x0 |
+||
SRLIW |
+rd=x0 |
+||
SRAIW |
+rd=x0 |
+||
SLT |
+rd=x0 |
+||
SLTU |
+rd=x0 |
+
5. RV128I Base Integer Instruction Set, Version 1.7
+CV64A6_MMU: This instruction set is not supported.
+6. "Zifencei" Extension for Instruction-Fetch Fence, Version 2.0
+7. "Zicsr", Extension for Control and Status Register (CSR) Instructions, Version 2.0
+RISC-V defines a separate address space of 4096 Control and Status +registers associated with each hart. This chapter defines the full set +of CSR instructions that operate on these CSRs.
++ + | +
+
+
+While CSRs are primarily used by the privileged architecture, there are +several uses in unprivileged code including for counters and timers, and +for floating-point status. +
+
+The counters and timers are no longer considered mandatory parts of the +standard base ISAs, and so the CSR instructions required to access them +have been moved out of Chapter 2 into this separate +chapter. + |
+
7.1. CSR Instructions
+All CSR instructions atomically read-modify-write a single CSR, whose +CSR specifier is encoded in the 12-bit csr field of the instruction +held in bits 31-20. The immediate forms use a 5-bit zero-extended +immediate encoded in the rs1 field.
+The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values in
+the CSRs and integer registers. CSRRW reads the old value of the CSR,
+zero-extends the value to XLEN bits, then writes it to integer register
+rd. The initial value in rs1 is written to the CSR. If rd=x0
,
+then the instruction shall not read the CSR and shall not cause any of
+the side effects that might occur on a CSR read.
The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the value +of the CSR, zero-extends the value to XLEN bits, and writes it to +integer register rd. The initial value in integer register rs1 is +treated as a bit mask that specifies bit positions to be set in the CSR. +Any bit that is high in rs1 will cause the corresponding bit to be set +in the CSR, if that CSR bit is writable.
+The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the +value of the CSR, zero-extends the value to XLEN bits, and writes it to +integer register rd. The initial value in integer register rs1 is +treated as a bit mask that specifies bit positions to be cleared in the +CSR. Any bit that is high in rs1 will cause the corresponding bit to +be cleared in the CSR, if that CSR bit is writable.
+For both CSRRS and CSRRC, if rs1=x0
, then the instruction will not
+write to the CSR at all, and so shall not cause any of the side effects
+that might otherwise occur on a CSR write, nor raise illegal-instruction
+exceptions on accesses to read-only CSRs. Both CSRRS and CSRRC always
+read the addressed CSR and cause any read side effects regardless of
+rs1 and rd fields.
+Note that if rs1 specifies a register other than x0
, and that register
+holds a zero value, the instruction will not action any attendant per-field
+side effects, but will action any side effects caused by writing to the entire
+CSR.
A CSRRW with rs1=x0
will attempt to write zero to the destination CSR.
The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, and
+CSRRC respectively, except they update the CSR using an XLEN-bit value
+obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field
+encoded in the rs1 field instead of a value from an integer register.
+For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then these
+instructions will not write to the CSR, and shall not cause any of the
+side effects that might otherwise occur on a CSR write, nor raise
+illegal-instruction exceptions on accesses to read-only CSRs. For
+CSRRWI, if rd=x0
, then the instruction shall not read the CSR and
+shall not cause any of the side effects that might occur on a CSR read.
+Both CSRRSI and CSRRCI will always read the CSR and cause any read side
+effects regardless of rd and rs1 fields.
Register operand | +||||
---|---|---|---|---|
Instruction |
+rd is |
+rs1 is |
+Reads CSR |
+Writes CSR |
+
CSRRW |
+Yes |
+- |
+No |
+Yes |
+
CSRRW |
+No |
+- |
+Yes |
+Yes |
+
CSRRS/CSRRC |
+- |
+Yes |
+Yes |
+No |
+
CSRRS/CSRRC |
+- |
+No |
+Yes |
+Yes |
+
Immediate operand |
+||||
Instruction |
+rd is |
+uimm0 |
+Reads CSR |
+Writes +CSR |
+
CSRRWI |
+Yes |
+- |
+No |
+Yes |
+
CSRRWI |
+No |
+- |
+Yes |
+Yes |
+
CSRRSI/CSRRCI |
+- |
+Yes |
+Yes |
+No |
+
CSRRSI/CSRRCI |
+- |
+No |
+Yes |
+Yes |
+
Table 8 summarizes the behavior of the CSR +instructions with respect to whether they read and/or write the CSR.
+In addition to side effects that occur as a consequence of reading or +writing a CSR, individual fields within a CSR might have side effects +when written. The CSRRW[I] instructions action side effects for all +such fields within the written CSR. The CSRRS[I] an CSRRC[I] instructions +only action side effects for fields for which the rs1 or uimm argument +has at least one bit set corresponding to that field.
++ + | +
+
+
+As of this writing, no standard CSRs have side effects on field writes. +Hence, whether a standard CSR access has any side effects can be determined +solely from the opcode. +
+
+Defining CSRs with side effects on field writes is not recommended. + |
+
For any event or consequence that occurs due to a CSR having a +particular value, if a write to the CSR gives it that value, the +resulting event or consequence is said to be an indirect effect of the +write. Indirect effects of a CSR write are not considered by the RISC-V +ISA to be side effects of that write.
++ + | +
+
+
+An example of side effects for CSR accesses would be if reading from a +specific CSR causes a light bulb to turn on, while writing an odd value +to the same CSR causes the light to turn off. Assume writing an even +value has no effect. In this case, both the read and write have side +effects controlling whether the bulb is lit, as this condition is not +determined solely from the CSR value. (Note that after writing an odd +value to the CSR to turn off the light, then reading to turn the light +on, writing again the same odd value causes the light to turn off again. +Hence, on the last write, it is not a change in the CSR value that turns +off the light.) +
+
+On the other hand, if a bulb is rigged to light whenever the value of a +particular CSR is odd, then turning the light on and off is not +considered a side effect of writing to the CSR but merely an indirect +effect of such writes. +
+
+More concretely, the RISC-V privileged architecture defined in Volume II +specifies that certain combinations of CSR values cause a trap to occur. +When an explicit write to a CSR creates the conditions that trigger the +trap, the trap is not considered a side effect of the write but merely +an indirect effect. +
+
+Standard CSRs do not have any side effects on reads. Standard CSRs may +have side effects on writes. Custom extensions might add CSRs for which +accesses have side effects on either reads or writes. + |
+
Some CSRs, such as the instructions-retired counter, instret
, may be
+modified as side effects of instruction execution. In these cases, if a
+CSR access instruction reads a CSR, it reads the value prior to the
+execution of the instruction. If a CSR access instruction writes such a
+CSR, the explicit write is done instead of the update from the side effect.
+In particular, a value
+written to instret
by one instruction will be the value read by the
+following instruction.
The assembler pseudoinstruction to read a CSR, CSRR rd, csr, is +encoded as CSRRS rd, csr, x0. The assembler pseudoinstruction to write +a CSR, CSRW csr, rs1, is encoded as CSRRW x0, csr, rs1, while CSRWI +csr, uimm, is encoded as CSRRWI x0, csr, uimm.
+Further assembler pseudoinstructions are defined to set and clear bits +in the CSR when the old value is not required: CSRS/CSRC csr, rs1; +CSRSI/CSRCI csr, uimm.
+7.1.1. CSR Access Ordering
+Each RISC-V hart normally observes its own CSR accesses, including its +implicit CSR accesses, as performed in program order. In particular, +unless specified otherwise, a CSR access is performed after the +execution of any prior instructions in program order whose behavior +modifies or is modified by the CSR state and before the execution of any +subsequent instructions in program order whose behavior modifies or is +modified by the CSR state. Furthermore, an explicit CSR read returns the +CSR state before the execution of the instruction, while an explicit CSR +write suppresses and overrides any implicit writes or modifications to +the same CSR by the same instruction.
+Likewise, any side effects from an explicit CSR access are normally +observed to occur synchronously in program order. Unless specified +otherwise, the full consequences of any such side effects are observable +by the very next instruction, and no consequences may be observed +out-of-order by preceding instructions. (Note the distinction made +earlier between side effects and indirect effects of CSR writes.)
+For the RVWMO memory consistency model (Chapter 18), CSR accesses are weakly +ordered by default, so other harts or devices may observe CSR accesses +in an order different from program order. In addition, CSR accesses are +not ordered with respect to explicit memory accesses, unless a CSR +access modifies the execution behavior of the instruction that performs +the explicit memory access or unless a CSR access and an explicit memory +access are ordered by either the syntactic dependencies defined by the +memory model or the ordering requirements defined by the Memory-Ordering +PMAs section in Volume II of this manual. To enforce ordering in all +other cases, software should execute a FENCE instruction between the +relevant accesses. For the purposes of the FENCE instruction, CSR read +accesses are classified as device input (I), and CSR write accesses are +classified as device output (O).
++ + | +
+
+
+Informally, the CSR space acts as a weakly ordered memory-mapped I/O +region, as defined by the Memory-Ordering PMAs section in Volume II of +this manual. As a result, the order of CSR accesses with respect to all +other accesses is constrained by the same mechanisms that constrain the +order of memory-mapped I/O accesses to such a region. +
+
+These CSR-ordering constraints are imposed to support ordering main
+memory and memory-mapped I/O accesses with respect to CSR accesses that
+are visible to, or affected by, devices or other harts. Examples include
+the
+
+Most CSRs (including, e.g., the |
+
The hardware platform may define that accesses to certain CSRs are +strongly ordered, as defined by the Memory-Ordering PMAs section in +Volume II of this manual. Accesses to strongly ordered CSRs have +stronger ordering constraints with respect to accesses to both weakly +ordered CSRs and accesses to memory-mapped I/O regions.
++ + | +
+
+
+The rules for the reordering of CSR accesses in the global memory order +should probably be moved to Chapter 18 concerning the RVWMO memory consistency model. + |
+
8. "Zicntr" and "Zihpm" Extensions for Counters, Version 2.0
+RISC-V ISAs provide a set of up to thirty-two 64-bit performance
+counters and timers that are accessible via unprivileged XLEN-bit
+read-only CSR registers 0xC00
–0xC1F
(when XLEN=32, the upper 32 bits
+are accessed via CSR registers 0xC80
–0xC9F
). These counters are
+divided between the "Zicntr" and "Zihpm" extensions.
8.1. "Zicntr" Extension for Base Counters and Timers
+The Zicntr standard extension comprises the first three of these +counters (CYCLE, TIME, and INSTRET), which have dedicated functions +(cycle count, real-time clock, and instructions retired, respectively). +The Zicntr extension depends on the Zicsr extension.
++ + | +
+
+
+We recommend provision of these basic counters in implementations as +they are essential for basic performance analysis, adaptive and dynamic +optimization, and to allow an application to work with real-time +streams. Additional counters in the separate Zihpm extension can help +diagnose performance problems and these should be made accessible from +user-level application code with low overhead. +
+
+Some execution environments might prohibit access to counters, for +example, to impede timing side-channel attacks. + |
+
For base ISAs with XLEN≥64, CSR instructions can access
+the full 64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and
+RDINSTRET pseudoinstructions read the full 64 bits of the cycle
,
+time
, and instret
counters.
+ + | +
+
+
+The counter pseudoinstructions are mapped to the read-only
+ |
+
For base ISAs with XLEN=32, the Zicntr extension enables the three +64-bit read-only counters to be accessed in 32-bit pieces. The RDCYCLE, +RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 bits, and +the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide the +upper 32 bits of the respective counters.
++ + | +
+
+
+We required the counters be 64 bits wide, even when XLEN=32, as +otherwise it is very difficult for software to determine if values have +overflowed. For a low-end implementation, the upper 32 bits of each +counter can be implemented using software counters incremented by a trap +handler triggered by overflow of the lower 32 bits. The sample code +given below shows how the full 64-bit width value can be safely read +using the individual 32-bit width pseudoinstructions. + |
+
The RDCYCLE pseudoinstruction reads the low XLEN bits of the cycle
+CSR which holds a count of the number of clock cycles executed by the
+processor core on which the hart is running from an arbitrary start time
+in the past. RDCYCLEH is only present when XLEN=32 and reads bits 63-32
+of the same cycle counter. The underlying 64-bit counter should never
+overflow in practice. The rate at which the cycle counter advances will
+depend on the implementation and operating environment. The execution
+environment should provide a means to determine the current rate
+(cycles/second) at which the cycle counter is incrementing.
+ + | +
+
+
+RDCYCLE is intended to return the number of cycles executed by the +processor core, not the hart. Precisely defining what is a "core" is +difficult given some implementation choices (e.g., AMD Bulldozer). +Precisely defining what is a "clock cycle" is also difficult given the +range of implementations (including software emulations), but the intent +is that RDCYCLE is used for performance monitoring along with the other +performance counters. In particular, where there is one hart/core, one +would expect cycle-count/instructions-retired to measure CPI for a hart. +
+
+Cores don’t have to be exposed to software at all, and an implementor +might choose to pretend multiple harts on one physical core are running +on separate cores with one hart/core, and provide separate cycle +counters for each hart. This might make sense in a simple barrel +processor (e.g., CDC 6600 peripheral processors) where inter-hart timing +interactions are non-existent or minimal. +
+
+Where there is more than one hart/core and dynamic multithreading, it is +not generally possible to separate out cycles per hart (especially with +SMT). It might be possible to define a separate performance counter that +tried to capture the number of cycles a particular hart was running, but +this definition would have to be very fuzzy to cover all the possible +threading implementations. For example, should we only count cycles for +which any instruction was issued to execution for this hart, and/or +cycles any instruction retired, or include cycles this hart was +occupying machine resources but couldn’t execute due to stalls while +other harts went into execution? Likely, "all of the above" would be +needed to have understandable performance stats. This complexity of +defining a per-hart cycle count, and also the need in any case for a +total per-core cycle count when tuning multithreaded code led to just +standardizing the per-core cycle counter, which also happens to work +well for the common single hart/core case. +
+
+Standardizing what happens during "sleep" is not practical given that +what "sleep" means is not standardized across execution environments, +but if the entire core is paused (entirely clock-gated or powered-down +in deep sleep), then it is not executing clock cycles, and the cycle +count shouldn’t be increasing per the spec. There are many details, +e.g., whether clock cycles required to reset a processor after waking up +from a power-down event should be counted, and these are considered +execution-environment-specific details. +
+
+Even though there is no precise definition that works for all platforms, +this is still a useful facility for most platforms, and an imprecise, +common, "usually correct" standard here is better than no standard. +The intent of RDCYCLE was primarily performance monitoring/tuning, and +the specification was written with that goal in mind. + |
+
The RDTIME pseudoinstruction reads the low XLEN bits of the "time" CSR, +which counts wall-clock real time that has passed from an arbitrary +start time in the past. RDTIMEH is only present when XLEN=32 and reads +bits 63-32 of the same real-time counter. The underlying 64-bit counter +increments by one with each tick of the real-time clock, and, for +realistic real-time clock frequencies, should never overflow in +practice. The execution environment should provide a means of +determining the period of a counter tick (seconds/tick). The period +should be constant within a small error bound. The environment should +provide a means to determine the accuracy of the clock (i.e., the +maximum relative error between the nominal and actual real-time clock +periods).
++ + | +
+
+
+On some simple platforms, cycle count might represent a valid +implementation of RDTIME, in which case RDTIME and RDCYCLE may return +the same result. +
+
+It is difficult to provide a strict mandate on clock period given the +wide variety of possible implementation platforms. The maximum error +bound should be set based on the requirements of the platform. + |
+
The real-time clocks of all harts must be synchronized to within one +tick of the real-time clock.
++ + | +
+
+
+As with other architectural mandates, it suffices to appear "as if" +harts are synchronized to within one tick of the real-time clock, i.e., +software is unable to observe that there is a greater delta between the +real-time clock values observed on two harts. + |
+
The RDINSTRET pseudoinstruction reads the low XLEN bits of the
+instret
CSR, which counts the number of instructions retired by this
+hart from some arbitrary start point in the past. RDINSTRETH is only
+present when XLEN=32 and reads bits 63-32 of the same instruction
+counter. The underlying 64-bit counter should never overflow in
+practice.
+ + | +
+
+
+Instructions that cause synchronous exceptions, including ECALL and
+EBREAK, are not considered to retire and hence do not increment the
+ |
+
The following code sequence will read a valid 64-bit cycle counter value
+into x3:x2
, even if the counter overflows its lower half between
+reading its upper and lower halves.
again:
+ rdcycleh x3
+ rdcycle x2
+ rdcycleh x4
+ bne x3, x4, again
+8.2. "Zihpm" Extension for Hardware Performance Counters
+ +9. "Zihintntl" Extension for Non-Temporal Locality Hints, Version 1.0
+10. "Zihintpause" Extension for Pause Hint, Version 2.0
+11. "Zimop" Extension for May-Be-Operations, Version 1.0
+11.1. "Zcmop" Compressed May-Be-Operations Extension, Version 1.0
+ +12. "Zicond" Extension for Integer Conditional Operations, Version 1.0.0
+13. "M" Extension for Integer Multiplication and Division, Version 2.0
+This chapter describes the standard integer multiplication and division +instruction extension, which is named "M" and contains instructions +that multiply or divide values held in two integer registers.
++ + | +
+
+
+We separate integer multiply and divide out from the base to simplify +low-end implementations, or for applications where integer multiply and +divide operations are either infrequent or better handled in attached +accelerators. + |
+
13.1. Multiplication Operations
++ +
+MUL performs an XLEN-bit×XLEN-bit multiplication of +rs1 by rs2 and places the lower XLEN bits in the destination +register. MULH, MULHU, and MULHSU perform the same multiplication but +return the upper XLEN bits of the full 2×XLEN-bit +product, for signed×signed, +unsigned×unsigned, and rs1×unsigned rs2 multiplication, respectively. +If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.
++ + | +
+
+
+MULHSU is used in multi-word signed multiplication to multiply the +most-significant word of the multiplicand (which contains the sign bit) +with the less-significant words of the multiplier (which are unsigned). + |
+
MULW is an RV64 instruction that multiplies the lower 32 bits of the +source registers, placing the sign extension of the lower 32 bits of the +result into the destination register.
++ + | +
+
+
+In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit +product, but signed arguments must be proper 32-bit signed values, +whereas unsigned arguments must have their upper 32 bits clear. If the +arguments are not known to be sign- or zero-extended, an alternative is +to shift both arguments left by 32 bits, then use MULH[[S]U]. + |
+
13.2. Division Operations
++
+DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned +integer division of rs1 by rs2, rounding towards zero. REM and REMU +provide the remainder of the corresponding division operation. For REM, +the sign of a nonzero result equals the sign of the dividend.
++ + | +
+
+
+For both signed and unsigned division, except in the case of overflow, it holds +that +. + |
+
If both the quotient and remainder are required from the same division, +the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr, +rs1, rs2 (rdq cannot be the same as rs1 or rs2). +Microarchitectures can then fuse these into a single divide operation +instead of performing two separate divides.
+DIVW and DIVUW are RV64 instructions that divide the lower 32 bits of +rs1 by the lower 32 bits of rs2, treating them as signed and +unsigned integers respectively, placing the 32-bit quotient in rd, +sign-extended to 64 bits. REMW and REMUW are RV64 instructions that +provide the corresponding signed and unsigned remainder operations +respectively. Both REMW and REMUW always sign-extend the 32-bit result +to 64 bits, including on a divide by zero. +
+The semantics for division by zero and division overflow are summarized +in Table 9. The quotient of division by zero has all bits +set, and the remainder of division by zero equals the dividend. Signed +division overflow occurs only when the most-negative integer is divided +by . The quotient of a signed division with overflow is +equal to the dividend, and the remainder is zero. Unsigned division +overflow cannot occur.
+Condition | +Dividend | +Divisor | +DIVU[W] | +REMU[W] | +DIV[W] | +REM[W] | +
---|---|---|---|---|---|---|
Division by zero |
+
|
+0 |
+
|
+
|
+
|
+
|
+
+ + | +
+
+
+We considered raising exceptions on integer divide by zero, with these +exceptions causing a trap in most execution environments. However, this +would be the only arithmetic trap in the standard ISA (floating-point +exceptions set flags and write default values, but do not cause traps) +and would require language implementors to interact with the execution +environment’s trap handlers for this case. Further, where language +standards mandate that a divide-by-zero exception must cause an +immediate control flow change, only a single branch instruction needs to +be added to each divide operation, and this branch instruction can be +inserted after the divide and should normally be very predictably not +taken, adding little runtime overhead. +
+
+The value of all bits set is returned for both unsigned and signed +divide by zero to simplify the divider circuitry. The value of all 1s is +both the natural value to return for unsigned divide, representing the +largest unsigned number, and also the natural result for simple unsigned +divider implementations. Signed division is often implemented using an +unsigned division circuit and specifying the same overflow result +simplifies the hardware. + |
+
13.3. Zmmul Extension, Version 1.0
+The Zmmul extension implements the multiplication subset of the M +extension. It adds all of the instructions defined in +Section 13.1, namely: MUL, MULH, MULHU, +MULHSU, and (for RV64 only) MULW. The encodings are identical to those +of the corresponding M-extension instructions. M implies Zmmul. +
++ + | +
+
+
+The Zmmul extension enables low-cost implementations that require +multiplication operations but not division. For many microcontroller +applications, division operations are too infrequent to justify the cost +of divider hardware. By contrast, multiplication operations are more +frequent, making the cost of multiplier hardware more justifiable. +Simple FPGA soft cores particularly benefit from eliminating division +but retaining multiplication, since many FPGAs provide hardwired +multipliers but require dividers be implemented in soft logic. + |
+
14. "A" Extension for Atomic Instructions, Version 2.1
+CV64A6_MMU: This extension is not supported.
+15. "Zawrs" Extension for Wait-on-Reservation-Set instructions, Version 1.01
+16. "Zacas" Extension for Atomic Compare-and-Swap (CAS) Instructions, Version 1.0.0
+17. "Zabha" Extension for Byte and Halfword Atomic Memory Operations, Version 1.0.0
+18. RVWMO Memory Consistency Model, Version 2.0
+This chapter defines the RISC-V memory consistency model. A memory +consistency model is a set of rules specifying the values that can be +returned by loads of memory. RISC-V uses a memory model called "RVWMO" +(RISC-V Weak Memory Ordering) which is designed to provide flexibility +for architects to build high-performance scalable designs while +simultaneously supporting a tractable programming model. + +
+Under RVWMO, code running on a single hart appears to execute in order +from the perspective of other memory instructions in the same hart, but +memory instructions from another hart may observe the memory +instructions from the first hart being executed in a different order. +Therefore, multithreaded code may require explicit synchronization to +guarantee ordering between memory instructions from different harts. The +base RISC-V ISA provides a FENCE instruction for this purpose, described +in Section 2.7, while the atomics extension "A" additionally defines load-reserved/store-conditional and atomic read-modify-write instructions. +
+The standard ISA extension for total store ordering "Ztso" (Chapter 19) augments +RVWMO with additional rules specific to those extensions.
+The appendices to this specification provide both axiomatic and +operational formalizations of the memory consistency model as well as +additional explanatory material. + +
++ + | +
+
+
+This chapter defines the memory model for regular main memory +operations. The interaction of the memory model with I/O memory, +instruction fetches, FENCE.I, page table walks, and SFENCE.VMA is not +(yet) formalized. Some or all of the above may be formalized in a future +revision of this specification. The RV128 base ISA and future ISA +extensions such as the V vector and J JIT extensions will need +to be incorporated into a future revision as well. +
+
+Memory consistency models supporting overlapping memory accesses of +different widths simultaneously remain an active area of academic +research and are not yet fully understood. The specifics of how memory +accesses of different sizes interact under RVWMO are specified to the +best of our current abilities, but they are subject to revision should +new issues be uncovered. + |
+
18.1. Definition of the RVWMO Memory Model
+The RVWMO memory model is defined in terms of the global memory order, +a total ordering of the memory operations produced by all harts. In +general, a multithreaded program has many different possible executions, +with each execution having its own corresponding global memory order. +
+The global memory order is defined over the primitive load and store +operations generated by memory instructions. It is then subject to the +constraints defined in the rest of this chapter. Any execution +satisfying all of the memory model constraints is a legal execution (as +far as the memory model is concerned).
+18.1.1. Memory Model Primitives
+The program order over memory operations reflects the order in which +the instructions that generate each load and store are logically laid +out in that hart’s dynamic instruction stream; i.e., the order in which +a simple in-order processor would execute the instructions of that hart.
+Memory-accessing instructions give rise to memory operations. A memory +operation can be either a load operation, a store operation, or both +simultaneously. All memory operations are single-copy atomic: they can +never be observed in a partially complete state. +
+Among instructions in RV32GC and RV64GC, each aligned memory instruction +gives rise to exactly one memory operation, with two exceptions. First, +an unsuccessful SC instruction does not give rise to any memory +operations. Second, FLD and FSD instructions may each give rise to +multiple memory operations if XLEN<64, as stated in +[fld_fsd] and clarified below. An aligned AMO +gives rise to a single memory operation that is both a load operation +and a store operation simultaneously.
++ + | +
+
+
+Instructions in the RV128 base instruction set and in future ISA +extensions such as V (vector) and P (SIMD) may give rise to multiple +memory operations. However, the memory model for these extensions has +not yet been formalized. + |
+
A misaligned load or store instruction may be decomposed into a set of +component memory operations of any granularity. An FLD or FSD +instruction for which XLEN<64 may also be decomposed into +a set of component memory operations of any granularity. The memory +operations generated by such instructions are not ordered with respect +to each other in program order, but they are ordered normally with +respect to the memory operations generated by preceding and subsequent +instructions in program order. +The atomics extension "A" does not require execution environments to support +misaligned atomic instructions at all. +However, if misaligned atomics are supported via the misaligned atomicity +granule PMA, then AMOs within an atomicity granule are not decomposed, nor are +loads and stores defined in the base ISAs, nor are loads and stores of no more +than XLEN bits defined in the F, D, and Q extensions. +
++ + | +
+
+
+The decomposition of misaligned memory operations down to byte +granularity facilitates emulation on implementations that do not +natively support misaligned accesses. Such implementations might, for +example, simply iterate over the bytes of a misaligned access one by +one. + |
+
An LR instruction and an SC instruction are said to be paired if the +LR precedes the SC in program order and if there are no other LR or SC +instructions in between; the corresponding memory operations are said to +be paired as well (except in case of a failed SC, where no store +operation is generated). The complete list of conditions determining +whether an SC must succeed, may succeed, or must fail is defined in +[sec:lrsc].
+Load and store operations may also carry one or more ordering +annotations from the following set: "acquire-RCpc", "acquire-RCsc", +"release-RCpc", and "release-RCsc". An AMO or LR instruction with +aq set has an "acquire-RCsc" annotation. An AMO or SC instruction +with rl set has a "release-RCsc" annotation. An AMO, LR, or SC +instruction with both aq and rl set has both "acquire-RCsc" and +"release-RCsc" annotations.
+For convenience, we use the term "acquire annotation" to refer to an +acquire-RCpc annotation or an acquire-RCsc annotation. Likewise, a +"release annotation" refers to a release-RCpc annotation or a +release-RCsc annotation. An "RCpc annotation" refers to an +acquire-RCpc annotation or a release-RCpc annotation. An RCsc +annotation refers to an acquire-RCsc annotation or a release-RCsc +annotation.
++ + | +
+
+
+In the memory model literature, the term "RCpc" stands for release +consistency with processor-consistent synchronization operations, and +the term "RCsc" stands for release consistency with sequentially +consistent synchronization operations. +
+
+While there are many different definitions for acquire and release +annotations in the literature, in the context of RVWMO these terms are +concisely and completely defined by Preserved Program Order rules 5-7. +
+
+"RCpc" annotations are currently only used when implicitly assigned to +every memory access per the standard extension "Ztso" +(Chapter 19). Furthermore, although the ISA does not +currently contain native load-acquire or store-release instructions, nor +RCpc variants thereof, the RVWMO model itself is designed to be +forwards-compatible with the potential addition of any or all of the +above into the ISA in a future extension. + |
+
18.1.2. Syntactic Dependencies
+The definition of the RVWMO memory model depends in part on the notion +of a syntactic dependency, defined as follows.
+In the context of defining dependencies, a register refers either to +an entire general-purpose register, some portion of a CSR, or an entire +CSR. The granularity at which dependencies are tracked through CSRs is +specific to each CSR and is defined in +Section 18.2.
+Syntactic dependencies are defined in terms of instructions' source +registers, instructions' destination registers, and the way +instructions carry a dependency from their source registers to their +destination registers. This section provides a general definition of all +of these terms; however, Section 18.3 provides a +complete listing of the specifics for each instruction.
+In general, a register r other than x0
is a source
+register for an instruction i if any of the following
+hold:
-
+
-
+
In the opcode of i, rs1, rs2, or rs3 is set to +r
+
+ -
+
i is a CSR instruction, and in the opcode of +i, csr is set to r, unless i +is CSRRW or CSRRWI and rd is set to
+x0
+ -
+
r is a CSR and an implicit source register for +i, as defined in Section 18.3
+
+ -
+
r is a CSR that aliases with another source register for +i
+
+
Memory instructions also further specify which source registers are +address source registers and which are data source registers.
+In general, a register r other than x0
is a destination
+register for an instruction i if any of the following
+hold:
-
+
-
+
In the opcode of i, rd is set to r
+
+ -
+
i is a CSR instruction, and in the opcode of +i, csr is set to r, unless i +is CSRRS or CSRRC and rs1 is set to
+x0
or i is CSRRSI +or CSRRCI and uimm[4:0] is set to zero.
+ -
+
r is a CSR and an implicit destination register for +i, as defined in Section 18.3
+
+ -
+
r is a CSR that aliases with another destination +register for i
+
+
Most non-memory instructions carry a dependency from each of their +source registers to each of their destination registers. However, there +are exceptions to this rule; see Section 18.3.
+Instruction j has a syntactic dependency on instruction +i via destination register s of +i and source register r of j +if either of the following hold:
+-
+
-
+
s is the same as r, and no instruction +program-ordered between i and j has +r as a destination register
+
+ -
+
There is an instruction m program-ordered between +i and j such that all of the following hold:
+++-
+
-
+
j has a syntactic dependency on m via +destination register q and source register r
+
+ -
+
m has a syntactic dependency on i via +destination register s and source register p
+
+ -
+
m carries a dependency from p to +q
+
+
+ -
+
Finally, in the definitions that follow, let a and +b be two memory operations, and let i and +j be the instructions that generate a and +b, respectively.
+b has a syntactic address dependency on a +if r is an address source register for j and +j has a syntactic dependency on i via source +register r
+b has a syntactic data dependency on a if +b is a store operation, r is a data source +register for j, and j has a syntactic +dependency on i via source register r
+b has a syntactic control dependency on a +if there is an instruction m program-ordered between +i and j such that m is a +branch or indirect jump and m has a syntactic dependency +on i.
++ + | +
+
+
+Generally speaking, non-AMO load instructions do not have data source +registers, and unconditional non-AMO store instructions do not have +destination registers. However, a successful SC instruction is +considered to have the register specified in rd as a destination +register, and hence it is possible for an instruction to have a +syntactic dependency on a successful SC instruction that precedes it in +program order. + |
+
18.1.3. Preserved Program Order
+The global memory order for any given execution of a program respects +some but not all of each hart’s program order. The subset of program +order that must be respected by the global memory order is known as +preserved program order.
+The complete definition of preserved program order is as follows (and +note that AMOs are simultaneously both loads and stores): memory +operation a precedes memory operation b in +preserved program order (and hence also in the global memory order) if +a precedes b in program order, +a and b both access regular main memory +(rather than I/O regions), and any of the following hold:
+-
+
-
+
Overlapping-Address Orderings:
+++-
+
-
+
b is a store, and +a and b access overlapping memory addresses
+
+ -
+
a and b are loads, +x is a byte read by both a and +b, there is no store to x between +a and b in program order, and +a and b return values for x +written by different memory operations
+
+ -
+
a is +generated by an AMO or SC instruction, b is a load, and +b returns a value written by a
+
+
+ -
+
-
+
Explicit Synchronization
+++-
+
-
+
There is a FENCE instruction that +orders a before b
+
+ -
+
a has an acquire +annotation
+
+ -
+
b has a release annotation
+
+ -
+
a and b both have +RCsc annotations
+
+ -
+
a is paired with +b
+
+
+ -
+
-
+
Syntactic Dependencies
+++-
+
-
+
b has a syntactic address +dependency on a
+
+ -
+
b has a syntactic data +dependency on a
+
+ -
+
b is a store, and +b has a syntactic control dependency on a
+
+
+ -
+
-
+
Pipeline Dependencies
+++-
+
-
+
b is a +load, and there exists some store m between +a and b in program order such that +m has an address or data dependency on a, +and b returns a value written by m
+
+ -
+
b is a store, and +there exists some instruction m between a +and b in program order such that m has an +address dependency on a
+
+
+ -
+
18.1.4. Memory Model Axioms
+An execution of a RISC-V program obeys the RVWMO memory consistency +model only if there exists a global memory order conforming to preserved +program order and satisfying the load value axiom, the atomicity +axiom, and the progress axiom.
+Load Value Axiom
+Each byte of each load i returns the value written to that +byte by the store that is the latest in global memory order among the +following stores:
+-
+
-
+
Stores that write that byte and that precede i in the +global memory order
+
+ -
+
Stores that write that byte and that precede i in +program order
+
+
Atomicity Axiom
+If r and w are paired load and store +operations generated by aligned LR and SC instructions in a hart +h, s is a store to byte x, and +r returns a value written by s, then +s must precede w in the global memory order, +and there can be no store from a hart other than h to byte +x following s and preceding w +in the global memory order.
++ + | +
+
+
+The Atomicity Axiom theoretically supports LR/SC pairs of different widths and to +mismatched addresses, since implementations are permitted to allow SC +operations to succeed in such cases. However, in practice, we expect +such patterns to be rare, and their use is discouraged. + |
+
Progress Axiom
+No memory operation may be preceded in the global memory order by an +infinite sequence of other memory operations.
+18.2. CSR Dependency Tracking Granularity
+Name | +Portions Tracked as Independent Units | +Aliases | +
---|---|---|
fflags |
+Bits 4, 3, 2, 1, 0 |
+fcsr |
+
frm |
+entire CSR |
+fcsr |
+
fcsr |
+Bits 7-5, 4, 3, 2, 1, 0 |
+fflags, frm |
+
Note: read-only CSRs are not listed, as they do not participate in the +definition of syntactic dependencies.
+18.3. Source and Destination Register Listings
+This section provides a concrete listing of the source and destination +registers for each instruction. These listings are used in the +definition of syntactic dependencies in +Section 18.1.2.
+The term "accumulating CSR" is used to describe a CSR that is both a +source and a destination register, but which carries a dependency only +from itself to itself.
+Instructions carry a dependency from each source register in the +"Source Registers" column to each destination register in the +"Destination Registers" column, from each source register in the +"Source Registers" column to each CSR in the "Accumulating CSRs" +column, and from each CSR in the "Accumulating CSRs" column to itself, +except where annotated otherwise.
+Key:
+-
+
-
+
AAddress source register
+
+ -
+
DData source register
+
+ -
+
† The instruction does not carry a dependency from +any source register to any destination register
+
+ -
+
‡ The instruction carries dependencies from source +register(s) to destination register(s) as specified
+
+
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
LUI |
++ | rd |
++ | + |
AUIPC |
++ | rd |
++ | + |
JAL |
++ | rd |
++ | + |
JALR† |
+rs1 |
+rd |
++ | + |
BEQ |
+rs1, rs2 |
++ | + | + |
BNE |
+rs1, rs2 |
++ | + | + |
BLT |
+rs1, rs2 |
++ | + | + |
BGE |
+rs1, rs2 |
++ | + | + |
BLTU |
+rs1, rs2 |
++ | + | + |
BGEU |
+rs1, rs2 |
++ | + | + |
LB † |
+rs1 A |
+rd |
++ | + |
LH † |
+rs1 A |
+rd |
++ | + |
LW † |
+rs1 A |
+rd |
++ | + |
LBU † |
+rs1 A |
+rd |
++ | + |
LHU † |
+rs1 A |
+rd |
++ | + |
SB |
+rs1 A, rs2 D |
++ | + | + |
SH |
+rs1 A, rs2 D |
++ | + | + |
SW |
+rs1 A, rs2 D |
++ | + | + |
ADDI |
+rs1 |
+rd |
++ | + |
SLTI |
+rs1 |
+rd |
++ | + |
SLTIU |
+rs1 |
+rd |
++ | + |
XORI |
+rs1 |
+rd |
++ | + |
ORI |
+rs1 |
+rd |
++ | + |
ANDI |
+rs1 |
+rd |
++ | + |
SLLI |
+rs1 |
+rd |
++ | + |
SRLI |
+rs1 |
+rd |
++ | + |
SRAI |
+rs1 |
+rd |
++ | + |
ADD |
+rs1, rs2 |
+rd |
++ | + |
SUB |
+rs1, rs2 |
+rd |
++ | + |
SLL |
+rs1, rs2 |
+rd |
++ | + |
SLT |
+rs1, rs2 |
+rd |
++ | + |
SLTU |
+rs1, rs2 |
+rd |
++ | + |
XOR |
+rs1, rs2 |
+rd |
++ | + |
SRL |
+rs1, rs2 |
+rd |
++ | + |
SRA |
+rs1, rs2 |
+rd |
++ | + |
OR |
+rs1, rs2 |
+rd |
++ | + |
AND |
+rs1, rs2 |
+rd |
++ | + |
FENCE |
++ | + | + | + |
FENCE.I |
++ | + | + | + |
ECALL |
++ | + | + | + |
EBREAK |
++ | + | + | + |
CSRRW‡ |
+rs1, csr* |
+rd, csr |
++ | *unless rd= |
+
CSRRS‡ |
+rs1, csr |
+rd *, csr |
++ | *unless rs1= |
+
CSRRC‡ |
+rs1, csr |
+rd *, csr |
++ | *unless rs1= |
+
‡ carries a dependency from rs1 to csr and from csr to rd |
+||||
CSRRWI ‡ |
+csr * |
+rd, csr |
++ | *unless rd=x0 |
+
CSRRSI ‡ |
+csr |
+rd, csr* |
++ | *unless uimm[4:0]=0 |
+
CSRRCI ‡ |
+csr |
+rd, csr* |
++ | *unless uimm[4:0]=0 |
+
‡ carries a dependency from csr to rd |
+
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
LWU † |
+rs1 A |
+rd |
++ | + |
LD † |
+rs1 A |
+rd |
++ | + |
SD |
+rs1 A, rs2 D |
++ | + | + |
SLLI |
+rs1 |
+rd |
++ | + |
SRLI |
+rs1 |
+rd |
++ | + |
SRAI |
+rs1 |
+rd |
++ | + |
ADDIW |
+rs1 |
+rd |
++ | + |
SLLIW |
+rs1 |
+rd |
++ | + |
SRLIW |
+rs1 |
+rd |
++ | + |
SRAIW |
+rs1 |
+rd |
++ | + |
ADDW |
+rs1, rs2 |
+rd |
++ | + |
SUBW |
+rs1, rs2 |
+rd |
++ | + |
SLLW |
+rs1, rs2 |
+rd |
++ | + |
SRLW |
+rs1, rs2 |
+rd |
++ | + |
SRAW |
+rs1, rs2 |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
MUL |
+rs1, rs2 |
+rd |
++ | + |
MULH |
+rs1, rs2 |
+rd |
++ | + |
MULHSU |
+rs1, rs2 |
+rd |
++ | + |
MULHU |
+rs1, rs2 |
+rd |
++ | + |
DIV |
+rs1, rs2 |
+rd |
++ | + |
DIVU |
+rs1, rs2 |
+rd |
++ | + |
REM |
+rs1, rs2 |
+rd |
++ | + |
REMU |
+rs1, rs2 |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
MULW |
+rs1, rs2 |
+rd |
++ | + |
DIVW |
+rs1, rs2 |
+rd |
++ | + |
DIVUW |
+rs1, rs2 |
+rd |
++ | + |
REMW |
+rs1, rs2 |
+rd |
++ | + |
REMUW |
+rs1, rs2 |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
LR.W† |
+rs1 A |
+rd |
++ | + |
SC.W† |
+rs1 A, rs2 D |
+rd * |
++ | * if successful |
+
AMOSWAP.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOADD.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOXOR.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOAND.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOOR.W† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOMIN.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOMAX.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOMINU.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOMAXU.W† |
+rs1 A, rs2 D |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
LR.D† |
+rs1 A |
+rd |
++ | + |
SC.D† |
+rs1 A, rs2 D |
+rd * |
++ | *if successful |
+
AMOSWAP.D† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOADD.D† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOXOR.D† |
+rs1 A, rs2 D |
+rd |
++ | + |
AMOAND.D† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOOR.D† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOMIN.D† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOMAX.D† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOMINU.D† |
+rs1 A, rs2D |
+rd |
++ | + |
AMOMAXU.D† |
+rs1 A, rs2D |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
FLW† |
+rs1 A |
+rd |
++ | + |
FSW |
+rs1 A, rs2D |
++ | + | + |
FMADD.S |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FMSUB.S |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FNMSUB.S |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FNMADD.S |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FADD.S |
+rs1, rs2, frm* |
+rd |
+NV, OF, NX |
+*if rm=111 |
+
FSUB.S |
+rs1, rs2, frm* |
+rd |
+NV, OF, NX |
+*if rm=111 |
+
FMUL.S |
+rs1, rs2, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FDIV.S |
+rs1, rs2, frm* |
+rd |
+NV, DZ, OF, UF, NX |
+*if rm=111 |
+
FSQRT.S |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FSGNJ.S |
+rs1, rs2 |
+rd |
++ | + |
FSGNJN.S |
+rs1, rs2 |
+rd |
++ | + |
FSGNJX.S |
+rs1, rs2 |
+rd |
++ | + |
FMIN.S |
+rs1, rs2 |
+rd |
+NV |
++ |
FMAX.S |
+rs1, rs2 |
+rd |
+NV |
++ |
FCVT.W.S |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.WU.S |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FMV.X.W |
+rs1 |
+rd |
++ | + |
FEQ.S |
+rs1, rs2 |
+rd |
+NV |
++ |
FLT.S |
+rs1, rs2 |
+rd |
+NV |
++ |
FLE.S |
+rs1, rs2 |
+rd |
+NV |
++ |
FCLASS.S |
+rs1 |
+rd |
++ | + |
FCVT.S.W |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
FCVT.S.WU |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
FMV.W.X |
+rs1 |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
FCVT.L.S |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.LU.S |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.S.L |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
FCVT.S.LU |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
FLD† |
+rs1 A |
+rd |
++ | + |
FSD |
+rs1 A, rs2D |
++ | + | + |
FMADD.D |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FMSUB.D |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FNMSUB.D |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FNMADD.D |
+rs1, rs2, rs3, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FADD.D |
+rs1, rs2, frm* |
+rd |
+NV, OF, NX |
+*if rm=111 |
+
FSUB.D |
+rs1, rs2, frm* |
+rd |
+NV, OF, NX |
+*if rm=111 |
+
FMUL.D |
+rs1, rs2, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FDIV.D |
+rs1, rs2, frm* |
+rd |
+NV, DZ, OF, UF, NX |
+*if rm=111 |
+
FSQRT.D |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FSGNJ.D |
+rs1, rs2 |
+rd |
++ | + |
FSGNJN.D |
+rs1, rs2 |
+rd |
++ | + |
FSGNJX.D |
+rs1, rs2 |
+rd |
++ | + |
FMIN.D |
+rs1, rs2 |
+rd |
+NV |
++ |
FMAX.D |
+rs1, rs2 |
+rd |
+NV |
++ |
FCVT.S.D |
+rs1, frm* |
+rd |
+NV, OF, UF, NX |
+*if rm=111 |
+
FCVT.D.S |
+rs1 |
+rd |
+NV |
++ |
FEQ.D |
+rs1, rs2 |
+rd |
+NV |
++ |
FLT.D |
+rs1, rs2 |
+rd |
+NV |
++ |
FLE.D |
+rs1, rs2 |
+rd |
+NV |
++ |
FCLASS.D |
+rs1 |
+rd |
++ | + |
FCVT.W.D |
+rs1,* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.WU.D |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.D.W |
+rs1 |
+rd |
++ | + |
FCVT.D.WU |
+rs1 |
+rd |
++ | + |
+ | Source Registers | +Destination Registers | +Accumulating CSRs | ++ |
---|---|---|---|---|
FCVT.L.D |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FCVT.LU.D |
+rs1, frm* |
+rd |
+NV, NX |
+*if rm=111 |
+
FMV.X.D |
+rs1 |
+rd |
++ | + |
FCVT.D.L |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
FCVT.D.LU |
+rs1, frm* |
+rd |
+NX |
+*if rm=111 |
+
FMV.D.X |
+rs1 |
+rd |
++ | + |
19. "Ztso" Extension for Total Store Ordering, Version 1.0
+20. "CMO" Extensions for Base Cache Management Operation ISA, Version 1.0.0
+21. "F" Extension for Single-Precision Floating-Point, Version 2.2
+22. "D" Extension for Double-Precision Floating-Point, Version 2.2
+23. "Q" Extension for Quad-Precision Floating-Point, Version 2.2
+24. "Zfh" and "Zfhmin" Extensions for Half-Precision Floating-Point, Version 1.0
+25. "BF16" Extensions for for BFloat16-precision Floating-Point, Version 1.0
+26. "Zfa" Extension for Additional Floating-Point Instructions, Version 1.0
+27. "Zfinx", "Zdinx", "Zhinx", "Zhinxmin" Extensions for Floating-Point in Integer Registers, Version 1.0
+28. "C" Extension for Compressed Instructions, Version 2.0
+This chapter describes the RISC-V standard compressed instruction-set +extension, named "C", which reduces static and dynamic code size by +adding short 16-bit instruction encodings for common operations. The C +extension can be added to any of the base ISAs (RV32, RV64, RV128), and +we use the generic term "RVC" to cover any of these. Typically, +50%-60% of the RISC-V instructions in a program can be replaced with RVC +instructions, resulting in a 25%-30% code-size reduction.
+28.1. Overview
+RVC uses a simple compression scheme that offers shorter 16-bit versions +of common 32-bit RISC-V instructions when:
+-
+
-
+
the immediate or address offset is small, or
+
+ -
+
one of the registers is the zero register (
+x0
), the ABI link register +(x1
), or the ABI stack pointer (x2
), or
+ -
+
the destination register and the first source register are identical, or
+
+ -
+
the registers used are the 8 most popular ones.
+
+
The C extension is compatible with all other standard instruction +extensions. The C extension allows 16-bit instructions to be freely +intermixed with 32-bit instructions, with the latter now able to start +on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C +extension, no instructions can raise instruction-address-misaligned +exceptions.
++ + | +
+
+
+Removing the 32-bit alignment constraint on the original 32-bit +instructions allows significantly greater code density. + |
+
The compressed instruction encodings are mostly common across RV32C, +RV64C, and RV128C, but as shown in Table 34, a few opcodes are used for +different purposes depending on base ISA. For example, the wider +address-space RV64C and RV128C variants require additional opcodes to +compress loads and stores of 64-bit integer values, while RV32C uses the +same opcodes to compress loads and stores of single-precision +floating-point values. Similarly, RV128C requires additional opcodes to +capture loads and stores of 128-bit integer values, while these same +opcodes are used for loads and stores of double-precision floating-point +values in RV32C and RV64C. If the C extension is implemented, the +appropriate compressed floating-point load and store instructions must +be provided whenever the relevant standard floating-point extension (F +and/or D) is also implemented. In addition, RV32C includes a compressed +jump and link instruction to compress short-range subroutine calls, +where the same opcode is used to compress ADDIW for RV64C and RV128C.
++ + | +
+
+
+Double-precision loads and stores are a significant fraction of static +and dynamic instructions, hence the motivation to include them in the +RV32C and RV64C encoding. +
+
+Although single-precision loads and stores are not a significant source +of static or dynamic compression for benchmarks compiled for the +currently supported ABIs, for microcontrollers that only provide +hardware single-precision floating-point units and have an ABI that only +supports single-precision floating-point numbers, the single-precision +loads and stores will be used at least as frequently as double-precision +loads and stores in the measured benchmarks. Hence, the motivation to +provide compressed support for these in RV32C. +
+
+Short-range subroutine calls are more likely in small binaries for +microcontrollers, hence the motivation to include these in RV32C. +
+
+Although reusing opcodes for different purposes for different base ISAs +adds some complexity to documentation, the impact on implementation +complexity is small even for designs that support multiple base ISAs. +The compressed floating-point load and store variants use the same +instruction format with the same register specifiers as the wider +integer loads and stores. + |
+
RVC was designed under the constraint that each RVC instruction expands +into a single 32-bit instruction in either the base ISA (RV32I/E, RV64I/E, +or RV128I) or the F and D standard extensions where present. Adopting +this constraint has two main benefits:
+-
+
-
+
Hardware designs can simply expand RVC instructions during decode, +simplifying verification and minimizing modifications to existing +microarchitectures.
+
+ -
+
Compilers can be unaware of the RVC extension and leave code compression +to the assembler and linker, although a compression-aware compiler will +generally be able to produce better results.
+
+
+ + | +
+
+
+We felt the multiple complexity reductions of a simple one-one mapping +between C and base IFD instructions far outweighed the potential gains +of a slightly denser encoding that added additional instructions only +supported in the C extension, or that allowed encoding of multiple IFD +instructions in one C instruction. + |
+
It is important to note that the C extension is not designed to be a +stand-alone ISA, and is meant to be used alongside a base ISA.
++ + | +
+
+
+Variable-length instruction sets have long been used to improve code +density. For example, the IBM Stretch (Buchholz, 1962), developed in the late 1950s, had +an ISA with 32-bit and 64-bit instructions, where some of the 32-bit +instructions were compressed versions of the full 64-bit instructions. +Stretch also employed the concept of limiting the set of registers that +were addressable in some of the shorter instruction formats, with short +branch instructions that could only refer to one of the index registers. +The later IBM 360 architecture (Amdahl et al., 1964) supported a simple variable-length +instruction encoding with 16-bit, 32-bit, or 48-bit instruction formats. +
+
+In 1963, CDC introduced the Cray-designed CDC 6600 (Thornton, 1965), a precursor to RISC +architectures, that introduced a register-rich load-store architecture +with instructions of two lengths, 15-bits and 30-bits. The later Cray-1 +design used a very similar instruction format, with 16-bit and 32-bit +instruction lengths. +
+
+The initial RISC ISAs from the 1980s all picked performance over code +size, which was reasonable for a workstation environment, but not for +embedded systems. Hence, both ARM and MIPS subsequently made versions of +the ISAs that offered smaller code size by offering an alternative +16-bit wide instruction set instead of the standard 32-bit wide +instructions. The compressed RISC ISAs reduced code size relative to +their starting points by about 25-30%, yielding code that was +significantly smaller than 80x86. This result surprised some, as their +intuition was that the variable-length CISC ISA should be smaller than +RISC ISAs that offered only 16-bit and 32-bit formats. +
+
+Since the original RISC ISAs did not leave sufficient opcode space free +to include these unplanned compressed instructions, they were instead +developed as complete new ISAs. This meant compilers needed different +code generators for the separate compressed ISAs. The first compressed +RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only a fixed +16-bit instruction size, which gave good reductions in static code size +but caused an increase in dynamic instruction count, which led to lower +performance compared to the original fixed-width 32-bit instruction +size. This led to the development of a second generation of compressed +RISC ISA designs with mixed 16-bit and 32-bit instruction lengths (e.g., +ARM Thumb2, microMIPS, PowerPC VLE), so that performance was similar to +pure 32-bit instructions but with significant code size savings. +Unfortunately, these different generations of compressed ISAs are +incompatible with each other and with the original uncompressed ISA, +leading to significant complexity in documentation, implementations, and +software tools support. +
+
+Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently +supports a compressed instruction format. It is surprising that the most +popular 64-bit ISA for mobile platforms (ARM v8) does not include a +compressed instruction format given that static code size and dynamic +instruction fetch bandwidth are important metrics. Although static code +size is not a major concern in larger systems, instruction fetch +bandwidth can be a major bottleneck in servers running commercial +workloads, which often have a large instruction working set. +
+
+Benefiting from 25 years of hindsight, RISC-V was designed to support +compressed instructions from the outset, leaving enough opcode space for +RVC to be added as a simple extension on top of the base ISA (along with +many other extensions). The philosophy of RVC is to reduce code size for +embedded applications and to improve performance and energy-efficiency +for all applications due to fewer misses in the instruction cache. +Waterman shows that RVC fetches 25%-30% fewer instruction bits, which +reduces instruction cache misses by 20%-25%, or roughly the same +performance impact as doubling the instruction cache size. (Waterman, 2011) + |
+
28.2. Compressed Instruction Formats
+Table 21 shows the nine compressed instruction
+formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW,
+CL, CS, CA, and CB are limited to just 8 of them.
+Table 22 lists these popular registers, which
+correspond to registers x8
to x15
. Note that there is a separate
+version of load and store instructions that use the stack pointer as the
+base address register, since saving to and restoring from the stack are
+so prevalent, and that they use the CI and CSS formats to allow access
+to all 32 data registers. CIW supplies an 8-bit immediate for the
+ADDI4SPN instruction.
+ + | +
+
+
+The RISC-V ABI was changed to make the frequently used registers map to +registers 'x8-x15'. This simplifies the decompression decoder by +having a contiguous naturally aligned set of register numbers, and is +also compatible with the RV32E and RV64E base ISAs, which only have 16 integer +registers. + |
+
Compressed register-based floating-point loads and stores also use the
+CL and CS formats respectively, with the eight registers mapping to f8
to f15
.
+
+ + | +
+
+
+The standard RISC-V calling convention maps the most frequently used
+floating-point registers to registers |
+
+The formats were designed to keep bits for the two register source +specifiers in the same place in all instructions, while the destination +register field can move. When the full 5-bit destination register +specifier is present, it is in the same place as in the 32-bit RISC-V +encoding. Where immediates are sign-extended, the sign extension is +always from bit 12. Immediate fields have been scrambled, as in the base +specification, to reduce the number of immediate muxes required.
++ + | +
+
+
+The immediate fields are scrambled in the instruction formats instead of +in sequential order so that as many bits as possible are in the same +position in every instruction, thereby simplifying implementations. + |
+
For many RVC instructions, zero-valued immediates are disallowed and
+x0
is not a valid 5-bit register specifier. These restrictions free up
+encoding space for other instructions requiring fewer operand bits.
|
+
|
+
|
+
|
+
28.3. Load and Store Instructions
+To increase the reach of 16-bit instructions, data-transfer instructions +use zero-extended immediates that are scaled by the size of the data in +bytes: ×4 for words, ×8 for double +words, and ×16 for quad words.
+RVC provides two variants of loads and stores. One uses the ABI stack
+pointer, x2
, as the base address and can target any data register. The
+other can reference one of 8 base address registers and one of 8 data
+registers.
28.3.1. Stack-Pointer-Based Loads and Stores
+These instructions use the CI format.
+C.LWSP loads a 32-bit value from memory into register rd. It computes
+an effective address by adding the zero-extended offset, scaled by 4,
+to the stack pointer, x2
. It expands to lw rd, offset(x2)
. C.LWSP is
+only valid when rd≠x0 the code points with rd=x0 are reserved.
C.LDSP is an RV64C/RV128C-only instruction that loads a 64-bit value
+from memory into register rd. It computes its effective address by
+adding the zero-extended offset, scaled by 8, to the stack pointer,
+x2
. It expands to ld rd, offset(x2)
. C.LDSP is only valid when
+rd≠x0 the code points with
+rd=x0 are reserved.
C.LQSP is an RV128C-only instruction that loads a 128-bit value from
+memory into register rd. It computes its effective address by adding
+the zero-extended offset, scaled by 16, to the stack pointer, x2
. It
+expands to lq rd, offset(x2)
. C.LQSP is only valid when
+rd≠x0 the code points with
+rd=x0 are reserved.
C.FLWSP is an RV32FC-only instruction that loads a single-precision
+floating-point value from memory into floating-point register rd. It
+computes its effective address by adding the zero-extended offset,
+scaled by 4, to the stack pointer, x2
. It expands to
+flw rd, offset(x2)
.
C.FLDSP is an RV32DC/RV64DC-only instruction that loads a
+double-precision floating-point value from memory into floating-point
+register rd. It computes its effective address by adding the
+zero-extended offset, scaled by 8, to the stack pointer, x2
. It
+expands to fld rd, offset(x2)
.
These instructions use the CSS format.
+C.SWSP stores a 32-bit value in register rs2 to memory. It computes an
+effective address by adding the zero-extended offset, scaled by 4, to
+the stack pointer, x2
. It expands to sw rs2, offset(x2)
.
C.SDSP is an RV64C/RV128C-only instruction that stores a 64-bit value in
+register rs2 to memory. It computes an effective address by adding the
+zero-extended offset, scaled by 8, to the stack pointer, x2
. It
+expands to sd rs2, offset(x2)
.
C.SQSP is an RV128C-only instruction that stores a 128-bit value in
+register rs2 to memory. It computes an effective address by adding the
+zero-extended offset, scaled by 16, to the stack pointer, x2
. It
+expands to sq rs2, offset(x2)
.
C.FSWSP is an RV32FC-only instruction that stores a single-precision
+floating-point value in floating-point register rs2 to memory. It
+computes an effective address by adding the zero-extended offset,
+scaled by 4, to the stack pointer, x2
. It expands to
+fsw rs2, offset(x2)
.
C.FSDSP is an RV32DC/RV64DC-only instruction that stores a
+double-precision floating-point value in floating-point register rs2
+to memory. It computes an effective address by adding the
+zero-extended offset, scaled by 8, to the stack pointer, x2
. It
+expands to fsd rs2, offset(x2)
.
+ + | +
+
+
+Register save/restore code at function entry/exit represents a +significant portion of static code size. The stack-pointer-based +compressed loads and stores in RVC are effective at reducing the +save/restore static code size by a factor of 2 while improving +performance by reducing dynamic instruction bandwidth. +
+
+A common mechanism used in other ISAs to further reduce save/restore +code size is load-multiple and store-multiple instructions. We +considered adopting these for RISC-V but noted the following drawbacks +to these instructions: +
+
+
+
+Furthermore, much of the gains can be realized in software by replacing +prologue and epilogue code with subroutine calls to common prologue and +epilogue code, a technique described in Section 5.6 of (Waterman, 2016). +
+
+While reasonable architects might come to different conclusions, we +decided to omit load and store multiple and instead use the +software-only approach of calling save/restore millicode routines to +attain the greatest code size reduction. + |
+
28.3.2. Register-Based Loads and Stores
++These instructions use the CL format.
+C.LW loads a 32-bit value from memory into register
+rd′
. It computes an effective address by adding the
+zero-extended offset, scaled by 4, to the base address in register
+rs1′
. It expands to lw rd′, offset(rs1′)
.
C.LD is an RV64C/RV128C-only instruction that loads a 64-bit value from
+memory into register rd′
. It computes an effective
+address by adding the zero-extended offset, scaled by 8, to the base
+address in register rs1′
. It expands to
+ld rd′, offset(rs1′)
.
C.LQ is an RV128C-only instruction that loads a 128-bit value from
+memory into register rd′
. It computes an effective
+address by adding the zero-extended offset, scaled by 16, to the base
+address in register rs1′
. It expands to
+lq rd′, offset(rs1′)
.
C.FLW is an RV32FC-only instruction that loads a single-precision
+floating-point value from memory into floating-point register
+rd′
. It computes an effective address by adding the
+zero-extended offset, scaled by 4, to the base address in register
+rs1′
. It expands to
+flw rd′, offset(rs1′)
.
C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision
+floating-point value from memory into floating-point register
+rd′
. It computes an effective address by adding the
+zero-extended offset, scaled by 8, to the base address in register
+rs1′
. It expands to
+fld rd′, offset(rs1′)
.
These instructions use the CS format.
+C.SW stores a 32-bit value in register rs2′
to memory.
+It computes an effective address by adding the zero-extended offset,
+scaled by 4, to the base address in register rs1′
. It
+expands to sw rs2′, offset(rs1′)
.
C.SD is an RV64C/RV128C-only instruction that stores a 64-bit value in
+register rs2′
to memory. It computes an effective
+address by adding the zero-extended offset, scaled by 8, to the base
+address in register rs1′
. It expands to
+sd rs2′, offset(rs1′)
.
C.SQ is an RV128C-only instruction that stores a 128-bit value in
+register rs2′
to memory. It computes an effective
+address by adding the zero-extended offset, scaled by 16, to the base
+address in register rs1′
. It expands to
+sq rs2′, offset(rs1′)
.
C.FSW is an RV32FC-only instruction that stores a single-precision
+floating-point value in floating-point register rs2′
to
+memory. It computes an effective address by adding the zero-extended
+offset, scaled by 4, to the base address in register
+rs1′
. It expands to
+fsw rs2′, offset(rs1′)
.
C.FSD is an RV32DC/RV64DC-only instruction that stores a
+double-precision floating-point value in floating-point register
+rs2′
to memory. It computes an effective address by
+adding the zero-extended offset, scaled by 8, to the base address in
+register rs1′
. It expands to
+fsd rs2′, offset(rs1′)
.
28.4. Control Transfer Instructions
+RVC provides unconditional jump instructions and conditional branch +instructions. As with base RVI instructions, the offsets of all RVC +control transfer instructions are in multiples of 2 bytes.
+These instructions use the CJ format.
+C.J performs an unconditional control transfer. The offset is
+sign-extended and added to the pc
to form the jump target address. C.J
+can therefore target a ±2 KiB range. C.J expands to
+jal x0, offset
.
C.JAL is an RV32C-only instruction that performs the same operation as
+C.J, but additionally writes the address of the instruction following
+the jump (pc+2
) to the link register, x1
. C.JAL expands to
+jal x1, offset
.
These instructions use the CR format.
+C.JR (jump register) performs an unconditional control transfer to the
+address in register rs1. C.JR expands to jalr x0, 0(rs1)
. C.JR is
+only valid when ; the code
+point with is reserved.
C.JALR (jump and link register) performs the same operation as C.JR, but
+additionally writes the address of the instruction following the jump
+(pc
+2) to the link register, x1
. C.JALR expands to
+jalr x1, 0(rs1)
. C.JALR is only valid when
+; the code point with
+ corresponds to the C.EBREAK
+instruction.
+ + | +
+
+
+Strictly speaking, C.JALR does not expand exactly to a base RVI +instruction as the value added to the PC to form the link address is 2 +rather than 4 as in the base ISA, but supporting both offsets of 2 and 4 +bytes is only a very minor change to the base microarchitecture. + |
+
These instructions use the CB format.
+C.BEQZ performs conditional control transfers. The offset is
+sign-extended and added to the pc
to form the branch target address.
+It can therefore target a ±256 B range. C.BEQZ takes the
+branch if the value in register rs1′ is zero. It
+expands to beq rs1′, x0, offset
.
C.BNEZ is defined analogously, but it takes the branch if
+rs1′ contains a nonzero value. It expands to
+bne rs1′, x0, offset
.
28.5. Integer Computational Instructions
+RVC provides several instructions for integer arithmetic and constant +generation.
+28.5.1. Integer Constant-Generation Instructions
+The two constant-generation instructions both use the CI instruction +format and can target any integer register.
+C.LI loads the sign-extended 6-bit immediate, imm, into register rd.
+C.LI expands into addi rd, x0, imm
. C.LI is only valid when
+rd≠x0
; the code points with rd=x0
encode HINTs.
C.LUI loads the non-zero 6-bit immediate field into bits 17–12 of the
+destination register, clears the bottom 12 bits, and sign-extends bit 17
+into all higher bits of the destination. C.LUI expands into
+lui rd, imm
. C.LUI is only valid when
+,
+and when the immediate is not equal to zero. The code points with
+imm=0 are reserved; the remaining code points with rd=x0
are
+HINTs; and the remaining code points with rd=x2
correspond to the
+C.ADDI16SP instruction.
28.5.2. Integer Register-Immediate Operations
+These integer register-immediate operations are encoded in the CI format +and perform operations on an integer register and a 6-bit immediate.
+C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in
+register rd then writes the result to rd. C.ADDI expands into
+addi rd, rd, imm
. C.ADDI is only valid when
+rd≠x0
and imm≠0
. The code
+points with rd=x0
encode the C.NOP instruction; the remaining code
+points with imm=0 encode HINTs.
C.ADDIW is an RV64C/RV128C-only instruction that performs the same
+computation but produces a 32-bit result, then sign-extends result to 64
+bits. C.ADDIW expands into addiw rd, rd, imm
. The immediate can be
+zero for C.ADDIW, where this corresponds to sext.w rd
. C.ADDIW is
+only valid when rd≠x0
; the code points with
+rd=x0
are reserved.
C.ADDI16SP shares the opcode with C.LUI, but has a destination field of
+x2
. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to the
+value in the stack pointer (sp=x2
), where the immediate is scaled to
+represent multiples of 16 in the range (-512,496). C.ADDI16SP is used to
+adjust the stack pointer in procedure prologues and epilogues. It
+expands into addi x2, x2, nzimm[9:4]
. C.ADDI16SP is only valid when
+nzimm≠0; the code point with nzimm=0 is reserved.
+ + | +
+
+
+In the standard RISC-V calling convention, the stack pointer |
+
+C.ADDI4SPN is a CIW-format instruction that adds a zero-extended
+non-zero immediate, scaled by 4, to the stack pointer, x2
, and writes
+the result to rd′
. This instruction is used to generate
+pointers to stack-allocated variables, and expands to
+addi rd′, x2, nzuimm[9:2]
. C.ADDI4SPN is only valid when
+nzuimm≠0; the code points with nzuimm=0 are
+reserved.
C.SLLI is a CI-format instruction that performs a logical left shift of
+the value in register rd then writes the result to rd. The shift
+amount is encoded in the shamt field. For RV128C, a shift amount of
+zero is used to encode a shift of 64. C.SLLI expands into
+slli rd, rd, shamt[5:0]
, except for RV128C with shamt=0
, which expands to
+slli rd, rd, 64
.
For RV32C, shamt[5] must be zero; the code points with shamt[5]=1
+are designated for custom extensions. For RV32C and RV64C, the shift
+amount must be non-zero; the code points with shamt=0 are HINTs. For
+all base ISAs, the code points with rd=x0
are HINTs, except those
+with shamt[5]=1 in RV32C.
C.SRLI is a CB-format instruction that performs a logical right shift of
+the value in register rd′ then writes the result to
+rd′. The shift amount is encoded in the shamt field.
+For RV128C, a shift amount of zero is used to encode a shift of 64.
+Furthermore, the shift amount is sign-extended for RV128C, and so the
+legal shift amounts are 1-31, 64, and 96-127. C.SRLI expands into
+srli rd′, rd′, shamt
, except for
+RV128C with shamt=0
, which expands to
+srli rd′, rd′, 64
.
For RV32C, shamt[5] must be zero; the code points with shamt[5]=1 +are designated for custom extensions. For RV32C and RV64C, the shift +amount must be non-zero; the code points with shamt=0 are HINTs.
+C.SRAI is defined analogously to C.SRLI, but instead performs an
+arithmetic right shift. C.SRAI expands to
+srai rd′, rd′, shamt
.
+ + | +
+
+
+Left shifts are usually more frequent than right shifts, as left shifts +are frequently used to scale address values. Right shifts have therefore +been granted less encoding space and are placed in an encoding quadrant +where all other immediates are sign-extended. For RV128, the decision +was made to have the 6-bit shift-amount immediate also be sign-extended. +Apart from reducing the decode complexity, we believe right-shift +amounts of 96-127 will be more useful than 64-95, to allow extraction of +tags located in the high portions of 128-bit address pointers. We note +that RV128C will not be frozen at the same point as RV32C and RV64C, to +allow evaluation of typical usage of 128-bit address-space codes. + |
+
C.ANDI is a CB-format instruction that computes the bitwise AND of the
+value in register rd′ and the sign-extended 6-bit
+immediate, then writes the result to rd′. C.ANDI
+expands to andi rd′, rd′, imm
.
28.5.3. Integer Register-Register Operations
++These instructions use the CR format.
+C.MV copies the value in register rs2 into register rd. C.MV expands
+into add rd, x0, rs2
. C.MV is only valid when
+rs2≠x0
the code points with rs2=x0
correspond to the C.JR instruction. The code points with rs2≠x0
and rd=x0
are HINTs.
+ + | +
+
+
+C.MV expands to a different instruction than the canonical MV +pseudoinstruction, which instead uses ADDI. Implementations that handle +MV specially, e.g. using register-renaming hardware, may find it more +convenient to expand C.MV to MV instead of ADD, at slight additional +hardware cost. + |
+
C.ADD adds the values in registers rd and rs2 and writes the result
+to register rd. C.ADD expands into add rd, rd, rs2
. C.ADD is only
+valid when rs2≠x0
the code points with rs2=x0
correspond to the C.JALR
+and C.EBREAK instructions. The code points with rs2≠x0
and rd=x0 are HINTs.
These instructions use the CA format.
+C.AND
computes the bitwise AND
of the values in registers
+rd′ and rs2′, then writes the result
+to register rd′. C.AND
expands into
+and rd′, rd′, rs2′
.
C.OR
computes the bitwise OR
of the values in registers
+rd′ and rs2′, then writes the result
+to register rd′. C.OR
expands into
+or rd′, rd′, rs2′
.
C.XOR
computes the bitwise XOR
of the values in registers
+rd′ and rs2′, then writes the result
+to register rd′. C.XOR
expands into
+xor rd′, rd′, rs2′
.
C.SUB
subtracts the value in register rs2′ from the
+value in register rd′, then writes the result to
+register rd′. C.SUB
expands into
+sub rd′, rd′, rs2′
.
C.ADDW
is an RV64C/RV128C-only instruction that adds the values in
+registers rd′ and rs2′, then
+sign-extends the lower 32 bits of the sum before writing the result to
+register rd′. C.ADDW
expands into
+addw rd′, rd′, rs2′
.
C.SUBW
is an RV64C/RV128C-only instruction that subtracts the value in
+register rs2′ from the value in register
+rd′, then sign-extends the lower 32 bits of the
+difference before writing the result to register rd′.
+C.SUBW
expands into subw rd′, rd′, rs2′
.
+ + | +
+
+
+This group of six instructions do not provide large savings +individually, but do not occupy much encoding space and are +straightforward to implement, and as a group provide a worthwhile +improvement in static and dynamic compression. + |
+
28.5.4. Defined Illegal Instruction
+A 16-bit instruction with all bits zero is permanently reserved as an +illegal instruction.
++ + | +
+
+
+We reserve all-zero instructions to be illegal instructions to help trap +attempts to execute zero-ed or non-existent portions of the memory +space. The all-zero value should not be redefined in any non-standard +extension. Similarly, we reserve instructions with all bits set to 1 +(corresponding to very long instructions in the RISC-V variable-length +encoding scheme) as illegal to capture another common value seen in +non-existent memory regions. + |
+
28.5.5. NOP Instruction
+C.NOP
is a CI-format instruction that does not change any user-visible
+state, except for advancing the pc
and incrementing any applicable
+performance counters. C.NOP
expands to nop
. C.NOP
is only valid when
+imm=0; the code points with imm≠0 encode HINTs.
28.5.6. Breakpoint Instruction
+Debuggers can use the C.EBREAK
instruction, which expands to ebreak
,
+to cause control to be transferred back to the debugging environment.
+C.EBREAK
shares the opcode with the C.ADD
instruction, but with rd and
+rs2 both zero, thus can also use the CR
format.
28.6. Usage of C Instructions in LR/SC Sequences
+On implementations that support the C extension, compressed forms of the +I instructions permitted inside constrained LR/SC sequences, as +described in [sec:lrscseq], are also permitted +inside constrained LR/SC sequences.
++ + | +
+
+
+The implication is that any implementation that claims to support both +the A and C extensions must ensure that LR/SC sequences containing valid +C instructions will eventually complete. + |
+
28.7. HINT Instructions
+A portion of the RVC encoding space is reserved for microarchitectural
+HINTs. Like the HINTs in the RV32I base ISA (see
+HINT Instructions), these instructions do not
+modify any architectural state, except for advancing the pc
and any
+applicable performance counters. HINTs are executed as no-ops on
+implementations that ignore them.
RVC HINTs are encoded as computational instructions that do not modify
+the architectural state, either because rd=x0
(e.g.
+C.ADD x0, t0
), or because rd is overwritten with a copy of itself
+(e.g. C.ADDI t0, 0
).
+ + | +
+
+
+This HINT encoding has been chosen so that simple implementations can +ignore HINTs altogether, and instead execute a HINT as a regular +computational instruction that happens not to mutate the architectural +state. + |
+
RVC HINTs do not necessarily expand to their RVI HINT counterparts. For
+example, C.ADD
x0, a0 might not encode the same HINT as
+ADD
x0, x0, a0.
+ + | +
+
+
+The primary reason to not require an RVC HINT to expand to an RVI HINT +is that HINTs are unlikely to be compressible in the same manner as the +underlying computational instruction. Also, decoupling the RVC and RVI +HINT mappings allows the scarce RVC HINT space to be allocated to the +most popular HINTs, and in particular, to HINTs that are amenable to +macro-op fusion. + |
+
Table 32 lists all RVC HINT code points. For RV32C, 78% +of the HINT space is reserved for standard HINTs. The remainder of the HINT space is designated for custom HINTs; +no standard HINTs will ever be defined in this subspace.
+Instruction | +Constraints | +Code Points | +Purpose | +
---|---|---|---|
C.NOP |
+imm≠0 |
+63 |
+Designated for future standard use |
+
C.ADDI |
+rd≠ |
+31 |
+|
C.LI |
+rd= |
+64 |
+|
C.LUI |
+rd= |
+63 |
+|
C.MV |
+rd= |
+31 |
+|
C.ADD |
+rd= |
+27 |
+|
C.ADD |
+rd= |
+4 |
+(rs2=x2) C.NTL.P1 (rs2=x3) C.NTL.PALL (rs2=x4) C.NTL.S1 (rs2=x5) C.NTL.ALL |
+
C.SLLI |
+rd= |
+31 (RV32), 63 (RV64/128) |
+Designated for custom use |
+
C.SLLI64 |
+rd=x0 |
+1 |
+|
C.SLLI64 |
+rd≠ |
+31 |
+|
C.SRLI64 |
+RV32 and RV64 only |
+8 |
+|
C.SRAI64 |
+RV32 and RV64 only |
+8 |
+
28.8. RVC Instruction Set Listings
+Table 24 shows a map of the major +opcodes for RVC. Each row of the table corresponds to one quadrant of +the encoding space. The last quadrant, which has the two +least-significant bits set, corresponds to instructions wider than 16 +bits, including those in the base ISAs. Several instructions are only +valid for certain operands; when invalid, they are marked either RES +to indicate that the opcode is reserved for future standard extensions; +Custom to indicate that the opcode is designated for custom +extensions; or HINT to indicate that the opcode is reserved for +microarchitectural hints (see Section 18.7).
+inst[15:13] |
+000 |
+001 |
+010 |
+011 |
+100 |
+101 |
+110 |
+111 |
++ | |
00 |
+ADDI4SPN |
+FLD |
+LW |
+FLW |
+Reserved |
+FSD |
+SW |
+FSW |
+RV32 |
+|
01 |
+ADDI |
+JAL |
+LI |
+LUI/ADDI16SP |
+MISC-ALU |
+J |
+BEQZ |
+BNEZ |
+RV32 |
+|
10 |
+SLLI |
+FLDSP |
+LWSP |
+FLWSP |
+J[AL]R/MV/ADD |
+FSDSP |
+SWSP |
+FSWSP |
+RV32 |
+|
11 |
+>16b |
+
29. "Zc*" Extension for Code Size Reduction, Version 1.0.0
+29.1. Zc* Overview
+Zc* is a group of extensions that define subsets of the existing C extension (Zca, Zcd, Zcf) and new extensions which only contain 16-bit encodings.
+Zcm* all reuse the encodings for c.fld, c.fsd, c.fldsp, c.fsdsp.
+Instruction | +Zca | +Zcf | +Zcd | +Zcb | +Zcmp | +Zcmt | +
---|---|---|---|---|---|---|
The Zca extension is added as way to refer to instructions in the C extension that do not include the floating-point loads and stores |
+||||||
C excl. c.f* |
+yes |
++ | + | + | + | + |
The Zcf extension is added as a way to refer to compressed single-precision floating-point load/stores |
+||||||
c.flw |
++ | rv32 |
++ | + | + | + |
c.flwsp |
++ | rv32 |
++ | + | + | + |
c.fsw |
++ | rv32 |
++ | + | + | + |
c.fswsp |
++ | rv32 |
++ | + | + | + |
The Zcd extension is added as a way to refer to compressed double-precision floating-point load/stores |
+||||||
c.fld |
++ | + | yes |
++ | + | + |
c.fldsp |
++ | + | yes |
++ | + | + |
c.fsd |
++ | + | yes |
++ | + | + |
c.fsdsp |
++ | + | yes |
++ | + | + |
Simple operations for use on all architectures |
+||||||
c.lbu |
++ | + | + | yes |
++ | + |
c.lh |
++ | + | + | yes |
++ | + |
c.lhu |
++ | + | + | yes |
++ | + |
c.sb |
++ | + | + | yes |
++ | + |
c.sh |
++ | + | + | yes |
++ | + |
c.zext.b |
++ | + | + | yes |
++ | + |
c.sext.b |
++ | + | + | yes |
++ | + |
c.zext.h |
++ | + | + | yes |
++ | + |
c.sext.h |
++ | + | + | yes |
++ | + |
c.zext.w |
++ | + | + | yes |
++ | + |
c.mul |
++ | + | + | yes |
++ | + |
c.not |
++ | + | + | yes |
++ | + |
PUSH/POP and double move which overlap with c.fsdsp. Complex operations intended for embedded CPUs |
+||||||
cm.push |
++ | + | + | + | yes |
++ |
cm.pop |
++ | + | + | + | yes |
++ |
cm.popret |
++ | + | + | + | yes |
++ |
cm.popretz |
++ | + | + | + | yes |
++ |
cm.mva01s |
++ | + | + | + | yes |
++ |
cm.mvsa01 |
++ | + | + | + | yes |
++ |
Table jump which overlaps with c.fsdsp. Complex operations intended for embedded CPUs |
+||||||
cm.jt |
++ | + | + | + | + | yes |
+
cm.jalt |
++ | + | + | + | + | yes |
+
29.2. C
+The C extension is the superset of the following extensions:
+-
+
-
+
Zca
+
+ -
+
Zcf if F is specified (RV32 only)
+
+ -
+
Zcd if D is specified
+
+
As C defines the same instructions as Zca, Zcf and Zcd, the rule is that:
+-
+
-
+
C always implies Zca
+
+ -
+
C+F implies Zcf (RV32 only)
+
+ -
+
C+D implies Zcd
+
+
29.3. Zce
+The Zce extension is intended to be used for microcontrollers, and includes all relevant Zc extensions.
+-
+
-
+
Specifying Zce on RV32 without F includes Zca, Zcb, Zcmp, Zcmt
+
+ -
+
Specifying Zce on RV32 with F includes Zca, Zcb, Zcmp, Zcmt and Zcf
+
+ -
+
Specifying Zce on RV64 always includes Zca, Zcb, Zcmp, Zcmt
+++-
+
-
+
Zcf doesn’t exist for RV64
+
+
+ -
+
Therefore common ISA strings can be updated as follows to include the relevant Zc extensions, for example:
+-
+
-
+
RV32IMC becomes RV32IM_Zce
+
+ -
+
RV32IMCF becomes RV32IMF_Zce
+
+
29.4. MISA.C
+MISA.C is set if the following extensions are selected:
+-
+
-
+
Zca and not F
+
+ -
+
Zca, Zcf and F is specified (RV32 only)
+
+ -
+
Zca, Zcf and Zcd if D is specified (RV32 only)
+++-
+
-
+
this configuration excludes Zcmp, Zcmt
+
+
+ -
+
-
+
Zca, Zcd if D is specified (RV64 only)
+++-
+
-
+
this configuration excludes Zcmp, Zcmt
+
+
+ -
+
29.5. Zca
+The Zca extension is added as way to refer to instructions in the C extension that do not include the floating-point loads and stores.
+Therefore it excluded all 16-bit floating point loads and stores: c.flw, c.flwsp, c.fsw, c.fswsp, c.fld, c.fldsp, c.fsd, c.fsdsp.
++ + | +
+
+
+the C extension only includes F/D instructions when D and F are also specified + |
+
29.6. Zcf (RV32 only)
+Zcf is the existing set of compressed single precision floating point loads and stores: c.flw, c.flwsp, c.fsw, c.fswsp.
+Zcf is only relevant to RV32, it cannot be specified for RV64.
+The Zcf extension depends on the Zca and F extensions.
+29.7. Zcd
+Zcd is the existing set of compressed double precision floating point loads and stores: c.fld, c.fldsp, c.fsd, c.fsdsp.
+The Zcd extension depends on the Zca and D extensions.
+29.8. Zcb
+Zcb has simple code-size saving instructions which are easy to implement on all CPUs.
+All encodings are currently reserved for all architectures, and have no conflicts with any existing extensions.
++ + | ++Zcb can be implemented on any CPU as the instructions are 16-bit versions of existing 32-bit instructions from the application class profile. + | +
The Zcb extension depends on the Zca extension.
+As shown on the individual instruction pages, many of the instructions in Zcb depend upon another extension being implemented. For example, c.mul is only implemented if M or Zmmul is implemented, and c.sext.b is only implemented if Zbb is implemented.
+The c.mul encoding uses the CA register format along with other instructions such as c.sub, c.xor etc.
++ + | ++ c.sext.w is a pseudo-instruction for c.addiw rd, 0 (RV64) + | +
RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
yes |
+yes |
+c.lbu rd', uimm(rs1') |
++ |
yes |
+yes |
+c.lhu rd', uimm(rs1') |
++ |
yes |
+yes |
+c.lh rd', uimm(rs1') |
++ |
yes |
+yes |
+c.sb rs2', uimm(rs1') |
++ |
yes |
+yes |
+c.sh rs2', uimm(rs1') |
++ |
yes |
+yes |
+c.zext.b rsd' |
++ |
yes |
+yes |
+c.sext.b rsd' |
++ |
yes |
+yes |
+c.zext.h rsd' |
++ |
yes |
+yes |
+c.sext.h rsd' |
++ |
+ | yes |
+c.zext.w rsd' |
++ |
yes |
+yes |
+c.not rsd' |
++ |
yes |
+yes |
+c.mul rsd', rs2' |
++ |
29.9. Zcmp
+The Zcmp extension is a set of instructions which may be executed as a series of existing 32-bit RISC-V instructions.
+This extension reuses some encodings from c.fsdsp. Therefore it is incompatible with Zcd, + which is included when C and D extensions are both present.
++ + | ++Zcmp is primarily targeted at embedded class CPUs due to implementation complexity. Additionally, it is not compatible with architecture class profiles. + | +
The Zcmp extension depends on the Zca extension.
+The PUSH/POP assembly syntax uses several variables, the meaning of which are:
+-
+
-
+
reg_list is a list containing 1 to 13 registers (ra and 0 to 12 s registers)
+++-
+
-
+
valid values: {ra}, {ra, s0}, {ra, s0-s1}, {ra, s0-s2}, …, {ra, s0-s8}, {ra, s0-s9}, {ra, s0-s11}
+
+ -
+
note that {ra, s0-s10} is not valid, giving 12 lists not 13 for better encoding
+
+
+ -
+
-
+
stack_adj is the total size of the stack frame.
+++-
+
-
+
valid values vary with register list length and the specific encoding, see the instruction pages for details.
+
+
+ -
+
RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
yes |
+yes |
+cm.push {reg_list}, -stack_adj |
++ |
yes |
+yes |
+cm.pop {reg_list}, stack_adj |
++ |
yes |
+yes |
+cm.popret {reg_list}, stack_adj |
++ |
yes |
+yes |
+cm.popretz {reg_list}, stack_adj |
++ |
yes |
+yes |
+cm.mva01s rs1', rs2' |
++ |
yes |
+yes |
+cm.mvsa01 r1s', r2s' |
++ |
29.10. Zcmt
+Zcmt adds the table jump instructions and also adds the jvt CSR. The jvt CSR requires a +state enable if Smstateen is implemented. See jvt CSR, table jump base vector and control register for details.
+This extension reuses some encodings from c.fsdsp. Therefore it is incompatible with Zcd, + which is included when C and D extensions are both present.
++ + | ++Zcmt is primarily targeted at embedded class CPUs due to implementation complexity. Additionally, it is not compatible with RVA profiles. + | +
The Zcmt extension depends on the Zca and Zicsr extensions.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
yes |
+yes |
+cm.jt index |
++ |
yes |
+yes |
+cm.jalt index |
++ |
29.11. Zc instruction formats
+Several instructions in this specification use the following new instruction formats.
+Format | +instructions | +15:10 | +9 | +8 | +7 | +6 | +5 | +4 | +3 | +2 | +1 | +0 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
CLB |
+c.lbu |
+funct6 |
+rs1' |
+uimm |
+rd' |
+op |
+||||||
CSB |
+c.sb |
+funct6 |
+rs1' |
+uimm |
+rs2' |
+op |
+||||||
CLH |
+c.lhu, c.lh |
+funct6 |
+rs1' |
+funct1 |
+uimm |
+rd' |
+op |
+|||||
CSH |
+c.sh |
+funct6 |
+rs1' |
+funct1 |
+uimm |
+rs2' |
+op |
+|||||
CU |
+c.[sz]ext.*, c.not |
+funct6 |
+rd'/rs1' |
+funct5 |
+op |
+|||||||
CMMV |
+cm.mvsa01 cm.mva01s |
+funct6 |
+r1s' |
+funct2 |
+r2s' |
+op |
+||||||
CMJT |
+cm.jt cm.jalt |
+funct6 |
+index |
+op |
+||||||||
CMPP |
+cm.push*, cm.pop* |
+funct6 |
+funct2 |
+urlist |
+spimm |
+op |
+
+ + | +
+
+
+c.mul uses the existing CA format. + |
+
29.12. Zcb instructions
+29.12.1. c.lbu
+Synopsis:
+Load unsigned byte, 16-bit encoding
+Mnemonic:
+c.lbu rd', uimm(rs1')
+Encoding (RV32, RV64):
+The immediate offset is formed as follows:
+ uimm[31:2] = 0;
+ uimm[1] = encoding[5];
+ uimm[0] = encoding[6];
+Description:
+This instruction loads a byte from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting byte is zero extended to XLEN bits and is written to rd'.
++ + | +
+
+
+rd' and rs1' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+X(rdc) = EXTZ(mem[X(rs1c)+EXTZ(uimm)][7..0]);
+29.12.2. c.lhu
+Synopsis:
+Load unsigned halfword, 16-bit encoding
+Mnemonic:
+c.lhu rd', uimm(rs1')
+Encoding (RV32, RV64):
+The immediate offset is formed as follows:
+ uimm[31:2] = 0;
+ uimm[1] = encoding[5];
+ uimm[0] = 0;
+Description:
+This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is zero extended to XLEN bits and is written to rd'.
++ + | +
+
+
+rd' and rs1' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+X(rdc) = EXTZ(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);
+29.12.3. c.lh
+Synopsis:
+Load signed halfword, 16-bit encoding
+Mnemonic:
+c.lh rd', uimm(rs1')
+Encoding (RV32, RV64):
+The immediate offset is formed as follows:
+ uimm[31:2] = 0;
+ uimm[1] = encoding[5];
+ uimm[0] = 0;
+Description:
+This instruction loads a halfword from the memory address formed by adding rs1' to the zero extended immediate uimm. The resulting halfword is sign extended to XLEN bits and is written to rd'.
++ + | +
+
+
+rd' and rs1' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+X(rdc) = EXTS(load_mem[X(rs1c)+EXTZ(uimm)][15..0]);
+29.12.4. c.sb
+Synopsis:
+Store byte, 16-bit encoding
+Mnemonic:
+c.sb rs2', uimm(rs1')
+Encoding (RV32, RV64):
+The immediate offset is formed as follows:
+ uimm[31:2] = 0;
+ uimm[1] = encoding[5];
+ uimm[0] = encoding[6];
+Description:
+This instruction stores the least significant byte of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.
++ + | +
+
+
+rs1' and rs2' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+mem[X(rs1c)+EXTZ(uimm)][7..0] = X(rs2c)
+29.12.5. c.sh
+Synopsis:
+Store halfword, 16-bit encoding
+Mnemonic:
+c.sh rs2', uimm(rs1')
+Encoding (RV32, RV64):
+The immediate offset is formed as follows:
+ uimm[31:2] = 0;
+ uimm[1] = encoding[5];
+ uimm[0] = 0;
+Description:
+This instruction stores the least significant halfword of rs2' to the memory address formed by adding rs1' to the zero extended immediate uimm.
++ + | +
+
+
+rs1' and rs2' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+mem[X(rs1c)+EXTZ(uimm)][15..0] = X(rs2c)
+29.12.6. c.zext.b
+Synopsis:
+Zero extend byte, 16-bit encoding
+Mnemonic:
+c.zext.b rd'/rs1'
+Encoding (RV32, RV64):
+Description:
+This instruction takes a single source/destination operand. +It zero-extends the least-significant byte of the operand to XLEN bits by inserting zeros into all of +the bits more significant than 7.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+32-bit equivalent:
+andi rd'/rs1', rd'/rs1', 0xff
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc. + |
+
Operation:
+X(rsdc) = EXTZ(X(rsdc)[7..0]);
+29.12.7. c.sext.b
+Synopsis:
+Sign extend byte, 16-bit encoding
+Mnemonic:
+c.sext.b rd'/rs1'
+Encoding (RV32, RV64):
+Description:
+This instruction takes a single source/destination operand. +It sign-extends the least-significant byte in the operand to XLEN bits by copying the most-significant bit +in the byte (i.e., bit 7) to all of the more-significant bits.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+Zbb is also required.
++ + | ++The SAIL module variable for rd'/rs1' is called rsdc. + | +
Operation:
+X(rsdc) = EXTS(X(rsdc)[7..0]);
+29.12.8. c.zext.h
+Synopsis:
+Zero extend halfword, 16-bit encoding
+Mnemonic:
+c.zext.h rd'/rs1'
+Encoding (RV32, RV64):
+Description:
+This instruction takes a single source/destination operand. +It zero-extends the least-significant halfword of the operand to XLEN bits by inserting zeros into all of +the bits more significant than 15.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+Zbb is also required.
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc. + |
+
Operation:
+X(rsdc) = EXTZ(X(rsdc)[15..0]);
+29.12.9. c.sext.h
+Synopsis:
+Sign extend halfword, 16-bit encoding
+Mnemonic:
+c.sext.h rd'/rs1'
+Encoding (RV32, RV64):
+Description:
+This instruction takes a single source/destination operand. +It sign-extends the least-significant halfword in the operand to XLEN bits by copying the most-significant bit +in the halfword (i.e., bit 15) to all of the more-significant bits.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+Zbb is also required.
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc. + |
+
Operation:
+X(rsdc) = EXTS(X(rsdc)[15..0]);
+29.12.10. c.zext.w
+Synopsis:
+Zero extend word, 16-bit encoding
+Mnemonic:
+c.zext.w rd'/rs1'
+Encoding (RV64):
+Description:
+This instruction takes a single source/destination operand. +It zero-extends the least-significant word of the operand to XLEN bits by inserting zeros into all of +the bits more significant than 31.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+Zba is also required.
+32-bit equivalent:
+add.uw rd'/rs1', rd'/rs1', zero
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc. + |
+
Operation:
+X(rsdc) = EXTZ(X(rsdc)[31..0]);
+29.12.11. c.not
+Synopsis:
+Bitwise not, 16-bit encoding
+Mnemonic:
+c.not rd'/rs1'
+Encoding (RV32, RV64):
+Description:
+This instruction takes the one’s complement of rd'/rs1' and writes the result to the same register.
++ + | +
+
+
+rd'/rs1' is from the standard 8-register set x8-x15. + |
+
Prerequisites:
+None
+32-bit equivalent:
+xori rd'/rs1', rd'/rs1', -1
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc. + |
+
Operation:
+X(rsdc) = X(rsdc) XOR -1;
+29.12.12. c.mul
+Synopsis:
+Multiply, 16-bit encoding
+Mnemonic:
+c.mul rsd', rs2'
+Encoding (RV32, RV64):
+Description:
+This instruction multiplies XLEN bits of the source operands from rsd' and rs2' and writes the lowest XLEN bits of the result to rsd'.
++ + | +
+
+
+rd'/rs1' and rs2' are from the standard 8-register set x8-x15. + |
+
Prerequisites:
+M or Zmmul must be configured.
++ + | +
+
+
+The SAIL module variable for rd'/rs1' is called rsdc, and for rs2' is called rs2c. + |
+
Operation:
+let result_wide = to_bits(2 * sizeof(xlen), signed(X(rsdc)) * signed(X(rs2c)));
+X(rsdc) = result_wide[(sizeof(xlen) - 1) .. 0];
+29.13. PUSH/POP register instructions
+These instructions are collectively referred to as PUSH/POP:
+-
+
- + + +
- + + +
- + + +
- + + +
The term PUSH refers to cm.push.
+The term POP refers to cm.pop.
+The term POPRET refers to cm.popret and cm.popretz.
+Common details for these instructions are in this section.
+29.13.1. PUSH/POP functional overview
+PUSH, POP, POPRET are used to reduce the size of function prologues and epilogues.
+-
+
-
+
The PUSH instruction
+++-
+
-
+
adjusts the stack pointer to create the stack frame
+
+ -
+
pushes (stores) the registers specified in the register list to the stack frame
+
+
+ -
+
-
+
The POP instruction
+++-
+
-
+
pops (loads) the registers in the register list from the stack frame
+
+ -
+
adjusts the stack pointer to destroy the stack frame
+
+
+ -
+
-
+
The POPRET instructions
+++-
+
-
+
pop (load) the registers in the register list from the stack frame
+
+ -
+
cm.popretz also moves zero into a0 as the return value
+
+ -
+
adjust the stack pointer to destroy the stack frame
+
+ -
+
execute a ret instruction to return from the function
+
+
+ -
+
29.13.2. Example usage
+This example gives an illustration of the use of PUSH and POPRET.
+The function processMarkers in the EMBench benchmark picojpeg in the following file on github: libpicojpeg.c
+The prologue and epilogue compile with GCC10 to:
+ 0001098a <processMarkers>:
+ 1098a: 711d addi sp,sp,-96 ;#cm.push(1)
+ 1098c: c8ca sw s2,80(sp) ;#cm.push(2)
+ 1098e: c6ce sw s3,76(sp) ;#cm.push(3)
+ 10990: c4d2 sw s4,72(sp) ;#cm.push(4)
+ 10992: ce86 sw ra,92(sp) ;#cm.push(5)
+ 10994: cca2 sw s0,88(sp) ;#cm.push(6)
+ 10996: caa6 sw s1,84(sp) ;#cm.push(7)
+ 10998: c2d6 sw s5,68(sp) ;#cm.push(8)
+ 1099a: c0da sw s6,64(sp) ;#cm.push(9)
+ 1099c: de5e sw s7,60(sp) ;#cm.push(10)
+ 1099e: dc62 sw s8,56(sp) ;#cm.push(11)
+ 109a0: da66 sw s9,52(sp) ;#cm.push(12)
+ 109a2: d86a sw s10,48(sp);#cm.push(13)
+ 109a4: d66e sw s11,44(sp);#cm.push(14)
+...
+ 109f4: 4501 li a0,0 ;#cm.popretz(1)
+ 109f6: 40f6 lw ra,92(sp) ;#cm.popretz(2)
+ 109f8: 4466 lw s0,88(sp) ;#cm.popretz(3)
+ 109fa: 44d6 lw s1,84(sp) ;#cm.popretz(4)
+ 109fc: 4946 lw s2,80(sp) ;#cm.popretz(5)
+ 109fe: 49b6 lw s3,76(sp) ;#cm.popretz(6)
+ 10a00: 4a26 lw s4,72(sp) ;#cm.popretz(7)
+ 10a02: 4a96 lw s5,68(sp) ;#cm.popretz(8)
+ 10a04: 4b06 lw s6,64(sp) ;#cm.popretz(9)
+ 10a06: 5bf2 lw s7,60(sp) ;#cm.popretz(10)
+ 10a08: 5c62 lw s8,56(sp) ;#cm.popretz(11)
+ 10a0a: 5cd2 lw s9,52(sp) ;#cm.popretz(12)
+ 10a0c: 5d42 lw s10,48(sp);#cm.popretz(13)
+ 10a0e: 5db2 lw s11,44(sp);#cm.popretz(14)
+ 10a10: 6125 addi sp,sp,96 ;#cm.popretz(15)
+ 10a12: 8082 ret ;#cm.popretz(16)
+with the GCC option -msave-restore the output is the following:
+0001080e <processMarkers>:
+ 1080e: 73a012ef jal t0,11f48 <__riscv_save_12>
+ 10812: 1101 addi sp,sp,-32
+...
+ 10862: 4501 li a0,0
+ 10864: 6105 addi sp,sp,32
+ 10866: 71e0106f j 11f84 <__riscv_restore_12>
+with PUSH/POPRET this reduces to
+0001080e <processMarkers>:
+ 1080e: b8fa cm.push {ra,s0-s11},-96
+...
+ 10866: bcfa cm.popretz {ra,s0-s11}, 96
+The prologue / epilogue reduce from 60-bytes in the original code, to 14-bytes with -msave-restore, +and to 4-bytes with PUSH and POPRET. +As well as reducing the code-size PUSH and POPRET eliminate the branches from +calling the millicode save/restore routines and so may also perform better.
++ + | +
+
+
+The calls to <riscv_save_0>/<riscv_restore_0> become 64-bit when the target functions are out of the ±1MB range, increasing the prologue/epilogue size to 22-bytes. + |
+
+ + | +
+
+
+POP is typically used in tail-calling sequences where ret is not used to return to ra after destroying the stack frame. + |
+
Stack pointer adjustment handling
+The instructions all automatically adjust the stack pointer by enough to cover the memory required for the registers being saved or restored. +Additionally the spimm field in the encoding allows the stack pointer to be adjusted in additional increments of 16-bytes. There is only a small restricted +range available in the encoding; if the range is insufficient then a separate c.addi16sp can be used to increase the range.
+Register list handling
+There is no support for the {ra, s0-s10} register list without also adding s11. Therefore the {ra, s0-s11} register list must be used in this case.
+29.13.3. PUSH/POP Fault handling
+Correct execution requires that sp refers to idempotent memory (also see Non-idempotent memory handling), because the core must be able to +handle traps detected during the sequence. +The entire PUSH/POP sequence is re-executed after returning from the trap handler, and multiple traps are possible during the sequence.
+If a trap occurs during the sequence then xEPC is updated with the PC of the instruction, xTVAL (if not read-only-zero) updated with the bad address if it was an access fault and xCAUSE updated with the type of trap.
++ + | ++It is implementation defined whether interrupts can also be taken during the sequence execution. + | +
29.13.4. Software view of execution
+Software view of the PUSH sequence
+From a software perspective the PUSH sequence appears as:
+-
+
-
+
A sequence of stores writing the bytes required by the pseudo-code
+++-
+
-
+
The bytes may be written in any order.
+
+ -
+
The bytes may be grouped into larger accesses.
+
+ -
+
Any of the bytes may be written multiple times.
+
+
+ -
+
-
+
A stack pointer adjustment
+
+
+ + | +
+
+
+If an implementation allows interrupts during the sequence, and the interrupt handler uses sp to allocate stack memory, then any stores which were executed before the interrupt may be overwritten by the handler. This is safe because the memory is idempotent and the stores will be re-executed when execution resumes. + |
+
The stack pointer adjustment must only be committed only when it is certain that the entire PUSH instruction will commit.
+Stores may also return imprecise faults from the bus. +It is platform defined whether the core implementation waits for the bus responses before continuing to the final stage of the sequence, +or handles errors responses after completing the PUSH instruction.
+For example:
+cm.push {ra, s0-s5}, -64
+Appears to software as:
+# any bytes from sp-1 to sp-28 may be written multiple times before
+# the instruction completes therefore these updates may be visible in
+# the interrupt/exception handler below the stack pointer
+sw s5, -4(sp)
+sw s4, -8(sp)
+sw s3,-12(sp)
+sw s2,-16(sp)
+sw s1,-20(sp)
+sw s0,-24(sp)
+sw ra,-28(sp)
+
+# this must only execute once, and will only execute after all stores
+# completed without any precise faults, therefore this update is only
+# visible in the interrupt/exception handler if cm.push has completed
+addi sp, sp, -64
+Software view of the POP/POPRET sequence
+From a software perspective the POP/POPRET sequence appears as:
+-
+
-
+
A sequence of loads reading the bytes required by the pseudo-code.
+++-
+
-
+
The bytes may be loaded in any order.
+
+ -
+
The bytes may be grouped into larger accesses.
+
+ -
+
Any of the bytes may be loaded multiple times.
+
+
+ -
+
-
+
A stack pointer adjustment
+
+ -
+
An optional
+li a0, 0
+ -
+
An optional
+ret
+
If a trap occurs during the sequence, then any loads which were executed before the trap may update architectural state. +The loads will be re-executed once the trap handler completes, so the values will be overwritten. +Therefore it is permitted for an implementation to update some of the destination registers before taking a fault.
+The optional li a0, 0
, stack pointer adjustment and optional ret
must only be committed only when it is certain that the entire POP/POPRET instruction will commit.
For POPRET once the stack pointer adjustment has been committed the ret
must execute.
For example:
+cm.popretz {ra, s0-s3}, 32;
+Appears to software as:
+# any or all of these load instructions may execute multiple times
+# therefore these updates may be visible in the interrupt/exception handler
+lw s3, 28(sp)
+lw s2, 24(sp)
+lw s1, 20(sp)
+lw s0, 16(sp)
+lw ra, 12(sp)
+
+# these must only execute once, will only execute after all loads
+# complete successfully all instructions must execute atomically
+# therefore these updates are not visible in the interrupt/exception handler
+li a0, 0
+addi sp, sp, 32
+ret
+29.13.5. Non-idempotent memory handling
+An implementation may have a requirement to issue a PUSH/POP instruction to non-idempotent memory.
+If the core implementation does not support PUSH/POP to non-idempotent memories, the core may use an idempotency PMA to detect it and take a +load (POP/POPRET) or store (PUSH) access fault exception in order to avoid unpredictable results.
+Software should only use these instructions on non-idempotent memory regions when software can tolerate the required memory accesses +being issued repeatedly in the case that they cause exceptions.
+29.13.6. Example RV32I PUSH/POP sequences
+The examples are included show the load/store series expansion and the stack adjustment. +Examples of cm.popret and cm.popretz are not included, as the difference in the expanded sequence from cm.pop is trivial in all cases.
+cm.push {ra, s0-s2}, -64
+Encoding: rlist=7, spimm=3
+expands to:
+sw s2, -4(sp);
+sw s1, -8(sp);
+sw s0, -12(sp);
+sw ra, -16(sp);
+addi sp, sp, -64;
+cm.push {ra, s0-s11}, -112
+Encoding: rlist=15, spimm=3
+expands to:
+sw s11, -4(sp);
+sw s10, -8(sp);
+sw s9, -12(sp);
+sw s8, -16(sp);
+sw s7, -20(sp);
+sw s6, -24(sp);
+sw s5, -28(sp);
+sw s4, -32(sp);
+sw s3, -36(sp);
+sw s2, -40(sp);
+sw s1, -44(sp);
+sw s0, -48(sp);
+sw ra, -52(sp);
+addi sp, sp, -112;
+cm.pop {ra}, 16
+Encoding: rlist=4, spimm=0
+expands to:
+lw ra, 12(sp);
+addi sp, sp, 16;
+cm.pop {ra, s0-s3}, 48
+Encoding: rlist=8, spimm=1
+expands to:
+lw s3, 44(sp);
+lw s2, 40(sp);
+lw s1, 36(sp);
+lw s0, 32(sp);
+lw ra, 28(sp);
+addi sp, sp, 48;
+cm.pop {ra, s0-s4}, 64
+Encoding: rlist=9, spimm=2
+expands to:
+lw s4, 60(sp);
+lw s3, 56(sp);
+lw s2, 52(sp);
+lw s1, 48(sp);
+lw s0, 44(sp);
+lw ra, 40(sp);
+addi sp, sp, 64;
+29.13.7. cm.push
+Synopsis:
+Create stack frame: store ra and 0 to 12 saved registers to the stack frame, optionally allocate additional stack space.
+Mnemonic:
+cm.push {reg_list}, -stack_adj
+Encoding (RV32, RV64):
++ + | +
+
+
+rlist values 0 to 3 are reserved for a future EABI variant called cm.push.e + |
+
Assembly Syntax:
+cm.push {reg_list}, -stack_adj
+cm.push {xreg_list}, -stack_adj
+The variables used in the assembly syntax are defined below.
+RV32E:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32I, RV64:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ case 7: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
+ case 8: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
+ case 9: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
+ case 10: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
+ case 11: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
+ case 12: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
+ case 13: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
+ case 14: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
+ //note - to include s10, s11 must also be included
+ case 15: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32E:
+
+stack_adj_base = 16;
+Valid values:
+stack_adj = [16|32|48|64];
+RV32I:
+
+switch (rlist) {
+ case 4.. 7: stack_adj_base = 16;
+ case 8..11: stack_adj_base = 32;
+ case 12..14: stack_adj_base = 48;
+ case 15: stack_adj_base = 64;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 7: stack_adj = [16|32|48| 64];
+ case 8..11: stack_adj = [32|48|64| 80];
+ case 12..14: stack_adj = [48|64|80| 96];
+ case 15: stack_adj = [64|80|96|112];
+}
+RV64:
+
+switch (rlist) {
+ case 4.. 5: stack_adj_base = 16;
+ case 6.. 7: stack_adj_base = 32;
+ case 8.. 9: stack_adj_base = 48;
+ case 10..11: stack_adj_base = 64;
+ case 12..13: stack_adj_base = 80;
+ case 14: stack_adj_base = 96;
+ case 15: stack_adj_base = 112;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 5: stack_adj = [ 16| 32| 48| 64];
+ case 6.. 7: stack_adj = [ 32| 48| 64| 80];
+ case 8.. 9: stack_adj = [ 48| 64| 80| 96];
+ case 10..11: stack_adj = [ 64| 80| 96|112];
+ case 12..13: stack_adj = [ 80| 96|112|128];
+ case 14: stack_adj = [ 96|112|128|144];
+ case 15: stack_adj = [112|128|144|160];
+}
+Description:
+This instruction pushes (stores) the registers in reg_list to the memory below the stack pointer, +and then creates the stack frame by decrementing the stack pointer by stack_adj, +including any additional stack space requested by the value of spimm.
++ + | +
+
+
+All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen. + |
+
For further information see PUSH/POP Register Instructions.
+Stack Adjustment Calculation:
+stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
+spimm is the number of additional 16-byte address increments allocated for the stack frame.
+The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, +as defined above.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists
+Operation:
+The first section of pseudo-code may be executed multiple times before the instruction successfully completes.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+if (XLEN==32) bytes=4; else bytes=8;
+
+addr=sp-bytes;
+for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
+ //if register i is in xreg_list
+ if (xreg_list[i]) {
+ switch(bytes) {
+ 4: asm("sw x[i], 0(addr)");
+ 8: asm("sd x[i], 0(addr)");
+ }
+ addr-=bytes;
+ }
+}
+The final section of pseudo-code executes atomically, and only executes if the section above completes without any exceptions or interrupts.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+sp-=stack_adj;
+29.13.8. cm.pop
+Synopsis:
+Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame.
+Mnemonic:
+cm.pop {reg_list}, stack_adj
+Encoding (RV32, RV64):
++ + | +
+
+
+rlist values 0 to 3 are reserved for a future EABI variant called cm.pop.e + |
+
Assembly Syntax:
+cm.pop {reg_list}, stack_adj
+cm.pop {xreg_list}, stack_adj
+The variables used in the assembly syntax are defined below.
+RV32E:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32I, RV64:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ case 7: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
+ case 8: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
+ case 9: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
+ case 10: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
+ case 11: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
+ case 12: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
+ case 13: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
+ case 14: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
+ //note - to include s10, s11 must also be included
+ case 15: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32E:
+
+stack_adj_base = 16;
+Valid values:
+stack_adj = [16|32|48|64];
+RV32I:
+
+switch (rlist) {
+ case 4.. 7: stack_adj_base = 16;
+ case 8..11: stack_adj_base = 32;
+ case 12..14: stack_adj_base = 48;
+ case 15: stack_adj_base = 64;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 7: stack_adj = [16|32|48| 64];
+ case 8..11: stack_adj = [32|48|64| 80];
+ case 12..14: stack_adj = [48|64|80| 96];
+ case 15: stack_adj = [64|80|96|112];
+}
+RV64:
+
+switch (rlist) {
+ case 4.. 5: stack_adj_base = 16;
+ case 6.. 7: stack_adj_base = 32;
+ case 8.. 9: stack_adj_base = 48;
+ case 10..11: stack_adj_base = 64;
+ case 12..13: stack_adj_base = 80;
+ case 14: stack_adj_base = 96;
+ case 15: stack_adj_base = 112;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 5: stack_adj = [ 16| 32| 48| 64];
+ case 6.. 7: stack_adj = [ 32| 48| 64| 80];
+ case 8.. 9: stack_adj = [ 48| 64| 80| 96];
+ case 10..11: stack_adj = [ 64| 80| 96|112];
+ case 12..13: stack_adj = [ 80| 96|112|128];
+ case 14: stack_adj = [ 96|112|128|144];
+ case 15: stack_adj = [112|128|144|160];
+}
+Description:
+This instruction pops (loads) the registers in reg_list from stack memory, +and then adjusts the stack pointer by stack_adj.
++ + | +
+
+
+All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen. + |
+
For further information see PUSH/POP Register Instructions.
+Stack Adjustment Calculation:
+stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
+spimm is the number of additional 16-byte address increments allocated for the stack frame.
+The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, +as defined above.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists
+Operation:
+The first section of pseudo-code may be executed multiple times before the instruction successfully completes.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+if (XLEN==32) bytes=4; else bytes=8;
+
+addr=sp+stack_adj-bytes;
+for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
+ //if register i is in xreg_list
+ if (xreg_list[i]) {
+ switch(bytes) {
+ 4: asm("lw x[i], 0(addr)");
+ 8: asm("ld x[i], 0(addr)");
+ }
+ addr-=bytes;
+ }
+}
+The final section of pseudo-code executes atomically, and only executes if the section above completes without any exceptions or interrupts.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+sp+=stack_adj;
+29.13.9. cm.popretz
+Synopsis:
+Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, move zero into a0, return to ra.
+Mnemonic:
+cm.popretz {reg_list}, stack_adj
+Encoding (RV32, RV64):
++ + | +
+
+
+rlist values 0 to 3 are reserved for a future EABI variant called cm.popretz.e + |
+
Assembly Syntax:
+cm.popretz {reg_list}, stack_adj
+cm.popretz {xreg_list}, stack_adj
+RV32E:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32I, RV64:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ case 7: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
+ case 8: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
+ case 9: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
+ case 10: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
+ case 11: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
+ case 12: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
+ case 13: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
+ case 14: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
+ //note - to include s10, s11 must also be included
+ case 15: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32E:
+
+stack_adj_base = 16;
+Valid values:
+stack_adj = [16|32|48|64];
+RV32I:
+
+switch (rlist) {
+ case 4.. 7: stack_adj_base = 16;
+ case 8..11: stack_adj_base = 32;
+ case 12..14: stack_adj_base = 48;
+ case 15: stack_adj_base = 64;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 7: stack_adj = [16|32|48| 64];
+ case 8..11: stack_adj = [32|48|64| 80];
+ case 12..14: stack_adj = [48|64|80| 96];
+ case 15: stack_adj = [64|80|96|112];
+}
+RV64:
+
+switch (rlist) {
+ case 4.. 5: stack_adj_base = 16;
+ case 6.. 7: stack_adj_base = 32;
+ case 8.. 9: stack_adj_base = 48;
+ case 10..11: stack_adj_base = 64;
+ case 12..13: stack_adj_base = 80;
+ case 14: stack_adj_base = 96;
+ case 15: stack_adj_base = 112;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 5: stack_adj = [ 16| 32| 48| 64];
+ case 6.. 7: stack_adj = [ 32| 48| 64| 80];
+ case 8.. 9: stack_adj = [ 48| 64| 80| 96];
+ case 10..11: stack_adj = [ 64| 80| 96|112];
+ case 12..13: stack_adj = [ 80| 96|112|128];
+ case 14: stack_adj = [ 96|112|128|144];
+ case 15: stack_adj = [112|128|144|160];
+}
+Description:
+This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj, moves zero into a0 and then returns to ra.
++ + | +
+
+
+All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen. + |
+
For further information see PUSH/POP Register Instructions.
+Stack Adjustment Calculation:
+stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
+spimm is the number of additional 16-byte address increments allocated for the stack frame.
+The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists
+Operation:
+The first section of pseudo-code may be executed multiple times before the instruction successfully completes.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+if (XLEN==32) bytes=4; else bytes=8;
+
+addr=sp+stack_adj-bytes;
+for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
+ //if register i is in xreg_list
+ if (xreg_list[i]) {
+ switch(bytes) {
+ 4: asm("lw x[i], 0(addr)");
+ 8: asm("ld x[i], 0(addr)");
+ }
+ addr-=bytes;
+ }
+}
+The final section of pseudo-code executes atomically, and only executes if the section above completes without any exceptions or interrupts.
++ + | +
+
+
+The li a0, 0 could be executed more than once, but is included in the atomic section for convenience. + |
+
//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+asm("li a0, 0");
+sp+=stack_adj;
+asm("ret");
+29.13.10. cm.popret
+Synopsis:
+Destroy stack frame: load ra and 0 to 12 saved registers from the stack frame, deallocate the stack frame, return to ra.
+Mnemonic:
+cm.popret {reg_list}, stack_adj
+Encoding (RV32, RV64):
++ + | +
+
+
+rlist values 0 to 3 are reserved for a future EABI variant called cm.popret.e + |
+
Assembly Syntax:
+cm.popret {reg_list}, stack_adj
+cm.popret {xreg_list}, stack_adj
+The variables used in the assembly syntax are defined below.
+RV32E:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32I, RV64:
+
+switch (rlist){
+ case 4: {reg_list="ra"; xreg_list="x1";}
+ case 5: {reg_list="ra, s0"; xreg_list="x1, x8";}
+ case 6: {reg_list="ra, s0-s1"; xreg_list="x1, x8-x9";}
+ case 7: {reg_list="ra, s0-s2"; xreg_list="x1, x8-x9, x18";}
+ case 8: {reg_list="ra, s0-s3"; xreg_list="x1, x8-x9, x18-x19";}
+ case 9: {reg_list="ra, s0-s4"; xreg_list="x1, x8-x9, x18-x20";}
+ case 10: {reg_list="ra, s0-s5"; xreg_list="x1, x8-x9, x18-x21";}
+ case 11: {reg_list="ra, s0-s6"; xreg_list="x1, x8-x9, x18-x22";}
+ case 12: {reg_list="ra, s0-s7"; xreg_list="x1, x8-x9, x18-x23";}
+ case 13: {reg_list="ra, s0-s8"; xreg_list="x1, x8-x9, x18-x24";}
+ case 14: {reg_list="ra, s0-s9"; xreg_list="x1, x8-x9, x18-x25";}
+ //note - to include s10, s11 must also be included
+ case 15: {reg_list="ra, s0-s11"; xreg_list="x1, x8-x9, x18-x27";}
+ default: reserved();
+}
+stack_adj = stack_adj_base + spimm[5:4] * 16;
+RV32E:
+
+stack_adj_base = 16;
+Valid values:
+stack_adj = [16|32|48|64];
+RV32I:
+
+switch (rlist) {
+ case 4.. 7: stack_adj_base = 16;
+ case 8..11: stack_adj_base = 32;
+ case 12..14: stack_adj_base = 48;
+ case 15: stack_adj_base = 64;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 7: stack_adj = [16|32|48| 64];
+ case 8..11: stack_adj = [32|48|64| 80];
+ case 12..14: stack_adj = [48|64|80| 96];
+ case 15: stack_adj = [64|80|96|112];
+}
+RV64:
+
+switch (rlist) {
+ case 4.. 5: stack_adj_base = 16;
+ case 6.. 7: stack_adj_base = 32;
+ case 8.. 9: stack_adj_base = 48;
+ case 10..11: stack_adj_base = 64;
+ case 12..13: stack_adj_base = 80;
+ case 14: stack_adj_base = 96;
+ case 15: stack_adj_base = 112;
+}
+
+Valid values:
+switch (rlist) {
+ case 4.. 5: stack_adj = [ 16| 32| 48| 64];
+ case 6.. 7: stack_adj = [ 32| 48| 64| 80];
+ case 8.. 9: stack_adj = [ 48| 64| 80| 96];
+ case 10..11: stack_adj = [ 64| 80| 96|112];
+ case 12..13: stack_adj = [ 80| 96|112|128];
+ case 14: stack_adj = [ 96|112|128|144];
+ case 15: stack_adj = [112|128|144|160];
+}
+Description:
+This instruction pops (loads) the registers in reg_list from stack memory, adjusts the stack pointer by stack_adj and then returns to ra.
++ + | +
+
+
+All ABI register mappings are for the UABI. An EABI version is planned once the EABI is frozen. + |
+
For further information see PUSH/POP Register Instructions.
+Stack Adjustment Calculation:
+stack_adj_base is the minimum number of bytes, in multiples of 16-byte address increments, required to cover the registers in the list.
+spimm is the number of additional 16-byte address increments allocated for the stack frame.
+The total stack adjustment represents the total size of the stack frame, which is stack_adj_base added to spimm scaled by 16, as defined above.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists
+Operation:
+The first section of pseudo-code may be executed multiple times before the instruction successfully completes.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+if (XLEN==32) bytes=4; else bytes=8;
+
+addr=sp+stack_adj-bytes;
+for(i in 27,26,25,24,23,22,21,20,19,18,9,8,1) {
+ //if register i is in xreg_list
+ if (xreg_list[i]) {
+ switch(bytes) {
+ 4: asm("lw x[i], 0(addr)");
+ 8: asm("ld x[i], 0(addr)");
+ }
+ addr-=bytes;
+ }
+}
+The final section of pseudo-code executes atomically, and only executes if the section above completes without any exceptions or interrupts.
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+sp+=stack_adj;
+asm("ret");
+29.13.11. cm.mvsa01
+Synopsis:
+Move a0-a1 into two registers of s0-s7
+Mnemonic:
+cm.mvsa01 r1s', r2s'
+Encoding (RV32, RV64):
++ + | +
+
+
+For the encoding to be legal r1s' != r2s'. + |
+
Assembly Syntax:
+cm.mvsa01 r1s', r2s'
+Description: +This instruction moves a0 into r1s' and a1 into r2s'. r1s' and r2s' must be different. +The execution is atomic, so it is not possible to observe state where only one of r1s' or r2s' has been updated.
+The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. +The mapping between them is specified in the pseudo-code below.
++ + | +
+
+
+The s register mapping is taken from the UABI, and may not match the currently unratified EABI. cm.mvsa01.e may be included in the future. + |
+
Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists.
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+if (RV32E && (r1sc>1 || r2sc>1)) {
+ reserved();
+}
+xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
+xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
+X[xreg1] = X[10];
+X[xreg2] = X[11];
+29.13.12. cm.mva01s
+Synopsis:
+Move two s0-s7 registers into a0-a1
+Mnemonic:
+cm.mva01s r1s', r2s'
+Encoding (RV32, RV64):
+Assembly Syntax:
+cm.mva01s r1s', r2s'
+Description: +This instruction moves r1s' into a0 and r2s' into a1. +The execution is atomic, so it is not possible to observe state where only one of a0 or a1 have been updated.
+The encoding uses sreg number specifiers instead of xreg number specifiers to save encoding space. +The mapping between them is specified in the pseudo-code below.
++ + | +
+
+
+The s register mapping is taken from the UABI, and may not match the currently unratified EABI. cm.mva01s.e may be included in the future. + |
+
Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists.
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+if (RV32E && (r1sc>1 || r2sc>1)) {
+ reserved();
+}
+xreg1 = {r1sc[2:1]>0,r1sc[2:1]==0,r1sc[2:0]};
+xreg2 = {r2sc[2:1]>0,r2sc[2:1]==0,r2sc[2:0]};
+X[10] = X[xreg1];
+X[11] = X[xreg2];
+29.14. Table Jump Overview
+cm.jt (Jump via table) and cm.jalt (Jump and link via table) are referred to as table jump.
+Table jump uses a 256-entry XLEN wide table in instruction memory to contain function addresses. +The table must be a minimum of 64-byte aligned.
+Table entries follow the current data endianness. This is different from normal instruction fetch which is always little-endian.
+cm.jt and cm.jalt encodings index the table, giving access to functions within the full XLEN wide address space.
+This is used as a form of dictionary compression to reduce the code size of jal / auipc+jalr / jr / auipc+jr instructions.
+Table jump allows the linker to replace the following instruction sequences with a cm.jt or cm.jalt encoding, and an entry in the table:
+-
+
-
+
32-bit j calls
+
+ -
+
32-bit jal ra calls
+
+ -
+
64-bit auipc+jr calls to fixed locations
+
+ -
+
64-bit auipc+jalr ra calls to fixed locations
+++-
+
-
+
The auipc+jr/jalr sequence is used because the offset from the PC is out of the ±1MB range.
+
+
+ -
+
If a return address stack is implemented, then as cm.jalt is equivalent to jal ra, it pushes to the stack.
+29.14.1. jvt
+The base of the table is in the jvt CSR (see jvt CSR, table jump base vector and control register), each table entry is XLEN bits.
+If the same function is called with and without linking then it must have two entries in the table. +This is typically caused by the same function being called with and without tail calling.
+29.14.2. Table Jump Fault handling
+For a table jump instruction, the table entry that the instruction selects is considered an extension of the instruction itself. +Hence, the execution of a table jump instruction involves two instruction fetches, the first to read the instruction (cm.jt/cm.jalt) +and the second to read from the jump vector table (JVT). Both instruction fetches are implicit reads, and both require +execute permission; read permission is irrelevant. It is recommended that the second fetch be ignored for hardware triggers and breakpoints.
+Memory writes to the jump vector table require an instruction barrier (fence.i) to guarantee that they are visible to the instruction fetch.
+Multiple contexts may have different jump vector tables. JVT may be switched between them without an instruction barrier +if the tables have not been updated in memory since the last fence.i.
+If an exception occurs on either instruction fetch, xEPC is set to the PC of the table jump instruction, xCAUSE is set as expected for the type of fault and xTVAL (if not set to zero) contains the fetch address which caused the fault.
+29.14.3. jvt CSR
+Synopsis:
+Table jump base vector and control register
+Address:
+0x0017
+Permissions:
+URW
+Format (RV32):
+Format (RV64):
+Description:
+The jvt register is an XLEN-bit WARL read/write register that holds the jump table configuration, consisting of the jump table base address (BASE) and the jump table mode (MODE).
+If Section 29.10 is implemented then jvt must also be implemented, but can contain a read-only value. If jvt is writable, the set of values the register may hold can vary by implementation. The value in the BASE field must always be aligned on a 64-byte boundary.
+jvt.base is a virtual address, whenever virtual memory is enabled.
+The memory pointed to by jvt.base is treated as instruction memory for the purpose of executing table jump instructions, implying execute access permission.
+jvt.mode | +Comment | +
---|---|
000000 |
+Jump table mode |
+
others |
+reserved for future standard use |
+
jvt.mode is a WARL field, so can only be programmed to modes which are implemented. Therefore the discovery mechanism is to +attempt to program different modes and read back the values to see which are available. Jump table mode must be implemented.
++ + | +
+
+
+in future the RISC-V Unified Discovery method will report the available modes. + |
+
Architectural State:
+jvt CSR adds architectural state to the system software context (such as an OS process), therefore must be saved/restored on context switches.
+State Enable:
+If the Smstateen extension is implemented, then bit 2 in mstateen0, sstateen0, and hstateen0 is implemented. If bit 2 of a controlling stateen0 CSR is zero, then access to the jvt CSR and execution of a cm.jalt or cm.jt instruction by a lower privilege level results in an Illegal Instruction trap (or, if appropriate, a Virtual Instruction trap).
+29.14.4. cm.jt
+Synopsis:
+jump via table
+Mnemonic:
+cm.jt index
+Encoding (RV32, RV64):
++ + | +
+
+
+For this encoding to decode as cm.jt, index<32, otherwise it decodes as cm.jalt, see Jump and link via table. + |
+
+ + | +
+
+
+If jvt.mode = 0 (Jump Table Mode) then cm.jt behaves as specified here. If jvt.mode is a reserved value, then cm.jt is also reserved. In the future other defined values of jvt.mode may change the behaviour of cm.jt. + |
+
Assembly Syntax:
+cm.jt index
+Description:
+cm.jt reads an entry from the jump vector table in memory and jumps to the address that was read.
+For further information see Table Jump Overview.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists.
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+# target_address is temporary internal state, it doesn't represent a real register
+# InstMemory is byte indexed
+
+switch(XLEN) {
+ 32: table_address[XLEN-1:0] = jvt.base + (index<<2);
+ 64: table_address[XLEN-1:0] = jvt.base + (index<<3);
+}
+
+//fetch from the jump table
+target_address[XLEN-1:0] = InstMemory[table_address][XLEN-1:0];
+
+j target_address[XLEN-1:0]&~0x1;
+29.14.5. cm.jalt
+Synopsis:
+jump via table with optional link
+Mnemonic:
+cm.jalt index
+Encoding (RV32, RV64):
++ + | +
+
+
+For this encoding to decode as cm.jalt, index>=32, otherwise it decodes as cm.jt, see Jump via table. + |
+
+ + | +
+
+
+If jvt.mode = 0 (Jump Table Mode) then cm.jalt behaves as specified here. If jvt.mode is a reserved value, then cm.jalt is also reserved. In the future other defined values of jvt.mode may change the behaviour of cm.jalt. + |
+
Assembly Syntax:
+cm.jalt index
+Description:
+cm.jalt reads an entry from the jump vector table in memory and jumps to the address that was read, linking to ra.
+For further information see Table Jump Overview.
+Prerequisites:
+None
+32-bit equivalent:
+No direct equivalent encoding exists.
+Operation:
+//This is not SAIL, it's pseudo-code. The SAIL hasn't been written yet.
+
+# target_address is temporary internal state, it doesn't represent a real register
+# InstMemory is byte indexed
+
+switch(XLEN) {
+ 32: table_address[XLEN-1:0] = jvt.base + (index<<2);
+ 64: table_address[XLEN-1:0] = jvt.base + (index<<3);
+}
+
+//fetch from the jump table
+target_address[XLEN-1:0] = InstMemory[table_address][XLEN-1:0];
+
+jal ra, target_address[XLEN-1:0]&~0x1;
+30. "B" Extension for Bit Manipulation, Version 1.0.0
+The B standard extension comprises instructions provided by the Zba, Zbb, and +Zbs extensions.
+30.1. Zb* Overview
+The bit-manipulation (bitmanip) extension collection is comprised of several component extensions to the base RISC-V architecture that are intended to provide some combination of code size reduction, performance improvement, and energy reduction. +While the instructions are intended to have general use, some instructions are more useful in some domains than others. +Hence, several smaller bitmanip extensions are provided. Each of these smaller extensions is grouped by common function and use case, and each has its own Zb*-extension name.
+Each bitmanip extension includes a group of several bitmanip instructions that have similar purposes and that can often share the same logic. Some instructions are available in only one extension while others are available in several. +The instructions have mnemonics and encodings that are independent of the extensions in which they appear. +Thus, when implementing extensions with overlapping instructions, there is no redundancy in logic or encoding.
+The bitmanip extensions are defined for RV32 and RV64. +Most of the instructions are expected to be forward compatible with RV128. +While the shift-immediate instructions are defined to have at most a 6-bit immediate field, a 7th bit is available in the encoding space should this be needed for RV128.
+30.2. Word Instructions
+The bitmanip extension follows the convention in RV64 that w-suffixed instructions (without a dot before the w) ignore the upper 32 bits of their inputs, operate on the least-significant 32-bits as signed values and produce a 32-bit signed result that is sign-extended to XLEN.
+Bitmanip instructions with the suffix .uw have one operand that is an unsigned 32-bit value that is extracted from the least significant 32 bits of the specified register. Other than that, these perform full XLEN operations.
+Bitmanip instructions with the suffix .b, .h and .w only look at the least significant 8-bits, 16-bits and 32-bits of the input (respectively) and produce an XLEN-wide result that is sign-extended or zero-extended, based on the specific instruction.
+30.3. Pseudocode for instruction semantics
+The semantics of each instruction in Instructions (in alphabetical order) is expressed in a SAIL-like syntax.
+30.4. Extensions
+The first group of bitmanip extensions to be released for Public Review are:
+Below is a list of all of the instructions that are included in these extensions +along with their specific mapping:
+RV32 | +RV64 | +Mnemonic | +Instruction | +Zba | +Zbb | +Zbc | +Zbs | +
---|---|---|---|---|---|---|---|
+ | ✓ |
+add.uw rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
✓ |
+✓ |
+andn rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+clmul rd, rs1, rs2 |
++ | + | + | ✓ |
++ |
✓ |
+✓ |
+clmulh rd, rs1, rs2 |
++ | + | + | ✓ |
++ |
✓ |
+✓ |
+clmulr rd, rs1, rs2 |
++ | + | + | ✓ |
++ |
✓ |
+✓ |
+clz rd, rs |
++ | + | ✓ |
++ | + |
+ | ✓ |
+clzw rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+cpop rd, rs |
++ | + | ✓ |
++ | + |
+ | ✓ |
+cpopw rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+ctz rd, rs |
++ | + | ✓ |
++ | + |
+ | ✓ |
+ctzw rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+max rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+maxu rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+min rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+minu rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+orc.b rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+orn rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+rev8 rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+rol rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
+ | ✓ |
+rolw rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+ror rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+rori rd, rs1, shamt |
++ | + | ✓ |
++ | + |
+ | ✓ |
+roriw rd, rs1, shamt |
++ | + | ✓ |
++ | + |
+ | ✓ |
+rorw rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+bclr rd, rs1, rs2 |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+bclri rd, rs1, imm |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+bext rd, rs1, rs2 |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+bexti rd, rs1, imm |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+binv rd, rs1, rs2 |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+binvi rd, rs1, imm |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+bset rd, rs1, rs2 |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+bseti rd, rs1, imm |
++ | + | + | + | ✓ |
+
✓ |
+✓ |
+sext.b rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+sext.h rd, rs |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+sh1add rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
+ | ✓ |
+sh1add.uw rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
✓ |
+✓ |
+sh2add rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
+ | ✓ |
+sh2add.uw rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
✓ |
+✓ |
+sh3add rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
+ | ✓ |
+sh3add.uw rd, rs1, rs2 |
++ | ✓ |
++ | + | + |
+ | ✓ |
+slli.uw rd, rs1, imm |
++ | ✓ |
++ | + | + |
✓ |
+✓ |
+xnor rd, rs1, rs2 |
++ | + | ✓ |
++ | + |
✓ |
+✓ |
+zext.h rd, rs |
++ | + | ✓ |
++ | + |
30.4.1. Zba: Address generation
+The Zba instructions can be used to accelerate the generation of addresses that index into arrays of basic types (halfword, word, doubleword) using both unsigned word-sized and XLEN-sized indices: a shifted index is added to a base address.
+The shift and add instructions do a left shift of 1, 2, or 3 because these are commonly found in real-world code and because they can be implemented with a minimal amount of additional hardware beyond that of the simple adder. This avoids lengthening the critical path in implementations.
+While the shift and add instructions are limited to a maximum left shift of 3, the slli instruction (from the base ISA) can be used to perform similar shifts for indexing into arrays of wider elements. The slli.uw — added in this extension — can be used when the index is to be interpreted as an unsigned word.
+The following instructions (and pseudoinstructions) comprise the Zba extension:
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
+ | ✓ |
+add.uw rd, rs1, rs2 |
++ |
✓ |
+✓ |
+sh1add rd, rs1, rs2 |
++ |
+ | ✓ |
+sh1add.uw rd, rs1, rs2 |
++ |
✓ |
+✓ |
+sh2add rd, rs1, rs2 |
++ |
+ | ✓ |
+sh2add.uw rd, rs1, rs2 |
++ |
✓ |
+✓ |
+sh3add rd, rs1, rs2 |
++ |
+ | ✓ |
+sh3add.uw rd, rs1, rs2 |
++ |
+ | ✓ |
+slli.uw rd, rs1, imm |
++ |
+ | ✓ |
+zext.w rd, rs |
++ |
30.4.2. Zbb: Basic bit-manipulation
+Logical with negate
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+andn rd, rs1, rs2 |
++ |
✓ |
+✓ |
+orn rd, rs1, rs2 |
++ |
✓ |
+✓ |
+xnor rd, rs1, rs2 |
++ |
+ + | +
+ Implementation Hint
+
+
+The Logical with Negate instructions can be implemented by inverting the rs2 inputs to the base-required AND, OR, and XOR logic instructions. +In some implementations, the inverter on rs2 used for subtraction can be reused for this purpose. + |
+
Count leading/trailing zero bits
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+clz rd, rs |
++ |
+ | ✓ |
+clzw rd, rs |
++ |
✓ |
+✓ |
+ctz rd, rs |
++ |
+ | ✓ |
+ctzw rd, rs |
++ |
Count population
+These instructions count the number of set bits (1-bits). This is also +commonly referred to as population count.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+cpop rd, rs |
++ |
+ | ✓ |
+cpopw rd, rs |
++ |
Integer minimum/maximum
+The integer minimum/maximum instructions are arithmetic R-type +instructions that return the smaller/larger of two operands.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+max rd, rs1, rs2 |
++ |
✓ |
+✓ |
+maxu rd, rs1, rs2 |
++ |
✓ |
+✓ |
+min rd, rs1, rs2 |
++ |
✓ |
+✓ |
+minu rd, rs1, rs2 |
++ |
Sign extension and zero extension
+These instructions perform the sign extension or zero extension of the least significant 8 bits or 16 bits of the source register.
+These instructions replace the generalized idioms slli rD,rS,(XLEN-<size>) + srli
(for zero extension) or slli + srai
(for sign extension) for the sign extension of 8-bit and 16-bit quantities, and for the zero extension of 16-bit quantities.
RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+sext.b rd, rs |
++ |
✓ |
+✓ |
+sext.h rd, rs |
++ |
✓ |
+✓ |
+zext.h rd, rs |
++ |
Bitwise rotation
+Bitwise rotation instructions are similar to the shift-logical operations from the base spec. However, where the shift-logical +instructions shift in zeros, the rotate instructions shift in the bits that were shifted out of the other side of the value. +Such operations are also referred to as ‘circular shifts’.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+rol rd, rs1, rs2 |
++ |
+ | ✓ |
+rolw rd, rs1, rs2 |
++ |
✓ |
+✓ |
+ror rd, rs1, rs2 |
++ |
✓ |
+✓ |
+rori rd, rs1, shamt |
++ |
+ | ✓ |
+roriw rd, rs1, shamt |
++ |
+ | ✓ |
+rorw rd, rs1, rs2 |
++ |
+ + | +
+ Architecture Explanation
+
+
+The rotate instructions were included to replace a common +four-instruction sequence to achieve the same effect (neg; sll/srl; srl/sll; or) + |
+
OR Combine
+orc.b sets the bits of each byte in the result rd to all zeros if no bit within the respective byte of rs is set, or to all ones if any bit within the respective byte of rs is set.
+One use-case is string-processing functions, such as strlen and strcpy, which can use orc.b to test for the terminating zero byte by counting the set bits in leading non-zero bytes in a word.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+orc.b rd, rs |
++ |
Byte-reverse
+rev8 reverses the byte-ordering of rs.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+rev8 rd, rs |
++ |
30.4.3. Zbc: Carry-less multiplication
+Carry-less multiplication is the multiplication in the polynomial ring over GF(2).
+clmul produces the lower half of the carry-less product and clmulh produces the upper half of the 2✕XLEN carry-less product.
+clmulr produces bits 2✕XLEN−2:XLEN-1 of the 2✕XLEN carry-less product.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+clmul rd, rs1, rs2 |
++ |
✓ |
+✓ |
+clmulh rd, rs1, rs2 |
++ |
✓ |
+✓ |
+clmulr rd, rs1, rs2 |
++ |
30.4.4. Zbs: Single-bit instructions
+The single-bit instructions provide a mechanism to set, clear, invert, or extract +a single bit in a register. The bit is specified by its index.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+bclr rd, rs1, rs2 |
++ |
✓ |
+✓ |
+bclri rd, rs1, imm |
++ |
✓ |
+✓ |
+bext rd, rs1, rs2 |
++ |
✓ |
+✓ |
+bexti rd, rs1, imm |
++ |
✓ |
+✓ |
+binv rd, rs1, rs2 |
++ |
✓ |
+✓ |
+binvi rd, rs1, imm |
++ |
✓ |
+✓ |
+bset rd, rs1, rs2 |
++ |
✓ |
+✓ |
+bseti rd, rs1, imm |
++ |
30.4.5. Zbkb: Bit-manipulation for Cryptography
+This extension contains instructions essential for implementing +common operations in cryptographic workloads.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+rol |
++ |
+ | ✓ |
+rolw |
++ |
✓ |
+✓ |
+ror |
++ |
✓ |
+✓ |
+rori |
++ |
+ | ✓ |
+roriw |
++ |
+ | ✓ |
+rorw |
++ |
✓ |
+✓ |
+andn |
++ |
✓ |
+✓ |
+orn |
++ |
✓ |
+✓ |
+xnor |
++ |
✓ |
+✓ |
+pack |
++ |
✓ |
+✓ |
+packh |
++ |
+ | ✓ |
+packw |
++ |
✓ |
+✓ |
+rev.b |
++ |
✓ |
+✓ |
+rev8 |
++ |
✓ |
++ | zip |
++ |
✓ |
++ | unzip |
++ |
30.4.6. Zbkc: Carry-less multiplication for Cryptography
+Carry-less multiplication is the multiplication in the polynomial ring over +GF(2). This is a critical operation in some cryptographic workloads, +particularly the AES-GCM authenticated encryption scheme. +This extension provides only the instructions needed to +efficiently implement the GHASH operation, which is part of this workload.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+clmul rd, rs1, rs2 |
++ |
✓ |
+✓ |
+clmulh rd, rs1, rs2 |
++ |
30.4.7. Zbkx: Crossbar permutations
+These instructions implement a "lookup table" for 4 and 8 bit elements +inside the general purpose registers. +rs1 is used as a vector of N-bit words, and rs2 as a vector of N-bit +indices into rs1. +Elements in rs1 are replaced by the indexed element in rs2, or zero +if the index into rs2 is out of bounds.
+These instructions are useful for expressing N-bit to N-bit boolean +operations, and implementing cryptographic code with secret +dependent memory accesses (particularly SBoxes) such that the execution +latency does not depend on the (secret) data being operated on.
+RV32 | +RV64 | +Mnemonic | +Instruction | +
---|---|---|---|
✓ |
+✓ |
+xperm.n rd, rs1, rs2 |
++ |
✓ |
+✓ |
+xperm.b rd, rs1, rs2 |
++ |
30.5. Instructions (in alphabetical order)
+30.5.1. add.uw
+-
+
- Synopsis +
-
+
Add unsigned word
+
+ - Mnemonic +
-
+
add.uw rd, rs1, rs2
+
+ - Pseudoinstructions +
-
+
zext.w rd, rs1 → add.uw rd, rs1, zero
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs an XLEN-wide addition between rs2 and the zero-extended least-significant word of rs1.
+
+ - Operation +
let base = X(rs2);
+let index = EXTZ(X(rs1)[31..0]);
+
+X(rd) = base + index;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.2. andn
+-
+
- Synopsis +
-
+
AND with inverted operand
+
+ - Mnemonic +
-
+
andn rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs the bitwise logical AND operation between rs1 and the bitwise inversion of rs2.
+
+ - Operation +
X(rd) = X(rs1) & ~X(rs2);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.3. bclr
+-
+
- Synopsis +
-
+
Single-Bit Clear (Register)
+
+ - Mnemonic +
-
+
bclr rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit cleared at the index specified in rs2. +The index is read from the lower log2(XLEN) bits of rs2.
+
+ - Operation +
let index = X(rs2) & (XLEN - 1);
+X(rd) = X(rs1) & ~(1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.4. bclri
+-
+
- Synopsis +
-
+
Single-Bit Clear (Immediate)
+
+ - Mnemonic +
-
+
bclri rd, rs1, shamt
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit cleared at the index specified in shamt. +The index is read from the lower log2(XLEN) bits of shamt. +For RV32, the encodings corresponding to shamt[5]=1 are reserved.
+
+ - Operation +
let index = shamt & (XLEN - 1);
+X(rd) = X(rs1) & ~(1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.5. bext
+-
+
- Synopsis +
-
+
Single-Bit Extract (Register)
+
+ - Mnemonic +
-
+
bext rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns a single bit extracted from rs1 at the index specified in rs2. +The index is read from the lower log2(XLEN) bits of rs2.
+
+ - Operation +
let index = X(rs2) & (XLEN - 1);
+X(rd) = (X(rs1) >> index) & 1;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.6. bexti
+-
+
- Synopsis +
-
+
Single-Bit Extract (Immediate)
+
+ - Mnemonic +
-
+
bexti rd, rs1, shamt
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction returns a single bit extracted from rs1 at the index specified in rs2. +The index is read from the lower log2(XLEN) bits of shamt. +For RV32, the encodings corresponding to shamt[5]=1 are reserved.
+
+ - Operation +
let index = shamt & (XLEN - 1);
+X(rd) = (X(rs1) >> index) & 1;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.7. binv
+-
+
- Synopsis +
-
+
Single-Bit Invert (Register)
+
+ - Mnemonic +
-
+
binv rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit inverted at the index specified in rs2. +The index is read from the lower log2(XLEN) bits of rs2.
+
+ - Operation +
let index = X(rs2) & (XLEN - 1);
+X(rd) = X(rs1) ^ (1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.8. binvi
+-
+
- Synopsis +
-
+
Single-Bit Invert (Immediate)
+
+ - Mnemonic +
-
+
binvi rd, rs1, shamt
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit inverted at the index specified in shamt. +The index is read from the lower log2(XLEN) bits of shamt. +For RV32, the encodings corresponding to shamt[5]=1 are reserved.
+
+ - Operation +
let index = shamt & (XLEN - 1);
+X(rd) = X(rs1) ^ (1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.9. bset
+-
+
- Synopsis +
-
+
Single-Bit Set (Register)
+
+ - Mnemonic +
-
+
bset rd, rs1,rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit set at the index specified in rs2. +The index is read from the lower log2(XLEN) bits of rs2.
+
+ - Operation +
let index = X(rs2) & (XLEN - 1);
+X(rd) = X(rs1) | (1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.10. bseti
+-
+
- Synopsis +
-
+
Single-Bit Set (Immediate)
+
+ - Mnemonic +
-
+
bseti rd, rs1,shamt
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction returns rs1 with a single bit set at the index specified in shamt. +The index is read from the lower log2(XLEN) bits of shamt. +For RV32, the encodings corresponding to shamt[5]=1 are reserved.
+
+ - Operation +
let index = shamt & (XLEN - 1);
+X(rd) = X(rs1) | (1 << index)
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbs (Single-bit instructions) |
+v1.0 |
+Ratified |
+
30.5.11. clmul
+-
+
- Synopsis +
-
+
Carry-less multiply (low-part)
+
+ - Mnemonic +
-
+
clmul rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
clmul produces the lower half of the 2·XLEN carry-less product.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+let output : xlenbits = 0;
+
+foreach (i from 0 to (xlen - 1) by 1) {
+ output = if ((rs2_val >> i) & 1)
+ then output ^ (rs1_val << i);
+ else output;
+}
+
+X[rd] = output
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.12. clmulh
+-
+
- Synopsis +
-
+
Carry-less multiply (high-part)
+
+ - Mnemonic +
-
+
clmulh rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
clmulh produces the upper half of the 2·XLEN carry-less product.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+let output : xlenbits = 0;
+
+foreach (i from 1 to xlen by 1) {
+ output = if ((rs2_val >> i) & 1)
+ then output ^ (rs1_val >> (xlen - i));
+ else output;
+}
+
+X[rd] = output
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.13. clmulr
+-
+
- Synopsis +
-
+
Carry-less multiply (reversed)
+
+ - Mnemonic +
-
+
clmulr rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
clmulr produces bits 2·XLEN−2:XLEN-1 of the 2·XLEN carry-less +product.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+let output : xlenbits = 0;
+
+foreach (i from 0 to (xlen - 1) by 1) {
+ output = if ((rs2_val >> i) & 1)
+ then output ^ (rs1_val >> (xlen - i - 1));
+ else output;
+}
+
+X[rd] = output
++ + | +
+ Note
+
+
+The clmulr instruction is used to accelerate CRC calculations. +The r in the instruction’s mnemonic stands for reversed, as the +instruction is equivalent to bit-reversing the inputs, performing +a clmul, then bit-reversing the output. + |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
30.5.14. clz
+-
+
- Synopsis +
-
+
Count leading zero bits
+
+ - Mnemonic +
-
+
clz rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction counts the number of 0’s before the first 1, starting at the most-significant bit (i.e., XLEN-1) and progressing to bit 0. Accordingly, if the input is 0, the output is XLEN, and if the most-significant bit of the input is a 1, the output is 0.
+
+ - Operation +
val HighestSetBit : forall ('N : Int), 'N >= 0. bits('N) -> int
+
+function HighestSetBit x = {
+ foreach (i from (xlen - 1) to 0 by 1 in dec)
+ if [x[i]] == 0b1 then return(i) else ();
+ return -1;
+}
+
+let rs = X(rs);
+X[rd] = (xlen - 1) - HighestSetBit(rs);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.15. clzw
+-
+
- Synopsis +
-
+
Count leading zero bits in word
+
+ - Mnemonic +
-
+
clzw rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction counts the number of 0’s before the first 1 starting at bit 31 and progressing to bit 0. +Accordingly, if the least-significant word is 0, the output is 32, and if the most-significant bit of the word (i.e., bit 31) is a 1, the output is 0.
+
+ - Operation +
val HighestSetBit32 : forall ('N : Int), 'N >= 0. bits('N) -> int
+
+function HighestSetBit32 x = {
+ foreach (i from 31 to 0 by 1 in dec)
+ if [x[i]] == 0b1 then return(i) else ();
+ return -1;
+}
+
+let rs = X(rs);
+X[rd] = 31 - HighestSetBit(rs);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.16. cpop
+-
+
- Synopsis +
-
+
Count set bits
+
+ - Mnemonic +
-
+
cpop rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instructions counts the number of 1’s (i.e., set bits) in the source register.
+
+ - Operation +
let bitcount = 0;
+let rs = X(rs);
+
+foreach (i from 0 to (xlen - 1) in inc)
+ if rs[i] == 0b1 then bitcount = bitcount + 1 else ();
+
+X[rd] = bitcount
++ + | +
+ Software Hint
+
+
+This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight. +
+
+The GCC builtin function |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.17. cpopw
+-
+
- Synopsis +
-
+
Count set bits in word
+
+ - Mnemonic +
-
+
cpopw rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instructions counts the number of 1’s (i.e., set bits) in the least-significant word of the source register.
+
+ - Operation +
let bitcount = 0;
+let val = X(rs);
+
+foreach (i from 0 to 31 in inc)
+ if val[i] == 0b1 then bitcount = bitcount + 1 else ();
+
+X[rd] = bitcount
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.18. ctz
+-
+
- Synopsis +
-
+
Count trailing zeros
+
+ - Mnemonic +
-
+
ctz rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction counts the number of 0’s before the first 1, starting at the least-significant bit (i.e., 0) and progressing to the most-significant bit (i.e., XLEN-1). +Accordingly, if the input is 0, the output is XLEN, and if the least-significant bit of the input is a 1, the output is 0.
+
+ - Operation +
val LowestSetBit : forall ('N : Int), 'N >= 0. bits('N) -> int
+
+function LowestSetBit x = {
+ foreach (i from 0 to (xlen - 1) by 1 in dec)
+ if [x[i]] == 0b1 then return(i) else ();
+ return xlen;
+}
+
+let rs = X(rs);
+X[rd] = LowestSetBit(rs);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.19. ctzw
+-
+
- Synopsis +
-
+
Count trailing zero bits in word
+
+ - Mnemonic +
-
+
ctzw rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction counts the number of 0’s before the first 1, starting at the least-significant bit (i.e., 0) and progressing to the most-significant bit of the least-significant word (i.e., 31). Accordingly, if the least-significant word is 0, the output is 32, and if the least-significant bit of the input is a 1, the output is 0.
+
+ - Operation +
val LowestSetBit32 : forall ('N : Int), 'N >= 0. bits('N) -> int
+
+function LowestSetBit32 x = {
+ foreach (i from 0 to 31 by 1 in dec)
+ if [x[i]] == 0b1 then return(i) else ();
+ return 32;
+}
+
+let rs = X(rs);
+X[rd] = LowestSetBit32(rs);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.20. max
+-
+
- Synopsis +
-
+
Maximum
+
+ - Mnemonic +
-
+
max rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns the larger of two signed integers.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+
+let result = if rs1_val <_s rs2_val
+ then rs2_val
+ else rs1_val;
+
+X(rd) = result;
++ + | +
+ Software Hint
+
+
+Calculating the absolute value of a signed integer can be performed +using the following sequence: neg rD,rS followed by max +rD,rS,rD. When using this common sequence, it is suggested that they +are scheduled with no intervening instructions so that +implementations that are so optimized can fuse them together. + |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.21. maxu
+-
+
- Synopsis +
-
+
Unsigned maximum
+
+ - Mnemonic +
-
+
maxu rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns the larger of two unsigned integers.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+
+let result = if rs1_val <_u rs2_val
+ then rs2_val
+ else rs1_val;
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.22. min
+-
+
- Synopsis +
-
+
Minimum
+
+ - Mnemonic +
-
+
min rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns the smaller of two signed integers.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+
+let result = if rs1_val <_s rs2_val
+ then rs1_val
+ else rs2_val;
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.23. minu
+-
+
- Synopsis +
-
+
Unsigned minimum
+
+ - Mnemonic +
-
+
minu rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction returns the smaller of two unsigned integers.
+
+ - Operation +
let rs1_val = X(rs1);
+let rs2_val = X(rs2);
+
+let result = if rs1_val <_u rs2_val
+ then rs1_val
+ else rs2_val;
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.24. orc.b
+-
+
- Synopsis +
-
+
Bitwise OR-Combine, byte granule
+
+ - Mnemonic +
-
+
orc.b rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
Combines the bits within each byte using bitwise logical OR. +This sets the bits of each byte in the result rd to all zeros if no bit within the respective byte of rs is set, or to all ones if any bit within the respective byte of rs is set.
+
+ - Operation +
let input = X(rs);
+let output : xlenbits = 0;
+
+foreach (i from 0 to (xlen - 8) by 8) {
+ output[(i + 7)..i] = if input[(i + 7)..i] == 0
+ then 0b00000000
+ else 0b11111111;
+}
+
+X[rd] = output;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
30.5.25. orn
+-
+
- Synopsis +
-
+
OR with inverted operand
+
+ - Mnemonic +
-
+
orn rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs the bitwise logical OR operation between rs1 and the bitwise inversion of rs2.
+
+ - Operation +
X(rd) = X(rs1) | ~X(rs2);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.26. pack
+-
+
- Synopsis +
-
+
Pack the low halves of rs1 and rs2 into rd.
+
+ - Mnemonic +
-
+
pack rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
The pack instruction packs the XLEN/2-bit lower halves of rs1 and rs2 into +rd, with rs1 in the lower half and rs2 in the upper half.
+
+ - Operation +
let lo_half : bits(xlen/2) = X(rs1)[xlen/2-1..0];
+let hi_half : bits(xlen/2) = X(rs2)[xlen/2-1..0];
+X(rd) = EXTZ(hi_half @ lo_half);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
30.5.27. packh
+-
+
- Synopsis +
-
+
Pack the low bytes of rs1 and rs2 into rd.
+
+ - Mnemonic +
-
+
packh rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
And the packh instruction packs the least-significant bytes of +rs1 and rs2 into the 16 least-significant bits of rd, +zero extending the rest of rd.
+
+ - Operation +
let lo_half : bits(8) = X(rs1)[7..0];
+let hi_half : bits(8) = X(rs2)[7..0];
+X(rd) = EXTZ(hi_half @ lo_half);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
30.5.28. packw
+-
+
- Synopsis +
-
+
Pack the low 16-bits of rs1 and rs2 into rd on RV64.
+
+ - Mnemonic +
-
+
packw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction packs the low 16 bits of +rs1 and rs2 into the 32 least-significant bits of rd, +sign extending the 32-bit result to the rest of rd. +This instruction only exists on RV64 based systems.
+
+ - Operation +
let lo_half : bits(16) = X(rs1)[15..0];
+let hi_half : bits(16) = X(rs2)[15..0];
+X(rd) = EXTS(hi_half @ lo_half);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
30.5.29. rev8
+-
+
- Synopsis +
-
+
Byte-reverse register
+
+ - Mnemonic +
-
+
rev8 rd, rs
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction reverses the order of the bytes in rs.
+
+ - Operation +
let input = X(rs);
+let output : xlenbits = 0;
+let j = xlen - 1;
+
+foreach (i from 0 to (xlen - 8) by 8) {
+ output[i..(i + 7)] = input[(j - 7)..j];
+ j = j - 8;
+}
+
+X[rd] = output
++ + | +
+ Note
+
+
+The rev8 mnemonic corresponds to different instruction encodings in RV32 and RV64. + |
+
+ + | +
+ Software Hint
+
+
+The byte-reverse operation is only available for the full register
+width. To emulate word-sized and halfword-sized byte-reversal,
+perform a |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+v1.0 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.30. rev.b
+-
+
- Synopsis +
-
+
Reverse the bits in each byte of a source register.
+
+ - Mnemonic +
-
+
rev.b rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction reverses the order of the bits in every byte of a register.
+
+ - Operation +
result : xlenbits = EXTZ(0b0);
+foreach (i from 0 to sizeof(xlen) by 8) {
+ result[i+7..i] = reverse_bits_in_byte(X(rs1)[i+7..i]);
+};
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | v1.0 |
+Ratified |
+
30.5.31. rol
+-
+
- Synopsis +
-
+
Rotate Left (Register)
+
+ - Mnemonic +
-
+
rol rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs a rotate left of rs1 by the amount in least-significant log2(XLEN) bits of rs2.
+
+ - Operation +
let shamt = if xlen == 32
+ then X(rs2)[4..0]
+ else X(rs2)[5..0];
+let result = (X(rs1) << shamt) | (X(rs1) >> (xlen - shamt));
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.32. rolw
+-
+
- Synopsis +
-
+
Rotate Left Word (Register)
+
+ - Mnemonic +
-
+
rolw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs a rotate left on the least-significant word of rs1 by the amount in least-significant 5 bits of rs2. +The resulting word value is sign-extended by copying bit 31 to all of the more-significant bits.
+
+ - Operation +
let rs1 = EXTZ(X(rs1)[31..0])
+let shamt = X(rs2)[4..0];
+let result = (rs1 << shamt) | (rs1 >> (32 - shamt));
+X(rd) = EXTS(result[31..0]);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.33. ror
+-
+
- Synopsis +
-
+
Rotate Right
+
+ - Mnemonic +
-
+
ror rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs a rotate right of rs1 by the amount in least-significant log2(XLEN) bits of rs2.
+
+ - Operation +
let shamt = if xlen == 32
+ then X(rs2)[4..0]
+ else X(rs2)[5..0];
+let result = (X(rs1) >> shamt) | (X(rs1) << (xlen - shamt));
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.34. rori
+-
+
- Synopsis +
-
+
Rotate Right (Immediate)
+
+ - Mnemonic +
-
+
rori rd, rs1, shamt
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction performs a rotate right of rs1 by the amount in the least-significant log2(XLEN) bits of shamt. +For RV32, the encodings corresponding to shamt[5]=1 are reserved.
+
+ - Operation +
let shamt = if xlen == 32
+ then shamt[4..0]
+ else shamt[5..0];
+let result = (X(rs1) >> shamt) | (X(rs1) << (xlen - shamt));
+
+X(rd) = result;
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.35. roriw
+-
+
- Synopsis +
-
+
Rotate Right Word by Immediate
+
+ - Mnemonic +
-
+
roriw rd, rs1, shamt
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs a rotate right on the least-significant word +of rs1 by the amount in the least-significant log2(XLEN) bits of +shamt. +The resulting word value is sign-extended by copying bit 31 to all of +the more-significant bits.
+
+ - Operation +
let rs1_data = EXTZ(X(rs1)[31..0];
+let result = (rs1_data >> shamt) | (rs1_data << (32 - shamt));
+X(rd) = EXTS(result[31..0]);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.36. rorw
+-
+
- Synopsis +
-
+
Rotate Right Word (Register)
+
+ - Mnemonic +
-
+
rorw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs a rotate right on the least-significant word of rs1 by the amount in least-significant 5 bits of rs2. +The resultant word is sign-extended by copying bit 31 to all of the more-significant bits.
+
+ - Operation +
let rs1 = EXTZ(X(rs1)[31..0])
+let shamt = X(rs2)[4..0];
+let result = (rs1 >> shamt) | (rs1 << (32 - shamt));
+X(rd) = EXTS(result);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.37. sext.b
+-
+
- Synopsis +
-
+
Sign-extend byte
+
+ - Mnemonic +
-
+
sext.b rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction sign-extends the least-significant byte in the source to XLEN by copying the most-significant bit in the byte (i.e., bit 7) to all of the more-significant bits.
+
+ - Operation +
X(rd) = EXTS(X(rs)[7..0]);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
30.5.38. sext.h
+-
+
- Synopsis +
-
+
Sign-extend halfword
+
+ - Mnemonic +
-
+
sext.h rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction sign-extends the least-significant halfword in rs to XLEN by copying the most-significant bit in the halfword (i.e., bit 15) to all of the more-significant bits.
+
+ - Operation +
X(rd) = EXTS(X(rs)[15..0]);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
30.5.39. sh1add
+-
+
- Synopsis +
-
+
Shift left by 1 and add
+
+ - Mnemonic +
-
+
sh1add rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction shifts rs1 to the left by 1 bit and adds it to rs2.
+
+ - Operation +
X(rd) = X(rs2) + (X(rs1) << 1);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.40. sh1add.uw
+-
+
- Synopsis +
-
+
Shift unsigned word left by 1 and add
+
+ - Mnemonic +
-
+
sh1add.uw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs an XLEN-wide addition of two addends. +The first addend is rs2. The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 1 place.
+
+ - Operation +
let base = X(rs2);
+let index = EXTZ(X(rs1)[31..0]);
+
+X(rd) = base + (index << 1);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.41. sh2add
+-
+
- Synopsis +
-
+
Shift left by 2 and add
+
+ - Mnemonic +
-
+
sh2add rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction shifts rs1 to the left by 2 places and adds it to rs2.
+
+ - Operation +
X(rd) = X(rs2) + (X(rs1) << 2);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.42. sh2add.uw
+-
+
- Synopsis +
-
+
Shift unsigned word left by 2 and add
+
+ - Mnemonic +
-
+
sh2add.uw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs an XLEN-wide addition of two addends. +The first addend is rs2. +The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 2 places.
+
+ - Operation +
let base = X(rs2);
+let index = EXTZ(X(rs1)[31..0]);
+
+X(rd) = base + (index << 2);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.43. sh3add
+-
+
- Synopsis +
-
+
Shift left by 3 and add
+
+ - Mnemonic +
-
+
sh3add rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction shifts rs1 to the left by 3 places and adds it to rs2.
+
+ - Operation +
X(rd) = X(rs2) + (X(rs1) << 3);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.44. sh3add.uw
+-
+
- Synopsis +
-
+
Shift unsigned word left by 3 and add
+
+ - Mnemonic +
-
+
sh3add.uw rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs an XLEN-wide addition of two addends. The first addend is rs2. The second addend is the unsigned value formed by extracting the least-significant word of rs1 and shifting it left by 3 places.
+
+ - Operation +
let base = X(rs2);
+let index = EXTZ(X(rs1)[31..0]);
+
+X(rd) = base + (index << 3);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
30.5.45. slli.uw
+-
+
- Synopsis +
-
+
Shift-left unsigned word (Immediate)
+
+ - Mnemonic +
-
+
slli.uw rd, rs1, shamt
+
+ - Encoding +
-
+
- Description +
-
+
This instruction takes the least-significant word of rs1, zero-extends it, and shifts it left by the immediate.
+
+ - Operation +
X(rd) = (EXTZ(X(rs)[31..0]) << shamt);
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
+ | 0.93 |
+Ratified |
+
+ + | +
+ Architecture Explanation
+
+
+This instruction is the same as slli with zext.w performed on rs1 before shifting. + |
+
30.5.46. unzip
+-
+
- Synopsis +
-
+
Implements the inverse of the zip instruction.
+
+ - Mnemonic +
-
+
unzip rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction gathers bits from the high and low halves of the source +word into odd/even bit positions in the destination word. +It is the inverse of the zip instruction. +This instruction is available only on RV32.
+
+ - Operation +
foreach (i from 0 to xlen/2-1) {
+ X(rd)[i] = X(rs1)[2*i]
+ X(rd)[i+xlen/2] = X(rs1)[2*i+1]
+}
++ + | +
+ Software Hint
+
+
+This instruction is useful for implementing the SHA3 cryptographic +hash function on a 32-bit architecture, as it implements the +bit-interleaving operation used to speed up the 64-bit rotations +directly. + |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbkb (Bit-manipulation for Cryptography) (RV32) |
+v1.0 |
+Ratified |
+
30.5.47. xnor
+-
+
- Synopsis +
-
+
Exclusive NOR
+
+ - Mnemonic +
-
+
xnor rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
This instruction performs the bit-wise exclusive-NOR operation on rs1 and rs2.
+
+ - Operation +
X(rd) = ~(X(rs1) ^ X(rs2));
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
+ | v1.0 |
+Ratified |
+
30.5.48. xperm.b
+-
+
- Synopsis +
-
+
Byte-wise lookup of indices into a vector in registers.
+
+ - Mnemonic +
-
+
xperm.b rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
The xperm.b instruction operates on bytes. +The rs1 register contains a vector of XLEN/8 8-bit elements. +The rs2 register contains a vector of XLEN/8 8-bit indexes. +The result is each element in rs2 replaced by the indexed element in rs1, +or zero if the index into rs2 is out of bounds.
+
+ - Operation +
val xpermb_lookup : (bits(8), xlenbits) -> bits(8)
+function xpermb_lookup (idx, lut) = {
+ (lut >> (idx @ 0b000))[7..0]
+}
+
+function clause execute ( XPERM_B (rs2,rs1,rd)) = {
+ result : xlenbits = EXTZ(0b0);
+ foreach(i from 0 to xlen by 8) {
+ result[i+7..i] = xpermn_lookup(X(rs2)[i+7..i], X(rs1));
+ };
+ X(rd) = result;
+ RETIRE_SUCCESS
+}
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbkx (Crossbar permutations) |
+v1.0 |
+Ratified |
+
30.5.49. xperm.n
+-
+
- Synopsis +
-
+
Nibble-wise lookup of indices into a vector.
+
+ - Mnemonic +
-
+
xperm.n rd, rs1, rs2
+
+ - Encoding +
-
+
- Description +
-
+
The xperm.n instruction operates on nibbles. +The rs1 register contains a vector of XLEN/4 4-bit elements. +The rs2 register contains a vector of XLEN/4 4-bit indexes. +The result is each element in rs2 replaced by the indexed element in rs1, +or zero if the index into rs2 is out of bounds.
+
+ - Operation +
val xpermn_lookup : (bits(4), xlenbits) -> bits(4)
+function xpermn_lookup (idx, lut) = {
+ (lut >> (idx @ 0b00))[3..0]
+}
+
+function clause execute ( XPERM_N (rs2,rs1,rd)) = {
+ result : xlenbits = EXTZ(0b0);
+ foreach(i from 0 to xlen by 4) {
+ result[i+3..i] = xpermn_lookup(X(rs2)[i+3..i], X(rs1));
+ };
+ X(rd) = result;
+ RETIRE_SUCCESS
+}
+-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbkx (Crossbar permutations) |
+v1.0 |
+Ratified |
+
30.5.50. zext.h
+-
+
- Synopsis +
-
+
Zero-extend halfword
+
+ - Mnemonic +
-
+
zext.h rd, rs
+
+ - Encoding (RV32) +
-
+
- Encoding (RV64) +
-
+
- Description +
-
+
This instruction zero-extends the least-significant halfword of the source to XLEN by inserting 0’s into all of the bits more significant than 15.
+
+ - Operation +
X(rd) = EXTZ(X(rs)[15..0]);
++ + | +
+ Note
+
+
+The zext.h mnemonic corresponds to different instruction encodings in RV32 and RV64. + |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbb (Basic bit-manipulation) |
+0.93 |
+Ratified |
+
30.5.51. zip
+-
+
- Synopsis +
-
+
Gather odd and even bits of the source word into upper/lower halves of the +destination.
+
+ - Mnemonic +
-
+
zip rd, rs
+
+ - Encoding +
-
+
- Description +
-
+
This instruction scatters all of the odd and even bits of a source word into +the high and low halves of a destination word. +It is the inverse of the unzip instruction. +This instruction is available only on RV32.
+
+ - Operation +
foreach (i from 0 to xlen/2-1) {
+ X(rd)[2*i] = X(rs1)[i]
+ X(rd)[2*i+1] = X(rs1)[i+xlen/2]
+}
++ + | +
+ Software Hint
+
+
+This instruction is useful for implementing the SHA3 cryptographic +hash function on a 32-bit architecture, as it implements the +bit-interleaving operation used to speed up the 64-bit rotations +directly. + |
+
-
+
- Included in +
Extension | +Minimum version | +Lifecycle state | +
---|---|---|
Zbkb (Bit-manipulation for Cryptography) (RV32) |
+v1.0 |
+Ratified |
+
30.6. Software optimization guide
+30.6.1. strlen
+The orc.b instruction allows for the efficient detection of NUL bytes in an XLEN-sized chunk of data:
+-
+
-
+
the result of orc.b on a chunk that does not contain any NUL bytes will be all-ones, and
+
+ -
+
after a bitwise-negation of the result of orc.b, the number of data bytes before the first NUL byte (if any) can be detected by ctz/clz (depending on the endianness of data).
+
+
A full example of a strlen function, which uses these techniques and also demonstrates the use of it for unaligned/partial data, is the following:
+#include <sys/asm.h>
+
+ .text
+ .globl strlen
+ .type strlen, @function
+strlen:
+ andi a3, a0, (SZREG-1) // offset
+ andi a1, a0, -SZREG // align pointer
+.Lprologue:
+ li a4, SZREG
+ sub a4, a4, a3 // XLEN - offset
+ slli a3, a3, 3 // offset * 8
+ REG_L a2, 0(a1) // chunk
+ /*
+ * Shift the partial/unaligned chunk we loaded to remove the bytes
+ * from before the start of the string, adding NUL bytes at the end.
+ */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ srl a2, a2 ,a3 // chunk >> (offset * 8)
+#else
+ sll a2, a2, a3
+#endif
+ orc.b a2, a2
+ not a2, a2
+ /*
+ * Non-NUL bytes in the string have been expanded to 0x00, while
+ * NUL bytes have become 0xff. Search for the first set bit
+ * (corresponding to a NUL byte in the original chunk).
+ */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ ctz a2, a2
+#else
+ clz a2, a2
+#endif
+ /*
+ * The first chunk is special: compare against the number of valid
+ * bytes in this chunk.
+ */
+ srli a0, a2, 3
+ bgtu a4, a0, .Ldone
+ addi a3, a1, SZREG
+ li a4, -1
+ .align 2
+ /*
+ * Our critical loop is 4 instructions and processes data in 4 byte
+ * or 8 byte chunks.
+ */
+.Lloop:
+ REG_L a2, SZREG(a1)
+ addi a1, a1, SZREG
+ orc.b a2, a2
+ beq a2, a4, .Lloop
+
+.Lepilogue:
+ not a2, a2
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ ctz a2, a2
+#else
+ clz a2, a2
+#endif
+ sub a1, a1, a3
+ add a0, a0, a1
+ srli a2, a2, 3
+ add a0, a0, a2
+.Ldone:
+ ret
+30.6.2. strcmp
+#include <sys/asm.h>
+
+ .text
+ .globl strcmp
+ .type strcmp, @function
+strcmp:
+ or a4, a0, a1
+ li t2, -1
+ and a4, a4, SZREG-1
+ bnez a4, .Lsimpleloop
+
+ # Main loop for aligned strings
+.Lloop:
+ REG_L a2, 0(a0)
+ REG_L a3, 0(a1)
+ orc.b t0, a2
+ bne t0, t2, .Lfoundnull
+ addi a0, a0, SZREG
+ addi a1, a1, SZREG
+ beq a2, a3, .Lloop
+
+ # Words don't match, and no null byte in first word.
+ # Get bytes in big-endian order and compare.
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+ rev8 a2, a2
+ rev8 a3, a3
+#endif
+ # Synthesize (a2 >= a3) ? 1 : -1 in a branchless sequence.
+ sltu a0, a2, a3
+ neg a0, a0
+ ori a0, a0, 1
+ ret
+
+.Lfoundnull:
+ # Found a null byte.
+ # If words don't match, fall back to simple loop.
+ bne a2, a3, .Lsimpleloop
+
+ # Otherwise, strings are equal.
+ li a0, 0
+ ret
+
+ # Simple loop for misaligned strings
+.Lsimpleloop:
+ lbu a2, 0(a0)
+ lbu a3, 0(a1)
+ addi a0, a0, 1
+ addi a1, a1, 1
+ bne a2, a3, 1f
+ bnez a2, .Lsimpleloop
+
+1:
+ sub a0, a2, a3
+ ret
+
+.size strcmp, .-strcmp
+31. "J" Extension for Dynamically Translated Languages, Version 0.0
+This chapter is a placeholder for a future standard extension to support +dynamically translated languages.
++ + | +
+
+
+Many popular languages are usually implemented via dynamic translation, +including Java and Javascript. These languages can benefit from +additional ISA support for dynamic checks and garbage collection. + |
+
32. "P" Extension for Packed-SIMD Instructions, Version 0.2
++ + | +
+
+
+Discussions at the 5th RISC-V workshop indicated a desire to drop this +packed-SIMD proposal for floating-point registers in favor of +standardizing on the V extension for large floating-point SIMD +operations. However, there was interest in packed-SIMD fixed-point +operations for use in the integer registers of small RISC-V +implementations. A task group is working to define the new P extension. + |
+
33. "V" Standard Extension for Vector Operations, Version 1.0
+34. Cryptography Extensions: Scalar & Entropy Source Instructions, Version 1.0.1
+35. Cryptography Extensions: Vector Instructions, Version 1.0
+36. Control-flow Integrity (CFI)
+CV64A6_MMU: The Zicfiss extension is not supported.
+CV64A6_MMU: The Zicfilp extension is not supported.
+37. RV32/64G Instruction Set Listings
+One goal of the RISC-V project is that it be used as a stable software +development target. For this purpose, we define a combination of a base +ISA (RV32I or RV64I) plus selected standard extensions (IMAFD, Zicsr, +Zifencei) as a "general-purpose" ISA, and we use the abbreviation G +for the IMAFDZicsr_Zifencei combination of instruction-set extensions. +This chapter presents opcode maps and instruction-set listings for RV32G +and RV64G.
+CV64A6_MMU: This chapter presents opcode maps and instruction-set +listings for CV64A6_MMU.
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+111 (>32b) |
+
---|---|---|---|---|---|---|---|---|
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+48b |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+64b |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+48b |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+≥80b |
+
Table 27 shows a map of the major opcodes for +RVG. Major opcodes with 3 or more lower bits set are reserved for +instruction lengths greater than 32 bits. Opcodes marked as reserved +should be avoided for custom instruction-set extensions as they might be +used by future standard extensions. Major opcodes marked as custom-0 +and custom-1 will be avoided by future standard extensions and are +recommended for use by custom instruction-set extensions within the base +32-bit instruction format. The opcodes marked custom-2/rv128 and +custom-3/rv128 are reserved for future use by RV128, but will +otherwise be avoided for standard extensions and so can also be used for +custom instruction-set extensions in RV32 and RV64.
+We believe RV32G and RV64G provide simple but complete instruction sets +for a broad range of general-purpose computing. The optional compressed +instruction set described in Chapter 28 can +be added (forming RV32GC and RV64GC) to improve performance, code size, +and energy efficiency, though with some additional hardware complexity.
+As we move beyond IMAFDC into further instruction-set extensions, the +added instructions tend to be more domain-specific and only provide +benefits to a restricted class of applications, e.g., for multimedia or +security. Unlike most commercial ISAs, the RISC-V ISA design clearly +separates the base ISA and broadly applicable standard extensions from +these more specialized additions. Chapter 38 +has a more extensive discussion of ways to add extensions to the RISC-V +ISA.
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
++ |
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+|||||||||||
|
+
|
+
|
+
|
+
|
+||||||||||||||
|
+
|
+
|
+
|
+|||||||||||
|
+
|
+
|
+
|
+|||||||||||
|
+
|
+
|
+
|
+|||||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+|||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+|||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+|||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
+ | ||||||||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
++ |
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+|||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+||||||||
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Table 28 lists the CSRs that have currently been +allocated CSR addresses. The timers, counters, and floating-point CSRs +are the only CSRs defined in this specification.
+Number | +Privilege | +Name | +Description | +
---|---|---|---|
|
+|||
|
+Read write |
+
|
+Floating-Point Accrued Exceptions. |
+
|
+Read write |
+
|
+Floating-Point Dynamic Rounding Mode. |
+
|
+Read write |
+
|
+Floating-Point Control and Status Register ( |
+
|
+|||
|
+Read-only |
+
|
+Cycle counter for RDCYCLE instruction. |
+
|
+Read-only |
+
|
+Timer for RDTIME instruction. |
+
|
+Read-only |
+
|
+Instructions-retired counter for RDINSTRET instruction. |
+
|
+Read-only |
+
|
+Upper 32 bits of |
+
|
+Read-only |
+
|
+Upper 32 bits of |
+
|
+Read-only |
+
|
+Upper 32 bits of |
+
38. Extending RISC-V
+In addition to supporting standard general-purpose software development, +another goal of RISC-V is to provide a basis for more specialized +instruction-set extensions or more customized accelerators. The +instruction encoding spaces and optional variable-length instruction +encoding are designed to make it easier to leverage software development +effort for the standard ISA toolchain when building more customized +processors. For example, the intent is to continue to provide full +software support for implementations that only use the standard I base, +perhaps together with many non-standard instruction-set extensions.
+This chapter describes various ways in which the base RISC-V ISA can be +extended, together with the scheme for managing instruction-set +extensions developed by independent groups. This volume only deals with +the unprivileged ISA, although the same approach and terminology is used +for supervisor-level extensions described in the second volume.
+38.1. Extension Terminology
+This section defines some standard terminology for describing RISC-V +extensions.
+38.1.1. Standard versus Non-Standard Extension
+Any RISC-V processor implementation must support a base integer ISA +(RV32I, RV32E, RV64I, RV64E, or RV128I). In addition, an implementation may +support one or more extensions. We divide extensions into two broad +categories: standard versus non-standard.
+-
+
-
+
A standard extension is one that is generally useful and that is +designed to not conflict with any other standard extension. Currently, +"MAFDQLCBTPV", described in other chapters of this manual, are either +complete or planned standard extensions.
+
+ -
+
A non-standard extension may be highly specialized and may conflict +with other standard or non-standard extensions. We anticipate a wide +variety of non-standard extensions will be developed over time, with +some eventually being promoted to standard extensions.
+
+
38.1.2. Instruction Encoding Spaces and Prefixes
+An instruction encoding space is some number of instruction bits within +which a base ISA or ISA extension is encoded. RISC-V supports varying +instruction lengths, but even within a single instruction length, there +are various sizes of encoding space available. For example, the base +ISAs are defined within a 30-bit encoding space (bits 31-2 of the 32-bit +instruction), while the atomic extension "A" fits within a 25-bit +encoding space (bits 31-7).
+We use the term prefix to refer to the bits to the right of an +instruction encoding space (since instruction fetch in RISC-V is +little-endian, the bits to the right are stored at earlier memory +addresses, hence form a prefix in instruction-fetch order). The prefix +for the standard base ISA encoding is the two-bit "11" field held in +bits 1-0 of the 32-bit word, while the prefix for the standard atomic +extension "A" is the seven-bit "0101111" field held in bits 6-0 of +the 32-bit word representing the AMO major opcode. A quirk of the +encoding format is that the 3-bit funct3 field used to encode a minor +opcode is not contiguous with the major opcode bits in the 32-bit +instruction format, but is considered part of the prefix for 22-bit +instruction spaces.
+Although an instruction encoding space could be of any size, adopting a +smaller set of common sizes simplifies packing independently developed +extensions into a single global encoding. +Table 29 gives the suggested sizes for RISC-V.
+Size | +Usage | +# Available in standard instruction length | +|||
---|---|---|---|---|---|
+ | + | 16-bit |
+32-bit |
+48-bit |
+64-bit |
+
+ | |||||
14-bit |
+Quadrant of compressed 16-bit encoding |
+3 |
++ | + | + |
+ | |||||
22-bit |
+Minor opcode in base 32-bit encoding |
++ | |||
25-bit |
+Major opcode in base 32-bit encoding |
++ | 32 |
+||
30-bit |
+Quadrant of base 32-bit encoding |
++ | 1 |
+||
+ | |||||
32-bit |
+Minor opcode in 48-bit encoding |
++ | + | ||
37-bit |
+Major opcode in 48-bit encoding |
++ | + | 32 |
+|
40-bit |
+Quadrant of 48-bit encoding |
++ | + | 4 |
+|
+ | |||||
45-bit |
+Sub-minor opcode in 64-bit encoding |
++ | + | + | |
48-bit |
+Minor opcode in 64-bit encoding |
++ | + | + | |
52-bit |
+Major opcode in 64-bit encoding |
++ | + | + | 32 |
+
38.1.3. Greenfield versus Brownfield Extensions
+We use the term greenfield extension to describe an extension that +begins populating a new instruction encoding space, and hence can only +cause encoding conflicts at the prefix level. We use the term +brownfield extension to describe an extension that fits around +existing encodings in a previously defined instruction space. A +brownfield extension is necessarily tied to a particular greenfield +parent encoding, and there may be multiple brownfield extensions to the +same greenfield parent encoding. For example, the base ISAs are +greenfield encodings of a 30-bit instruction space, while the FDQ +floating-point extensions are all brownfield extensions adding to the +parent base ISA 30-bit encoding space.
+Note that we consider the standard A extension to have a greenfield +encoding as it defines a new previously empty 25-bit encoding space in +the leftmost bits of the full 32-bit base instruction encoding, even +though its standard prefix locates it within the 30-bit encoding space +of its parent base ISA. Changing only its single 7-bit prefix could move +the A extension to a different 30-bit encoding space while only worrying +about conflicts at the prefix level, not within the encoding space +itself.
++ | Adds state | +No new state | +
---|---|---|
Greenfield |
+RV32I(30), RV64I(30) |
+A(25) |
+
Brownfield |
+F(I), D(F), Q(D) |
+M(I) |
+
Table 30 shows the bases and standard extensions placed +in a simple two-dimensional taxonomy. One axis is whether the extension +is greenfield or brownfield, while the other axis is whether the +extension adds architectural state. For greenfield extensions, the size +of the instruction encoding space is given in parentheses. For +brownfield extensions, the name of the extension (greenfield or +brownfield) it builds upon is given in parentheses. Additional +user-level architectural state usually implies changes to the +supervisor-level system or possibly to the standard calling convention.
+Note that RV64I is not considered an extension of RV32I, but a different +complete base encoding.
+38.1.4. Standard-Compatible Global Encodings
+A complete or global encoding of an ISA for an actual RISC-V +implementation must allocate a unique non-conflicting prefix for every +included instruction encoding space. The bases and every standard +extension have each had a standard prefix allocated to ensure they can +all coexist in a global encoding.
+A standard-compatible global encoding is one where the base and every +included standard extension have their standard prefixes. A +standard-compatible global encoding can include non-standard extensions +that do not conflict with the included standard extensions. A +standard-compatible global encoding can also use standard prefixes for +non-standard extensions if the associated standard extensions are not +included in the global encoding. In other words, a standard extension +must use its standard prefix if included in a standard-compatible global +encoding, but otherwise its prefix is free to be reallocated. These +constraints allow a common toolchain to target the standard subset of +any RISC-V standard-compatible global encoding.
+38.1.5. Guaranteed Non-Standard Encoding Space
+To support development of proprietary custom extensions, portions of the +encoding space are guaranteed to never be used by standard extensions.
+38.2. RISC-V Extension Design Philosophy
+We intend to support a large number of independently developed +extensions by encouraging extension developers to operate within +instruction encoding spaces, and by providing tools to pack these into a +standard-compatible global encoding by allocating unique prefixes. Some +extensions are more naturally implemented as brownfield augmentations of +existing extensions, and will share whatever prefix is allocated to +their parent greenfield extension. The standard extension prefixes avoid +spurious incompatibilities in the encoding of core functionality, while +allowing custom packing of more esoteric extensions.
+This capability of repacking RISC-V extensions into different +standard-compatible global encodings can be used in a number of ways.
+One use-case is developing highly specialized custom accelerators, +designed to run kernels from important application domains. These might +want to drop all but the base integer ISA and add in only the extensions +that are required for the task in hand. The base ISAs have been designed +to place minimal requirements on a hardware implementation, and has been +encoded to use only a small fraction of a 32-bit instruction encoding +space.
+Another use-case is to build a research prototype for a new type of +instruction-set extension. The researchers might not want to expend the +effort to implement a variable-length instruction-fetch unit, and so +would like to prototype their extension using a simple 32-bit +fixed-width instruction encoding. However, this new extension might be +too large to coexist with standard extensions in the 32-bit space. If +the research experiments do not need all of the standard extensions, a +standard-compatible global encoding might drop the unused standard +extensions and reuse their prefixes to place the proposed extension in a +non-standard location to simplify engineering of the research prototype. +Standard tools will still be able to target the base and any standard +extensions that are present to reduce development time. Once the +instruction-set extension has been evaluated and refined, it could then +be made available for packing into a larger variable-length encoding +space to avoid conflicts with all standard extensions.
+The following sections describe increasingly sophisticated strategies +for developing implementations with new instruction-set extensions. +These are mostly intended for use in highly customized, educational, or +experimental architectures rather than for the main line of RISC-V ISA +development.
+38.3. Extensions within fixed-width 32-bit instruction format
+In this section, we discuss adding extensions to implementations that +only support the base fixed-width 32-bit instruction format.
++ + | +
+
+
+We anticipate the simplest fixed-width 32-bit encoding will be popular +for many restricted accelerators and research prototypes. + |
+
38.3.1. Available 30-bit instruction encoding spaces
+In the standard encoding, three of the available 30-bit instruction
+encoding spaces (those with 2-bit prefixes 00
, 01
, and 10
) are used to
+enable the optional compressed instruction extension. However, if the
+compressed instruction-set extension is not required, then these three
+further 30-bit encoding spaces become available. This quadruples the
+available encoding space within the 32-bit format.
38.3.2. Available 25-bit instruction encoding spaces
+A 25-bit instruction encoding space corresponds to a major opcode in the +base and standard extension encodings.
+There are four major opcodes expressly designated for custom extensions +Table 27, each of which represents a 25-bit +encoding space. Two of these are reserved for eventual use in the RV128 +base encoding (will be OP-IMM-64 and OP-64), but can be used for +non-standard extensions for RV32 and RV64.
+The two major opcodes reserved for RV64 (OP-IMM-32 and OP-32) can also +be used for non-standard extensions to RV32 only.
+If an implementation does not require floating-point, then the seven +major opcodes reserved for standard floating-point extensions (LOAD-FP, +STORE-FP, MADD, MSUB, NMSUB, NMADD, OP-FP) can be reused for +non-standard extensions. Similarly, the AMO major opcode can be reused +if the standard atomic extensions are not required.
+If an implementation does not require instructions longer than 32-bits, +then an additional four major opcodes are available (those marked in +gray in Table 27).
+The base RV32I encoding uses only 11 major opcodes plus 3 reserved +opcodes, leaving up to 18 available for extensions. The base RV64I +encoding uses only 13 major opcodes plus 3 reserved opcodes, leaving up +to 16 available for extensions.
+38.3.3. Available 22-bit instruction encoding spaces
+A 22-bit encoding space corresponds to a funct3 minor opcode space in +the base and standard extension encodings. Several major opcodes have a +funct3 field minor opcode that is not completely occupied, leaving +available several 22-bit encoding spaces.
+Usually a major opcode selects the format used to encode operands in the +remaining bits of the instruction, and ideally, an extension should +follow the operand format of the major opcode to simplify hardware +decoding.
+38.3.4. Other spaces
+Smaller spaces are available under certain major opcodes, and not all +minor opcodes are entirely filled.
+38.4. Adding aligned 64-bit instruction extensions
+The simplest approach to provide space for extensions that are too large +for the base 32-bit fixed-width instruction format is to add naturally +aligned 64-bit instructions. The implementation must still support the +32-bit base instruction format, but can require that 64-bit instructions +are aligned on 64-bit boundaries to simplify instruction fetch, with a +32-bit NOP instruction used as alignment padding where necessary.
+To simplify use of standard tools, the 64-bit instructions should be +encoded as described in Table 1. +However, an implementation might choose a non-standard +instruction-length encoding for 64-bit instructions, while retaining the +standard encoding for 32-bit instructions. For example, if compressed +instructions are not required, then a 64-bit instruction could be +encoded using one or more zero bits in the first two bits of an +instruction.
++ + | +
+
+
+We anticipate processor generators that produce instruction-fetch units +capable of automatically handling any combination of supported +variable-length instruction encodings. + |
+
38.5. Supporting VLIW encodings
+Although RISC-V was not designed as a base for a pure VLIW machine, VLIW +encodings can be added as extensions using several alternative +approaches. In all cases, the base 32-bit encoding has to be supported +to allow use of any standard software tools.
+38.5.1. Fixed-size instruction group
+The simplest approach is to define a single large naturally aligned +instruction format (e.g., 128 bits) within which VLIW operations are +encoded. In a conventional VLIW, this approach would tend to waste +instruction memory to hold NOPs, but a RISC-V-compatible implementation +would have to also support the base 32-bit instructions, confining the +VLIW code size expansion to VLIW-accelerated functions.
+38.5.2. Encoded-Length Groups
+Another approach is to use the standard length encoding from +Table 1 to encode parallel +instruction groups, allowing NOPs to be compressed out of the VLIW +instruction. For example, a 64-bit instruction could hold two 28-bit +operations, while a 96-bit instruction could hold three 28-bit +operations, and so on. Alternatively, a 48-bit instruction could hold +one 42-bit operation, while a 96-bit instruction could hold two 42-bit +operations, and so on.
+This approach has the advantage of retaining the base ISA encoding for +instructions holding a single operation, but has the disadvantage of +requiring a new 28-bit or 42-bit encoding for operations within the VLIW +instructions, and misaligned instruction fetch for larger groups. One +simplification is to not allow VLIW instructions to straddle certain +microarchitecturally significant boundaries (e.g., cache lines or +virtual memory pages).
+38.5.3. Fixed-Size Instruction Bundles
+Another approach, similar to Itanium, is to use a larger naturally +aligned fixed instruction bundle size (e.g., 128 bits) across which +parallel operation groups are encoded. This simplifies instruction +fetch, but shifts the complexity to the group execution engine. To +remain RISC-V compatible, the base 32-bit instruction would still have +to be supported.
+38.5.4. End-of-Group bits in Prefix
+None of the above approaches retains the RISC-V encoding for the +individual operations within a VLIW instruction. Yet another approach is +to repurpose the two prefix bits in the fixed-width 32-bit encoding. One +prefix bit can be used to signal "end-of-group" if set, while the +second bit could indicate execution under a predicate if clear. Standard +RISC-V 32-bit instructions generated by tools unaware of the VLIW +extension would have both prefix bits set (11) and thus have the correct +semantics, with each instruction at the end of a group and not +predicated.
+The main disadvantage of this approach is that the base ISAs lack the +complex predication support usually required in an aggressive VLIW +system, and it is difficult to add space to specify more predicate +registers in the standard 30-bit encoding space.
+39. ISA Extension Naming Conventions
+This chapter describes the RISC-V ISA extension naming scheme that is +used to concisely describe the set of instructions present in a hardware +implementation, or the set of instructions used by an application binary +interface (ABI).
++ + | +
+
+
+The RISC-V ISA is designed to support a wide variety of implementations +with various experimental instruction-set extensions. We have found that +an organized naming scheme simplifies software tools and documentation. + |
+
39.1. Case Sensitivity
+The ISA naming strings are case insensitive.
+39.2. Base Integer ISA
+RISC-V ISA strings begin with either RV32I, RV32E, RV64I, RV64E, or RV128I +indicating the supported address space size in bits for the base integer +ISA.
+39.3. Instruction-Set Extension Names
+Standard ISA extensions are given a name consisting of a single letter. +For example, the first four standard extensions to the integer bases +are: "M" for integer multiplication and division, "A" for atomic +memory instructions, "F" for single-precision floating-point +instructions, and "D" for double-precision floating-point +instructions. Any RISC-V instruction-set variant can be succinctly +described by concatenating the base integer prefix with the names of the +included extensions, e.g., "RV64IMAFD".
+We have also defined an abbreviation "G" to represent the +"IMAFDZicsr_Zifencei" base and extensions, as this is intended to +represent our standard general-purpose ISA.
+Standard extensions to the RISC-V ISA are given other reserved letters, +e.g., "Q" for quad-precision floating-point, or "C" for the 16-bit +compressed instruction format.
+Some ISA extensions depend on the presence of other extensions, e.g., +"D" depends on "F" and "F" depends on "Zicsr". These dependencies +may be implicit in the ISA name: for example, RV32IF is equivalent to +RV32IFZicsr, and RV32ID is equivalent to RV32IFD and RV32IFDZicsr.
+39.4. Version Numbers
+Recognizing that instruction sets may expand or alter over time, we +encode extension version numbers following the extension name. Version +numbers are divided into major and minor version numbers, separated by a +"p". If the minor version is "0", then "p0" can be omitted from +the version string. Changes in major version numbers imply a loss of +backwards compatibility, whereas changes in only the minor version +number must be backwards-compatible. For example, the original 64-bit +standard ISA defined in release 1.0 of this manual can be written in +full as "RV64I1p0M1p0A1p0F1p0D1p0", more concisely as +"RV64I1M1A1F1D1".
+We introduced the version numbering scheme with the second release. +Hence, we define the default version of a standard extension to be the +version present at that time, e.g., "RV32I" is equivalent to +"RV32I2".
+39.5. Underscores
+Underscores "_" may be used to separate ISA extensions to improve +readability and to provide disambiguation, e.g., "RV32I2_M2_A2".
+Because the "P" extension for Packed SIMD can be confused for the +decimal point in a version number, it must be preceded by an underscore +if it follows a number. For example, "rv32i2p2" means version 2.2 of +RV32I, whereas "rv32i2_p2" means version 2.0 of RV32I with version 2.0 +of the P extension.
+39.6. Additional Standard Unprivileged Extension Names
+Standard unprivileged extensions can also be named using a single "Z" followed by +an alphabetical name and an optional version number. For example, +"Zifencei" names the instruction-fetch fence extension described in +Chapter 6; "Zifencei2" and +"Zifencei2p0" name version 2.0 of same.
+The first letter following the "Z" conventionally indicates the most +closely related alphabetical extension category, IMAFDQLCBKJTPVH. For the +"Zfa" extension for additional floating-point instructions, for example, the letter "f" +indicates the extension is related to the "F" standard extension. If +multiple "Z" extensions are named, they should be ordered first by +category, then alphabetically within a category—for example, +"Zicsr_Zifencei_Zam".
+All multi-letter extensions, including those with the "Z" prefix, must be +separated from other multi-letter extensions by an underscore, e.g., +"RV32IMACZicsr_Zifencei".
+39.7. Supervisor-level Instruction-Set Extensions
+Standard extensions that extend the supervisor-level virtual-memory +architecture are prefixed with the letters "Sv", followed by an alphabetical +name and an optional version number, or by a numeric name with no version number. +Other standard extensions that extend +the supervisor-level architecture are prefixed with the letters "Ss", +followed by an alphabetical name and an optional version number. Such +extensions are defined in Volume II.
+Standard supervisor-level extensions should be listed after standard +unprivileged extensions. If multiple supervisor-level extensions are +listed, they should be ordered alphabetically.
+39.8. Hypervisor-level Instruction-Set Extensions
+Standard extensions that extend the hypervisor-level architecture are prefixed +with the letters "Sh". +If multiple hypervisor-level extensions are listed, they should be ordered +alphabetically.
++ + | ++Many augmentations to the hypervisor-level archtecture are more +naturally defined as supervisor-level extensions, following the scheme +described in the previous section. +The "Sh" prefix is used by the few hypervisor-level extensions that have no +supervisor-visible effects. + | +
39.9. Machine-level Instruction-Set Extensions
+Standard machine-level instruction-set extensions are prefixed with the +letters "Sm".
+Standard machine-level extensions should be listed after standard +lesser-privileged extensions. If multiple machine-level extensions are +listed, they should be ordered alphabetically.
+39.10. Non-Standard Extension Names
+Non-standard extensions are named using a single "X" followed by an +alphabetical name and an optional version number. For example, +"Xhwacha" names the Hwacha vector-fetch ISA extension; "Xhwacha2" +and "Xhwacha2p0" name version 2.0 of same.
+Non-standard extensions must be listed after all standard extensions, and, +like other multi-letter extensions, must be separated from other multi-letter +extensions by an underscore. +For example, an ISA with non-standard extensions Argle and +Bargle may be named "RV64IZifencei_Xargle_Xbargle".
+If multiple non-standard extensions are listed, they should be ordered +alphabetically.
+39.11. Subset Naming Convention
+Table 31 summarizes the standardized extension +names. The table also defines the canonical +order in which extension names must appear in the name string, with +top-to-bottom in table indicating first-to-last in the name string, +e.g., RV32IMACV is legal, whereas RV32IMAVC is not.
+Subset | +Name | +Implies | +
---|---|---|
Base ISA |
++ | + |
Integer |
+I |
++ |
Reduced Integer |
+E |
++ |
Standard Unprivileged Extensions |
+||
Integer Multiplication and Division |
+M |
+Zmmul |
+
Atomics |
+A |
++ |
Single-Precision Floating-Point |
+F |
+Zicsr |
+
Double-Precision Floating-Point |
+D |
+F |
+
General |
+G |
+IMAFDZicsr_Zifencei |
+
Quad-Precision Floating-Point |
+Q |
+D |
+
16-bit Compressed Instructions |
+C |
++ |
B Extension |
+B |
++ |
Packed-SIMD Extensions |
+P |
++ |
Vector Extension |
+V |
+D |
+
Hypervisor Extension |
+H |
++ |
Additional Standard Unprivileged Extensions |
+||
Additional Standard unprivileged extensions "abc" |
+Zabc |
++ |
Standard Supervisor-Level Extensions |
+||
Supervisor-level extension "def" |
+Ssdef |
++ |
Standard Machine-Level Extensions |
+||
Machine-level extension "jkl" |
+Smjkl |
++ |
Non-Standard Extensions |
+||
Non-standard extension "mno" |
+Xmno |
++ |
40. History and Acknowledgments
+40.1. "Why Develop a new ISA?" Rationale from Berkeley Group
+We developed RISC-V to support our own needs in research and education, +where our group is particularly interested in actual hardware +implementations of research ideas (we have completed eleven different +silicon fabrications of RISC-V since the first edition of this +specification), and in providing real implementations for students to +explore in classes (RISC-V processor RTL designs have been used in +multiple undergraduate and graduate classes at Berkeley). In our current +research, we are especially interested in the move towards specialized +and heterogeneous accelerators, driven by the power constraints imposed +by the end of conventional transistor scaling. We wanted a highly +flexible and extensible base ISA around which to build our research +effort.
+A question we have been repeatedly asked is "Why develop a new ISA?" +The biggest obvious benefit of using an existing commercial ISA is the +large and widely supported software ecosystem, both development tools +and ported applications, which can be leveraged in research and +teaching. Other benefits include the existence of large amounts of +documentation and tutorial examples. However, our experience of using +commercial instruction sets for research and teaching is that these +benefits are smaller in practice, and do not outweigh the disadvantages:
+-
+
-
+
Commercial ISAs are proprietary. Except for SPARC V8, which is an +open IEEE standard (IEEE Standard for a 32-Bit Microprocessor, 1994) , most owners of commercial ISAs carefully guard +their intellectual property and do not welcome freely available +competitive implementations. This is much less of an issue for academic +research and teaching using only software simulators, but has been a +major concern for groups wishing to share actual RTL implementations. It +is also a major concern for entities who do not want to trust the few +sources of commercial ISA implementations, but who are prohibited from +creating their own clean room implementations. We cannot guarantee that +all RISC-V implementations will be free of third-party patent +infringements, but we can guarantee we will not attempt to sue a RISC-V +implementor.
+
+ -
+
Commercial ISAs are only popular in certain market domains. The most +obvious examples at time of writing are that the ARM architecture is not +well supported in the server space, and the Intel x86 architecture (or +for that matter, almost every other architecture) is not well supported +in the mobile space, though both Intel and ARM are attempting to enter +each other’s market segments. Another example is ARC and Tensilica, +which provide extensible cores but are focused on the embedded space. +This market segmentation dilutes the benefit of supporting a particular +commercial ISA as in practice the software ecosystem only exists for +certain domains, and has to be built for others.
+
+ -
+
Commercial ISAs come and go. Previous research infrastructures have +been built around commercial ISAs that are no longer popular (SPARC, +MIPS) or even no longer in production (Alpha). These lose the benefit of +an active software ecosystem, and the lingering intellectual property +issues around the ISA and supporting tools interfere with the ability of +interested third parties to continue supporting the ISA. An open ISA +might also lose popularity, but any interested party can continue using +and developing the ecosystem.
+
+ -
+
Popular commercial ISAs are complex. The dominant commercial ISAs +(x86 and ARM) are both very complex to implement in hardware to the +level of supporting common software stacks and operating systems. Worse, +nearly all the complexity is due to bad, or at least outdated, ISA +design decisions rather than features that truly improve efficiency.
+
+ -
+
Commercial ISAs alone are not enough to bring up applications. Even +if we expend the effort to implement a commercial ISA, this is not +enough to run existing applications for that ISA. Most applications need +a complete ABI (application binary interface) to run, not just the +user-level ISA. Most ABIs rely on libraries, which in turn rely on +operating system support. To run an existing operating system requires +implementing the supervisor-level ISA and device interfaces expected by +the OS. These are usually much less well-specified and considerably more +complex to implement than the user-level ISA.
+
+ -
+
Popular commercial ISAs were not designed for extensibility. The +dominant commercial ISAs were not particularly designed for +extensibility, and as a consequence have added considerable instruction +encoding complexity as their instruction sets have grown. Companies such +as Tensilica (acquired by Cadence) and ARC (acquired by Synopsys) have +built ISAs and toolchains around extensibility, but have focused on +embedded applications rather than general-purpose computing systems.
+
+ -
+
A modified commercial ISA is a new ISA. One of our main goals is to +support architecture research, including major ISA extensions. Even +small extensions diminish the benefit of using a standard ISA, as +compilers have to be modified and applications rebuilt from source code +to use the extension. Larger extensions that introduce new architectural +state also require modifications to the operating system. Ultimately, +the modified commercial ISA becomes a new ISA, but carries along all the +legacy baggage of the base ISA.
+
+
Our position is that the ISA is perhaps the most important interface in +a computing system, and there is no reason that such an important +interface should be proprietary. The dominant commercial ISAs are based +on instruction-set concepts that were already well known over 30 years +ago. Software developers should be able to target an open standard +hardware target, and commercial processor designers should compete on +implementation quality.
+We are far from the first to contemplate an open ISA design suitable for +hardware implementation. We also considered other existing open ISA +designs, of which the closest to our goals was the OpenRISC +architecture (OpenCores, 2012). We decided against adopting the OpenRISC ISA for several +technical reasons:
+-
+
-
+
OpenRISC has condition codes and branch delay slots, which complicate +higher performance implementations.
+
+ -
+
OpenRISC uses a fixed 32-bit encoding and 16-bit immediates, which +precludes a denser instruction encoding and limits space for later +expansion of the ISA.
+
+ -
+
OpenRISC does not support the 2008 revision to the IEEE 754 +floating-point standard.
+
+ -
+
The OpenRISC 64-bit design had not been completed when we began.
+
+
By starting from a clean slate, we could design an ISA that met all of +our goals, though of course, this took far more effort than we had +planned at the outset. We have now invested considerable effort in +building up the RISC-V ISA infrastructure, including documentation, +compiler tool chains, operating system ports, reference ISA simulators, +FPGA implementations, efficient ASIC implementations, architecture test +suites, and teaching materials. Since the last edition of this manual, +there has been considerable uptake of the RISC-V ISA in both academia +and industry, and we have created the non-profit RISC-V Foundation to +protect and promote the standard. The RISC-V Foundation website at +riscv.org contains the latest information on the Foundation +membership and various open-source projects using RISC-V.
+40.2. History from Revision 1.0 of ISA manual
+The RISC-V ISA and instruction-set manual builds upon several earlier +projects. Several aspects of the supervisor-level machine and the +overall format of the manual date back to the T0 (Torrent-0) vector +microprocessor project at UC Berkeley and ICSI, begun in 1992. T0 was a +vector processor based on the MIPS-II ISA, with Krste Asanović as main +architect and RTL designer, and Brian Kingsbury and Bertrand Irrisou as +principal VLSI implementors. David Johnson at ICSI was a major +contributor to the T0 ISA design, particularly supervisor mode, and to +the manual text. John Hauser also provided considerable feedback on the +T0 ISA design.
+The Scale (Software-Controlled Architecture for Low Energy) project at +MIT, begun in 2000, built upon the T0 project infrastructure, refined +the supervisor-level interface, and moved away from the MIPS scalar ISA +by dropping the branch delay slot. Ronny Krashinsky and Christopher +Batten were the principal architects of the Scale Vector-Thread +processor at MIT, while Mark Hampton ported the GCC-based compiler +infrastructure and tools for Scale.
+A lightly edited version of the T0 MIPS scalar processor specification +(MIPS-6371) was used in teaching a new version of the MIT 6.371 +Introduction to VLSI Systems class in the Fall 2002 semester, with Chris +Terman and Krste Asanović as lecturers. Chris Terman contributed most of +the lab material for the class (there was no TA!). The 6.371 class +evolved into the trial 6.884 Complex Digital Design class at MIT, taught +by Arvind and Krste Asanović in Spring 2005, which became a regular +Spring class 6.375. A reduced version of the Scale MIPS-based scalar +ISA, named SMIPS, was used in 6.884/6.375. Christopher Batten was the TA +for the early offerings of these classes and developed a considerable +amount of documentation and lab material based around the SMIPS ISA. +This same SMIPS lab material was adapted and enhanced by TA Yunsup Lee +for the UC Berkeley Fall 2009 CS250 VLSI Systems Design class taught by +John Wawrzynek, Krste Asanović, and John Lazzaro.
+The Maven (Malleable Array of Vector-thread ENgines) project was a +second-generation vector-thread architecture. Its design was led by +Christopher Batten when he was an Exchange Scholar at UC Berkeley +starting in summer 2007. Hidetaka Aoki, a visiting industrial fellow +from Hitachi, gave considerable feedback on the early Maven ISA and +microarchitecture design. The Maven infrastructure was based on the +Scale infrastructure but the Maven ISA moved further away from the MIPS +ISA variant defined in Scale, with a unified floating-point and integer +register file. Maven was designed to support experimentation with +alternative data-parallel accelerators. Yunsup Lee was the main +implementor of the various Maven vector units, while Rimas Avižienis was +the main implementor of the various Maven scalar units. Yunsup Lee and +Christopher Batten ported GCC to work with the new Maven ISA. +Christopher Celio provided the initial definition of a traditional +vector instruction set ("Flood") variant of Maven.
+Based on experience with all these previous projects, the RISC-V ISA +definition was begun in Summer 2010, with Andrew Waterman, Yunsup Lee, +Krste Asanović, and David Patterson as principal designers. An initial +version of the RISC-V 32-bit instruction subset was used in the UC +Berkeley Fall 2010 CS250 VLSI Systems Design class, with Yunsup Lee as +TA. RISC-V is a clean break from the earlier MIPS-inspired designs. John +Hauser contributed to the floating-point ISA definition, including the +sign-injection instructions and a register encoding scheme that permits +internal recoding of floating-point values.
+40.3. History from Revision 2.0 of ISA manual
+Multiple implementations of RISC-V processors have been completed, +including several silicon fabrications, as shown in +Fabricated RISC-V testchips table.
+Name | +Tapeout Date | +Process | +ISA | +
---|---|---|---|
Raven-1 |
+May 29, 2011 |
+ST 28nm FDSOI |
+RV64G1_Xhwacha1 |
+
EOS14 |
+April 1, 2012 |
+IBM 45nm SOI |
+RV64G1p1_Xhwacha2 |
+
EOS16 |
+August 17, 2012 |
+IBM 45nm SOI |
+RV64G1p1_Xhwacha2 |
+
Raven-2 |
+August 22, 2012 |
+ST 28nm FDSOI |
+RV64G1p1_Xhwacha2 |
+
EOS18 |
+February 6, 2013 |
+IBM 45nm SOI |
+RV64G1p1_Xhwacha2 |
+
EOS20 |
+July 3, 2013 |
+IBM 45nm SOI |
+RV64G1p99_Xhwacha2 |
+
Raven-3 |
+September 26, 2013 |
+ST 28nm SOI |
+RV64G1p99_Xhwacha2 |
+
EOS22 |
+March 7, 2014 |
+IBM 45nm SOI |
+RV64G1p9999_Xhwacha3 |
+
The first RISC-V processors to be fabricated were written in Verilog and +manufactured in a pre-production FDSOI technology from ST as the Raven-1 +testchip in 2011. Two cores were developed by Yunsup Lee and Andrew +Waterman, advised by Krste Asanović, and fabricated together: 1) an RV64 +scalar core with error-detecting flip-flops, and 2) an RV64 core with an +attached 64-bit floating-point vector unit. The first microarchitecture +was informally known as "TrainWreck", due to the short time available +to complete the design with immature design libraries.
+Subsequently, a clean microarchitecture for an in-order decoupled RV64 +core was developed by Andrew Waterman, Rimas Avižienis, and Yunsup Lee, +advised by Krste Asanović, and, continuing the railway theme, was +codenamed "Rocket" after George Stephenson’s successful steam +locomotive design. Rocket was written in Chisel, a new hardware design +language developed at UC Berkeley. The IEEE floating-point units used in +Rocket were developed by John Hauser, Andrew Waterman, and Brian +Richards. Rocket has since been refined and developed further, and has +been fabricated two more times in FDSOI (Raven-2, Raven-3), and five +times in IBM SOI technology (EOS14, EOS16, EOS18, EOS20, EOS22) for a +photonics project. Work is ongoing to make the Rocket design available +as a parameterized RISC-V processor generator.
+EOS14-EOS22 chips include early versions of Hwacha, a 64-bit IEEE +floating-point vector unit, developed by Yunsup Lee, Andrew Waterman, +Huy Vo, Albert Ou, Quan Nguyen, and Stephen Twigg, advised by Krste +Asanović. EOS16-EOS22 chips include dual cores with a cache-coherence +protocol developed by Henry Cook and Andrew Waterman, advised by Krste +Asanović. EOS14 silicon has successfully run at 1.25 GHz. EOS16 silicon suffered +from a bug in the IBM pad libraries. EOS18 and EOS20 have successfully +run at 1.35 GHz.
+Contributors to the Raven testchips include Yunsup Lee, Andrew Waterman, +Rimas Avižienis, Brian Zimmer, Jaehwa Kwak, Ruzica Jevtić, Milovan +Blagojević, Alberto Puggelli, Steven Bailey, Ben Keller, Pi-Feng Chiu, +Brian Richards, Borivoje Nikolić, and Krste Asanović.
+Contributors to the EOS testchips include Yunsup Lee, Rimas Avižienis, +Andrew Waterman, Henry Cook, Huy Vo, Daiwei Li, Chen Sun, Albert Ou, +Quan Nguyen, Stephen Twigg, Vladimir Stojanović, and Krste Asanović.
+Andrew Waterman and Yunsup Lee developed the C++ ISA simulator +"Spike", used as a golden model in development and named after the +golden spike used to celebrate completion of the US transcontinental +railway. Spike has been made available as a BSD open-source project.
+Andrew Waterman completed a Master’s thesis with a preliminary design of +the RISC-V compressed instruction set (Waterman, 2011).
+Various FPGA implementations of the RISC-V have been completed, +primarily as part of integrated demos for the Par Lab project research +retreats. The largest FPGA design has 3 cache-coherent RV64IMA +processors running a research operating system. Contributors to the FPGA +implementations include Andrew Waterman, Yunsup Lee, Rimas Avižienis, +and Krste Asanović.
+RISC-V processors have been used in several classes at UC Berkeley. +Rocket was used in the Fall 2011 offering of CS250 as a basis for class +projects, with Brian Zimmer as TA. For the undergraduate CS152 class in +Spring 2012, Christopher Celio used Chisel to write a suite of +educational RV32 processors, named "Sodor" after the island on which +"Thomas the Tank Engine" and friends live. The suite includes a +microcoded core, an unpipelined core, and 2, 3, and 5-stage pipelined +cores, and is publicly available under a BSD license. The suite was +subsequently updated and used again in CS152 in Spring 2013, with Yunsup +Lee as TA, and in Spring 2014, with Eric Love as TA. Christopher Celio +also developed an out-of-order RV64 design known as BOOM (Berkeley +Out-of-Order Machine), with accompanying pipeline visualizations, that +was used in the CS152 classes. The CS152 classes also used +cache-coherent versions of the Rocket core developed by Andrew Waterman +and Henry Cook.
+Over the summer of 2013, the RoCC (Rocket Custom Coprocessor) interface +was defined to simplify adding custom accelerators to the Rocket core. +Rocket and the RoCC interface were used extensively in the Fall 2013 +CS250 VLSI class taught by Jonathan Bachrach, with several student +accelerator projects built to the RoCC interface. The Hwacha vector unit +has been rewritten as a RoCC coprocessor.
+Two Berkeley undergraduates, Quan Nguyen and Albert Ou, have +successfully ported Linux to run on RISC-V in Spring 2013.
+Colin Schmidt successfully completed an LLVM backend for RISC-V 2.0 in +January 2014.
+Darius Rad at Bluespec contributed soft-float ABI support to the GCC +port in March 2014.
+John Hauser contributed the definition of the floating-point +classification instructions.
+We are aware of several other RISC-V core implementations, including one +in Verilog by Tommy Thorn, and one in Bluespec by Rishiyur Nikhil.
+40.4. Acknowledgments
+Thanks to Christopher F. Batten, Preston Briggs, Christopher Celio, +David Chisnall, Stefan Freudenberger, John Hauser, Ben Keller, Rishiyur +Nikhil, Michael Taylor, Tommy Thorn, and Robert Watson for comments on +the draft ISA version 2.0 specification.
+40.5. History from Revision 2.1
+Uptake of the RISC-V ISA has been very rapid since the introduction of
+the frozen version 2.0 in May 2014, with too much activity to record in
+a short history section such as this. Perhaps the most important single
+event was the formation of the non-profit RISC-V Foundation in August
+2015. The Foundation will now take over stewardship of the official
+RISC-V ISA standard, and the official website riscv.org
is the best
+place to obtain news and updates on the RISC-V standard.
40.6. Acknowledgments
+Thanks to Scott Beamer, Allen J. Baum, Christopher Celio, David +Chisnall, Paul Clayton, Palmer Dabbelt, Jan Gray, Michael Hamburg, and +John Hauser for comments on the version 2.0 specification.
+40.7. History from Revision 2.2
+ +40.8. Acknowledgments
+Thanks to Jacob Bachmeyer, Alex Bradbury, David Horner, Stefan O’Rear, +and Joseph Myers for comments on the version 2.1 specification.
+40.9. History for Revision 2.3
+Uptake of RISC-V continues at a breakneck pace.
+John Hauser and Andrew Waterman contributed a hypervisor ISA extension +based upon a proposal from Paolo Bonzini.
+Daniel Lustig, Arvind, Krste Asanović, Shaked Flur, Paul Loewenstein, +Yatin Manerkar, Luc Maranget, Margaret Martonosi, Vijayanand Nagarajan, +Rishiyur Nikhil, Jonas Oberhauser, Christopher Pulte, Jose Renau, Peter +Sewell, Susmit Sarkar, Caroline Trippel, Muralidaran Vijayaraghavan, +Andrew Waterman, Derek Williams, Andrew Wright, and Sizhuo Zhang +contributed the memory consistency model.
+40.10. Funding
+Development of the RISC-V architecture and implementations has been +partially funded by the following sponsors.
+-
+
-
+
Par Lab: Research supported by Microsoft (Award # 024263) and Intel +(Award # 024894) funding and by matching funding by U.C. Discovery (Award +# DIG07-10227). Additional support came from Par Lab affiliates Nokia, +NVIDIA, Oracle, and Samsung.
+
+ -
+
Project Isis: DoE Award DE-SC0003624.
+
+ -
+
ASPIRE Lab: DARPA PERFECT program, Award HR0011-12-2-0016. DARPA +POEM program Award HR0011-11-C-0100. The Center for Future Architectures +Research (C-FAR), a STARnet center funded by the Semiconductor Research +Corporation. Additional support from ASPIRE industrial sponsor, Intel, +and ASPIRE affiliates, Google, Hewlett Packard Enterprise, Huawei, +Nokia, NVIDIA, Oracle, and Samsung.
+
+
The content of this paper does not necessarily reflect the position or +the policy of the US government and no official endorsement should be +inferred.
+Appendix A: RVWMO Explanatory Material, Version 0.1
+This section provides more explanation for RVWMO +Chapter 18, using more informal +language and concrete examples. These are intended to clarify the +meaning and intent of the axioms and preserved program order rules. This +appendix should be treated as commentary; all normative material is +provided in Chapter 18 and in the rest of +the main body of the ISA specification. All currently known +discrepancies are listed in Section A.7. Any +other discrepancies are unintentional.
+A.1. Why RVWMO?
+Memory consistency models fall along a loose spectrum from weak to +strong. Weak memory models allow more hardware implementation +flexibility and deliver arguably better performance, performance per +watt, power, scalability, and hardware verification overheads than +strong models, at the expense of a more complex programming model. +Strong models provide simpler programming models, but at the cost of +imposing more restrictions on the kinds of (non-speculative) hardware +optimizations that can be performed in the pipeline and in the memory +system, and in turn imposing some cost in terms of power, area overhead, +and verification burden.
+RISC-V has chosen the RVWMO memory model, a variant of release +consistency. This places it in between the two extremes of the memory +model spectrum. The RVWMO memory model enables architects to build +simple implementations, aggressive implementations, implementations +embedded deeply inside a much larger system and subject to complex +memory system interactions, or any number of other possibilities, all +while simultaneously being strong enough to support programming language +memory models at high performance.
+To facilitate the porting of code from other architectures, some +hardware implementations may choose to implement the Ztso extension, +which provides stricter RVTSO ordering semantics by default. Code +written for RVWMO is automatically and inherently compatible with RVTSO, +but code written assuming RVTSO is not guaranteed to run correctly on +RVWMO implementations. In fact, most RVWMO implementations will (and +should) simply refuse to run RVTSO-only binaries. Each implementation +must therefore choose whether to prioritize compatibility with RVTSO +code (e.g., to facilitate porting from x86) or whether to instead +prioritize compatibility with other RISC-V cores implementing RVWMO.
+Some fences and/or memory ordering annotations in code written for RVWMO +may become redundant under RVTSO; the cost that the default of RVWMO +imposes on Ztso implementations is the incremental overhead of fetching +those fences (e.g., FENCE R,RW and FENCE RW,W) which become no-ops on +that implementation. However, these fences must remain present in the +code if compatibility with non-Ztso implementations is desired.
+A.2. Litmus Tests
+The explanations in this chapter make use of litmus tests, or small
+programs designed to test or highlight one particular aspect of a memory
+model. Litmus sample shows an example
+of a litmus test with two harts. As a convention for this figure and for
+all figures that follow in this chapter, we assume that s0-s2
are
+pre-set to the same value in all harts and that s0
holds the address
+labeled x
, s1
holds y
, and s2
holds z
, where x
, y
, and z
+are disjoint memory locations aligned to 8 byte boundaries. All other registers and all referenced memory locations are presumed to be initialized to zero. Each figure
+shows the litmus test code on the left, and a visualization of one
+particular valid or invalid execution on the right.
|
+
|
+
Litmus tests are used to understand the implications of the memory model
+in specific concrete situations. For example, in the litmus test of
+Litmus sample, the final value of a0
+in the first hart can be either 2, 4, or 5, depending on the dynamic
+interleaving of the instruction stream from each hart at runtime.
+However, in this example, the final value of a0
in Hart 0 will never
+be 1 or 3; intuitively, the value 1 will no longer be visible at the
+time the load executes, and the value 3 will not yet be visible by the
+time the load executes. We analyze this test and many others below.
Edge | +Full Name (and explanation) | +
---|---|
rf |
+Reads From (from each store to the loads that return a value +written by that store) |
+
co |
+Coherence (a total order on the stores to each address) |
+
fr |
+From-Reads (from each load to co-successors of the store from which +the load returned a value) |
+
ppo |
+Preserved Program Order |
+
fence |
+Orderings enforced by a FENCE instruction |
+
addr |
+Address Dependency |
+
ctrl |
+Control Dependency |
+
data |
+Data Dependency |
+
The diagram shown to the right of each litmus test shows a visual +representation of the particular execution candidate being considered. +These diagrams use a notation that is common in the memory model +literature for constraining the set of possible global memory orders +that could produce the execution in question. It is also the basis for +the herd models presented in +Section B.2. This notation is explained in +Table 33. Of the listed relations, rf edges between +harts, co edges, fr edges, and ppo edges directly constrain the global +memory order (as do fence, addr, data, and some ctrl edges, via ppo). +Other edges (such as intra-hart rf edges) are informative but do not +constrain the global memory order.
+For example, in Litmus sample, a0=1
+could occur only if one of the following were true:
-
+
-
+
(b) appears before (a) in global memory order (and in the +coherence order co). However, this violates RVWMO PPO +rule
+ppo:→st
. The co edge from (b) to (a) highlights this +contradiction.
+ -
+
(a) appears before (b) in global memory order (and in the +coherence order co). However, in this case, the Load Value Axiom would +be violated, because (a) is not the latest matching store prior to (c) +in program order. The fr edge from (c) to (b) highlights this +contradiction.
+
+
Since neither of these scenarios satisfies the RVWMO axioms, the outcome
+a0=1
is forbidden.
Beyond what is described in this appendix, a suite of more than seven +thousand litmus tests is available at +github.com/litmus-tests/litmus-tests-riscv.
++ + | +
+
+
+The litmus tests repository also provides instructions on how to run the +litmus tests on RISC-V hardware and how to compare the results with the +operational and axiomatic models. +
+
+In the future, we expect to adapt these memory model litmus tests for +use as part of the RISC-V compliance test suite as well. + |
+
A.3. Explaining the RVWMO Rules
+In this section, we provide explanation and examples for all of the +RVWMO rules and axioms.
+A.3.1. Preserved Program Order and Global Memory Order
+Preserved program order represents the subset of program order that must +be respected within the global memory order. Conceptually, events from +the same hart that are ordered by preserved program order must appear in +that order from the perspective of other harts and/or observers. Events +from the same hart that are not ordered by preserved program order, on +the other hand, may appear reordered from the perspective of other harts +and/or observers.
+Informally, the global memory order represents the order in which loads +and stores perform. The formal memory model literature has moved away +from specifications built around the concept of performing, but the idea +is still useful for building up informal intuition. A load is said to +have performed when its return value is determined. A store is said to +have performed not when it has executed inside the pipeline, but rather +only when its value has been propagated to globally visible memory. In +this sense, the global memory order also represents the contribution of +the coherence protocol and/or the rest of the memory system to +interleave the (possibly reordered) memory accesses being issued by each +hart into a single total order agreed upon by all harts.
+The order in which loads perform does not always directly correspond to +the relative age of the values those two loads return. In particular, a +load b may perform before another load a to +the same address (i.e., b may execute before +a, and b may appear before a +in the global memory order), but a may nevertheless return +an older value than b. This discrepancy captures (among +other things) the reordering effects of buffering placed between the +core and memory. For example, b may have returned a value +from a store in the store buffer, while a may have ignored +that younger store and read an older value from memory instead. To +account for this, at the time each load performs, the value it returns +is determined by the load value axiom, not just strictly by determining +the most recent store to the same address in the global memory order, as +described below.
+A.3.2. Load value axiom
++ + | +
+
+
+Section 18.1.4.1: Each byte of each load i returns the value written +to that byte by the store that is the latest in global memory order among +the following stores: +
+
+
|
+
Preserved program order is not required to respect the ordering of a +store followed by a load to an overlapping address. This complexity +arises due to the ubiquity of store buffers in nearly all +implementations. Informally, the load may perform (return a value) by +forwarding from the store while the store is still in the store buffer, +and hence before the store itself performs (writes back to globally +visible memory). Any other hart will therefore observe the load as +performing before the store.
+Consider the Table 34. When running this program on an implementation with
+store buffers, it is possible to arrive at the final outcome a0=1, a1=0, a2=1, a3=0
as follows:
|
+
|
+
-
+
-
+
(a) executes and enters the first hart’s private store buffer
+
+ -
+
(b) executes and forwards its return value 1 from (a) in the +store buffer
+
+ -
+
(c) executes since all previous loads (i.e., (b)) have +completed
+
+ -
+
(d) executes and reads the value 0 from memory
+
+ -
+
(e) executes and enters the second hart’s private store buffer
+
+ -
+
(f) executes and forwards its return value 1 from (e) in the +store buffer
+
+ -
+
(g) executes since all previous loads (i.e., (f)) have +completed
+
+ -
+
(h) executes and reads the value 0 from memory
+
+ -
+
(a) drains from the first hart’s store buffer to memory
+
+ -
+
(e) drains from the second hart’s store buffer to memory
+
+
Therefore, the memory model must be able to account for this behavior.
+To put it another way, suppose the definition of preserved program order +did include the following hypothetical rule: memory access +a precedes memory access b in preserved +program order (and hence also in the global memory order) if +a precedes b in program order and +a and b are accesses to the same memory +location, a is a write, and b is a read. +Call this "Rule X". Then we get the following:
+-
+
-
+
(a) precedes (b): by rule X
+
+ -
+
(b) precedes (d): by rule 4
+
+ -
+
(d) precedes (e): by the load value axiom. Otherwise, if (e) +preceded (d), then (d) would be required to return the value 1. (This is +a perfectly legal execution; it’s just not the one in question)
+
+ -
+
(e) precedes (f): by rule X
+
+ -
+
(f) precedes (h): by rule 4]
+
+ -
+
(h) precedes (a): by the load value axiom, as above.
+
+
The global memory order must be a total order and cannot be cyclic, +because a cycle would imply that every event in the cycle happens before +itself, which is impossible. Therefore, the execution proposed above +would be forbidden, and hence the addition of rule X would forbid +implementations with store buffer forwarding, which would clearly be +undesirable.
+Nevertheless, even if (b) precedes (a) and/or (f) precedes (e) in the +global memory order, the only sensible possibility in this example is +for (b) to return the value written by (a), and likewise for (f) and +(e). This combination of circumstances is what leads to the second +option in the definition of the load value axiom. Even though (b) +precedes (a) in the global memory order, (a) will still be visible to +(b) by virtue of sitting in the store buffer at the time (b) executes. +Therefore, even if (b) precedes (a) in the global memory order, (b) +should return the value written by (a) because (a) precedes (b) in +program order. Likewise for (e) and (f).
+
|
+
|
+
Another test that highlights the behavior of store buffers is shown in +Table 35. In this example, (d) is +ordered before (e) because of the control dependency, and (f) is ordered +before (g) because of the address dependency. However, (e) is not +necessarily ordered before (f), even though (f) returns the value +written by (e). This could correspond to the following sequence of +events:
+-
+
-
+
(e) executes speculatively and enters the second hart’s private +store buffer (but does not drain to memory)
+
+ -
+
(f) executes speculatively and forwards its return value 1 from +(e) in the store buffer
+
+ -
+
(g) executes speculatively and reads the value 0 from memory
+
+ -
+
(a) executes, enters the first hart’s private store buffer, and +drains to memory
+
+ -
+
(b) executes and retires
+
+ -
+
(c) executes, enters the first hart’s private store buffer, and +drains to memory
+
+ -
+
(d) executes and reads the value 1 from memory
+
+ -
+
(e), (f), and (g) commit, since the speculation turned out to be +correct
+
+ -
+
(e) drains from the store buffer to memory
+
+
A.3.3. Atomicity axiom
++ + | +
+
+
+Atomicity Axiom (for Aligned Atomics): If r and w are paired load and +store operations generated by aligned LR and SC instructions in a hart +h, s is a store to byte x, and r returns a value written by s, then s must +precede w in the global memory order, and there can be no store from +a hart other than h to byte x following s and preceding w in the global +memory order. + |
+
The RISC-V architecture decouples the notion of atomicity from the +notion of ordering. Unlike architectures such as TSO, RISC-V atomics +under RVWMO do not impose any ordering requirements by default. Ordering +semantics are only guaranteed by the PPO rules that otherwise apply.
+RISC-V contains two types of atomics: AMOs and LR/SC pairs. These +conceptually behave differently, in the following way. LR/SC behave as +if the old value is brought up to the core, modified, and written back +to memory, all while a reservation is held on that memory location. AMOs +on the other hand conceptually behave as if they are performed directly +in memory. AMOs are therefore inherently atomic, while LR/SC pairs are +atomic in the slightly different sense that the memory location in +question will not be modified by another hart during the time the +original hart holds the reservation.
+(a) lr.d a0, 0(s0) | +(a) lr.d a0, 0(s0) | +(a) lr.w a0, 0(s0) | +(a) lr.w a0, 0(s0) | +
---|---|---|---|
(b) sd t1, 0(s0) |
+(b) sw t1, 4(s0) |
+(b) sw t1, 4(s0) |
+(b) sw t1, 4(s0) |
+
(c) sc.d t3, t2, 0(s0) |
+(c) sc.d t3, t2, 0(s0) |
+(c) sc.w t3, t2, 0(s0) |
+(c) addi s0, s0, 8 |
+
(d) sc.w t3, t2, 8(s0) |
++ | + | + |
Figure 4: In all four (independent) instances, the final store-conditional instruction is permitted but not guaranteed to succeed.
+The atomicity axiom forbids stores from other harts from being +interleaved in global memory order between an LR and the SC paired with +that LR. The atomicity axiom does not forbid loads from being +interleaved between the paired operations in program order or in the +global memory order, nor does it forbid stores from the same hart or +stores to non-overlapping locations from appearing between the paired +operations in either program order or in the global memory order. For +example, the SC instructions in [litmus_lrsdsc] may (but are not +guaranteed to) succeed. None of those successes would violate the +atomicity axiom, because the intervening non-conditional stores are from +the same hart as the paired load-reserved and store-conditional +instructions. This way, a memory system that tracks memory accesses at +cache line granularity (and which therefore will see the four snippets +of [litmus_lrsdsc] as identical) will not +be forced to fail a store-conditional instruction that happens to +(falsely) share another portion of the same cache line as the memory +location being held by the reservation.
+The atomicity axiom also technically supports cases in which the LR and +SC touch different addresses and/or use different access sizes; however, +use cases for such behaviors are expected to be rare in practice. +Likewise, scenarios in which stores from the same hart between an LR/SC +pair actually overlap the memory location(s) referenced by the LR or SC +are expected to be rare compared to scenarios where the intervening +store may simply fall onto the same cache line.
+A.3.4. Progress axiom
++ + | +
+
+
+Progress Axiom: No memory operation may be preceded in the global +memory order by an infinite sequence of other memory operations. + |
+
The progress axiom ensures a minimal forward progress guarantee. It +ensures that stores from one hart will eventually be made visible to +other harts in the system in a finite amount of time, and that loads +from other harts will eventually be able to read those values (or +successors thereof). Without this rule, it would be legal, for example, +for a spinlock to spin infinitely on a value, even with a store from +another hart waiting to unlock the spinlock.
+The progress axiom is intended not to impose any other notion of +fairness, latency, or quality of service onto the harts in a RISC-V +implementation. Any stronger notions of fairness are up to the rest of +the ISA and/or up to the platform and/or device to define and implement.
+The forward progress axiom will in almost all cases be naturally +satisfied by any standard cache coherence protocol. Implementations with +non-coherent caches may have to provide some other mechanism to ensure +the eventual visibility of all stores (or successors thereof) to all +harts.
+A.3.5. Overlapping-Address Orderings (Rules 1-3)
++ + | +
+
+
+Rule 1: b is a store, and a and b access overlapping memory addresses +
+
+Rule 2: a and b are loads, x is a byte read by both a and b, there is no +store to x between a and b in program order, and a and b return values +for x written by different memory operations +
+
+Rule 3: a is generated by an AMO or SC instruction, b is a load, and b +returns a value written by a + |
+
Same-address orderings where the latter is a store are straightforward: +a load or store can never be reordered with a later store to an +overlapping memory location. From a microarchitecture perspective, +generally speaking, it is difficult or impossible to undo a +speculatively reordered store if the speculation turns out to be +invalid, so such behavior is simply disallowed by the model. +Same-address orderings from a store to a later load, on the other hand, +do not need to be enforced. As discussed in +Load value axiom, this reflects the observable +behavior of implementations that forward values from buffered stores to +later loads.
+Same-address load-load ordering requirements are far more subtle. The +basic requirement is that a younger load must not return a value that is +older than a value returned by an older load in the same hart to the +same address. This is often known as "CoRR" (Coherence for Read-Read +pairs), or as part of a broader "coherence" or "sequential +consistency per location" requirement. Some architectures in the past +have relaxed same-address load-load ordering, but in hindsight this is +generally considered to complicate the programming model too much, and +so RVWMO requires CoRR ordering to be enforced. However, because the +global memory order corresponds to the order in which loads perform +rather than the ordering of the values being returned, capturing CoRR +requirements in terms of the global memory order requires a bit of +indirection.
+
|
+
|
+
Consider the litmus test of Table 36, which is one particular +instance of the more general "fri-rfi" pattern. The term "fri-rfi" +refers to the sequence (d), (e), (f): (d) "from-reads" (i.e., reads +from an earlier write than) (e) which is the same hart, and (f) reads +from (e) which is in the same hart.
+From a microarchitectural perspective, outcome a0=1
, a1=2
, a2=0
is
+legal (as are various other less subtle outcomes). Intuitively, the
+following would produce the outcome in question:
-
+
-
+
(d) stalls (for whatever reason; perhaps it’s stalled waiting +for some other preceding instruction)
+
+ -
+
(e) executes and enters the store buffer (but does not yet +drain to memory)
+
+ -
+
(f) executes and forwards from (e) in the store buffer
+
+ -
+
(g), (h), and (i) execute
+
+ -
+
(a) executes and drains to memory, (b) executes, and (c) +executes and drains to memory
+
+ -
+
(d) unstalls and executes
+
+ -
+
(e) drains from the store buffer to memory
+
+
This corresponds to a global memory order of (f), (i), (a), (c), (d), +(e). Note that even though (f) performs before (d), the value returned +by (f) is newer than the value returned by (d). Therefore, this +execution is legal and does not violate the CoRR requirements.
+Likewise, if two back-to-back loads return the values written by the +same store, then they may also appear out-of-order in the global memory +order without violating CoRR. Note that this is not the same as saying +that the two loads return the same value, since two different stores may +write the same value.
+
|
+
|
+
Consider the litmus test of Table 37.
+The outcome a0=1
, a1=v
, a2=v
, a3=0
(where v is
+some value written by another hart) can be observed by allowing (g) and
+(h) to be reordered. This might be done speculatively, and the
+speculation can be justified by the microarchitecture (e.g., by snooping
+for cache invalidations and finding none) because replaying (h) after
+(g) would return the value written by the same store anyway. Hence
+assuming a1
and a2
would end up with the same value written by the
+same store anyway, (g) and (h) can be legally reordered. The global
+memory order corresponding to this execution would be
+(h),(k),(a),(c),(d),(g).
Executions of the test in Table 37 in
+which a1
does not equal a2
do in fact require that (g) appears
+before (h) in the global memory order. Allowing (h) to appear before (g)
+in the global memory order would in that case result in a violation of
+CoRR, because then (h) would return an older value than that returned by
+(g). Therefore, rule 2 forbids this CoRR violation
+from occurring. As such, rule 2 strikes a careful
+balance between enforcing CoRR in all cases while simultaneously being
+weak enough to permit "RSW" and "fri-rfi" patterns that commonly
+appear in real microarchitectures.
There is one more overlapping-address rule: rule 3 simply states that a value cannot +be returned from an AMO or SC to a subsequent load until the AMO or SC +has (in the case of the SC, successfully) performed globally. This +follows somewhat naturally from the conceptual view that both AMOs and +SC instructions are meant to be performed atomically in memory. However, +notably, rule 3 states that hardware +may not even non-speculatively forward the value being stored by an +AMOSWAP to a subsequent load, even though for AMOSWAP that store value +is not actually semantically dependent on the previous value in memory, +as is the case for the other AMOs. The same holds true even when +forwarding from SC store values that are not semantically dependent on +the value returned by the paired LR.
+The three PPO rules above also apply when the memory accesses in +question only overlap partially. This can occur, for example, when +accesses of different sizes are used to access the same object. Note +also that the base addresses of two overlapping memory operations need +not necessarily be the same for two memory accesses to overlap. When +misaligned memory accesses are being used, the overlapping-address PPO +rules apply to each of the component memory accesses independently.
+A.3.6. Fences (Rule 4)
++ + | +
+
+
+Rule 4: There is a FENCE instruction that orders a before b + |
+
By default, the FENCE instruction ensures that all memory accesses from +instructions preceding the fence in program order (the "predecessor +set") appear earlier in the global memory order than memory accesses +from instructions appearing after the fence in program order (the +"successor set"). However, fences can optionally further restrict the +predecessor set and/or the successor set to a smaller set of memory +accesses in order to provide some speedup. Specifically, fences have PR, +PW, SR, and SW bits which restrict the predecessor and/or successor +sets. The predecessor set includes loads (resp.stores) if and only if PR +(resp.PW) is set. Similarly, the successor set includes loads +(resp.stores) if and only if SR (resp.SW) is set.
+The FENCE encoding currently has nine non-trivial combinations of the +four bits PR, PW, SR, and SW, plus one extra encoding FENCE.TSO which +facilitates mapping of "acquire+release" or RVTSO semantics. The +remaining seven combinations have empty predecessor and/or successor +sets and hence are no-ops. Of the ten non-trivial options, only six are +commonly used in practice:
+-
+
-
+
FENCE RW,RW
+
+ -
+
FENCE.TSO
+
+ -
+
FENCE RW,W
+
+ -
+
FENCE R,RW
+
+ -
+
FENCE R,R
+
+ -
+
FENCE W,W
+
+
FENCE instructions using any other combination of PR, PW, SR, and SW are +reserved. We strongly recommend that programmers stick to these six. +Other combinations may have unknown or unexpected interactions with the +memory model.
+Finally, we note that since RISC-V uses a multi-copy atomic memory +model, programmers can reason about fences bits in a thread-local +manner. There is no complex notion of "fence cumulativity" as found in +memory models that are not multi-copy atomic.
+A.3.7. Explicit Synchronization (Rules 5-8)
++ + | +
+
+
+Rule 5: a has an acquire annotation +
+
+Rule 6: b has a release annotation +
+
+Rule 7: a and b both have RCsc annotations +
+
+Rule 8: a is paired with b + |
+
An acquire operation, as would be used at the start of a critical +section, requires all memory operations following the acquire in program +order to also follow the acquire in the global memory order. This +ensures, for example, that all loads and stores inside the critical +section are up to date with respect to the synchronization variable +being used to protect it. Acquire ordering can be enforced in one of two +ways: with an acquire annotation, which enforces ordering with respect +to just the synchronization variable itself, or with a FENCE R,RW, which +enforces ordering with respect to all previous loads.
+ sd x1, (a1) # Arbitrary unrelated store
+ ld x2, (a2) # Arbitrary unrelated load
+ li t0, 1 # Initialize swap value.
+ again:
+ amoswap.w.aq t0, t0, (a0) # Attempt to acquire lock.
+ bnez t0, again # Retry if held.
+ # ...
+ # Critical section.
+ # ...
+ amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.
+ sd x3, (a3) # Arbitrary unrelated store
+ ld x4, (a4) # Arbitrary unrelated load
+Consider Example 1.
+Because this example uses aq, the loads and stores in the critical
+section are guaranteed to appear in the global memory order after the
+AMOSWAP used to acquire the lock. However, assuming a0
, a1
, and a2
+point to different memory locations, the loads and stores in the
+critical section may or may not appear after the "Arbitrary unrelated
+load" at the beginning of the example in the global memory order.
sd x1, (a1) # Arbitrary unrelated store
+ ld x2, (a2) # Arbitrary unrelated load
+ li t0, 1 # Initialize swap value.
+ again:
+ amoswap.w t0, t0, (a0) # Attempt to acquire lock.
+ fence r, rw # Enforce "acquire" memory ordering
+ bnez t0, again # Retry if held.
+ # ...
+ # Critical section.
+ # ...
+ fence rw, w # Enforce "release" memory ordering
+ amoswap.w x0, x0, (a0) # Release lock by storing 0.
+ sd x3, (a3) # Arbitrary unrelated store
+ ld x4, (a4) # Arbitrary unrelated load
+Now, consider the alternative in Example 2. In +this case, even though the AMOSWAP does not enforce ordering with an +aq bit, the fence nevertheless enforces that the acquire AMOSWAP +appears earlier in the global memory order than all loads and stores in +the critical section. Note, however, that in this case, the fence also +enforces additional orderings: it also requires that the "Arbitrary +unrelated load" at the start of the program appears earlier in the +global memory order than the loads and stores of the critical section. +(This particular fence does not, however, enforce any ordering with +respect to the "Arbitrary unrelated store" at the start of the +snippet.) In this way, fence-enforced orderings are slightly coarser +than orderings enforced by .aq.
+Release orderings work exactly the same as acquire orderings, just in +the opposite direction. Release semantics require all loads and stores +preceding the release operation in program order to also precede the +release operation in the global memory order. This ensures, for example, +that memory accesses in a critical section appear before the +lock-releasing store in the global memory order. Just as for acquire +semantics, release semantics can be enforced using release annotations +or with a FENCE RW,W operation. Using the same examples, the ordering +between the loads and stores in the critical section and the "Arbitrary +unrelated store" at the end of the code snippet is enforced only by the +FENCE RW,W in Example 2, not by +the rl in Example 1.
+With RCpc annotations alone, store-release-to-load-acquire ordering is +not enforced. This facilitates the porting of code written under the TSO +and/or RCpc memory models. To enforce store-release-to-load-acquire +ordering, the code must use store-release-RCsc and load-acquire-RCsc +operations so that PPO rule 7 applies. RCpc alone is +sufficient for many use cases in C/C but is insufficient for many +other use cases in C/C, Java, and Linux, to name just a few examples; +see Memory Porting for details.
+PPO rule 8 indicates that an SC must appear after +its paired LR in the global memory order. This will follow naturally +from the common use of LR/SC to perform an atomic read-modify-write +operation due to the inherent data dependency. However, PPO +rule 8 also applies even when the value being stored +does not syntactically depend on the value returned by the paired LR.
+Lastly, we note that just as with fences, programmers need not worry +about "cumulativity" when analyzing ordering annotations.
+A.3.8. Syntactic Dependencies (Rules 9-11)
++ + | +
+
+
+Rule 9: b has a syntactic address dependency on a +
+
+Rule 10: b has a syntactic data dependency on a +
+
+Rule 11: b is a store, and b has a syntactic control dependency on a + |
+
Dependencies from a load to a later memory operation in the same hart +are respected by the RVWMO memory model. The Alpha memory model was +notable for choosing not to enforce the ordering of such dependencies, +but most modern hardware and software memory models consider allowing +dependent instructions to be reordered too confusing and +counterintuitive. Furthermore, modern code sometimes intentionally uses +such dependencies as a particularly lightweight ordering enforcement +mechanism.
+The terms in Section 18.1.2 work as follows. Instructions
+are said to carry dependencies from their
+source register(s) to their destination register(s) whenever the value
+written into each destination register is a function of the source
+register(s). For most instructions, this means that the destination
+register(s) carry a dependency from all source register(s). However,
+there are a few notable exceptions. In the case of memory instructions,
+the value written into the destination register ultimately comes from
+the memory system rather than from the source register(s) directly, and
+so this breaks the chain of dependencies carried from the source
+register(s). In the case of unconditional jumps, the value written into
+the destination register comes from the current pc
(which is never
+considered a source register by the memory model), and so likewise, JALR
+(the only jump with a source register) does not carry a dependency from
+rs1 to rd.
(a) fadd f3,f1,f2
+(b) fadd f6,f4,f5
+(c) csrrs a0,fflags,x0
+The notion of accumulating into a destination register rather than
+writing into it reflects the behavior of CSRs such as fflags
. In
+particular, an accumulation into a register does not clobber any
+previous writes or accumulations into the same register. For example, in
+Listing 4, (c) has a syntactic dependency on both (a) and (b).
Like other modern memory models, the RVWMO memory model uses syntactic
+rather than semantic dependencies. In other words, this definition
+depends on the identities of the registers being accessed by different
+instructions, not the actual contents of those registers. This means
+that an address, control, or data dependency must be enforced even if
+the calculation could seemingly be optimized away
. This choice
+ensures that RVWMO remains compatible with code that uses these false
+syntactic dependencies as a lightweight ordering mechanism.
ld a1,0(s0)
+xor a2,a1,a1
+add s1,s1,a2
+ld a5,0(s1)
+For example, there is a syntactic address dependency from the memory
+operation generated by the first instruction to the memory operation
+generated by the last instruction in
+Listing 5, even though a1
XOR
+a1
is zero and hence has no effect on the address accessed by the
+second load.
The benefit of using dependencies as a lightweight synchronization +mechanism is that the ordering enforcement requirement is limited only +to the specific two instructions in question. Other non-dependent +instructions may be freely reordered by aggressive implementations. One +alternative would be to use a load-acquire, but this would enforce +ordering for the first load with respect to all subsequent +instructions. Another would be to use a FENCE R,R, but this would +include all previous and all subsequent loads, making this option more +expensive.
+lw x1,0(x2)
+bne x1,x0,next
+sw x3,0(x4)
+next: sw x5,0(x6)
+Control dependencies behave differently from address and data
+dependencies in the sense that a control dependency always extends to
+all instructions following the original target in program order.
+Consider Listing 6 the
+instruction at next
will always execute, but the memory operation
+generated by that last instruction nevertheless still has a control
+dependency from the memory operation generated by the first instruction.
lw x1,0(x2)
+bne x1,x0,next
+next: sw x3,0(x4)
+Likewise, consider Listing 7. +Even though both branch outcomes have the same target, there is still a +control dependency from the memory operation generated by the first +instruction in this snippet to the memory operation generated by the +last instruction. This definition of control dependency is subtly +stronger than what might be seen in other contexts (e.g., C++), but it +conforms with standard definitions of control dependencies in the +literature.
+Notably, PPO rules 9-11 are also +intentionally designed to respect dependencies that originate from the +output of a successful store-conditional instruction. Typically, an SC +instruction will be followed by a conditional branch checking whether +the outcome was successful; this implies that there will be a control +dependency from the store operation generated by the SC instruction to +any memory operations following the branch. PPO +rule 11 in turn implies that any subsequent store +operations will appear later in the global memory order than the store +operation generated by the SC. However, since control, address, and data +dependencies are defined over memory operations, and since an +unsuccessful SC does not generate a memory operation, no order is +enforced between unsuccessful SC and its dependent instructions. +Moreover, since SC is defined to carry dependencies from its source +registers to rd only when the SC is successful, an unsuccessful SC has +no effect on the global memory order.
+
|
+
|
+
In addition, the choice to respect dependencies originating at
+store-conditional instructions ensures that certain out-of-thin-air-like
+behaviors will be prevented. Consider
+Table 38. Suppose a
+hypothetical implementation could occasionally make some early guarantee
+that a store-conditional operation will succeed. In this case, (c) could
+return 0 to a2
early (before actually executing), allowing the
+sequence (d), (e), (f), (a), and then (b) to execute, and then (c) might
+execute (successfully) only at that point. This would imply that (c)
+writes its own success value to 0(s1)
! Fortunately, this situation and
+others like it are prevented by the fact that RVWMO respects
+dependencies originating at the stores generated by successful SC
+instructions.
We also note that syntactic dependencies between instructions only have
+any force when they take the form of a syntactic address, control,
+and/or data dependency. For example: a syntactic dependency between two
+F
instructions via one of the accumulating CSRs
in
+Section 18.3 does not imply
+that the two F
instructions must be executed in order. Such a
+dependency would only serve to ultimately set up later a dependency from
+both F
instructions to a later CSR instruction accessing the CSR
+flag in question.
A.3.9. Pipeline Dependencies (Rules 12-13)
++ + | +
+
+
+Rule 12: b is a load, and there exists some store m between a and b in +program order such that m has an address or data dependency on a, +and b returns a value written by m +
+
+Rule 13: b is a store, and there exists some instruction m between a and +b in program order such that m has an address dependency on a + |
+
|
+
|
+
PPO rules 12 and 13 reflect behaviors of almost all real processor +pipeline implementations. Rule 12 +states that a load cannot forward from a store until the address and +data for that store are known. Consider Table 39 (f) cannot be +executed until the data for (e) has been resolved, because (f) must +return the value written by (e) (or by something even later in the +global memory order), and the old value must not be clobbered by the +writeback of (e) before (d) has had a chance to perform. Therefore, (f) +will never perform before (d) has performed.
+
|
+
|
+
If there were another store to the same address in between (e) and (f), +as in Table 41, +then (f) would no longer be dependent on the data of (e) being resolved, +and hence the dependency of (f) on (d), which produces the data for (e), +would be broken.
+Rule13 makes a similar observation to the +previous rule: a store cannot be performed at memory until all previous +loads that might access the same address have themselves been performed. +Such a load must appear to execute before the store, but it cannot do so +if the store were to overwrite the value in memory before the load had a +chance to read the old value. Likewise, a store generally cannot be +performed until it is known that preceding instructions will not cause +an exception due to failed address resolution, and in this sense, +rule 13 can be seen as somewhat of a special case +of rule 11.
+
|
+
|
+
Consider Table 41 (f) cannot be
+executed until the address for (e) is resolved, because it may turn out
+that the addresses match; i.e., that a1=s0
. Therefore, (f) cannot be
+sent to memory before (d) has executed and confirmed whether the
+addresses do indeed overlap.
A.4. Beyond Main Memory
+RVWMO does not currently attempt to formally describe how FENCE.I, +SFENCE.VMA, I/O fences, and PMAs behave. All of these behaviors will be +described by future formalizations. In the meantime, the behavior of +FENCE.I is described in Chapter 6, the +behavior of SFENCE.VMA is described in the RISC-V Instruction Set +Privileged Architecture Manual, and the behavior of I/O fences and the +effects of PMAs are described below.
+A.4.1. Coherence and Cacheability
+The RISC-V Privileged ISA defines Physical Memory Attributes (PMAs) +which specify, among other things, whether portions of the address space +are coherent and/or cacheable. See the RISC-V Privileged ISA +Specification for the complete details. Here, we simply discuss how the +various details in each PMA relate to the memory model:
+-
+
-
+
Main memory vs.I/O, and I/O memory ordering PMAs: the memory model as +defined applies to main memory regions. I/O ordering is discussed below.
+
+ -
+
Supported access types and atomicity PMAs: the memory model is simply +applied on top of whatever primitives each region supports.
+
+ -
+
Cacheability PMAs: the cacheability PMAs in general do not affect the +memory model. Non-cacheable regions may have more restrictive behavior +than cacheable regions, but the set of allowed behaviors does not change +regardless. However, some platform-specific and/or device-specific +cacheability settings may differ.
+
+ -
+
Coherence PMAs: The memory consistency model for memory regions marked +as non-coherent in PMAs is currently platform-specific and/or +device-specific: the load-value axiom, the atomicity axiom, and the +progress axiom all may be violated with non-coherent memory. Note +however that coherent memory does not require a hardware cache coherence +protocol. The RISC-V Privileged ISA Specification suggests that +hardware-incoherent regions of main memory are discouraged, but the +memory model is compatible with hardware coherence, software coherence, +implicit coherence due to read-only memory, implicit coherence due to +only one agent having access, or otherwise.
+
+ -
+
Idempotency PMAs: Idempotency PMAs are used to specify memory regions +for which loads and/or stores may have side effects, and this in turn is +used by the microarchitecture to determine, e.g., whether prefetches are +legal. This distinction does not affect the memory model.
+
+
A.4.2. I/O Ordering
+For I/O, the load value axiom and atomicity axiom in general do not +apply, as both reads and writes might have device-specific side effects +and may return values other than the value "written" by the most +recent store to the same address. Nevertheless, the following preserved +program order rules still generally apply for accesses to I/O memory: +memory access a precedes memory access b in +global memory order if a precedes b in +program order and one or more of the following holds:
+-
+
-
+
a precedes b in preserved program order as +defined in Chapter 18, with the exception +that acquire and release ordering annotations apply only from one memory +operation to another memory operation and from one I/O operation to +another I/O operation, but not from a memory operation to an I/O nor +vice versa
+
+ -
+
a and b are accesses to overlapping +addresses in an I/O region
+
+ -
+
a and b are accesses to the same strongly +ordered I/O region
+
+ -
+
a and b are accesses to I/O regions, and +the channel associated with the I/O region accessed by either +a or b is channel 1
+
+ -
+
a and b are accesses to I/O regions +associated with the same channel (except for channel 0)
+
+
Note that the FENCE instruction distinguishes between main memory +operations and I/O operations in its predecessor and successor sets. To +enforce ordering between I/O operations and main memory operations, code +must use a FENCE with PI, PO, SI, and/or SO, plus PR, PW, SR, and/or SW. +For example, to enforce ordering between a write to main memory and an +I/O write to a device register, a FENCE W,O or stronger is needed.
+sd t0, 0(a0)
+fence w,o
+sd a0, 0(a1)
+When a fence is in fact used, implementations must assume that the
+device may attempt to access memory immediately after receiving the MMIO
+signal, and subsequent memory accesses from that device to memory must
+observe the effects of all accesses ordered prior to that MMIO
+operation. In other words, in Listing 8,
+suppose 0(a0)
is in main memory and 0(a1)
is the address of a device
+register in I/O memory. If the device accesses 0(a0)
upon receiving
+the MMIO write, then that load must conceptually appear after the first
+store to 0(a0)
according to the rules of the RVWMO memory model. In
+some implementations, the only way to ensure this will be to require
+that the first store does in fact complete before the MMIO write is
+issued. Other implementations may find ways to be more aggressive, while
+others still may not need to do anything different at all for I/O and
+main memory accesses. Nevertheless, the RVWMO memory model does not
+distinguish between these options; it simply provides an
+implementation-agnostic mechanism to specify the orderings that must be
+enforced.
Many architectures include separate notions of "ordering" and +`completion" fences, especially as it relates to I/O (as opposed to +regular main memory). Ordering fences simply ensure that memory +operations stay in order, while completion fences ensure that +predecessor accesses have all completed before any successors are made +visible. RISC-V does not explicitly distinguish between ordering and +completion fences. Instead, this distinction is simply inferred from +different uses of the FENCE bits.
+For implementations that conform to the RISC-V Unix Platform +Specification, I/O devices and DMA operations are required to access +memory coherently and via strongly ordered I/O channels. Therefore, +accesses to regular main memory regions that are concurrently accessed +by external devices can also use the standard synchronization +mechanisms. Implementations that do not conform to the Unix Platform +Specification and/or in which devices do not access memory coherently +will need to use mechanisms (which are currently platform-specific or +device-specific) to enforce coherency.
+I/O regions in the address space should be considered non-cacheable +regions in the PMAs for those regions. Such regions can be considered +coherent by the PMA if they are not cached by any agent.
+The ordering guarantees in this section may not apply beyond a +platform-specific boundary between the RISC-V cores and the device. In +particular, I/O accesses sent across an external bus (e.g., PCIe) may be +reordered before they reach their ultimate destination. Ordering must be +enforced in such situations according to the platform-specific rules of +those external devices and buses.
+A.5. Code Porting and Mapping Guidelines
+x86/TSO Operation | +RVWMO Mapping | +
---|---|
Load |
+
|
+
Store |
+
|
+
Atomic RMW |
+
|
+
Fence |
+
|
+
Table 42 provides a mapping from TSO memory +operations onto RISC-V memory instructions. Normal x86 loads and stores +are all inherently acquire-RCpc and release-RCpc operations: TSO +enforces all load-load, load-store, and store-store ordering by default. +Therefore, under RVWMO, all TSO loads must be mapped onto a load +followed by FENCE R,RW, and all TSO stores must be mapped onto +FENCE RW,W followed by a store. TSO atomic read-modify-writes and x86 +instructions using the LOCK prefix are fully ordered and can be +implemented either via an AMO with both aq and rl set, or via an LR +with aq set, the arithmetic operation in question, an SC with both +aq and rl set, and a conditional branch checking the success +condition. In the latter case, the rl annotation on the LR turns out +(for non-obvious reasons) to be redundant and can be omitted.
+Alternatives to Table 42 are also possible. A TSO +store can be mapped onto AMOSWAP with rl set. However, since RVWMO PPO +Rule 3 forbids forwarding of values from +AMOs to subsequent loads, the use of AMOSWAP for stores may negatively +affect performance. A TSO load can be mapped using LR with aq set: all +such LR instructions will be unpaired, but that fact in and of itself +does not preclude the use of LR for loads. However, again, this mapping +may also negatively affect performance if it puts more pressure on the +reservation mechanism than was originally intended.
+Power Operation | +RVWMO Mapping | +
---|---|
Load |
+
|
+
Load-Reserve |
+
|
+
Store |
+
|
+
Store-Conditional |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Table 43 provides a mapping from Power memory +operations onto RISC-V memory instructions. Power ISYNC maps on RISC-V +to a FENCE.I followed by a FENCE R,R; the latter fence is needed because +ISYNC is used to define a "control+control fence" dependency that is +not present in RVWMO.
+ARM Operation | +RVWMO Mapping | +
---|---|
Load |
+
|
+
Load-Acquire |
+
|
+
Load-Exclusive |
+
|
+
Load-Acquire-Exclusive |
+
|
+
Store |
+
|
+
Store-Release |
+
|
+
Store-Exclusive |
+
|
+
Store-Release-Exclusive |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Table 44 provides a mapping from ARM memory +operations onto RISC-V memory instructions. Since RISC-V does not +currently have plain load and store opcodes with aq or rl +annotations, ARM load-acquire and store-release operations should be +mapped using fences instead. Furthermore, in order to enforce +store-release-to-load-acquire ordering, there must be a FENCE RW,RW +between the store-release and load-acquire; Table 44 +enforces this by always placing the fence in front of each acquire +operation. ARM load-exclusive and store-exclusive instructions can +likewise map onto their RISC-V LR and SC equivalents, but instead of +placing a FENCE RW,RW in front of an LR with aq set, we simply also +set rl instead. ARM ISB maps on RISC-V to FENCE.I followed by +FENCE R,R similarly to how ISYNC maps for Power.
+Linux Operation | +RVWMO Mapping | +
---|---|
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Linux Construct |
+RVWMO AMO Mapping |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Linux Construct |
+RVWMO LR/SC Mapping |
+
|
+
|
+
|
+
|
+
|
+
|
+
+ |
|
+
|
+
|
+
With regards to Table 45, other +constructs (such as spinlocks) should follow accordingly. Platforms or +devices with non-coherent DMA may need additional synchronization (such +as cache flush or invalidate mechanisms); currently any such extra +synchronization will be device-specific.
+Table 45 provides a mapping of Linux memory
+ordering macros onto RISC-V memory instructions. The Linux fences
+dma_rmb()
and dma_wmb()
map onto FENCE R,R and FENCE W,W,
+respectively, since the RISC-V Unix Platform requires coherent DMA, but
+would be mapped onto FENCE RI,RI and FENCE WO,WO, respectively, on a
+platform with non-coherent DMA. Platforms with non-coherent DMA may also
+require a mechanism by which cache lines can be flushed and/or
+invalidated. Such mechanisms will be device-specific and/or standardized
+in a future extension to the ISA.
The Linux mappings for release operations may seem stronger than +necessary, but these mappings are needed to cover some cases in which +Linux requires stronger orderings than the more intuitive mappings would +provide. In particular, as of the time this text is being written, Linux +is actively debating whether to require load-load, load-store, and +store-store orderings between accesses in one critical section and +accesses in a subsequent critical section in the same hart and protected +by the same synchronization object. Not all combinations of +FENCE RW,W/FENCE R,RW mappings with aq/rl mappings combine to +provide such orderings. There are a few ways around this problem, +including:
+-
+
-
+
Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This +suffices but is undesirable, as it defeats the purpose of the aq/rl +modifiers.
+
+ -
+
Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does +not currently work due to the lack of load and store opcodes with aq +and rl modifiers.
+
+ -
+
Strengthen the mappings of release operations such that they would +enforce sufficient orderings in the presence of either type of acquire +mapping. This is the currently recommended solution, and the one shown +in Table 45.
+
+
RVWMO Mapping: (a) lw a0, 0(s0) (b) fence.tso // vs. fence rw,w (c) sd +x0,0(s1) … loop: (d) amoswap.d.aq a1,t1,0(s1) bnez a1,loop (e) lw +a2,0(s2)
+For example, the critical section ordering rule currently being debated +by the Linux community would require (a) to be ordered before (e) in +Listing 9. If that will indeed be +required, then it would be insufficient for (b) to map as FENCE RW,W. +That said, these mappings are subject to change as the Linux Kernel +Memory Model evolves.
+Linux Code:
+(a) int r0 = *x;
+ (bc) spin_unlock(y, 0);
+....
+....
+(d) spin_lock(y);
+(e) int r1 = *z;
+
+RVWMO Mapping:
+(a) lw a0, 0(s0)
+(b) fence.tso // vs. fence rw,w
+(c) sd x0,0(s1)
+....
+loop:
+(d) lr.d.aq a1,(s1)
+bnez a1,loop
+sc.d a1,t1,(s1)
+bnez a1,loop
+(e) lw a2,0(s2)
+Table 46 provides a mapping of C11/C++11 atomic
+operations onto RISC-V memory instructions. If load and store opcodes
+with aq and rl modifiers are introduced, then the mappings in
+Table 47 will suffice. Note however that
+the two mappings only interoperate correctly if
+atomic_<op>(memory_order_seq_cst)
is mapped using an LR that has both
+aq and rl set.
+Even more importantly, a Table 46 sequentially consistent store,
+followed by a Table 47 sequentially consistent load
+can be reordered unless the Table 46 mapping of stores is
+strengthened by either adding a second fence or mapping the store
+to amoswap.rl
instead.
C/C++ Construct | +RVWMO Mapping | +
---|---|
Non-atomic load |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Non-atomic store |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
C/C++ Construct |
+RVWMO AMO Mapping |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
C/C++ Construct |
+RVWMO LR/SC Mapping |
+
|
+
|
+
+ |
|
+
|
+
|
+
+ |
|
+
|
+
|
+
+ |
|
+
|
+
|
+
+ |
|
+
|
+
|
+
+ |
|
+
C/C++ Construct | +RVWMO Mapping | +
---|---|
Non-atomic load |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Non-atomic store |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
C/C++ Construct |
+RVWMO AMO Mapping |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
C/C++ Construct |
+RVWMO LR/SC Mapping |
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
|
+
Any AMO can be emulated by an LR/SC pair, but care must be taken to
+ensure that any PPO orderings that originate from the LR are also made
+to originate from the SC, and that any PPO orderings that terminate at
+the SC are also made to terminate at the LR. For example, the LR must
+also be made to respect any data dependencies that the AMO has, given
+that load operations do not otherwise have any notion of a data
+dependency. Likewise, the effect a FENCE R,R elsewhere in the same hart
+must also be made to apply to the SC, which would not otherwise respect
+that fence. The emulator may achieve this effect by simply mapping AMOs
+onto lr.aq; <op>; sc.aqrl
, matching the mapping used elsewhere for
+fully ordered atomics.
These C11/C++11 mappings require the platform to provide the following +Physical Memory Attributes (as defined in the RISC-V Privileged ISA) for +all memory:
+-
+
-
+
main memory
+
+ -
+
coherent
+
+ -
+
AMOArithmetic
+
+ -
+
RsrvEventual
+
+
Platforms with different attributes may require different mappings, or +require platform-specific SW (e.g., memory-mapped I/O).
+A.6. Implementation Guidelines
+The RVWMO and RVTSO memory models by no means preclude +microarchitectures from employing sophisticated speculation techniques +or other forms of optimization in order to deliver higher performance. +The models also do not impose any requirement to use any one particular +cache hierarchy, nor even to use a cache coherence protocol at all. +Instead, these models only specify the behaviors that can be exposed to +software. Microarchitectures are free to use any pipeline design, any +coherent or non-coherent cache hierarchy, any on-chip interconnect, +etc., as long as the design only admits executions that satisfy the +memory model rules. That said, to help people understand the actual +implementations of the memory model, in this section we provide some +guidelines on how architects and programmers should interpret the +models' rules.
+Both RVWMO and RVTSO are multi-copy atomic (or +other-multi-copy-atomic): any store value that is visible to a hart +other than the one that originally issued it must also be conceptually +visible to all other harts in the system. In other words, harts may +forward from their own previous stores before those stores have become +globally visible to all harts, but no early inter-hart forwarding is +permitted. Multi-copy atomicity may be enforced in a number of ways. It +might hold inherently due to the physical design of the caches and store +buffers, it may be enforced via a single-writer/multiple-reader cache +coherence protocol, or it might hold due to some other mechanism.
+Although multi-copy atomicity does impose some restrictions on the +microarchitecture, it is one of the key properties keeping the memory +model from becoming extremely complicated. For example, a hart may not +legally forward a value from a neighbor hart’s private store buffer +(unless of course it is done in such a way that no new illegal behaviors +become architecturally visible). Nor may a cache coherence protocol +forward a value from one hart to another until the coherence protocol +has invalidated all older copies from other caches. Of course, +microarchitectures may (and high-performance implementations likely +will) violate these rules under the covers through speculation or other +optimizations, as long as any non-compliant behaviors are not exposed to +the programmer.
+As a rough guideline for interpreting the PPO rules in RVWMO, we expect +the following from the software perspective:
+-
+
-
+
programmers will use PPO rules 1 and 4-8 regularly and actively.
+
+ -
+
expert programmers will use PPO rules 9-11 to speed up critical paths +of important data structures.
+
+ -
+
even expert programmers will rarely if ever use PPO rules 2-3 and +12-13 directly. +These are included to facilitate common microarchitectural optimizations +(rule 2) and the operational formal modeling approach (rules 3 and +12-13) described +in Section B.3. They also facilitate the +process of porting code from other architectures that have similar +rules.
+
+
We also expect the following from the hardware perspective:
+-
+
-
+
PPO rules 1 and 3-6 reflect +well-understood rules that should pose few surprises to architects.
+
+ -
+
PPO rule 2 reflects a natural and common hardware +optimization, but one that is very subtle and hence is worth double +checking carefully.
+
+ -
+
PPO rule 7 may not be immediately obvious to +architects, but it is a standard memory model requirement
+
+ -
+
The load value axiom, the atomicity axiom, and PPO rules +8-13 reflect rules that most +hardware implementations will enforce naturally, unless they contain +extreme optimizations. Of course, implementations should make sure to +double check these rules nevertheless. Hardware must also ensure that +syntactic dependencies are not
+optimized away
.
+
Architectures are free to implement any of the memory model rules as +conservatively as they choose. For example, a hardware implementation +may choose to do any or all of the following:
+-
+
-
+
interpret all fences as if they were FENCE RW,RW (or FENCE IORW,IORW, +if I/O is involved), regardless of the bits actually set
+
+ -
+
implement all fences with PW and SR as if they were FENCE RW,RW (or +FENCE IORW,IORW, if I/O is involved), as PW with SR is the most +expensive of the four possible main memory ordering components anyway
+
+ -
+
emulate aq and rl as described in Section A.5
+
+ -
+
enforcing all same-address load-load ordering, even in the presence of +patterns such as
+fri-rfi
andRSW
+ -
+
forbid any forwarding of a value from a store in the store buffer to a +subsequent AMO or LR to the same address
+
+ -
+
forbid any forwarding of a value from an AMO or SC in the store buffer +to a subsequent load to the same address
+
+ -
+
implement TSO on all memory accesses, and ignore any main memory +fences that do not include PW and SR ordering (e.g., as Ztso +implementations will do)
+
+ -
+
implement all atomics to be RCsc or even fully ordered, regardless of +annotation
+
+
Architectures that implement RVTSO can safely do the following:
+Other general notes:
+-
+
-
+
Silent stores (i.e., stores that write the same value that already +exists at a memory location) behave like any other store from a memory +model point of view. Likewise, AMOs which do not actually change the +value in memory (e.g., an AMOMAX for which the value in rs2 is smaller +than the value currently in memory) are still semantically considered +store operations. Microarchitectures that attempt to implement silent +stores must take care to ensure that the memory model is still obeyed, +particularly in cases such as RSW Section A.3.5 +which tend to be incompatible with silent stores.
+
+ -
+
Writes may be merged (i.e., two consecutive writes to the same address +may be merged) or subsumed (i.e., the earlier of two back-to-back writes +to the same address may be elided) as long as the resulting behavior +does not otherwise violate the memory model semantics.
+
+
The question of write subsumption can be understood from the following +example:
+
|
+
|
+
As written, if the load (d) reads value 1, then (a) must +precede (f) in the global memory order:
+-
+
-
+
(a) precedes (c) in the global memory order because of rule 4
+
+ -
+
(c) precedes (d) in the global memory order because of the Load +Value axiom
+
+ -
+
(d) precedes (e) in the global memory order because of rule 10
+
+ -
+
(e) precedes (f) in the global memory order because of rule 1
+
+
In other words the final value of the memory location whose address is
+in s0
must be 2 (the value written by the store (f)) and
+cannot be 3 (the value written by the store (a)).
A very aggressive microarchitecture might erroneously decide to discard +(e), as (f) supersedes it, and this may in turn lead the +microarchitecture to break the now-eliminated dependency between (d) and +(f) (and hence also between (a) and (f)). This would violate the memory +model rules, and hence it is forbidden. Write subsumption may in other +cases be legal, if for example there were no data dependency between (d) +and (e).
+A.6.1. Possible Future Extensions
+We expect that any or all of the following possible future extensions +would be compatible with the RVWMO memory model:
+-
+
-
+
"V" vector ISA extensions
+
+ -
+
"J" JIT extension
+
+ -
+
Native encodings for load and store opcodes with aq and rl set
+
+ -
+
Fences limited to certain addresses
+
+ -
+
Cache writeback/flush/invalidate/etc.instructions
+
+
A.7. Known Issues
+A.7.1. Mixed-size RSW
+Hart 0 | +Hart 1 | +||
---|---|---|---|
li t1, 1 |
+li t1, 1 |
+||
(a) |
+lw a0,0(s0) |
+(d) |
+lw a1,0(s1) |
+
(b) |
+fence rw,rw |
+(e) |
+amoswap.w.rl a2,t1,0(s2) |
+
(c) |
+sw t1,0(s1) |
+(f) |
+ld a3,0(s2) |
+
+ | + | (g) |
+lw a4,4(s2) |
+
+ | + | + | xor a5,a4,a4 |
+
+ | + | + | add s0,s0,a5 |
+
+ | + | (h) |
+sw t1,0(s0) |
+
Outcome: |
+
Hart 0 | +Hart 1 | +||
---|---|---|---|
li t1, 1 |
+li t1, 1 |
+||
(a) |
+lw a0,0(s0) |
+(d) |
+ld a1,0(s1) |
+
(b) |
+fence rw,rw |
+(e) |
+lw a2,4(s1) |
+
(c) |
+sw t1,0(s1) |
++ | xor a3,a2,a2 |
+
+ | + | + | add s0,s0,a3 |
+
+ | + | (f) |
+sw t1,0(s0) |
+
Outcome: |
+
Hart 0 | +Hart 1 | +||
---|---|---|---|
li t1, 1 |
+li t1, 1 |
+||
(a) |
+lw a0,0(s0) |
+(d) |
+sw t1,4(s1) |
+
(b) |
+fence rw,rw |
+(e) |
+ld a1,0(s1) |
+
(c) |
+sw t1,0(s1) |
+(f) |
+lw a2,4(s1) |
+
+ | + | + | xor a3,a2,a2 |
+
+ | + | + | add s0,s0,a3 |
+
+ | + | (g) |
+sw t1,0(s0) |
+
Outcome: |
+
There is a known discrepancy between the operational and axiomatic
+specifications within the family of mixed-size RSW variants shown in
+Table 49-Table 51.
+To address this, we may choose to add something like the following new
+PPO rule: Memory operation a precedes memory operation
+b in preserved program order (and hence also in the global
+memory order) if a precedes b in program
+order, a and b both access regular main
+memory (rather than I/O regions), a is a load,
+b is a store, there is a load m between
+a and b, there is a byte x
+that both a and m read, there is no store
+between a and m that writes to
+x, and m precedes b in PPO. In
+other words, in herd syntax, we may choose to add
+(po-loc & rsw);ppo;[W]
to PPO. Many implementations will already
+enforce this ordering naturally. As such, even though this rule is not
+official, we recommend that implementers enforce it nevertheless in
+order to ensure forwards compatibility with the possible future addition
+of this rule to RVWMO.
Appendix B: Formal Memory Model Specifications, Version 0.1
+To facilitate formal analysis of RVWMO, this chapter presents a set of +formalizations using different tools and modeling approaches. Any +discrepancies are unintended; the expectation is that the models +describe exactly the same sets of legal behaviors.
+This appendix should be treated as commentary; all normative material is +provided in Chapter 17 and in the rest of +the main body of the ISA specification. All currently known +discrepancies are listed in +Section A.7. Any other +discrepancies are unintentional.
+B.1. Formal Axiomatic Specification in Alloy
+We present a formal specification of the RVWMO memory model in Alloy +(alloy.mit.edu). This model is available online at +github.com/daniellustig/riscv-memory-model.
+The online material also contains some litmus tests and some examples of +how Alloy can be used to model check some of the mappings in Section A.5.
+// =RVWMO PPO=
+
+// Preserved Program Order
+fun ppo : Event->Event {
+ // same-address ordering
+ po_loc :> Store
+ + rdw
+ + (AMO + StoreConditional) <: rfi
+
+ // explicit synchronization
+ + ppo_fence
+ + Acquire <: ^po :> MemoryEvent
+ + MemoryEvent <: ^po :> Release
+ + RCsc <: ^po :> RCsc
+ + pair
+
+ // syntactic dependencies
+ + addrdep
+ + datadep
+ + ctrldep :> Store
+
+ // pipeline dependencies
+ + (addrdep+datadep).rfi
+ + addrdep.^po :> Store
+}
+
+// the global memory order respects preserved program order
+fact { ppo in ^gmo }
+// =RVWMO axioms= + +// Load Value Axiom +fun candidates[r: MemoryEvent] : set MemoryEvent { + (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo + + (r.^~po & Store & same_addr[r]) // writes preceding r in po +} + +fun latest_among[s: set Event] : Event { s - s.~^gmo } + +pred LoadValue { + all w: Store | all r: Load | + w->r in rf <=> w = latest_among[candidates[r]] +} + +// Atomicity Axiom +pred Atomicity { + all r: Store.~pair | // starting from the lr, + no x: Store & same_addr[r] | // there is no store x to the same addr + x not in same_hart[r] // such that x is from a different hart, + and x in r.~rf.^gmo // x follows (the store r reads from) in gmo, + and r.pair in x.^gmo // and r follows x in gmo +} + +// Progress Axiom implicit: Alloy only considers finite executions + +pred RISCV_mm { LoadValue and Atomicity /* and Progress */ }+
//Basic model of memory
+
+sig Hart { // hardware thread
+ start : one Event
+}
+sig Address {}
+abstract sig Event {
+ po: lone Event // program order
+}
+
+abstract sig MemoryEvent extends Event {
+ address: one Address,
+ acquireRCpc: lone MemoryEvent,
+ acquireRCsc: lone MemoryEvent,
+ releaseRCpc: lone MemoryEvent,
+ releaseRCsc: lone MemoryEvent,
+ addrdep: set MemoryEvent,
+ ctrldep: set Event,
+ datadep: set MemoryEvent,
+ gmo: set MemoryEvent, // global memory order
+ rf: set MemoryEvent
+}
+sig LoadNormal extends MemoryEvent {} // l{b|h|w|d}
+sig LoadReserve extends MemoryEvent { // lr
+ pair: lone StoreConditional
+}
+sig StoreNormal extends MemoryEvent {} // s{b|h|w|d}
+// all StoreConditionals in the model are assumed to be successful
+sig StoreConditional extends MemoryEvent {} // sc
+sig AMO extends MemoryEvent {} // amo
+sig NOP extends Event {}
+
+fun Load : Event { LoadNormal + LoadReserve + AMO }
+fun Store : Event { StoreNormal + StoreConditional + AMO }
+
+sig Fence extends Event {
+ pr: lone Fence, // opcode bit
+ pw: lone Fence, // opcode bit
+ sr: lone Fence, // opcode bit
+ sw: lone Fence // opcode bit
+}
+sig FenceTSO extends Fence {}
+
+/* Alloy encoding detail: opcode bits are either set (encoded, e.g.,
+ * as f.pr in iden) or unset (f.pr not in iden). The bits cannot be used for
+ * anything else */
+fact { pr + pw + sr + sw in iden }
+// likewise for ordering annotations
+fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden }
+// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is
+fact { no FenceTSO.(pr + pw + sr + sw) }
+// =Basic model rules=
+
+// Ordering annotation groups
+fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc }
+fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc }
+fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc }
+fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc }
+
+// There is no such thing as store-acquire or load-release, unless it's both
+fact { Load & Release in Acquire }
+fact { Store & Acquire in Release }
+
+// FENCE PPO
+fun FencePRSR : Fence { Fence.(pr & sr) }
+fun FencePRSW : Fence { Fence.(pr & sw) }
+fun FencePWSR : Fence { Fence.(pw & sr) }
+fun FencePWSW : Fence { Fence.(pw & sw) }
+
+fun ppo_fence : MemoryEvent->MemoryEvent {
+ (Load <: ^po :> FencePRSR).(^po :> Load)
+ + (Load <: ^po :> FencePRSW).(^po :> Store)
+ + (Store <: ^po :> FencePWSR).(^po :> Load)
+ + (Store <: ^po :> FencePWSW).(^po :> Store)
+ + (Load <: ^po :> FenceTSO) .(^po :> MemoryEvent)
+ + (Store <: ^po :> FenceTSO) .(^po :> Store)
+}
+
+// auxiliary definitions
+fun po_loc : Event->Event { ^po & address.~address }
+fun same_hart[e: Event] : set Event { e + e.^~po + e.^po }
+fun same_addr[e: Event] : set Event { e.address.~address }
+
+// initial stores
+fun NonInit : set Event { Hart.start.*po }
+fun Init : set Event { Event - NonInit }
+fact { Init in StoreNormal }
+fact { Init->(MemoryEvent & NonInit) in ^gmo }
+fact { all e: NonInit | one e.*~po.~start } // each event is in exactly one hart
+fact { all a: Address | one Init & a.~address } // one init store per address
+fact { no Init <: po and no po :> Init }
+// po
+fact { acyclic[po] }
+
+// gmo
+fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents
+
+//rf
+fact { rf.~rf in iden } // each read returns the value of only one write
+fact { rf in Store <: address.~address :> Load }
+fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) }
+
+//dep
+fact { no StoreNormal <: (addrdep + ctrldep + datadep) }
+fact { addrdep + ctrldep + datadep + pair in ^po }
+fact { datadep in datadep :> Store }
+fact { ctrldep.*po in ctrldep }
+fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po }
+fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed
+
+// rdw
+fun rdw : Event->Event {
+ (Load <: po_loc :> Load) // start with all same_address load-load pairs,
+ - (~rf.rf) // subtract pairs that read from the same store,
+ - (po_loc.rfi) // and subtract out "fri-rfi" patterns
+}
+
+// filter out redundant instances and/or visualizations
+fact { no gmo & gmo.gmo } // keep the visualization uncluttered
+fact { all a: Address | some a.~address }
+
+// =Optional: opcode encoding restrictions=
+
+// the list of blessed fences
+fact { Fence in
+ Fence.pr.sr
+ + Fence.pw.sw
+ + Fence.pr.pw.sw
+ + Fence.pr.sr.sw
+ + FenceTSO
+ + Fence.pr.pw.sr.sw
+}
+
+pred restrict_to_current_encodings {
+ no (LoadNormal + StoreNormal) & (Acquire + Release)
+}
+
+// =Alloy shortcuts=
+pred acyclic[rel: Event->Event] { no iden & ^rel }
+pred total[rel: Event->Event, bag: Event] {
+ all disj e, e': bag | e->e' in rel + ~rel
+ acyclic[rel]
+}
+B.2. Formal Axiomatic Specification in Herd
+The tool herd takes a memory model and a litmus test as +input and simulates the execution of the test on top of the memory +model. Memory models are written in the domain specific language Cat. +This section provides two Cat memory model of RVWMO. The first model, +Listing 15, follows the global memory order, +Chapter Chapter 18, definition of RVWMO, as much +as is possible for a Cat model. The second model, +Listing 16, is an equivalent, more efficient, +partial order based RVWMO model.
+The simulator herd
is part of the diy
tool
+suite — see diy.inria.fr for software and documentation. The
+models and more are available online at diy.inria.fr/cats7/riscv/.
(*************)
+(* Utilities *)
+(*************)
+
+(* All fence relations *)
+let fence.r.r = [R];fencerel(Fence.r.r);[R]
+let fence.r.w = [R];fencerel(Fence.r.w);[W]
+let fence.r.rw = [R];fencerel(Fence.r.rw);[M]
+let fence.w.r = [W];fencerel(Fence.w.r);[R]
+let fence.w.w = [W];fencerel(Fence.w.w);[W]
+let fence.w.rw = [W];fencerel(Fence.w.rw);[M]
+let fence.rw.r = [M];fencerel(Fence.rw.r);[R]
+let fence.rw.w = [M];fencerel(Fence.rw.w);[W]
+let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M]
+let fence.tso =
+ let f = fencerel(Fence.tso) in
+ ([W];f;[W]) | ([R];f;[M])
+
+let fence =
+ fence.r.r | fence.r.w | fence.r.rw |
+ fence.w.r | fence.w.w | fence.w.rw |
+ fence.rw.r | fence.rw.w | fence.rw.rw |
+ fence.tso
+
+(* Same address, no W to the same address in-between *)
+let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc)
+(* Read same write *)
+let rsw = rf^-1;rf
+(* Acquire, or stronger *)
+let AQ = Acq|AcqRel
+(* Release or stronger *)
+and RL = RelAcqRel
+(* All RCsc *)
+let RCsc = Acq|Rel|AcqRel
+(* Amo events are both R and W, relation rmw relates paired lr/sc *)
+let AMO = R & W
+let StCond = range(rmw)
+
+(*************)
+(* ppo rules *)
+(*************)
+
+(* Overlapping-Address Orderings *)
+let r1 = [M];po-loc;[W]
+and r2 = ([R];po-loc-no-w;[R]) \ rsw
+and r3 = [AMO|StCond];rfi;[R]
+(* Explicit Synchronization *)
+and r4 = fence
+and r5 = [AQ];po;[M]
+and r6 = [M];po;[RL]
+and r7 = [RCsc];po;[RCsc]
+and r8 = rmw
+(* Syntactic Dependencies *)
+and r9 = [M];addr;[M]
+and r10 = [M];data;[W]
+and r11 = [M];ctrl;[W]
+(* Pipeline Dependencies *)
+and r12 = [R];(addr|data);[W];rfi;[R]
+and r13 = [R];addr;[M];po;[W]
+
+let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13
+Total
+
+(* Notice that herd has defined its own rf relation *)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(********************************)
+(* Generate global memory order *)
+(********************************)
+
+let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *)
+ loc & (W\FW) * FW | # Final write after any write to the same location
+ ppo | # ppo compatible
+ rfe # includes herd external rf (optimization)
+
+(* Walk over all linear extensions of gmo0 *)
+with gmo from linearizations(M\IW,gmo0)
+
+(* Add initial writes upfront -- convenient for computing rfGMO *)
+let gmo = gmo | loc & IW * (M\IW)
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Compute rf according to the load value axiom, aka rfGMO *)
+let WR = loc & ([W];(gmo|po);[R])
+let rfGMO = WR \ (loc&([W];gmo);WR)
+
+(* Check equality of herd rf and of rfGMO *)
+empty (rf\rfGMO)|(rfGMO\rf) as RfCons
+
+(* Atomicity axiom *)
+let infloc = (gmo & loc)^-1
+let inflocext = infloc & ext
+let winside = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W]
+empty winside as Atomic
+riscv.cat
, an alternative herd presentation of the RVWMO memory model (3/3)Partial
+
+(***************)
+(* Definitions *)
+(***************)
+
+(* Define ppo *)
+include "riscv-defs.cat"
+
+(* Compute coherence relation *)
+include "cos-opt.cat"
+
+(**********)
+(* Axioms *)
+(**********)
+
+(* Sc per location *)
+acyclic co|rf|fr|po-loc as Coherence
+
+(* Main model axiom *)
+acyclic co|rfe|fr|ppo as Model
+
+(* Atomicity axiom *)
+empty rmw & (fre;coe) as Atomic
+B.3. An Operational Memory Model
+This is an alternative presentation of the RVWMO memory model in +operational style. It aims to admit exactly the same extensional +behavior as the axiomatic presentation: for any given program, admitting +an execution if and only if the axiomatic presentation allows it.
+The axiomatic presentation is defined as a predicate on complete +candidate executions. In contrast, this operational presentation has an +abstract microarchitectural flavor: it is expressed as a state machine, +with states that are an abstract representation of hardware machine +states, and with explicit out-of-order and speculative execution (but +abstracting from more implementation-specific microarchitectural details +such as register renaming, store buffers, cache hierarchies, cache +protocols, etc.). As such, it can provide useful intuition. It can also +construct executions incrementally, making it possible to interactively +and randomly explore the behavior of larger examples, while the +axiomatic model requires complete candidate executions over which the +axioms can be checked.
+The operational presentation covers mixed-size execution, with +potentially overlapping memory accesses of different power-of-two byte +sizes. Misaligned accesses are broken up into single-byte accesses.
+The operational model, together with a fragment of the RISC-V ISA
+semantics (RV64I and A), are integrated into the rmem
exploration tool
+(github.com/rems-project/rmem). rmem
can explore litmus tests
+(see Section A.2) and small ELF binaries
+exhaustively, pseudo-randomly and interactively. In rmem
, the ISA
+semantics is expressed explicitly in Sail (see
+github.com/rems-project/sail for the Sail language, and
+github.com/rems-project/sail-riscv for the RISC-V ISA model),
+and the concurrency semantics is expressed in Lem (see
+github.com/rems-project/lem for the Lem language).
rmem
has a command-line interface and a web-interface. The
+web-interface runs entirely on the client side, and is provided online
+together with a library of litmus tests:
+www.cl.cam.ac.uk/. The command-line interface is
+faster than the web-interface, specially in exhaustive mode.
Below is an informal introduction of the model states and transitions. +The description of the formal model starts in the next subsection.
+Terminology: In contrast to the axiomatic presentation, here every
+memory operation is either a load or a store. Hence, AMOs give rise to
+two distinct memory operations, a load and a store. When used in
+conjunction with instruction
, the terms load
and store
refer
+to instructions that give rise to such memory operations. As such, both
+include AMO instructions. The term acquire
refers to an instruction
+(or its memory operation) with the acquire-RCpc or acquire-RCsc
+annotation. The term release
refers to an instruction (or its memory
+operation) with the release-RCpc or release-RCsc annotation.
Model states
+Model states: A model state consists of a shared memory and a tuple of hart states.
+The shared memory state records all the memory store operations that +have propagated so far, in the order they propagated (this can be made +more efficient, but for simplicity of the presentation we keep it this +way).
+Each hart state consists principally of a tree of instruction instances, +some of which have been finished, and some of which have not. +Non-finished instruction instances can be subject to restart, e.g. if +they depend on an out-of-order or speculative load that turns out to be +unsound.
+Conditional branch and indirect jump instructions may have multiple +successors in the instruction tree. When such instruction is finished, +any un-taken alternative paths are discarded.
+Each instruction instance in the instruction tree has a state that +includes an execution state of the intra-instruction semantics (the ISA +pseudocode for this instruction). The model uses a formalization of the +intra-instruction semantics in Sail. One can think of the execution +state of an instruction as a representation of the pseudocode control +state, pseudocode call stack, and local variable values. An instruction +instance state also includes information about the instance’s memory and +register footprints, its register reads and writes, its memory +operations, whether it is finished, etc.
+Model transitions
+The model defines, for any model state, the set of allowed transitions, +each of which is a single atomic step to a new abstract machine state. +Execution of a single instruction will typically involve many +transitions, and they may be interleaved in operational-model execution +with transitions arising from other instructions. Each transition arises +from a single instruction instance; it will change the state of that +instance, and it may depend on or change the rest of its hart state and +the shared memory state, but it does not depend on other hart states, +and it will not change them. The transitions are introduced below and +defined in Section B.3.5, with a precondition and +a construction of the post-transition model state for each.
+Transitions for all instructions:
+-
+
-
+
Fetch instruction: This transition represents a fetch and decode of a new instruction instance, as a program order successor of a previously fetched +instruction instance (or the initial fetch address).
+
+
The model assumes the instruction memory is fixed; it does not describe +the behavior of self-modifying code. In particular, the Fetch instruction transition does +not generate memory load operations, and the shared memory is not +involved in the transition. Instead, the model depends on an external +oracle that provides an opcode when given a memory location.
+-
+
-
+
Register write: This is a write of a register value.
+
+ -
+
Register read: This is a read of a register value from the most recent +program-order-predecessor instruction instance that writes to that +register.
+
+ -
+
Pseudocode internal step: This covers pseudocode internal computation: arithmetic, function +calls, etc.
+
+ -
+
Finish instruction: At this point the instruction pseudocode is done, the instruction cannot be restarted, memory accesses cannot be discarded, and all memory +effects have taken place. For conditional branch and indirect jump +instructions, any program order successors that were fetched from an +address that is not the one that was written to the pc register are +discarded, together with the sub-tree of instruction instances below +them.
+
+
Transitions specific to load instructions:
+-
+
-
+
Initiate memory load operations: At this point the memory footprint of the load instruction is +provisionally known (it could change if earlier instructions are +restarted) and its individual memory load operations can start being +satisfied.
+
+
-
+
-
+
Satisfy memory load operation by forwarding from unpropogated stores: This partially or entirely satisfies a single memory load operation by forwarding, from program-order-previous memory store operations.
+
+ -
+
Satisfy memory load operation from memory: This entirely satisfies the outstanding slices of a single memory +load operation, from memory.
+
+
-
+
-
+
Complete load operations: At this point all the memory load operations of the instruction have +been entirely satisfied and the instruction pseudocode can continue +executing. A load instruction can be subject to being restarted until +the transition. But, under some conditions, the model might treat a load +instruction as non-restartable even before it is finished (e.g. see ).
+
+
Transitions specific to store instructions:
+-
+
-
+
Initiate memory store operation footprints: At this point the memory footprint of the store is provisionally +known.
+
+ -
+
Instantiate memory store operation values: At this point the memory store operations have their values and +program-order-successor memory load operations can be satisfied by +forwarding from them.
+
+ -
+
Commit store instruction: At this point the store operations are guaranteed to happen (the +instruction can no longer be restarted or discarded), and they can start +being propagated to memory.
+
+
-
+
-
+
Propagate store operation: This propagates a single memory store operation to memory.
+
+
-
+
-
+
Complete store operations: At this point all the memory store operations of the instruction +have been propagated to memory, and the instruction pseudocode can +continue executing.
+
+
Transitions specific to sc
instructions:
-
+
-
+
Early sc fail: This causes the
+sc
to fail, either a spontaneous fail or becauset is not paired with a program-order-previouslr
.
+ -
+
Paired sc: This transition indicates the
+sc
is paired with anlr
and might +succeed.
+ -
+
Commit and propagate store operation of an sc: This is an atomic execution of the transitions Commit store instruction and Propagate store operation, it is enabled +only if the stores from which the
+lr
read from have not been +overwritten.
+ -
+
Late sc fail: This causes the
+sc
to fail, either a spontaneous fail or because +the stores from which thelr
read from have been overwritten.
+
Transitions specific to AMO instructions:
+-
+
-
+
Satisfy, commit and propagate operations of an AMO: This is an atomic execution of all the transitions needed to satisfy +the load operation, do the required arithmetic, and propagate the store +operation.
+
+
Transitions specific to fence instructions:
+-
+
- + + +
The transitions labeled can always be taken eagerly, +as soon as their precondition is satisfied, without excluding other +behavior; the cannot. Although Fetch instruction is marked with a +, it can be taken eagerly as long as it is not +taken infinitely many times.
+An instance of a non-AMO load instruction, after being fetched, will +typically experience the following transitions in this order:
+-
+
- + + +
- + + +
-
+
Satisfy memory load operation by forwarding from unpropagated stores and/or Satisfy memory load operation from memory (as many as needed to satisfy all the load operations of the +instance)
+
+ - + + +
- + + +
- + + +
Before, between and after the transitions above, any number of +Pseudocode internal step transitions may appear. In addition, a Fetch instruction transition for fetching the +instruction in the next program location will be available until it is +taken.
+This concludes the informal description of the operational model. The +following sections describe the formal operational model.
+B.3.1. Intra-instruction Pseudocode Execution
+The intra-instruction semantics for each instruction instance is +expressed as a state machine, essentially running the instruction +pseudocode. Given a pseudocode execution state, it computes the next +state. Most states identify a pending memory or register operation, +requested by the pseudocode, which the memory model has to do. The +states are (this is a tagged union; tags in small-caps):
+Load_mem(kind, address, size, load_continuation) |
+- memory load +operation |
+
Early_sc_fail(res_continuation) |
+- allow |
+
Store_ea(kind, address, size, next_state) |
+- memory store +effective address |
+
Store_memv(mem_value, store_continuation) |
+- memory store value |
+
Fence(kind, next_state) |
+- fence |
+
Read_reg(reg_name, read_continuation) |
+- register read |
+
Write_reg(reg_name, reg_value, next_state) |
+- register write |
+
Internal(next_state) |
+- pseudocode internal step |
+
Done |
+- end of pseudocode |
+
Here:
+-
+
-
+
mem_value and reg_value are lists of bytes;
+
+ -
+
address is an integer of XLEN bits;
+
+
for load/store, kind identifies whether it is lr/sc
,
+acquire-RCpc/release-RCpc, acquire-RCsc/release-RCsc,
+acquire-release-RCsc;
+* for fence, kind identifies whether it is a normal or TSO, and (for
+normal fences) the predecessor and successor ordering bits;
+* reg_name identifies a register and a slice thereof (start and end bit
+indices); and the continuations describe how the instruction instance will continue
+for each value that might be provided by the surrounding memory model
+(the load_continuation and read_continuation take the value loaded
+from memory and read from the previous register write, the
+store_continuation takes false for an sc
that failed and true in
+all other cases, and res_continuation takes false if the sc
fails
+and true otherwise).
+ + | +
+
+
+For example, given the load instruction |
+
Notice that writing to memory is split into two steps, Store_ea and +Store_memv: the first one makes the memory footprint of the store +provisionally known, and the second one adds the value to be stored. We +ensure these are paired in the pseudocode (Store_ea followed by +Store_memv), but there may be other steps between them.
++ + | +
+
+
+It is observable that the Store_ea can occur before the value to be +stored is determined. For example, for the litmus test +LB+fence.r.rw+data-po to be allowed by the operational model (as it is +by RVWMO), the first store in Hart 1 has to take the Store_ea step +before its value is determined, so that the second store can see it is +to a non-overlapping memory footprint, allowing the second store to be +committed out of order without violating coherence. + |
+
The pseudocode of each instruction performs at most one store or one +load, except for AMOs that perform exactly one load and one store. Those +memory accesses are then split apart into the architecturally atomic +units by the hart semantics (see Initiate memory load operations and Initiate memory store operation footprints below).
+Informally, each bit of a register read should be satisfied from a +register write by the most recent (in program order) instruction +instance that can write that bit (or from the hart’s initial register +state if there is no such write). Hence, it is essential to know the +register write footprint of each instruction instance, which we +calculate when the instruction instance is created (see the Festch instruction action of +below). We ensure in the pseudocode that each instruction does at most +one register write to each register bit, and also that it does not try +to read a register value it just wrote.
+Data-flow dependencies (address and data) in the model emerge from the +fact that each register read has to wait for the appropriate register +write to be executed (as described above).
+B.3.2. Instruction Instance State
+Each instruction instance _i has a state comprising:
+-
+
-
+
program_loc, the memory address from which the instruction was +fetched;
+
+ -
+
instruction_kind, identifying whether this is a load, store, AMO, +fence, branch/jump or a
+simple
instruction (this also includes a +kind similar to the one described for the pseudocode execution +states);
+ -
+
src_regs, the set of source _reg_name_s (including system +registers), as statically determined from the pseudocode of the +instruction;
+
+ -
+
dst_regs, the destination _reg_name_s (including system registers), +as statically determined from the pseudocode of the instruction;
+
+ -
+
pseudocode_state (or sometimes just
+state
for short), one of (this +is a tagged union; tags in small-caps):
+
Plain(isa_state) | +- ready to make a pseudocode transition | +
---|---|
Pending_mem_loads(load_continuation) |
+- requesting memory load +operation(s) |
+
Pending_mem_stores(store_continuation) |
+- requesting memory store +operation(s) |
+
-
+
-
+
reg_reads, the register reads the instance has performed, including, +for each one, the register write slices it read from;
+
+ -
+
reg_writes, the register writes the instance has performed;
+
+ -
+
mem_loads, a set of memory load operations, and for each one the +as-yet-unsatisfied slices (the byte indices that have not been satisfied +yet), and, for the satisfied slices, the store slices (each consisting +of a memory store operation and subset of its byte indices) that +satisfied it.
+
+ -
+
mem_stores, a set of memory store operations, and for each one a +flag that indicates whether it has been propagated (passed to the shared +memory) or not.
+
+ -
+
information recording whether the instance is committed, finished, +etc.
+
+
Each memory load operation includes a memory footprint (address and +size). Each memory store operations includes a memory footprint, and, +when available, a value.
+A load instruction instance with a non-empty mem_loads, for which all +the load operations are satisfied (i.e. there are no unsatisfied load +slices) is said to be entirely satisfied.
+Informally, an instruction instance is said to have fully determined
+data if the load (and sc
) instructions feeding its source registers
+are finished. Similarly, it is said to have a fully determined memory
+footprint if the load (and sc
) instructions feeding its memory
+operation address register are finished. Formally, we first define the
+notion of fully determined register write: a register write
+ from reg_writes of instruction instance
+ is said to be fully determined if one of the following
+conditions hold:
-
+
-
+
is finished; or
+
+ -
+
the value written by is not affected by a memory +operation that has made (i.e. a value loaded from memory +or the result of
+sc
), and, for every register read that + has made, that affects , the register +write from which read is fully determined (or + read from the initial register state).
+
Now, an instruction instance is said to have fully +determined data if for every register read from +reg_reads, the register writes that reads from are +fully determined. An instruction instance is said to +have a fully determined memory footprint if for every register read + from reg_reads that feeds into ’s +memory operation address, the register writes that reads +from are fully determined.
++ + | +
+
+
+The |
+
B.3.3. Hart State
+The model state of a single hart comprises:
+-
+
-
+
hart_id, a unique identifier of the hart;
+
+ -
+
initial_register_state, the initial register value for each +register;
+
+ -
+
initial_fetch_address, the initial instruction fetch address;
+
+ -
+
instruction_tree, a tree of the instruction instances that have been +fetched (and not discarded), in program order.
+
+
B.3.4. Shared Memory State
+The model state of the shared memory comprises a list of memory store +operations, in the order they propagated to the shared memory.
+When a store operation is propagated to the shared memory it is simply +added to the end of the list. When a load operation is satisfied from +memory, for each byte of the load operation, the most recent +corresponding store slice is returned.
++ + | +
+
+
+For most purposes, it is simpler to think of the shared memory as an
+array, i.e., a map from memory locations to memory store operation
+slices, where each memory location is mapped to a one-byte slice of the
+most recent memory store operation to that location. However, this
+abstraction is not detailed enough to properly handle the |
+
B.3.5. Transitions
+Each of the paragraphs below describes a single kind of system +transition. The description starts with a condition over the current +system state. The transition can be taken in the current state only if +the condition is satisfied. The condition is followed by an action that +is applied to that state when the transition is taken, in order to +generate the new system state.
+Fetch instruction
+A possible program-order-successor of instruction instance + can be fetched from address loc if:
+-
+
-
+
it has not already been fetched, i.e., none of the immediate +successors of in the hart’s instruction_tree are from +loc; and
+
+ -
+
if ’s pseudocode has already written an address to +pc, then loc must be that address, otherwise loc is:
+++-
+
-
+
for a conditional branch, the successor address or the branch target +address;
+
+ -
+
for a (direct) jump and link instruction (
+jal
), the target address;
+ -
+
for an indirect jump instruction (
+jalr
), any address; and
+ -
+
for any other instruction, .
+
+
+ -
+
Action: construct a freshly initialized instruction instance + for the instruction in the program memory at loc, +with state Plain(isa_state), computed from the instruction pseudocode, +including the static information available from the pseudocode such as +its instruction_kind, src_regs, and dst_regs, and add + to the hart’s instruction_tree as a successor of +.
+The possible next fetch addresses (loc) are available immediately
+after fetching and the model does not need to wait for
+the pseudocode to write to pc; this allows out-of-order execution, and
+speculation past conditional branches and jumps. For most instructions
+these addresses are easily obtained from the instruction pseudocode. The
+only exception to that is the indirect jump instruction (jalr
), where
+the address depends on the value held in a register. In principle the
+mathematical model should allow speculation to arbitrary addresses here.
+The exhaustive search in the rmem
tool handles this by running the
+exhaustive search multiple times with a growing set of possible next
+fetch addresses for each indirect jump. The initial search uses empty
+sets, hence there is no fetch after indirect jump instruction until the
+pseudocode of the instruction writes to pc, and then we use that value
+for fetching the next instruction. Before starting the next iteration of
+exhaustive search, we collect for each indirect jump (grouped by code
+location) the set of values it wrote to pc in all the executions in
+the previous search iteration, and use that as possible next fetch
+addresses of the instruction. This process terminates when no new fetch
+addresses are detected.
Initiate memory load operations
+An instruction instance in state Plain(Load_mem(kind, +address, size, load_continuation)) can always initiate the +corresponding memory load operations. Action:
+-
+
-
+
Construct the appropriate memory load operations :
+++-
+
-
+
if address is aligned to size then is a single +memory load operation of size bytes from address;
+
+ -
+
otherwise, is a set of size memory load +operations, each of one byte, from the addresses +.
+
+
+ -
+
-
+
set mem_loads of to ; and
+
+ -
+
update the state of to +Pending_mem_loads(load_continuation).
+
+
In Section 18.1.1 it is said that +misaligned memory accesses may be decomposed at any granularity. Here we +decompose them to one-byte accesses as this granularity subsumes all +others.
+Satisfy memory load operation by forwarding from unpropagated stores
+For a non-AMO load instruction instance in state +Pending_mem_loads(load_continuation), and a memory load operation + in that has +unsatisfied slices, the memory load operation can be partially or +entirely satisfied by forwarding from unpropagated memory store +operations by store instruction instances that are program-order-before + if:
+-
+
-
+
all program-order-previous
+fence
instructions with.sr
and.pw
+set are finished;
+ -
+
for every program-order-previous
+fence
instruction, , +with.sr
and.pr
set, and.pw
not set, if is not +finished then all load instructions that are program-order-before + are entirely satisfied;
+ -
+
for every program-order-previous
+fence.tso
instruction, +, that is not finished, all load instructions that are +program-order-before are entirely satisfied;
+ -
+
if is a load-acquire-RCsc, all program-order-previous +store-releases-RCsc are finished;
+
+ -
+
if is a load-acquire-release, all +program-order-previous instructions are finished;
+
+ -
+
all non-finished program-order-previous load-acquire instructions are +entirely satisfied; and
+
+ -
+
all program-order-previous store-acquire-release instructions are +finished;
+
+
Let be the set of all unpropagated memory store
+operation slices from non-sc
store instruction instances that are
+program-order-before and have already calculated the
+value to be stored, that overlap with the unsatisfied slices of
+, and which are not superseded by intervening store
+operations or store operations that are read from by an intervening
+load. The last condition requires, for each memory store operation slice
+ in from instruction
+:
-
+
-
+
that there is no store instruction program-order-between +and with a memory store operation overlapping +; and
+
+ -
+
that there is no load instruction program-order-between +and that was satisfied from an overlapping memory store +operation slice from a different hart.
+
+
Action:
+-
+
-
+
update to indicate that + was satisfied by ; and
+
+ -
+
restart any speculative instructions which have violated coherence as +a result of this, i.e., for every non-finished instruction + that is a program-order-successor of , +and every memory load operation of +that was satisfied from , if there exists a memory +store operation slice in , and +an overlapping memory store operation slice from a different memory +store operation in , and is not +from an instruction that is a program-order-successor of +, restart and its restart-dependents.
+
+
Where, the restart-dependents of instruction are:
+-
+
-
+
program-order-successors of that have data-flow +dependency on a register write of ;
+
+ -
+
program-order-successors of that have a memory load +operation that reads from a memory store operation of +(by forwarding);
+
+ -
+
if is a load-acquire, all the program-order-successors +of ;
+
+ -
+
if is a load, for every
+fence
, , with +.sr
and.pr
set, and.pw
not set, that is a +program-order-successor of , all the load instructions +that are program-order-successors of ;
+ -
+
if is a load, for every
+fence.tso
, , +that is a program-order-successor of , all the load +instructions that are program-order-successors of ; and
+ -
+
(recursively) all the restart-dependents of all the instruction +instances above.
+
+
Forwarding memory store operations to a memory load might satisfy only +some slices of the load, leaving other slices unsatisfied.
+A program-order-previous store operation that was not available when +taking the transition above might make provisionally +unsound (violating coherence) when it becomes available. That store will +prevent the load from being finished (see Finish instruction), and will cause it to +restart when that store operation is propagated (see Propagate store operation).
+A consequence of the transition condition above is that +store-release-RCsc memory store operations cannot be forwarded to +load-acquire-RCsc instructions: does not include +memory store operations from finished stores (as those must be +propagated memory store operations), and the condition above requires +all program-order-previous store-releases-RCsc to be finished when the +load is acquire-RCsc.
+Satisfy memory load operation from memory
+For an instruction instance of a non-AMO load +instruction or an AMO instruction in the context of the Saitsfy, commit and propagate operations of an AMO transition, +any memory load operation in + that has unsatisfied slices, can be +satisfied from memory if all the conditions of <sat_by_forwarding, Saitsfy memory load operation by forwarding from unpropagated stores>> are satisfied. Action: +let be the memory store operation slices from memory +covering the unsatisfied slices of , and apply the +action of Satisfy memory operation by forwarding from unpropagates stores.
++ + | +
+
+
+Note that Satisfy memory operation by forwarding from unpropagates stores might leave some slices of the memory load operation +unsatisfied, those will have to be satisfied by taking the transition +again, or taking Satisfy memory load operation from memory. Satisfy memory load operation from memory, on the other hand, will always satisfy all the +unsatisfied slices of the memory load operation. + |
+
Complete load operations
+A load instruction instance in state +Pending_mem_loads(load_continuation) can be completed (not to be +confused with finished) if all the memory load operations + are entirely satisfied (i.e. there +are no unsatisfied slices). Action: update the state of +to Plain(load_continuation(mem_value)), where mem_value is assembled +from all the memory store operation slices that satisfied +.
+Early sc
fail
+An sc
instruction instance in state
+Plain(Early_sc_fail(res_continuation)) can always be made to fail.
+Action: update the state of to
+Plain(res_continuation(false)).
Paired sc
+An sc
instruction instance in state
+Plain(Early_sc_fail(res_continuation)) can continue its (potentially
+successful) execution if is paired with an lr
. Action:
+update the state of to Plain(res_continuation(true)).
Initiate memory store operation footprints
+An instruction instance in state Plain(Store_ea(kind, +address, size, next_state)) can always announce its pending memory +store operation footprint. Action:
+-
+
-
+
construct the appropriate memory store operations +(without the store value):
+++-
+
-
+
if address is aligned to size then is a single +memory store operation of size bytes to address;
+
+ -
+
otherwise, is a set of size memory store +operations, each of one-byte size, to the addresses +.
+
+
+ -
+
-
+
set to ; and
+
+ -
+
update the state of to Plain(next_state).
+
+
Note that after taking the transition above the memory store operations +do not yet have their values. The importance of splitting this +transition from the transition below is that it allows other +program-order-successor store instructions to observe the memory +footprint of this instruction, and if they don’t overlap, propagate out +of order as early as possible (i.e. before the data register value +becomes available).
+Instantiate memory store operation values
+An instruction instance in state +Plain(Store_memv(mem_value, store_continuation)) can always +instantiate the values of the memory store operations +. Action:
+-
+
-
+
split mem_value between the memory store operations +; and
+
+ -
+
update the state of to +Pending_mem_stores(store_continuation).
+
+
Commit store instruction
+An uncommitted instruction instance of a non-sc
store
+instruction or an sc
instruction in the context of the Commit and propagate store operation of an sc
+transition, in state Pending_mem_stores(store_continuation), can be
+committed (not to be confused with propagated) if:
-
+
-
+
has fully determined data;
+
+ -
+
all program-order-previous conditional branch and indirect jump +instructions are finished;
+
+ -
+
all program-order-previous
+fence
instructions with.sw
set are +finished;
+ -
+
all program-order-previous
+fence.tso
instructions are finished;
+ -
+
all program-order-previous load-acquire instructions are finished;
+
+ -
+
all program-order-previous store-acquire-release instructions are +finished;
+
+ -
+
if is a store-release, all program-order-previous +instructions are finished;
+
+ -
+
all program-order-previous memory access instructions have a fully +determined memory footprint;
+
+ -
+
all program-order-previous store instructions, except for
+sc
that failed, +have initiated and so have non-empty mem_stores; and
+ -
+
all program-order-previous load instructions have initiated and so have +non-empty mem_loads.
+
+
Action: record that i is committed.
++ + | +
+
+
+Notice that if condition +8 is satisfied +the conditions +9 and +10 are also +satisfied, or will be satisfied after taking some eager transitions. +Hence, requiring them does not strengthen the model. By requiring them, +we guarantee that previous memory access instructions have taken enough +transitions to make their memory operations visible for the condition +check of , which is the next transition the instruction will take, +making that condition simpler. + |
+
Propagate store operation
+For a committed instruction instance in state +Pending_mem_stores(store_continuation), and an unpropagated memory +store operation in +, can be +propagated if:
+-
+
-
+
all memory store operations of program-order-previous store +instructions that overlap with have already +propagated;
+
+ -
+
all memory load operations of program-order-previous load instructions +that overlap with have already been satisfied, and +(the load instructions) are non-restartable (see definition below); +and
+
+ -
+
all memory load operations that were satisfied by forwarding + are entirely satisfied.
+
+
Where a non-finished instruction instance is +non-restartable if:
+-
+
-
+
there does not exist a store instruction and an +unpropagated memory store operation of +such that applying the action of the Propagate store operation transition to + will result in the restart of ; and
+
+ -
+
there does not exist a non-finished load instruction +and a memory load operation of such +that applying the action of the Satisfy memory load operation by forwarding from unpropagated stores/Satisfy memory load operation from memory transition (even if + is already satisfied) to will result +in the restart of .
+
+
Action:
+-
+
-
+
update the shared memory state with ;
+
+ -
+
update to indicate that + was propagated; and
+
+ -
+
restart any speculative instructions which have violated coherence as +a result of this, i.e., for every non-finished instruction + program-order-after and every memory +load operation of that was satisfied +from , if there exists a memory store operation +slice in that overlaps with + and is not from , and + is not from a program-order-successor of +, restart and its restart-dependents +(see Satisfy memory load operation by forwarding from unpropagated stores).
+
+
Commit and propagate store operation of an sc
+An uncommitted sc
instruction instance , from hart
+, in state Pending_mem_stores(store_continuation), with
+a paired lr
that has been satisfied by some store
+slices , can be committed and propagated at the same
+time if:
-
+
-
+
is finished;
+
+ -
+
every memory store operation that has been forwarded to + is propagated;
+
+ -
+
the conditions of Commit store instruction is satisfied;
+
+ -
+
the conditions of Propagate store instruction is satisfied (notice that an
+sc
instruction can +only have one memory store operation); and
+ -
+
for every store slice from , + has not been overwritten, in the shared memory, by a +store that is from a hart that is not , at any point +since was propagated to memory.
+
+
Action:
+-
+
-
+
apply the actions of Commit store instruction; and
+
+ -
+
apply the action of Propagate store instruction.
+
+
Late sc
fail
+An sc
instruction instance in state
+Pending_mem_stores(store_continuation), that has not propagated its
+memory store operation, can always be made to fail. Action:
-
+
-
+
clear ; and
+
+ -
+
update the state of to +Plain(store_continuation(false)).
+
+
For efficiency, the rmem
tool allows this transition only when it is
+not possible to take the Commit and propagate store operation of an sc transition. This does not affect the set of
+allowed final states, but when explored interactively, if the sc
+should fail one should use the Eaarly sc fail transition instead of waiting for this transition.
Complete store operations
+A store instruction instance in state +Pending_mem_stores(store_continuation), for which all the memory store +operations in have been propagated, +can always be completed (not to be confused with finished). Action: +update the state of to +Plain(store_continuation(true)).
+Satisfy, commit and propagate operations of an AMO
+An AMO instruction instance in state +Pending_mem_loads(load_continuation) can perform its memory access if +it is possible to perform the following sequence of transitions with no +intervening transitions:
+-
+
- + + +
- + + +
-
+
Pseudocode internal step (zero or more times)
+
+ - + + +
- + + +
- + + +
- + + +
and in addition, the condition of Finish instruction, with the exception of not requiring + to be in state Plain(Done), holds after those +transitions. Action: perform the above sequence of transitions (this +does not include Finish instruction), one after the other, with no intervening +transitions.
++ + | +
+
+
+Notice that program-order-previous stores cannot be forwarded to the +load of an AMO. This is simply because the sequence of transitions above +does not include the forwarding transition. But even if it did include +it, the sequence will fail when trying to do the Propagate store operation transition, as this +transition requires all program-order-previous store operations to +overlapping memory footprints to be propagated, and forwarding requires +the store operation to be unpropagated. +
+
+In addition, the store of an AMO cannot be forwarded to a +program-order-successor load. Before taking the transition above, the +store operation of the AMO does not have its value and therefore cannot +be forwarded; after taking the transition above the store operation is +propagated and therefore cannot be forwarded. + |
+
Commit fence
+A fence instruction instance in state +Plain(Fence(kind, next_state)) can be committed if:
+-
+
-
+
if is a normal fence and it has
+.pr
set, all +program-order-previous load instructions are finished;
+ -
+
if is a normal fence and it has
+.pw
set, all +program-order-previous store instructions are finished; and
+ -
+
if is a
+fence.tso
, all program-order-previous load +and store instructions are finished.
+
Action:
+-
+
-
+
record that is committed; and
+
+ -
+
update the state of to Plain(next_state).
+
+
Register read
+An instruction instance in state +Plain(Read_reg(reg_name, read_cont)) can do a register read of +reg_name if every instruction instance that it needs to read from has +already performed the expected reg_name register write.
+Let read_sources include, for each bit of reg_name, the write to +that bit by the most recent (in program order) instruction instance that +can write to that bit, if any. If there is no such instruction, the +source is the initial register value from initial_register_state. Let +reg_value be the value assembled from read_sources. Action:
+-
+
-
+
add reg_name to with +read_sources and reg_value; and
+
+ -
+
update the state of to Plain(read_cont(reg_value)).
+
+
Register write
+An instruction instance in state +Plain(Write_reg(reg_name, reg_value, next_state)) can always do a +reg_name register write. Action:
+-
+
-
+
add reg_name to with + and reg_value; and
+
+ -
+
update the state of to Plain(next_state).
+
+
where is a pair of the set of all read_sources from +, and a flag that is true iff + is a load instruction instance that has already been +entirely satisfied.
+Pseudocode internal step
+An instruction instance in state +Plain(Internal(next_state)) can always do that pseudocode-internal +step. Action: update the state of to +Plain(next_state).
+Finish instruction
+A non-finished instruction instance in state Plain(Done) +can be finished if:
+-
+
-
+
if is a load instruction:
+++-
+
-
+
all program-order-previous load-acquire instructions are finished;
+
+ -
+
all program-order-previous
+fence
instructions with.sr
set are +finished;
+ -
+
for every program-order-previous
+fence.tso
instruction, +, that is not finished, all load instructions that are +program-order-before are finished; and
+ -
+
it is guaranteed that the values read by the memory load operations +of will not cause coherence violations, i.e., for any +program-order-previous instruction instance , let + be the combined footprint of propagated +memory store operations from store instructions program-order-between + and , and fixed memory store +operations that were forwarded to from store +instructions program-order-between and +including , and let + be the complement of + in the memory footprint of . +If is not empty:
+++-
+
-
+
has a fully determined memory footprint;
+
+ -
+
has no unpropagated memory store operations that +overlap with ; and
+
+ -
+
if is a load with a memory footprint that overlaps +with , then all the memory load +operations of that overlap with + are satisfied and +is non-restartable (see the Propagate store operation transition for how to determined if an +instruction is non-restartable).
+++Here, a memory store operation is called fixed if the store instruction +has fully determined data.
+
+
+ -
+
+ -
+
-
+
has a fully determined data; and
+
+ -
+
if is not a fence, all program-order-previous +conditional branch and indirect jump instructions are finished.
+
+
Action:
+-
+
-
+
if is a conditional branch or indirect jump +instruction, discard any untaken paths of execution, i.e., remove all +instruction instances that are not reachable by the branch/jump taken in +instruction_tree; and
+
+ -
+
record the instruction as finished, i.e., set finished to true.
+
+
B.3.6. Limitations
+-
+
-
+
The model covers user-level RV64I and RV64A. In particular, it does +not support the misaligned atomicity granule PMA or the total store +ordering extension "Ztso". It should be trivial to adapt the model to +RV32I/A and to the G, Q and C extensions, but we have never tried it. +This will involve, mostly, writing Sail code for the instructions, with +minimal, if any, changes to the concurrency model.
+
+ -
+
The model covers only normal memory accesses (it does not handle I/O +accesses).
+
+ -
+
The model does not cover TLB-related effects.
+
+ -
+
The model assumes the instruction memory is fixed. In particular, the +Fetch instruction transition does not generate memory load operations, and the shared +memory is not involved in the transition. Instead, the model depends on +an external oracle that provides an opcode when given a memory location.
+
+ -
+
The model does not cover exceptions, traps and interrupts.
+
+
Appendix C: Vector Assembly Code Examples
+Appendix D: Calling Convention for Vector State (Not authoritative - Placeholder Only)
+Index
+Bibliography
+RISC-V ELF psABI Specification. github.com/riscv/riscv-elf-psabi-doc/ .
+RISC-V Assembly Programmer’s Manual. github.com/riscv/riscv-asm-manual .
+ANSI/IEEE Std 754-2008, IEEE standard for floating-point arithmetic. (2008). "Institute of Electrical and Electronic Engineers".
+Amdahl, G. M., Blaauw, G. A., & F. P. Brooks, J. (1964). Architecture of the IBM System/360. IBM Journal of R. & D., 8(2).
+Heil, T. H., & Smith, J. E. (1996). Selective Dual Path Execution. University of Wisconsin - Madison.
+Katevenis, M. G. H., Sherburne, R. W., Jr., Patterson, D. A., & Séquin, C. H. (1983, August). The RISC II micro-architecture. Proceedings VLSI 83 Conference.
+Kim, H., Mutlu, O., Stark, J., & Patt, Y. N. (2005). Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture, 43–54.
+Klauser, A., Austin, T., Grunwald, D., & Calder, B. (1998). Dynamic Hammock Predication for Non-Predicated Instruction Set Architectures. Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques.
+Lee, D. D., Kong, S. I., Hill, M. D., Taylor, G. S., Hodges, D. A., Katz, R. H., & Patterson, D. A. (1989). A VLSI Chip Set for a Multiprocessor Workstation–Part I: An RISC Microprocessor with Coprocessor Interface and Support for Symbolic Processing. IEEE JSSC, 24(6), 1688–1698.
+Pan, H., Hindman, B., & Asanović, K. (2009, March). Lithe: Enabling Efficient Composition of Parallel Libraries. Proceedings of the 1st USENIX Workshop on Hot Topics in Parallelism (HotPar ’09).
+Pan, H., Hindman, B., & Asanović, K. (2010, June). Composing Parallel Software Efficiently with Lithe. 31st Conference on Programming Language Design and Implementation.
+Patterson, D. A., & Séquin, C. H. (1981). RISC I: A Reduced Instruction Set VLSI Computer. ISCA, 443–458.
+Sinharoy, B., Kalla, R., Starke, W. J., Le, H. Q., Cargnoni, R., Van Norstrand, J. A., Ronchetti, B. J., Stuecheli, J., Leenstra, J., Guthrie, G. L., Nguyen, D. Q., Blaner, B., Marino, C. F., Retter, E., & Williams, P. (2011). IBM POWER7 multicore server processor. IBM Journal of Research and Development, 55(3), 1–1.
+Thornton, J. E. (1965). Parallel Operation in the Control Data 6600. Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems, 33–40.
+Tseng, J., & Asanović, K. (2000). Energy-Efficient Register Access. Proc. of the 13th Symposium on Integrated Circuits and Systems Design, 377–384.
+Ungar, D., Blau, R., Foley, P., Samples, D., & Patterson, D. (1984). Architecture of SOAR: Smalltalk on a RISC. ISCA, 188–197.
+Waterman, A. (2011). Improving Energy Efficiency and Reducing Code Size with RISC-V Compressed (Issue UCB/EECS-2011-63) [Master’s thesis]. University of California, Berkeley.
+Waterman, A. (2016). Design of the RISC-V Instruction Set Architecture (Issue UCB/EECS-2016-1) [PhD thesis]. University of California, Berkeley.
+