Writing NES Emulator in Rust
Writing NES Emulator in Rust
html
The NES (Nintendo Entertainment System) was one of the most popular gaming platforms
throughout the 80s and the 90s. The platform and the emergent ecosystem was and still
is a huge cultural phenomenon. The device itself had relatively simple hardware (judging
from the modern days), and it's incredible how much was made out of it.
This series is about creating an emulator capable of running and playing �rst-gen NES
games, such as:
• PacMan
• Donkey Kong
• Ice Climber
• Super Mario Bros
• etc.
The goal is to play NES games on the hardware that we have, meaning we have to
simulate NES hardware. The process of simulation alone means that we are introducing
signi�cant performance overhead in comparison to running native applications. By
choosing rust, we hope to get some additional performance budget for our needs. NES
hardware specs are pretty modest by today's standards. For example, the NES CPU is
about 3000 times slower than modern CPUs. Emulating that in any language should not
be a problem. Some folks were able to get playable performance on an emulator written
in Python. But it is still nice to have extra power for free.
I expect the reader to have basic knowledge of the Rust language and understanding
primary language constructs and platform capabilities. I'll introduce some features as we
go, but others have to be learned elsewhere.
It's also assumed that the reader has a basic understanding of bit arithmetic, boolean
logic, and how binary and hexadecimal numbering systems work. Again, NES is a
relatively simple platform, and the NES CPU instructions set is small and straightforward,
but some basic understanding of computer systems is required.
Special thanks to Spencer Burris and Kirill Gusakov for reviews, edits and helpfull suggestions!
• Applications are running business logic and interact with hardware through an
Operating System.
• The Operating System communicates with the hardware using machine language.
• On a hardware level, each device can be seen as an array of memory elements,
processing units, or both. From this perspective, NES joypad is nothing more than
an array of eight 1-bit items, each representing a pressed/released state of a button
• Layers below ALU and Memory elements are less of an interest to us. On a
hardware level, it all comes down to logic gates and their arrangements.
If you want to get intimate knowledge of how computers are composed, starting
from the basic principles of boolean logic, I highly recommend the book:
"The Elements of Computing Systems. Building a Modern Computer from First
Principles" by Noam Nisan, Shimon Schocken.
Luckily for us, NES doesn't have an Operating System. That means that the Application
layer (Gamezzz) communicates with hardware directly using machine language.
As you can see, machine language is the interface between our emulator and our NES
games.
• Central Processing Unit ( ) - the NES's 2A03 is a modi�ed version of the 6502
chip. As with any CPU, the goal of this module is to execute the main program
instructions.
• Picture Processing Unit ( ) - was based on the 2C02 chip made by Ricoh, the
same company that made CPU. This module's primary goal is to draw the current
state of a game on a TV Screen.
• Both CPU and PPU have access to their 2 KiB (2048 bytes) banks of Random Access
Memory ( )
• Audio Processing Unit ( ) - the module is a part of 2A03 chip and is responsible
for generating speci�c �ve-channel based sounds, that made NES chiptunes so
recognizable.
• Cartridges - were an essential part of the platform because the console didn't have
an operating system. Each cartridge carried at least two large ROM chips - the
Character ROM (CHR ROM) and the Program ROM (PRG ROM). The former stored a
game's video graphics data, the latter stored CPU instructions - the game's code. (in
reality, when a cartridge is inserted into the slot CHR Rom is connected directly to
PPU, while PRG Rom is connected directly to CPU) The later version of cartridges
carried additional hardware (ROM and RAM) accessible through so-called mappers.
That explains why later games had provided signi�cantly better gameplay and
visuals despite running on the same console hardware.
• Gamepads - have a distinct goal to read inputs from a gamer and make it available
for game logic. As we will see later, the fact that the gamepad for the 8-bit platform
has only eight buttons is not a coincidence.
What's interesting is that CPU, PPU, and APU are independent of each other. This fact
makes NES a distributed system in which separate components have to coordinate to
generate one seamless gaming experience.
We can use the schema of the main NES components as an implementation plan for our
emulator.
We have to build a simulation of all of these modules. The goal is to have something
playable as soon as possible. Using an iterative approach, we will incrementally add
features to achieve this goal.
Roughly estimating the e�ort required for each component, PPU will be the hardest, and
the BUS the easiest.
Writing a perfect emulator is a never-ending quest. But this quest has a start, and we will
start by emulating the CPU.
The goal of this chapter is to get our �rst NES game up and running. We are going to play
the Snake game. The source code with comments can be found in this gist.
CPU is the heart of any computer system. It's the CPUs job to run program instructions
and orchestrate all of the available hardware modules to provide the full experience.
Despite PPU and APU running their independent circuits, they still have to march under
CPUs beat and execute commands issued by the CPU.
Before jumping into implementation, we need to brie�y discuss which resources are
available to the CPU to do its work.
The only two resources that the CPU has access to are the Memory Map and CPU
Registers.
From a programming standpoint, the memory map is just a continuous array of 1-byte
cells. NES CPU uses 16-bit for memory addressing, which means that it can address 65536
di�erent memory cells.
As we've seen before, the NES platform had only 2 KiB of RAM connected to the CPU.
Memory access is relatively slow, NES CPU has a few internal memory slots called
registers with signi�cantly lower access delay.
• Program Counter (PC) - holds the address for the next machine language instruction
to be executed.
• Stack Pointer - Memory space [0x0100 .. 0x1FF] is used for stack. The stack pointer
holds the address of the top of that space. NES Stack (as all stacks) grows from top
to bottom: when a byte gets pushed to the stack, SP register decrements. When a
byte is retrieved from the stack, SP register increments.
• Accumulator (A) - stores the results of arithmetic, logic, and memory access
operations. It used as an input parameter for some operations.
• Index Register X (X) - used as an o�set in speci�c memory addressing modes (more
on this later). Can be used for auxiliary storage needs (holding temp values, being
used as a counter, etc.)
• Processor status (P) - 8-bit register represents 7 status �ags that can be set or unset
depending on the result of the last executed instruction (for example Z �ag is set (1)
if the result of an operation is 0, and is unset/erased (0) otherwise)
Each CPU comes with a prede�ned hard-wired instruction set that de�nes everything a
CPU can do.
CPU receives instructions from the application layer in the form of machine codes. And
you can think of machine language as a thin layer connecting the software with the
hardware.
• https://www.nesdev.org/obelisk-6502-guide/reference.html
• http://www.6502.org/tutorials/6502opcodes.html
I tend to use both of the links. The pages provide full specs of available CPU features and
their machine codes.
I highly recommend reading this interactive tutorial on 6502 instructions before moving
on.
6502 chip is a relatively simple CPU; it supports only six types of commands and about 64
unique commands. Because some of the instructions have multiple versions for di�erent
memory addressing modes, it results in about 150 machine code operations that we are
to implement.
NES console had a custom chip 2A03 that is based on 6502, but has
noticeable di�erences:
To keep things simple, we would need to implement support for 256 di�erent
machine instructions.
The good news is that there are a lot of similarities between instructions. Once we
have the foundation in place, we will be constantly reusing them to implement the
whole set.
Let's try to interpret our �rst program. The program looks like this:
a9 c0 aa e8 00
Now it's more readable: it consists of 4 instructions, and the �rst instruction has a
parameter.
Let's interpret what's going on by referencing the opcodes from the 6502 Instruction
Reference
It looks like that the command loads a hexadecimal value 0xC0 into the accumulator CPU
register. It also has to update some bits in Processor Status register P (namely, bit 1 - Zero
Flag and bit 7 - Negative Flag).
spec shows that the opcode has one parameter. The instruction size is 2
bytes: one byte is for operation code itself (standard for all NES CPU opcodes), and
the other is for a parameter.
NES Opcodes can have no explicit parameters or one explicit parameter. For some
operations, the explicit parameter can take 2 bytes. And in that case, the machine
instruction would occupy 3 bytes.
Let's sketch out how our CPU might look like from a high-level perspective:
impl CPU {
pub fn new() -> Self {
CPU {
register_a: 0,
status: 0,
program_counter: 0,
}
}
Note that we introduced a program counter register that will help us track our current
position in the program. Also, note that the interpret method takes a mutable reference
to self as we know that we will need to modify during the execution.
loop {
let opscode = program[self.program_counter as usize];
self.program_counter += 1;
match opscode {
_ => todo!()
}
}
}
So far so good. Endless loop? Nah, it's gonna be alright. Now let's implement the
opcode:
match opscode {
0xA9 => {
let param = program[self.program_counter as usize];
self.program_counter +=1;
self.register_a = param;
if self.register_a == 0 {
self.status = self.status | 0b0000_0010;
} else {
self.status = self.status & 0b1111_1101;
}
}
_ => todo!()
}
We are not doing anything crazy here, just following the spec and using rust constructs to
do binary arithmetic.
It's essential to set or unset CPU �ag status depending on the results.
Because of the endless loop, we won't be able to test this functionality yet. Before moving
on, let's quickly implement opcode:
match opcode {
// ...
0x00 => {
return;
}
_ => todo!()
}
#[cfg(test)]
mod test {
use super::*;
#[test]
fn test_0xa9_lda_immediate_load_data() {
let mut cpu = CPU::new();
cpu.interpret(vec![0xa9, 0x05, 0x00]);
assert_eq!(cpu.register_a, 0x05);
assert!(cpu.status & 0b0000_0010 == 0b00);
assert!(cpu.status & 0b1000_0000 == 0);
}
#[test]
fn test_0xa9_lda_zero_flag() {
let mut cpu = CPU::new();
cpu.interpret(vec![0xa9, 0x00, 0x00]);
assert!(cpu.status & 0b0000_0010 == 0b10);
}
}
This one is also straightforward: copy a value from A to X, and update status register.
impl CPU {
// ...
pub fn interpret(&mut self, program: Vec<u8>) {
// ...
match opscode {
//...
0xAA => {
self.register_x = self.register_a;
if self.register_x == 0 {
self.status = self.status | 0b0000_0010;
} else {
self.status = self.status & 0b1111_1101;
}
}
}
}
}
#[test]
fn test_0xaa_tax_move_a_to_x() {
let mut cpu = CPU::new();
cpu.register_a = 10;
cpu.interpret(vec![0xaa, 0x00]);
assert_eq!(cpu.register_x, 10)
}
Before moving to the next opcode, we have to admit that our code is quite convoluted:
Let's �x that:
// ...
fn lda(&mut self, value: u8) {
self.register_a = value;
self.update_zero_and_negative_flags(self.register_a);
}
fn tax(&mut self) {
self.register_x = self.register_a;
self.update_zero_and_negative_flags(self.register_x);
}
self.lda(param);
}
_ => todo!(),
}
}
}
Ok. The code looks more manageable now. Hopefully, all tests are still passing.
I cannot emphasize enough the importance of writing tests for all of the opcodes we are
implementing. The operations themselves are almost trivial, but tiny mistakes can cause
unpredictable ripples in game logic.
Implementing that last opcode from the program should not be a problem, and I'll leave
this exercise to you.
#[test]
fn test_5_ops_working_together() {
let mut cpu = CPU::new();
cpu.interpret(vec![0xa9, 0xc0, 0xaa, 0xe8, 0x00]);
assert_eq!(cpu.register_x, 0xc1)
}
#[test]
fn test_inx_overflow() {
let mut cpu = CPU::new();
cpu.register_x = 0xff;
cpu.interpret(vec![0xe8, 0xe8, 0x00]);
assert_eq!(cpu.register_x, 1)
}
In our initial implementation, the CPU receives instructions as a separate input stream,
this is not how things work in an actual NES.
NES implements typical von Neumann architecture: both data and the instructions are
stored in memory. The executed code is data from the CPU perspective, and any data can
potentially be interpreted as executable code. There is no way CPU can tell the di�erence.
The only mechanism the CPU has is a register that keeps track of a
position in the instructions stream.
impl CPU {
match opscode {
//..
}
}
}
}
We have just created an array for the whole 64 KiB of address space. As discussed in a
future chapter, CPU has only 2 KiB of RAM, and everything else is reserved for memory
mapping.
We load program code into memory, starting at 0x8000 address. We've discussed that
[0x8000 .. 0xFFFF] is reserved for Program ROM, and we can assume that the instructions
stream should start somewhere in this space (not necessarily at 0x8000).
NES platform has a special mechanism to mark where the CPU should start the
execution. Upon inserting a new cartridge, the CPU receives a special signal called "Reset
interrupt" that instructs CPU to:
Before implementing that, I should brie�y mention that NES CPU can address 65536
memory cells. It takes 2 bytes to store an address. NES CPU uses Little-Endian addressing
rather than Big-Endian. That means that the 8 least signi�cant bits of an address will be
stored before the 8 most signi�cant bits.
Real Address
Address packed in big-endian
Address packed in little-endian
For example, the instruction to read data from memory cell 0x8000 into A register would
look like:
Now we can implement functionality properly. We will have to adjust the load and
load_and_run functions:
• method should load a program into PRG ROM space and save the reference to
the code into 0xFFFC memory cell
• method should restore the state of all registers, and initialize
self.program_counter = self.mem_read_u16(0xFFFC);
}
Remember LDA opcodes we implemented last chapter? That single mnemonic (LDA)
actually can be translated into 8 di�erent machine instructions depending on the type of
the parameter:
• here
• and here
In short, the addressing mode is a property of an instruction that de�nes how the CPU
should interpret the next 1 or 2 bytes in the instruction stream.
• ($A5) has a size of 2 bytes, one for opcode itself, and one for a
parameter. That's why zero page addressing can't reference memory above the �rst
255 bytes.
• ($AD) has 3 bytes, the Address occupies 2 bytes making it
possible to reference all 65536 memory cells. (NOTE: 2-byte the parameter will be
packed according to little-endian rules)
There are no opcodes that occupy more than 3 bytes. CPU instruction size can be either
1, 2, or 3 bytes.
The majority of CPU instructions provide more than one addressing alternative. Ideally,
we don't want to re-implement the same addressing mode logic for every CPU
instruction.
Let's try to codify how the CPU should interpret di�erent addressing modes:
#[derive(Debug)]
#[allow(non_camel_case_types)]
pub enum AddressingMode {
Immediate,
ZeroPage,
ZeroPage_X,
ZeroPage_Y,
Absolute,
Absolute_X,
Absolute_Y,
Indirect_X,
Indirect_Y,
NoneAddressing,
}
impl CPU {
// ...
fn get_operand_address(&self, mode: &AddressingMode) -> u16 {
match mode {
AddressingMode::Immediate => self.program_counter,
AddressingMode::Absolute =>
self.mem_read_u16(self.program_counter),
AddressingMode::ZeroPage_X => {
let pos = self.mem_read(self.program_counter);
let addr = pos.wrapping_add(self.register_x) as u16;
addr
}
AddressingMode::ZeroPage_Y => {
let pos = self.mem_read(self.program_counter);
let addr = pos.wrapping_add(self.register_y) as u16;
addr
}
AddressingMode::Absolute_X => {
let base = self.mem_read_u16(self.program_counter);
let addr = base.wrapping_add(self.register_x as u16);
addr
}
AddressingMode::Absolute_Y => {
let base = self.mem_read_u16(self.program_counter);
let addr = base.wrapping_add(self.register_y as u16);
addr
}
AddressingMode::Indirect_X => {
let base = self.mem_read(self.program_counter);
AddressingMode::NoneAddressing => {
panic!("mode {:?} is not supported", mode);
}
}
self.register_a = value;
self.update_zero_and_negative_flags(self.register_a);
}
match code {
0xA9 => {
self.lda(&AddressingMode::Immediate);
self.program_counter += 1;
}
0xA5 => {
self.lda(&AddressingMode::ZeroPage);
self.program_counter += 1;
}
0xAD => {
self.lda(&AddressingMode::Absolute);
self.program_counter += 2;
}
//....
}
}
}
#[test]
fn test_lda_from_memory() {
let mut cpu = CPU::new();
cpu.mem_write(0x10, 0x55);
assert_eq!(cpu.register_a, 0x55);
}
Using the same foundation, we can quickly implement instruction, which copies the
value from register A to memory.
0x95 => {
self.sta(AddressingMode::ZeroPage_X);
self.program_counter += 1;
}
//..
}
}
Before we wrap up, I'd like to mention that the current method is somewhat i�y for a
few reasons. First, the requirement to increment program_counter by 1 (or 2) after some
of the operations is error-prone. If we made an error, it would be tough to spot it.
Second, wouldn't it be more readable and convenient if we could group all "LDA"
operations under a single match statement?
Lastly, all we are doing is hard-coding Instructions spec into Rust code. The translation is
a bit hard to compare. Keeping the code in some table form looks like a more
manageable approach.
Implementing the rest of the 6502 CPU instructions should be relatively straightforward. I
won't go into detail for all of them.
• is perhaps the most complicated instruction from a logic �ow perspective. Note
that the spec contains details regarding decimal mode that can be entirely skipped
because the Ricoh modi�cation of the chip didn't support decimal mode.
This article goes into a detailed overview of how binary arithmetic is implemented in
6502: The 6502 over�ow �ag explained mathematically
For the curious and brave souls: The 6502 CPU's over�ow �ag explained at the
silicon level
• , and have to deal with 2 bit B-�ag. Except for interrupts execution,
those are the only commands that directly in�uence (or being directly in�uenced by)
the 5th bit of
If you get stuck, you can always look up the implementation of 6502 instruction set here:
Great, you've made it this far. What we are going to do next is to take a bit of a detour.
The snake game was introduced in this article: Easy 6502. In fact, it is not a true NES
game. It is built with 6502 instructions and uses quite di�erent memory mappings.
However, it's a fun way to validate that our CPU is truly functional, and it's fun to play the
�rst game.
The majority of logic we are to implement now would be reused some way or another
when we will be implementing rendering in PPU, so nothing is wasted e�ort.
Game
Execution code
code
Important note: the game expects that the execution code would be located right after the
output region. This means that we need to update our CPU.load function:
impl CPU {
// ...
pub fn load(&mut self, program: Vec<u8>) {
self.memory[0x0600..(0x0600 +
game_code.len())].copy_from_slice(&program[..]);
self.mem_write_u16(0xFFFC, 0x0600);
}
We need to intercept this cycle to get user input into the input mapping space and render
the state of the screen. Let's modify our CPU run cycle:
impl CPU {
// ...
pub fn run(&mut self) {
self.run_with_callback(|_| {});
}
loop {
callback(self);
//....
match code {
//...
}
// ..
}
}
}
Now, the client code can provide a callback that will be executed before every opcode
interpretation cycle.
fn main() {
let game_code = vec![
// ...
];
For our input/output, we would be using a cross-platform library that's popular in game
Luckily for us, there is a convenient crate that provides Rust bindings for the library: rust-
sdl2
# ...
[dependencies]
lazy_static = "1.4.0"
bitflags = "1.2.1"
sdl2 = "0.34.0"
rand = "=0.7.3"
use sdl2::event::Event;
use sdl2::EventPump;
use sdl2::keyboard::Keycode;
use sdl2::pixels::Color;
use sdl2::pixels::PixelFormatEnum;
fn main() {
// init sdl2
let sdl_context = sdl2::init().unwrap();
let video_subsystem = sdl_context.video().unwrap();
let window = video_subsystem
.window("Snake game", (32.0 * 10.0) as u32, (32.0 * 10.0) as u32)
.position_centered()
.build().unwrap();
//...
}
Because our game screen is tiny (32x32 pixels), we set the scale factor to 10.
Using .unwrap() is justi�able here because it's the outer layer of our application.
There are no other layers that potentially can handle Err values and do something
about it.
//...
let creator = canvas.texture_creator();
let mut texture = creator
.create_texture_target(PixelFormatEnum::RGB24, 32, 32).unwrap();
//...
We are telling SDL that our texture has a size of 32x32, and that each pixel is to be
represented by 3 bytes (for R, G and B colors). This means that the texture will be
represented by a 32x32x3 array of bytes.
3. Rendering the screen state is a bit trickier. Our program assumes 1 byte per pixel,
while SDL expects 3 bytes.
From the game point of view it doesn't matter much how we map colors, the only
two color maps that are essential are:
• 0 - Black
• 1 - White
Now we can transform the CPU screen map into 3 bytes like this:
A caveat with this is that we don't want to force updating the SDL canvas if the screen
state hasn't changed. Remember that the CPU will call our callback after each instruction,
and most of the time those instructions have nothing to do with the screen. Meanwhile,
updating the canvas is a heavy operation.
We can keep track of the screen state by creating a temp bu�er that will be populated
from the screen state. Only in the case of screen changes, we would update SDL canvas.
fn main() {
// ...init sdl
// ...load program
cpu.run_with_callback(move |cpu| {
handle_user_input(cpu, &mut event_pump);
cpu.mem_write(0xfe, rng.gen_range(1, 16));
::std::thread::sleep(std::time::Duration::new(0, 70_000));
});
}
The last sleep statement was added to slow things down so that the game runs at a
playable pace.
And there you have it, the �rst game running on our emulator.
CPU gets access to memory (including memory-mapped spaces) using three buses:
A bus itself is not a device; it's a wiring between platform components. Therefore, we
don't need to implement it as an independent module as Rust allows us to "wire" the
components directly. However, it's a convenient abstraction where we can o�oad quite a
bit of responsibility to keep the CPU code cleaner.
In our current code, the CPU has direct access to RAM space, and it is oblivious to
memory-mapped regions.
The good news is that we don't need to write a full-blown emulation of data, control, and
address buses. Because it's not a hardware chip, no logic expects any speci�c behavior
from the BUS. So we can just codify coordination and signal routing.
Mirroring is a side-e�ect of NES trying to keep things as cheap as possible. It can be seen
as an address space being mapped to another address space.
•
•
•
The reason for mirroring is the fact that CPU RAM has only 2 KiB of ram space, and only
11 bits is enough for addressing RAM space. Naturally, the NES motherboard had only 11
addressing tracks from CPU to RAM.
CPU however has addressing space reserved for RAM space - and
that's 13 bits. As a result, the 2 highest bits have no e�ect when accessing RAM. Another
way of saying this, when CPU is requesting address at (13 bits)
the RAM chip would receive only (11 bits) via the address bus.
So despite mirroring looking wasteful, it was a side-e�ect of the wiring, and on real
hardware it cost nothing. Emulators, on the other hand, have to do extra work to provide
the same behavior.
Long story short, the BUS needs to zero out the highest 2 bits if it receives a request in
the range of
So let's introduce a new module Bus, that will have direct access to RAM.
impl Bus {
pub fn new() -> Self{
Bus {
cpu_vram: [0; 2048]
}
}
}
The last step is to replace direct access to RAM from CPU with access via BUS
impl CPU {
pub fn new() -> Self {
CPU {
register_a: 0,
register_x: 0,
register_y: 0,
stack_pointer: STACK_RESET,
program_counter: 0,
status: CpuFlags::from_bits_truncate(0b100100),
bus: Bus::new(),
}
}
// ...
}
The �rst version of the cartridges was relatively simple. They carried two banks of ROM
memory: PRG ROM for code and CHR ROM for visual graphics.
Upon insertion into the console, PRG ROM gets connected to CPU, and CHR ROM gets
connected to PPU. So on a hardware level, CPU wasn't able to access CHR ROM directly,
and PPU wasn't able to access PRG ROM.
• mappers to provide access to extended ROM memory: both CHR ROM and
PRG ROM
• extra RAM (with a battery) to save and restore a game state
However, we won't be working with cartridges. Emulators work with �les containing
dumps of ROM spaces.
There are several �le formats for ROM dumps; the most popular one is iNES designed by
Marat Fayzullin
• 16-byte header
• optional 512 bytes of the so-called Trainer, a data section created by Famicom
copiers to keep their own mapping. We can skip this section if it is present.
• Section containing PRG ROM code
• Section containing CHR ROM data
Control Byte 1 and Control Byte 2 (Byte 06 and 07 in the header) contain some additional
info about the data in �le, but it's packed in bits.
We won't cover and support the iNES 2.0 format as it's not very popular. But you can �nd
the formal speci�cation of both iNES versions.
• PRG ROM
• CHR ROM
• Mapper type
• Mirroring type: Horizontal, Vertical, 4 Screen
Mirroring will be extensively covered in the following PPU chapters. For now, we need to
�gure out which mirroring type the game is using.
Mapper 0 essentially means "no mapper" that CPU reads both CHR and PRG ROM as is.
#[derive(Debug, PartialEq)]
pub enum Mirroring {
VERTICAL,
HORIZONTAL,
FOUR_SCREEN,
}
impl Rom {
pub fn new(raw: &Vec<u8>) -> Result<Rom, String> {
if &raw[0..4] != NES_TAG {
return Err("File is not in iNES file format".to_string());
}
Ok(Rom {
prg_rom: raw[prg_rom_start..(prg_rom_start +
prg_rom_size)].to_vec(),
chr_rom: raw[chr_rom_start..(chr_rom_start +
chr_rom_size)].to_vec(),
mapper: mapper,
screen_mirroring: screen_mirroring,
})
}
}
impl Bus {
pub fn new(rom: Rom) -> Self {
Bus {
cpu_vram: [0; 2048],
rom: rom,
}
}
//....
}
impl Bus {
// ...
You can download your �rst NES ROM dump �le on Github.
You will need to modify the main method to load binary from a �le.
Spoiler alert: it's a modi�cation of a snake game with funkier physics. The game expects
the same memory map for the input device, screen output, and random number
generator.
The NES dev community has created large suites of tests that can be used to check our
emulator.
They cover pretty much every aspect of the console, including quirks and notable bugs
that were embedded in the platform.
We will start with the most basic test covering main CPU features: instruction set,
memory access, and CPU cycles.
The iNES �le of the test is located here: nestest.nes An execution log accompanies the
test, showing what the execution should look like: nestest.log
The next goal is to generate a similar execution log for the CPU while running a program.
For now, we can ignore the last column and focus on the �rst �ve.
• The �rst number is the actual mem reference that we get if we apply an o�set to the
requesting address. 0xE1 is using the "Indirect X" addressing mode, and the o�set is
de�ned by register X
• The second number is a 2-byte target address fetched from . In this
case it's [0x00, 0x02]
• The third number is content of address cell 0x0200
impl CPU {
// ..
pub fn run_with_callback<F>(&mut self, mut callback: F)
where
F: FnMut(&mut CPU),
{
let ref opcodes: HashMap<u8, &'static opcodes::OpCode> =
*opcodes::OPCODES_MAP;
loop {
callback(self);
// ...
}
}
}
All we need to do is to create a callback function that will trace CPU state:
fn main() {
//...
cpu.run_with_callback(move |cpu| {
println!("{}", trace(cpu));
}
}
It's vital to get a execution log format precisely like the one used in the provided log.
#[cfg(test)]
mod test {
use super::*;
use crate::bus::Bus;
use crate::cartridge::test::test_rom;
#[test]
fn test_format_trace() {
let mut bus = Bus::new(test_rom());
bus.mem_write(100, 0xa2);
bus.mem_write(101, 0x01);
bus.mem_write(102, 0xca);
bus.mem_write(103, 0x88);
bus.mem_write(104, 0x00);
#[test]
fn test_format_mem_access() {
let mut bus = Bus::new(test_rom());
// ORA ($33), Y
bus.mem_write(100, 0x11);
bus.mem_write(101, 0x33);
//data
bus.mem_write(0x33, 00);
bus.mem_write(0x34, 04);
//target cell
bus.mem_write(0x400, 0xAA);
Now it's time to compare our execution log to the golden standard.
You can use any di� tool you'd like. But because our NES doesn't support CPU clock
cycles yet, it makes sense to remove last columns in the provided log:
I.e., everything that our emulator has produced should exactly match the golden
standard, up to line . If anything is o� before the line, we have a mistake in our
CPU implementation and it needs to be �xed.
But that doesn't explain why our program got terminated. Why didn't we get the perfect
match after the line ?
It looks like our CPU doesn't know how to interpret the opcode 0x04.
Here is the bad news: there are about 110 uno�cial CPU instructions. Most of the real
NES games use them a lot. For us to move on, we will need to implement all of them.
• nesdev.com/undocumented_opcodes.txt
• wiki.nesdev.com/w/index.php/Programming_with_uno�cial_opcodes
• wiki.nesdev.com/w/index.php/CPU_uno�cial_opcodes
• www.oxyron.de/html/opcodes02.html
The testing ROM should drive your progress. In the end, the CPU should support 256
instructions. Considering that 1 byte is for the operation code, we've exhausted all
possible values.
That's a good sign. 4015 is a memory map for the APU register, which we haven't
implemented yet.
Picture Processing Unit is the hardest one to emulate because it deals with the most
complicated aspect of gaming: rendering the state of the screen. The NES PPU has fair
number of quirks. While emulating some of them is not necessarily required, others are
crucially important to have a playable environment. 64KiB is not a hell of a lot of space,
and NES platform designers tried to squeeze as much out of it as possible. Working with
CHR ROM data means pretty much working with compressed data format; it requires a lot
of bit arithmetic, uncompressing, and parsing.
The �rst step is very similar to emulating the CPU. After the third one it will be possible to
play games with static screens:
• Pac-Man
• Donkey Kong
• Balloon Fight
When we are done with the scroll, we could play platformers such as Super Mario Bros.
So let's start.
PPU has its own memory map, composed of PPU RAM, CHR ROM, and address space
mirrors. PPU exposes 8 I/O Registers that are used by the CPU for communication. Those
registers are mapped to in the CPU memory map (and mirrored every
8 bytes through the region of )
To be precise, PPU has its own bus used to communicate with RAM and cartridge CHR
ROM. But we don't necessarily need to emulate the bus.
• Address (0x2006) & Data (0x2007) - provide access to the memory map available for
PPU
• OAM Address (0x2003) & OAM Data (0x2004) - Object Attribute Memory - the space
responsible for sprites
• Direct Memory Access (0x4014) - for fast copying of 256 bytes from CPU RAM to
OAM
• Controller (0x2000) - instructs PPU on general logic �ow (which memory table to use,
if PPU should interrupt CPU, etc.)
• Mask (0x2001) - instructs PPU how to render sprites and background
• Scroll (0x2005) - instructs PPU how to set a viewport
• Status 0x2002
The full spec of the registers can be found on NES Dev wiki.
PPU execution life cycle was tightly coupled with the electron beam of the TV screen.
The PPU renders 262 scanlines per frame. (0 - 240 are visible scanlines, the rest are
so-called vertical overscan)
Each scanline lasts for 341 PPU clock cycles, with each clock cycle producing one
pixel. (the �rst 256 pixels are visible, the rest is horizontal overscan)
The NES screen resolution is 320x240, thus scanlines 241 - 262 are not visible.
Upon entering the 241st scanline, PPU triggers VBlank NMI on the CPU. PPU makes
no memory accesses during 241-262 scanlines, so PPU memory can be freely
accessed by the program. The majority of games play it safe and update the screen
state only during this period, essentially preparing the view state for the next frame.
Where:
Mirroring and chr_rom are speci�c to each game and provided by a cartridge
impl NesPPU {
pub fn new(chr_rom: Vec<u8>, mirroring: Mirroring) -> Self {
NesPPU {
chr_rom: chr_rom,
mirroring: mirroring,
vram: [0; 2048],
oam_data: [0; 64 * 4],
palette_table: [0; 32],
}
}
}
Let's try to emulate two the most complex registers: Address ( ) and Data( )
There are multiple caveats in the way the CPU can access PPU RAM. Say the CPU wants to
access memory cell at 0x0600 PPU memory space:
1. It has to load the requesting address into the Addr register. It has to write to the
register twice - to load 2 bytes into 1-byte register:
LDA #$06
STA $2006
LDA #$00
STA $2006
2. Then, the CPU can request data load from PPU Data register (0x2007)
LDA $2007
Because CHR ROM and RAM are considered external devices to PPU, PPU can't
return the value immediately. PPU has to fetch the data and keep it in internal
bu�er.
The �rst read from 0x2007 would return the content of this internal bu�er �lled
during the previous load operation. From the CPU perspective, this is a dummy
read.
3. CPU has to read from 0x2007 one more time to �nally get the value from the PPUs
internal bu�er.
LDA $2007
Also note that read or write access to 0x2007 increments the PPU Address (0x2006).
The increment size is determined by the state of the Control register (0x2000):
impl AddrRegister {
pub fn new() -> Self {
AddrRegister {
value: (0, 0), // high byte first, lo byte second
hi_ptr: true,
}
}
fn set(&mut self, data: u16) {
self.value.0 = (data >> 8) as u8;
self.value.1 = (data & 0xff) as u8;
}
impl NesPPU {
// ...
fn write_to_ppu_addr(&mut self, value: u8) {
self.addr.update(value);
}
}
bitflags! {
// 7 bit 0
// ---- ----
// VPHB SINN
// |||| ||||
// |||| ||++- Base nametable address
// |||| || (0 = $2000; 1 = $2400; 2 = $2800; 3 = $2C00)
// |||| |+--- VRAM address increment per CPU read/write of PPUDATA
// |||| | (0: add 1, going across; 1: add 32, going down)
// |||| +---- Sprite pattern table address for 8x8 sprites
// |||| (0: $0000; 1: $1000; ignored in 8x16 mode)
// |||+------ Background pattern table address (0: $0000; 1: $1000)
// ||+------- Sprite size (0: 8x8 pixels; 1: 8x16 pixels)
// |+-------- PPU master/slave select
// | (0: read backdrop from EXT pins; 1: output color on EXT
pins)
// +--------- Generate an NMI at the start of the
// vertical blanking interval (0: off; 1: on)
pub struct ControlRegister: u8 {
const NAMETABLE1 = 0b00000001;
const NAMETABLE2 = 0b00000010;
const VRAM_ADD_INCREMENT = 0b00000100;
const SPRITE_PATTERN_ADDR = 0b00001000;
const BACKROUND_PATTERN_ADDR = 0b00010000;
const SPRITE_SIZE = 0b00100000;
const MASTER_SLAVE_SELECT = 0b01000000;
const GENERATE_NMI = 0b10000000;
}
}
impl ControlRegister {
pub fn new() -> Self {
ControlRegister::from_bits_truncate(0b00000000)
}
impl NesPPU {
//...
fn write_to_ctrl(&mut self, value: u8) {
self.ctrl.update(value);
}
}
impl NesPPU {
//...
fn increment_vram_addr(&mut self) {
self.addr.increment(self.ctrl.vram_addr_increment());
}
match addr {
0..=0x1fff => todo!("read from chr_rom"),
0x2000..=0x2fff => todo!("read from RAM"),
0x3000..=0x3eff => panic!("addr space 0x3000..0x3eff is not
expected to be used, requested = {} ", addr),
0x3f00..=0x3fff =>
{
self.palette_table[(addr - 0x3f00) as usize]
}
_ => panic!("unexpected access to mirrored space {}", addr),
}
}
}
We can emulate this internal bu�er behavior for RAM and ROM by using a temporary
�eld to hold a value from a previous read request:
impl NesPPU {
// ...
match addr {
0..=0x1fff => {
let result = self.internal_data_buf;
self.internal_data_buf = self.chr_rom[addr as usize];
result
}
0x2000..=0x2fff => {
let result = self.internal_data_buf;
self.internal_data_buf =
self.vram[self.mirror_vram_addr(addr) as usize];
result
}
// ..
}
}
}
Writing to PPU memory can be implemented in a similar way, just don't forget that writes
to 0x2007 also increments Address Register.
Again the NESDEV wiki provides excellent coverage of this topic: Mirroring.
VRAM mirroring is tightly coupled with the way NES implements scrolling of the viewport.
We would spend enough time discussing this in the chapter about Scroll. For now, we can
just code the mirroring behavior.
NES uses 1 KiB of VRAM to represent a single screen state. Having 2 KiB of VRAM onboard
means that NES can keep a state of 2 screens.
impl NesPPU {
//...
// Horizontal:
// [ A ] [ a ]
// [ B ] [ b ]
// Vertical:
// [ A ] [ B ]
// [ a ] [ b ]
pub fn mirror_vram_addr(&self, addr: u16) -> u16 {
let mirrored_vram = addr & 0b10111111111111; // mirror down
0x3000-0x3eff to 0x2000 - 0x2eff
let vram_index = mirrored_vram - 0x2000; // to vram vector
let name_table = vram_index / 0x400; // to the name table index
match (&self.mirroring, name_table) {
(Mirroring::VERTICAL, 2) | (Mirroring::VERTICAL, 3) => vram_index
- 0x800,
(Mirroring::HORIZONTAL, 2) => vram_index - 0x400,
(Mirroring::HORIZONTAL, 1) => vram_index - 0x400,
(Mirroring::HORIZONTAL, 3) => vram_index - 0x800,
_ => vram_index,
}
}
}
impl Bus {
pub fn new(rom: Rom) -> Self {
let ppu = NesPPU::new(rom.chr_rom, rom.screen_mirroring);
Bus {
cpu_vram: [0; 2048],
prg_rom: rom.prg_rom,
ppu: ppu,
}
}
//..
And provide memory mapping for the registers we've implemented so far:
impl Bus {
//...
fn mem_read(&mut self, addr: u16) -> u8 {
match addr {
RAM..=RAM_MIRRORS_END => {
let mirror_down_addr = addr & 0b00000111_11111111;
self.cpu_vram[mirror_down_addr as usize]
}
0x2000 | 0x2001 | 0x2003 | 0x2005 | 0x2006 | 0x4014 => {
panic!("Attempt to read from write-only PPU address {:x}",
addr);
}
0x2007 => self.ppu.read_data(),
0x2008..=PPU_REGISTERS_MIRRORS_END => {
let mirror_down_addr = addr & 0b00100000_00000111;
self.mem_read(mirror_down_addr)
}
0x8000..=0xFFFF => self.read_prg_rom(addr),
_ => {
println!("Ignoring mem access at {}", addr);
0
}
}
}
0x2006 => {
self.ppu.write_to_ppu_addr(data);
}
0x2007 => {
self.ppu.write_to_data(data);
}
0x2008..=PPU_REGISTERS_MIRRORS_END => {
let mirror_down_addr = addr & 0b00100000_00000111;
self.mem_write(mirror_down_addr, data);
}
0x8000..=0xFFFF => panic!("Attempt to write to Cartridge ROM
space: {:x}", addr),
_ => {
println!("Ignoring mem write-access at {}", addr);
}
}
}
}
The communication with the rest of the registers is similar. I leave this exercise to the
reader.
Interrupts are the mechanism for the CPU to break the sequential execution �ow and
react to events that require immediate attention ("attend to an interrupt").
We've already implemented one of the supported interrupts - the RESET signal. This
interrupt noti�es the CPU that a new cartridge was inserted and the CPU needs to
execute the reset subroutine.
PPU communicates that it's entering the VBLANK phase for the frame via another
interrupt signal - NMI (Non-Maskable Interrupt). From a high-level perspective, this means
two things:
The reason why VBLANK phase is unique is that while PPU is rendering visible scan
lines, it's constantly using internal bu�ers and memory. External access to IO
registers can corrupt data in those bu�ers and cause noticeable graphic glitches.
Unlike other interrupts, CPU can't ignore the NMI. And the �ag in the
has no e�ect on how the CPU attends to it. The CPU however, might
instruct PPU to not trigger NMI by resetting the 7th bit in the PPU Control register.
But to simplify,
PPU Cycles and CPU Cycles are not the same things
On the NES Platform, all components were running independently in parallel. This makes
NES a distributed system. The coordination hast to be carefully designed by game
developers based on timing specs of the instructions. I can only imagine how tedious this
manual process is.
1. Allocate a thread per component and simulate proper timing for each instruction. I
don't know of any emulator that does that. Simulating proper timing is a hell of a
task. Second, this approach requires allocating more hardware resources than
needed for the job (PPU, CPU, and APU would require 3 threads, and potentially
would occupy 3 cores on the host machine)
2. Execute all components sequentially in one thread, by advancing one clock cycle at a
time in each component. This is similar to creating a green-thread runtime and
using one dedicated OS thread to run this runtime. It would require substantial
investment in creating green-threads runtime.
3. Execute all components sequentially in one thread, but by letting CPU to execute
one full instruction, compute the clock cycles budget for other components and let
them run within the budget. This technique is called "catch-up"
For example, CPU takes 2 cycles to execute "LDA #$01" (opcode 0xA9), which means
that PPU can run for 6 PPU cycles now (PPU clock is ticking three times faster than
CPU clock) and APU can run for 1 cycle (APU clock is two times slower)
Because we already have CPU loop mostly spec'd out, the third approach is the easiest to
implement. Granted, it would be the least accurate one. But it's good enough to have
something playable as soon as possible.
impl CPU {
pub fn run_with_callback<F>(&mut self, mut callback: F)
where
F: FnMut(&mut CPU),
{
//...
loop {
// …
self.bus.tick(opcode.cycles);
if program_counter_state == self.program_counter {
self.program_counter += (opcode.len - 1) as u16;
}
}
}
}
The Bus should keep track of executed cycles and propagate tick call to PPU, but because
PPU clock is 3 times faster than CPU clock, it would multiply the value:
cycles: usize,
}
impl Bus {
pub fn new(rom: Rom) -> Self {
let ppu = NesPPU::new(rom.chr_rom, rom.screen_mirroring);
Bus {
cpu_vram: [0; 2048],
prg_rom: rom.prg_rom,
ppu: ppu,
cycles: 0,
}
}
pub fn tick(&mut self, cycles: u8) {
self.cycles += cycles as usize;
self.ppu.tick(cycles * 3);
}
}
The PPU would track cycles and calculate which scanline is should be drawing:
impl NesPPU {
// …
pub fn tick(&mut self, cycles: u8) -> bool {
self.cycles += cycles as usize;
if self.cycles >= 341 {
self.cycles = self.cycles - 341;
self.scanline += 1;
if self.scanline == 241 {
if self.ctrl.generate_vblank_nmi() {
self.status.set_vblank_status(true);
todo!("Should trigger NMI interrupt")
}
}
Some crucial details are still missing: some of the CPU operations take variable clock time
depending on the execution �ow. For example, conditional branch operations (like BNE)
take an additional CPU cycle if the comparison is successful. And yet another CPU cycle if
the JUMP would result in program counter to be on another memory page
Memory page size is 256 bytes. For example, the range [0x0000 .. 0x00FF]- belongs
to page 0, [0x0100 .. 0x01FF] belongs to page 1, etc. It's enough to compare the
upper byte of the addresses to see if they are on the same page.
I leave it up to the reader to �gure out how to codify those additional ticks that may or
may not happen.
The problem is that we want to deliver signals from PPU to CPU and Rust doesn't really
allow to have dependency cycles easily.
One way to overcome this is to replace the push model with pull. The CPU can ask if there
are interrupts ready at the beginning of the interpret cycle.
impl CPU {
//...
pub fn run_with_callback<F>(&mut self, mut callback: F)
where
F: FnMut(&mut CPU),
{
// ...
loop {
if let Some(_nmi) = self.bus.poll_nmi_status() {
self.interrupt_nmi();
}
// …
}
}
}
The �nal piece is to implement interrupt behavior. Upon receiving an interrupt signal the
CPU:
Interrupt handler would have to call RTI operation at the end to �nish interrupt
attendance. That would restore Status Flag and Program Counter position from the stack.
E�ectively going back to the execution �ow where it was left o�.
fn interrupt_nmi(&mut self) {
self.stack_push_u16(self.program_counter);
let mut flag = self.status.clone();
flag.set(CpuFlags::BREAK, 0);
flag.set(CpuFlags::BREAK2, 1);
self.stack_push(flag.bits);
self.status.insert(CpuFlags::INTERRUPT_DISABLE);
self.bus.tick(2);
self.program_counter = self.mem_read_u16(0xfffA);
}
In addition to scanline position, PPU would immediately trigger NMI if both of these
conditions are met:
In our CPU implementation, we've implemented opcode as a return from CPU fetch-
decode-execute cycle, but in reality it should trigger BRK interrupt. This is so-called
"software interrupt" that a game code can trigger programmatically in response to
events.
The address space on PPU is reserved for CHR ROM, which contains the
visual graphics data of a game.
That's 8 KiB worth of data. And that's all there was in the �rst versions of NES cartridges.
Visual data is packed in so-called tiles: an 8 x 8 pixel image that could use up to 4 colors.
(to be precise, background tile can have 4 colors, a sprite tile can have 3 colors, and 0b00
is used as an indication that a pixel should be transparent)
8 KiB / 128 bits = 512 tiles. I.e., each cartridge contained 512 tiles in total, divided into 2
pages/banks. The banks did not really have a name, historically they are called "left" and
"right".
8 pixels x 8 pixels is a tiny size, not much can be presented that way. The majority of
objects in NES games are composed of multiple tiles.
What makes the CHR format tricky is that tiles themselves don't hold any color
information. Each pixel in a tile is codi�ed using 2 bits, declaring a color index in a palette,
not a color itself.
If NES were using popular RGB format for each pixel, a single tile would occupy 8824
= 192 bytes. And it would require 96 KiB of CHR ROM space to hold 512 tiles.
The real color of a pixel is decided during the rendering phase by using the so-called color
palette, more on this later.
Surprisingly, 2 bits of a pixel are not codi�ed in the same byte. A tile is described using 16
bytes. And each row is encoded using 2 bytes that stand 8 bytes apart from each other.
To �gure out the color index of the top-left pixel, we need to read the 7th bit of byte
0x0000 and the 7th bit of byte 0x0008, to get the next pixel in the same row we would
need to read 6th bits in the same bytes, etc.
Before rendering CHR ROM content, we need to brie�y discuss the colors available to the
NES. Di�erent versions of the PPU chip had slightly di�erent system-level palettes of 52
hardwired colors.
There are multiple variations used in emulators. Some make the picture more visually
appealing, while others keep it closer to the original picture NES generated on a TV.
It doesn't matter much which one we would choose, most of them get us good enough
results.
However, we still need to codify that table in a RGB format that is recognized by the SDL2
library:
#[rustfmt::skip]
To render tiles from a CHR ROM, we need to get a ROM �le of a game. Google would help
you �nd a lot of ROM dumps of the well-known classics. However, downloading such
ROMs if you don't own a cartridge would be illegal (wink-wink). There is a site that listed
legit homebrew games that were recently developed. And some of them are pretty good,
most of them are free. Check it out: www.nesworld.com
The caveat here that our emulator supports only NES 1.0 format. And homebrew
developed games tend to use NES 2.0. Games like "Alter Ego" would do.
I would use Pac-Man, mostly because it's recognizable and I happen to own a cartridge of
this game.
First, let's create an abstraction layer for a frame, so we don't need to work with SDL
directly:
impl Frame {
const WIDTH: usize = 256;
const HIGHT: usize = 240;
for y in 0..=7 {
let mut upper = tile[y];
let mut lower = tile[y + 8];
for x in (0..=7).rev() {
let value = (1 & upper) << 1 | (1 & lower);
upper = upper >> 1;
lower = lower >> 1;
let rgb = match value {
0 => palette::SYSTEM_PALLETE[0x01],
1 => palette::SYSTEM_PALLETE[0x23],
2 => palette::SYSTEM_PALLETE[0x27],
3 => palette::SYSTEM_PALLETE[0x30],
_ => panic!("can't be"),
};
frame.set_pixel(x, y, rgb)
}
}
frame
}
For now, we're interpreting color indices randomly. Just pick 4 random colors
from the system palette for each index value to see how it would look like.
fn main() {
// ….init sdl2
// ....
loop {
for event in event_pump.poll_iter() {
match event {
Event::Quit { .. }
| Event::KeyDown {
keycode: Some(Keycode::Escape),
..
} => std::process::exit(0),
_ => { /* do nothing */ }
}
}
}
}
We can adjust code just a little bit to draw all tiles from CHR ROM:
Aha! Despite colors being clearly o�, the shapes are recognizable now. We can see parts
of ghosts, some letters, and some numbers. I guess that's it. Moving on...
At this point, the CPU and PPU are fully functional and working in coordination with each
other. If we were to load a game into our emulator, the game would execute and most
likely run into demo mode.
The problem is that we can't see what's going on inside. Remember how we had
intercepted the execution of the snake game to read the game screen state? And then
had it rendered via SDL2 canvas? We will have to do something similar here. It's just the
data format used by NES is slightly more complicated.
Both of those are constructed using CHR tiles, we've discussed in the previous chapter. In
fact, the same tiles can be used both for background and for sprites.
NES uses di�erent memory spaces to hold those categories. The set of possible
transformations is also di�erent.
Three main memory sections are responsible for the state of a background:
NES Screen background screen is composed of 960 tiles (a tile being 8x8 pixels: 256 / 8
* 240 / 8 = 960 ) Each tile is represented by one byte in VRAM in the space called
Nametable.
Using a byte in nametable PPU can address only 256 elements within a single bank
in pattern table. Control register decides which of two banks should be used for
background (and which one should be used for sprites).
In addition to 960 bytes for tiles, a nametable holds 64 bytes that specify color palette, we
will discuss later. In total, a single frame is de�ned as 1024 bytes (960 + 64). PPU VRAM
can simultaneously hold two nametables - the states of two frames.
Two additional nametables that exist in the address space of the PPU must be either
mapped to existing tables or to extra RAM space on a cartridge. More details.
Nametables are populated by CPU during program execution (using Addr and Data
registers that we've implemented). It's entirely determined by game code. All we need to
do is to read the correct part of the VRAM.
1. Determine which nametable is being used for the current screen (by reading bit 0
and bit 1 from Control register)
2. Determine which CHR ROM bank is used for background tiles (by reading bit 4 from
Control Register)
3. Read 960 bytes from the speci�ed nametable and draw a 32x30 tile-based screen
use crate::ppu::NesPPU;
use frame::Frame;
for i in 0..0x03c0 { // just for now, lets use the first nametable
let tile = ppu.vram[i] as u16;
let tile_x = i % 32;
let tile_y = i / 32;
let tile = &ppu.chr_rom[(bank + tile * 16) as usize..=(bank + tile *
16 + 15) as usize];
for y in 0..=7 {
let mut upper = tile[y];
let mut lower = tile[y + 8];
for x in (0..=7).rev() {
let value = (1 & upper) << 1 | (1 & lower);
upper = upper >> 1;
lower = lower >> 1;
let rgb = match value {
0 => palette::SYSTEM_PALLETE[0x01],
1 => palette::SYSTEM_PALLETE[0x23],
2 => palette::SYSTEM_PALLETE[0x27],
3 => palette::SYSTEM_PALLETE[0x30],
_ => panic!("can't be"),
};
frame.set_pixel(tile_x*8 + x, tile_y*8 + y, rgb)
}
}
}
}
Note: We are still using randomly picked colors from a system palette just to see
shapes
Again, we need to intercept the program execution to read the screen state. On the real
console the PPU is drawing one pixel each PPU clock cycle. However, we can take a
shortcut. Instead of reading part of the screen state on each PPU clock tick, we can wait
until the full screen is ready and read in one go.
This is quite a drastic simpli�cation that limits the types of games it will
be possible to play on the emulator.
More advanced games used a lot of tricks to enrich the gaming experience. For
example, changing scroll in the middle of the frame (split scroll) or changing palette
colors.
This simpli�cation wouldn't a�ect �rst-gen NES games much. The majority of NES
games would require more accuracy in PPU emulation, however.
On the real console, PPU is actively drawing screen state on a TV screen during 0 - 240
scanlines; during scanlines 241 - 262, the CPU is updating the state of PPU for the next
frame, then the cycle repeats.
One way to intercept is to read the screen state right after NMI interrupt - when PPU is
done rendering the current frame, but before CPU starts creating the next one.
First lets add callback to the bus, that will be called every time PPU triggers NMI:
ub struct Bus<'call> {
cpu_vram: [u8; 2048],
prg_rom: Vec<u8>,
ppu: NesPPU,
cycles: usize,
gameloop_callback: Box<dyn FnMut(&NesPPU) + 'call>,
impl<'a> Bus<'a> {
pub fn new<'call, F>(rom: Rom, gameloop_callback: F) -> Bus<'call>
where
F: FnMut(&NesPPU) + 'call,
{
let ppu = NesPPU::new(rom.chr_rom, rom.screen_mirroring);
Bus {
cpu_vram: [0; 2048],
prg_rom: rom.prg_rom,
ppu: ppu,
cycles: 0,
gameloop_callback: Box::from(gameloop_callback),
}
}
}
impl<'a> Bus<'a> {
//..
pub fn tick(&mut self, cycles: u8) {
self.cycles += cycles as usize;
fn main() {
// init sdl2…
canvas.present();
for event in event_pump.poll_iter() {
match event {
Event::Quit { .. }
| Event::KeyDown {
keycode: Some(Keycode::Escape),
..
} => std::process::exit(0),
_ => { /* do nothing */ }
}
}
});
cpu.reset();
cpu.run();
}
NES Console could generate 52 di�erent colors on a TV screen. Those colors constitute
the hardwired System Palette of the console.
However, a single screen can use only 25 colors simultaneously: 13 background colors
and 12 for sprites.
NES had internal memory RAM to store palette settings. The space is divided into 8
palettes tables: 4 for background and 4 for sprites. Each palette contains three colors.
Remember that a pixel in a tile was coded using 2 bits - that's 4 possible values. 0b00 is a
special one.
A single tile can be drawn using only one palette from the palette table. For background
tiles, the last 64 bytes of each nametable are reserved for assigning a speci�c palette to a
part of the background. This section is called an attribute table.
First let's extract the palette for a background tile speci�ed by its row and column
position on the screen:
And just rewire our color lookup in render function from using randomly picked colors
to the actual ones:
for i in 0..0x3c0 {
let tile = ppu.vram[i] as u16;
let tile_column = i % 32;
let tile_row = i / 32;
let tile = &ppu.chr_rom[(bank + tile * 16) as usize..=(bank + tile *
16 + 15) as usize];
let palette = bg_pallette(ppu, tile_column, tile_row);
for y in 0..=7 {
let mut upper = tile[y];
let mut lower = tile[y + 8];
for x in (0..=7).rev() {
let value = (1 & lower) << 1 | (1 & upper);
upper = upper >> 1;
lower = lower >> 1;
let rgb = match value {
0 => palette::SYSTEM_PALLETE[ppu.palette_table[0] as
usize],
1 => palette::SYSTEM_PALLETE[palette[1] as usize],
2 => palette::SYSTEM_PALLETE[palette[2] as usize],
3 => palette::SYSTEM_PALLETE[palette[3] as usize],
_ => panic!("can't be"),
};
frame.set_pixel(tile_column * 8 + x, tile_row * 8 + y, rgb)
}
}
}
}
That's it.
Rendering sprites is somewhat similar, but a bit easier. NES had an internal RAM for
storing states of all sprites in the frame, so-called Object Attribute Memory (OAM).
It had 256 bytes of RAM and reserved 4 bytes for each sprite. This gives an option of
having 64 tiles on a screen simultaneously (but keep in mind that a single object on a
screen usually consists of at least 3-4 tiles).
• using OAM Addr and OAM Data PPUT registers, updating one byte at a time.
• bulk updating the whole table by transferring 256 bytes from CPU RAM using OAM
DMA
NES Dev Wiki provides a pretty solid speci�cation of each byte in the OAM record
To render all visible sprites, we just need to scan through oam_data space and parse out
every 4 bytes into a sprite:
for y in 0..=7 {
let mut upper = tile[y];
let mut lower = tile[y + 8];
'ololo: for x in (0..=7).rev() {
let value = (1 & lower) << 1 | (1 & upper);
upper = upper >> 1;
lower = lower >> 1;
let rgb = match value {
0 => continue 'ololo, // skip coloring the pixel
1 => palette::SYSTEM_PALLETE[sprite_palette[1] as usize],
2 => palette::SYSTEM_PALLETE[sprite_palette[2] as usize],
3 => palette::SYSTEM_PALLETE[sprite_palette[3] as usize],
_ => panic!("can't be"),
};
match (flip_horizontal, flip_vertical) {
(false, false) => frame.set_pixel(tile_x + x, tile_y + y,
rgb),
(true, false) => frame.set_pixel(tile_x + 7 - x, tile_y +
y, rgb),
(false, true) => frame.set_pixel(tile_x + x, tile_y + 7 -
y, rgb),
(true, true) => frame.set_pixel(tile_x + 7 - x, tile_y + 7
- y, rgb),
}
}
}
• Joypads
• Power pads
• Lightgun Zappers
• Arkanoid controllers
• And even keyboards
We will emulate joypads as it's the most common and the easiest device to emulate
Two joypads are mapped to and CPU address space, respectively. The
same register can be used for both reading and writing. Reading from a controller reports
the state of a button (1 - pressed, 0 - released). The controller reports a state of one
button at a time. To get the state of all buttons, the CPU has to read the controller
register 8 times.
A -> B -> Select -> Start -> Up -> Down -> Left -> Right
After reporting the state of the button , the controller would continually return 1s
for all following read, until a strobe mode change.
The CPU can change the mode of a controller by writing a byte to the register. However,
only the �rst bit matters.
• strobe bit on - controller reports only status of the button A on every read
• strobe bit o� - controller cycles through all buttons
So the most basic cycle to read the state of a joypad for CPU:
bitflags! {
// https://wiki.nesdev.com/w/index.php/Controller_reading_code
pub struct JoypadButton: u8 {
const RIGHT = 0b10000000;
const LEFT = 0b01000000;
const DOWN = 0b00100000;
const UP = 0b00010000;
const START = 0b00001000;
const SELECT = 0b00000100;
const BUTTON_B = 0b00000010;
const BUTTON_A = 0b00000001;
}
}
We need to track:
impl Joypad {
pub fn new() -> Self {
Joypad {
strobe: false,
button_index: 0,
button_status: JoypadButton::from_bits_truncate(0),
}
}
}
impl Joypad {
//...
pub fn write(&mut self, data: u8) {
self.strobe = data & 1 == 1;
if self.strobe {
self.button_index = 0
}
}
Don't forget to connect the joypad to the BUS and map it for address 0x4016.
One last step is to adjust our game loop to update the status of the joypad depending of
a keyboard button being pressed or released on the host machine:
fn main() {
//... init sdl2
//... load the game
let mut key_map = HashMap::new();
key_map.insert(Keycode::Down, joypad::JoypadButton::DOWN);
key_map.insert(Keycode::Up, joypad::JoypadButton::UP);
key_map.insert(Keycode::Right, joypad::JoypadButton::RIGHT);
key_map.insert(Keycode::Left, joypad::JoypadButton::LEFT);
key_map.insert(Keycode::Space, joypad::JoypadButton::SELECT);
key_map.insert(Keycode::Return, joypad::JoypadButton::START);
key_map.insert(Keycode::A, joypad::JoypadButton::BUTTON_A);
key_map.insert(Keycode::S, joypad::JoypadButton::BUTTON_B);
canvas.present();
for event in event_pump.poll_iter() {
match event {
Event::Quit { .. }
| Event::KeyDown {
keycode: Some(Keycode::Escape),
..
} => std::process::exit(0),
_ => { /* do nothing */ }
}
}
});
//...
}
And here we are. Now we can play NES classics, using a keyboard. If you want to have a
little bit of geeky fun, I highly recommend buying USB replicas of original NES controllers
on Amazon.
SDL2 fully supports joysticks, and with just a tiny adjustment in our game loop, we can
have almost perfect NES experience.
Ok, we've made quite a bit of progress here. The two major pieces left are:
Before we start discussing scrolling, we need to clarify one detail. We've discussed that
PPU noti�es the state of the frame by triggering NMI interrupt, which tells CPU that
rendering of the current frame is �nished. That's not the whole story. PPU has 2
additional mechanisms to tell its progress:
Sprite over�ow is rarely used because it had a bug that resulted in false positives and
false negatives. Sprite 0 hit though is used by the majority of games that have scrolling.
It's the way to get mid-frame progress status of PPU:
This is a very rough simulation of the behavior. The accurate one requires checking
opaque pixels of a sprite colliding with opaque pixels of background.
if self.scanline == 241 {
self.status.set_vblank_status(true);
self.status.set_sprite_zero_hit(false);
if self.ctrl.generate_vblank_nmi() {
self.nmi_interrupt = Some(1);
}
}
Note: the sprite zero hit �ag should be erased upon entering VBLANK state.
The scroll is one of the primary mechanisms to simulate movement in space in NES
games. It's an old idea of moving the viewport against the static background to create an
illusion of movement through space.
The scroll is implemented on the PPU level and only a�ects the rendering of background
tiles (those stored in nametables). Sprites (OAM data) are not a�ected by this.
PPU can keep two screens in memory simultaneously (remember one name table - 1024
bytes, and PPU has 2 KiB of VRAM). This doesn't look like a lot, but this is enough to do
the trick. During the scroll the viewport cycles through those two nametables, while the
CPU is busy updating the part of the screen that's not yet visible, but will be soon. That
also means that most of the time, the PPU is rendering parts of both nametables.
Because this exhausts all available console resources, early games had only 2 options for
scrolling: horizontal or vertical. Old games were settled on the type of scroll for the whole
game. Games that came later on had a mechanism to alternate scrolling between stages.
And the most advanced games (like Zelda) provided the experience where a user can
"move" in all 4 directions.
Initially, the scroll was tightly coupled with mirroring - mostly because of the way NES
handled over�ow of a viewport from one nametable to another on hardware level.
For games like Super Mario Bros (Horizontal Scroll) or Ice Climber (Vertical Scroll), the
mechanism is entirely de�ned by:
Remember, a background screen is de�ned by 960 tiles, each tile being 8x8 pixels,
because PPU Scroll Register de�nes shifts in pixels, which means that on edges of the
viewport, we can see parts of a tile.
Updating PPU memory is relatively expensive, and the CPU can do this only during 241 -
262 scanlines. Because of these constraints, the CPU can update a relatively thin part
(2x30 tiles wide area) of a screen per frame. If we render parts of the nametables that are
not yet visible, we can see how the state of the world comes into existence a couple
frames before entering the viewport.
• The palette of a tile is de�ned by the nametable the tile belongs to, by the base
nametable speci�ed in the Control register
• For horizontal scrolling the content of the base nametable always goes to the left
part of the viewport (or top part in case of vertical scrolling)
Implementing scroll rendering is not hard but requires attention to details. The most
convenient mental model I could come up with is the following:
struct Rect {
x1: usize,
y1: usize,
x2: usize,
y2: usize,
}
impl Rect {
fn new(x1: usize, y1: usize, x2: usize, y2: usize) -> Self {
Rect {
x1: x1,
y1: y1,
x2: x2,
y2: y2,
}
}
}
• And apply shift transformation for each visible pixel - shift_x, shift_y
For example,
for i in 0..0x3c0 {
let tile_column = i % 32;
let tile_row = i / 32;
let tile_idx = name_table[i] as u16;
let tile = &ppu.chr_rom[(bank + tile_idx * 16) as usize..=(bank +
tile_idx * 16 + 15) as usize];
let palette = bg_pallette(ppu, attribute_table, tile_column,
tile_row);
for y in 0..=7 {
let mut upper = tile[y];
let mut lower = tile[y + 8];
for x in (0..=7).rev() {
let value = (1 & lower) << 1 | (1 & upper);
upper = upper >> 1;
lower = lower >> 1;
let rgb = match value {
0 => palette::SYSTEM_PALLETE[ppu.palette_table[0] as
usize],
1 => palette::SYSTEM_PALLETE[palette[1] as usize],
2 => palette::SYSTEM_PALLETE[palette[2] as usize],
3 => palette::SYSTEM_PALLETE[palette[3] as usize],
_ => panic!("can't be"),
};
let pixel_x = tile_column * 8 + x;
let pixel_y = tile_row * 8 + y;
render_name_table(ppu, frame,
main_nametable,
Rect::new(scroll_x, scroll_y, 256, 240 ),
-(scroll_x as isize), -(scroll_y as isize)
);
render_name_table(ppu, frame,
second_nametable,
Rect::new(0, 0, scroll_x, 240),
(256 - scroll_x) as isize, 0
);
// … render sprites
}
Implementing the vertical scroll is similar; we could reuse the same render_name_table
helper function without changes. Just need to �gure out proper addressing, shifts, and
view_port parameters.
Support for scrolling means that now we can play old platformers like Super Mario Bros
and Ice Climber.
• Twitter
• Github
• LinkedIn
And I’m a big fan of distributed systems, data-intensive apps, and functional
programming.
• Twitter
• GitHub
• LinkedIn