Sunday 31 December 2017

Booting C64 on FPGA

Foreword

In the previous post we managed to boot the C64 system in a Verilog simulation.

In this post we will attempt to boot the C64 system on the ZYBO board.

Tweeking existing FPGA design

We need to make a couple of changes in order for the FPGA to boot the C64 system. The majority of changes for this we already did within the c64_core module. So, for our FPGA design we can just replace the code within our c64_core IP with the code we did in the previous post.

Above paragraph sounds like a mouthful, so how do we go about doing it? First, let us just refresh ourselves with how the block design looks like currently:


To get to this diagram, you double click on the block design within the sources tab with the project open in Vivado:

Right click on c64_core_0 and select Edit in IP Packager:


You will be presented with the same kind of Wizard as in a previous post where we wrapped the c64_core into an IP.

Apply the verilog changes and move to the last page of the Wizard (e.g. Review and Package) and click Repackage IP.

The IP repository will be updated, but your design will still have a old copy of the c64_core IP. For your block design to use the new version, you will need to delete it from the design and add it again.

You might recall from my previous post that we changed our module inputs and outputs of c64_core a bit. We changed the address output to an input and added another output for returning the data for given input address.

These changes to the inputs/outputs of c64_core necessitates changes to gpio_manipulator IP:

module gpio_manipulator (
  input wire [19:0] gpio_output,
  input wire [7:0] data_out,
  output wire [19:0] gpio_input,
  output wire clk_gen_rst,
  output wire rst,
  output wire debug_mode,
  output wire debug_clk,
  output wire [15:0] address_out
); 

assign clk_gen_rst = gpio_output[16];
assign rst = gpio_output[17];
assign debug_mode = gpio_output[18];
assign debug_clk = gpio_output[19];
assign address_out = gpio_output[15:0]; 
assign gpio_input[7:0] = data_out;

endmodule

With these changes made and all blocks in our block diagram been wired up again, our block diagram will look like the following:



With all this done, we are ready to run synthesis on our design again and to generate a new bitstream.

With Synthesis everything completed without an issue. However, my Vivado got stuck with an error during Bitsream generation. In short, this error told me that my fpga doesn't have enough components available to accommodate the synthesised design.

This error really puzzled me and I started to look for clues on what could cause this issue.

Eventually I found something that looked suspect: Whereas the Kernel ROM and Basic ROM utilised Block RAM, the 64KB RAM instance itself utilised distributed RAM resulting in trying to use every component it can find as a piece of RAM.

After further digging, I found the cause of the issue:

 always @ (posedge clk)
    begin
     if (WE) 
     begin
      ram[addr] <= ram_in;
      ram_out <= ram_in;
     end
     else 
     begin
      ram_out <= ram[addr_ram_in];
     end 
    end 


The issue here is that when writing to RAM one address input is used, and another address input is used when reading from it. This doesn't perfectly module a Block RAM in a Xilinx FPGA, which uses only one address input for both reading and writing.

This bug was in fact introduced in the previous post when we wanted to modify the 64KB RAM so that we can inspect the contents from a ARM Cortex program.

We can fix this bug by just using the addr_ram_in input for both scenarios.

With this fix our bitstream generation will complete without any issues.

Testing

Time to test the C64 bootup on our FPGA device. For this purpose we will again use a C program running on the ARM Cortex in the Zynq to evaluate the execution results.

Here is the complete C program:

#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xil_io.h"
#include <unistd.h>

int main()
{
    init_platform();
    //Output: bit 16: Reset clock system
    //Output: bit 17: Reset C64 core
    //Output: bit 18: Enable Debug mode
    //Output: bit 19: Debug Clock
    //Output: bit 15-0: Address within 64KB RAM that we want to read
    //Input: bit 7:0: Data output for given address as specified in previous line
    print("Hello World\n\r");
    u32 regval = (1 << 16) | (1 << 17);
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 16) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 17) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(20000000);
    regval = regval | (1 << 18);
    Xil_Out32(0x41200000, regval);
    for (int i = 1024; i < 1500; i++) {
     regval = regval & ~0xffff;
     regval = regval | i;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = regval | (1 << 19);
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = ~(1 << 19) & regval;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     u32 in = Xil_In32(0x41200000);
     in = in & 0xff;
     printf("in %x\n\r",in);
     usleep(500000);
    }
    cleanup_platform();
    return 0;
}


To aid in making things clearer I have added explanatory comments in the beginning for the different bit positions.

We start off once again by resetting the clock system and the C64 core after which we wait about 20 seconds.

After the 20 second wait we display the contents of the first half of screen memory pausing half a second for each value. It is important to note that since we are in the debug mode at this time, we are responsible for clocking the 64KB Block RAM from our program at each read.

When our program starts outputting the values of screen memory, you will see a couple of 20's (e.g. hex value for screen code space). Keep watching, since you will eventually see the screen codes for the welcome message:


The four 2a's corresponds to the four asterisks of the welcome message.

It appears that our FPGA device managed to boot the C64 system successfully!

In Summary

In this post we managed to boot the C64 system on the Zybo board from the simulation sources of the previous post.

In the next post we will explore how to interface with the SDRAM that comes equipped with the ZYBO.

Having limited Block RAM available on the FPGA, SDRAM access opens new horizons for us like buffering the output video frames and resizing it so it appears properly on an LCD screen.

Till next time!

Wednesday 13 December 2017

Booting the C64 System

Foreword

In the previous post we managed to successfully run Klaus Dormann's Test Suite on the Zybo Board.

In this post we will extend our implementation to boot the C64 system.

At the end of this post we will run the resulting implementation in a simulator and in the next post we will get to running it on the ZYBO board.

Adding the C64 ROMS

In order to boot the C64 system we need to add the two ROMS, e.g. BASIC and KERNEL to our design.

The process would be more or less the same as we did with adding the TestSuite binary in previous posts.

Since we are working with ROMS, however, we will only be adding logic to read data from the Block RAM and no write logic.

Since we are dealing with two ROMS and in later posts three ROMS when adding the Chargen ROM, it make sense to extract the common logic into a module of its own. The signature of this module will look as follows:

module rom#(
 parameter ADDR_WIDTH = 13,
 parameter ROM_FILE = ""

)

(
  input clk,
  input wire [ADDR_WIDTH-1:0] addr,
  output reg [7:0] rom_out
    )


You will notice our signature contains an extra section preceded by a hash, which is a style we haven't use before.

The hash section is basically a parameter section, declaring parameters with default values. The nice thing about these parameters is that you can override these values when you create a module instance with suitable values.

In the parameter section of our rom module we have the parameter ADDR_WIDTH with a a default value of 13.  This means that if you instantiate a rom module instance and you don't override the ADDR_WIDTH parameter, your resulting instance can accept addresses of maximum 13 bits.

13 bits gives us 8KB of addressable space. This default is sufficient for both the BASIC ROM and the KERNEL.

In later posts, however, where we will be adding the CharROM which is only 4KB we will need to override the ADDR_WIDTH with a value of 12.

Let us now look at the meat of our rom module:

reg [7:0] rom[2**ADDR_WIDTH-1:0];

 always @ (posedge clk)
    begin
      rom_out <= rom[addr];
    end 

    
initial begin
      $readmemh(ROM_FILE, rom) ;
    end    


We begin by defining an array that will contain the contents for the applicable ROM. In defining the size of the array we make use of the ADDR_WIDTH parameter defined previously.

We populate the contents of this array with an initial block similarly as we did in a previous post.

We define a always block for pushing the contents for given address to an output register on the positive transition of the clock pulse.

With our rom module defined, we can now create some instances of it in our main module:

rom #(
 .ROM_FILE("/home/johan/Documents/roms/kernel.hex")
) kernel(
  .clk(clk),
  .addr(addr[12:0]),
  .rom_out(kernel_out)
    );

rom #(
 .ROM_FILE("/home/johan/Documents/roms/basic.hex")
) basic(
  .clk(clk),
  .addr(addr[12:0]),
  .rom_out(basic_out)
    );


For both instances we send as paramater the location to a hex formatted file containing the content for applicable ROM.

For the address we send through the least 13 bits of the address bus.

We are missing some arbitration logic that will ensure, depending on the given address whether we return the contents of the BASIC ROM, KERNEL or our 64KB RAM.

Adding Arbitration Logic

The logic for performing arbitration is as follows:

...
reg [7:0] combined_d_out;
...
always @*
  casex (addr)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    default: combined_d_out = ram_out;
  endcase
...

The function of this logic can be represented in a diagram as follows:


All our storage elements, BASIC, KERNEL and our RAM gets fed to a multiplexer and we use the address as selector to decide which one gets send to the DI input of the 6502 CPU.

Let us now look at our piece of Verilog code in more detail. This will indeed look familiar to programmers as a case/switch statement.

This case statement, however, starts with casex instead of case. This is a special kind of Verilog statement, where in the selector you can specify Don't care values.

A don't care value you sepcify with an X, and means that this position can be any value.

Strictly speaking, if you look at our case statement, you could have only connected only the most significant three bits to our case statement, since the lower thirteen doesn't serve any purpose. But, as you will see later, we will need to full addresses for a scenario where will check for a specific address.

One thing we haven't consider in our design is the way Block RAMS work. Block RAMS only show the output a clock pulse after the address is asserted. In our design, however, we are multiplexing one clock cycle to early, meaning that by the time the data is ready, we might have switched that block rom out of view with the next address.

The solution would be to delay address input also by one clock cycle. This will result into the following changes:

...
reg [15:0] addr_delayed;
...
 always @ (posedge clk)
    addr_delayed <= addr;
...
always @*
  casex (addr_delayed)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    default: combined_d_out = ram_out;
  endcase
...


Preparing for Simulation

All our for our 6502 system is currently wrapped in module called c64_core that is contained in Design sources, used for performing synthesis.

We also have a similar module within our simulation sources containing code for assisting a simulation.

With this current setup you would develop in the copy contained in simulation sources, making it is easy run a simulation now and again to check if you are on the right track.

Once finished with your development though, you would need to copy your changes to c64_core in Design sources.

This copy and pasting can be quite error prone. A better approach would be to let both the design and simulation sources share the same c64_core module. Then, within the simulation sources you create a top module surrounding the c64_core module. This top module would then contain all the simulation specific code.

Let us start with this top module. First, let us look again at the signature of c64_core module:

module c64_core(
  input wire clk_in,
  input wire reset,
  input wire debug_clk,
  input wire debug_mode,
  output wire [15:0] addr_out
    );


The resulting top module is quite simple:

reg clk = 0;
reg reset = 1;
wire [15:0] addr_out;

c64_core my_core(
    .clk_in(clk),
    .rst(reset),
    .debug_clk(1'b0),
    .debug_mode(1'b0),
    .addr_out(addr_out),
        );

always #10
clk <= ~clk;        

initial begin
  #100 reset <= 0;
  #100000000 $finish;
end    


First Simulation Attempt

With our first simulation our Wave output looks as follows:


If you go through the address requests of addr_out, you will see that the last couple of address requests ranges between ff5e-ff63. If you look at Disassembly listing of the kernel, you will see these addresses corresponds to the following:

FF5E   AD 12 D0   LDA $D012
FF61   D0 FB      BNE $FF5E

This loop rings a clear bell from my previous blogs where I wrote emulators for other platforms. Writing a C64 from scratch, you will most probably always got stuck at this loop for the first time.

This signals good news, since we are on the right track.

What we need to do next, is imitate values for register D012 (which is a VIC-II register) , so we can get past above loop, and see if screen memory get populated with the C64 startup message.

Getting past the $FF5E loop

To get past the $FF5E loop we can just link the memory register to a binary counter counting up at each clock cycle.

The implementation of the binary counter is as simple as follows:

...
reg [7:0] line_counter;
...
always @(posedge clk)
  if (rst)
    line_counter <= 0;
  else
    line_counter <= line_counter + 1;
...

And finally we change our arbitration block:

always @*
  casex (addr_delayed)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    16'hd012: combined_d_out = line_counter;
    default: combined_d_out = ram_out;
  endcase


When our run simulation again with above changes, our wave output looks as follows:


If you now compare these addresses to a disassembly listing again, you will get to the following section:

; wait for return for keyboard
E5CA   20 16 E7   JSR $E716
E5CD   A5 C6      LDA $C6
E5CF   85 CC      STA $CC
E5D1   8D 92 02   STA $0292
E5D4   F0 F7      BEQ $E5CD
E5D6   78         SEI
E5D7   A5 CF      LDA $CF
E5D9   F0 0C      BEQ $E5E7
E5DB   A5 CE      LDA $CE
E5DD   AE 87 02   LDX $0287

I got this dissasemmbly listing from ffd2.com

From this we can gather that our simulation got to the point where it is waiting for keyboard input, which just after C64 bootup.

Ok, I am pretty convinced the C64 boot process went fine, but I am itching to check one more thing: Checking whether screen memory at memory location 1024 is populated with the Welcome message.

Checking Screen memory for welcome message

As our FPGA implementation is at the moment, we don't really have a way to inspect the contents of our 64KB RAM. We therefore need to modify our debug mode functionality to return the information we want.

Firstly, let us start to modify the header of our c64_core module for returning the relevant information:

module c64_core(
  input wire clk_in,
  input wire reset,
  input wire debug_clk,
  input wire debug_mode,
  input wire [15:0] addr_in
  output wire [7:0] data_out
    )

We have change our addr_out to addr_in and added an output wire returning data for requested address.

Next thing we should do, is to disconnect our cpu from any clock once our core turns into debug mode. We do this by introducing an extra clocking wire for our CPU:

...
wire cpu_clk;
...
assign cpu_clk = debug_mode ? 1b'0 : clk_in;
...
cpu mycpu ( cpu_clk, rst, addr, combined_d_out, ram_in, WE, 1'b0, 1'b0, 1'b1 );
...

Next up, it is important to give our RAM logic the ability to get an address from two sources, depending on whether debug mode is selected:

...
wire [15:0] addr_ram_in;
...
assign addr_ram_in = debug_mode ? addr_in : addr;
...
assign data_out = ram_out;
...
 always @ (posedge clk)
    begin
     if (WE) 
     begin
      ram[addr] <= ram_in;
      ram_out <= ram_in;
     end
     else 
     begin
      ram_out <= ram[addr_ram_in];
     end 
    end 
...

We are done with our changes within c64_core. Next we some make some modifications to the top_module for our simulation.

First some declaration changes:

...
reg [15:0] index;
wire [7:0] d_out;    
..    
c64_core my_core(
    .clk_in(clk),
    .rst(reset),
    .debug_clk(clk),
    .debug_mode(1'b0),
    .addr_in(index),
    .data_out(d_out)

        );


The index register I have defined will updated by a loop which I will discuss shortly.

We end off by modifying our initial block for our simulation:

initial begin
  #100 reset <= 0;
  #100000000 
  #20 debug_mode < 1;
  for (index=1024; index<1500; index = index +1 ) 
  begin
    #20 $display("%d",d_out);    
  end  
  $finish;
end    

We have added a for-loop. For-loops are provided in Verilog to aid in simulation. I have read a couple of sources stating that a for-loop will indeed synthesise to something on an FPGA, but the end result would not be necessary the result that you want. So the golden rule: Only use for-loops in simulations.

In our for-loop we keep increment the register index from 1024 till it reaches 1500. Each time, within the for loop, we wait 20 simulation periods (defined by #20) . This have the effect of executing our for-loop once every clock cycle.

Within our for-loop we have also introduced a new simulation directive called $display. It works very similar to printf in c. In our case we actually outputs the value of d_out at each increment. This loop will in effect output the first half of screen memory to the console.

When running the simulation with our changes, the output of the Tcl console will look as follows:


The output starts with a train of 32's, which is a space if you look at the screencode table. This looks promising. Scrolling down we do eventually see some signs of a message:


Converting these screencodes to the actual characters yield the following:

42 = *
42 = *
42 = *
42 = *
32 = SPACE
3  = C
15 = O
13 = M
13 = M
15 = O
4  = D
15 = O
18 = R
5  = E


This is exactly the first part of the C64 welcome message.

We can conclude our simulation went ok up the point of showing the welcome message.

In Summary

In this post we managed to successfully run a simulation for booting the C64 system and populating screen memory with the welcome message.

In the next post we will attempt to run the C64 boot process on the ZYBO board itself.

Till next time!

Friday 1 December 2017

Programming the ARM Cortex

Foreword

In the previous post we developed the FPGA implementation for running the 6502 Test Suite written by Klaus Dormann on the Zybo board.

In this post we will be writing an ARM Cortex program for controlling our FPGA implementation, that is starting it up and monitoring the status of the Testsuite execution.

Opening the Xilinx SDK

We will be developing our ARM Cortex program within the Xilinx SDK.

The Xilinx SDK gets installed as part of the Vivado installation process.

The Xilinx SDK can be launched from Vivado, but before we do, there is a couple of steps we need to do beforehand.

As you remember we ended off running the Synthesis on our FPGA implementation and verified that there was no errors.

The next step we need to do is generate a bistream. This done by clicking on Generate Bitstream in the left Panel. Follow the prompts and wait for the process to complete.

We can now start preparing for the launch of the Xilinx SDK.

First export the hardware by selecting File/Export/Export Hardware:


On the resulting screen ensure that the Include Bitstream option is selected and Click OK:


We are now ready to launch the Xilinx SDK. Under File select Launch SDK. On the resulting dialogue click OK.

Xilinx SDK will now start up:


Xilinx SDK is based on Eclipse, so similar concepts apply, like you can have a couple of projects within the same Workspace.

As you can see, our Workspace already has one project called design_1_wrapper_hw_platform_0. This project contains some code for initialising our hardware platform at startup.

Our application will be contained in another project, same workspace. So select File/New/Application Project.

Give a meaningful name for your project and click Next. On the next page we need to select a Default template for our new project. The Hello World Template, selected by default, will do. Click finish.

You will see two new nodes created in the Project Explorer Panel:

The first Node, Test_Suite_Run, is your new project.

The folder ending with _bsp is a Board Support Package. This folder contains the necessary include files and libraries that your program will need to get to the hardware specific stuff of the core you are using.

The helloworld program itself is within Test_Suite_Run within the src folder called helloworld.c. This is the file we will use to add our code for controlling our custom core.

Getting all the info together

Let us now get all the information together that is needed to write our ARM Cortex program.

As you know we will be communicating with our core via a GPIO Block. Important pieces of information we need here is the pin assignments. We can get these information by looking at the gpio_manipulator.v. In summary here is the required information:

  • GPIO Inputs (Bits 15:0): Address input
  • GPIO Output (Bit 16): clk_gen_reset
  • GPIO Output (Bit 17): rst
  • GPIO Output (Bit 18): debug_mode
  • GPIO Output (Bit 19): debug_clk
The next piece of information we need is: How do we communicate with the GPIO from our ARM Cortex program?

As many other peripherals in a ARM system the GPIO is a set of registers mapped within the memory space. So the firstly we need to know the memory address of our GPIO peripheral.

We get this info by opening our Block Design in Vivado. You will see next to the Design tab is a Address Editor tab:


Click on the Address Editor tab and you will see the required info:


As you see our gpio block is mapped to address 0x4120_0000 in memory space.

At this point you may be wondering how to use these registers. Xilinx provide this information in a product Guide that is s publicly available on there web site. To get to this guide, do an Internet search with the search terms Product Guide Xilinx GPIO. One of the first hits will be something like AXI GPIO v2.0 LogiCORE IP Product Guide (PG144). This is the guide we are after. On page 10 of this guide some more information is provided on how to use these registers.

Our GPIO instance only has a single channel, so only address offset 0x0 and 0x4 is applicable. In our design we didn't connect up the tristate register, so this leaves us only with register 0x0 that we need to use.

The access type column indicates that register 0x0 accepts reads and writes. So, using the pin assignments from the previous section, we only need to read/write to the applicable bit in register to have the desired effect.

Writing our ARM Cortex program

We finally have enough information to start writing out ARM Cortex program.

I will start outlining what we want to achieve in pseudo code:

  1. Initialise both reset pins (e.g. rst and clk_gen_rst) as asserted
  2. Wait one second
  3. Pull clk_gen_rst pin down
  4. Wait one second
  5. Pull rst pin low
  6. Wait two minutes
  7. Assert debug_mode pin
  8. Repeat 20 times 
    1. Toggle debug_clk
    2. Read address outpins pins
    3. Output to value of address pins to UART
Just a quick explanation of the pseudo code.

In step 3 with the clk_gen_rst pin pulled down, the clock generator will start oscillating. It will however take a small time period for the clock generator to reach a stable state. Strictly speaking we should look at the lock of the clock generator to know when it is in a stable state.

To keep things simple we haven't connected the lock pin. Instead, we will just wait a second which is more than enough time for our clock generator to reach a stable state.

Once our clock generator is in an assumed stable state, we can pull the reset pin of our custom core low. This will initiate the execution of the test suite.

We then wait two minutes, which should be more than enough time for our core to finish the Test Suite.

After two minutes we assert the debug_mode pin. This will shift the clock source used by our core from the clock generator to debug_clk, which we will manually clock in our code.

In step eight we enter a short loop, where we toggle the clock, read the address output of our core and outputting it to the UART.

Next we will implement this algorithm. Open up helloworld.c and modify it so that it looks like follows:

#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xil_io.h"
#include <unistd.h>

int main()
{
    init_platform();

    print("Hello World\n\r");

    u32 regval = (1 << 16) | (1 << 17);
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    //16
    regval = ~(1 << 16) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 17) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(120000000);
    regval = regval | (1 << 18);
    Xil_Out32(0x41200000, regval);
    for (int i = 0; i < 20; i++) {
     usleep(10000);
     u32 in = Xil_In32(0x41200000);
     in = in & 0xffff;
     printf("in %x\n\r",in);
     regval = regval | 19;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = ~(1 << 19) & regval;
     Xil_Out32(0x41200000, regval);
    }
    cleanup_platform();
    return 0;
}


We include two additional headers:

  • unistd.h: Header file containing usleep (microsleep). 
  • xil_io.h: Header file containing functions for reading and writing to GPIO.
As you can see, we use Xil_Out32 to write data to GPIO and Xil_In32 to read data from GPIO.

We are now ready to run our program on the ZYBO board. Ensure that the ZYBO board is plugged into your PC via the USB port and switch it on.

Next we should program the FPGA with our implementation. Do this by clicking on Program FPGA:


With the FPGA programmed, click on the Debug button and select Debug/Launch on Hardware(System Debugger):


After a couple of seconds, you will see the first line within your main method gets hit as a breakpoint:


At this point we need to start a terminal session with the UART on the ZYBO. Do this by issuing the following command:

screen /dev/ttyUSB1 115200

Now let the program run to completion. This will take about two minutes. The terminal output will look more ore less like the following:


In this instance our core reached the loop at address 339a. If you have a look at the source code for Klaus Dormann's Test Suite, you will see that the Test Suite was successful if this loop was reached.

So, we know we have done the FPGA implementation correctly and the Arlet core is correct.

In Summary

In this post we wrote the ARM Cortex program for controlling our core and monitoring the execution of the Test Suite.

We confirmed that our implementation was correct.

In the next post we will try to boot our FPGA implementation with the C64 ROMS.

Till next time!