Sunday 31 December 2017

Booting C64 on FPGA

Foreword

In the previous post we managed to boot the C64 system in a Verilog simulation.

In this post we will attempt to boot the C64 system on the ZYBO board.

Tweeking existing FPGA design

We need to make a couple of changes in order for the FPGA to boot the C64 system. The majority of changes for this we already did within the c64_core module. So, for our FPGA design we can just replace the code within our c64_core IP with the code we did in the previous post.

Above paragraph sounds like a mouthful, so how do we go about doing it? First, let us just refresh ourselves with how the block design looks like currently:


To get to this diagram, you double click on the block design within the sources tab with the project open in Vivado:

Right click on c64_core_0 and select Edit in IP Packager:


You will be presented with the same kind of Wizard as in a previous post where we wrapped the c64_core into an IP.

Apply the verilog changes and move to the last page of the Wizard (e.g. Review and Package) and click Repackage IP.

The IP repository will be updated, but your design will still have a old copy of the c64_core IP. For your block design to use the new version, you will need to delete it from the design and add it again.

You might recall from my previous post that we changed our module inputs and outputs of c64_core a bit. We changed the address output to an input and added another output for returning the data for given input address.

These changes to the inputs/outputs of c64_core necessitates changes to gpio_manipulator IP:

module gpio_manipulator (
  input wire [19:0] gpio_output,
  input wire [7:0] data_out,
  output wire [19:0] gpio_input,
  output wire clk_gen_rst,
  output wire rst,
  output wire debug_mode,
  output wire debug_clk,
  output wire [15:0] address_out
); 

assign clk_gen_rst = gpio_output[16];
assign rst = gpio_output[17];
assign debug_mode = gpio_output[18];
assign debug_clk = gpio_output[19];
assign address_out = gpio_output[15:0]; 
assign gpio_input[7:0] = data_out;

endmodule

With these changes made and all blocks in our block diagram been wired up again, our block diagram will look like the following:



With all this done, we are ready to run synthesis on our design again and to generate a new bitstream.

With Synthesis everything completed without an issue. However, my Vivado got stuck with an error during Bitsream generation. In short, this error told me that my fpga doesn't have enough components available to accommodate the synthesised design.

This error really puzzled me and I started to look for clues on what could cause this issue.

Eventually I found something that looked suspect: Whereas the Kernel ROM and Basic ROM utilised Block RAM, the 64KB RAM instance itself utilised distributed RAM resulting in trying to use every component it can find as a piece of RAM.

After further digging, I found the cause of the issue:

 always @ (posedge clk)
    begin
     if (WE) 
     begin
      ram[addr] <= ram_in;
      ram_out <= ram_in;
     end
     else 
     begin
      ram_out <= ram[addr_ram_in];
     end 
    end 


The issue here is that when writing to RAM one address input is used, and another address input is used when reading from it. This doesn't perfectly module a Block RAM in a Xilinx FPGA, which uses only one address input for both reading and writing.

This bug was in fact introduced in the previous post when we wanted to modify the 64KB RAM so that we can inspect the contents from a ARM Cortex program.

We can fix this bug by just using the addr_ram_in input for both scenarios.

With this fix our bitstream generation will complete without any issues.

Testing

Time to test the C64 bootup on our FPGA device. For this purpose we will again use a C program running on the ARM Cortex in the Zynq to evaluate the execution results.

Here is the complete C program:

#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xil_io.h"
#include <unistd.h>

int main()
{
    init_platform();
    //Output: bit 16: Reset clock system
    //Output: bit 17: Reset C64 core
    //Output: bit 18: Enable Debug mode
    //Output: bit 19: Debug Clock
    //Output: bit 15-0: Address within 64KB RAM that we want to read
    //Input: bit 7:0: Data output for given address as specified in previous line
    print("Hello World\n\r");
    u32 regval = (1 << 16) | (1 << 17);
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 16) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 17) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(20000000);
    regval = regval | (1 << 18);
    Xil_Out32(0x41200000, regval);
    for (int i = 1024; i < 1500; i++) {
     regval = regval & ~0xffff;
     regval = regval | i;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = regval | (1 << 19);
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = ~(1 << 19) & regval;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     u32 in = Xil_In32(0x41200000);
     in = in & 0xff;
     printf("in %x\n\r",in);
     usleep(500000);
    }
    cleanup_platform();
    return 0;
}


To aid in making things clearer I have added explanatory comments in the beginning for the different bit positions.

We start off once again by resetting the clock system and the C64 core after which we wait about 20 seconds.

After the 20 second wait we display the contents of the first half of screen memory pausing half a second for each value. It is important to note that since we are in the debug mode at this time, we are responsible for clocking the 64KB Block RAM from our program at each read.

When our program starts outputting the values of screen memory, you will see a couple of 20's (e.g. hex value for screen code space). Keep watching, since you will eventually see the screen codes for the welcome message:


The four 2a's corresponds to the four asterisks of the welcome message.

It appears that our FPGA device managed to boot the C64 system successfully!

In Summary

In this post we managed to boot the C64 system on the Zybo board from the simulation sources of the previous post.

In the next post we will explore how to interface with the SDRAM that comes equipped with the ZYBO.

Having limited Block RAM available on the FPGA, SDRAM access opens new horizons for us like buffering the output video frames and resizing it so it appears properly on an LCD screen.

Till next time!

Wednesday 13 December 2017

Booting the C64 System

Foreword

In the previous post we managed to successfully run Klaus Dormann's Test Suite on the Zybo Board.

In this post we will extend our implementation to boot the C64 system.

At the end of this post we will run the resulting implementation in a simulator and in the next post we will get to running it on the ZYBO board.

Adding the C64 ROMS

In order to boot the C64 system we need to add the two ROMS, e.g. BASIC and KERNEL to our design.

The process would be more or less the same as we did with adding the TestSuite binary in previous posts.

Since we are working with ROMS, however, we will only be adding logic to read data from the Block RAM and no write logic.

Since we are dealing with two ROMS and in later posts three ROMS when adding the Chargen ROM, it make sense to extract the common logic into a module of its own. The signature of this module will look as follows:

module rom#(
 parameter ADDR_WIDTH = 13,
 parameter ROM_FILE = ""

)

(
  input clk,
  input wire [ADDR_WIDTH-1:0] addr,
  output reg [7:0] rom_out
    )


You will notice our signature contains an extra section preceded by a hash, which is a style we haven't use before.

The hash section is basically a parameter section, declaring parameters with default values. The nice thing about these parameters is that you can override these values when you create a module instance with suitable values.

In the parameter section of our rom module we have the parameter ADDR_WIDTH with a a default value of 13.  This means that if you instantiate a rom module instance and you don't override the ADDR_WIDTH parameter, your resulting instance can accept addresses of maximum 13 bits.

13 bits gives us 8KB of addressable space. This default is sufficient for both the BASIC ROM and the KERNEL.

In later posts, however, where we will be adding the CharROM which is only 4KB we will need to override the ADDR_WIDTH with a value of 12.

Let us now look at the meat of our rom module:

reg [7:0] rom[2**ADDR_WIDTH-1:0];

 always @ (posedge clk)
    begin
      rom_out <= rom[addr];
    end 

    
initial begin
      $readmemh(ROM_FILE, rom) ;
    end    


We begin by defining an array that will contain the contents for the applicable ROM. In defining the size of the array we make use of the ADDR_WIDTH parameter defined previously.

We populate the contents of this array with an initial block similarly as we did in a previous post.

We define a always block for pushing the contents for given address to an output register on the positive transition of the clock pulse.

With our rom module defined, we can now create some instances of it in our main module:

rom #(
 .ROM_FILE("/home/johan/Documents/roms/kernel.hex")
) kernel(
  .clk(clk),
  .addr(addr[12:0]),
  .rom_out(kernel_out)
    );

rom #(
 .ROM_FILE("/home/johan/Documents/roms/basic.hex")
) basic(
  .clk(clk),
  .addr(addr[12:0]),
  .rom_out(basic_out)
    );


For both instances we send as paramater the location to a hex formatted file containing the content for applicable ROM.

For the address we send through the least 13 bits of the address bus.

We are missing some arbitration logic that will ensure, depending on the given address whether we return the contents of the BASIC ROM, KERNEL or our 64KB RAM.

Adding Arbitration Logic

The logic for performing arbitration is as follows:

...
reg [7:0] combined_d_out;
...
always @*
  casex (addr)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    default: combined_d_out = ram_out;
  endcase
...

The function of this logic can be represented in a diagram as follows:


All our storage elements, BASIC, KERNEL and our RAM gets fed to a multiplexer and we use the address as selector to decide which one gets send to the DI input of the 6502 CPU.

Let us now look at our piece of Verilog code in more detail. This will indeed look familiar to programmers as a case/switch statement.

This case statement, however, starts with casex instead of case. This is a special kind of Verilog statement, where in the selector you can specify Don't care values.

A don't care value you sepcify with an X, and means that this position can be any value.

Strictly speaking, if you look at our case statement, you could have only connected only the most significant three bits to our case statement, since the lower thirteen doesn't serve any purpose. But, as you will see later, we will need to full addresses for a scenario where will check for a specific address.

One thing we haven't consider in our design is the way Block RAMS work. Block RAMS only show the output a clock pulse after the address is asserted. In our design, however, we are multiplexing one clock cycle to early, meaning that by the time the data is ready, we might have switched that block rom out of view with the next address.

The solution would be to delay address input also by one clock cycle. This will result into the following changes:

...
reg [15:0] addr_delayed;
...
 always @ (posedge clk)
    addr_delayed <= addr;
...
always @*
  casex (addr_delayed)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    default: combined_d_out = ram_out;
  endcase
...


Preparing for Simulation

All our for our 6502 system is currently wrapped in module called c64_core that is contained in Design sources, used for performing synthesis.

We also have a similar module within our simulation sources containing code for assisting a simulation.

With this current setup you would develop in the copy contained in simulation sources, making it is easy run a simulation now and again to check if you are on the right track.

Once finished with your development though, you would need to copy your changes to c64_core in Design sources.

This copy and pasting can be quite error prone. A better approach would be to let both the design and simulation sources share the same c64_core module. Then, within the simulation sources you create a top module surrounding the c64_core module. This top module would then contain all the simulation specific code.

Let us start with this top module. First, let us look again at the signature of c64_core module:

module c64_core(
  input wire clk_in,
  input wire reset,
  input wire debug_clk,
  input wire debug_mode,
  output wire [15:0] addr_out
    );


The resulting top module is quite simple:

reg clk = 0;
reg reset = 1;
wire [15:0] addr_out;

c64_core my_core(
    .clk_in(clk),
    .rst(reset),
    .debug_clk(1'b0),
    .debug_mode(1'b0),
    .addr_out(addr_out),
        );

always #10
clk <= ~clk;        

initial begin
  #100 reset <= 0;
  #100000000 $finish;
end    


First Simulation Attempt

With our first simulation our Wave output looks as follows:


If you go through the address requests of addr_out, you will see that the last couple of address requests ranges between ff5e-ff63. If you look at Disassembly listing of the kernel, you will see these addresses corresponds to the following:

FF5E   AD 12 D0   LDA $D012
FF61   D0 FB      BNE $FF5E

This loop rings a clear bell from my previous blogs where I wrote emulators for other platforms. Writing a C64 from scratch, you will most probably always got stuck at this loop for the first time.

This signals good news, since we are on the right track.

What we need to do next, is imitate values for register D012 (which is a VIC-II register) , so we can get past above loop, and see if screen memory get populated with the C64 startup message.

Getting past the $FF5E loop

To get past the $FF5E loop we can just link the memory register to a binary counter counting up at each clock cycle.

The implementation of the binary counter is as simple as follows:

...
reg [7:0] line_counter;
...
always @(posedge clk)
  if (rst)
    line_counter <= 0;
  else
    line_counter <= line_counter + 1;
...

And finally we change our arbitration block:

always @*
  casex (addr_delayed)
    16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
    16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
    16'hd012: combined_d_out = line_counter;
    default: combined_d_out = ram_out;
  endcase


When our run simulation again with above changes, our wave output looks as follows:


If you now compare these addresses to a disassembly listing again, you will get to the following section:

; wait for return for keyboard
E5CA   20 16 E7   JSR $E716
E5CD   A5 C6      LDA $C6
E5CF   85 CC      STA $CC
E5D1   8D 92 02   STA $0292
E5D4   F0 F7      BEQ $E5CD
E5D6   78         SEI
E5D7   A5 CF      LDA $CF
E5D9   F0 0C      BEQ $E5E7
E5DB   A5 CE      LDA $CE
E5DD   AE 87 02   LDX $0287

I got this dissasemmbly listing from ffd2.com

From this we can gather that our simulation got to the point where it is waiting for keyboard input, which just after C64 bootup.

Ok, I am pretty convinced the C64 boot process went fine, but I am itching to check one more thing: Checking whether screen memory at memory location 1024 is populated with the Welcome message.

Checking Screen memory for welcome message

As our FPGA implementation is at the moment, we don't really have a way to inspect the contents of our 64KB RAM. We therefore need to modify our debug mode functionality to return the information we want.

Firstly, let us start to modify the header of our c64_core module for returning the relevant information:

module c64_core(
  input wire clk_in,
  input wire reset,
  input wire debug_clk,
  input wire debug_mode,
  input wire [15:0] addr_in
  output wire [7:0] data_out
    )

We have change our addr_out to addr_in and added an output wire returning data for requested address.

Next thing we should do, is to disconnect our cpu from any clock once our core turns into debug mode. We do this by introducing an extra clocking wire for our CPU:

...
wire cpu_clk;
...
assign cpu_clk = debug_mode ? 1b'0 : clk_in;
...
cpu mycpu ( cpu_clk, rst, addr, combined_d_out, ram_in, WE, 1'b0, 1'b0, 1'b1 );
...

Next up, it is important to give our RAM logic the ability to get an address from two sources, depending on whether debug mode is selected:

...
wire [15:0] addr_ram_in;
...
assign addr_ram_in = debug_mode ? addr_in : addr;
...
assign data_out = ram_out;
...
 always @ (posedge clk)
    begin
     if (WE) 
     begin
      ram[addr] <= ram_in;
      ram_out <= ram_in;
     end
     else 
     begin
      ram_out <= ram[addr_ram_in];
     end 
    end 
...

We are done with our changes within c64_core. Next we some make some modifications to the top_module for our simulation.

First some declaration changes:

...
reg [15:0] index;
wire [7:0] d_out;    
..    
c64_core my_core(
    .clk_in(clk),
    .rst(reset),
    .debug_clk(clk),
    .debug_mode(1'b0),
    .addr_in(index),
    .data_out(d_out)

        );


The index register I have defined will updated by a loop which I will discuss shortly.

We end off by modifying our initial block for our simulation:

initial begin
  #100 reset <= 0;
  #100000000 
  #20 debug_mode < 1;
  for (index=1024; index<1500; index = index +1 ) 
  begin
    #20 $display("%d",d_out);    
  end  
  $finish;
end    

We have added a for-loop. For-loops are provided in Verilog to aid in simulation. I have read a couple of sources stating that a for-loop will indeed synthesise to something on an FPGA, but the end result would not be necessary the result that you want. So the golden rule: Only use for-loops in simulations.

In our for-loop we keep increment the register index from 1024 till it reaches 1500. Each time, within the for loop, we wait 20 simulation periods (defined by #20) . This have the effect of executing our for-loop once every clock cycle.

Within our for-loop we have also introduced a new simulation directive called $display. It works very similar to printf in c. In our case we actually outputs the value of d_out at each increment. This loop will in effect output the first half of screen memory to the console.

When running the simulation with our changes, the output of the Tcl console will look as follows:


The output starts with a train of 32's, which is a space if you look at the screencode table. This looks promising. Scrolling down we do eventually see some signs of a message:


Converting these screencodes to the actual characters yield the following:

42 = *
42 = *
42 = *
42 = *
32 = SPACE
3  = C
15 = O
13 = M
13 = M
15 = O
4  = D
15 = O
18 = R
5  = E


This is exactly the first part of the C64 welcome message.

We can conclude our simulation went ok up the point of showing the welcome message.

In Summary

In this post we managed to successfully run a simulation for booting the C64 system and populating screen memory with the welcome message.

In the next post we will attempt to run the C64 boot process on the ZYBO board itself.

Till next time!

Friday 1 December 2017

Programming the ARM Cortex

Foreword

In the previous post we developed the FPGA implementation for running the 6502 Test Suite written by Klaus Dormann on the Zybo board.

In this post we will be writing an ARM Cortex program for controlling our FPGA implementation, that is starting it up and monitoring the status of the Testsuite execution.

Opening the Xilinx SDK

We will be developing our ARM Cortex program within the Xilinx SDK.

The Xilinx SDK gets installed as part of the Vivado installation process.

The Xilinx SDK can be launched from Vivado, but before we do, there is a couple of steps we need to do beforehand.

As you remember we ended off running the Synthesis on our FPGA implementation and verified that there was no errors.

The next step we need to do is generate a bistream. This done by clicking on Generate Bitstream in the left Panel. Follow the prompts and wait for the process to complete.

We can now start preparing for the launch of the Xilinx SDK.

First export the hardware by selecting File/Export/Export Hardware:


On the resulting screen ensure that the Include Bitstream option is selected and Click OK:


We are now ready to launch the Xilinx SDK. Under File select Launch SDK. On the resulting dialogue click OK.

Xilinx SDK will now start up:


Xilinx SDK is based on Eclipse, so similar concepts apply, like you can have a couple of projects within the same Workspace.

As you can see, our Workspace already has one project called design_1_wrapper_hw_platform_0. This project contains some code for initialising our hardware platform at startup.

Our application will be contained in another project, same workspace. So select File/New/Application Project.

Give a meaningful name for your project and click Next. On the next page we need to select a Default template for our new project. The Hello World Template, selected by default, will do. Click finish.

You will see two new nodes created in the Project Explorer Panel:

The first Node, Test_Suite_Run, is your new project.

The folder ending with _bsp is a Board Support Package. This folder contains the necessary include files and libraries that your program will need to get to the hardware specific stuff of the core you are using.

The helloworld program itself is within Test_Suite_Run within the src folder called helloworld.c. This is the file we will use to add our code for controlling our custom core.

Getting all the info together

Let us now get all the information together that is needed to write our ARM Cortex program.

As you know we will be communicating with our core via a GPIO Block. Important pieces of information we need here is the pin assignments. We can get these information by looking at the gpio_manipulator.v. In summary here is the required information:

  • GPIO Inputs (Bits 15:0): Address input
  • GPIO Output (Bit 16): clk_gen_reset
  • GPIO Output (Bit 17): rst
  • GPIO Output (Bit 18): debug_mode
  • GPIO Output (Bit 19): debug_clk
The next piece of information we need is: How do we communicate with the GPIO from our ARM Cortex program?

As many other peripherals in a ARM system the GPIO is a set of registers mapped within the memory space. So the firstly we need to know the memory address of our GPIO peripheral.

We get this info by opening our Block Design in Vivado. You will see next to the Design tab is a Address Editor tab:


Click on the Address Editor tab and you will see the required info:


As you see our gpio block is mapped to address 0x4120_0000 in memory space.

At this point you may be wondering how to use these registers. Xilinx provide this information in a product Guide that is s publicly available on there web site. To get to this guide, do an Internet search with the search terms Product Guide Xilinx GPIO. One of the first hits will be something like AXI GPIO v2.0 LogiCORE IP Product Guide (PG144). This is the guide we are after. On page 10 of this guide some more information is provided on how to use these registers.

Our GPIO instance only has a single channel, so only address offset 0x0 and 0x4 is applicable. In our design we didn't connect up the tristate register, so this leaves us only with register 0x0 that we need to use.

The access type column indicates that register 0x0 accepts reads and writes. So, using the pin assignments from the previous section, we only need to read/write to the applicable bit in register to have the desired effect.

Writing our ARM Cortex program

We finally have enough information to start writing out ARM Cortex program.

I will start outlining what we want to achieve in pseudo code:

  1. Initialise both reset pins (e.g. rst and clk_gen_rst) as asserted
  2. Wait one second
  3. Pull clk_gen_rst pin down
  4. Wait one second
  5. Pull rst pin low
  6. Wait two minutes
  7. Assert debug_mode pin
  8. Repeat 20 times 
    1. Toggle debug_clk
    2. Read address outpins pins
    3. Output to value of address pins to UART
Just a quick explanation of the pseudo code.

In step 3 with the clk_gen_rst pin pulled down, the clock generator will start oscillating. It will however take a small time period for the clock generator to reach a stable state. Strictly speaking we should look at the lock of the clock generator to know when it is in a stable state.

To keep things simple we haven't connected the lock pin. Instead, we will just wait a second which is more than enough time for our clock generator to reach a stable state.

Once our clock generator is in an assumed stable state, we can pull the reset pin of our custom core low. This will initiate the execution of the test suite.

We then wait two minutes, which should be more than enough time for our core to finish the Test Suite.

After two minutes we assert the debug_mode pin. This will shift the clock source used by our core from the clock generator to debug_clk, which we will manually clock in our code.

In step eight we enter a short loop, where we toggle the clock, read the address output of our core and outputting it to the UART.

Next we will implement this algorithm. Open up helloworld.c and modify it so that it looks like follows:

#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xil_io.h"
#include <unistd.h>

int main()
{
    init_platform();

    print("Hello World\n\r");

    u32 regval = (1 << 16) | (1 << 17);
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    //16
    regval = ~(1 << 16) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(1000000);
    regval = ~(1 << 17) & regval;
    Xil_Out32(0x41200000, regval);
    usleep(120000000);
    regval = regval | (1 << 18);
    Xil_Out32(0x41200000, regval);
    for (int i = 0; i < 20; i++) {
     usleep(10000);
     u32 in = Xil_In32(0x41200000);
     in = in & 0xffff;
     printf("in %x\n\r",in);
     regval = regval | 19;
     Xil_Out32(0x41200000, regval);
     usleep(10000);
     regval = ~(1 << 19) & regval;
     Xil_Out32(0x41200000, regval);
    }
    cleanup_platform();
    return 0;
}


We include two additional headers:

  • unistd.h: Header file containing usleep (microsleep). 
  • xil_io.h: Header file containing functions for reading and writing to GPIO.
As you can see, we use Xil_Out32 to write data to GPIO and Xil_In32 to read data from GPIO.

We are now ready to run our program on the ZYBO board. Ensure that the ZYBO board is plugged into your PC via the USB port and switch it on.

Next we should program the FPGA with our implementation. Do this by clicking on Program FPGA:


With the FPGA programmed, click on the Debug button and select Debug/Launch on Hardware(System Debugger):


After a couple of seconds, you will see the first line within your main method gets hit as a breakpoint:


At this point we need to start a terminal session with the UART on the ZYBO. Do this by issuing the following command:

screen /dev/ttyUSB1 115200

Now let the program run to completion. This will take about two minutes. The terminal output will look more ore less like the following:


In this instance our core reached the loop at address 339a. If you have a look at the source code for Klaus Dormann's Test Suite, you will see that the Test Suite was successful if this loop was reached.

So, we know we have done the FPGA implementation correctly and the Arlet core is correct.

In Summary

In this post we wrote the ARM Cortex program for controlling our core and monitoring the execution of the Test Suite.

We confirmed that our implementation was correct.

In the next post we will try to boot our FPGA implementation with the C64 ROMS.

Till next time!

Sunday 26 November 2017

Running on the fpga

Foreword

In the previous post we managed to run the 6502 Test Suite developed by Klaus Dormann on Arlet Ottens's core.

In this post we will be modifying the top module so that it can run on run on the FPGA of the Zybo board.

The ultimate goal is to be able to run Klaus Dorman Dormann's Test Suite on our FPGA implementation. To accomplish this goal we need to do two things: Creating the FPGA implementation and writing a ARM Cortex program to control our core and to verify the results of the test suite.

In this post we will only tackle the first goal, creating the FPGA implementation.

The writing of the ARM Cortex program we will tackle in the next post.

More on ZYNQ development

As mentioned in my introduction post, the Zybo board is equipped with a Zynq FPGA chip.

Apart from an FPGA fabric, the Zynq also contains a ARM Cortex core with supporting peripherals, like USB, Ethernet and so on.

Most of your FPGA designs for the ZYNQ will co-operate with the ARM Cortex.

Interfacing your design with the ARM Cortex can get quite complex and we can easily get it wrong.

To assist in above mentioned complexity, Vivado provides a block design tool, which is a visual tool where you can drag and drop components and providing functionality for auto-connecting the components together.

The only caveat with this process is that if you want to add your own cores to the block design, you first need to wrap it into an IP for Vivado to use as a drop down component.

We will discuss how to wrap our core in a moment.

The intended design

Before going into details on how to wrap our core into an IP, I will first give some context on what we want to achieve.

The following diagram summarises more or less what we want to achieve:


The 6502 System block is basically the IP we are going to wrap. This block has 4 inputs and 1 output.

The clk_in input is fed by a clock generator which will clock at 8MHz allowing our core to execute at full speed.

The other 3 inputs and one output will be connected to the ARM Cortex via a GPIO (General purpose Input/Output) block.

It should be noted that we will also be writing a program that will run on the ARM Cortex. This program will control the pins linked to the GPIO block as shown in the diagram.

The idea is to run the Klaus Test Suite at full speed on the 6502 System block for about two minutes. This time interval will be more than enough to finish the Test Suite.

After the two minutes we want to drastically slow down the clock speed and monitor the addresses been placed on the address bus (via the addr_out pin) to get an idea at which point we are with the execution of the Test Suite.

The slow down of the clock speed is performed by pulling up the debug pin to high, which we will be doing via our ARM Cortex program.

With the debug_mode pin high, our 6502 System will not be using the clk_in pin anymore for clocking, but rather the debug_clk pin.

In our ARM Cortex program we will then be toggling the debug_pin, reading the addr_out after each toggle and outputting the value to the UART via a printf.

We can connect to the UART via terminal application and get an idea where we are in execution of the Test Suite.

This is in a nutshell what we want to achieve.

Wrapping our core into an IP

Let us start off by opening up the Vivado project we have created in the previous post.

Currently the sources panel in the IDE for this project should like follows:


As you can see there currently only defined simulation sources and no Design sources.

However, in order to wrap our core into an ip, we do need design sources. The quickest way to this would be to select all three simulation sources, right click them and select Move to design sources. This would however mean that our top module file would be shared among our simulation and design. This would require for us to add some directive within our top.v file so that it can be used for both simulation and design purposes.

For purposes of this post we will try to keep things simple, and just make a copy of top.v for synthesis purposes.

Click on the plus sign to add new sources. On the window that pops up ensure that Add or create design sources is selected and click next.

On the next screen click Create File. For the file name specify c64_core.v and click ok. For remaining popups, click Yes/ok till all is gone.

Now copy and paste the contents of top.v into c64_core.v and also change module name to c64_core.

Since we are creating an IP block, we need need to provide some parameters to our module. Modify your module header so that it looks as follows:

module c64_core(
  input wire clk_in,
  input wire reset,
  input wire debug_clk,
  input wire debug_mode,
  output wire [15:0] addr_out
    );


The purpose of these parameters we have discussed in a previous section.

We can now do a bit of cleanup in this module. For instance, we can remove the always block that generates the clk signal, since we will be getting it externally.

Similarly we can remove the initial block where we change the state of the reset pin.

At this point we might be tempted to also remove the initial block where we populate the contents of the RAM, since initial blocks in general doesn't synthesise to anything at all. This initial block is, however, an exception to the rule and many FPGA Syntheses tools, including Vivado, will synthesise a core which will also initialise applicable Block RAM with the contents of given HEX file.

I will proof above statement in a later section.

There is one remaining change we need to do with our top module code. As discussed earlier, our core will not always use clk_in for clocking. When the pin debug_mode is set to high, we need to use debug_clk as the system clock. To cater for this requirement we need to add the following assignment:

assign clk = debug_mode ? debug_clk : clk_in;

With this change it is also important to change the type of clk from a register to a wire.

A final thing we should do, is to make cpu.v and alu.v part of the design sources. Do this, select these two files in the Simulation sources section within the Sources panel. Right click the selection and then click Move to Design Sources.

At this point ensure everything is saved within the project.

We are now ready to do a synthesis run to see if there is any errors in our module. In the Project Manager panel, click Run Synthesis and follow the prompts.

Once we have verified that the synthesis is successful,  we can continue.

We will be referencing this project from a new project, so first close down this project and then create a new project.

With your new project also ensure that you select your ZYBO board as the target device.

Within your new project select Tools from the main menu and then Create and Package New IP. You will be presented with the following Wizard page:

Click Next.

On the next screen, ensure that Package a specific directory is selected and click next.

On the next screen, select the path to your project that you have closed a moment ago. Click next.

Give a sensible name for your IP project and click next again.

You will be finally brought to the final screen. Clicking Finish will create yet another new project for the specific use for defining additional properties for your IP.

The screen that opens up will looks something like the following:


On the left panel you will see Packaging Steps. Clicking on each of these steps will show a different set of information regarding your IP. For most of these steps, the default settings will do.

Click on the Review and Package step and then the Package IP button. The IP project will close down and you will be presented again your new empty project.

At this point your new IP is added to your new project and ready for use.

In the next section we will start our block design for the whole system.

Doing the Block Design

Let us do the block design.

Start by clicking on Create Block Design within the project manager panel. Leave defaults as is in the popup and click ok.

You will see an empty Design Panel opening containing the message: The design is empty. Press the + button to add IP.

Click the + button as hinted by the design panel. You you will be presented with a search dialogue.

The first IP we are going to add will be the ZYNQ processing system. So, within the search box enter zynq. The ZYNQ7 Processing System will be the first item item in the search results. Double click this item and you will see the block been added to your block design:

You will see a green line at the top providing you the option to Run Block Automation. This option basically means it will do some auto connections for you.

Please select this option. Click OK on the popup. You see a couple of detail has been added:


Next IP we are going to add, is the IP we have created in this post. Search with the name you have given this IP and add it.

Next it will be necessary to link up our core with ZYNQ processing system. As mentioned earlier we will be using a GPIO block for this. So , please add a AXI GPIO core to the design.

This time we will be provided a Connection Automation option. Once again we will be using this option. This time around we will be more picky with the options and the defaults will not do:



We will only be selecting S_AXI. The GPIO side will be connecting to our IP block, and we are not ready for any connections yet. The result of auto connections will look as follows:


As you can see a couple of new blocks have been added. However, the GPIO remains unconnected as well as our own core.

We basically want to connect four ports from our core to the GPIO port of the axi_gpio component. The Block design tool, however, only allows you to connect two ports to each other, and not a couple of ports to one port.

We will need to introduce core that will aggregate our ports into one port, which we will in turn connect to the GPIO port.

Here is the verilog code for the Aggregator:

module gpio_manipulator (
  input wire [19:0] gpio_output,
  output wire [19:0] gpio_input,
  output wire clk_gen_rst,
  output wire rst,
  output wire debug_mode,
  output wire debug_clk,
  input wire [15:0] address
); 

assign clk_gen_rst = gpio_output[16];
assign rst = gpio_output[17];
assign debug_mode = gpio_output[18];
assign debug_clk = gpio_output[19];
assign gpio_input[15:0]= address; 

endmodule


This module you should wrap as an IP. Once wrapped, we should also add it to our block design.

You will notice that our gpio_manipulator only makes provision for 20 gpio pins, whereas an AXI-GPIO has 32 pins by default. It is important that we also reduce the input/output pins on the axi component to 20.

We do this reduction by first double clicking on the AXI GPIO component. When the screen opens, go to the IP Configuration tab and change GPIO width from 32 to 20.

We can now start to connect our components to the rest of the system.

At this stage you might be puzzled as the gpio_manipulator has a gpio input and a gpio port, whereas the axi gpio component only has one gpio port. How do we connect our two gpio ports to the one port?

The problem can be solved by just clicking on the + next to GPIO on the AXI GPIO component. You will see more ports become available.


gpio_io_i and gpio_io_o will look familiar, but not gpio_io_t. gpio_io_t defines for each of the 20 pins whether it is an input or an output. For now we will not worry about gpio_io_t because we are going to directly connect to gpio_io_i and gpio_io_o.

We can now continue and connect some wires.

Move the mouse cursor to the gpio_io_i pin of the axi_gpio block. You will see the mouse cursor will change into a pencil. While keeping the left mouse button depressed, move the mouse to the gpio_input pin of the gpio_manipulator block. You will see that a couple of pins that can potentially be connected will light up with a green tick, as shown below:


While the mouse cursor is over gpio_input of gpio_manipulator, release the left button of the mouse.

The connection you have just made, will look as follows:

Now, make a similar connection from the pin gpio_io_o of the axi gpio component to the gpio_output pin of gpio manipulator.

This concludes the connections between the gpio manipulator and the axi gpio.

The rest of the pins of the gpio manipulator we can now connect to our other custom core. The result will look as follows:


At this point all the pins of c64_core is connect except for clk_in. A good connection candidate for this pin is FCLK_CLK0 of the the Processing system block. The clock frequency of this pin is however 100MHz.

Well, I am sure that not many people would mind a C64 clocking at 100MHz 😃 However, such a C64 would be too fast too use, so let us try and keep to the real speed as much as possible.

We will use a clock generator to bring the clock speed down. Add a new IP and within the search box type clock. In the search results click on Clocking Wizard. The resulting block been added to the design will look as follows:

clk_out1 will be the pin to connect to clk_in of our c64 core. Obviously clk_in1 we will connect to FCLK_CLK0 which is 100MHz.

The reset pin we will connect to clk_gen_rst of our gpio manipulator. The reset of our clock generator will therefore also form part of the responsibility of our ARM Cortex program.

One final thing we should is to configure the parameters of our clock generator. Double on the clock component and go to clocking options. Scroll right to the bottom and ensure that the input clock frequency is at Auto, which will yield 100MHz:


Next, go to the Output clocks tab. In this tab we want to push down the clock frequency to 1Mhz, close to the speed of a real C64. However, the wizard wont allow us to go below 4.6MHz. Som for now we will leave it a 5Mhz. We will work around this in future posts.


We are now done with the block design. Let us see if it can synthesise. In the Sources panel, right click on design and select create HDL wrapper:


With the HDL wrapper created we can now synthesise. On the left click Run Synthesis. Verify that there are no errors.

Synthesizable Block RAM Initialisation

In a previous section I mentioned that an initial block with a $readmem will indeed synthesise into block RAM elements with there content initialised by a given hex-file.

In this section I will prove this statement with the synthesised design that we ended with in the previous section.

Assuming the final synthesis in the previous section was successful, click on Open Synthesised design in the left panel. The resulting diagram will look something like the following:


We will now keep drilling down the design till we reach the block RAMS containing the Test Suite code.

We drill down by clicking on the + on the top left of the block. Drilling into this block will yield quite a sea of wires as shown below:


Every grey block corresponds more or less to a block you have added to the block design. If you zoom into the top three blocks you will see that they are related to the ps7 processing system.

The blocks we are really interested in is the four blocks bottom right. So let us zoom into this region:


It is not so clear in the picture, but the block second from the left is c64_core. This is the block we are after, so let us drill into it. After drilling twice into this block, we are actually getting to the block RAM (I have marked them in red):


In this particular scenario the block RAMS is arranges into eight rows containing 2 Block RAMS each. As you might have guessed, each row corresponds to one bit, and the eight rows together forms an addressable byte.

The two block rams per row, is a configuration called the cascade configuration. Each block RAM can store 32Kbit information, so the left block RAM contains data for addresses for the first 32KB of memory and the one the right contains the data for the the last 32KB of memory.

The schematic view also allows us the view the init values that will be assigned to a block RAM. To view, click on a block RAM and then within the cell properties panel click on Properties. If you now scroll down, you will eventually see lots of hex values:


You will see each row of Hex number is preceeded by 256'h which is typical Verilog syntax meaning the row contains 256 bits in total. It should be noted though that the order of the bits in each row is reversed, meaning the last bits is shown first and we go down to the first bit.

Let us see if we can match up some of the data shown in this view to the actual binary data in the Klaus Test Suite binary.

The values that we will try to match will be the last four bytes of memory (e.g. the reset and IRQ vector).

To retrieve the last four bytes of memory we need to look at the second block RAM in each row. For each of the these Block RAMS we need to get the contents of INIT_7F (7F is the last init row for each Block RAM, since each BLOCK RAM has 80Hex rows, numbered from INIT_00 to INIT_7F).

Here is the data in question, from top to down:


  • 256'h47FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'hCFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'hABFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'h43FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'h8BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'h8BFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'h47FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • 256'h47FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Since we want the last four bytes, we take the first hex digit of each line:

4
C
A
4
8
8
4
4

We now convert each of these values to binary:

4 = 0100
C = 1100
A = 1010
4 = 0100
8 = 1000
8 = 1000
4 = 0100
4 = 0100

Now, each column of binary digits forms a byte (NB!! Most significant bit at the bottom). Since the last byte is first, working its way to the first bit. The mapping is thus:


  • Column 1 = address $FFFF, value = 00110110b = 3B
  • Column 2 = address $FFFE, value = 11001011b = CB
  • Column 3 = address $FFFD, value = 00000100b = 04
  • Column 4 = address $FFFC, value = 00000000b = 00
These values match up to the values we see when opening up the Klaus Test Suite Binary with a Hex editor.

In Summary

In this post we developed an FPGA core that can execute the 6502 Test Suite developed by Klaus Dormann.

With this done, however, we are only halfway there.

What still needs to be done is to write a ARM Cortex program that will reset our FPGA core and monitor the execution of the Test Suite.

Till next time!

Tuesday 21 November 2017

Simulating a complete 6502 system

Foreword

In the previous post we covered some Verilog basics.

In this post we will be developing the top module surrounding the 6502 core written by Arlet Ottens.

In the top module we will basically interface the 6502 core with a 64KB memory image of Klaus Dormann's Test Suite.

We will then run Klaus Dormann's Test Suite and see if all tests succeeds.

In this post, however, we will not be running our implementation directly on the FPGA, but rather in a simulation.

We will be using the simulation tool included in the FPGA development IDE from the vendor, called Vivado.

Installing Vivado

This blog assumes you have installed Vivado.

If you don't have Vivado you start by downloading install via this URL: https://www.xilinx.com/support/download.html

On the web page ensure that are are on the Vivado tab:



Scroll down to Full Product Installation and select the applicable download for your OS:


A web installer will download which allows you to select the components you want to install. Some components are free, while other are evaluation only.

For the purpose of this blog series the free components will do.

As part of the install ensure that Xilinx SDK is selected. This will be used to develop software running on the ARM Cortex processor.

Just a final note. As part of the installation you might be required to have a Xilinx user account. You can register on the Xilinx website and it is free!

Configuring Vivado for ZYBO

Next, let us configure the installed Vivado for use with the ZYBO board. Strictly speaking it is not necessary for this post, but since we are at the point of configuring Vivado, let us do all configuration in one go.

Firstly, we need to get the board file for the ZYBO board. This can be downloaded via the following link:

https://github.com/Digilent/vivado-boards/archive/master.zip

When you open the downloaded zip file, you will see a folder vivado-boards-master. Go into this folder.

You will see two folders, called new and old.

The old folder is folder is for pre-2015.1 Vivado versions. The folder that we will be using, is new, which is for Vivado versions 2015.1 and above.

Within the new folder you will see another folder called board_files. This folder corresponds to the following folder your within your Vivado installation:

{Vivado installation folder}/2017.1/data/boards/board_files

The part in your path 2017.1 might be different, depending on the version of Vivado you have.

Now, copy the contents of the folder board_files in the zip file to the board_files folder in your Vivado installation.

If you now restart Vivado, the ZYBO board configurations will be available in Vivado installation.

Creating a new Project in Vivado

Time to create a Vivado project for our 6502 simulation.

On the Vivado home screen select Create Project.

You will be presented with a wizard of pages.

On the first page click next.

Give your new project a name and click next.

On the next wizard page, Project Type, leave defaults as is and click next.

The next wizard page is where things starts to get interesting. When you click on the boards button, you will see a list of known boards your Vivado installation supports. Among them you will see a list of Zybo boards:



The list of Zybo boards you see in the list is the extra list of boards that appeared due to the set of steps you did in the previous section.

For simulation not that important, but when running your design on the FPGA it crucial that you select the ZYBO board corresponding to your board revision. The listed Zybo boards are very similar to each other. However, there is very subtle differences among the DDR-RAM parameters among them.

This subtle differences in DDR-PRAMETERS can just be enough to send you on a wild goose chase! I was one of those that ended in a goose chase for a week :-) I started off a project selecting Zybo Z7-10 (B.2) instead of the Zybo B.3.

With the incorrect board selected the ARM processor manage to start-up, but after executing a couple of machine code instructions, some memory location would just be modified out of the blue, causing chaos.

Anyway, back to the plot. With your board selected, click next. You will reach the summary page at which you need to click finish. In a minute or a new project will be created for you.

In the sources panel select the big '+' button to add some sources to your project:

On the first wizard page ensure that Add or create simulation sources is selected and click next.

On the next page we first need to add the two files from Arlet Ottens's core, called cpu.v and alu.v. So, click the Add Files button, and browse to these two files respectively on your hardrive, and add them.

Finally, it is neccessary to create a new file in which will create our top module. For this click the Create File module. For the filename, please specify top.v. Now hit OK and then finish.

The request files will now be added and created. After finished, your source panel will look similar to the following:



For top.v, Vivado has created an empty skelleton for you:


We will populate this skeleton in the next section.

Writing the top module

Let us now populate our top module with some code.

Firstly, let us instantiate an instance of the cpu module:

cpu  mycpu ( .clk (), 
            .reset (), 
            .AB (), 
            .DI (), 
            .DO (), 
            .WE (), 
            .IRQ (), 
            .NMI (), 
            .RDY() );

For now, all the signals is unconnected. We will connect them as we go along.

The specifics for the clock signal is defined as follows:

...
reg clk = 0;
...
always #10 
clk <= ~clk;
...

In the always block we are creating the pulsing clock for simulation purposes. Every time when the always block executes, it first wait 10 simulation cycles, indicated by the #10, and assign the inverse of the current clock state as the new value for clk.

Next, let us tie up the reset line:

...
reg reset = 1;
...
initial begin
  #50 reset <= 0; 
  #10000000 $stop;
end    
...

We start the simulation with the reset line asserted.

The value of the reset line we change within an initial block. An initial block is a initialisation block that if called just once at simulation startup. It should also be noted that initial blocks is specifically for simulation and most of the time it would not even synthesise to anything at all on the FPGA.

Let us look a bit closer to the contents of this initial block. The initial block waits 50 simulation cycles (i.e. 5 clock cycles), before setting the reset line to zero.

Our initial block has one more purpose: Killing the complete simulation after 10000000 simulation cycles.

The simulation then runs for 10000000 simulation cycles, before killing the complete simulation with the $stop directive.

The next couple of lines of the cpu core, AB, DI, DO and WE are all part of a memory interface. So next, let us spend some time for designing this memory interface.

First, we instantiate our 64KB memory array:

    reg [7:0] ram[65535:0];


This register is a bit different from our previous register declarations. It basically states that 65536 registers should created (indicated by [65536]) and each register should have eight bits (indicated by [7:0])

Now, let us define the necessary memory interface wires and a always block for assignments:

...
wire [15:0] addr;
wire [7:0] ram_in;
reg [7:0] ram_out;
...
 always @ (posedge clk)
 begin
  if (WE) 
  begin
   ram[addr] <= ram_in;
   ram_out <= ram_in;
  end
  else 
  begin
   ram_out <= ram[addr];
  end 
 end 
...

The always block closely models how a block ram element work and when doing synthesis, the tool will also pick up that want a block ram with this always block. It will then perform the synthesis accordingly.

With everything defined, we can now connect all the signals within our 6502 instance:

cpu  mycpu ( .clk (clk), 
            .reset (reset), 
            .AB (addr), 
            .DI (ram_out), 
            .DO (ram_in), 
            .WE (WE), 
            .IRQ (1'b0), 
            .NMI (1'b0), 
            .RDY(1'b1) );


For now we are not interested in the IRQ, NMI and RDY signals. For each of these signals we just choose a constant value that won't impede the operation of the CPU.

Almost done coding our top module. What remains to be done, is to populate our ram array with  a image of Klaus Dormann's Test Suite.

There is two verilog directives that help us out with ram array population:

  • readmemb: Read contents from a binary file and populate ram array with contents
  • readmemh: Read contents from text file with hexadecimal strings and populate ram array with contents.
readmemb is probably first price to use, because you can use the test suite binary as is. However, in a couple of cases I found that Vivado tools doesn't work so nice with readmemb. When trying to do the 6502 simulation with a readmemb, my Vivado IDE ended up in a endless loop. Using readmemh, however, I didn't experienced such issues.

The readmemh directive expects a text file with one hexadecimal number per line. Such a file is quite easy to create with the aid of a hexeditor.

With the binary file open in a hexeditor, copy the contents of the left panel (e.g. hexadecimal view) and paste into a text editor:



With the hex data in a text editor, you can replace each space with a newline.


There is one change you need to make to your hex file. The reset vector should be modified to start at 0400. Do this by changing the contents of memory locations fffc and fffd to 0 and 4 respectively.


With the hex file created, you can now add the following initial block within you top block:



initial begin
  $readmemh("{path to your hex file}", ram) ;
end


You need to replace {path to your hex file} with the applicable path to your hex file on your local drive.

Running the simulation

We are now ready to run the simulation.

On the left panel of the Vivado IDE, click run simulation and on the popup click Run behavioural simulation.

You will see a progress box running for a short while and then a wave window will open:


You will see that only a small time period of your simulation has run. This is due to some default Vivado settings. You can override this default if you want to, but most of the time it useful to see a quick snapshot of your simulation before running the full one.

You can resume the full simulation by click the play button on the top bar (labled run all).

You will see the display of the wave window updating while the simulation is running.

Running the Klaus Test Suite within a Verilog simulation can take ages to run. So, for now we just want some kind of idea that our 6502 environment is running more or less fine, and leave the complete test suite for the FPGA itself.

Here is some example of some sanity checks you can do:


On the wave output you do some spot checks on whether the ram give out the correct data for given addresses. remember that the output is only available in the next clock cycle for block rams as indicated by above diagram by the red lines.

This concludes our simulation exercise for this post.

In Summary

In this post we build the top module surrounding Arlet Ottens's 6502 core.

The top model contained interface to a RAM array populated with Klaus Dormann's Test suite code.

We ended off by running a simulation in Vivado and doing some spot check on whether the RAM is returning correct data for given addresses.

In the next post we will start with the physical FPGA implementation of top module we developed in this post.

Till next time!