Thursday, 4 November 2021

Running an Amiga core on a Zybo Board: Conclusion

Foreword

In the previous set of posts, I attempted to get an Amiga FPGA implementation to run on a Zybo board.

My attempts started off to look promising, getting an Motorola 68K to run on the FPGA.

However, in the last couple of months, I hit a brick wall: Memory latency. As you might know, the memory of an Amiga clocks at a speed of 7MHz. On the Zybo board, however, I was experiencing latency limiting my access to memory to 5MHz.

At this point, I just would like to point out that I don't have anything against the Zybo board. It is just that in order to properly utilise memory on the Zybo board, and any modern computer system, for that matter, you need to access memory in a pipelined matter, and use some caches.

Compared to modern day systems, the Amiga accesses memory in quite a random fashion, so caching would not really benefit the core of operation of an Amiga.

In this post I will unpack this limitation of memory latency a bit more, and I will give some pointers on which direction I am going to take to try and get around this issue.

More on the issue of latency

In my journey on creating an Amiga on the Zybo, I had been looking at the MiSTer project quite a bit:

https://github.com/MiSTer-devel/Main_MiSTer/wiki

There was one paragraph on the wiki page, which, if I have seen it originally, might have saved me some pain:

SDRAM board (recommended expansion) – This small board plugs into the GPIO0 connector of the DE10-nano board. Whilst the DE10-nano has fast DDR3 memory, it cannot be used to emulate a retro EDO DRAM due to a high latency and shared usage from the ARM side. This SDR SDRAM on a daughter board is required for most cores to emulate a retro memory module.

This actually also applies to the Zybo board, and since there is not really a way to add an SDRAM module to the Zybo board, the Zybo board is not really suited for emulating an Amiga within the FPGA.

Using an alternative board

With the Zybo board that is out of the question for what we want to do, I thought long and hard on what board I can use for this exercise.

What we are looking for, is a board that will provide as much direct access between the FPGA and RAM as possible. This will allow us to optimise and reduce latency for our kind of RAM access patterns.

From the MiSTer wiki, as I mentioned in the previous section, they indicated that the emulated cores uses SDRAM and that DDR3 RAM is not really suitable. What is not clear though, is if the majority is due to the DDR3 RAM, or what amount the shared ARM side contributes to the latency.

If it is true that DDR3 RAM just by itself gives unacceptable latency for an Amiga implementation, then this would immediately eliminate quite a number of FPGA's that could be used for this exercise.

It would be nice if we can get some ballpark figure for DDR3 latency, to see, at least in theory, if it would be possible to run an Amiga core with DDR3 memory.

One brand of memory where the DDR3 timings is readily available in their datasheets, is Micron. Firstly, reading to this link, a good estimate for latency based on DDR timings, would be the following formula:

T_RP + T_RCD + CL

Peaking at one of the datasheets of Micron, it seems that a typical value for adding these numbers together, is around 39ns. So, if, for each memory access there is a delay of 39ns, we are looking at a memory speed of about 25MHz.

So, it seems that we might be ok, at least in theory, when using DDR3 memory for emulating an Amiga core. That been said, this assumes we can control how the data is accessed from DDR3 memory.

I am going to take a gamble on this, and see how far I am going to get implementing an Amiga on another FPGA board that support direct access to DDR3 memory.

The question is: which board? The board I have in mind is the Arty-A7 development board, from Digilent:

This is an FPGA development board having 256MB of DDR3 board. The FPGA has direct access to the DDR3 RAM.

In Summary

In this post we have concluded the exercise where we attempted to create an Amiga core on a Zybo board.

I hit a brick wall with regards to RAM latency, so it is not feasible to create an Amiga core on a Zybo board.

I haven't given up yet, and I am going to see if I can succeed if I use a different FPGA development board, which the FPGA can directly control access to the DDR3 memory.

The board which I am going to use, would be the Arty-A7.

Till next time!

Sunday, 25 April 2021

Running an Amiga core on a Zybo Board: Part 4

Foreword

In the previous post we managed to get the Source code of the Mini Amiga project to compile in Vivado and ran the resulting bitstream on the Basys 3 board.

With the core on the Basys 3, we did a very basic test, confirming whether the address requests to external SDRAM was targeted for the Kickstart ROM area.

Our next goal is to get the Kickstart ROM to run.

You might have noticed towards the end of the previous post that I am alternating between the term Kickstart ROM and AROS ROM. Many readers are more familiar with the term Kickstart ROM, whereas we are going to run the AROS ROM image in our design. So, just keep that in mind when I interchange between the two terms.

The Kickstart ROM is 512KB in size, more than the amount of Block RAM the Basys 3 can offer. We will therefore move in this post to the Zybo board, which supports external SDRAM.

Our primary focus in this post will be to develop an interface where, given an address, the interface will one word of data from that location in SDRAM.

Overview

Quite a while ago, while I was still developing a C64 block for an FPGA, I have also created a wrapper for accessing SDRAM on the Zybo board, here.

The purpose was to save the frames produced by the the VIC-II module and render it to a VGA screen at a different frame rate.

In the design I used the IP Burst block, provided by Xilinx, to shield me from the details of the AXI protocol.

The design I ended up with, was a streaming interface, that accessed data in a serial fashion. This is a suitable interface for rendering frames to a VGA screen, where the data required is also of a serial nature.

With the streaming interface we can easily predict which data will be required in the future and therefore prefetch the data, therefore mitigate the effect of latency.

In our Amiga design, however, the 68000 will be accessing the SDRAM in a much more random fashion, so a streaming interface will not give us any real benefit. So, in this post we will be designing a very simple SDRAM interface, where we provide an address, and the interface will return the relevant word of data from SDRAM.

Obviously, with this interface SDRAM latency will work against us, but for coming posts we will first try to get the system to work and attempt to fix the latency issues later on.

Creating an AXI Block

In a previous post, here, I explained how to create an AXI block in a Zybo block design.

In that post, I have also explained how to utilise IP Burst Block in your created AXI block. As mentioned earlier, the IP Burst block shields you from the technicalities of the AXI protocol.

Well, after a couple of years down the line, I tend to disagree with the statement in the previous paragraph. Working with AXI protocols is not that bad and using the IP Burst block is a bit of an overkill.

The thing is, when creating an AXI master block, the wizard do create a template for you that is basically an example of how to use the AXI protocol. A couple of years ago, however, this template just looked like a very complex state machine, and so I decided to use the IP Burst block as an alternative.

Having a look at the state machine that the AXI Block Wizard create, one can see that it is basically a memory tester. It starts off writing a certain test pattern of data to memory and afterwards read it back from memory, checking to see if it matches the test pattern.

It is easy enough to alter the path of this state machine to read or write on demand. We will tackle this in the next section.

Modifying the AXI block

Let us modify the AXI block to fit our needs.

First we need to look at the state machine that is implemented in case-statements. The state machine transitions as follows: IDLE > WRITE > READ > COMPARE > IDLE.

In order to implement our read on demand, we need to transition back to IDLE after a read or a write. Also in the IDLE state we need to directly transition directly to a READ or a WRITE given a command. We do this as follows:

           if ( init_txn_pulse == 1'b1)                                                      
             begin                                                                                        
               mst_exec_state  <= user_write ? INIT_WRITE : INIT_READ;                                                              
               ERROR <= 1'b0;
               compare_done <= 1'b0;
             end                                                                                          
           else                                                                                            
             begin                                                                                        
               mst_exec_state  <= IDLE;                                                            
             end

Here user_write is an input port we define on the module. If it is a 1, it is a write and a read otherwise.

Next let us add some ports to the AXI block for supplying commands:

...
// Users to add ports here
        input [31:0] user_address,
        input [31:0] user_data,
        input user_write,
// User ports ends
...

So, we have a port to specify an aaddress for a read/write command. Also, for a write, we have a port to specify the data.

Let us have a look at how these ports are going to be used:

// Next address after AWREADY indicates previous address acceptance    
 always @(posedge M_AXI_ACLK)                                        
 begin                                                                
   if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1)                                            
     begin                                                            
       axi_awaddr <= user_address;                                            
     end                                                              
   else if (M_AXI_AWREADY && axi_awvalid)                            
     begin                                                            
       axi_awaddr <= axi_awaddr/* + burst_size_bytes*/;                  
     end                                                              
   else                                                              
     axi_awaddr <= axi_awaddr;                                        
   end          
...
/* Write Data Generator                                                            
Data pattern is only a simple incrementing count from 0 for each burst  */        
 always @(posedge M_AXI_ACLK)                                                      
 begin                                                                            
   if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1)                                                        
     axi_wdata <= user_data;                                                            
   //else if (wnext && axi_wlast)                                                  
   //  axi_wdata <= 'b0;                                                          
   else if (wnext)                                                                
     axi_wdata <= axi_wdata/* + 1*/;                                                  
   else                                                                            
     axi_wdata <= axi_wdata;                                                      
   end              
...
// Next address after ARREADY indicates previous address acceptance  
 always @(posedge M_AXI_ACLK)                                      
 begin                                                              
   if (M_AXI_ARESETN == 0 || init_txn_pulse == 1'b1)                                          
     begin                                                          
       axi_araddr <= user_address;                                          
     end                                                            
   else if (M_AXI_ARREADY && axi_arvalid)                          
     begin                                                          
       axi_araddr <= axi_araddr/* + burst_size_bytes*/;                
     end                                                            
   else                                                            
     axi_araddr <= axi_araddr;                                      
 end
...

In this snippet of code, we have kept the original template code more or less intact. It is just the pieces in bold that we have changed.

You will notice that in a couple of place we use init_txn_pulse. This is used to trigger a read or a write transaction.

One thing we still need to do is to indicate to the outside world when a read/write transaction has completed and also send the data when the transaction was a read. To do this, we just need to watch the signals wnext and rnext:

...
        output reg [31:0] user_data_out,
        output reg user_data_ready,
...
    always @(posedge M_AXI_ACLK)
    begin
      if (wnext || rnext)
      begin
        user_data_ready <= 1;
      end else if(!INIT_AXI_TXN)        
      begin
        user_data_ready <= 0;
      end      
    end
...
    always @(posedge M_AXI_ACLK)
    begin
      if (rnext)
      begin
        user_data_out <= M_AXI_RDATA;
      end
    end
...

Keep in mind that M_AXI_ACLK is clocking at 100MHZ, so for one clock cycle rnext/wnext might be high, and low at the next one. The Mini Amiga block is clocking considerably slower than 100MHz, and will miss these notifications if we rely directly on rnext/wnext. It is for that reason why we are storing the value for user_data_ready and user_data_out.

Testing the AXI block

Let us write a module for testing the AXI block we have created in this post.

With the test we will basically write some test data to SDRAM and then read it back.

We will implement this with a very simple statement machine, which will be a counter of which its bits will have different purposes.

This counter will be 11 bits wide, if which we will use the bits as follows:

bit 10: Read/Write
bit 9/8: Index for generating read data/address
bit 7-0: Lower part of counter

As mentioned above, we will be using bits 9 and 8 as a index to generate some random addresses for reading and writing, which we will do as follows:

always @(*)
begin
  if (counter[9:8] == 0)
  begin
    data_out = 20;
  end else if (counter[9:8] == 1) 
  begin
    data_out = 120;
  end else if (counter[9:8] == 2) 
  begin
    data_out = 30;
  end else if (counter[9:8] == 3) 
  begin
    data_out = 10;
  end else 
  begin
    data_out = 111;
  end
end

always @(*)
begin
  if (counter[9:8] == 0)
  begin
    address = 20;
  end else if (counter[9:8] == 1) 
  begin
    address = 120;
  end else if (counter[9:8] == 2) 
  begin
    address = 30;
  end else if (counter[9:8] == 3) 
  begin
    address = 10;
  end else 
  begin
    address = 111;
  end
end

We use the lower part of the counter, bits 7 - 0, to decide when to trigger the init_txn pulse. We use such a big range to ensure that we leave enough gap for latency:

always @(posedge clk)
begin
  if (counter[7:0] == 20)
  begin
    init_txn <= 1;
  end else if (counter[7:0] == 118) 
  begin
    init_txn <= 0;
  end
end

This is enough coding for a test. Let us do some testing.

The following logic trace shows some results:

The first two lines contains the requested address. At the bottom the read/write requests are shown, of which the section shown is mostly reads.

The line, user_data_out, shows the data that is read back from memory. This is not very clear in the picture, but the values are 0x14, 0x78, 0x1e. These values converted to decimal are 20, 120 and 30.

This corresponds to the values we used in our test module.

In Summary

In this post we have created an AXI block that will enable us to read and write to SDRAM on the Zybo board.

In the next post we will integrating this AXI block with the Mini Amiga block, and try to boot the AROS ROM.

Till next time!

Wednesday, 14 April 2021

Running an Amiga core on a Zybo Board: Part 3

Foreword

In the previous post we managed to get the fx68k core to run on a Basys 3 board, and execute a very simple machine code program.

In this post we will continue our journey in getting an Amiga core to run on a Zybo board. Having said that, in this post we will again do the exercise on the Basys 3 board. We will be finally moving to the Zybo board in the next post.

As mentioned in a previous post, we will be scrutinising the Mini Amiga project, here, that will form the basis of our Amiga exercise.

The Mini Amiga project is based on a Altera FPGA, and for that reason we need to scrutinise the code of this project. However, this makes the journey so much more exciting.

A bit of background on the Mini Amiga project

If one reads this article on Wikipedia, one can see that the original Mini Amiga project was based on a physical Motorola 68000 CPU, and the Amiga chipset was implemented within an FPGA.

Since then, the source code of the Amiga Project was adjusted so it can be used within the main MiSTer project.

To understand the source code of the Mini Amiga Project, a good starting point would be to look at the file rtl/minimig.v. This file used to be the top level module for the original Mini Amiga project. Looking at the inputs/outputs of this module, it is immediately obvious that this module doesn't host a 68000 implementation, and need to be connected externally.

To cater for an on-FPGA implementation for the 68000, a couple of wrappers were created around the minimig module. More on this in the next section.

A deeper look into the source code

In the previous section I mentioned that rtl/minimig.v used to be the top level module for the original implementation for the Mini Amiga project, and that wrappers were written to interface it with an on-FPGA 68000 implementation.

Let us have a closer look at these wrappers. We start by looking into the file Minimig.sv, which is located in the root. This file hosts a module called emu, which in turn instantiates an instance of the minimig module.

Something else that is also interesting of the module emu, is that it instantiates an instance of cpu_wrapper, and it is here where we actually create an instance of the fx68k core and linking up the ports to the corresponding minimig instance.

The rest of the wrapper code gets very specific to the features of the DE10-Nano board.

For our goal of implementing an Amiga on a Zybo board, we will be creating our own wrappers around the Minimig module.

Pruning the Minimig module

When porting a complex project from one platform to another, a lot of times it is required to have an in-depth understanding of how the system works.

This is easier said than done, especially if it is not possible to play with the system on the original platform.

In our case, we have a similar scenario. Your preferred FPGA board for implementing the Mini Amiga will probably not be the DE-10 Nano board, for which the current project is written for. Also, buying the DE-10 Nano board to gradually understand the components of the Mini Amiga project, before moving to your actual choice of FPGA board, doesn't sound like an economically viable option either.

To get around the problem of complexity, I will be stripping down the Minimig module to its most basic form, and then re-adding functionality as we go along.

We start by looking at the ports of the Minimig module, which are grouped as follows:

m68k pins
sram pins
system pins
rs232 pins
I/O
host controller interface (SPI)
Video
RTG Framebuffer control
audio
user/io
fifo/track display

In our first round, we will only be using the first three group of ports, which are m68k pins, sram pins and sram pins. The rest of the ports we will remove or comment out.

Next, let us have a look inside the minimig module, as which instances we can remove. I have found the following to remove:

userio
rtg

Modifying top.v

We are now going to modify the top module we have created in the previous post, so it can include an instance of the minimig module.

One thing you will notice when connecting the ports of the Minimig module, is that there is quite a number of clock inputs, like clk7_en, c1, c3, cck and eclk. Luckily a module is the provided in the Mini Amiga project for generating these signals given a clock input. This module is present in the file rtl/amiga_clk.v.

Let us create an instance of this module in our top module:

module top(
...
    );
...
amiga_clk amiga_clk
        (
          .clk_28(clk_28mhz),     // 28MHz output clock ( 28.375160MHz)
          .clk7_en(clk7_en),    // 7MHz output clock enable (on 28MHz clock domain)
          .clk7n_en(clk7n_en),   // 7MHz negedge output clock enable (on 28MHz clock domain)
          .c1(c1),         // clk28m clock domain signal synchronous with clk signal
          .c3(c3),         // clk28m clock domain signal synchronous with clk signal delayed by 90 degrees
          .cck(cck),        // colour clock output (3.54 MHz)
          .eclk(eclk),       // 0.709379 MHz clock enable output (clk domain pulse)
          .reset_n(~(button_reset))
        );
...
endmodule

Here is another mystery. The amiga_clk module wants a 28MHz clock input, whereas in the previous post we have defined a clock of 16MHz for clocking our CPU.

It turns out that in the Mini Amiga project, the CPU is clocked at 28MHz, whereas most of the Amiga components have a resulting clock speed of 7MHz.

We therefore need to adjust the frequency of our generated clock from 16MHz to 28MHz (actually 28.375160MHz, to be exact).

The resulting fx68k and minimig instance look as follows:

...
    fx68k fx68k(
        .clk(clk_28mhz),
        .HALTn(1),                    // Used for single step only. Force high if not used
        // input logic HALTn = 1'b1,            // Not all tools support default port values
        
        // These two signals don't need to be registered. They are not async reset.
        .extReset(reset_cpu_in),            // External sync reset on emulated system
        .pwrUp(reset_cpu_in),            // Asserted together with reset on emulated system coldstart    
        .enPhi1(phi), .enPhi2(~phi),    // Clock enables. Next cycle is PHI1 or PHI2
        .eRWn(read_write),
        .oRESETn(reset_cpu_out),

        .ASn(As), 
        .LDSn(Lds), 
        .UDSn(Uds),
        .DTACKn(data_ack), 
        .VPAn(1),
        .BERRn(1),
        .BRn(1), .BGACKn(1),
        .IPL0n(interrupts[0]), 
        .IPL1n(interrupts[1]), 
        .IPL2n(interrupts[2]),
        .iEdb(data_in),
        .oEdb(data_out),
        .eab(add)
        );
...
minimig minimig
 (
     //m68k pins
     .cpu_address(add), // m68k address bus
     .cpu_data(data_in),    // m68k data bus
     .cpudata_in(data_out),  // m68k data in
     ._cpu_ipl(interrupts),    // m68k interrupt request
     ._cpu_as(As),     // m68k address strobe
     ._cpu_uds(Uds),    // m68k upper data strobe
     ._cpu_lds(Lds),    // m68k lower data strobe
     .cpu_r_w(read_write),     // m68k read / write
     ._cpu_dtack(data_ack),  // m68k data acknowledge
     ._cpu_reset(reset_cpu_in),  // m68k reset
     ._cpu_reset_in(reset_cpu_out),//m68k reset in
     .nmi_addr(0),    // m68k NMI address

     //sram pins
     .ram_data(),    // sram data bus
     .ramdata_in(22),  // sram data bus in
     .ram_address(ram_add), // sram address bus
     ._ram_bhe(),    // sram upper byte select
     ._ram_ble(),    // sram lower byte select
     ._ram_we(),     // sram write enable
     ._ram_oe(),     // sram output enable
     .chip48(),      // big chipram read
 
     //system    pins
     .rst_ext(),     // reset from ctrl block
     .rst_out(),     // minimig reset status
     .clk(clk28mhz),         // 28.37516 MHz clock
     .clk7_en(clk7_en),     // 7MHz clock enable
     .clk7n_en(clk7n_en),    // 7MHz negedge clock enable
     .c1(c1),          // clock enable signal
     .c3(c3),          // clock enable signal
     .cck(cck),         // colour clock enable
     .eclk(eclk),        // ECLK enable (1/10th of CLK)
 );
...

One the changes we did here, was to feed interrupts from the minimig module to our CPU.

One interesting part of the minimig module is the sram section. This section should be interfaced to external SDRAM or DDRRAM, and the idea is that every time when access a memory location that forms part of chipram/fastram or kickstart ROM, the memory request will be redirected to these ports.

My plan is to interface the sram pins to the DDRRAM present on the Zybo board via AXI, in future posts.

In this post, however, we will only be looking at the addresses send on the ram_address port, and verify that addresses is send that falls within kickstart ROM range upon start-up. More on this later.

Building and testing

When trying the synthesise create a bitstream I encountered a couple of errors in Vivado. First of all, Vivado doesn't like the concept of local registers. As an example, take a look at the following snippet from ciaa.v:

// generate a keystrobe which is valid exactly one clk cycle
always @(posedge clk) begin
	reg kms_levelD;
	if (clk7n_en) begin
		kms_levelD <= kms_level;
		keystrobe <= (kms_level ^ kms_levelD) && (kbd_mouse_type == 2);
	end
end

In this case, the register kms_levelD is a local register. To make Vivado happy, you will need to move the register declaration outside of the always block.

Keep in mind that there is some cases where the same local register name is used in a couple of always blocks, in which case you will need give them all unique names, which is quite an undertaking.

I have picked up quite a number of places in the Mini Amiga source where register/wires are used and declared only later on in source files. Vivado doesn't like this either.

You will also find that the file denise_colortable_ram_mf.v will fail to compile on Vivado. This file is very specific to Altera FPGA's.

To get around this error, just comment out the instantiation of altsyncram_component. We will revisit this module in a future post.

Eventually everything compiled, and I moved on to testing. As mentioned in the previous post, was goal was to see if addresses were output on the ram_address port that was within the range of the kickstart ROM, upon startup.

Let me spend a moment to explain this. Staring at memory address 0, we have chipram, and ROM only starts at address $F80000.

With a Motorola 68000, the above setup is problematic, because at start-up the 68000 reads the initial value for the program counter from memory location 4, which is in chipram.

On the Amiga this dilemma is solved by mapping the Kickstart ROM also to location 0 upon start-up. Upon the first read from the CIA, this mapping is removed and the chipram starting at location is free to use.

So, in my test I will be checking to see if addresses starting at $F80000 will appear on the ram_address port upon startup.

In my testing, a couple of things went wrong. Firstly, the minimig module didn't create a reset signal on the _cpu_reset pin for resetting the CPU. It turned out that the userio module, that we have commented out, is responsible for generating the reset signal.

To get around this issue, I have linked up our button_reset signal to the minimig module as follows:

module minimig
(
...
 input button_reset,
...
);
...
assign cpurst = button_reset;
...
endmodule

Also, our amiga_clk instance will need to use the _cpu_reset port of the minimig module:

module top(
...
    );
...
amiga_clk amiga_clk
        (
          .clk_28(clk_28mhz),     // 28MHz output clock ( 28.375160MHz)
          .clk7_en(clk7_en),    // 7MHz output clock enable (on 28MHz clock domain)
          .clk7n_en(clk7n_en),   // 7MHz negedge output clock enable (on 28MHz clock domain)
          .c1(c1),         // clk28m clock domain signal synchronous with clk signal
          .c3(c3),         // clk28m clock domain signal synchronous with clk signal delayed by 90 degrees
          .cck(cck),        // colour clock output (3.54 MHz)
          .eclk(eclk),       // 0.709379 MHz clock enable output (clk domain pulse)
          .reset_n(~(_cpu_reset))
        );
...
endmodule

Another signal with a similar issue was the halt signal. This signal is also controlled by the userio module, which can halt the cpu on demand.

Again, for this signal we are going to implement a quick fix. The cpuhlt wire in the minimig module we need to assign the value 0.

With these changes, I was able to proceed and I got the following waveforms as result:

Looking at the ram_add signal, we can see addresses 0x3c0000 and 0x3c0001 every time just before the As signal transitions to high. Converting these addresses to byte addresses, we get 0x780000 and 0x780002. This misses the Kickstart ROM address range of 0xf80000 by one bit.

Carefully looking at the Mini Amiga source code, it looks like these addresses are expected. Thinking about this, I realised the reduced start address of kickstart ROM to 0x780000 actually makes sense. If we have kept the address 0xf80000, it is just wasting RAM.

In Summary

In this post we did a quick round of getting the Minimig source code to compile in Vivado, and verified that the ram addresses are as expected.

In the next post we will be moving to the Zybo board, trying to get the the AROS ROM to boot, with the ROM image sitting in SDRAM.

We will be using the AROS ROM instead of the official Kickstart ROM, since the AROS ROM doesn't require a license.

Till next time!

Wednesday, 31 March 2021

Running an Amiga core on a Zybo Board: Part 2

Foreword

In the previous post I gave a quick rundown on what I plan for the next couple of posts, which is to get an Amiga core running on a Zybo board.

In this post we will get the fx68k core running on an FPGA, which is an FPGA implementation for the 68000 CPU.

Ironically, the FPGA I will be using for this exercise will not be the Zybo board, but one of my lower spec FPGA boards, the Basys 3, which is also Xilinx based.

The reason for this decision is because in the Zybo board one needs to wrap every core into an IP, which is a bit time consuming, if you are after quick R&D. Once we got to a point where many of the components are running, we can move to the Zybo board.

Simulation issues

As with many FPGA projects, one always start testing your design with simulation.

In the couple of years I was playing with FPGA's, I found that the fx68k was one of the cores I wasn't able to run in a simulator. I am referring here to the simulators that is free of charge. Not sure about commercial ones.

When trying to simulate fx86 within Vivado, it will be stuck in the Elaboration step for hours. Even after 7 hours, Vivado couldn't get passed this step.

When trying to use iVerilog, it didn't understand all the SystemVerilog syntax in the design.

Verilator's support for SystemVerilog is quite good, but in the fx68k design there are some structures, where part of the bits is defined as wires and the rest of the bits is defined as registers. Verilator doesn't like such structures.

So, concerning testing the fx68k core, I will run it on the FPGA itself. This proofed not to be a major issue.

If anyone had any luck in simulating fx68k with a particular simulator, please let me know.

Setting up the project

As mentioned earlier, I will not be using my Zybo board to test the fx68k, but a Basys 3 board, in which you don't need to package all your cores into IP blocks.

We start by creating a top module block:

module top(
   input clk,
    );
...
endmodule

When working with the Basys 3 board, it is important to constraint all the ports of your top level block in an XDC file.

In this case we have an input clock which we constrain as follows:

set_property PACKAGE_PIN W5 [get_ports clk]
set_property IOSTANDARD LVCMOS33 [get_ports clk]
create_clock -period 10.000 -name sys_clk_pin -waveform {0.000 5.000} -add [get_ports clk]

Since we work here with a Basys 3 board, pin W5 of the FPGA chip is connected to an external clock clocking at 100MHz.

Now 100MHz is probably a bit too fast for our design, and I am not sure if all timing requirements will be met if we clock the fx68k core that high. So, I will instantiate a slower clock, which will clock around 16MHz.

Vivado supports a clock wizard, which allows you to create a template block, and all instances of this template block will output 16MHz.

We create an instance of this clock within our top module as follows:

module top(
   input btnC,
   input clk,
   output [15:0] led
    );
...
wire clk_16mhz;
...
clk_wiz_0 clk_wiz_0 
 (
  // Clock out ports
  .clk_out1(clk_16mhz),
 // Clock in ports
  .clk_in1(clk)
 )
endmodule

clk_16mhz is the clock signal clocking at 16MHz, and we will use this clock in our design.

Next, let us see how to connect the ports of the fx68k core:

    fx68k fx68k(
        .clk(clk_16mhz),
        .HALTn(1),                    // Used for single step only. Force high if not used
        // These two signals don't need to be registered. They are not async reset.
        .extReset(button_reset),            // External sync reset on emulated system
        .pwrUp(button_reset),            // Asserted together with reset on emulated system coldstart    
        .enPhi1(phi), .enPhi2(~phi),    // Clock enables. Next cycle is PHI1 or PHI2
        .eRWn(read_write),
        .ASn(As),
        .LDSn(Lds),
        .UDSn(Uds),
        .DTACKn(0),
        .VPAn(1),
        .BERRn(1),
        .BRn(1), .BGACKn(1),
        .IPL0n(1),
        .IPL1n(1),
        .IPL2n(1),
        .iEdb(data_in),
        .oEdb(data_out),
        .eab(add)
        );

There are a couple of ports that we can leave unconnected for now. Let us quickly go through the purpose of some of these ports:

extReset and pwrUp: In my design I have linked these ports to a button on the Basys board. Preferably one should pass the button through a debounce core, so you don't have erratic spikes on the reset port.
eWRn: This port indicates to memory whether the fx68k want to read or write. Read is indicated with a 1 and a write with a 0.
ASn, LDSn, UDSn: These signals are asserted during read/write cycles. We don't really need this for our fx68k, but it helps to see it on a signal trace to see if everything is working correctly.
iEdb and oEdb: This is for data in and data out from memory.
eab: Address request to memory

Just a note on enPhi1 and enPhi2. In my design I toggle these signals on every clock cycle and it is important that these two signals are always the inverse of each other. So, here is my simple implementation for phi:

    always @(negedge clk_16mhz)
    begin
      phi <= ~phi;
    end

Before ending this section, I would like to make mention of two rom files in the fx68k source, microrom.mem and nanorom.mem. It is important to ensure that the contents of these two files gets loaded into the synthesised design.

If you open up fx68k.sv, you will see that these files are been used by the following modules:

module uRom( input clk, input [UADDR_WIDTH-1:0] microAddr, output logic [UROM_WIDTH-1:0] microOutput);
	reg [UROM_WIDTH-1:0] uRam[ UROM_DEPTH];		
	initial begin
		$readmemb("microrom.mem", uRam);
	end
	
	always_ff @( posedge clk) 
		microOutput <= uRam[ microAddr];
endmodule


module nanoRom( input clk, input [NADDR_WIDTH-1:0] nanoAddr, output logic [NANO_WIDTH-1:0] nanoOutput);
	reg [NANO_WIDTH-1:0] nRam[ NANO_DEPTH];		
	initial begin
		$readmemb("nanorom.mem", nRam);
	end
	
	always_ff @( posedge clk) 
		nanoOutput <= nRam[ nanoAddr];
endmodule

It is important that these files is in a location that is accessible to the synthesising tool. If in doubt, rather use absolute paths.

During synthesis you should also see some messages whether it was able to load these files successfully.

Hello World boot

Now it is time to test the fx68k core. For this purpose, we will writing a very 68000 assembly program:

The assembly language is the column on the far right, and the machine language equivalent is in the middle column.

Before continuing, I would like make mention of the tool I used to get this view. This is an online tool you can get access via the following url: onlinedisassembler.com/. This tool allows you to disassemble machine code from a variety of CPU's.

Back to the machine language. If one look at the machine code, one will realise that the 68000 is a big endian architecture, that is the most significant parts of a value is stored first in memory. This is in contrast to an architecture like the i386 and ARM processors, which uses little endian.

Now let us look into the test program in more detail:

movew #1285,%d0: Here we load the register d0 with the value 1285
movew %d0, 0x00858585: Next, we store the value we loaded in the register d0 to memory at location address 0x858585.
jsr 0x93e86: Jump to subroutine at memory location 0x93e86.

Somewhere in this assembly language program I have introduced something that is not allowed, and we will see later how the 68000 core handle this scenario.

Now, the next question is, how to we get the fx68k core to execute our program? We need to know what location the 68000 will jump to, when it comes out of reset. Unfortunately this information is not easy to come by.

After quite a bit of searching, I found this information in the M68000 user manual on page 5-29 in the section Reset Operation. I quote:

When RESET and HALT are driven by an external device, the entire system, including the
processor, is reset. Resetting the processor initializes the internal state. The processor
reads the reset vector table entry (address $00000) and loads the contents into the
supervisor stack pointer (SSP). Next, the processor loads the contents of address $00004
(vector table entry 1) into the program counter.

So, in short, we need to ensure that our implementation should return the initial stack pointer when the CPU requests the contents for memory location 0, 1, 2 and 3.

Similarly, our implementation should return the initial value for the program counter when the CPU requests the contents of memory location 4, 5, 6 and 7.

It should be noted that the fx68k access memory in chunks of 16 bits, so addresses posted by the CPU on the address bus we need to multiply by 2 to get the byte address equivalent. The implications of this will become apparent in a moment.

So, let us write some simple logic that feed our program in chunks to the CPU:

  
always @(posedge clk_16mhz)
        begin
            if (add == 16'h0)
            begin
              data_in <= 7;
            end else if (add == 16'h1)
            begin
              data_in <= 16'h9fe7;
            end else if (add == 16'h2)
            begin
              data_in <= 4;
            end else if (add == 16'h3)
            begin
              data_in <= 16'hfe00;
            end else if (add == 20'h27f00)
            begin
              data_in <= 16'h303c; //load immediate
            end else if (add == 20'h27f01)
            begin
              data_in <= 16'h0505;
            end else if (add == 20'h27f02)
            begin
              data_in <= 16'h33c0; //store
            end else if (add == 20'h27f03)
            begin
              data_in <= 16'h0085;
            end else if (add == 20'h27f04)
            begin
              data_in <= 16'h8585;
            end else if(add == 20'h27f05)
            begin
              data_in <= 16'h4eb9;
            end else if (add == 20'h27f06)
            begin
              data_in <= 9;
            end else if (add == 20'h27f07)
            begin
              data_in <= 16'h3e86;
            end
            else begin
              data_in <= 16'h33c0;
            end
        end

As mentioned previously, the address we select by is a word address and not a byte address.

Also, as per the M68000 user guide, location 0 contains the initial value for the stack pointer, which in this case is 0x79fe7. All addresses stored in memory are byte addresses.

Location 2 contains the initial value for our programming counter. As this is a word address, we need to multiply by two to get to the byte address, resolving to 4, which corresponds to the address provided in the user guide. Our initial program counter value in this case is 0x4fe00.

Finally, as seen our snippet of code goes up to word address 0x27f07, and for all other address requests we return the value 0x33c0.

Test Results

Let us have a look at the test results. Below is some waveforms I have captured from my Basys 3 board from the point it got out of reset:

As seen the fx68k core starts off reading word locations 0 - 4, which contains the stack pointer and program counter. The next word location that is read is address 0x27f00. Mutliplying this value by 2 gives 0x4fe00, which matches the value we have provided for the Program Counter.

If one follows the rest of the waveform, the read_write signal eventually goes low and the value 0x0505 gets written to memory. This is still expected behaviour from our test program.

After the above mentioned write another write kicks off to location 0x3cff2 and then no further reads or writes occurs.

Word address 0x3cff2 (e.g. byte address 0x79fe4) looks like our stack address pointer we have defined, and we could argue that the last instruction executed was a jsr, which will push an address to the stack. However, whatever the processor tried to push to the stack, it failed to finish off what it was trying to do.

I was wondering if there was something wrong with the initial stack pointer we have provided, so I started playing with different values. One of the values I used as a stack pointer were 0xc033c0.

This time around, using 0xc033c0 as the initial stack pointer, there appeared much more stack operations, looking at address requests.

Something else that was interesting, was the reading of locations 6 and 7, which seemed like some kind of vector. Referring back to the M68000 user guide, I found a table on page 6-7 explaining all the vectors. As it turns out, word address 6, which is byte address 12, is an address error. Further on in the user guide, on page 6-19, an Address error is described as:

An address error exception occurs when the processor attempts to access a word or longword operand or an instruction at an odd address.

And indeed, we were writing the value of register d0 to memory location 0x858585, which was an odd address.

If we change the address to an even number like 0x858586, everything seems to work fine:

The return address gets pushed onto the stack at memory word locations 0x6019de and 0x6019df.

We also see that eventually contents is read from location 0x049f43 onwards, so it seems that our jsr has executed correctly. For this new location our implementation will just return the value 0x33c0.

If we look back at our test program, we see that the value 0x33c0 is actually the opcode for storing the value of the register d0 to memory. This is exactly what is happening towards the end of the waveform.

In Summary

In this post managed to get the fx68k core to run on Basys 3 development board.

In the Test result section, I spend a bit of time unpacking the resulting waveform.

In the next post we will continue our journey to get an Amiga core to run on a Zybo/Basys 3 board.

Till next time!

Monday, 29 March 2021

Running an Amiga core on a Zybo Board: Part 1

Foreword

It has been quite some time since my last post on this blog. Things were pretty busy at work, so I ended up taking quite a long break from writing blog posts.

Recently some of my readers asked whether it was possible to implement an Amiga on an FPGA that provides 20K Logic Elements. The summary of my answer was 'I don't know'.

The Amiga was based on the Motorola 68000 CPU, which in itself consists out of far more components than a 6502. So, one doesn't even know if there would be enough Logic elements left for the rest of the Amiga system.

One of the known Amiga cores can be found in the MiSTer project, here. This project is based on the DE10-Nano development board, which has 110k Logic elements. So, even with this project it was still unclear whether an Amiga core can be accommodated on a 20K FPGA, or even on my 30K element Zybo board.

After about a couple of weeks I started to become interested in implementing an Amiga on an FPGA myself. I never owned an Amiga, so it would be nice to implement it on a Zybo board, to see what I have missed in 90's concerning games 😀.

So, in the next couple of Blog Posts, I will attempt to get an Amiga core from the MiSTer project working on the Zybo board.

Whether I would be able to succeed, I cannot tell, but let us see how far we can get! 😀

Approach

In the coming Blog posts I will basically focus on the Mini Amiga sub project, here, from the MiSTer project.

Perhaps, before I continue, I should just give credit to the people that was involved for producing this core and providing it under the GPL 3 license:

Rok Krajnc, providing the project of which the source code of this core is based
Dennis van Weeren, Jakub Bednarski, Sascha Boing, A.M. Robinson, Tobias Gubener, Till Harbaum: For various contributions to the Mini Amiga project
Sorgelig: Maintainer of the overall MiSTer project

My approach would be to take the various components of the Mini Amiga project, bit by bit, and writing a top level module for each and trying to get it to work on the Zybo board. With such an incremental approach, it would be easier to isolate bugs that pops up.

I will start off with trying to get the CPU to work, by running a very simple 68000 machine language program and verifying that all inputs/outputs of this core are as expected.

The mini Amiga core supports two implementations for a 68000 core: fx68k and tg68k. I have chosen the fx68k core for this blog series. So, also special thanks to Jorge Cwik for providing this core as open source.

In Summary

This post was basically an introduction to the series of Blog Posts that will follow where I will attempt to get an Amiga core to run on a Zybo Board.

My focus will be on the Mini Amiga project and I will follow an incremental approach in an attempt to get the core running on the Zybo board.

In the next post I will try and get the fx68k core running on the Zybo board, using very simple 68000 machine code program as a test.

Till next time!