Foreword

In the previous posts we had been creating a 6502 based design for reading executable code from a FAT32 formatted SD Card and executing it.

I can maybe just summarise my goal again with this current project. My goal is to run a Amiga core on an Arty A7 board. For this project I will be using a 6502 core for doing all the heavy lifting of loading Amiga ROM's and disk images into RAM, from an SD Card, so the Amiga core can execute it.

At this point in time our design use all block RAM. On every FPGA, block RAM is a limited resource, especially if we want to implement something like an Amiga core.

So, in this post we will be trying to run the 6502 core using the DDR RAM available on the Arty A7. Having achieved this goal.

Stumbling Blocks

Let me start discussing the stumbling blocks I cam across the past couple of months in trying get the 6502 core to use 6502 RAM.

Usually when I encounter stumbling blocks, I go into quite some detail in my blog posts about them. However, my stumbling blocks with implementing blocks with DDR RAM were gigantic the past couple of months, so I will try and keep it brief in this section.

So, my initial attempt to write code to interface the 6502 core with the DDR was pretty straightforward, and everything ran as expected during the simulation. However, when I tried running it on the actual Arty A7, things looked totally different than during the simulation.

Every other byte I read back from DDR on the Arty A7 were garbage. When these kind of things happen when playing around with DDR, my heart sinks into my shoes, simply because there is not really tools for troubleshooting these kind of issues. A lot of the operations of DDR happens at frequencies well above that can be captured by the Integrated Logic Analysers. In these cases one can only really solve the issue by some kind of intuition.

After a number of backwards and forwards, I decided to revisit my assumptions of a previous post:

In the post of this diagram, I was working on a memory tester. Basically signal A will resemble the clock signal of the 6502 core.

At point A an address will be asserted by the 6502 and at point B the first DDR instruction will be loaded into a shift register that will shift an instruction out to DDR for opening the row that is associated for the address provided by the 6502 core.

Between two dotted lines are a time period of 1.5ns, so the time period between A and B is 3ns. This translate to 333MHz, and within an FPGA used on a Arty A7, it seemed like a very tight fit to me, although it was sufficient to run a memory tester on the board.

I gave this some thought. There is a lot more logic cells involved with a 6502 core than with a simple memory tester. So, 3ns might not be enough for all the individual address lines to reach their full voltages.

My intuition told me, or should I rather say I made a hypothesis😆, that the problem may be solved by increasing the time period between A and B. We will discover this as a possible solution in the next section.

Clocking changes

With the hypothesis I made in the previous section, I came up with the following clocking scheme:

The 6502_clk is basically the clock that should drive the 6502 core. It is an exact copy of mclk, but I am throwing away 9 clocks in between, thus keeping only every tenth clock. With mclk that is 83.3 MHz, this gives us an effective 6502 clock of 8.3MHz, which is stil above the target clock of 7MHz required for our Amiga core in future.

At the point I have indicated with an arrow, we are loading our shift register with the address asserted by our 6502 core, which is one mclk cycle after the assertion. This works out 12ns, compared to the 3ns of our earlier design. I think this will give ample time for our address lines to settle, before reading it at the next mclk cycle.

The question remains if this bigger time gap will not introduce extra latency causing us to miss our target frequency of 7MHz. We will revisit this question later on when have finished with the design.

Let us start by writing some Verilog code for a counter that keeps track of when to enable the 6502 clock:

    reg [3:0] edge_count = 9;

    always @(negedge mclk)
    begin 
        if (edge_count == 0)
            edge_count <= 9;
        else
            edge_count <= edge_count - 1;   
    end

    always @(negedge mclk)
    begin
        clk_8_enable <= edge_count == 0;
    end

We get the resulting 6502 clock with the following:

    BUFGCE BUFGCE_8_mhz (
       .O(clk_8_mhz),   // 1-bit output: Clock output
       .CE(clk_8_enable), // 1-bit input: Clock enable input for I0
       .I(mclk)    // 1-bit input: Primary clock
    );

So, we will use the signal clk_8_mhz to clock our 6502. It is important to add a necessary constraint in Vivado, to indicate that it is treated as a clock when synthesizing the design. This constraint will look like the following:

create_generated_clock -name clkdiv1 -source [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O] 
     -edges {1 2 21} [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O]

The edges parameter indicates which edges of the mclk clock forms part of the 6502 clock.

Changing the command sequence

With the clocking changes performed in the previous section, we also need to make a change to the sequence of the commands issued to the DDR RAM. For this discussion you might want to refer back to the following posts:

https://c64onfpga.blogspot.com/2022/05/starting-with-memory-tester-on-arty.html

https://c64onfpga.blogspot.com/2022/07/shrinking-latency.html

In our initial attempts to shrink latency, we wrote the following code for reducing initial latency:

    assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) 
           ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;

It was this assignment to the wire a mentioned earlier that that resulted in trying to sample address 3ns after being asserted.

The above snippet need to be removed and this command should rather be asserted in the state machine as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
              end

You will also see that we only assert this command and go the next state when edge_count is 8. This ensure that out state machine keeps in sync with our 6502 clock.

Now, if you refer back to the previous posts I mentioned, you will see that the actual state following PREPARE_CMD are WAIT_CMD. Well, with our new way of clocking we don't need to transition to a wait state, because the waiting is done within PREPARE_CMD, where we wait for edge_count to reach value 8.

So, the state after PREPARE_CMD should now be COL_CMD, because we need to issue the column read column at that state. The selector for that state looks as follows:

              COL_CMD: begin
                  begin
                      state <= STATE_PREA;
                      test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                          (write_out ? 2'b11 : 2'b00), 10'h1fd};
                      cmd_slot <= 1;  
                      data_in <= {8{cmd_data_out}};
                          do_write <= write_out;
                  end
              end

The rest of the state machine are the same.

Lowering the 6502 into the design

At this point in time we are having two seperate designs. The first design is a prototype design for testing the DDR memory on the Arty A7, of which we have discussing the changes for in this post. The second design was the 6502 based design we were developing in the last couple of posts for accessing data from an SD Card.

Now, we have come to a point where we need to merge the two designs, giving our 6502 based SD Card reader the power of DDR memory.

So, the top module of our 6502 based design, will now move within mem_tester.v as an instance, with the code looking like this:

    retrosystem retrosystem(    
        .cs(),
        .mosi(),
        .miso(),
        .reset(wait_for_read > 0),
        .gen_clk(clk),
        .write_ddr(write),
        .ddr_data_out(data_out_byte),
        .ddr_data_in(data_in),
        .ddr_addr(address_byte),
        .led(led),
        .sclk(),
        .cd(),
        .wp()
    );

First of all, I had to come up a name, for a module that was top.v, that is not a top module anymore. So, I just picked the name retrosystem, which contains a SD Card module and a 6502 system.

Firstly we have the sgnals like cs, mosi, miso and so on which forms part of the SD Interface. These signals we will need to extend all the way to the top module so the SD Card module can be reached.

We have also added some extra signals to interface the 6502 with the DDR RAM on the Arty A7:

write_ddr
ddr_data_out
ddr_data_in
ddr_addr

With all this in place, let us see how to interface the 6502 with the external DDR RAM.

First, let us make a change to the following code block:

always @*
begin
    casex (addr_delayed)
        //16'hfexx: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx00: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx01: combined_data = wb_data_store[7:0];
        16'b1111_1011_xxxx_xx10: combined_data = wb_data_store[15:8];
        16'b1111_1011_xxxx_xx11: combined_data = wb_data_store[23:16];
        16'b0000_0xxx_xxxx_xxxx: combined_data = addr_delayed[0] 
            ? ddr_data_in[15:8] : ddr_data_in[7:0];

        default: combined_data = rom_out;
    endcase 
end

Combined data is the port that combines data of the various sources and send to the 6502 core via the DI input.

The bolded selector used to get its data from a small segment of block RAM, but in this case we changed it to get it externally. We get data from DDR RAM in 16 bit pieces and we therefore need to decide which byte we are going to send to the 6503. Bit 0 of the address determines this decision.

As bit 0 of the address determine which byte to read from a 16 bit word, bit 0 also determines which byte to write in a 16 bit word to memory. This process is a bit more complicated so I will not cover it here. It is suffice to say that will will need to make use of the DM signal on DDR RAM to ensure the correct byte gets written.

We also need to assign some of the ports:

assign ram_6502_addr = cpu_address;
assign write_ddr = (we_6502 & cpu_address[15:9] == 0);
assign ddr_data_out = cpu_data_out;
assign ddr_addr = cpu_address_result;

I mentioned that the retrosystem block needs to be instantiated with mem_tester. Speaking of mem_tester, it also contains a state machine which is no longer necessary.

Checking Timing

With all the code developed in the previous section, we still need to check if the time of a complete read/write cycle fits within our expectations of more or less 7MHz.

The simulation waveform gives an idea of the timings:

Firstly, the signal clk_8_mhz is the signal clocking the CPU at 8.3MHz. All memory cycles associated with a read/write (e.g. Activate, column read, precharge) should be completed within one such cyle.

CPU address is the address that is output by the 6502 CPU core. You will also see the address changes on a clk_8_mhz cycle.

On this simulation graph, I have also shown the DDR signals, which are prefixed by SD. I have numbered the different DDR commands. Point 1 is where an Row activate is happening. Point 2 is where a column read/write is happening. Finally point 3 is where the precharge is happening, as the last command of a read/write cycle.

In this diagram I have also shown the precharge command of the previous cycle.

All in all it seems that a read/write can complete within the time period of one 8.3MHz clock cycle.

Now, when we do a read the actual data will be presented on the data_out signal, which I have also indicated on the diagram. In this case the data is the three blurbs after the long trains of X's. On the diagram it is not clear what the values are of these three blurbs, so let us zoom in a bit:

In the first blurb you will also see a number of X's and in between the value 20 Hex and a9 hex. In this particular test in the simulation, the value 20 and A9 was the actual data I have written to the address 4, so we know that the first blurb always contains the data we are looking for during a read.

However, this blurb only lasts one mclk clock cycle and we need to extend the data until the next 8.3 MHz clock cycle so that our CPU can pick it up. We this by adjusting our PREPARE_CMD selector of earlier as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 2)
                  begin
                      cap_value <= data_out;
                  end
                  do_capture <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
                  
                  cmd_status <= 1;
              end

As shown by the bolded section, we capture data_out when edge_count is 2.

The Test Program

Let us end this post by looking at the Test program we used for testing 6502 and DDR RAM interaction.

Here is the listing:

.ORG $FC00
ldx #offset
copy
    lda zcode,x
    sta $4,x
    dex
    bpl copy

    ldx #0    
read
    lda $4,x
    inx
    cpx #$0a
    bne read

    jmp $4
zcode
    lda #$20
    sta $fb0b
    lda #0
    sta $fb0b

    lda #0
    sta $0
    sta $1
    sta $2
    sta $3

lp1
    inc $0
    bne lp1
lp2
    inc $1
    bne lp1
lp3
    inc $2
    lda $2
    cmp #60
    bne lp1
    lda #$20
    eor $3
    sta $fb0b
    sta $3
    lda #0
    sta $2
    beq lp1
endz
    nop
offset=*-zcode

ENDROM = $FFFF-*-3
.FILL ENDROM 00
.BYTE 0, $FC, 00, 00

This code starting at FC00, which is the start of our "ROM", basically does three things. It starts by loading the code starting at label zcode into RAM starting at address $4.

The next thing this program does is load the code back starting from location 4. This was useful for me to get confirmation that reading from DDR RAM works, by inspecting the data returned to the 6502 as what we expect with an ILA.

Finally the code jumps to location 4, effectively starting to execute the code at label zcode. This is basically a nested waiting loop turning an LED on and off every second or so. You will remember from previous posts that bit of register $FB0B controls an LED.

Real Life Results

I thought of ending this post by showing ILA captures of the design running on the real FPGA. Firstly, a list of data that we expect for the test:

Address 4: A9
Address 5: 20
Address 6: 8D
Address 7: 0B
Address 8: FB
Address 9: A9
Address a: 00
Address b: 8D
Address c: 0B
Address d: FB
Address e: A9
Address f: 00

And next the ILA capture:

The top row is the asserted CPU address and the bottom 2 is selected bytes from cap_value. Let us start by just reminding ourselves again about the structure of the cap_value register.

Firstly, cap_value is 128 bits in width. In total it stores 8 bursts of data from DDR memory, of which we always just look at the first burst.

Furthermore, because the DDR RAM on the ARTY A7 has 16 data bit lines, we have structured cap_value that bits 0-63 contains the low bytes of eight bursts, and bits 64-127 contains the high bytes of the eight bursts.

Coming back to the above diagram, the second line of the capture captures bit 64-71 of cap_value and is the high byte fore the relevant address. Similarly, the last captured lines captures bits 0-7 of cap_value and is the low byte for the relevant address.

Now, as we have discussed earlier on, we get the data for an asserted address just before the transition to the next address. So, for example, for address 4 we get low byte a9 and high byte 20. The same is true for address 5, because byte address 4 & 5 shares the same 16-bit word.

Comparing the diagram to the values we expect, we can confirm that our design works correctly. The LED on the Arty A7 also flashes as we expect.

In Summary

In this post we integrated our 6502 based design with the DDR RAM on the Arty A7 and verified that read/writes work correctly.

In the next post we will also wire up the SD Card ports to out top module and confirm that our 6502/SD Card/DDR design works together as expected.

Until Next time!

C64 on an FPGA

Sunday, 16 July 2023

Running 6502 with DDR RAM