Sunday 23 July 2023

Throwing an SD Card into the mix

Foreword

In the previous post we managed to get our 6502 based design to work with the DDR RAM on the Arty A7 board.

For this post I have originally planned to throw in SD Card access to our current design. However, this proofed to be a no brainer, so just covering this topic alone will yield a very short blog post.

Another topic I have thought of discussing in this post, was to prove that our 6502 based design works at a speed of an equivalent real 6502 processor at 8.33MHz. This is a doubt that pops up in me from time to time when mimicking retro system system that worked with asynchronous RAM, by means of synchronous RAM. This occasional doubt sprouts from the fact that with Synchronous RAM the data is only available when the clock transitions at the next clock cycle, whereas with asynchronous RAM the data is already available just before the next clock transition.

Proofing that our based design works at the same speed as an equivalent 8.33MHz based system, will put this doubt to rest.

Adding SD Card access to our design

I mentioned in the previous post that we need to ensure that we need ensure that we propagate all our SD Card ports of our retro module, all the way up to our top module.

With this done, we need to add the following constraints to our xdc file:

set_property -dict {PACKAGE_PIN G13 IOSTANDARD LVCMOS33} [get_ports cs]
set_property -dict {PACKAGE_PIN B11 IOSTANDARD LVCMOS33} [get_ports mosi]
set_property -dict {PACKAGE_PIN A11 IOSTANDARD LVCMOS33} [get_ports miso]
set_property -dict {PACKAGE_PIN D12 IOSTANDARD LVCMOS33} [get_ports sclk]
set_property -dict { PACKAGE_PIN D13   IOSTANDARD LVCMOS33 } [get_ports dat1]; #IO_L6N_T0_VREF_15 Sch=ja[7]
set_property -dict { PACKAGE_PIN B18   IOSTANDARD LVCMOS33 } [get_ports dat2]; #IO_L10P_T1_AD11P_15 Sch=ja[8]
set_property -dict {PACKAGE_PIN A18 IOSTANDARD LVCMOS33} [get_ports cd]
set_property -dict {PACKAGE_PIN K16 IOSTANDARD LVCMOS33} [get_ports wp]

Having done all this, I found that our design works perfectly with the flashing LED, as explained in this post: https://c64onfpga.blogspot.com/2023/04/sd-card-access-for-arty-a7-part-9.html. Only difference being that we use DDR for RAM and not Block RAM.

Another mission accomplished!!!

Benchmarking

Let us now see if we can determine the speed of our 6502 based design compared to an equivalent retro 6502 based system.

The benchmark I will be using, will be a bit of an unconventional one. I will be running a 6502 machine code program on a VICE Commodore64 emulator, constantly changing the border color. I will then run a similar program on our 6502 based FPGA, but flashing an LED. The comparison between the time changing the color of the border vs toggling an LED will be our benchmark.

This is perhaps an unfair comparison because the C64 looses execution time due to interrupts and the VIC-II that occasionally steels from the 6502 cycles to get extra display data. For this purpose we will be disabling interrupts and blanking the screen to avoid cycle stealing. The resulting C64 program is as follows:

    sei
    lda $d011
    and $ef
    sta $d011
    lda #0
    sta $4
    sta $5
    sta $6
    sta $7

lp1
    inc $4
    bne lp1
lp2
    inc $5
    bne lp1
lp3
    inc $6
    lda $6
    cmp #60
    bne lp1
    lda #$02
    eor $7
    sta $d020
    sta $7
    lda #0
    sta $6
    beq lp1

This program will blank the screen and alternate between red and black border. To determine how long it takes for the border color to transition, I made a video recording of the screen. I then played the recorded video back with VLC media player, making a note of the timestamp when screen transitions to red and when transitioning to black again.

Here is a screenshot of when screen turns red:


Here we see that screen turns red 26 seconds into the video. Next, the following screenshot show when screen turns black again:


Here we see see the screen turns black again 57 seconds into the video. Thus one color transition takes 57 - 26 = 31 seconds

Now, let us do a similar test with our FPGA based 6502 design. Here is the code:

    lda #$20
    sta $fb0b
    lda #0
    sta $fb0b

    lda #0
    sta $0
    sta $1
    sta $2
    sta $3

lp1
    inc $0
    bne lp1
lp2
    inc $1
    bne lp1
lp3
    inc $2
    lda $2
    cmp #60
    bne lp1
    lda #$20
    eor $3
    sta $fb0b
    sta $3
    lda #0
    sta $2
    beq lp1
Almost the same as our C64 variant, except that we are toggling a register for blinking an LED. This file will need to assembled and stored as a file boot.bin on the SD Card.

Now, this code takes 4 seconds to change from LED thats on, and turning off again. Let us do some math. 31/4 = 7.75, meaning our design is 8 times faster than a C64. Keeping in mind that a C64 operates at 1MHZ, this comes close to 8 MHz, which is more or less the speed of our FPGA based design.

In Summary

In this post we add SD Card access to our 6502 based design, bringing us to the point where our design can access an SD Card and DDR RAM.

We also ran a benchmark that confirms that our current design runs at around 8MHz.

In the next post I will start to develop a dual channel memory controller for accessing DDR RAM. The purpose will be so that our 6502 core and Amiga core can independently access DDR RAM at around 8MHz. This will enable the 6502 part of the system to read through a disk image and simulate disk access to the Amiga core.

Until next time! 

Sunday 16 July 2023

Running 6502 with DDR RAM

Foreword

In the previous posts we had been creating a 6502 based design for reading executable code from a FAT32 formatted SD Card and executing it.

I can maybe just summarise my goal again with this current project. My goal is to run a Amiga core on an Arty A7 board. For this project I will be using a 6502 core for doing all the heavy lifting of loading Amiga ROM's and disk images into RAM, from an SD Card, so the Amiga core can execute it.

At this point in time our design use all block RAM. On every FPGA, block RAM is a limited resource, especially if we want to implement something like an Amiga core.

So, in this post we will be trying to run the 6502 core using the DDR RAM available on the Arty A7. Having achieved this goal.

Stumbling Blocks

Let me start discussing the stumbling blocks I cam across the past couple of months in trying get the 6502 core to use 6502 RAM.

Usually when I encounter stumbling blocks, I go into quite some detail in my blog posts about them. However, my stumbling blocks with implementing blocks with DDR RAM were gigantic the past couple of months, so I will try and keep it brief in this section.

So, my initial attempt to write code to interface the 6502 core with the DDR was pretty straightforward, and everything ran as expected during the simulation. However, when I tried running it on the actual Arty A7, things looked totally different than during the simulation.

Every other byte I read back from DDR on the Arty A7 were garbage. When these kind of things happen when playing around with DDR, my heart sinks into my shoes, simply because there is not really tools for troubleshooting these kind of issues. A lot of the operations of DDR happens at frequencies well above that can be captured by the Integrated Logic Analysers. In these cases one can only really solve the issue by some kind of intuition.

After a number of backwards and forwards, I decided to revisit my assumptions of a previous post:


In the post of this diagram, I was working on a memory tester. Basically signal A will resemble the clock signal of the 6502 core.

At point A an address will be asserted by the 6502 and at point B the first DDR instruction will be loaded into a shift register that will shift an instruction out to DDR for opening the row that is associated for the address provided by the 6502 core.

Between two dotted lines are a time period of 1.5ns, so the time period between A and B is 3ns. This translate to 333MHz, and within an FPGA used on a Arty A7, it seemed like a very tight fit to me, although it was sufficient to run a memory tester on the board.

I gave this some thought. There is a lot more logic cells involved with a 6502 core than with a simple memory tester. So, 3ns might not be enough for all the individual address lines to reach their full voltages.

My intuition told me, or should I rather say I made a hypothesis😆, that the problem may be solved by increasing the time period between A and B. We will discover this as a possible solution in the next section.

Clocking changes

With the hypothesis I made in the previous section, I came up with the following clocking scheme:


The 6502_clk is basically the clock that should drive the 6502 core. It is an exact copy of mclk, but I am throwing away 9 clocks in between, thus keeping only every tenth clock. With mclk that is 83.3 MHz, this gives us an effective 6502 clock of 8.3MHz, which is stil above the target clock of 7MHz required for our Amiga core in future.

At the point I have indicated with an arrow, we are loading our shift register with the address asserted by our 6502 core, which is one mclk cycle after the assertion. This works out 12ns, compared to the 3ns of our earlier design. I think this will give ample time for our address lines to settle, before reading it at the next mclk cycle.

The question remains if this bigger time gap will not introduce extra latency causing us to miss our target frequency of 7MHz. We will revisit this question later on when have finished with the design.

Let us start by writing some Verilog code for a counter that keeps track of when to enable the 6502 clock:

    reg [3:0] edge_count = 9;

    always @(negedge mclk)
    begin 
        if (edge_count == 0)
            edge_count <= 9;
        else
            edge_count <= edge_count - 1;   
    end

    always @(negedge mclk)
    begin
        clk_8_enable <= edge_count == 0;
    end
We get the resulting 6502 clock with the following:

    BUFGCE BUFGCE_8_mhz (
       .O(clk_8_mhz),   // 1-bit output: Clock output
       .CE(clk_8_enable), // 1-bit input: Clock enable input for I0
       .I(mclk)    // 1-bit input: Primary clock
    );
So, we will use the signal clk_8_mhz to clock our 6502. It is important to add a necessary constraint in Vivado, to indicate that it is treated as a clock when synthesizing the design. This constraint will look like the following:

create_generated_clock -name clkdiv1 -source [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O] 
     -edges {1 2 21} [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O]
The edges parameter indicates which edges of the mclk clock forms part of the 6502 clock.

Changing the command sequence

With the clocking changes performed in the previous section, we also need to make a change to the sequence of the commands issued to the DDR RAM. For this discussion you might want to refer back to the following posts:


In our initial attempts to shrink latency, we wrote the following code for reducing initial latency:

    assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) 
           ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;
It was this assignment to the wire a mentioned earlier that that resulted in trying to sample address 3ns after being asserted.

The above snippet need to be removed and this command should rather be asserted in the state machine as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
              end
You will also see that we only assert this command and go the next state when edge_count is 8. This ensure that out state machine keeps in sync with our 6502 clock.

Now, if you refer back to the previous posts I mentioned, you will see that the actual state following PREPARE_CMD are WAIT_CMD. Well, with our new way of clocking we don't need to transition to a wait state, because the waiting is done within PREPARE_CMD, where we wait for edge_count to reach value 8.

So, the state after PREPARE_CMD should now be COL_CMD, because we need to issue the column read column at that state. The selector for that state looks as follows:

              COL_CMD: begin
                  begin
                      state <= STATE_PREA;
                      test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                          (write_out ? 2'b11 : 2'b00), 10'h1fd};
                      cmd_slot <= 1;  
                      data_in <= {8{cmd_data_out}};
                          do_write <= write_out;
                  end
              end

The rest of the state machine are the same.

Lowering the 6502 into the design

At this point in time we are having two seperate designs. The first design is a prototype design for testing the DDR memory on the Arty A7, of which we have discussing the changes for in this post. The second design was the 6502 based design we were developing in the last couple of posts for accessing data from an SD Card.

Now, we have come to a point where we need to merge the two designs, giving our 6502 based SD Card reader the power of DDR memory.

So, the top module of our 6502 based design, will now move within mem_tester.v as an instance, with the code looking like this:

    retrosystem retrosystem(    
        .cs(),
        .mosi(),
        .miso(),
        .reset(wait_for_read > 0),
        .gen_clk(clk),
        .write_ddr(write),
        .ddr_data_out(data_out_byte),
        .ddr_data_in(data_in),
        .ddr_addr(address_byte),
        .led(led),
        .sclk(),
        .cd(),
        .wp()
    ); 
First of all, I had to come up a name, for a module that was top.v, that is not a top module anymore. So, I just picked the name retrosystem, which contains a SD Card module and a 6502 system.

Firstly we have the sgnals like cs, mosi, miso and so on which forms part of the SD Interface. These signals we will need to extend all the way to the top module so the SD Card module can be reached.

We have also added some extra signals to interface the 6502 with the DDR RAM on the Arty A7:
  • write_ddr
  • ddr_data_out
  • ddr_data_in
  • ddr_addr
With all this in place, let us see how to interface the 6502 with the external DDR RAM.

First, let us make a change to the following code block:

always @*
begin
    casex (addr_delayed)
        //16'hfexx: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx00: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx01: combined_data = wb_data_store[7:0];
        16'b1111_1011_xxxx_xx10: combined_data = wb_data_store[15:8];
        16'b1111_1011_xxxx_xx11: combined_data = wb_data_store[23:16];
        16'b0000_0xxx_xxxx_xxxx: combined_data = addr_delayed[0] 
            ? ddr_data_in[15:8] : ddr_data_in[7:0];

        default: combined_data = rom_out;
    endcase 
end

Combined data is the port that combines data of the various sources and send to the 6502 core via the DI input.

The bolded selector used to get its data from a small segment of block RAM, but in this case we changed it to get it externally. We get data from DDR RAM in 16 bit pieces and we therefore need to decide which byte we are going to send to the 6503. Bit 0 of the address determines this decision.

As bit 0 of the address determine which byte to read from a 16 bit word, bit 0 also determines which byte to write in a 16 bit word to memory. This process is a bit more complicated so I will not cover it here. It is suffice to say that will will need to make use of the DM signal on DDR RAM to ensure the correct byte gets written.

We also need to assign some of the ports:

assign ram_6502_addr = cpu_address;
assign write_ddr = (we_6502 & cpu_address[15:9] == 0);
assign ddr_data_out = cpu_data_out;
assign ddr_addr = cpu_address_result;
I mentioned that the retrosystem block needs to be instantiated with mem_tester. Speaking of mem_tester, it also contains a state machine which is no longer necessary.

Checking Timing

With all the code developed in the previous section, we still need to check if the time of a complete read/write cycle fits within our expectations of more or less 7MHz.

The simulation waveform gives an idea of the timings:


Firstly, the signal clk_8_mhz is the signal clocking the CPU at 8.3MHz. All memory cycles associated with a read/write (e.g. Activate, column read, precharge) should be completed within one such cyle.

CPU address is the address that is output by the 6502 CPU core. You will also see the address changes on a clk_8_mhz cycle.

On this simulation graph, I have also shown the DDR signals, which are prefixed by SD. I have numbered the different DDR commands. Point 1 is where an Row activate is happening. Point 2 is where a column read/write is happening. Finally point 3 is where the precharge is happening, as the last command of a read/write cycle.

In this diagram I have also shown the precharge command of the previous cycle.

All in all it seems that a read/write can complete within the time period of one 8.3MHz clock cycle.

Now, when we do a read the actual data will be presented on the data_out signal, which I have also indicated on the diagram. In this case the data is the three blurbs after the long trains of X's. On the diagram it is not clear what the values are of these three blurbs, so let us zoom in a bit:


In the first blurb you will also see a number of X's and in between the value 20 Hex and a9 hex. In this particular test in the simulation, the value 20 and A9 was the actual data I have written to the address 4, so we know that the first blurb always contains the data we are looking for during a read.

However, this blurb only lasts one mclk clock cycle and we need to extend the data until the next 8.3 MHz clock cycle so that our CPU can pick it up. We this by adjusting our PREPARE_CMD selector of earlier as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 2)
                  begin
                      cap_value <= data_out;
                  end
                  do_capture <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
                  
                  cmd_status <= 1;
              end
As shown by the bolded section, we capture data_out when edge_count is 2.

The Test Program

Let us end this post by looking at the Test program we used for testing 6502 and DDR RAM interaction.

Here is the listing:

.ORG $FC00
ldx #offset
copy
    lda zcode,x
    sta $4,x
    dex
    bpl copy

    ldx #0    
read
    lda $4,x
    inx
    cpx #$0a
    bne read

    jmp $4
zcode
    lda #$20
    sta $fb0b
    lda #0
    sta $fb0b

    lda #0
    sta $0
    sta $1
    sta $2
    sta $3

lp1
    inc $0
    bne lp1
lp2
    inc $1
    bne lp1
lp3
    inc $2
    lda $2
    cmp #60
    bne lp1
    lda #$20
    eor $3
    sta $fb0b
    sta $3
    lda #0
    sta $2
    beq lp1
endz
    nop
offset=*-zcode

ENDROM = $FFFF-*-3
.FILL ENDROM 00
.BYTE 0, $FC, 00, 00

This code starting at FC00, which is the start of our "ROM", basically does three things. It starts by loading the code starting at label zcode into RAM starting at address $4.

The next thing this program does is load the code back starting from location 4. This was useful for me to get confirmation that reading from DDR RAM works, by inspecting the data returned to the 6502 as what we expect with an ILA.

Finally the code jumps to location 4, effectively starting to execute the code at label zcode. This is basically a nested waiting loop turning an LED on and off every second or so. You will remember from previous posts that bit of register $FB0B controls an LED.

Real Life Results

I thought of ending this post by showing ILA captures of the design running on the real FPGA. Firstly, a list of data that we expect for the test:

Address 4: A9
Address 5: 20
Address 6: 8D
Address 7: 0B
Address 8: FB
Address 9: A9
Address a: 00
Address b: 8D
Address c: 0B
Address d: FB
Address e: A9
Address f: 00
And next the ILA capture:


The top row is the asserted CPU address and the bottom 2 is selected bytes from cap_value. Let us start by just reminding ourselves again about the structure of the cap_value register.

Firstly, cap_value is 128 bits in width. In total it stores 8 bursts of data from DDR memory, of which we always just look at the first burst.

Furthermore, because the DDR RAM on the ARTY A7 has 16 data bit lines, we have structured cap_value that bits 0-63 contains the low bytes of eight bursts, and bits 64-127 contains the high bytes of the eight bursts.

Coming back to the above diagram, the second line of the capture captures bit 64-71 of cap_value and is the high byte fore the relevant address. Similarly, the last captured lines captures bits 0-7 of cap_value and is the low byte for the relevant address.

Now, as we have discussed earlier on, we get the data for an asserted address just before the transition to the next address. So, for example, for address 4 we get low byte a9 and high byte 20. The same is true for address 5, because byte address 4 & 5 shares the same 16-bit word.

Comparing the diagram to the values we expect, we can confirm that our design works correctly. The LED on the Arty A7 also flashes as we expect.

In Summary

In this post we integrated our 6502 based design with the DDR RAM on the Arty A7 and verified that read/writes work correctly.

In the next post we will also wire up the SD Card ports to out top module and confirm that our 6502/SD Card/DDR design works together as expected.

Until Next time!