Friday 5 January 2024

Extending our Hypothetical Amiga core

Foreword

In the previous post we added some more functionality to issue some more sensible commands to the second channel of our memory controller.

We also created a Hypothetical Amiga core for serving our second memory channel with the sensible read/write commands. This Hypothetical Amiga core serves as a Work in progress, which we will continue to add more and more functionality in coming posts.

In this post we will hook up the Minimig (aka Mini Amiga) core to our almost mostly empty Amiga core.

We briefly played with the Minimig core a couple of posts ago, just to get a feel of how it works. In particular, one of the feature we briefly looked at of the Amiga, is the Memory overlay scheme at bootup of Amiga, where the RAM starting at address 0, is disabled and instead a piece of ROM is mapped instead at this range of memory.

However, when we previous played with the Minimig, we barely touched on all the technicalities when reading from memory via the Minimig. In this post we will explore some of these technicalities.

DACT Battles

With discussions we had from previous posts, we know that in order for a 68k Motorola processor to access memory, two signals are of importance: ASn and DACT. The CPU asserts the signal ASn to signal memory it wants to access memory. When the memory is ready with the data it signals the CPU in return by asserting the DACT signal.

While playing with the Minimig core, I discovered something that puzzled me a bit from the following waveform:

In this waveform, the Amiga core is still in overlay mode, meaning that the address 0 the Amiga would translate to the address 3c0000 hex. The Amiga core, however, asserts the DACT signal, e.g. second signal in the waveform before the address changes to 3c0000. This means that potentially the data will be be read from the incorrect address and returned to the CPU.

I was wondering how the Minimig core on the MisTer project deals with this scenario. Delving a bit into the source code, I found the following snippet in the file rtl/cpu_wrapper.v:

...
fx68k cpu_inst_o
(
...
	.DTACKn(ramsel ? ~ramready : chip_dtack),
...
);
...

Here we see that for RAM accesses we don't use the dact signal of our Minimig core, but rather the ramready signal.

So, what does the ramready signal entails? The answer lies in the following snippet of code within the file rtl/sdram_ctrl.v:

...
cpu_cache_new cpu_cache
(
	.clk              (sysclk),                // clock
	.rst              (!reset || !cache_rst),  // cache reset
	.cpu_cache_ctrl   (cpu_cache_ctrl),        // CPU cache control
	.cache_inhibit    (cache_inhibit),         // cache inhibit
	.cpu_cs           (ramsel),                // cpu activity
	.cpu_adr          (cpuAddr),               // cpu address
	.cpu_bs           ({!cpuU, !cpuL}),        // cpu byte selects
	.cpu_we           (cpustate == 3),         // cpu write
	.cpu_ir           (cpustate == 0),         // cpu instruction read
	.cpu_dr           (cpustate == 2),         // cpu data read
	.cpu_dat_w        (cpuWR),                 // cpu write data
	.cpu_dat_r        (cpuRD),                 // cpu read data
	.cpu_ack          (cache_rd_ack),          // cpu acknowledge
	.wb_en            (cache_wr_ack),          // write enable
	.sdr_dat_r        (sdata_reg),             // sdram read data
	.sdr_read_req     (cache_req),             // sdram read request from cache
	.sdr_read_ack     (cache_fill),            // sdram read acknowledge to cache
	.snoop_act        (chipWE),                // snoop act (write only - just update existing data in cache)
	.snoop_adr        (chipAddr),              // snoop address
	.snoop_dat_w      (chipWR),                // snoop write data
	.snoop_bs         ({!chipU, !chipL})       // snoop byte selects
);
...
assign ramready = cache_rd_ack || write_ena;
...

Here we see that the Amiga core used in the MisTer project, doesn't read directly from SDRAM, but rather via an cache. Looking at the implementation of cpu_cache_new, we see that it is quite an advanced cache, with the same kind of functionality you would find with a cache dedicated on modern day CPU's. This cache will even snoop data writes peripheral chips have done to memory via DMA, so that the CPU will not miss out on these updates.

Overall, this cache is 8KB in size. Overall, I am not so sure if I will be using a cache in my implementation as well. The MisTer project is meant for the DE10-Nano, which has over 500KB of block RAM. I am using the Arty A7, which has far less Block RAM, so I am not sure if I would be able to compete with the same capabilities than what the DE10-Nano has.

So, for now I will work with an implementation that doesn't have a cache just to ease the use of BlockRAM.

More clocking

Now, from the previous post you will remember that we are clocking our amiga_mem_core module, with a clock signal called clk_8_2_mhz. This clock signal only triggers once every 10 clock cycles of our mclk of 83.3333MHz, which resolves to a frequency of 8.3333MHz.

Well, it turns out that that our MiniMig design wants a clock signal of more or less 28MHz. So, instead of one clock cycle every tenth clock cycle, we will need to enable more clock cycles for every 10 clock cycles.

We can maybe go about enable every second clock cycle, which will give us 41MHz. Maybe this is a bit too fast. We can maybe opt for every 10 clock cycles, we can only enable 4 of them, giving us a frequency of 33MHz. Think this is the closest we can get to 28MHz by just enabling different clock cycles within every 10 of them.

So, let us fiddle a bit with how clk_8_2_mhz is generated:

...
    BUFGCE BUFGCE_8_2_mhz (
       .O(clk_8_2_mhz),   // 1-bit output: Clock output
       .CE(clk_8_2_enable), // 1-bit input: Clock enable input for I0
       .I(mclk)    // 1-bit input: Primary clock
    );
...
    always @(negedge mclk)
    begin
        clk_8_2_enable <= (edge_count == 7 || edge_count == 5 || edge_count == 3);
    end
...

Having fed our amiga_mem_core module with the correct clock frequency, we need to instantiate a module instance within this module for generating the other clock signals the minimig core requires:

   amiga_clk amiga_clk
        (
          .clk_28(clk),     // 28MHz output clock ( 28.375160MHz)
          .clk7_en(clk7_en),    // 7MHz output clock enable (on 28MHz clock domain)
          .clk7n_en(clk7n_en),   // 7MHz negedge output clock enable (on 28MHz clock domain)
          .c1(c1),         // clk28m clock domain signal synchronous with clk signal
          .c3(c3),         // clk28m clock domain signal synchronous with clk signal delayed by 90 degrees
          .cck(cck),        // colour clock output (3.54 MHz)
          .eclk(eclk),       // 0.709379 MHz clock enable output (clk domain pulse)
          .reset_n(~(reset))
        );

This is also a module I used straight from the minimig project, which is the file rtl/amiga_clk.v.

For completeness sake, let us add the other module instances:

...
   always @(negedge clk)
    begin
      phi <= ~phi;
    end
...
   minimig minimig(     //m68k pins
     .cpu_address(add), // m68k address bus
     .cpu_data(data_in),    // m68k data bus
     .cpudata_in(data_out),  // m68k data in
     ._cpu_ipl(interrupts),    // m68k interrupt request
     ._cpu_as(As),     // m68k address strobe
     ._cpu_uds(Uds),    // m68k upper data strobe
     .button_reset(reset),
     ._cpu_lds(Lds),    // m68k lower data strobe
     .cpu_r_w(read_write),     // m68k read / write
     ._cpu_dtack(data_ack),  // m68k data acknowledge
     ._cpu_reset(/*reset*/),  // m68k reset
     ._cpu_reset_in(reset_cpu_out),//m68k reset in
     .nmi_addr(0),    // m68k NMI address
     //TODO
     //sram pins
     .ram_data(data),    // sram data bus
     .ramdata_in(ram_data_in),  // sram data bus in
     .ram_address(address), // sram address bus
     ._ram_bhe(),    // sram upper byte select
     ._ram_ble(),    // sram lower byte select
     ._ram_we(write),     // sram write enable
     ._ram_oe(oe),     // sram output enable
     .chip48(),      // big chipram read
 
     //system    pins
     .rst_ext(),     // reset from ctrl block
     .rst_out(),     // minimig reset status
     .clk(clk),         // 28.37516 MHz clock
     .clk7_en(clk7_en),     // 7MHz clock enable
     .clk7n_en(clk7n_en),    // 7MHz negedge clock enable
     .c1(c1),          // clock enable signal
     .c3(c3),          // clock enable signal
     .cck(cck),         // colour clock enable
     .eclk(eclk),        // ECLK enable (1/10th of CLK)
 
     //rs232 pins
     .rxd(),         // rs232 receive
     .txd(),         // rs232 send
     .cts(),         // rs232 clear to send
     .rts(),         // rs232 request to send
     .dtr(),         // rs232 Data Terminal Ready
     .dsr(),         // rs232 Data Set Ready
     .cd(),          // rs232 Carrier Detect
     .ri(),          // rs232 Ring Indicator
 
 
     //host controller interface (SPI)
     .IO_UIO(),
     .IO_FPGA(),
     .IO_STROBE(),
     .IO_WAIT(),
     .IO_DIN(),
     .IO_DOUT()
 

 
     //user i/o
     //output  [1:0] cpucfg,
     //output  [2:0] cachecfg,
     //output  [6:0] memcfg,
     //output        bootrom,     // enable bootrom magic in gary.v
);
...
   fx68k fx68k(        .clk(clk),
        .HALTn(1),                    // Used for single step only. Force high if not used
        // input logic HALTn = 1'b1,            // Not all tools support default port values
        
        // These two signals don't need to be registered. They are not async reset.
        .extReset(reset),            // External sync reset on emulated system
        .pwrUp(reset),            // Asserted together with reset on emulated system coldstart    
        .enPhi1(phi), .enPhi2(~phi),    // Clock enables. Next cycle is PHI1 or PHI2
        .eRWn(read_write),
        .oRESETn(reset_cpu_out),
        //output eRWn, output ASn, output LDSn, output UDSn,
        //output logic E, output VMAn,    
        //output FC0, output FC1, output FC2,
        //output BGn,
        //output oRESETn, output oHALTEDn,
        .ASn(As), 
        .LDSn(Lds), 
        .UDSn(Uds),
        .DTACKn(data_ack), 
        .VPAn(1),
        .BERRn(1),
        .BRn(1), .BGACKn(1),
        .IPL0n(interrupts[0]), 
        .IPL1n(interrupts[1]), 
        .IPL2n(interrupts[2]),
        .iEdb(data_in),
        .oEdb(data_out),
        .eab(add)
);
...

I have described the use of both modules in a previous post. You will also noticed that the DTACKn still uses the data_ack signal blindly from minimig, of which I have warned against in the previous section. We will give attention to this in the next section.

Synchronised Memory Access

As mentioned earlier, one is not really guaranteed when the minimig core asserts the dact signal, that the data will be available for the CPU. So, one needs to delay the dact signal somehow until the data is ready.

One signal we can use for this is is the _ram_oe signal from the minimig core. Once this signal is asserted, we can be sure that the address asserted is correct and we can fetch the correct data. Obviously we will only assert the DACT signal when the data is really ready.

We will implement all this logic with the following state machine:

    always @(posedge clk)
    begin
       case(dact_state)
         STATE_IDLE: begin
                       dact_state <= (!oe && !data_ack) ? STATE_OE : STATE_IDLE;
                     end
         STATE_OE: begin
                       dact_state <= STATE_OE_1;
                     end
         STATE_OE_1: begin
                       dact_state <= STATE_DACT;
                     end
         STATE_DACT: begin
                       if (data_ack)
                       begin
                           dact_state <= STATE_IDLE;
                       end
                     end

       endcase
    end

We transition from the IDLE state to the next state when both oe and data_ack is asserted, just to ensure we act on a CPU memory access and not from a peripheral DMA access.

For the purpose of just testing, I have added two states to simulate a two cycle memory access time. When we will eventually use our real memory controller, more clock cycles will apply.

We are now ready to supply our CPU with a true DACT signal:

   fx68k fx68k(        .clk(clk),
...
      .DTACKn(dact_state == STATE_DACT ? data_ack : 1), 
...
);

What remains to be done is to link up the memory. For now a will just simulate hardcoded values for memory, that will be a program and see if our CPU will act accordingly:

...
   always @(posedge clk)
   begin
       oe_delayed <= oe;
   end
...   
   assign trigger_read = oe_delayed && !oe;
...   
    always @(negedge clk_8_2_mhz)
        begin
            if (amiga_test_address == 22'h3c0000 && trigger_read)
            begin
              data_in_amiga_test <= 16'hc0;
            end else if (amiga_test_address == 22'h3c0001 && trigger_read)
            begin
              data_in_amiga_test <= 16'h33c3;
            end else if (amiga_test_address == 22'h3c0002 && trigger_read)
            begin
              data_in_amiga_test <= 16'h0;
            end else if (amiga_test_address == 22'h3c0003 && trigger_read)
            begin
              data_in_amiga_test <= 16'h0008;
            end else if (amiga_test_address == 22'h3c0004 && trigger_read)
            begin
              data_in_amiga_test <= 16'h303c; //load immediate
            end else if (amiga_test_address == 22'h3c0005 && trigger_read)
            begin
              data_in_amiga_test <= 16'h0505;
            end else if (amiga_test_address == 22'h3c0006 && trigger_read)
            begin
              data_in_amiga_test <= 16'h33c0; //store
            end else if (amiga_test_address == 22'h3c0007 && trigger_read)
            begin
              data_in_amiga_test <= 16'h0085;
            end else if (amiga_test_address == 22'h3c0008 && trigger_read)
            begin
              data_in_amiga_test <= 16'h8586;
            end else if(amiga_test_address == 22'h3c0009 && trigger_read)
            begin
              data_in_amiga_test <= 16'h4eb9;
            end else if (amiga_test_address == 22'h3c000a && trigger_read)
            begin
              data_in_amiga_test <= 9;
            end else if (amiga_test_address == 22'h3c000b && trigger_read)
            begin
              data_in_amiga_test <= 16'h3e86;
            end
            else if (trigger_read) begin
              data_in_amiga_test <= 16'h33c0;
            end
        end
...

The data only gets assigned once the oe signal transitions from 1 to a 0. As we know, when the 68k processor starts to execute it starts by loading the vectors at address 0, which the minimig core translates to 3c0000.

As you might remember, the starting address that indicates to the 68k where to starts executing is indicated by the vector starting at byte address 4, or 16 bit-word address 2. From the code above this translates to byte address 8, or word address 4. For this reason the program actually starts at word address 3c0004.

Let us have a look at how the waveform looks like, captured from the real FPGA:

This third line is the address send to RAM for retrieving data. It resolves to the valid address starting with '3c' at the falling edge of cck. The second last row show the resulting data asserted shortly thereafter.

In Summary

In this post I added some more logic to our amiga_mem_core to resemble more an Amiga.

We also tested to see if we our 68k core could reliably fetch data and execute code.

In the next post we will try and link up our Amiga controller to our SDRAM memory controller.

Until next time!

Monday 23 October 2023

Adding more functionality to the second channel of the Memory controller

Foreword

In the previous post we started modifying our existing memory controller to become a dual channel memory controller.

A dual core memory controller would allow us to have two cores accessing memory at both 7MHz, by allocating a different bank of memory within the DDR3 memory for each core.

In the previous post we basically got the timings right to trigger the DDR3 commands of the cores in an interleaved way.

In this post we are going to extend this functionality further and add a core to issue some dummy read/write commands on the second memory channel and see if we can read some sensible data back from DDR RAM via the second memory channel.

Using sensible addresses

In the previous post we didn't really worry about using sensible row/column addresses for the second channel of our memory controller and we just used the same hardcoded address for both the row and the column.

So, let us start this post by seeing if we can create some sensible row and column addresses. Firstly, we will create a block of code for driving our second memory channel:

amiga_mem_core amiga_mem_core(.clk(clk_8_2_mhz),
    .address(channel_address_2),
    .data(channel_data_2),
    .data_in(cap_value_2),
    .write(write_channel_2),
    .reset(reset_retro)
);

amiga_mem_core is our hypothetical Amiga core that will use the second memory channel for its memory needs. We will gradually develop this core in coming sections and future posts.

Let us quickly discuss the different ports of amiga_mem_core:

clk_8_2_mhz: This is basically the same kind of clock as what drives our main 6502 core. This is the 83.333Mhz clock, but we only present every tenth clock pulse, which gives us an effective clock of 8.333Mhz. I would like to point out here that we will use a different clock pulse from 10 available than we use for our 6502 core, because the second memory channel require the address to be asserted at a different time than the first memory channel.
channel_address_2: a 16 bit linear address, giving 64k address space. We will slice and dice this address to get row address and column address
cap_value_2: 16 bit captured data from DDR3 RAM. As we know from previous posts, the ISERDES captures this data from DDR3 RAM, but throws it away after the next 83MHz. So, we need to capture this data so it is still available at the next 8.33MHz clock pulse.
write_channel_2: The Amiga core indicates whether it wants to either write (e.g. set to 1), or read (e.g. set to 0).

Let us modify our memory controller state machine a bit to use the values from these ports:

              WAIT_READ_WRITE_2: begin
                  test_cmd <= 32'h000001ff;
                  phy_rcw_pos_2 <= 3;
                  phy_address_2 <= {9'b0,channel_address_2[15:10]};
				  state <= PRECHARGE_AFTER_WRITE;
              end
			  
	      PRECHARGE_AFTER_WRITE: begin
                  // CAS command
                  phy_rcw_pos_2 <= {2'b10, write_channel_2};
                  phy_address_2 <= {5'b0,channel_address_2[9:3], map_address_2[2:0]};
                  data_in <= {8{channel_data_2}};
                  dq_tri = write_channel_2 ? 15 : 0;
                  mem_channel <= 1;
                  state <= POST_READ_1;
                  cmd_slot <= 3;
                  test_cmd <= write_channel_2 ? 32'h000029fd : 32'h00002dfd;
              end

If you have a look at my previous post, you will see I have also modified the above two selectors of the state machine to open a row for the second memory channel and then do a column read/write in the second selector. In this case I have added some more logic to use the address of our Hypothetical Amiga core.

Note that as with our first channel we form the row address by using bits 10 upwards from our Amiga core, and the lower ten bits of the Amiga core address.

You will notice I am not using the lower three bits as is for the column address, but rather make use of a map. I have used the same technique in the first channel of our memory controller. Let us quickly recap on the reason for this.

As you might remember from previous posts, DDR3 memory will never just you the single 16 bit- word you are looking for, but will always return you a burst of 4 or 8 words. To catch the data in the correct chunk within the 8 word burst, is quite challenge and you need to fiddle quite a bit the code to get it right.

So, I just take the lazy route and just see what word arrives for each address 0f 0-7 and then just created a map to get the correct word within the burst. My mapping function looks like this:

    always @*
    begin
        if (channel_address_2[2:0] == 0)
        begin
            map_address_2 = 7;
        end else if (channel_address_2[2:0] == 1)
        begin
            map_address_2 = 0;
        end else if (channel_address_2[2:0] == 2)
        begin
            map_address_2 = 1;
        end else if (channel_address_2[2:0] == 3)
        begin
            map_address_2 = 2;
        end else if (channel_address_2[2:0] == 4)
        begin
            map_address_2 = 3;
        end else if (channel_address_2[2:0] == 5)
        begin
            map_address_2 = 4;
        end else if (channel_address_2[2:0] == 6)
        begin
            map_address_2 = 5;
        end else
        begin
            map_address_2 = 6;
        end
    end

Also, there is a different mapping function for both the simulation environment and when running on the actual FPGA. I never managed to find the reason why there is a difference between the two, but for now I am just using two different mapping functions for the two environments.

Moving onto the data_in assignment. Here I am just repeating the data I want to write for the full burst, until the write is complete. It is important in this case just to ensure we assert the Data mask bit it the correct time instant to ensure the correct word is written in a 8-word column. So, I am just doing another mapping function:

    always @*
    begin
        if (cmd_offset[2:0] == 0) 
        begin
            dm_slot = ~1;
        end else if (cmd_offset[2:0] == 1)
        begin
            dm_slot = ~2;
        end else if (cmd_offset[2:0] == 2)
        begin
            dm_slot = ~4;
        end else if (cmd_offset[2:0] == 3)
        begin
            dm_slot = ~8;
        end else if (cmd_offset[2:0] == 4)
        begin
            dm_slot = ~16;
        end else if (cmd_offset[2:0] == 5)
        begin
            dm_slot = ~32;
        end else if (cmd_offset[2:0] == 6)
        begin
            dm_slot = ~64;
        end else if (cmd_offset[2:0] == 7)
        begin
            dm_slot = ~128;
        end
    end

The wire cmd_offset is used for both channels, so it is important we have a selector like this:

    assign cmd_offset = mem_channel == 0 ? cmd_address[2:0] : channel_address_2[2:0];

Implementing the Hypothetical Amiga core

Let us implement the Hypothetical Amiga core we had been talking about in this post. This is basically the core where we will do some writes using the second memory channel and see if we can read the same data back. In future posts we will gradually evolve this core to a fully functional Amiga core.

This core will basically be a 6 bit counter, where we use the top bit to indicate read/write, low indicating write. So, starting the top bit as zero, we will start doing a bunch of writes, and when the counter comes to the point where bit 5 (e.g. top bit) is set, we will do a series of reads.

The resulting core is fairly simple:

module amiga_mem_core(
    input wire clk,
    output wire [15:0] address,
    output wire write,
    input wire reset,
    output wire [15:0] data,
    input wire [15:0] data_in
    );
    
   (* mark_debug = "true" *) reg [5:0] counter = 0;
   (* mark_debug = "true" *) reg [15:0] captured_data;
   
   assign address = {11'b0, counter[4:0]};
   assign write = counter[5];
    
   always @(posedge clk)
   begin
       counter <= reset ? 0 : (counter + 1);
   end
   
   always @(posedge clk)
   begin
       captured_data <= data_in;
   end
   
   assign data = counter + 3;
endmodule

I have marked counter and captured_data to be debugged, so we can view those ports via ILA when running on the actual FPGA.

We use the counter also to generate some test data and add three to it does to get some test data that is different from the address.

I mentioned earlier that the data ISERDES capture is only retained for one 83.33Mhz clock cycle, so by the time our Amiga core looks for the data, it will be long time gone. So, we will need to capture it outside the Amiga core and feed it to the Amiga core like this:

    always @(posedge mclk)
    begin
        if (edge_count == 7)
        begin
            cap_value_2 <= {data_out[103:96], data_out[39:32]};
        end
    end

So, we capture the data always at specific 83Mhz when the data is available. data_out is basically the the output of our ISERDES block, that captured 8 bursts of data. Bits 63 - 0 contains the low byte of each of the 8 data bursts, and bits 127 - 64 contains the high byte of each of the 8 bursts. By experimentation I found that the data we need is always at bits 39:32 and bits 103:96.

In Summary

In this post we added some more meat around the second channel of our memory controller, managed to write some test data to the DDR3 RAM and read the same data back.

In the next post we will start to do some more interesting stuff, and see if we can add an Amiga core that uses the second memory channel for memory storage.

Until next time!

Thursday 28 September 2023

New beginnings of a dual channel DDR3 memory controller

Foreword

In the previous post we managed to get a 6502 based ecosystem together where we could access both an SD Card and DDR3 memory.

With this design we can load quite a lot of stuff from SD Card into DDR3 memory and thus reduce our dependency on limit Block RAM that is available on the FPGA. This opens the possibility to emulate an Amiga core on the Arty A7 FPGA board.

As mentioned in previous posts, we will be using an 6502 based system that will do all the work of loading all the required stuff from SD Card to DDR3, which the Amiga core requires to work. Needless to say, this would require both the 6502 core and Amiga core to access the DDR3 memory.

One way to address the need of both 6502 + Amiga core to access the DDR3, would be to use the memory controller we developed in the last couple of posts, and just let the two cores make turns to access DDR3 memory. Knowing that our memory controller runs at around 8MHz, that would mean that our Amiga core would be running accessing DDR3 memory at around 4MHZ, because it would be accessing memory at every second clock cycle. This is far from ideal with a stock Amiga running at least at 7MHz.

So, in this post we will try and come up with an optimised dual channel memory controller where we will attempt to make both the Amiga core and 6502 core access memory at 7MHz.

The Magic of Memory Banks

In our journey with DDR3 memory, we got the know the different states memory can be in:

Activate: Activate a row for reading or writing
Read/Write: Read or write a particular column of data
Precharge: After you are finished with your reads/writes on a particular row, you first need to precharge the row, before moving on to the next one.

All the above mentioned takes time to complete. In my Arty A7 scenario, each of these states takes about 5 memory clock cycles to complete.

Once you have an open row, however, consecutive memory reads from the same row can be quite fast, provided you give the column addresses ahead of time.

Things, however, will not work out so well for our plan where a 6502 and Amiga core need to access memory. The Amiga core, for instance, might need to access data from a different row than what the 6502 is currently busy with. In such case the Amiga core needs to wait for 6502 core to finish its business with the current row it is busy with, before the Amiga core can open the row it wants. This will again bring us to the point where the Amiga core can only access the memory at half the available memory bandwidth, e.g. 4MHz.

However, all hope is not lost. DDR3 memory divides memory into different memory banks and each memory bank can have a row open independently from the over banks. The DDR3 memory chip on the Arty A7 have 8 memory banks. This means that each memory bank is 256MB/8 = 32MB.

So, the basic idea is to give both the 6502 core and the Amiga core its own bank, then theoretically every core can get the full memory bandwidth of 8MHz. One just need to carefully schedule the timing of when to issue DDR3 commands, so that these cores don't trip over each over. DDR3 RAM, for instance, still have only one data bus, so if you issue read commands from two banks, you can't expect the data to arrive at the same time. It will first output the data from the first bank, and thereafter the data from the other bank.

Coming back to the size of every memory bank. 32MB per bank is more than enough of what we want to do. For the Amiga core this will be more than enough for the ROMS and the amount of RAM you will get for your earlier Amigas. For the 6502 core this will also be more than enough to store a disk image and simulate a disk read from the Amiga.

Using timeslots wisely

The memory on the Arty A7 clocks at 333Mhz, which is far beyond the speed capability of the FPGA on the Arty A7. As we learned from previous posts, the designers of the FPGA, provided a way out by providing OSERDES blocks, for serialising out data. The OSERDES blocks themselves can serialise the data out at 333MHz. We need to provide the data 4 chunks at a time to this block, which reduces the required speed from the rest of the FPGA to 83MHz, which is more manageable.

Now, in our current design, for every 4 timeslots, we can at most issue only one DDR command. We have a choice where this command can happen, but at most only one command within 4 timeslots.

With our plan to interleave DDR commands for a 6502 core and an Amiga core, issuing 1 command per 4 cycles, is perhaps too tight. After thinking of this for a while, I came thought of having two commands per 4 timeslots. I want to reserve the first two timeslots for th3 6502 core, and the last 2 slots for the Amiga core.

To see how we are going to change our design to cater for this, let us revise how the current design works. The following is a snippet of one of the selectors in our state machine for our memory controller:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
              end

test_cmd is the command we want to issue. I am not going to explain the individual bits for this, but it basically indicates what RAS/CAS/WRITE should be set as for the command. cmd_slot indicates at which of the 4 time slots the command should be issued.

The bits of these two registers goes down a number of levels, until we have reached the following snippet:

  cmd_addr #(
    .IODELAY_GRP(IODELAY_GRP),
    .IOSTANDARD(IOSTANDARD_CMDA),
    .SLEW(SLEW_CMDA),
    .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
    .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE),
    .ADDRESS_NUMBER(ADDRESS_NUMBER)
  ) cmd_addr_i(
    .ddr3_a   (ddr3_a[ADDRESS_NUMBER-1:0]), // output address ports (14:0) for 4Gb device
    .ddr3_ba  (ddr3_ba[2:0]),             // output bank address ports
    .ddr3_we  (ddr3_we),                 // output WE port
    .ddr3_ras (ddr3_ras),                // output RAS port
    .ddr3_cas (ddr3_cas),                // output CAS port
    .ddr3_cke (ddr3_cke),                // output Clock Enable port
    .ddr3_odt (ddr3_odt),                // output ODT port,
    .cmd_slot (cmd_slot),
    .clk      (clk),                     // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div  (clk_div),                 // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst      (rst),                     // reset delays/serdes
    .in_a     (in_a[2*ADDRESS_NUMBER-1:0]), // input address, 2 bits per signal (first, second) (29:0) for 4Gb device
    .in_ba    (in_ba[5:0]),              // input bank address, 2 bits per signal (first, second)
    .in_we    (in_we[1:0]),              // input WE, 2 bits (first, second)
    .in_ras   (in_ras[1:0]),             // input RAS, 2 bits (first, second)
    .in_cas   (in_cas[1:0]),             // input CAS, 2 bits (first, second)
    .in_cke   (in_cke[1:0]),             // input CKE, 2 bits (first, second)
    .in_odt   (in_odt[1:0]),             // input ODT, 2 bits (first, second)
//    .in_tri   (in_tri[1:0]),             // tristate command/address outputs - same timing, but no odelay
    .in_tri   (in_tri),             // tristate command/address outputs - same timing, but no odelay
    .dly_data (dly_data[7:0]),           // delay value (3 LSB - fine delay)
    .dly_addr (dly_addr[4:0]),           // select which delay to program
    .ld_delay (ld_cmda),               // load delay data to selected iodelayl (clk_div synchronous)
    .set      (set)                      // clk_div synchronous set all delays from previously loaded values
);

At this point we have already stripped of all the necessary bits from the command, as indicated in bold.

You might also pick up that we are doubling up on the bits, like we are multiplying ADDRESS_NUMBER by 2, with the bank we are passing 6 bits instead of the required 3 and so on. So, in effect for most part of the system we already catering for two commands per 4 time slots. It is just that right at the top we are passing down a single command.

Now, within cmd_addr module, we need to make a couple of changes to handle two commands per 4 time slots. First let us look at the module for outputting the address to DDR3 memory:

// All addresses
generate
    genvar i;
    for (i=0; i<ADDRESS_NUMBER; i=i+1) begin: addr_block
//       assign decode_addr[i]=(ld_dly_addr[4:0] == i)?1'b1:1'b0;
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_addr_i (
    .dq(ddr3_a[i]),               // I/O pad (appears on the output 1/2 clk_div earlier, than DDR data)
    .clk(clk),          // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div(clk_div),      // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst(rst),
    .dly_data(dly_data_r[7:0]),     // delay value (3 LSB - fine delay)
    .din({{2{in_a_r[ADDRESS_NUMBER+i]}},{2{in_a_r[i]}}}),      // parallel data to be sent out
//    .tin(in_tri_r[1:0]),          // tristate for data out (sent out earlier than data!) 
    .tin(in_tri_r),          // tristate for data out (sent out earlier than data!) 
    .set_delay(set_r),             // clk_div synchronous load odelay value from dly_data
    .ld_delay(ld_dly_addr[i])      // clk_div synchronous set odealy value from loaded
);       
    end
endgenerate

Here cmda_single is applicable to a single address bit, so we need to replicate it for every bit of the address. We do that with a for-loop construct.

Now, with the din port we need to supply four bits of data for each applicable address bit, which is needed by an OSEDRDES serializer. For the first two timeslots we duplicate the first address twice, and for the last two timeslots we duplicate the last address twice.

We need to do a similar exercise for the bank address, so I am not going to show the code for that here.

At first side it may seem a bit puzzling that I duplicate the address bits and bank address bits, instead of pinning it to the correct slot. This is because in the other non-command slots the address is ignored, so we can actually save quite a bit on logic here, especially knowing that there is quite a number of address bits.

For the RAS/CAS/WE bits, we do something like the following:

// we
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_we_i (
    .dq(ddr3_we),
    .clk(clk),
    .clk_div(clk_div),
    .rst(rst),
    .dly_data(dly_data_r[7:0]),
    .din({cmd_slot[1] ? {in_we_r[0], 1'b1}  : {1'b1 , in_we_r[0]},
          cmd_slot[0] ? {in_we_r[1], 1'b1}  : {1'b1 , in_we_r[1]}}),
    .tin(in_tri_r), 
    .set_delay(set_r),
    .ld_delay(ld_dly_cmd[3]));

Note as before, our command slot is still 2 bits, but the meaning has a changed a bit. Previously cmd_slot was to be interpreted as a number between 0 and 3, but now each memory channel has its own bit, and have each access to only to two slots.

With all these alterations done to deal with a dual channel memory controller, let us see how our state machine will deal with dual channel memory requests:

              ROW_CMD: begin
                  if (edge_count == 9)
                  begin
                      test_cmd <= 32'h000001ff;
                      phy_rcw_pos_2 <= 2;
                  end else
                  begin
                      test_cmd <= 32'h000005ff;
                      phy_rcw_pos_2 <= 7;
                  end
                  
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
                  
              end

              COL_CMD: begin
                          state <= WAIT_READ_WRITE_0;
                          test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;
                          mem_channel <= 0;  
                          data_in <= {8{cmd_data_out}};
                          do_write <= write_out;
              end
              WAIT_READ_WRITE_0: begin
                  state <= WAIT_READ_WRITE_1;
                  dq_tri <= do_write ? 0 : 15;
                  cmd_slot <= 0;
                  test_cmd <= do_write ? 32'h000005ff : 32'h000001ff;                  
              end


              WAIT_READ_WRITE_1: begin
                  state <= WAIT_READ_WRITE_2;
              end

              WAIT_READ_WRITE_2: begin
                  test_cmd <= 32'h000001ff;
                  phy_rcw_pos_2 <= 3;
                  state <= PRECHARGE_AFTER_WRITE;
              end

              PRECHARGE_AFTER_WRITE: begin
                  
                  data_in <= {8{16'h8888}};
                  dq_tri <= 0;
                  phy_rcw_pos_2 <= 4;
                  mem_channel <= 1;
                  state <= POST_PRECHARGE;
                  cmd_slot <= 3;
                  test_cmd <= 32'h000029fd;
              end

              POST_PRECHARGE: begin
                  cap_value <= data_out;
                  state <= ROW_CMD;
                  phy_rcw_pos_2 <= 7;
                  test_cmd <= 32'h000005ff;
              end

I have bolded the parts that is required to perform memory operations for the second channel. For now I have hardcoded a write operation for the second channel, writing the hex value 8888 to a particular memory location in bank 1, every time it is the turn of the second memory controller.

I will give more meaningful stuff for the second memory channel to do in coming posts. For now it is just important to see that these two memory channels can co-exist without any issues.

An important part of second memory channel operations is the register phy_rcw_pos_2. This indicated which bits RAS/CAS/WE should be asserted for the applicable timeslot for the second memory controller. The bits are as follows:

Bit 0: Write Enable
Bit 1: CAS
Bit 2: RAS

It is important to note that these bits are active when low.

Viewing dual channel in action

Let us have a look at our dual channel setup in a simulation waveform:

I have marked with red C's when our 6502 core clocks.

I have marked with lime coloured arrows where operations of our first memory channel happens, which is also the memory channel that our 6502 core uses.

Likewise, I have indicated with blue arrows, where operations happens for our second memory channel. As mentioned previously, we only do a write operation currently for our second channel, which operates on bank 1.

You might find it a bit strange our first blue arrow is a pre-charge command (e.g. WE and RAS asserted) and not an activate command (e.g. RAS asserted only). This is because this command forms the last command in a series that started in the previous clock cycle.

During the simulation everything worked fine and I didn't got any DDR timing violation errors. I also ran on the physical FPGA and all reads/writes of the first memory channel works 100%

In Summary

In this post we started to implement a dual channel memory controller. Up to this point we got memory operations for the two channels to live together. The second memory channel, however, is only performing writes at the moment.

In the next post we will do some more work on our second channel of our memory controller, so that it can do some more useful work.

Until next time!

Sunday 23 July 2023

Throwing an SD Card into the mix

Foreword

In the previous post we managed to get our 6502 based design to work with the DDR RAM on the Arty A7 board.

For this post I have originally planned to throw in SD Card access to our current design. However, this proofed to be a no brainer, so just covering this topic alone will yield a very short blog post.

Another topic I have thought of discussing in this post, was to prove that our 6502 based design works at a speed of an equivalent real 6502 processor at 8.33MHz. This is a doubt that pops up in me from time to time when mimicking retro system system that worked with asynchronous RAM, by means of synchronous RAM. This occasional doubt sprouts from the fact that with Synchronous RAM the data is only available when the clock transitions at the next clock cycle, whereas with asynchronous RAM the data is already available just before the next clock transition.

Proofing that our based design works at the same speed as an equivalent 8.33MHz based system, will put this doubt to rest.

Adding SD Card access to our design

I mentioned in the previous post that we need to ensure that we need ensure that we propagate all our SD Card ports of our retro module, all the way up to our top module.

With this done, we need to add the following constraints to our xdc file:

set_property -dict {PACKAGE_PIN G13 IOSTANDARD LVCMOS33} [get_ports cs]
set_property -dict {PACKAGE_PIN B11 IOSTANDARD LVCMOS33} [get_ports mosi]
set_property -dict {PACKAGE_PIN A11 IOSTANDARD LVCMOS33} [get_ports miso]
set_property -dict {PACKAGE_PIN D12 IOSTANDARD LVCMOS33} [get_ports sclk]
set_property -dict { PACKAGE_PIN D13   IOSTANDARD LVCMOS33 } [get_ports dat1]; #IO_L6N_T0_VREF_15 Sch=ja[7]
set_property -dict { PACKAGE_PIN B18   IOSTANDARD LVCMOS33 } [get_ports dat2]; #IO_L10P_T1_AD11P_15 Sch=ja[8]
set_property -dict {PACKAGE_PIN A18 IOSTANDARD LVCMOS33} [get_ports cd]
set_property -dict {PACKAGE_PIN K16 IOSTANDARD LVCMOS33} [get_ports wp]

Having done all this, I found that our design works perfectly with the flashing LED, as explained in this post: https://c64onfpga.blogspot.com/2023/04/sd-card-access-for-arty-a7-part-9.html. Only difference being that we use DDR for RAM and not Block RAM.

Another mission accomplished!!!

Benchmarking

Let us now see if we can determine the speed of our 6502 based design compared to an equivalent retro 6502 based system.

The benchmark I will be using, will be a bit of an unconventional one. I will be running a 6502 machine code program on a VICE Commodore64 emulator, constantly changing the border color. I will then run a similar program on our 6502 based FPGA, but flashing an LED. The comparison between the time changing the color of the border vs toggling an LED will be our benchmark.

This is perhaps an unfair comparison because the C64 looses execution time due to interrupts and the VIC-II that occasionally steels from the 6502 cycles to get extra display data. For this purpose we will be disabling interrupts and blanking the screen to avoid cycle stealing. The resulting C64 program is as follows:

    sei
    lda $d011
    and $ef
    sta $d011
    lda #0
    sta $4
    sta $5
    sta $6
    sta $7

lp1
    inc $4
    bne lp1
lp2
    inc $5
    bne lp1
lp3
    inc $6
    lda $6
    cmp #60
    bne lp1
    lda #$02
    eor $7
    sta $d020
    sta $7
    lda #0
    sta $6
    beq lp1

This program will blank the screen and alternate between red and black border. To determine how long it takes for the border color to transition, I made a video recording of the screen. I then played the recorded video back with VLC media player, making a note of the timestamp when screen transitions to red and when transitioning to black again.

Here is a screenshot of when screen turns red:

Here we see that screen turns red 26 seconds into the video. Next, the following screenshot show when screen turns black again:

Here we see see the screen turns black again 57 seconds into the video. Thus one color transition takes 57 - 26 = 31 seconds

Now, let us do a similar test with our FPGA based 6502 design. Here is the code:

    lda #$20
    sta $fb0b
    lda #0
    sta $fb0b

    lda #0
    sta $0
    sta $1
    sta $2
    sta $3

lp1
    inc $0
    bne lp1
lp2
    inc $1
    bne lp1
lp3
    inc $2
    lda $2
    cmp #60
    bne lp1
    lda #$20
    eor $3
    sta $fb0b
    sta $3
    lda #0
    sta $2
    beq lp1

Almost the same as our C64 variant, except that we are toggling a register for blinking an LED. This file will need to assembled and stored as a file boot.bin on the SD Card.

Now, this code takes 4 seconds to change from LED thats on, and turning off again. Let us do some math. 31/4 = 7.75, meaning our design is 8 times faster than a C64. Keeping in mind that a C64 operates at 1MHZ, this comes close to 8 MHz, which is more or less the speed of our FPGA based design.

In Summary

In this post we add SD Card access to our 6502 based design, bringing us to the point where our design can access an SD Card and DDR RAM.

We also ran a benchmark that confirms that our current design runs at around 8MHz.

In the next post I will start to develop a dual channel memory controller for accessing DDR RAM. The purpose will be so that our 6502 core and Amiga core can independently access DDR RAM at around 8MHz. This will enable the 6502 part of the system to read through a disk image and simulate disk access to the Amiga core.

Until next time!

Sunday 16 July 2023

Running 6502 with DDR RAM

Foreword

In the previous posts we had been creating a 6502 based design for reading executable code from a FAT32 formatted SD Card and executing it.

I can maybe just summarise my goal again with this current project. My goal is to run a Amiga core on an Arty A7 board. For this project I will be using a 6502 core for doing all the heavy lifting of loading Amiga ROM's and disk images into RAM, from an SD Card, so the Amiga core can execute it.

At this point in time our design use all block RAM. On every FPGA, block RAM is a limited resource, especially if we want to implement something like an Amiga core.

So, in this post we will be trying to run the 6502 core using the DDR RAM available on the Arty A7. Having achieved this goal.

Stumbling Blocks

Let me start discussing the stumbling blocks I cam across the past couple of months in trying get the 6502 core to use 6502 RAM.

Usually when I encounter stumbling blocks, I go into quite some detail in my blog posts about them. However, my stumbling blocks with implementing blocks with DDR RAM were gigantic the past couple of months, so I will try and keep it brief in this section.

So, my initial attempt to write code to interface the 6502 core with the DDR was pretty straightforward, and everything ran as expected during the simulation. However, when I tried running it on the actual Arty A7, things looked totally different than during the simulation.

Every other byte I read back from DDR on the Arty A7 were garbage. When these kind of things happen when playing around with DDR, my heart sinks into my shoes, simply because there is not really tools for troubleshooting these kind of issues. A lot of the operations of DDR happens at frequencies well above that can be captured by the Integrated Logic Analysers. In these cases one can only really solve the issue by some kind of intuition.

After a number of backwards and forwards, I decided to revisit my assumptions of a previous post:

In the post of this diagram, I was working on a memory tester. Basically signal A will resemble the clock signal of the 6502 core.

At point A an address will be asserted by the 6502 and at point B the first DDR instruction will be loaded into a shift register that will shift an instruction out to DDR for opening the row that is associated for the address provided by the 6502 core.

Between two dotted lines are a time period of 1.5ns, so the time period between A and B is 3ns. This translate to 333MHz, and within an FPGA used on a Arty A7, it seemed like a very tight fit to me, although it was sufficient to run a memory tester on the board.

I gave this some thought. There is a lot more logic cells involved with a 6502 core than with a simple memory tester. So, 3ns might not be enough for all the individual address lines to reach their full voltages.

My intuition told me, or should I rather say I made a hypothesis😆, that the problem may be solved by increasing the time period between A and B. We will discover this as a possible solution in the next section.

Clocking changes

With the hypothesis I made in the previous section, I came up with the following clocking scheme:

The 6502_clk is basically the clock that should drive the 6502 core. It is an exact copy of mclk, but I am throwing away 9 clocks in between, thus keeping only every tenth clock. With mclk that is 83.3 MHz, this gives us an effective 6502 clock of 8.3MHz, which is stil above the target clock of 7MHz required for our Amiga core in future.

At the point I have indicated with an arrow, we are loading our shift register with the address asserted by our 6502 core, which is one mclk cycle after the assertion. This works out 12ns, compared to the 3ns of our earlier design. I think this will give ample time for our address lines to settle, before reading it at the next mclk cycle.

The question remains if this bigger time gap will not introduce extra latency causing us to miss our target frequency of 7MHz. We will revisit this question later on when have finished with the design.

Let us start by writing some Verilog code for a counter that keeps track of when to enable the 6502 clock:

    reg [3:0] edge_count = 9;

    always @(negedge mclk)
    begin 
        if (edge_count == 0)
            edge_count <= 9;
        else
            edge_count <= edge_count - 1;   
    end

    always @(negedge mclk)
    begin
        clk_8_enable <= edge_count == 0;
    end

We get the resulting 6502 clock with the following:

    BUFGCE BUFGCE_8_mhz (
       .O(clk_8_mhz),   // 1-bit output: Clock output
       .CE(clk_8_enable), // 1-bit input: Clock enable input for I0
       .I(mclk)    // 1-bit input: Primary clock
    );

So, we will use the signal clk_8_mhz to clock our 6502. It is important to add a necessary constraint in Vivado, to indicate that it is treated as a clock when synthesizing the design. This constraint will look like the following:

create_generated_clock -name clkdiv1 -source [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O] 
     -edges {1 2 21} [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O]

The edges parameter indicates which edges of the mclk clock forms part of the 6502 clock.

Changing the command sequence

With the clocking changes performed in the previous section, we also need to make a change to the sequence of the commands issued to the DDR RAM. For this discussion you might want to refer back to the following posts:

https://c64onfpga.blogspot.com/2022/05/starting-with-memory-tester-on-arty.html

https://c64onfpga.blogspot.com/2022/07/shrinking-latency.html

In our initial attempts to shrink latency, we wrote the following code for reducing initial latency:

    assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) 
           ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;

It was this assignment to the wire a mentioned earlier that that resulted in trying to sample address 3ns after being asserted.

The above snippet need to be removed and this command should rather be asserted in the state machine as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
              end

You will also see that we only assert this command and go the next state when edge_count is 8. This ensure that out state machine keeps in sync with our 6502 clock.

Now, if you refer back to the previous posts I mentioned, you will see that the actual state following PREPARE_CMD are WAIT_CMD. Well, with our new way of clocking we don't need to transition to a wait state, because the waiting is done within PREPARE_CMD, where we wait for edge_count to reach value 8.

So, the state after PREPARE_CMD should now be COL_CMD, because we need to issue the column read column at that state. The selector for that state looks as follows:

              COL_CMD: begin
                  begin
                      state <= STATE_PREA;
                      test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                          (write_out ? 2'b11 : 2'b00), 10'h1fd};
                      cmd_slot <= 1;  
                      data_in <= {8{cmd_data_out}};
                          do_write <= write_out;
                  end
              end

The rest of the state machine are the same.

Lowering the 6502 into the design

At this point in time we are having two seperate designs. The first design is a prototype design for testing the DDR memory on the Arty A7, of which we have discussing the changes for in this post. The second design was the 6502 based design we were developing in the last couple of posts for accessing data from an SD Card.

Now, we have come to a point where we need to merge the two designs, giving our 6502 based SD Card reader the power of DDR memory.

So, the top module of our 6502 based design, will now move within mem_tester.v as an instance, with the code looking like this:

    retrosystem retrosystem(    
        .cs(),
        .mosi(),
        .miso(),
        .reset(wait_for_read > 0),
        .gen_clk(clk),
        .write_ddr(write),
        .ddr_data_out(data_out_byte),
        .ddr_data_in(data_in),
        .ddr_addr(address_byte),
        .led(led),
        .sclk(),
        .cd(),
        .wp()
    );

First of all, I had to come up a name, for a module that was top.v, that is not a top module anymore. So, I just picked the name retrosystem, which contains a SD Card module and a 6502 system.

Firstly we have the sgnals like cs, mosi, miso and so on which forms part of the SD Interface. These signals we will need to extend all the way to the top module so the SD Card module can be reached.

We have also added some extra signals to interface the 6502 with the DDR RAM on the Arty A7:

write_ddr
ddr_data_out
ddr_data_in
ddr_addr

With all this in place, let us see how to interface the 6502 with the external DDR RAM.

First, let us make a change to the following code block:

always @*
begin
    casex (addr_delayed)
        //16'hfexx: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx00: combined_data = o_data_sdspi[7:0];
        16'b1111_1011_xxxx_xx01: combined_data = wb_data_store[7:0];
        16'b1111_1011_xxxx_xx10: combined_data = wb_data_store[15:8];
        16'b1111_1011_xxxx_xx11: combined_data = wb_data_store[23:16];
        16'b0000_0xxx_xxxx_xxxx: combined_data = addr_delayed[0] 
            ? ddr_data_in[15:8] : ddr_data_in[7:0];

        default: combined_data = rom_out;
    endcase 
end

Combined data is the port that combines data of the various sources and send to the 6502 core via the DI input.

The bolded selector used to get its data from a small segment of block RAM, but in this case we changed it to get it externally. We get data from DDR RAM in 16 bit pieces and we therefore need to decide which byte we are going to send to the 6503. Bit 0 of the address determines this decision.

As bit 0 of the address determine which byte to read from a 16 bit word, bit 0 also determines which byte to write in a 16 bit word to memory. This process is a bit more complicated so I will not cover it here. It is suffice to say that will will need to make use of the DM signal on DDR RAM to ensure the correct byte gets written.

We also need to assign some of the ports:

assign ram_6502_addr = cpu_address;
assign write_ddr = (we_6502 & cpu_address[15:9] == 0);
assign ddr_data_out = cpu_data_out;
assign ddr_addr = cpu_address_result;

I mentioned that the retrosystem block needs to be instantiated with mem_tester. Speaking of mem_tester, it also contains a state machine which is no longer necessary.

Checking Timing

With all the code developed in the previous section, we still need to check if the time of a complete read/write cycle fits within our expectations of more or less 7MHz.

The simulation waveform gives an idea of the timings:

Firstly, the signal clk_8_mhz is the signal clocking the CPU at 8.3MHz. All memory cycles associated with a read/write (e.g. Activate, column read, precharge) should be completed within one such cyle.

CPU address is the address that is output by the 6502 CPU core. You will also see the address changes on a clk_8_mhz cycle.

On this simulation graph, I have also shown the DDR signals, which are prefixed by SD. I have numbered the different DDR commands. Point 1 is where an Row activate is happening. Point 2 is where a column read/write is happening. Finally point 3 is where the precharge is happening, as the last command of a read/write cycle.

In this diagram I have also shown the precharge command of the previous cycle.

All in all it seems that a read/write can complete within the time period of one 8.3MHz clock cycle.

Now, when we do a read the actual data will be presented on the data_out signal, which I have also indicated on the diagram. In this case the data is the three blurbs after the long trains of X's. On the diagram it is not clear what the values are of these three blurbs, so let us zoom in a bit:

In the first blurb you will also see a number of X's and in between the value 20 Hex and a9 hex. In this particular test in the simulation, the value 20 and A9 was the actual data I have written to the address 4, so we know that the first blurb always contains the data we are looking for during a read.

However, this blurb only lasts one mclk clock cycle and we need to extend the data until the next 8.3 MHz clock cycle so that our CPU can pick it up. We this by adjusting our PREPARE_CMD selector of earlier as follows:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 2)
                  begin
                      cap_value <= data_out;
                  end
                  do_capture <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
                  
                  cmd_status <= 1;
              end

As shown by the bolded section, we capture data_out when edge_count is 2.

The Test Program

Let us end this post by looking at the Test program we used for testing 6502 and DDR RAM interaction.

Here is the listing:

.ORG $FC00
ldx #offset
copy
    lda zcode,x
    sta $4,x
    dex
    bpl copy

    ldx #0    
read
    lda $4,x
    inx
    cpx #$0a
    bne read

    jmp $4
zcode
    lda #$20
    sta $fb0b
    lda #0
    sta $fb0b

    lda #0
    sta $0
    sta $1
    sta $2
    sta $3

lp1
    inc $0
    bne lp1
lp2
    inc $1
    bne lp1
lp3
    inc $2
    lda $2
    cmp #60
    bne lp1
    lda #$20
    eor $3
    sta $fb0b
    sta $3
    lda #0
    sta $2
    beq lp1
endz
    nop
offset=*-zcode

ENDROM = $FFFF-*-3
.FILL ENDROM 00
.BYTE 0, $FC, 00, 00

This code starting at FC00, which is the start of our "ROM", basically does three things. It starts by loading the code starting at label zcode into RAM starting at address $4.

The next thing this program does is load the code back starting from location 4. This was useful for me to get confirmation that reading from DDR RAM works, by inspecting the data returned to the 6502 as what we expect with an ILA.

Finally the code jumps to location 4, effectively starting to execute the code at label zcode. This is basically a nested waiting loop turning an LED on and off every second or so. You will remember from previous posts that bit of register $FB0B controls an LED.

Real Life Results

I thought of ending this post by showing ILA captures of the design running on the real FPGA. Firstly, a list of data that we expect for the test:

Address 4: A9
Address 5: 20
Address 6: 8D
Address 7: 0B
Address 8: FB
Address 9: A9
Address a: 00
Address b: 8D
Address c: 0B
Address d: FB
Address e: A9
Address f: 00

And next the ILA capture:

The top row is the asserted CPU address and the bottom 2 is selected bytes from cap_value. Let us start by just reminding ourselves again about the structure of the cap_value register.

Firstly, cap_value is 128 bits in width. In total it stores 8 bursts of data from DDR memory, of which we always just look at the first burst.

Furthermore, because the DDR RAM on the ARTY A7 has 16 data bit lines, we have structured cap_value that bits 0-63 contains the low bytes of eight bursts, and bits 64-127 contains the high bytes of the eight bursts.

Coming back to the above diagram, the second line of the capture captures bit 64-71 of cap_value and is the high byte fore the relevant address. Similarly, the last captured lines captures bits 0-7 of cap_value and is the low byte for the relevant address.

Now, as we have discussed earlier on, we get the data for an asserted address just before the transition to the next address. So, for example, for address 4 we get low byte a9 and high byte 20. The same is true for address 5, because byte address 4 & 5 shares the same 16-bit word.

Comparing the diagram to the values we expect, we can confirm that our design works correctly. The LED on the Arty A7 also flashes as we expect.

In Summary

In this post we integrated our 6502 based design with the DDR RAM on the Arty A7 and verified that read/writes work correctly.

In the next post we will also wire up the SD Card ports to out top module and confirm that our 6502/SD Card/DDR design works together as expected.

Until Next time!

Thursday 27 April 2023

SD Card Access for a Arty A7: Part 9

Foreword

In the previous post we developed a DMA module for transferring a read sector from the FIFO in the SC Card module to the 6502 memory space. We also wrote some 6502 Assembly code for testing this functionality.

In this post we will write some more 6502 Assembly code for reading a file from a FAT32 partition.

32-bit operations

When trying to determine the location of a file on an SD Card, one often needs to work with 32-bit quantities. However, as you know the 6502 only works with 8 bits at a times. So, in order to make life simpler, let us start by writing some Assembly Routines for doing a couple of 32-bit operations.

Core of these routines we will imagine a virtual 32 bit accumulator, which we will store at address C0 hex in memory, and will use little endian format.

The first operation we need to define, is a Load Accumulator, which we will define with the symbol ld32. The address containing the data we want to store in the accumulator, must be stored in the X- and Y-registers:

ld32
     stx $b0
     sty $b1
     ldy #$0
     lda ($b0),y
     sta $c0
     iny
     lda ($b0),y
     sta $c1
     iny
     lda ($b0),y
     sta $c2
     iny
     lda ($b0),y
     sta $c3
     iny
     rts

First of we need to store the address in memory locations b0 and b1, so we can load the data from the memory location in an indexed addressing fashion. Here I do a bit of loop unrolling, saving a bit of CPU cycles. When this routine returns, our accumulator will contain the necessary data in memory locations c0, c1, c2 and c3.

The next routine we will will need is to store the contents of our virtual accumulator to some other memory location:

st32
     stx $b0
     sty $b1
     ldy #$0
     lda $c0
     sta ($b0),y
     iny
     lda $c1
     sta ($b0),y
     iny
     lda $c2
     sta ($b0),y
     iny
     lda $c3
     sta ($b0),y
     iny
     rts

Again, the destination address needs to be stored in registers X and Y, which we store in memory location b0 and b1 at the beginning of the routine.

All these routines so far are little endian. Our SD Card module, however, works with 32-bit LBA numbers are bug endian. So we need another variant of the Store Accumulator which can store the number as big-endian:

st32rev
     stx $b0
     sty $b1
     ldy #$3
     lda $c0
     sta ($b0),y
     dey
     lda $c1
     sta ($b0),y
     dey
     lda $c2
     sta ($b0),y
     dey
     lda $c3
     sta ($b0),y
     iny
     rts

When determining the location of a file, one 32-bit operation that valuable is add. In FAT32 we are presented with both 16-bit and 32-bit numbers to add, so we need routine for both:

add32
     stx $b0
     sty $b1
     ldy #0
     clc
     lda $c0
     adc ($b0),y
     sta $c0
     iny
     lda $c1
     adc ($b0),y
     sta $c1
     iny
     lda $c2
     adc ($b0),y
     sta $c2
     iny
     lda $c3
     adc ($b0),y
     sta $c3
     rts

add16
     stx $b0
     sty $b1
     ldy #0
     clc
     lda $c0
     adc ($b0),y
     sta $c0
     iny
     lda $c1
     adc ($b0),y
     sta $c1
     iny
     lda $c2
     adc #0
     sta $c2
     iny
     lda $c3
     adc #0
     sta $c3
     rts

Finding the root cluster

To find a file we need to loop through file entries in the root cluster. To determine the location of the root cluster, we need to load the bootsector of the FAT32 partition, which contains the necessary parameters for calculating this. The following code takes care of this:

mbr  equ $200
par1 equ $1be
lbastart equ 8
lbamemaddr equ mbr+par1+lbastart 

       ldx #<lbamemaddr
       ldy #>lbamemaddr
       jsr ld32
       ldx #48
       ldy #0
       jsr st32rev
       LDA #6
       JSR CMD
       LDA #$12
       STA $FB0B
       LDA #$16
       STA $FB0B

Let us start by breaking down the EQU's a bit. $200 is the address in 6502 memory space where we previously downloaded the MBR from the SD Card.

The value $1be is the offset within the MBR containing the first Partition entry. Byte 8 of every partition contains the LBA number of the sector of the partition.

So, basically we need to store this LBA block number to address 48, which contains the LBA address that we will instruct the SD Card module to read from the SD Card. The bootsector will end up at address $400.

Now we are ready to calculate the LBA block number of the root cluster. From the previous posts, we basically calculate this with the following formula: Bootsector location + reserved sectors + Number of FATs * Sectors per FAT.

From the previous snippet of code, we still have the Bootsector location stored the virtual accumulator at address $C0, so we can just continue to add the number of reserved sectors and so on to get to the location of the root cluster.

Firstly, adding the reserved sectors:

...
bootsec    equ $400
reservedsec equ bootsec+$e
...
       ldx #<reservedsec
       ldy #>reservedsec
       jsr add16
...

As can be seen, the location of the Reserved Sectors is at $e in the Bootsector and is two bytes, so we need to use add16.

Next we need to add Sectors per fat a number of times as specified by Number of FATs:

...
numfat     equ bootsec+$10
secperfat  equ bootsec+$24
...
       ldx numfat
addfat
       txa
       pha
       ldx #<secperfat
       ldy #>secperfat
       jsr add4
       pla
       tax
       dex
       bne addfat 
...

With this we have the calculated LBA for the root cluster. This number we need to store again at address 48, which will instruct the SD Card core to load the root cluster sectors. We also need to make a backup of this number as well for future calculations:

       ldx #48
       ldy #0
       jsr st32rev
       ldx #$c4
       ldy #0
       jsr st32

Searching for the file

With the root cluster location determined, we now need to loop through all the file entries to find the file we are looking for. If we just have a look at the purpose of all this, we want bootable ROM code of minimum size in block rom, and then load the rest of the boot code from the SD card.

So, we will always load a file with hardcoded filename 'boot.bin'. This filename will form of part of the bootrom in top of memory, defined as:

FILENAME
     .TEXT "BOOT    BIN"

It may look a bot strange with the extra white space between filename and extension, but this is how filenames are stored in file entries in FAT32 partitions. When looping through the file entries we need to compare each filename with the above.

Since we need to do so many compare operations, it makes sense to move the text boot.bin into zero page:

       ldx #10

initfilename
       lda FILENAME,x
       sta $d0,x
       dex
       bpl initfilename

I have become into the habit of when needing to iterate through a number of memory locations, I am doing it in the reverse order. It just eliminates the need to have a compare operation with every loop iteration.

There is quite a number of things that needs to happen when iterating through file entries. You need to read sector by sector of the root cluster. Then, each sector you need to process all entries. Also, what is complicating things is that a sector is 512 bytes in size, whereas the 6502 works with pages of 256 bytes in size. So, one also needs to keep track of how many times a 256 byte page boundary is crossed to figure out when to load the next sector.

All this calls for a nested loop that is a number of levels deep. Here is some pseudo code for the nested loop:

for sectors = 1 to ...
   read sector
   for page = 0 to 1
      for fileentry = 0 to 7
        get file entry
        do something with file entry
      end
   end
end

Each file entry is 32 bytes, so in a page of 256 bytes, there is 8 entries. For that reason we are looping from 0 to 7 in innermost loop.

Let us do some initialisation:

nextsec
       LDA #6
       JSR CMD
       LDA #$12
       STA $FB0B
       LDA #$16
       STA $FB0B

       lda #0
       sta $b2
       lda #4
       sta $b3

We start with some code to load a root sector into memory, where the sector number is stored in addresses 48 - 51, as explained previously. The addresses b2/b3 contains the address at which the root sector is stored, which is $0400. We will be incrementing b2/b3 as we loop through the file entries.

Next, let us write some code for looping through the file entries:

nextentry
       clc
       lda $b2
       adc #32
       sta $b2
       bcc inspectfileentry
       inc $b3
       lda #1
       and $b3
       bne inspectfileentry
       inc 51
       jmp nextsec

In this snippet inspectfileentry is where we do something with the current file entry. Basically to get to the next entry we keep adding 32 to the address in b2/b3.

However, we need to mindful of when we cross a page boundary, that is when the carry flag gets set. IN such a case we increment b3 and then inspect bit 0 of b3. When bit 0 is a 1, it means we are at byte 256 of 512 bytes, and we are still good to go.

However when we increment b3 and bit 0 is 0, it means we just passed the 512'th byte of the sector we are reading. In this case it is time to read the next sector from SD card. We do this by incrementing address 51, which is part of the 48-51 LBA number.

Finally, let us implement inspectfileentry:

inspectfileentry
       ldy #11
       lda ($b2),y
       cmp #15
       beq nextentry
loopfilesearch
       dey
       bmi done
       lda ($b2),y
       cmp $d0,y
       beq loopfilesearch

Again, we are working backwards. We start by inspecting the byte following the filename/extension, which contains all the attributes. With this entry we check if this file entry forms part of a long file entry. If it is we skip to the next entry.

We then check the filename entry byte by byte to see if it matches 'boot.bin'. If it matches, we jump to done and load the file into memory.

Loading the file

With the file entry for the file we want, we now have our hands the cluster number where the file starts. This cluster number is located at bytes 26, 27, 20 & 21 of the file entry. With a cluster number we always need to subtract 2 to get the physical cluster position. So, let us load the virtual accumulator with the cluster number and do the subtraction:

DONE
       ldy #26
       sec
       lda ($b2),y
       sbc #2
       sta $c0
       ldy #27
       lda ($b2),y
       sbc #0
       sta $c1
       ldy #20
       lda ($b2),y
       sbc #0
       sta $c2
       ldy #21
       lda ($b2),y
       sbc #0
       sta $c3

In all the code written in this post, we are only doing one subtraction, so I didn't deemed it necessary to create a routine for this process.

A this point we should remember that we have a cluster number and not a sector number. As a cluster contains multiple sectors we need to multiply this number by the number of sectors per cluster, which is byte 13 of the boot sector. In my experience this parameter is usually a power of 2, so we can achieve multiplication by shifting the cluster number by a number of bit positions, with this assumption.

Obviously, we need to determine upfront how many left shifts is required for this operation. We need to do this while the bootsector is still in memory:

addfat
       txa
       pha
       ldx #<secperfat
       ldy #>secperfat
       jsr add32
       pla
       tax
       dex
       bne addfat 

       ldx #48
       ldy #0
       jsr st32rev
       ldx #$c4
       ldy #0
       jsr st32

       lda sectorspercluster
       ldx #0
       clc
shift
       ror a
       bcs endshift
       inx
       bcc shift
endshift
       stx $c8

You will recognise this code from an earlier section, of which I have just appended some extra code. We just keep shifting the parameter right until the carry flag is set, keeping count how many shifts is required. This required number of shifts we store in location $c8.

With this calculated, we can now move back to the spot where we loaded our virtual accumulator with the cluster number of our file.

With this number we do a number of right shifts, implying the multiplication I was referring to earlier:

       clc
       ldx $c8
conv
       rol $c0
       rol $c1
       rol $c2
       rol $c3
       dex
       bne conv

Now we have the relative sector number where our file begins. We still need to add the location of the root cluster number to get the absolute cluster number. We previously stored this number at location $c4, so we can do the addition like this and load the first sector of the file into memory:

       ldx #$c4
       ldy #0
       jsr add32
       ldx #48
       ldy #0
       jsr st32rev
       LDA #6
       JSR CMD
       LDA #$12
       STA $FB0B
       LDA #$16
       STA $FB0B

After loading this sector of the file, one can also jump to it with JMP $400.

Testing

To test that all this functionality really work on a physical board, we can write a 6502 program in boot.bin that flashes an LED. There is some spare bits in the register ignore_reads that we can use. For this purpose we will be using bit 5 of this register. One also need to map this bit via the XDC constraint file, to an led on the board.

The following snippet will do the flashing:

    lda #0
    sta $0
    clc
    ldx #0
    ldy #0
    lda #0
loop
    inx
    bne loop
loop2
    iny
    bne loop
loop3
    adc #1
    cmp #60
    bne loop
    lda #$20
    eor $0
    sta $FB0B
    sta $0
    lda #0
    beq loop

I have added a couple of nested loops to slow down the flashing enough so the flashing can be visible to the human eye. One needs to assemble this snippet and store as boot.bin on the root directory on the SD Card.

I followed this process and can confirm that the LED flashes on my board 😀

In Summary

In this post we wrote some more 6502 assembly code to read the sector of a file stored on a FAT32 partition.

In the next post we will be revisiting our DDR3 core and see if we can get our 6502 based design to use DDR3 RAM rather than block RAM. This will bring us one step closer in trying to run an Amiga core on an Arty A7, using a 6502 to load all the ROM and images into memory.

Until next time!

Monday 3 April 2023

SD Card Access for a Arty A7: Part 8

Foreword

In the previous post we did some deeper exploring into FAT32 and write some high level code to read a file from an example FAT32 partition.

In this post we will start to attempt the same exercise, but with the goal of writing the code in 6502 assembly. However, there is one hurdle that is holding us back to jump straight into writing 6502 code for this exercise, and that is that the read sector data lives in a FIFO buffer within the SDCard core, and not within the memory space of the 6502.

Now, I have pointed out in a previous post that the Sd Card core does have a register in which you can access the FIFO buffer 32 bits at a time. It would be possible to write some 6502 for reading the contents of the FIFO buffer and storing it in 6502 memory space, but this would be quite messy.

What I am thinking is rather to write a DMA (Direct Memory Access) module. This module would then call the FIFO read register repeatedly, and then take the 32 bit data and write it in a byte-by-byte fashion to the 6502 memory space. This will simplify the 6502 assembly code somewhat.

So, in this post we will be developing the DMA module.

The DMA State Machine

As with so many things one develops in Verilog, one often finds the need to implement your requirements by means of a state machine. Our DMA module is no exception to this.

Let us start by listing the states our state machine it needs, listing the order it needs to transition to:

IDLE
START: This state will be triggered by 6502 assembly code, indicating that the DMA transfer should happen. When the DMA module starts the transfer process, the 6502 CPU needs to be paused via RDY line, to avoid simultaneous writes to memory. In my design I will leave some headroom, waiting a number of cycles after de-assertion of RDY line, before starting the DMA transfer.
SEND_CMD: Send read command to SD Card core, to get 32 bits of data from FIFO. The state machine will always remain just one clock cycle within this state
WAIT_ACK: Wait for ack signal from SD Card core. This indicates that data is ready and need to be captured by our DMA core
SEND_6502_MEM: Send the captured 32 bits of data to the 6502 memory space. While in this state the data is transferred to 6502 memory space one byte at a time. Once all 32 buts transferred, the state machine will transition to either IDLE or SEND_CMD, depending on whether the full 512 bytes has been transferred to 6502 memory space.

This gives us a high level overview of all the states involved. Let us now focus more on each individual state transition.

The transition from IDLE to START should be triggered by the 6502. So, let us start by adding an input port for this to our module:

module dma(
  input wire start
    );

At first sight one might think that one can just change state to START if this port is high. However, the 6502 might not get a chance to set this port low again, because it will be frozen for the duration of the DMA transfer. If, after the transfer the 6502 still cannot in time set it to low, the DMA will initiate another transfer and freeze the 6502 again.

So, it is actually better to rather trigger the transition from IDLE to start only at the point where the input port transition from low to high.

Let us write some code for this:

...
 always @(posedge clk)
 begin
     start_delayed <= start;
 end
...
assign pos_trigger = start && !start_delayed;
always @(posedge clk)
begin
    case (state)
        IDLE: begin
             state <= pos_trigger ? START : IDLE;
           end
        START: begin
             state <= count == 0 ? SEND_CMD : START;
           end
    endcase 
end
...

The START state only transition to the next state if a counter has expired, to give some headroom as I explained earlier.

Next, let us have a look at the state of sending a read command to SD Card, and waiting for acknowledge when data is ready:

        SEND_CMD: begin
             state <= WAIT_ACK;
           end
        WAIT_ACK: begin
             state <= ack ? SEND_6502_MEM : WAIT_ACK;
           end

As can be seen we are only in SEND_CMD for a single clock cycle, before going to WAIT_ACK.

Finally we have our SEND_6502_MEM state. We basically will linger in this state until all 32 bits are transferred to 6502 memory space a byte at a time. Once these 4 bytes has been transferred, we either jump so state IDLE or to SEND_CMD, depending on whether we have transferred the full 512 bytes of the FIFO buffer.

This branch from the state SEND_6502_MEM require us to maintain two counters, one for keeping track how far we have shifted the 32 bits of data and how much of the 512 bytes of data has been transferred.

... 
always @(posedge clk)
 begin
   if (ack && state == WAIT_ACK)
   begin
     shift_count <= 3;
   end else if (state == SEND_6502_MEM)
   begin
     shift_count <= shift_count - 1;
   end
 end
...
 always @(posedge clk)
 begin
   if (state == IDLE)
   begin
       address_6502 <= 0;
   end else if (address_6502 < 512 && state == SEND_6502_MEM) begin
       address_6502 <= address_6502 + 1;
   end
 end
...

With these two counters defined, we can now create the SEND_6502_MEM selector in our state machine:

        SEND_6502_MEM: begin
             if (shift_count == 0 && address_6502 != 511)
             begin
                 state <= SEND_CMD;
             end else if (shift_count != 0)
             begin
                 state <= SEND_6502_MEM;
             end 
             else
             begin
                 state <= IDLE;
             end
           end

The Remaining Verilog bits

The state machine we defined in the previous section forms the heart of our DMA module. However, we still need to write some more Verilog code for this module to glue everything together and to do something useful.

First let us create a complete list of ports our DMA module will need:

module dma(
  input wire [31:0] wb_data,
  input wire clk,
  input wire ack,
  input wire start,
  output wire read, 
  output reg pause_6502 = 0,
  output wire [7:0] o_data,
  output reg [15:0] address_6502,
  output wire write_6502
    );

Here is a quick description of the different ports:

wb_data: FIFO read data from the SD Card module. Returned when we issue a read command.
ack: Signal from SD Card module, indicating data requested is ready.
read: Signals the top module we want to do a dma read from the SD Card module.
pause_6502: Pause the 6502 so that we transfer a sector of data
o_data: 8 bits of sector data to write to 6502 memory data
address_6502: This is in actual fact the counter defined earlier on and is also used in writing data to 6502 memory space.
write_6502: perform a write to 6502 memory space. This is accompanied with the ports o_data and address_6502

For the above output ports we need to write some code for populating them with values. Let us start with the port pause_6502:

 always @(posedge clk)
 begin
     if (pos_trigger)
     begin
        pause_6502 <= 1;
     end else if (state == IDLE) 
     begin
        pause_6502 <= 0;
     end
 end

We basically assert the signal upon assertion of the start signal. Only once we are back at the state IDLE we release the assertion.

Next, let us tackle o_data. This port is 8 bits wide, whereas we receive the data in 32-bits, so we will to implement some kind of shift register, which we implement as follows:

 always @(posedge clk)
 begin
   if (ack && state == WAIT_ACK)
   begin
       captured_data <= wb_data;
   end else if (state == SEND_6502_MEM)
   begin
       captured_data <= {captured_data[23:0], 8'h0};
   end
 end

This is pretty self explanatory. Capture data when ready and shift out if in state SEND_6502_MEM.

I would like to point out, though, that it is not enough to capture the data why checking ack alone. This signal is also asserted when the 6502 send commands to the SD Card module as well. Therefore we need to check if we are in state WAIT_CMD as well.

There remains two output ports we need to do: read and write_6502. These ports are relatively straightforward:

...
assign write_6502 = state == SEND_6502_MEM;
...
assign read = state == SEND_CMD;
...

Wiring the DMA module to top module

With the DMA module fully developed, we need to interface it with the rest of the system. First let us have a look at the ports of our 6502:

cpu cpu( .clk(gen_clk), .reset(count_down > 0), .AB(cpu_address), .DI(combined_data), 
  .DO(cpu_data_out), .WE(we_6502), .IRQ(0), .NMI(0), .RDY(!(wait_read || pause_6502) ));

With reference to DMA, we are only interested in the RDY signal. We basically to an OR here with the existing RDY singal in the system, as well as the pause_6502 signal from our DMA module.

Next, let us move onto the effected ports in the SD Card Module:

sdspi  sdspi (
...
            .i_wb_stb(dma_read ? 1 : wb_stb), 
		.i_wb_addr(dma_read ? 2 : cpu_address[3:2]),
...
	);

In both these ports we multiplex via dma_read between read instructions from 6502 and the dma module.

We blindly assert port i_wb_stb when dma_read is true. Also for the port i_wb_addr we assert the address 2 if dma_read is true. Address 2 instructs a read from the FIFO buffer.

Finally we need to modify of our block RAM logic for the 6502 memory space so it can be written to by both the 6502 and the DMA module:

     assign ram_6502_addr = write_6502_dma ? {ignore_reads[4:3], ram_6502_addr_out[8:0]} : cpu_address;

     always @ (posedge gen_clk)
       begin
        if ((we_6502 & cpu_address[15:9] == 0) || write_6502_dma) 
        begin
         ram[ram_6502_addr] <= write_6502_dma ? o_data : cpu_data_out;
         ram_out <= write_6502_dma ? o_data : cpu_data_out;
        end
        else 
        begin
         ram_out <= ram[cpu_address];
        end 
       end

The key here is write_6502_dma, which is a signal from our DMA module.

ram_6502_addr is the address we use writing to 6502 memory. We will notice that in the DMA version of the write address, we are making use of bits 4 and 3 from the ignore_reads register, a register we developed in a previous post and to which the 6502 can write to.

By writing to these two bits in ignore_reads, we can control where in memory the dma data will end up, but on a 512 byte boundary.

Writing some more 6502 Assembly

With the Verilog changes completed for our DMA, let us write some 6502 code to utilise it.

I want to start off by highlighting a limitation with our current setup. Currently all the sequence of bytes for the different SD Card commands, including the reading of a sector, is stored as table in ROM. For the sector read command, this is problematic since the sector number you want to read is also present in ROM, meaning you cannot read a different sector than currently present in the ROM.

Not very, useful, is it? 😂 To get around this, we will need to copy the sector read entry from the table to RAM, which will allow us to change the sector number for a read command. For this purpose, I am going to use zero page:

       ldx #7
ldzero lda data+48,X
       sta 48,X
       dex
       bpl ldzero

Just to recap from previous posts. data is the beginning of the mentioned table in ROM. The sector read command is entry number 6, and with each entry being 8 bytes, we come up with number 48.

In this code snippet I decided to do the copy in the reverse order. If you copy in ascending byte order, you will need to have an extra compare instruction to test whether X reached the target value. Descending order avoids the compare, and your branch instruction can just keep the loop going until x becomes negative.

Now, as soon as we are past the point of SD Card initialisation and we want to read one or more sectors, we need to change the table pointer from ROM to zero page. You might remember from previous posts that we use address A0 for our table pointer, which will result in the following code:

       LDA #0
       STA $A0
       STA $A1
       LDA #6
       JSR CMD

The pointer update we only need to do once. Also we don't need to make any changes to our CMD routine.

To read a different sector, we can just write code like this:

       lda #$20
       sta 50
       lda #$15
       sta 51
       LDA #6
       JSR CMD

This will read sector 2015(Hex). To do this, we just needed to adjust two bytes in the command entry we store in zero page.

So, with the sector being read and present in the FIFO buffer, we need to write some 6502 code to instruct our DMA to transfer data from FIFO buffer to 6502 memory space. The following snippet gives an example of reading two separate sectors into 6502 memory:

       LDA #6
       JSR CMD
       LDA #$e
       STA $FD0B
       nop
       nop
       lda #$20
       sta 50
       lda #$15
       sta 51
       LDA #6
       JSR CMD
       LDA #$12
       STA $FD0B
       LDA #$16
       STA $FD0B

       nop
       nop

DONE
       JMP DONE

I have bolded the sections that performs the DMA transfers. Let us start by having a look at the first transfer, which is initiated by writing $e to the register $FD0B. Looking at the individual bits, setting bit 2 to one, will initiate the transfer. Bits 3 and 4 gives us the value 01, meaning the transfer will be between addresses 512 and 1024 (e.g. 512 byte page 1).

Let us have a look at the second transfer. Here we see two separate writes to the register $FD0B. Comparing the writes we see that happens is setting bit 2 to zero to then to one. This is to create a positive transition, which triggers the DMA transfer. We see bits 3-4 gives us the value 10 binary, which is 512 byte page two, which is present at addresses 1024 to 1535. So the sectors we read will be next to each other.

You will also note that after every DMA trigger, I am adding two nop instructions. This is because of some anomaly I discovered with the 6502 core and the RDY signal. When our DMA core assert the RDY signal, the 6502 core somehow skips the next byte, which is supposed to be the next instruction opcode. I solved this issue by just adding a nop instruction after the STA $FD0B, so if an opcode byte is skipped, it is at least a meaningless one.

This concludes our discussion on the 6502 that needs to be written for triggering the DMA core.

In Summary

In this post we developed a DMA core for transferring sector data stored in the FIFO buffer of the SD Card module to 6502 memory space.

We also wrote some 6502 code for triggering a DMA transfer.

In the next post we will continue to write 6502 Assembly code for reading a file from a FAT32 partition.

Until next time!