Thursday, 28 September 2023

New beginnings of a dual channel DDR3 memory controller

Foreword

In the previous post we managed to get a 6502 based ecosystem together where we could access both an SD Card and DDR3 memory.

With this design we can load quite a lot of stuff from SD Card into DDR3 memory and thus reduce our dependency on limit Block RAM that is available on the FPGA. This opens the possibility to emulate an Amiga core on the Arty A7 FPGA board.

As mentioned in previous posts, we will be using an 6502 based system that will do all the work of loading all the required stuff from SD Card to DDR3, which the Amiga core requires to work. Needless to say, this would require both the 6502 core and Amiga core to access the DDR3 memory.

One way to address the need of both 6502 + Amiga core to access the DDR3, would be to use the memory controller we developed in the last couple of posts, and just let the two cores make turns to access DDR3 memory. Knowing that our memory controller runs at around 8MHz, that would mean that our Amiga core would be running accessing DDR3 memory at around 4MHZ, because it would be accessing memory at every second clock cycle. This is far from ideal with a stock Amiga running at least at 7MHz.

So, in this post we will try and come up with an optimised dual channel memory controller where we will attempt to make both the Amiga core and 6502 core access memory at 7MHz.

The Magic of Memory Banks

In our journey with DDR3 memory, we got the know the different states memory can be in:

  • Activate: Activate a row for reading or writing
  • Read/Write: Read or write a particular column of data
  • Precharge: After you are finished with your reads/writes on a particular row, you first need to precharge the row, before moving on to the next one.
All the above mentioned takes time to complete. In my Arty A7 scenario, each of these states takes about 5 memory clock cycles to complete.

Once you have an open row, however, consecutive memory reads from the same row can be quite fast, provided you give the column addresses ahead of time.

Things, however, will not work out so well for our plan where a 6502 and Amiga core need to access memory. The Amiga core, for instance, might need to access data from a different row than what the 6502 is currently busy with. In such case the Amiga core needs to wait for 6502 core to finish its business with the current row it is busy with, before the Amiga core can open the row it wants. This will again bring us to the point where the Amiga core can only access the memory at half the available memory bandwidth, e.g. 4MHz.

However, all hope is not lost. DDR3 memory divides memory into different memory banks and each memory bank can have a row open independently from the over banks. The DDR3 memory chip on the Arty A7 have 8 memory banks. This means that each memory bank is 256MB/8 = 32MB.

So, the basic idea is to give both the 6502 core and the Amiga core its own bank, then theoretically every core can get the full memory bandwidth of 8MHz. One just need to carefully schedule the timing of when to issue DDR3 commands, so that these cores don't trip over each over. DDR3 RAM, for instance, still have only one data bus, so if you issue read commands from two banks, you can't expect the data to arrive at the same time. It will first output the data from the first bank, and thereafter the data from the other bank.

Coming back to the size of every memory bank. 32MB per bank is more than enough of what we want to do. For the Amiga core this will be more than enough for the ROMS and the amount of RAM you will get for your earlier Amigas. For the 6502 core this will also be more than enough to store a disk image and simulate a disk read from the Amiga.

Using timeslots wisely

The memory on the Arty A7 clocks at 333Mhz, which is far beyond the speed capability of the FPGA on the Arty A7. As we learned from previous posts, the designers of the FPGA, provided a way out by providing OSERDES blocks, for serialising out data. The OSERDES blocks themselves can serialise the data out at 333MHz. We need to provide the data 4 chunks at a time to this block, which reduces the required speed from the rest of the FPGA to 83MHz, which is more manageable.

Now, in our current design, for every 4 timeslots, we can at most issue only one DDR command. We have a choice where this command can happen, but at most only one command within 4 timeslots.

With our plan to interleave DDR commands for a 6502 core and an Amiga core, issuing 1 command per 4 cycles, is perhaps too tight. After thinking of this for a while, I came thought of having two commands per 4 timeslots. I want to reserve the first two timeslots for th3 6502 core, and the last 2 slots for the Amiga core.

To see how we are going to change our design to cater for this, let us revise how the current design works. The following is a snippet of one of the selectors in our state machine for our memory controller:

              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
              end
test_cmd is the command we want to issue. I am not going to explain the individual bits for this, but it basically indicates what RAS/CAS/WRITE should be set as for the command. cmd_slot indicates at which of the 4 time slots the command should be issued.

The bits of these two registers goes down a number of levels, until we have reached the following snippet:

  cmd_addr #(
    .IODELAY_GRP(IODELAY_GRP),
    .IOSTANDARD(IOSTANDARD_CMDA),
    .SLEW(SLEW_CMDA),
    .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
    .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE),
    .ADDRESS_NUMBER(ADDRESS_NUMBER)
  ) cmd_addr_i(
    .ddr3_a   (ddr3_a[ADDRESS_NUMBER-1:0]), // output address ports (14:0) for 4Gb device
    .ddr3_ba  (ddr3_ba[2:0]),             // output bank address ports
    .ddr3_we  (ddr3_we),                 // output WE port
    .ddr3_ras (ddr3_ras),                // output RAS port
    .ddr3_cas (ddr3_cas),                // output CAS port
    .ddr3_cke (ddr3_cke),                // output Clock Enable port
    .ddr3_odt (ddr3_odt),                // output ODT port,
    .cmd_slot (cmd_slot),
    .clk      (clk),                     // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div  (clk_div),                 // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst      (rst),                     // reset delays/serdes
    .in_a     (in_a[2*ADDRESS_NUMBER-1:0]), // input address, 2 bits per signal (first, second) (29:0) for 4Gb device
    .in_ba    (in_ba[5:0]),              // input bank address, 2 bits per signal (first, second)
    .in_we    (in_we[1:0]),              // input WE, 2 bits (first, second)
    .in_ras   (in_ras[1:0]),             // input RAS, 2 bits (first, second)
    .in_cas   (in_cas[1:0]),             // input CAS, 2 bits (first, second)
    .in_cke   (in_cke[1:0]),             // input CKE, 2 bits (first, second)
    .in_odt   (in_odt[1:0]),             // input ODT, 2 bits (first, second)
//    .in_tri   (in_tri[1:0]),             // tristate command/address outputs - same timing, but no odelay
    .in_tri   (in_tri),             // tristate command/address outputs - same timing, but no odelay
    .dly_data (dly_data[7:0]),           // delay value (3 LSB - fine delay)
    .dly_addr (dly_addr[4:0]),           // select which delay to program
    .ld_delay (ld_cmda),               // load delay data to selected iodelayl (clk_div synchronous)
    .set      (set)                      // clk_div synchronous set all delays from previously loaded values
);
At this point we have already stripped of all the necessary bits from the command, as indicated in bold.

You might also pick up that we are doubling up on the bits, like we are multiplying ADDRESS_NUMBER by 2, with the bank we are passing 6 bits instead of the required 3 and so on. So, in effect for most part of the system we already catering for two commands per 4 time slots. It is just that right at the top we are passing down a single command.

Now, within cmd_addr module, we need to make a couple of changes to handle two commands per 4 time slots. First let us look at the module for outputting the address to DDR3 memory:

// All addresses
generate
    genvar i;
    for (i=0; i<ADDRESS_NUMBER; i=i+1) begin: addr_block
//       assign decode_addr[i]=(ld_dly_addr[4:0] == i)?1'b1:1'b0;
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_addr_i (
    .dq(ddr3_a[i]),               // I/O pad (appears on the output 1/2 clk_div earlier, than DDR data)
    .clk(clk),          // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div(clk_div),      // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst(rst),
    .dly_data(dly_data_r[7:0]),     // delay value (3 LSB - fine delay)
    .din({{2{in_a_r[ADDRESS_NUMBER+i]}},{2{in_a_r[i]}}}),      // parallel data to be sent out
//    .tin(in_tri_r[1:0]),          // tristate for data out (sent out earlier than data!) 
    .tin(in_tri_r),          // tristate for data out (sent out earlier than data!) 
    .set_delay(set_r),             // clk_div synchronous load odelay value from dly_data
    .ld_delay(ld_dly_addr[i])      // clk_div synchronous set odealy value from loaded
);       
    end
endgenerate
Here cmda_single is applicable to a single address bit, so we need to replicate it for every bit of the address. We do that with a for-loop construct.

Now, with the din port we need to supply four bits of data for each applicable address bit, which is needed by an OSEDRDES serializer. For the first two timeslots we duplicate the first address twice, and for the last two timeslots we duplicate the last address twice.

We need to do a similar exercise for the bank address, so I am not going to show the code for that here.

At first side it may seem a bit puzzling that I duplicate the address bits and bank address bits, instead of pinning it to the correct slot. This is because in the other non-command slots the address is ignored, so we can actually save quite a bit on logic here, especially knowing that there is quite a number of address bits.

For the RAS/CAS/WE bits, we do something like the following:

// we
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_we_i (
    .dq(ddr3_we),
    .clk(clk),
    .clk_div(clk_div),
    .rst(rst),
    .dly_data(dly_data_r[7:0]),
    .din({cmd_slot[1] ? {in_we_r[0], 1'b1}  : {1'b1 , in_we_r[0]},
          cmd_slot[0] ? {in_we_r[1], 1'b1}  : {1'b1 , in_we_r[1]}}),
    .tin(in_tri_r), 
    .set_delay(set_r),
    .ld_delay(ld_dly_cmd[3]));
Note as before, our command slot is still 2 bits, but the meaning has a changed a bit. Previously cmd_slot was to be interpreted as a number between 0 and 3, but now each memory channel has its own bit, and have each access to only to two slots.

With all these alterations done to deal with a dual channel memory controller, let us see how our state machine will deal with dual channel memory requests:

              ROW_CMD: begin
                  if (edge_count == 9)
                  begin
                      test_cmd <= 32'h000001ff;
                      phy_rcw_pos_2 <= 2;
                  end else
                  begin
                      test_cmd <= 32'h000005ff;
                      phy_rcw_pos_2 <= 7;
                  end
                  
                  cmd_slot <= 0;
                  if (edge_count == 8)
                  begin
                      state <= COL_CMD;
                      test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                  end
                  
              end

              COL_CMD: begin
                          state <= WAIT_READ_WRITE_0;
                          test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;
                          mem_channel <= 0;  
                          data_in <= {8{cmd_data_out}};
                          do_write <= write_out;
              end
              WAIT_READ_WRITE_0: begin
                  state <= WAIT_READ_WRITE_1;
                  dq_tri <= do_write ? 0 : 15;
                  cmd_slot <= 0;
                  test_cmd <= do_write ? 32'h000005ff : 32'h000001ff;                  
              end


              WAIT_READ_WRITE_1: begin
                  state <= WAIT_READ_WRITE_2;
              end

              WAIT_READ_WRITE_2: begin
                  test_cmd <= 32'h000001ff;
                  phy_rcw_pos_2 <= 3;
                  state <= PRECHARGE_AFTER_WRITE;
              end

              PRECHARGE_AFTER_WRITE: begin
                  
                  data_in <= {8{16'h8888}};
                  dq_tri <= 0;
                  phy_rcw_pos_2 <= 4;
                  mem_channel <= 1;
                  state <= POST_PRECHARGE;
                  cmd_slot <= 3;
                  test_cmd <= 32'h000029fd;
              end

              POST_PRECHARGE: begin
                  cap_value <= data_out;
                  state <= ROW_CMD;
                  phy_rcw_pos_2 <= 7;
                  test_cmd <= 32'h000005ff;
              end

I have bolded the parts that is required to perform memory operations for the second channel. For now I have hardcoded a write operation for the second channel, writing the hex value 8888 to a particular memory location in bank 1, every time it is the turn of the second memory controller.

I will give more meaningful stuff for the second memory channel to do in coming posts. For now it is just important to see that these two memory channels can co-exist without any issues.

An important part of second memory channel operations is the register phy_rcw_pos_2. This indicated which bits RAS/CAS/WE should be asserted for the applicable timeslot for the second memory controller. The bits are as follows:

  • Bit 0: Write Enable
  • Bit 1: CAS
  • Bit 2: RAS
It is important to note that these bits are active when low.

Viewing dual channel in action

Let us have a look at our dual channel setup in a simulation waveform:


I have marked with red C's when our 6502 core clocks.

I have marked with lime coloured arrows where operations of our first memory channel happens, which is also the memory channel that our 6502 core uses.

Likewise, I have indicated with blue arrows, where operations happens for our second memory channel. As mentioned previously, we only do a write operation currently for our second channel, which operates on bank 1.

You might find it a bit strange our first blue arrow is a pre-charge command (e.g. WE and RAS asserted) and not an activate command (e.g. RAS asserted only). This is because this command forms the last command in a series that started in the previous clock cycle. 

During the simulation everything worked fine and I didn't got any DDR timing violation errors. I also ran on the physical FPGA and all reads/writes of the first memory channel works 100%

In Summary

In this post we started to implement a dual channel memory controller. Up to this point we got memory operations for the two channels to live together. The second memory channel, however, is only performing writes at the moment.

In the next post we will do some more work on our second channel of our memory controller, so that it can do some more useful work.

Until next time!