Sunday 11 September 2022

Reading/Writing Data in 16-bit chunks from RAM

Foreword

In the previous post we reduced trailing latency for DDR RAM access, providing us with the desired memory throughput required by a Amiga core.

At first sight it seems that an Amiga core cannot really work with DDR3 memory. An Amiga core works with 16-bits at a time from memory, whereas DDR3 RAM works with bursts of 4 or 8 16-bit bursts at a time. So, in this post we will see if we can find a way to work with DDR3 memory 16-bits at a time.

Writing 16-bits at a time

On our journey to tackling 16-bits at a time, let us have a look at writes.

As mentioned earlier, DDR3 RAM works with either 4 or 8 bursts at a time. Putting an Amiga core into the picture, only one of these 4/8 bursts will always be a valid write, and the remaining bursts memory will be unintentionally overwritten.

Somehow we need to be able to tell DDR3 RAM which of the bursts contain valid Write, and indeed DDR3 memory does.

All DDR3 memory contains an input signal called Data Mask, or abbreviated DM. During a burst session, the DDR3 RAM will examine the DM session at every data burst. If the signal is a 1, e.g. burst masked, the burst will be ignored. If, however, the signal is 0, e.g. unmasked, the burst will be considered valid and the relevant location in memory will be updated with the value.

Therefore, with an Amiga core, with eight bursts there will always be only one timeslot where the DM signal will be 0 and the rest will be ones.

Let us take an example. Suppose we want to write the value 25 to address 13. Burst writes always start at 8 byte boundaries, like addresses 0, 8, 16, 24 and so on. Address 13 falls within the boundary 8 to 15. So, the DM values for these bursts will look as follows:

                 1   1   1   1   1   0   1   1   
Address: 08 09 10 11 12 13 14 15

Concerning the data, we can just repeat the data value 25 8 times, making life easier.

Now, let us write some code. First we need to reduce data_in/data_out ports of mem_tester to 16 bits:

module mem_tester(
    input clk,
    // 0 - reset
    // 1 - ready
    input [2:0] cmd_status,
    output reg select = 0,
    output reg refresh = 0,
    output reg write,
    output [15:0] address_out,
    output wire [/*127*/15:0] data_out,
    input [/*127:0*/15:0] data_in
    );
...
endmodule
Let us now move on to the file mcontr_sequencer.v, which contains the state machine that breaks up the commands from mem_tester into DDR3 memory commands. One of the selectors we need to change as follows:

              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                          test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= {8{cmd_data_out}};
                          //column_address <= cmd_address[9:0];
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
We basically duplicating the data_out of mem_tester 8 times, which will be fed to the OSERDES module that will repeat the same burst 8 times during a Write.

Now, we still need to assert the correct dm timeslot, so that data is written to the correct location in memory. We do this with the help of the lowe three bits of the command address send by the mem_tester:
    always @*
    begin
        if (cmd_address[2:0] == 0) 
        begin
            dm_slot = ~1;
        end else if (cmd_address[2:0] == 1)
        begin
            dm_slot = ~2;
        end else if (cmd_address[2:0] == 2)
        begin
            dm_slot = ~4;
        end else if (cmd_address[2:0] == 3)
        begin
            dm_slot = ~8;
        end else if (cmd_address[2:0] == 4)
        begin
            dm_slot = ~16;
        end else if (cmd_address[2:0] == 5)
        begin
            dm_slot = ~32;
        end else if (cmd_address[2:0] == 6)
        begin
            dm_slot = ~64;
        end else if (cmd_address[2:0] == 7)
        begin
            dm_slot = ~128;
        end
    end
Here dm_slot is the data we need to feed the OSERDES component dealing with the DM output. Once the mem_tester has asserted an address the OSERDES component for DM will output the 8 bit pattern continuously, until the DDR3 RAM is at the phase of receiving data that should be written. During this phase the DDR3 RAM will look for the DM slot which is zero as the queue.

Using the lower three bits of the address to decide which DM slot to enable is quite a nice rule of thumb. However, beware that due to so many things happening from the time mem_tester asserts a command, until we get to the point where the DDR3 memory actually reads the bursts, your guess on the correct DM slot versus the actual one might be out by a time slot or two.

One way to determine the correct DM slots, would be to run a number of simulations and determine these values experimentally. However, in my experience with this particular setup, you will be able to make the simulation environment work perfectly, but when running it on the actual FPGA might differ again a time slot or two from the simulation environment. Scary stuff indeed! Sim environment is supposed to work exactly as in practice. I still need to narrow down why theory is different from practice, but for the moment I need to live with the difference.

Ultimately, this means that I also need to obtain dm slot values experimentally when running on the real FPGA, which sounds like quite toll order. However, I find a solution for this, which I will share a bit later in another section.

Reading 16 bits at a time

Let us now look into reading 16 bits at a time. With reading we don't need to worry about masking off certain bursts. We can let the DDR3 send its usual 8 bursts at a time, and we just wait for the burst we are interested in.

Issues can arise if the burst we are interested in is towards the end, adding unnecessary latency and potentially missing the deadline when the data is required.

We could perhaps do better by asking the RAM to give us the data we are interested in first. Indeed, the RAM datasheet does indicate that we can access RAM in such a fashion:


Everything is driven by the lowest three bits of the address. If the lowest three bits are  zero, byte zero will be presented first. Similarly, if these three bits are one, then byte 1 will be presented first. This follows a nice a pattern until the three bits are seven, at which byte 7 is presented first.

At this point we still have the uncertainty whether data is written in the correct DM slot, as outlined in the previous section. Let us start somewhere and start writing some further code. The ultimate test is that mem_tester should write to memory locations 0 to 15 in sequence, the values 0 to 15 in the same sequence. If we read the values back in the same sequence, we expect the same values.

Here is a snippet of the simulation waveform:


The ddr3_dq signal shows the result of an 8 burst read. The data_cap shows the value we have captured from the burst, which in this case is bb0b. This indicates that we captured the third burst, which have the same value. 

Receiving our required data only at the third burst complicates our life a bit. To see why, let us revisit the Burst table of the DDR3 datasheet I presented earlier on:


As indicated by the red column, if we were to receive our data at the first burst, life would have been much simpler. In this case the DDR3 will just give us the correct data for the lowest three bits of the address.

Receiving data at the third burst, things are getting more complex, as shown in the green column. Supplying the address 0, will give us the data of address 2. Supplying address 1, gives the data of address 3, and so on. Overall, the column doesn't follow a nice sequential pattern: 2, 3, 0, 1, 6, 7, 4, 5, 2. 

The easiest way to do translation with this non-sequential pattern, would be to use a look-up table. With a lookup-table we can also solve our potential problem in the previous section of finding the correct DM-slot for writes.

Let us start building this lookup table. Looking at the screenshot of the simulation again as an example of the first value. In this waveform we requested address c, but we got b. We can state this info in another way: To get address b, we need to specify address c, or write it like this:

b -> c

This is one entry for our lookup table. However, our lookup table only needs 3 bits, so let us convert to binary:

1011 -> 1100

which resolves to 3 -> 4

Let us now find similar lookup values for supplied addresses in the range 0 to 7. We start by writing down the numbers 0 to 7, and what the actual address was we got:
0 - 7
1 - 0
2 - 5
3 - 6
4 - 3
5 - 4
6 - 1
7 - 2

Now we swop the columns around to get the lookup table:

7 -> 0
0 -> 1
5 -> 2
6 -> 3
3 -> 4
4 -> 5
1 -> 6
2 -> 7

Now we can implement the mapping in verilog code:

    //Sim mapping
    always @*
    begin
        if (cmd_address[2:0] == 0)
        begin
            map_address = 1;
        end else if (cmd_address[2:0] == 1)
        begin
            map_address = 6;
        end else if (cmd_address[2:0] == 2)
        begin
            map_address = 7;
        end else if (cmd_address[2:0] == 3)
        begin
            map_address = 4;
        end else if (cmd_address[2:0] == 4)
        begin
            map_address = 5;
        end else if (cmd_address[2:0] == 5)
        begin
            map_address = 2;
        end else if (cmd_address[2:0] == 6)
        begin
            map_address = 3;
        end else
        begin
            map_address = 0;
        end
    end
Where ever we need to specify the new address, we need to use map_address for the lower bits:

              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                                            test_cmd <= {1'b0, 4'b0, {cmd_address[9:3],
                                              map_address[2:0]}, 1'b0, 4'h1, 
                                                       (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= {8{cmd_data_out}};
                          //column_address <= cmd_address[9:0];
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
Let us now move on, to examine the data when running on the actual FPGA. The following waveform shows some signals captured, while the core was running on the FPGA:


To save space, I have omitted the captions on the left, otherwise everything looks very tiny. The omitted captions, from top to bottom are as follows:

  • Address Out
  • Clk of mem_tester
  • Captured data
  • Write/read
As you can see during this capture, the write/read signal is low, so these are all reads. 

The red arrows I have indicated where we are asserting commands. The clock cycle following these red arrows, we capture the actual data. From this diagram, we can see the address and corresponding data is as follows:
  • Address 0 -> 5
  • Address 1 -> 6
  • Address 2 -> 7
  • Address 3 -> 0
  • Address 4 -> 1
  • Address 5 -> 2
Missing from the diagram is addresses 6 and 7, which will yield values 3 and 4 respectively. So, the mapping for running on the actual FPGA will be as follows:

    always @*
    begin
        if (cmd_address[2:0] == 0)
        begin
            map_address = 3;
        end else if (cmd_address[2:0] == 1)
        begin
            map_address = 4;
        end else if (cmd_address[2:0] == 2)
        begin
            map_address = 5;
        end else if (cmd_address[2:0] == 3)
        begin
            map_address = 6;
        end else if (cmd_address[2:0] == 4)
        begin
            map_address = 7;
        end else if (cmd_address[2:0] == 5)
        begin
            map_address = 0;
        end else if (cmd_address[2:0] == 6)
        begin
            map_address = 1;
        end else
        begin
            map_address = 2;
        end
    end

In Summary

In this post we implemented 16-bit reading/writing from DDR3 memory.

At this point in time we have a potential DDR3 memory solution for an Amiga core. We still need to fill this memory with ROM contents, which makes sense to load it from an SD Card.

So, in the next post we will start to work on a FPGA design where we read data from an SD Card.

Until next time!