Sunday, 3 July 2022

Shrinking Latency

Foreword

In the previous post we created a very elementary memory tester for the Arty A7 board to see if we more or less got the logic correct for writing and reading to memory.

When I wrote the memory tester, I have added quite a bit of padding between DDR commands, just to avoid violating some DDR timing parameters. The purpose of this exercise was just to get the memory tester working, and not to worry at that point in time of getting the most efficient time possible. The old saying of when eating an elephant, you do so one bite at a time. 😀

In this post we will revisit timings for our elementary memory tester, and see where we can remove any wasted clock cycles. The ultimate goal is to be able to access memory at a rate of at least 7MHz, which will be sufficient to emulate an Amiga core.

Reducing initial latency

Let us see if we can reduce initial latency. That is the latency from the moment the Memory Tester asserts a command, until the time when DDR RAM receives the first command for fulfilling this request. This is all illustrated with the following hand drawn diagram:


In this diagram every division indicated has a period of 1.5ns and we show a couple of clock signals. Let us start by having a look at the frequencies of these clock signals:

  • Memtester has a frequency of 20MHz, and I am not showing a complete cycle of it.
  • Oserdes Out has a period of 2 divisions = 3ns. This equals a frequency of 333MHz. This is the signal driving the commands out to the DDR RAM.
  • Oserdes Load has a period of 8 divisions = 12ns. This equals a frequency of 83MHz. This clock signal is used to load the OSERDES block with 4 bits worth of data at a time. 
  • As you can see from the diagram, Mclk has exactly the same frequency as Oserdes Load, but is shifted 45 degrees.
Let us know look at the initial flow of events when the Mem Tester asserts a command. The Mem Tester asserts a command at point A.

We capture this command from the Mem Tester at point B. You might remember from the previous post, that we store this captured command in a register called test_cmd.

Out test_cmd register is hooked up to the inputs of Oserdes blocks, which loads values from test_cmd on the rising edge of the Oserdes load clock.

From the diagram you will see that point B will happen a small amount of time after a rising edge of Oserdes Load. This means that Oserdes blocks will not capture the data of test_cmd straight away, and will only happen at point C.

With the way Oserdes blocks work, it will not start out putting the 4-bit sequency to DDR RAM at point C, but rather at the following rising edge of the clock of Oserdes load.

As you can see quite a bit of time is wasted from the time a command is asserted by the mem tester, until the time the first command is issued to DDR RAM, resulting in reduced throughput.

There is a number of ways we can reduce this initial latency. For starters, quite a bit of time is wasted by first storing a value into a register, making this data only available to the rest of the system at the following clock cycle.

For the initial command to the DDR spawning from the mem tester, we can bypass the test-cmd register and feed it directly to the Oserdes blocks. We can do this as follows:

...
    assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) 
           ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;
...
    phy_cmd #(
      ...
    ) phy_cmd_i (
      ...
        .phy_cmd_word        (result_cmd),
	  ...
    );
...
So, if we are in the state WAIT_CMD, the mem tester indicates the command is valid, and it is not a refresh command, we don't use test_cmd, but build up a row activate command on the fly. With this setup the oserdes components will load on the first oserdes load-clock edge following the mem tester clock edge, and will basically start outputting commands to DDR RAM at point C in the diagram. In effect we have shaved off a full oserdes load-clock cycle of latency.

The period between the rising edge of the mem tester-clock cycle and the following oserdes load clock-cycle is 1.5ns. During testing I have found that this period doesn't allow for proper settling of all bits of the command from the mem tester.

To make everything work ok, I had to double this period between the two rising edges to 3ns. To accomplish this, I had to adjust the phases of the clock that the MMCM block provides. The adjustments of the notable clocks are as follows:
  • Oserdes data load: From 45 degrees to 90 degrees.
  • Serial data out: From 180 degrees to 0 degrees.
  • Mclk: From 90 degrees to 135 degrees.
So, in effect all the clocks need to shift by a period of 1.5ns with the exception of the mem tester clock.

Changing command slots

As mentioned previously, the Oserdes components for outputting commands to DDR, receives 4 bits of data at a time. At any one point in time only one of these 4 bits will contain a command. The other three bits will be part of NOP commands.

At this point in time we don't have much flexibility at which of the four time slots a command can be issued. If we decide, for instance, that slot 2 should be used for commands, then every command we issue should be issued at slot 2, i.e. we cannot alternate between different slots between commands.

This incapability of alternating between different slots can also cause unnecessary latency.

Let us take an example.  The DDR RAM on the Arty A7 has a minimum latency of 5 clock cycles between commands. Let us assume we have decided to use the first slot as the command slot and we want to give a row activate command followed by a column read command. Let us quickly visualise the time slots:

1000 1000

Here we can see that issuing an Activate command in the first four time slots and then the read column command in the next four timeslots is not going to work for us. In this setup there is 4 cycles latency between the commands instead of the required 5. Because we cannot alternate between time slots, we will need to issue the commands like this:

1000 0000 1000

We can clearly see that we are wasting 4 cycles just because we cannot alternate between time slots, in order to guarantee a minimum of 5 cycles between commands.

For us to implement alternating between different ime slots, we need to look inside the file cmd_addr.v. You can find within the Elphel project in the path memctrl/phy/cmd_addr.v. WIthin this module we need to add the following input port:

input                  [1:0] cmd_slot,
This indicates at which slot number the given command should be triggered. Let us look at the write enable signal as an example on how to use this input port:

// we
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_we_i (
    .dq(ddr3_we),
    .clk(clk),
    .clk_div(clk_div),
    .rst(rst),
    .dly_data(dly_data_r[7:0]),
    .din({cmd_slot == 0 ? {1'b1, 1'b1, 1'b1, in_we_r[1]} :
          cmd_slot == 1 ? {1'b1, 1'b1, in_we_r[1], 1'b1} :
          cmd_slot == 2 ? {1'b1, in_we_r[1], 1'b1, 1'b1} :  
          {in_we_r[1], 1'b1, 1'b1, 1'b1}}),
    .tin(in_tri_r), 
    .set_delay(set_r),
    .ld_delay(ld_dly_cmd[3]));
As can be seen here at port din we place the signal in a different position for each value of cmd_slot.

We need to repeat the above for each of the command signals. For the address and bank number signals we can just repeat the value for each slot, taking address as an example:

generate
    genvar i;
    for (i=0; i<ADDRESS_NUMBER; i=i+1) begin: addr_block
//       assign decode_addr[i]=(ld_dly_addr[4:0] == i)?1'b1:1'b0;
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_addr_i (
    .dq(ddr3_a[i]),               // I/O pad (appears on the output 1/2 clk_div earlier, than DDR data)
    .clk(clk),          // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div(clk_div),      // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst(rst),
    .dly_data(dly_data_r[7:0]),     // delay value (3 LSB - fine delay)
    .din({4{ in_a_r[ADDRESS_NUMBER+i]}}),      // parallel data to be sent out
//    .tin(in_tri_r[1:0]),          // tristate for data out (sent out earlier than data!) 
    .tin(in_tri_r),          // tristate for data out (sent out earlier than data!) 
    .set_delay(set_r),             // clk_div synchronous load odelay value from dly_data
    .ld_delay(ld_dly_addr[i])      // clk_div synchronous set odealy value from loaded
);       
    end
endgenerate

Putting everything together

We are now ready to adjust our main state machine within mcontr_sequencer.v. We start with the PREPARE_CMD and WAIT_CMD states:

...
              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  state <= WAIT_CMD;
...
              end
              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                          test_cmd <= {1'b0, 4'b0, cmd_address[9:0], 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= cmd_data_out;
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
...
In PREPARE_CMD we ensure that all commands will be issued at the first timeslot.

You will remember that earlier on we defined the wire result_cmd that will pass the ACTIVATE command beforehand to DDR when the Mem tester have asserted a command. So within the WAIT_CMD selecter we can go ahead and issue a DDR Read/Write command, or a Refresh command if desired.

For the rest of the states, we remove unnecessary waits, which results in the following:

             
             STATE_PREA: begin
                  state <= WAIT_WRITE_RECOVERY;
                  dq_tri <= do_write ? 0 : 15;
                  cmd_slot <= 0;
                  test_cmd <= do_write ? 32'h000005ff : 32'h000001ff;                 
              end
              WAIT_WRITE_RECOVERY: begin
                  test_cmd <= 32'h000001ff;
                  state <= PRECHARGE_AFTER_WRITE;
              end
              PRECHARGE_AFTER_WRITE: begin
                  do_capture <= 1;
                  state <= POST_READ_1;
                  cmd_slot <= 3;
                  test_cmd <= 32'h000029fd;
              end

              POST_READ_1: begin
                  state <= PREPARE_CMD;
                  test_cmd <= 32'h000001ff;
              end
Let us quickly go through this code. In STATE_PREA we wait for the read/write cycle to complete, and keep the ODT signal asserted during this time if it is a write.

After a read/write cycle is complete we need to close the open row in preparation for the next read/write command, by means of a PRECHARGE command. However, with DDR RAM you cannot issue an PRECHARGE command straight away, but need to wait for a period of time after a read/write cycle has completed. This time period is called Write Recovery, and on the Arty A7 this is 5 clock cycles.

Test Results

Let us see what happens in practice. The following simulation waveform shows what happens during a read:


On this waveform I have indicated points A, B, C, D and E:
  • Point A is the clock signal for our Memory Tester. Originally this frequency was 20MHz, but during experimentation, I have found that 20MHz is a tight fit. I have lowered the frequency to 16.7MHz instead. Here the Memory Tester issues the command at the first rising edge and data is available at the following rising edge.
  • At point B we issue a row Activate command.
  • At point C we issue a column read command.
  • At point D we issue the precharge command.
  • Point E indicates the point when we receive data from the data_out port and when this value is captured by the cap_value port.
As we see on point E the data is captured after the second riging edge of the Mem Tester clock, but we require it to be captured before the second rising edge. We will tackle this in the next post.

In Summary

In this post we have attempted to reduce latency in our Memory controller. We managed to reduce the time from the memory tester issues a memory command, till the time a physical command is issued to DDR RAM.

An issue that we still need to resolve, is making read data available before the second rising edge of the second rising Mem Tester clock.

Until next time!