Monday, 25 April 2022

Writing and reading to Arty RAM

Foreword

In the previous post we played a bit with read levelling on the Arty A7.

I also came up with the theory that you don't really need to do write or read levelling on the Arty A7. This board only contains a single RAM chip and is fairly close to the FPGA, so from the looks of it one would not need to account for variable clock skew.

If this theory would proof to be correct throughout this series, our design will remain relatively simple.

In this post we attempt to write a value to the RAM of the Arty, and see if we can read the same value back.

If we can achieve this step, it will be a huge step in creating our custom memory controller.

As with my previous couple of posts, I will be building on the 10393 Elphel Memory controller, where the source code is available here.

Overview of RAM commands for reading and writing

Let us start off by looking at the commands to do a read and a write to RAM.

Before we do anything, it is important that the row gets activated on which we want to work. This is done by bringing the RAS signal low and specifying the row address. Before issuing the next command, one needs to wait for a couple of clock cycles. On the Arty A7, this waiting period is 5 clock cycles.

With the row activated, reading and writing commands can now be issues to that row.

To do a write, the CAS signal needs to be brought down, as well as the WE signal. The column address to which we want to write also needs to be supplied at the same time.

Once the write command is supplied to the RAM, the data needs to be provided on the DQ pins. However, the data cannot be provided straight away and one needs to wait for 5 clock cycles before providing it.

In our case, after we have performed  write, we want to see if we can read the same data back. For a read command the CAS needs to be pulled down once again, but we leave WE high. We also need to specify the same address as we did previously for the write.

After waiting again for 5 clock cycles, the data will be available on the DQ pins.

Just a note about the data written or read to and from the DQ pins. During a read or write operation, eight 16-bit words is transferred at a time. These eight words is transferred over a period of 4 clock cycles, transferring a word at both rising and falling clock transitions.

ISERDESE2 and OSERDES2 revisited

As you might have guest, we will be using ISERDESE2 and OSERDESE2 components for capturing/writing the eight words on the DQ pins.

From the previous post we have seen that these components are serial-parallel/parallel-serial converters, allowing data on DQ pins to work at a speed of 333MHz, while allowing the rest of the FPGA to operate at a more convenient 83MHz.

Since my last post, I discovered a caveat with ISERDESE2 and OSERDESE2 components. With ODERDESE2 you can indeed specify 8 bits at a time, which makes our 83MHz/333MHz sum work out. 

However, I discovered that ISERDESE2 can only accept 4 bits data from the DQ pins at a time, so suddenly our 83MHz/333Mhz sum doesn't work out anymore. We would need to clock the ISERDESE2 at 167MHz to ensure we offload the captured bits in time each time.

I don't feel comfortable running everything in the FPGA at 167MHz, so I will add my shift register in the design that will shift in 4bits at a time and output eight bits. The shift register will therefore operate at 167MHz, and the rest of the design can operate at 83MHz.

The Command State Machine

Up to now I have been making use of a state machine for issuing the series of commands for initialising the RAM, and doing read and write levelling. I will be using the same state machine in this post for testing the read and write operation.

I haven't given much detail of this state machine, so I will do so in this section.

I have implemented the state machine in the file memctrl/phy/mcontr_sequencer.v of the Elphel 10393 memory controller.

Each command issued by the state machine is basically a 32 bit number, where the layout of the important bits are as follows:

  • bits 30-17: address
  • bits 16-14: bank
  • bit 13: RAS
  • bit 12: CAS
  • bit 11: Write Enable
  • bit 10: ODT
  • bit 9: CKE
Each bit of this command will ultimately end up at an OSERDESE2 component. In DDR RAM these bits only clock bits at the rising edge, so for these OSERDESE2 components you only need to specify 4 bits at a time instead of 8.

These 4 bits we supply to a OSERDESE2 component corresponds to 4 timeslots and one of these timeslots will be our command bit.

At this point the obvious question will arise on which timeslot should be used as the command bit. This will become apparent a bit later.

Let us get back to our state machine. To perform an Activate, we do the following snippet:

    always @(posedge mclk)
    begin
        if (start_init)
        begin
            case (state)
			  ...
			  ... initialise RAM
			  ...
              STATE_ACTIVATE: begin
                  state <= STATE_WAIT_ACTIVATE;
                  //Do activate
                  test_cmd <= 32'h000021fd;
                  dq_tri = 0;
                  data_in <= 128'h112233005566778899aabbccddeeff44;
              end

              STATE_WAIT_ACTIVATE: begin
                  test_cmd <= 32'h000001ff;
                  state <= STATE_WAIT_STATE_1;
              end

        end
A couple of things is happening here. First we send the command for doing an activate, which is represented by the value 32'h000021fd. A command should only be signalled for one clock cycle, after which the RAS, CAS and WE signals should be de-asserted. It is for this reason we set the command to 32'h000001ff at the next clock cycle.

You will also realise that I am setting dq_tri to zero. This sets the DQ tristate buffers to output mode so that we perform writing at a later stage.

Also, during STATE_ACTIVATE, data_in gets assigned a value. This in effect is 8 16-bit words wrapped in a 128-bit value, which we want to write to memory. This data is supplied to the OSERDESE2 components for the DQ pins.

Once we have set data_in with this value, the OSERDESE2 components for the DQ pins, will continuously output these 8 values.

After we have waited for at least 5 cycles for ACTIVATE to complete, we can issue a write command and then  a read command:

              READ_TEST: begin
                  state <= WRITE_DELAY_0;
                  //Column write
                  test_cmd <= 32'h00081dfd;
              end
			  
              WRITE_DELAY_0: begin
                  test_cmd <= 32'h000005ff;
                  state <= WRITE_DELAY_1;
              end
              ...
	      ...Wait 5 clock cycles
       	      ...

              DO_READ: begin
                  state <= PAUSE_AFTER_READ;
                  //Column read
                  dq_tri <= 15;
                  test_cmd <= 32'h000011fd;
              end

              PAUSE_AFTER_READ: begin
                  state <= PAUSE_AFTER_READ;
                  test_cmd <= 32'h000001ff;
              end
You will notice that after issuing the WRITE command, we are issuing the value 32'h000005ff instead  of 32'h000001ff. The reason for this is because the ODT bit needs to be asserted during a write.

Now, after 5 clock cycles after issuing the WRITE command the RAM will read data from the DQ pins and use this to write to memory. Previously we have did setup the OSERDESE2 components to continuously write out 8 values. The RAM will pick up these values.

After the write command is complete, we issue a read command. Here we should note that dq_tri is set as such that the DQ tristate buffer is prepared for reading. The RAM chip will provide the requested data after 5 clock cycles.

The ISERDESE2 components for the DQ pins will capture the requested read data and output it in groups of 4. It is important that we latch this data to a register when it becomes available, otherwise the data will be gone after 4 more clock cycles.

We will have a look at latching this data in the next section. 

Latching read data

Apart from the need for latching, as described in the previous section, we also need latching for solving the issue where ISERDESE2 only outputs for 4 bits instead of 8, which is required so that our design can clock at the slower 83MHz, instead of 167MHz.

So, let us start to address the ISERDESE2 problem. The file in question is wrap/iserdes_mem.v of the Elphel 10393 memory controller. This is used by memctrl/phy/dq_single.v.

My instance of IDESERDESE2 in iserdes_mem.v looks like this:

...
 parameter IOBDELAY = "IBUF",
... 
	 ISERDESE2 #(
         .DATA_RATE                  ("DDR"),
         .DATA_WIDTH                 (4),
         .DYN_CLKDIV_INV_EN          (DYN_CLKDIV_INV_EN),
         .DYN_CLK_INV_EN             ("FALSE"),
         .INIT_Q1                    (1'b0),
         .INIT_Q2                    (1'b0),
         .INIT_Q3                    (1'b0),
         .INIT_Q4                    (1'b0),
         
         .INTERFACE_TYPE             ("MEMORY"),
         .NUM_CE                     (1),
         .IOBDELAY                   (IOBDELAY),
         
         .OFB_USED                   ("FALSE"),
         .SERDES_MODE                ("MASTER"),
         .SRVAL_Q1                   (1'b0),
         .SRVAL_Q2                   (1'b0),
         .SRVAL_Q3                   (1'b0),
         .SRVAL_Q4                   (1'b0)
         
         )
         iserdes_i
         (
         .O                          (comb_out),
         .Q1                         (iserdes_out[3]),
         .Q2                         (iserdes_out[2]),
         .Q3                         (iserdes_out[1]),
         .Q4                         (iserdes_out[0]),
         .SHIFTOUT1                  (),
         .SHIFTOUT2                  (),
         .BITSLIP                    (1'b0),
         .CE1                        (1'b1),
         .CE2                        (1'b1),
         .CLK                        (iclk),
         .CLKB                       (!iclk),
         .CLKDIVP                    (), // used with phasers, source-sync
         .CLKDIV                     (oclk_div),
         .DDLY                       (ddly),
         .D                          (d_direct), // direct connection to IOB bypassing idelay
         .DYNCLKDIVSEL               (inv_clk_div),
         .DYNCLKSEL                  (1'b0),
         .OCLK                       (oclk),
         .OCLKB                      (!oclk),
         .OFB                        (),
         .RST                        (rst),
         .SHIFTIN1                   (1'b0),
         .SHIFTIN2                   (1'b0)
         );

One particular change here, is that I am using the value "IBUF" for the parameter IOBDELAY. This is because I want to use the direct input, e.g. D, instead of the delayed version, DDLY. We are using our own fixed delays.

Now, in order to aggregate these 4 bits to 8 bits, we do the following:

 ...
 output [7:0] dout,
 ...
 always @(negedge oclk_div)
 begin
   dout_le <= {dout_le[3:0], iserdes_out};
 end
 ...
oclk_div is a 167MHz clock signal. With this snippet of code we will get 8 bits of data every second 167MHz clock cycle.

We now need to capture these 8 bits of data within our state machine. In the state machine we need to signal our design when to capture the data:

...
... Some delay
...
              POST_READ: begin
                  do_capture <= 1;
                  state <= POST_READ_1;
              end

              POST_READ_1: begin
                  do_capture <= 0;
                  state <= POST_READ_6;
              end
...
Directly after we have set do_capture to 1, we set it 0, so that the captured value is not overwritten. I have determine the right moment to set do_capture by experimentation in the Verilog simulator.

We use do_capture as follows:

    always @(negedge mclk)
    begin
        if (do_capture)
        begin
            cap_value <= data_out;
        end
    end

Test Results

Let us have a look at simulation waveforms for our design:


The signal ddr3_dq is the read data that comes back from DDR memory.

The data_out signal is the output from the DQ ISERDESE2 components. The bits are rearranged. The first 64 bits contains the upper byte of very burst, and the last 64 bits contains the  lower byte of every burst.

If we split the last data_out value into two pieces, we get:
  • xxxxxxxx 5577xxxx
  • xxxxxxxx 6688xxxx
You will see that each line starts with 8 x's. This is because only 4 bursts is captured at that point in time. When the next 4 bursts is captured, these 8 x's will be replaced by 5577xxxx and 6688xxxx respectively, and the old position of these two data elements will be replaced with the freshly captured 4 bits of data.

One thing that might look strange, is that both 5577 and 6688 is followed by 4 x's. This is because the ISERDESE2 components started to captured the 4 bits of data one cycle too early. We can fix this by altering the timeslot at which commands are issued.

The RAM commands gets issues within the file memctrl/phy/cmd_addr.v. I have made quite a bit of changes to this file myself. Let us have a look at this file which handles the we signal:

// we
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_we_i (
    .dq(ddr3_we),
    .clk(clk),
    .clk_div(clk_div),
    .rst(rst),
    .dly_data(dly_data_r[7:0]),
    .din({1'b1, in_we_r[0], in_we_r[1], 1'b1}),
    .tin(in_tri_r), 
    .set_delay(set_r),
    .ld_delay(ld_dly_cmd[3]));
In my version of the file, din accepts 4 bits, whereas the original version only accepts two bits.

Now, the problem was that we start reading too early by one clock cycle, so if we issue the commands one clock cycle earlier, the problem will be solved. To do this, we change the din assignment as follows:

  • .din({in_we_r[0], in_we_r[1], 1'b1, 1'b1})
So, here we moved the in_we_r bits to the beginning.

With this change the simulation looks better:


There is just one thing I would like to mention. When running this on the actual FPGA, I have found that the result might differ by one clock cycle from the simulation.

In Summary

In this post we did a simple test of writing a value to ARTY RAM and trying to read the same value back.

I managed to succeed in  getting this design to run on both the simulation and running it on the FPGA.

In the next post I will try and create a memory tester from what we have learned so far. This will give us an opportunity to stress test our implementation for writing and reading to memory.

Till next time!