Friday 23 December 2022

SD Card Access for a Arty A7: Part 4

Foreword

In the previous post we managed to issue an IDLE command to SD Card via an SD Card reader, attached to the Arty A7 board. We also confirmed that SD Card send a response back for the command.

Up to this point in time we made use of a state machine for issuing command sequences to the Gisselquist SD core. There is quite a number of commands one needs to issue to an SD Card, in order to do reading/writing of data stored on the SD Card. Using a state machine for this exercise can become quite unpleasant in the long run.

Thus, in this post we will look at using a CPU core, on which we can run a stored program for issuing the SD Card command sequences. The CPU core I will using for this purpose will be Arlet Ottens' 6502 core. 

I am sure there will be very frowns out there on using an 8-bit CPU, working with a 32-bit Wishbone device like the Gisselquist SD Card core. However, the 6502 is fairly light on FPGA resources and I think it is worthwhile to see how far this core can help us out.

The Memory Map

Let us have a look at the memory map for our 6502 system:

  • FFFF - FF00: ROM. For starters we will have a 256 byte ROM, but might grow beyond this size over time. As with many 6502 systems, the startup ROM needs to live in the top part of RAM, because the reset vector is at addresses FFFC-FFFD.
  • FEFF - FE00: Interface to the registers of Gisselquist SD Card Core. As we have seen in the previous post, we have access to 2 32-bit registers via the Wishbone bus of the Gisselquist core.
Let us zoom a bit into the Interface to the Gisselquist core. The wishbone interface works with 32 bits of data, whereas the 6502 works with 8 bits of data at a time. How does one deal with these differences in data widths?

To explain a possible solution to the problem, let us start by arranging the memory locations starting at FE00 like this:

So, the addresses FE00-FE003 maps to register 0 of Gisselquist core, and the addresses FE04-FE07 maps to Register 1 of the Gisselquist core.

Now, read/writes to the lowest byte of each register (e.g. marked in red), will trigger transactions on the wishbone bus. The byte addresses in black, maps to temporary registers.

Suppose we want our 6502 to write a value to Register 1. We will start by writing the top three bytes of the 32 bit word to memory locations FE07, FE06 and FE05. Writing to these addresses will set the values of temporary registers and will not trigger any wishbone transaction. Witing to FE04, however, will a wishbone write transaction.

With this wishbone write transaction, we will concatenate the values stored in temporary registers FE07, FE06 and FE05, together with the value currently been written by the 6502 to address FE04.

I wishbone read works in a very similar way, triggered by reading either FE04 and FE00. The top three bytes returned by the wishbone read will be stored in another set of temporary registers, which afterwards can also be read by the 6502 at addresses FE07/FE06/FE05 or FE03/FE02/FE01. 

I will give more detail on implementing this in coming sections.

Wiring up the 6502

Let us start Wiring up the 6502 core.

For starters, we need a ROM for feeding the 6502 with a program to execute. For this we go on a trip in memory lane, where in 2017 we created a ROM module for our C64 core, here. You can find the full source for this module with this link on Github: https://github.com/ovalcode/c64fpga/blob/master/ip/bblock/src/rom.v 

An instance of the ROM module looks like this:

   rom#(
    .ADDR_WIDTH(8),
    .ROM_FILE("romsdspi.bin")
)   rom (
      .clk(gen_clk),
      .addr(cpu_address[7:0]),
      .rom_out(rom_out)
    );
As mentioned earlier, we will start off with only a 256 byte ROM. For this reason ADD_WIDTH is set to 8. We also only use the lower 8 bits of the address from the CPU.

The parameter, .ROM_FILE, is the path to a file on the file system containing a ROM image. This is a text file, one byte per line and in Hex. So, our 256 byte ROM, will result in a file containing 256 lines. As mentioned previously, the rest vector is at address FFFC-FFFD, so the last four lines of our ROM file will look like this:

00
FF
00
00
Here we see the 6502 will start executing at address FF00, the beginning of the last 256 page in the 64K address space. We will cover the assembly code a bit later.

Let us now have a quick look at an instance of the Arlet Ottens core:

cpu cpu( .clk(gen_clk), .reset(...), .AB(cpu_address), .DI(rom_out), .DO(cpu_data_out), .WE(we_6502), .IRQ(0), .NMI(0), .RDY(1) );

Writing from 6502 to SDSPI Core

Let us now focus on the functionality for writing from the 6502 core to a SDSPI Core register.

Firstly, because the 6502 can only deal with 8-bits at a time, we need to add 3 temp registers so we can have 32-bits available that the SDSPI require for a write:

always @(posedge gen_clk)
begin
  if (we_6502)
  begin
    if (cpu_address == 16'hfe01)
    begin
        reg_1 <= cpu_data_out;
    end else if (cpu_address == 16'hfe02)
    begin
        reg_2 <= cpu_data_out;
    end else if (cpu_address == 16'hfe03)
    begin
        reg_3 <= cpu_data_out;
    end
  end
end
Next, let us generate a strobe signal for the SDSPI Core:

always @(posedge gen_clk)
assign on_word_boundary = cpu_address[1:0] == 0;

assign wb_stb = cpu_address[15:8] == 8'hfe && on_word_boundary;
So, we only strobe on a word boundary, e.g. addresses like FE00 and FE04. The applicable signals on the SDSPI core looks like this:

sdspi  sdspi (
...
		// Wishbone interface
		// {{{
		.i_wb_cyc(1), .i_wb_stb(wb_stb), .i_wb_we(we_6502),
		.i_wb_addr({1'b0,cpu_address[2]}),
		.i_wb_data({reg_3, reg_2, reg_1, cpu_data_out}),
...
	);

Reading with the 6502

Now, let us look into reading with the 6502. Reading is a bit more complex than writing, because we can read from potentially two sources: ROM and registers of the SDSPI core.

To cater for the two possible read sources, let us create the following outline:

...
cpu cpu( ... .DO(cpu_data_out), ... );
...
always @(posedge gen_clk)
begin
  addr_delayed <= cpu_address;
end
...
always @*
begin
    casex (addr_delayed)
        ...
        default: combined_data = rom_out;
    endcase 
end
...
The casex is the main part for selecting the correct source. We use a casex instead of a usual case because we use a subset of the bits to decide which source to select. We will add more selectors to our casex in a bit.

One thing you will also notice, is that we are using a delayed version of the address for selection. This is just to cater for the way Block RAMs work, which always has the data ready for given address at the next clock cycle. At the next clock cycle the 6502 core can potentially assert a different address, which can cause data from the wrong source to be selected and presented to the CPU.

Now, let us extend our outline so that we make our 6502 read registers from the SDSPI core:

...
always @(posedge gen_clk)
begin
    wb_data_store <= (wb_stb && !we_6502) ? o_data_sdspi[31:8] : wb_data_store;  
end
...
always @*
begin
    casex (addr_delayed)
        16'b1111_1110_xxxx_xx00: combined_data = o_data_sdspi[7:0];
        16'b1111_1110_xxxx_xx01: combined_data = wb_data_store[7:0];
        16'b1111_1110_xxxx_xx10: combined_data = wb_data_store[15:8];
        16'b1111_1110_xxxx_xx11: combined_data = wb_data_store[23:16];

        default: combined_data = rom_out;
    endcase 
end
...
So, when we read from a SDPSI core regitser we store the top three in temporary register called wb_data_store, which the 6502 can read at later stage if so desired.

At this point we have a small caveat, since the SDSPI Core will not have register data ready at the next clock cycle, but require one additional clock cycle before the data is ready. This behaviour breaks all the assumptions the 6502 core make.

Luckily, the 6502 core does provides an RDY input signal, with which we can effectively pause the 6502 on read for as many clock cycles as we want to, until the data we want is ready in the data bus. With this in mind, we need to change the code above to the following:

...
always @(posedge gen_clk)
begin
    wait_read <= wait_read ? 0 : (wb_stb && !we_6502);
end

always @(posedge gen_clk)
begin
    capture_data <= wait_read;
end

always @(posedge gen_clk)
begin
    wb_data_store <= capture_data ? o_data_sdspi[31:8] : wb_data_store;  
end

cpu cpu(... .RDY(!wait_read) );

always @(posedge gen_clk)
begin
  addr_delayed <= wait_read ? addr_delayed : cpu_address;
end
...
As seen from this code, we also need to wait before we capture a value for wb_data_store, as well as delaying addr_delayed even further is required.

The 6502 Assembly Program

Let us have a look at a 6502 Assembly program for for accessing the SDSPI core, which will ultimately put an SD Card into IDLE mode:

FF00   A9 55     LDA #$55
FF02   8D 01 FE  STA $FE01
FF05   8D 02 FE  STA $FE02
FF08   8D 03 FE  STA $FE03
FF0B   A9 0B     LDA #$0B
FF0D   8D 04 FE  STA $FE04 ; Store the value $5555550B into Data Register
FF10   A9 C0     LDA #$C0  
FF12   8D 00 FE  STA $FE00 ; Init Config registers with value stored in Data Register
FF15   A9 FF     LDA #$FF
FF17   8D 03 FE  STA $FE03
FF1A   8D 02 FE  STA $FE02
FF1D   8D 01 FE  STA $FE01
FF20   8D 04 FE  STA $FE04 ; Load Data register with $FFFFFFFF
FF23   A9 40     LDA #$40
FF25   8D 00 FE  STA $FE00 ; Give Idle command ($40) followed by $FFFFFFFF (e.g. Data Register)
Just to give some context again. Data Register mentioned in the comments is register 1 of the SDSPI core.

The actual command byte is issued via address $FE00. The command byte value $C0 instructs the SDSPI core to load the config registers with values stored in the Data Register. Command byte value $40 instructs the SD Card to go into IDLE mode.

One part I haven't shown in this program is a required endless loop.

The waveforms produced by this Assembly program is the same as in the previous post where we issued an IDLE command by means of a state machine, so I will not present the waveforms in this post.

In Summary

In this post we added added Arlet Otten's 6502 core to our design, so that we can programmatically initialise an SD Card. Doing SD Card initialisation with a state machine will just become too cumbersome on the long run.

In the next post we will try and fully initialise the SDCard and see if we can read a sector of data from the Card.

Until next time!

  

Sunday 4 December 2022

SD Card Access for a Arty A7: Part 3

Foreword

This is the third part in the series where we try to get a SD Card reader to work on an Arty A7 board.

In the previous post we had a look at the FPGA core by Dan Gisselquist that can interface with an SD Card Reader.

A nice feature of Dan Gisselquist's core is the test bench that can simulate responses from an SD Card. Out the box this Test Bench works within the Verilator eco system. However, in the previous post we managed to use Gisselquist's SD Card response module with simulation in Vivado.

We concluded the previous post been able to issue an IDLE command to the SD Card response module, and got a response back.

In this post we will do the same exercise on the physical Arty A7 board, issuing an IDLE command to the SD Card and checking if we can also get a response back from an SD Card.

Attaching the SD Card module to the Arty A7

In the first part of this series I briefly shown a pic of a PMOD SD Card reader still in its packaging.

To be honest, this module was in its packaging up to now 😁 Well, I thought just to share a picture of the Sd Card module attached to the Arty A7 board:


There was one caveat I discovered straight away when inserting the SD Card module, which I didn't thought of before hand: This module occupies some space in front of the PMOD header next to it.

This might pose an issue when we want to use a VGA PMOD module later in the project, which uses both PMOD headers JB and JC.

To get around this issue, we might need to make use of a PMOD extension cord for inserting the SD Card module, freeing some space in front of header JB. But, we will tackle this issue when we get there.

Creating the constraints

We need to create some constraints to ensure the ports from the top module are mapped to the correct pins on the PMOD header. We start by looking at the general xdc file for the Arty A7 on Github. In particular, we are interested in PMOD section JA, which we use for our SD Card Module:

We adjust the port names as follows:

## Pmod Header JA
set_property -dict {PACKAGE_PIN G13 IOSTANDARD LVCMOS33} [get_ports cs]
set_property -dict {PACKAGE_PIN B11 IOSTANDARD LVCMOS33} [get_ports mosi]
set_property -dict {PACKAGE_PIN A11 IOSTANDARD LVCMOS33} [get_ports miso]
set_property -dict {PACKAGE_PIN D12 IOSTANDARD LVCMOS33} [get_ports sclk]
set_property -dict { PACKAGE_PIN D13   IOSTANDARD LVCMOS33 } [get_ports dat1]; #IO_L6N_T0_VREF_15 Sch=ja[7]
set_property -dict { PACKAGE_PIN B18   IOSTANDARD LVCMOS33 } [get_ports dat2]; #IO_L10P_T1_AD11P_15 Sch=ja[8]
set_property -dict {PACKAGE_PIN A18 IOSTANDARD LVCMOS33} [get_ports cd]
set_property -dict {PACKAGE_PIN K16 IOSTANDARD LVCMOS33} [get_ports wp]
Next, we should ensure that our top level module use the same port names:

module top(
    input wire CLK100MHZ,
    output wire cs,
    output wire mosi,
    input wire miso,
    output wire sclk,
    input wire cd,
    input wire wp
    );
...
endmodule
Inside this top level module an instance will live of the Gisselquist SD Card core. We will also be implementing a state machine in this module for instruction the Gisselquist core for sending an IDLE command to the SD Card.

We will cover the state machine in the next section.

The State Machine

Let us create a state machine for issuing a IDLE command to the Gisselquist core.

For starters, one need to look at clock speed. The input clock to the Arty A7 is always 100MHz. This is perhaps a bit too fast for our purposes, so one can create a slower clock with the help of a MMCME2_ADV block. I have covered the use of such a block in in one of previous posts some time ago, so I am not going to cover the process of instantiating one here.

Preferably our state machine should only start once the generated clock is stable. For this purpose we will use the .LOCKED signal of the MMCME2_ADV instance. With this in mind, let us start with an outline of our state machine:

always @(posedge gen_clk)
begin
    if (clk_locked)
    begin
      case (state)
      ...
      endcase
    end
end
So, the state machine will only start changing states once the clock is locked. We start with a number of dummy states to give the Gisselquist core a chance to initialise, after which we de-assert the reset signal to the core:

...
          0: state <= 1;
          1: state <= 2;
          2: state <= 3;
          3: state <= 4;
          4: begin
               state <= 5;
               reset_sd <= 0;
             end
...
So, what next? We could go straight ahead and issue the IDLE command, but preferably we set the signal which we clock the SD Card at the desired initial frequency, which is 400KHz. I am clocking the Sd Card core at a frequency of 10MHZ, so we will need to bring it down by means of the internal clock divider provided by this core. To get to 400KHz we need to use a divider value of 11, which is 0b in hexadecimal. We set the value with the state machine as follows:

          5: begin
                 state <= 6;
                 stb <= 1;
                 wb_sel <= 4'hf;
                 addr <= 1;
                 we <= 1;
                 data <= 32'h5556550b;
             end
          6: begin
                 state <= 7;
                 stb <= 0;
             end
          7: begin
                 state <= 8;
                 stb <= 1;
                 wb_sel <= 4'hf;
                 addr <= 0;
                 we <= 1;
                 data <= 32'hc0;
             end
To understand this snippet, we need to quickly look at the internal registers of the Gisselquist core:
  • Register 0: Used for sending command bytes. SD Cards always expect a command command that starts with 01. For all the other bit combinations of the two significant bits of the command byte, we are free to use as command bytes operating on the Gisselquist core itself.
  • Register 1: Data register containing four extra bytes of data associated with the command byte. If you want to issue a command byte containing multiple bytes, you need to set this register first before issuing the command.
So, from the above snippet we are issuing the command byte c0, which sets some internal configuration registers of the Gisselquist core. The value to write to the configuration registers should be stored in the data register beforehand, which in this case is 32'h5556550b. The lower eight bits of the this value is the value for the divider, which is 0b.

Once we have set the the internal configuration registers, we are free to issue the idle command to the Sd Card:
 
          8: begin
                 state <= 9;
                 stb <= 1;
                 wb_sel <= 4'hf;
                 addr <= 1;
                 we <= 1;
                 data <= 32'hffffffff;
                  
              end
          9: begin
                 state <= 10;
                 stb <= 1;
                 wb_sel <= 4'hf;
                 addr <= 0;
                 we <= 1;
                 data <= 32'h40;
                  
              end
          10: begin 
                 stb <= 0;
             end
So, we issue the command 40 to the SD Card, followed by 4 ff bytes.

Clocking the Sd Card

The Gisselquist core generates a clock that we can clock the SD Card. The temptation is great just to connect this signal directly to the outside world, as we do with the other signals to the SD Card.

However, everything always gets more complex with clocks, whether passing it around within the FPGA or passing it externally. In my previous project where I implemented a C64 core on a Zybo board, I had some fun at one stage dealing with an external clock, here.

I wanted make use of the the onboard Audio Codec on the Zybo board and with my first attempt I wanted to clock this device directly. The Audio Codec simply refused to work. After some digging on the Internet, I discovered that you should always pass a clock to the outside world with an ODDR block.

The clock signal to the SD Card is no exception, so let us define an ODDR instance:

   ODDR #(
      .DDR_CLK_EDGE("OPPOSITE_EDGE"), // "OPPOSITE_EDGE" or "SAME_EDGE" 
      .INIT(1'b0),    // Initial value of Q: 1'b0 or 1'b1
      .SRTYPE("SYNC") // Set/Reset type: "SYNC" or "ASYNC" 
   ) ODDR_inst (
      .Q(sclk),   // 1-bit DDR output
      .C(o_sclk),   // 1-bit clock input
      .CE(1), // 1-bit clock enable input
      .D1(1), // 1-bit data input (positive edge)
      .D2(0), // 1-bit data input (negative edge)
      .R(0),   // 1-bit reset
      .S(0)    // 1-bit set
With all this in place, we are now ready to do a test run on the Arty A7 board

Test Results

Let us have a look at the Test Results, when running the core on the Arty A7:


I tried to cram quite a lot of info into this picture and the names of the signals is perhaps not so readable, so I repeat the signal names:
  • cs (Chip Select)
  • miso
  • mosi
  • o_sclk
Looking at the signals, we can see that we issue the IDLE command (0x40) on the mosi signal, and eventually we get get a response back from the SD Card (e.g. 0x01)via the miso signal, which is what we expect.

In Summary

In this we gave the Gisselquist core a test run on an Arty A7 with a SD Card module attached and issued an IDLE command. With the test, the SD Card responded correctly, confirming that we are on track at the moment with out setup.

There are quite a number of commands one needs to issue to an SD Card to read the data stored on it and this can be very cumbersome to implement via a state machine.

It will be far easier to write a program executed by a CPU for issuing the SD Card commands. We will look into this with the next post.

The CPU I have in mind for this exercise is the 6502. Granted, this is an 8-bit CPU and one will probably not get the best performance given the 32-bit data that needs to be passed quite often to the Gisselquist core, but I think it is a good start.

The 6502 doesn't need so much resources of the FPGA, which will leave us more room for the Amiga core to be used later in the Blog series.

If required, we can always help the 6502 out with some hardware acceleration.

Until next time!

Monday 14 November 2022

SD Card Access for a Arty A7: Part 2

Foreword

In the previous post we started a multi part series for interfacing the Arty A7 with an SD Card that will be used for non-volatile storage. This non-volatile storage is crucial for running an Amiga core on the Arty A7, storing the necessary disk images and ROMS. 

In the previous post we covered the hardware we will use for interfacing with an SD Card.

In this post we will continue our journey and continue with the software side of things. In particular we will be looking at an opensource core for interfacing with an SD Card and see if we can issue a basic command to an SD Card in simulation.

The SD Card interface by Dan Gisselquist

In this post we will be using the SD Card interface by Dan Gisselquist, here. This core does all of its communications with an SD Card by means of SPI (Serial Peripheral Interface), rather than the native SD Card interface. The SPI interface is slower than the native interface, because the native interface sends/receives four bits of data at a time, whereas SPI only deals with one bit at a time. However, the speed of SPI will be sufficient for our purposes.

Within the FPGA, this core is accessed via a Wishbone bus. A Wishbone bus is similar in function to an AXI bus you find in ARM devices. We will touch a bit on the technical of the Wishbone bus a bit later in this post.

There is one final feature of the Gisselquist core source code which I think is very cool. This is the fact that the source code contains a Test Bed which can simulate responses from an SD Card. Very useful, since the SD Card itself is a very complex state machine, and thus if we don't need to worry about this it makes life so much simpler.

The Test Bed for simulating SD Card responses is written in C++. To get this Test Bed to work is quite an exercise, needing to install a number of dependencies, working with Verilator.

I will deviate a bit from the steps on how the Test Bed will be used. For starters, I will be using Vivado for simulation instead of Verilator. To be frank, I haven't used an C++ module in simulation with Vivado before, but I have found a nice tutorial for this on the Internet, which I will share in the next section.

Using C++ in an Vivado simulation

As mentioned in the previous section, I have never worked with C++ code in a Vivado Simulation before.

So, is it even possible? It turns out that it is indeed possible, if one have a look at this write up by Adam Taylor. The instructions Taylor gives is using Vivado in Windows. However, it is not so difficult to adapt so it can be used in Linux, which I will be using.

Taylor also mentions that Vivado provides a counter example for showing how to interface Vivado with C++. From this example I will just point out a couple of important operations one needs to use with C++ in a Vivado simuation.

The first operation is get_port_number. Here is an example how this function is used:

        int i_clk = Xsi_Instance.get_port_number("i_clk");
This is typically the first step if you want to operate on a port of the top module. The number returned is like a handle that you will use to operate on that port like setting a value or reading a value.

Let us see how one would set the value on a port, using the i_clk port again as the example:

    const s_xsi_vlog_logicval logic_val  = {0X00000001, 0X00000000};
    Xsi_Instance.put_value(i_clk, &logic_val);
This will set the value on the i_clk port to one. There is quite a bit going on in this code snippet, but basically one needs to send a two valued structure to the function put_value. At first, these two valued structures can be very confusing, but luckily Vivado does provide us with some documentation within the comments of their code for this structure:


So, basically with the two integers together, we have a two bit value for every bit position. This kind of setup is necessary because apart from 0's and 1's, we also need to cater for X's (Unknowns) and Z's (High impendence).

However, in our case we are not really interested in X's and Z's, so it would be sufficient just to set the value's of 1's and 0's we want in the first integer, and just leave the second integer zero.

If we want to read a value from a port, instead of writing, we will use get_value:

    s_xsi_vlog_logicval count_val = {0X00000000, 0X00000000};
    Xsi_Instance.get_value(port, &count_val);
Again the rules of two valued structures is applicable.

One important thing to do when using C++ code with Vivado, is that you need to advance in time in order to do anything useful. To advance in time, you need to call run(), like this:

        Xsi_Instance.run(10);
The parameter needs to be in time precision units. In Verilog, time precision is typically defined with a timescale directive in the beginning of a Verilog file, like this:

`timescale 1ns/1ps
In this case, one time precision unit would be 1ps. Calling the run() method with a value of 10 in this scenario would thus advance time by 10ps.

Let us wrap up this section by looking at a concrete c++ example, generating a 100MHz clock signal:

const s_xsi_vlog_logicval one_val  = {0X00000001, 0X00000000};
const s_xsi_vlog_logicval zero_val = {0X00000000, 0X00000000};
int i_clk = Xsi_Instance.get_port_number("i_clk");

while (1) {
    Xsi_Instance.put_value(i_clk, &zero_val);
    Xsi_Instance.run(5000);
    Xsi_Instance.put_value(i_clk, &one_val);
    Xsi_Instance.run(5000);
}
This example starts by setting the i_clk port to zero, and waiting 5000 time precision units (e.g. 5000ps). This is 5ns and is a half 100MHz clock cycle. After waiting 5ns, the i_clk port is set to a one, waiting another 5ns, and then repeating the whole process.

Basics of using the Gisselquist Test Bench

Now, let us have a look at using the Gisselquist Test Bench. The key file we will use from this test bench is https://github.com/ZipCPU/sdspi/blob/master/bench/cpp/sdspisim.cpp.

This file simulates responses from an SD Card. In this file there is only one method we are interested in:

...
int	SDSPISIM::operator()(const int csn, const int sck, const int mosi) {
...
}
...
This method needs to be called repeatedly during a simulation, at least once per clock transition. The parameters to this method corresponds to a handful of output ports of the SDSPI main module:
  • o_cs_n
  • o_sck
  • o_mosi
So, each time this method is called, Xsi_Instance.get_value(port, &count_val) needs to be called for each of these ports, and passing all the values as parameters to operator(). Keep in mind count_val is a two valued structure, so will need to pass the first value of the structure.

You will notice that operator() also returns a value. This is the value for the i_miso port of the SDSPI main module.  When calling operator() you should also do a put_value() every time for the i_miso port.

Let us write a quick outline how we will use operator() during simulation:

...
int get_port_value(Xsi::Loader& xsi, int port) {
    s_xsi_vlog_logicval count_val = {0X00000000, 0X00000000};
    xsi.get_value(port, &count_val);
    return count_val.aVal;
}
...
void update_spsi(Xsi::Loader& xsi) {
    int o_cs_n = xsi.get_port_number("o_cs_n");
    int o_sck = xsi.get_port_number("o_sck");
    int o_mosi = xsi.get_port_number("o_mosi");
    int m_value = (*m_sdspi)(get_port_value(xsi, o_cs_n), get_port_value(xsi, o_sck), get_port_value(xsi, o_mosi));
    xsi.put_value(i_miso, m_value ? &one_val : &zero_val);
}
...

const s_xsi_vlog_logicval one_val  = {0X00000001, 0X00000000};
const s_xsi_vlog_logicval zero_val = {0X00000000, 0X00000000};
int i_clk = Xsi_Instance.get_port_number("i_clk");

while (1) {
    Xsi_Instance.put_value(i_clk, &zero_val);
    Xsi_Instance.run(5000);
    update_spsi(Xsi_Instance);
    Xsi_Instance.put_value(i_clk, &one_val);
    Xsi_Instance.run(5000);
    update_spsi(Xsi_Instance);
}
...
This concludes an high level overview of using the Test Bench. We still need to fill in the remaining details, like using the Wishbone bus and issuing commands to the SD controller. We will cover this in the remaining sections of this post.

Using the Wishbone bus

As mentioned earlier, we need to make use of the Wisbone bus to access the Sd Card core from Gisselquist.

Let us start by having a quick look at the signals involved in the wishbone bus:

		input	wire		i_wb_stb, i_wb_we,
		input	wire [AW-1:0]	i_wb_addr,
		input	wire [DW-1:0]	i_wb_data,
		output	wire		o_wb_stall,
		output	reg		o_wb_ack,
		output	reg [DW-1:0]	o_wb_data,
Here follows a discussion on the signals:
  • i_wb_stb: Signals a command to read/write. This signal needs to be de-asserted at the next clock cycle.
  • i_wb_we: Signals whether the command is a read or a write.
  • i_wb_addr; Address to read/write
  • i_wb_data & o_wb_data: data for writing or result of read
  • o_wb_stall: Indicate is the core is unable to accept a command at a point in time.
  • o_wb_ack: Indicates whether a command has completed.
At this point the question is, how do we address the SDCard module so that an SD Card comes to live? To answer this question, let us have a look at some of the protocols involved with an SD Card. The following links gives a nice overview of these protocols:

https://openlabpro.com/guide/interfacing-microcontrollers-with-sd-card/

The following diagram, which I also used from the above link, gives an example of a typical command issued to an SD Card:

Every command start with a command byte, followed by 4 bytes of arguments. So in total a command consists of 5 bytes. However, a wishbone bus can only work with 4 bytes of data at a time. The SD Card module deals with this via two separate addresses, e.g. address 0 and address 1:
  • Write the command byte to address 0
  • Write the four arguments bytes to address 1. This address is actually a data register that will store the argument bytes for later use.
Actually, when sending a command to the SD Card module, you should write the parts in the reverse order. That is, first write the argument bytes to address 1, the data register, and then write the command byte to address 0.

As soon as you have written the command byte to address 0, the SD Card module will start transmitting the whole command to the SD Card via the MOSI port, starting with the command byte, and following with the arguments stored in the data register.

Results of the simulation

Time to run a simulation with the Gisselquist core. The goal of the simulation is to give a simple command to the SC Card, which in this case is simulated by a test bench, and see if we get the appropriate response back.

Now, the Test Bench provided by Gisselquist generates a wdb waveform of the simulation, which can be viewed by GTKWave. However, should it be preferred to rather view the waveform in Vivado, Vivado can also open wdb files.

To open a wdb file in Vivado, from the flow menu item in the menu bar simply select Open Static Simulation. You can then browse for the wdb file in the file system and open it.

Let us look at some waveforms. First, here is where we issue a command:


The command is been output on the o_mosi line, which in this case is 0x40, which instructs the SD Card to go into idle mode.

If you have a look at the timescale, you will see that o_sck is clocking the SD Card at 10MHz. This is compared to the usual 400KHz which you should clock the SD Card at from startup. We will address this in the next post, when we will try to run the command on a real FPGA.

Next, let us look at the response from SD Card:


The response is provide in the i_miso signal. The value we get back, is 0x01, which means in idle state.

In Summary

In this post we gave a high level overview of the SD Card core by Dan Gisselquist. We also did a very simple simulation via the Test Bench that Dan Gisselquist provides.

While writing this post, I actually discovered that Dan Gisselquist maintains a blog where he discusses a variety of topics regarding FPGA development. If you are interested, head over to his blog at www.zipcpu.com.

In the next post we will be trying out the SD Card Command example on a real Arty A7.

Until next time!









Thursday 27 October 2022

SD Card Access for a Arty A7: Part 1

Foreword

In the previous post we wrote some code enabling the reading and writing of 16 bits at a time from DDR3 memory, as required by an Amiga FPGA core. All this effort was required because DDR3 memory doesn't work with a single 16-bit group at time, but several.

Luckily we could around this "hurdle" of DDR3 by making use of data masks.

At this point in the journey to implement an Amiga core on an Arty A7,  it becoming clear that we will need some sort of non-volatile storage for storing ROM's and disk images. We will be using SD Card storage as non-volatile storage on the Arty A7.

The technical details of interfacing with an SD Card via an FPGA can be quite overwhelming, even if you are using an existing library. So, in order to make it less intimidating, I will make a multi part series of interfacing with the SD Card.

This post forms the first in the series of interfacing an Arty A7 with an SD Card. More parts to come in the future.

Using PMOD Headers

When I started this blog, more than three years ago, the FPGA I started playing with was a Zybo board. This nice thing of this board was that it had a build in SD Card Slot.

On the Arty A7 board, however, we don't have such a luxury. All hope not lost, however. One can add an external SD Card reader by means of a PMOD header.

For those that are not familiar with PMOD headers, let us go into a bit more detail of what PMOD headers is and how it is used.

PMOD is a standard Digilent use on their development boards. Digilent, by the way is the manufacturer that produces both the Arty A7 board and the Zybo board.

Let us start by having a look at the PMOD headers on my Arty A7 board:


As you can see, each PMOD header is 6 pins wide and 2 pins high. The pin spacing for these headers are fairly standard, and are the same as you will find on a breadboard or the GPIO pins on a Raspberry Pi.

One thing that might annoy the alert reader is that their is no 'safety notch' to prevent inserting the PMOD module the wrong way around. It is indeed possible to insert a PMOD module the wrong way around, so just keep this in mind when working with PMOD's. More on this a bit later.

At this point, the question is: Which PMOD header do we use on the Arty A7 board? At first sight, this may sound like a stupid question. I mean, if you have a PC with a number of USB ports, you just pick one, right?

Well, with PMOD's you need to exercise a bit more caution. Let me explain. You get two types of PMOD: Standard and high speed. With standard PMOD the current is limited on each by means of resistors. This is a kind of safety feature, so that if you incidentally short two pins on the header, the FPGA doesn't go up in smoke. 

The downside of these resistors is that they also limit the maximum speed that the pins on the PMOD can run at and this is where high speed PMOD comes in. On a high speed PMOD, these protection resistors are absent, which makes it possible to make your FPGA go up in smoke 😱.

So, nothing wrong in using a high speed PMOD, but if don't need the speed it provides, it is safer to for the Standard PMOD.

Let us where these types of PMOD's are located on the Arty A7. I got this picture from the Arty A7 Reference manual, hosted here:


The PMODs are located right at the top, as indicated by callout #16. If you look closely at the diagram, you will see the PMOD's are number from left to right as JA, JB, JC and JD.

The Reference manual states that JB and JC are high speed PMODs and JA and JD are standard PMOD's. At first I found it quite intriguing that the high speed PMODs are next to each other, while the Standard PMODs are not. I eventually found some answer for this when I saw how a VGA PMOD look like:

Here we see that a VGA PMOD needs two separate PMODs next to each other. This PMOD needs High Speed PMODs and thus it makes sense for me somewhat that Digilent put two High Speed PMODs next to each other on their development boards. All the other Standard PMOD modules I have seen for sale only uses a single PMOD, so the location of Standard PMODs doesn't really matter.

The SD Card PMOD

So, I bought myself a PMOD SD Card Reader from RS Components. With everything removed from the bag of the courier and excess packaging, this is what I got:


This module is for the full size SD Card, so I will need to use an adaptor for reading a MicroSD Card.

There is a nice diagram on the packaging indicating which pin is which on the PMOD header. The Technical reference for this module, here, makes mention that there is a small printing error on the printed diagram on the packaging. However, I verified the diagram on my packaging, and it looked correct for me. Perhaps they fixed it in later revisions.

Now, to come back to our question from the previous section, how do we determine the correct orientation for inserting this module into the PMOD header? Start by flipping this module around, so that the Sd Card slot faces to the top:


In this picture I have indicated with a red arrow the presence of a "1" on the PCB. This marks pin 1 of the PMOD header, which if you follow this pin from the PCB, is the top right pin on the PMOD header.

Compare this to the pin out of a PMOD header, which I also retrieved from digilent.com :


This means the upright orientation of the SD Card slot, as shown above, is the correct way to insert the module into the PMOD Header.

I feel I was quite pedantic, just to explain which site is up 😂. 

About the Software

We have more or less the hardware in place for reading from an SD Card. However, we still need the necessary software for interfacing with the SD Card.

I belief Vivado does provide some IP blocks for interfacing with an SD Card reader. However, I am not a big fan of these proprietary blocks, and would rather want to use an Opensource module.

There is indeed an Opensource module on GitHub developed by Dan Gisselquist. This module is written in Verilog. We will explore this project more in the next post.

In Summary

In this post we explored how to interface the Arty A7 with an SD Card for non-volatile storage. I will be using a PMOD module from  Digilent that will act as the SD Card reader.

In the next post we will be exploring the SD Card project by Dan Gisselquist, which we use to interface the Arty A7 with the SD Card reader.

Until next time!

Sunday 11 September 2022

Reading/Writing Data in 16-bit chunks from RAM

Foreword

In the previous post we reduced trailing latency for DDR RAM access, providing us with the desired memory throughput required by a Amiga core.

At first sight it seems that an Amiga core cannot really work with DDR3 memory. An Amiga core works with 16-bits at a time from memory, whereas DDR3 RAM works with bursts of 4 or 8 16-bit bursts at a time. So, in this post we will see if we can find a way to work with DDR3 memory 16-bits at a time.

Writing 16-bits at a time

On our journey to tackling 16-bits at a time, let us have a look at writes.

As mentioned earlier, DDR3 RAM works with either 4 or 8 bursts at a time. Putting an Amiga core into the picture, only one of these 4/8 bursts will always be a valid write, and the remaining bursts memory will be unintentionally overwritten.

Somehow we need to be able to tell DDR3 RAM which of the bursts contain valid Write, and indeed DDR3 memory does.

All DDR3 memory contains an input signal called Data Mask, or abbreviated DM. During a burst session, the DDR3 RAM will examine the DM session at every data burst. If the signal is a 1, e.g. burst masked, the burst will be ignored. If, however, the signal is 0, e.g. unmasked, the burst will be considered valid and the relevant location in memory will be updated with the value.

Therefore, with an Amiga core, with eight bursts there will always be only one timeslot where the DM signal will be 0 and the rest will be ones.

Let us take an example. Suppose we want to write the value 25 to address 13. Burst writes always start at 8 byte boundaries, like addresses 0, 8, 16, 24 and so on. Address 13 falls within the boundary 8 to 15. So, the DM values for these bursts will look as follows:

                 1   1   1   1   1   0   1   1   
Address: 08 09 10 11 12 13 14 15

Concerning the data, we can just repeat the data value 25 8 times, making life easier.

Now, let us write some code. First we need to reduce data_in/data_out ports of mem_tester to 16 bits:

module mem_tester(
    input clk,
    // 0 - reset
    // 1 - ready
    input [2:0] cmd_status,
    output reg select = 0,
    output reg refresh = 0,
    output reg write,
    output [15:0] address_out,
    output wire [/*127*/15:0] data_out,
    input [/*127:0*/15:0] data_in
    );
...
endmodule
Let us now move on to the file mcontr_sequencer.v, which contains the state machine that breaks up the commands from mem_tester into DDR3 memory commands. One of the selectors we need to change as follows:

              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                          test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= {8{cmd_data_out}};
                          //column_address <= cmd_address[9:0];
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
We basically duplicating the data_out of mem_tester 8 times, which will be fed to the OSERDES module that will repeat the same burst 8 times during a Write.

Now, we still need to assert the correct dm timeslot, so that data is written to the correct location in memory. We do this with the help of the lowe three bits of the command address send by the mem_tester:
    always @*
    begin
        if (cmd_address[2:0] == 0) 
        begin
            dm_slot = ~1;
        end else if (cmd_address[2:0] == 1)
        begin
            dm_slot = ~2;
        end else if (cmd_address[2:0] == 2)
        begin
            dm_slot = ~4;
        end else if (cmd_address[2:0] == 3)
        begin
            dm_slot = ~8;
        end else if (cmd_address[2:0] == 4)
        begin
            dm_slot = ~16;
        end else if (cmd_address[2:0] == 5)
        begin
            dm_slot = ~32;
        end else if (cmd_address[2:0] == 6)
        begin
            dm_slot = ~64;
        end else if (cmd_address[2:0] == 7)
        begin
            dm_slot = ~128;
        end
    end
Here dm_slot is the data we need to feed the OSERDES component dealing with the DM output. Once the mem_tester has asserted an address the OSERDES component for DM will output the 8 bit pattern continuously, until the DDR3 RAM is at the phase of receiving data that should be written. During this phase the DDR3 RAM will look for the DM slot which is zero as the queue.

Using the lower three bits of the address to decide which DM slot to enable is quite a nice rule of thumb. However, beware that due to so many things happening from the time mem_tester asserts a command, until we get to the point where the DDR3 memory actually reads the bursts, your guess on the correct DM slot versus the actual one might be out by a time slot or two.

One way to determine the correct DM slots, would be to run a number of simulations and determine these values experimentally. However, in my experience with this particular setup, you will be able to make the simulation environment work perfectly, but when running it on the actual FPGA might differ again a time slot or two from the simulation environment. Scary stuff indeed! Sim environment is supposed to work exactly as in practice. I still need to narrow down why theory is different from practice, but for the moment I need to live with the difference.

Ultimately, this means that I also need to obtain dm slot values experimentally when running on the real FPGA, which sounds like quite toll order. However, I find a solution for this, which I will share a bit later in another section.

Reading 16 bits at a time

Let us now look into reading 16 bits at a time. With reading we don't need to worry about masking off certain bursts. We can let the DDR3 send its usual 8 bursts at a time, and we just wait for the burst we are interested in.

Issues can arise if the burst we are interested in is towards the end, adding unnecessary latency and potentially missing the deadline when the data is required.

We could perhaps do better by asking the RAM to give us the data we are interested in first. Indeed, the RAM datasheet does indicate that we can access RAM in such a fashion:


Everything is driven by the lowest three bits of the address. If the lowest three bits are  zero, byte zero will be presented first. Similarly, if these three bits are one, then byte 1 will be presented first. This follows a nice a pattern until the three bits are seven, at which byte 7 is presented first.

At this point we still have the uncertainty whether data is written in the correct DM slot, as outlined in the previous section. Let us start somewhere and start writing some further code. The ultimate test is that mem_tester should write to memory locations 0 to 15 in sequence, the values 0 to 15 in the same sequence. If we read the values back in the same sequence, we expect the same values.

Here is a snippet of the simulation waveform:


The ddr3_dq signal shows the result of an 8 burst read. The data_cap shows the value we have captured from the burst, which in this case is bb0b. This indicates that we captured the third burst, which have the same value. 

Receiving our required data only at the third burst complicates our life a bit. To see why, let us revisit the Burst table of the DDR3 datasheet I presented earlier on:


As indicated by the red column, if we were to receive our data at the first burst, life would have been much simpler. In this case the DDR3 will just give us the correct data for the lowest three bits of the address.

Receiving data at the third burst, things are getting more complex, as shown in the green column. Supplying the address 0, will give us the data of address 2. Supplying address 1, gives the data of address 3, and so on. Overall, the column doesn't follow a nice sequential pattern: 2, 3, 0, 1, 6, 7, 4, 5, 2. 

The easiest way to do translation with this non-sequential pattern, would be to use a look-up table. With a lookup-table we can also solve our potential problem in the previous section of finding the correct DM-slot for writes.

Let us start building this lookup table. Looking at the screenshot of the simulation again as an example of the first value. In this waveform we requested address c, but we got b. We can state this info in another way: To get address b, we need to specify address c, or write it like this:

b -> c

This is one entry for our lookup table. However, our lookup table only needs 3 bits, so let us convert to binary:

1011 -> 1100

which resolves to 3 -> 4

Let us now find similar lookup values for supplied addresses in the range 0 to 7. We start by writing down the numbers 0 to 7, and what the actual address was we got:
0 - 7
1 - 0
2 - 5
3 - 6
4 - 3
5 - 4
6 - 1
7 - 2

Now we swop the columns around to get the lookup table:

7 -> 0
0 -> 1
5 -> 2
6 -> 3
3 -> 4
4 -> 5
1 -> 6
2 -> 7

Now we can implement the mapping in verilog code:

    //Sim mapping
    always @*
    begin
        if (cmd_address[2:0] == 0)
        begin
            map_address = 1;
        end else if (cmd_address[2:0] == 1)
        begin
            map_address = 6;
        end else if (cmd_address[2:0] == 2)
        begin
            map_address = 7;
        end else if (cmd_address[2:0] == 3)
        begin
            map_address = 4;
        end else if (cmd_address[2:0] == 4)
        begin
            map_address = 5;
        end else if (cmd_address[2:0] == 5)
        begin
            map_address = 2;
        end else if (cmd_address[2:0] == 6)
        begin
            map_address = 3;
        end else
        begin
            map_address = 0;
        end
    end
Where ever we need to specify the new address, we need to use map_address for the lower bits:

              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                                            test_cmd <= {1'b0, 4'b0, {cmd_address[9:3],
                                              map_address[2:0]}, 1'b0, 4'h1, 
                                                       (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= {8{cmd_data_out}};
                          //column_address <= cmd_address[9:0];
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
Let us now move on, to examine the data when running on the actual FPGA. The following waveform shows some signals captured, while the core was running on the FPGA:


To save space, I have omitted the captions on the left, otherwise everything looks very tiny. The omitted captions, from top to bottom are as follows:

  • Address Out
  • Clk of mem_tester
  • Captured data
  • Write/read
As you can see during this capture, the write/read signal is low, so these are all reads. 

The red arrows I have indicated where we are asserting commands. The clock cycle following these red arrows, we capture the actual data. From this diagram, we can see the address and corresponding data is as follows:
  • Address 0 -> 5
  • Address 1 -> 6
  • Address 2 -> 7
  • Address 3 -> 0
  • Address 4 -> 1
  • Address 5 -> 2
Missing from the diagram is addresses 6 and 7, which will yield values 3 and 4 respectively. So, the mapping for running on the actual FPGA will be as follows:

    always @*
    begin
        if (cmd_address[2:0] == 0)
        begin
            map_address = 3;
        end else if (cmd_address[2:0] == 1)
        begin
            map_address = 4;
        end else if (cmd_address[2:0] == 2)
        begin
            map_address = 5;
        end else if (cmd_address[2:0] == 3)
        begin
            map_address = 6;
        end else if (cmd_address[2:0] == 4)
        begin
            map_address = 7;
        end else if (cmd_address[2:0] == 5)
        begin
            map_address = 0;
        end else if (cmd_address[2:0] == 6)
        begin
            map_address = 1;
        end else
        begin
            map_address = 2;
        end
    end

In Summary

In this post we implemented 16-bit reading/writing from DDR3 memory.

At this point in time we have a potential DDR3 memory solution for an Amiga core. We still need to fill this memory with ROM contents, which makes sense to load it from an SD Card.

So, in the next post we will start to work on a FPGA design where we read data from an SD Card.

Until next time!


Thursday 4 August 2022

Shrinking trailing latency

Foreword

In the previous post we managed to reduce initial latency in order to improve throughput.

Just to put everything into perspective again. Our memory controller is clocking at 16.7MHz. A request to read/write from memory is provided at the first clock pulse, and data is expected at the second pulse.

In the previous post we found that requested data from memory was available shortly after the second pulse, which is simply too late.

In this post we will attempt to reduce some of the trailing latency even further so at least we can have the requested data before the second pulse.

The exercise of reducing trailing latency has proven not be so major, so will find that this post will be shorter than usual.

The Plan

Let us quickly review our plan by looking at the diagram below.

Point 1 is the first clock edge of the 16.7MHz clock signal. At this point we assert a command for memory access.

At point 2 we are expecting the required data to be ready. However, the data is captured only at point 3, which is way after the second required clock edge.


My plan is to rather capture the data at the negative clock edge of the clock that drives the output of the Iserdes block, which I have indicated in the diagram as point 4. In the next section I will indicate how to accomplish this.

Capturing at the right moment

Let us now see how we can capture the data at the right moment, as outlined in the previous section.

From the diagram we have seen in the previous section, we have seen that the capture should happen at state 2f.

The state register is maintained in the module mcontr_sequencer, and we check for state 2f:

module  mcontr_sequencer   #(
  ...
)(
 ...
);
...
assign store_captured_data = state == 7'2f;
...
endmodule
We can now pass this signal down subsequent modules in we reach the module iserdes_mem.
module  iserdes_mem #
(
   ...
) (
...
    input        store_captured_data,
...
);
...
always @(negedge oclk_div)
begin
    if (store_captured_data)
    begin
        dout_le <= {dout_le[3:0], iserdes_out};
    end
end
...
endmodule
As you can see we do the capture on negative edge of the clock clocking the output of a iserdes block.

With this logic we will be able to capture data at point 4, indicated in the diagram of the previous section.

In Summary

In this post we managed to shorten the trailing latency, so that we would be able to capture the required data at the second pulse of a 16.7MHz clock signal.

So far in the game, we have worked with multiple 16-bit bursts at a time with every memory access. This is not really suitable for an Amiga based design which only work with a single 16-bit piece of data at a time from memory.

So, in the next post we will start focussing on changing our design so that it works only with one 16-bit value at a time from memory.

Till next time!

Sunday 3 July 2022

Shrinking Latency

Foreword

In the previous post we created a very elementary memory tester for the Arty A7 board to see if we more or less got the logic correct for writing and reading to memory.

When I wrote the memory tester, I have added quite a bit of padding between DDR commands, just to avoid violating some DDR timing parameters. The purpose of this exercise was just to get the memory tester working, and not to worry at that point in time of getting the most efficient time possible. The old saying of when eating an elephant, you do so one bite at a time. 😀

In this post we will revisit timings for our elementary memory tester, and see where we can remove any wasted clock cycles. The ultimate goal is to be able to access memory at a rate of at least 7MHz, which will be sufficient to emulate an Amiga core.

Reducing initial latency

Let us see if we can reduce initial latency. That is the latency from the moment the Memory Tester asserts a command, until the time when DDR RAM receives the first command for fulfilling this request. This is all illustrated with the following hand drawn diagram:


In this diagram every division indicated has a period of 1.5ns and we show a couple of clock signals. Let us start by having a look at the frequencies of these clock signals:

  • Memtester has a frequency of 20MHz, and I am not showing a complete cycle of it.
  • Oserdes Out has a period of 2 divisions = 3ns. This equals a frequency of 333MHz. This is the signal driving the commands out to the DDR RAM.
  • Oserdes Load has a period of 8 divisions = 12ns. This equals a frequency of 83MHz. This clock signal is used to load the OSERDES block with 4 bits worth of data at a time. 
  • As you can see from the diagram, Mclk has exactly the same frequency as Oserdes Load, but is shifted 45 degrees.
Let us know look at the initial flow of events when the Mem Tester asserts a command. The Mem Tester asserts a command at point A.

We capture this command from the Mem Tester at point B. You might remember from the previous post, that we store this captured command in a register called test_cmd.

Out test_cmd register is hooked up to the inputs of Oserdes blocks, which loads values from test_cmd on the rising edge of the Oserdes load clock.

From the diagram you will see that point B will happen a small amount of time after a rising edge of Oserdes Load. This means that Oserdes blocks will not capture the data of test_cmd straight away, and will only happen at point C.

With the way Oserdes blocks work, it will not start out putting the 4-bit sequency to DDR RAM at point C, but rather at the following rising edge of the clock of Oserdes load.

As you can see quite a bit of time is wasted from the time a command is asserted by the mem tester, until the time the first command is issued to DDR RAM, resulting in reduced throughput.

There is a number of ways we can reduce this initial latency. For starters, quite a bit of time is wasted by first storing a value into a register, making this data only available to the rest of the system at the following clock cycle.

For the initial command to the DDR spawning from the mem tester, we can bypass the test-cmd register and feed it directly to the Oserdes blocks. We can do this as follows:

...
    assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) 
           ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;
...
    phy_cmd #(
      ...
    ) phy_cmd_i (
      ...
        .phy_cmd_word        (result_cmd),
	  ...
    );
...
So, if we are in the state WAIT_CMD, the mem tester indicates the command is valid, and it is not a refresh command, we don't use test_cmd, but build up a row activate command on the fly. With this setup the oserdes components will load on the first oserdes load-clock edge following the mem tester clock edge, and will basically start outputting commands to DDR RAM at point C in the diagram. In effect we have shaved off a full oserdes load-clock cycle of latency.

The period between the rising edge of the mem tester-clock cycle and the following oserdes load clock-cycle is 1.5ns. During testing I have found that this period doesn't allow for proper settling of all bits of the command from the mem tester.

To make everything work ok, I had to double this period between the two rising edges to 3ns. To accomplish this, I had to adjust the phases of the clock that the MMCM block provides. The adjustments of the notable clocks are as follows:
  • Oserdes data load: From 45 degrees to 90 degrees.
  • Serial data out: From 180 degrees to 0 degrees.
  • Mclk: From 90 degrees to 135 degrees.
So, in effect all the clocks need to shift by a period of 1.5ns with the exception of the mem tester clock.

Changing command slots

As mentioned previously, the Oserdes components for outputting commands to DDR, receives 4 bits of data at a time. At any one point in time only one of these 4 bits will contain a command. The other three bits will be part of NOP commands.

At this point in time we don't have much flexibility at which of the four time slots a command can be issued. If we decide, for instance, that slot 2 should be used for commands, then every command we issue should be issued at slot 2, i.e. we cannot alternate between different slots between commands.

This incapability of alternating between different slots can also cause unnecessary latency.

Let us take an example.  The DDR RAM on the Arty A7 has a minimum latency of 5 clock cycles between commands. Let us assume we have decided to use the first slot as the command slot and we want to give a row activate command followed by a column read command. Let us quickly visualise the time slots:

1000 1000

Here we can see that issuing an Activate command in the first four time slots and then the read column command in the next four timeslots is not going to work for us. In this setup there is 4 cycles latency between the commands instead of the required 5. Because we cannot alternate between time slots, we will need to issue the commands like this:

1000 0000 1000

We can clearly see that we are wasting 4 cycles just because we cannot alternate between time slots, in order to guarantee a minimum of 5 cycles between commands.

For us to implement alternating between different ime slots, we need to look inside the file cmd_addr.v. You can find within the Elphel project in the path memctrl/phy/cmd_addr.v. WIthin this module we need to add the following input port:

input                  [1:0] cmd_slot,
This indicates at which slot number the given command should be triggered. Let us look at the write enable signal as an example on how to use this input port:

// we
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_we_i (
    .dq(ddr3_we),
    .clk(clk),
    .clk_div(clk_div),
    .rst(rst),
    .dly_data(dly_data_r[7:0]),
    .din({cmd_slot == 0 ? {1'b1, 1'b1, 1'b1, in_we_r[1]} :
          cmd_slot == 1 ? {1'b1, 1'b1, in_we_r[1], 1'b1} :
          cmd_slot == 2 ? {1'b1, in_we_r[1], 1'b1, 1'b1} :  
          {in_we_r[1], 1'b1, 1'b1, 1'b1}}),
    .tin(in_tri_r), 
    .set_delay(set_r),
    .ld_delay(ld_dly_cmd[3]));
As can be seen here at port din we place the signal in a different position for each value of cmd_slot.

We need to repeat the above for each of the command signals. For the address and bank number signals we can just repeat the value for each slot, taking address as an example:

generate
    genvar i;
    for (i=0; i<ADDRESS_NUMBER; i=i+1) begin: addr_block
//       assign decode_addr[i]=(ld_dly_addr[4:0] == i)?1'b1:1'b0;
    cmda_single #(
         .IODELAY_GRP(IODELAY_GRP),
         .IOSTANDARD(IOSTANDARD),
         .SLEW(SLEW),
         .REFCLK_FREQUENCY(REFCLK_FREQUENCY),
         .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE)
    ) cmda_addr_i (
    .dq(ddr3_a[i]),               // I/O pad (appears on the output 1/2 clk_div earlier, than DDR data)
    .clk(clk),          // free-running system clock, same frequency as iclk (shared for R/W)
    .clk_div(clk_div),      // free-running half clk frequency, front aligned to clk (shared for R/W)
    .rst(rst),
    .dly_data(dly_data_r[7:0]),     // delay value (3 LSB - fine delay)
    .din({4{ in_a_r[ADDRESS_NUMBER+i]}}),      // parallel data to be sent out
//    .tin(in_tri_r[1:0]),          // tristate for data out (sent out earlier than data!) 
    .tin(in_tri_r),          // tristate for data out (sent out earlier than data!) 
    .set_delay(set_r),             // clk_div synchronous load odelay value from dly_data
    .ld_delay(ld_dly_addr[i])      // clk_div synchronous set odealy value from loaded
);       
    end
endgenerate

Putting everything together

We are now ready to adjust our main state machine within mcontr_sequencer.v. We start with the PREPARE_CMD and WAIT_CMD states:

...
              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  cmd_slot <= 0;
                  state <= WAIT_CMD;
...
              end
              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_1;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                          test_cmd <= {1'b0, 4'b0, cmd_address[9:0], 1'b0, 4'h1, 
                      (write_out ? 2'b11 : 2'b00), 10'h1fd};
                          cmd_slot <= 1;  
                          data_in <= cmd_data_out;
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end
...
In PREPARE_CMD we ensure that all commands will be issued at the first timeslot.

You will remember that earlier on we defined the wire result_cmd that will pass the ACTIVATE command beforehand to DDR when the Mem tester have asserted a command. So within the WAIT_CMD selecter we can go ahead and issue a DDR Read/Write command, or a Refresh command if desired.

For the rest of the states, we remove unnecessary waits, which results in the following:

             
             STATE_PREA: begin
                  state <= WAIT_WRITE_RECOVERY;
                  dq_tri <= do_write ? 0 : 15;
                  cmd_slot <= 0;
                  test_cmd <= do_write ? 32'h000005ff : 32'h000001ff;                 
              end
              WAIT_WRITE_RECOVERY: begin
                  test_cmd <= 32'h000001ff;
                  state <= PRECHARGE_AFTER_WRITE;
              end
              PRECHARGE_AFTER_WRITE: begin
                  do_capture <= 1;
                  state <= POST_READ_1;
                  cmd_slot <= 3;
                  test_cmd <= 32'h000029fd;
              end

              POST_READ_1: begin
                  state <= PREPARE_CMD;
                  test_cmd <= 32'h000001ff;
              end
Let us quickly go through this code. In STATE_PREA we wait for the read/write cycle to complete, and keep the ODT signal asserted during this time if it is a write.

After a read/write cycle is complete we need to close the open row in preparation for the next read/write command, by means of a PRECHARGE command. However, with DDR RAM you cannot issue an PRECHARGE command straight away, but need to wait for a period of time after a read/write cycle has completed. This time period is called Write Recovery, and on the Arty A7 this is 5 clock cycles.

Test Results

Let us see what happens in practice. The following simulation waveform shows what happens during a read:


On this waveform I have indicated points A, B, C, D and E:
  • Point A is the clock signal for our Memory Tester. Originally this frequency was 20MHz, but during experimentation, I have found that 20MHz is a tight fit. I have lowered the frequency to 16.7MHz instead. Here the Memory Tester issues the command at the first rising edge and data is available at the following rising edge.
  • At point B we issue a row Activate command.
  • At point C we issue a column read command.
  • At point D we issue the precharge command.
  • Point E indicates the point when we receive data from the data_out port and when this value is captured by the cap_value port.
As we see on point E the data is captured after the second riging edge of the Mem Tester clock, but we require it to be captured before the second rising edge. We will tackle this in the next post.

In Summary

In this post we have attempted to reduce latency in our Memory controller. We managed to reduce the time from the memory tester issues a memory command, till the time a physical command is issued to DDR RAM.

An issue that we still need to resolve, is making read data available before the second rising edge of the second rising Mem Tester clock.

Until next time!

Thursday 26 May 2022

Starting with a memory tester on the Arty

Foreword

In the previous post we managed to write a value to a memory location, and read the same value back.

In this post we will create a very simple memory tester, just to stress test the memory a tiny bit and see if there is some obvious setup and hold hold timing violations, resulting in data corruption.

This is kind of habit I have grown into, since your design might operate ok with a few clock cycles, but you might experience a weird glitch after a couple of thousands of clock cycles, due to a setup and hold violation. So, it is always good to stress test bits of your design as soon as possible, to avoid a lot of rework.

For this memory controller I will covering in this post, I will be writing test data to a couple of rows to DDR RAM, wait for about 20 seconds, and see if I can read back the correct value from a particular memory location.

Obviously, for this test I will also need to implement some refresh logic, so that the data doesn't leak away from the tiny capacitors in the DDR RAM during the 20 seconds of waiting. 

Abstracting the memory tester from Technical details

The Memory tester we will be developing in this post issues a series of write and read commands. In the future this memory tester will be eventually replaced by the Amiga core, which will be issuing this commands.

The Amiga core doesn't understand the technical details of DDR memory, like splitting an address into a separate row address and column address. The Amiga core also doesn't know that for any memory read/write command you first need to activate row and afterwards pre-charge it.

For all these reasons, we need to abstract the technical details of DDR memory from our memory tester.

The abstracted interface for our Memory tester looks as follows:

    
module mem_tester (
    output reg select,
    output reg write,
    output [15:0] address_out,
    );
	
endmodule
Address_out is a linear address. Outside this module it will be converted to row and column addresses. For now, I am only going to make the with of this output 16 bits.

The Memory tester will use the select output to assert a command and indicate if it is a read/write command with the write output.

We need to add some more ports to our memory tester:

module mem_tester(
...
    input clk,
    input [2:0] cmd_status,
    output reg refresh = 0,
    output wire [127:0] data_out,
    input [127:0] data_in
...
    );
	
endmodule
For our input clock, I want to use a frequency of 20MHz, which is close to the frequency used by the Amiga core.

I can hear a couple of screams at this point: "Cross clock domains!". Indeed, cross clock domains is always a pain to work with 😀.

However, in the past couple of months I have discovered when working with a Mixed Mode Clock Manager (MMCM) in Xilinx FPGA's, working with cross domain clocks is not so bad. With a MMCM you can align the rising edges of different output clocks. Provided that the frequency of the slower clock is a multiple of the faster clock, these edges will always line up.

The next port, cmd_status, will indicate for our memory tester when memory is ready to accept the next command.

You will note that I also have a refresh output port, indicating that our Memory tester is also responsible for performing memory refresh. This goes a bit against our goal of abstracting the technical details off DDR RAM, but I found it difficult to orchestrate a refresh from the outside with the different clock domain.

 Finally, we are sending and receiving bits of 128 bits at a time. This is because the DDR RAM works with 8 bursts at a time. Again, this is a bit of a mismatch with the Amiga core, which works with only 16 bits at a time, but we will handle it in future when we get there.

Coding the Memory Tester

With the ports defined for our memory tester, let us start with some code. We start with a state machine:

    always @(posedge clk)
    begin
        case (state)
              0: begin
                      if (cmd_status == 1)
                      begin
                          if (refresh_underflow)
                          begin
                            refresh <= 1;
                            select <= 1;
                            state <= 2;
                          end else if (address[15:14] != 2'b11)
                          begin
                            write <= 1;
                            select <= 1;
                            state <= 2;
                          end else if (wait_for_read == 0)
                          begin
                            write <= 0;
                            select <= 1;
                            state <= 2;
                          end
                      end 
                 end
              2: begin
                     select <= 0;
                     refresh <= 0;
                     state <= wait_for_read == 0 ? 3 : 0;
                 end
        endcase
    end
State 0 is an idle state, where we wait for the memory to become ready. When the memory controller is ready, we first need to check if the memory is due for a refresh.

The address register keeps track of which address we need to write test data to. Once the top two MSB's of the address register have approached 1's, we are finished with writing. From this point onwards we wait for the signal of wait_for_read to signal that we need to read data from a particular location.

We need state 2 to immediately unassert  the command we issued in the previous state. We also want to stop issuing commands once we have issued a read command by assigning state to 3.

Next, let us have a look at other snippets of code on which our state machine depends. First, the refresh logic:

    always @(posedge clk)
    begin
        if (refresh_counter == 0)
        begin
            refresh_underflow <= 1;
        end else if (refresh)
            refresh_underflow <= 0;
        begin
        end
    end
    
    always @(posedge clk)
    begin
        if (refresh_counter > 0)
        begin
            refresh_counter <= refresh_counter - 1;
        end else
        begin
            refresh_counter <= 120;
        end        
    end
Refresh counter continuously countdown from 120 to zero. With our module clocking at 20MHz this means this counter underflows every 6 microseconds, which is in line with the specs of our DDR RAM stating a refresh command should be issued every 7 microseconds.

The refresh_counter remember that a refresh needs to happen and gets cleared as soon as the refresh command was issued.

Next, let us look at the code that keeps track of the address to which we need to write to:

    always @(posedge clk)
    begin
        if (state == 2 && !refresh)
        begin
            address <= address + 8;
        end
    end
We advance the address to the next address once we are finished with a write command. We also adnace by 8 instead 1 because of bursty nature of DDR RAM. We also don't want to advance the address if the previous command was a refresh.

Let us have a look at data generation for the writes:

...
    assign data_out = {data_counter, 3'b000,
                       data_counter, 3'b001,
                       data_counter, 3'b010,
                       data_counter, 3'b011,
                       data_counter, 3'b100,
                       data_counter, 3'b101,
                       data_counter, 3'b110,
                       data_counter, 3'b111};
...					   
    always @(posedge clk)
    begin
        if (state == 2 && !refresh)
        begin
            data_counter <= data_counter + 1;
        end
    end
...
Here we create data for 8 bursts at a time.

Let us next create the logic where we wait for the read:

...
    reg [31:0] wait_for_read = 400000000;
...	
	always @(posedge clk)
    begin
        if (wait_for_read > 0)
        begin
            wait_for_read <= wait_for_read - 1;
        end
    end
...
This snippet will wait for about 20 seconds before doing a read. The cycle which writes all the test data will complete long before then, and will continue to refresh the DDR RAM continuously until it is time to do the read.

Adding the Memory controller to the existing design

Let us now add our Memory Tester to our existing design. From my previous post, you will remember that I have implemented our logic as another state machine in the module, mcontr_sequencer. Within this module we will also place in instance of memory tester, as follows:

mem_tester m2(
    .clk(memtest_out),
    .cmd_status(cmd_status),
    .select(cmd_valid),
    .write(write_out),
    .address_out(cmd_address),
    .refresh(refresh_out),
    .data_out(cmd_data_out),
    .data_in()
    );
We need to change the code a bit in our state machine living within the module mcontr_sequencer a bit so that it work with our memory tester.

The first change is as follows:

    always @(posedge mclk)
    begin
        if (start_init)
        begin
            case (state)
			...
			... initilise the memory ...
			...
              PREPARE_CMD: begin
                  test_cmd <= 32'h000001ff;
                  do_capture <= 0;
                  state <= WAIT_CMD;
                  cmd_status <= 1;
              end
              WAIT_CMD: begin
                  if (cmd_valid)
                  begin
                      if (refresh_out)
                      begin
                          state <= REFRESH_0;
                          cmd_status <= 2; 
                      end else begin
                          state <= STATE_PREA;
                          test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd};
                          data_in <= cmd_data_out;
                          column_address <= cmd_address[9:0];
                          do_write <= write_out;
                          cmd_status <= 2;
                      end
                  end
              end			
			  ...
            endcase
        end
    end
We get into the state PREPARE_CMD, right after the memory was initialised, which we have covered in a previous post. Within the state PREPARE_CMD we set cmd_status to 1. This signals our memory tester that it is free to submit a command.

We then wait in the state WAIT_CMD until the memory tester has given us a command. We will cover the refresh command a bit later.

When we have received a read/write command, the first thing we need to do is to activate the row in question. You can see from the snippet above that we get the row address by looking at bits 15:10 of the cmd_address. The lower bits of the address bits 9:0 is the column address, and we save this for later use.

In the next state we get the dq bus ready for either reading or writing, and in subsequent states we wait for the ACTIVATE phase of the DDR RAM to complete:

              ...
              STATE_PREA: begin
                  state <= STATE_WAIT_READ_PATTERN_0;
                  dq_tri = do_write ? 0 : 15;
                  test_cmd <= 32'h000001ff;                  
              end
			  .... wait until activate is complete ...
Once activation is complete, we can issue the command for reading/writing a column:

              ISSUE_CMD: begin
                  state <= ASSERT_ODT;
                  test_cmd <= {1'b0, 4'b0, column_address, 1'b0, 4'h1, 
                      (do_write ? 2'b11 : 2'b00), 10'h1fd};
              end

              ASSERT_ODT: begin
                  test_cmd <= do_write ? 32'h000005ff : 32'h000001ff;
                  state <= WAIT_CMD_FINISHED;
              end
With writes we need to assert the ODT line, which we do in the ASSERT_ODT state.

Finally after completing a read/write, we need to precharge the row, which I am not going to show here.

In our state machine we still need to serve the Refresh command. Apart from assigning a value to test_cmd to trigger the Refresh command, we also need to honour the timing period tRFC after issuing the command, which is 160ns for the DDR RAM chip we use on the Arty A7 board. 

We implement this delay also on our state machine, so our memory tester will only need to issue the refresh command and don't need to worry about timing the tRFC delay. Out state machine implementing the refresh command will look like this:

...
reg                   [3:0] refresh_wait = 14;
...
             REFRESH_0: begin
                  test_cmd <= 32'h000031ff;
                  state <= REFRESH_1;
              end

              REFRESH_1: begin
                  test_cmd <= 32'h000001ff;
                  if (refresh_wait > 0)
                  begin
                      refresh_wait <= refresh_wait - 1;
                  end else
                  begin
                      refresh_wait <= 14;
                  end
                  if (refresh_wait == 0)
                  begin
                      state <= PREPARE_CMD
                  end
              end

Since our state machine is operating in the 83MHz domain, 14 cycles gives us 168ns, which is in line with tRFC.

Test Results

Running our design on the FPGA returned the results I expected when I read back a test value from memory after 20 seconds.

For the moment, I have nothing else to report back on 😄

In Summary

In this post we have implemented a very simple memory tester where we write a volume of data to memory, wait 20 seconds and read a test value back.

In the next post we will be continue to chip away at our memory controller.

On thing I am aware of I should give attention to in our memory controller is to reduce latency so that we can easily operate at 7MHz, which is the memory bandwidth the Amiga core requires. We will give attention to that in the next post.

Till next time!