Sunday, 13 May 2018

Reading From SDRAM

Foreword

In the previous post we manage to pass pixel data from one clock domain to another and getting it displayed it on a VGA screen.

The above was a very nice practical test for me around all the theory surrounding cross clock domains. I must admit, I was quite overwhelmed with all the theory surrounding Cross Clock Domains on the Internet.

When overwhelmed with theory one must always attempt, if you can, to do a simple practical test for a reality check 😄

The next big goal is to attempt to read frames from SDRAM and display it on a VGA screen. This is quite a big chunk to perform at once, so we will try and break it down in small chunks.

In this post we will try and see if we can attempt to read from SDRAM to FPGA. We have already managed to write to SDRAM in a previous post, so the reading shouldn't be that hard.

However, the only part that concerns me with the reading is that our pixel clock clock is operating at speed very close to our AXI clock speed, so we must validate if our AXI system can keep up with with sufficient performance.

So, in this post I will also show a very primitive way for checking AXI-bus performance.

The Plan

Throughout this whole Blog-series I have been following the approach of divide and conquer.

This just makes life simpler when exploring new turf and should you encounter a bug you limit the scope you need to search in order to isolate the problem.

This is also the reason why we are only going to focus in this post in reading from SDRAM to FPGA.

But, in order to validate that we are reading data correctly from SDRAM, we need to be able to populate SDRAM with known data in the first place.

So, what is an easy way to populate SDRAM with known data? The Xilinx SDK come to our rescue here.

The XCST console integrated within the Xilinx SDK actually allows you to write a binary file that is present on your desktop machine directly to SDRAM on the Zybo Board at an address you specify.

Since our main goal is to develop a C64 on FPGA I am going to use the XSCT console to write a copy of the BASIC ROM from the C64 to the SDRAM of the Zybo board.

We are then going to attempt to read this binary back from SDRAM to our FPGA by means of our developed design and inspect the data that comes back by means of an Integrated Logic Analyser.

I have covered the use of an Integrated Logic Analyser in a previous post.

To make things more interesting we will take the data received from SDRAM, which is in 32-bit word form, and output it to a 8-bit port in a byte by byte fashion, Reliving the 8-bit area!

Reusing old code

To make our development a bit faster, we will be reusing our code developed in a previous post where we managed to write to SDRAM. With some tinkering we can make it read from SDRAM.

Let us refresh our minds again on how this code worked.

The whole design revolves around the IP provided by Xilinx called the LogiCore AXI Master Burst

The relevent IP Core I have just described is indicated by the block AXI Master Burst.

This block connects to the outside world via AXI. Our user logic will connect to this block via the IPIC protocol which is somewhat simpler than the AXI protocol.

Now, we need to wrap the AXI Master Block within an IP within Vivado so we can add it as a block within our Block Design.

We will then wire up the rest  of our design to this Block.

In the post where have implemented the functionaslity for writing to SDRAM, we encapsulated the user related logic within a module called burst_block which had the following signature:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs

    );

All the port which names starts with ip2,  forms part of the IPIC which is connected to the AXI MASTER Block.

During the course of this post we will rework both the module burst block and our AXI IP to cater for SDRAM reading.

I want to stress though that I will however not change these two components to provide dual functionality, but rather make copies and change the copies to provide the read functionality.

Thus, in the end we will have a AXI IP and burst_block providing the AXI write functionality and we will have another AXI IP and burst_block-set providing the read functionality.

The new AXI IP Block

Let us start by creating a new AXI IP Block for reading.

As mentioned in a previous post, you will start off this process by clicking on the Tools menu, selecting Create and Package new IP, and selecting Create AXI Peripheral on the second wizard page.

I am not going to bore you with the whole process again, but let us have a look at how the complete AXI Block will look like in the block design:


This block almost looks the same as the one we have develop for writing to SDRAM.

The real major difference is that we don't have a data input port, but a data output port.

Another difference is the assignment of m00_axi_aruser[0] and m00_axi_awuser[0]. These assignments will change as follows:

    assign m00_axi_aruser[0] = 1'b1;
    assign m00_axi_awuser[0] = 1'b0;


In our Write Axi block from a previous post we swapped around the zero and one values. Because this block is a read block, we are interested in Coherent reads rather than writes.

Just to refresh our minds again, we are making use of the ACP AXI port on the processing subsystem, in which we see memory in exactly the same way as the two ARM cores see memory. For that reason we need to worry about coherency. Here is a quite quote from the Technical Reference Manual for the Zynq, regarding above mentioned ports:

ACP coherent read requests: An ACP read request is coherent when ARUSER[0] = 1 and ARCACHE[1] = 1 alongside ARVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is read directly from the relevant processor and returned to the ACP port. When the data is not present in any of the Cortex-A9 processors, the read request is issued on one of the SCU AXI master ports, along with all its AXI parameters, with the exception of the locked attribute.

ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.
Why do we need to look at memory in the same way as a CPU core? The answer is that because we will be using XSCT console to write test data to SDRAM for reading back later by our user logic.

The XSCT console, however, always operates within the context of a CPU core. So reads/writes that we do from this console will always be from a L1 or a L2 cache. So, if we make use a AXI master port accessing the DDR RAM directly, like HP AXI or GP AXI, we might not read the data back that we wrote via the XSCT console.

Changes to block_burst

Let us create a new block_burst for doing reading. The definition of this new will looks as follow:

module burst_read_block(
  input wire clk,
  input wire reset,
  input wire restart,
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [31:0] axi_d_out,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs,
  output wire empty,
  input wire read
    );


A major change compared to the previous AXI write module is that our module will now be receiving data from the AXI port and the rest of our design will now read data from this block.

Our FIFO buffer contained within this module will also have its read and write ports swapped around a bit:

fifo #(
  .DATA_WIDTH(32),
  .ADDRESS_WIDTH(5)
)

   data_buf (
            .clk(clk), 
            .reset(!reset | restart),
            .read(read),
            .write(!master_read_src_rdy & bytes_to_receive > 0),
            
            .write_data(/*write_data*/ip2bus_mstrd_d),
            .empty(empty), 
            .full(),            
            .read_data(axi_d_out)
        );


The read port will now be driven user logic instead of the AXI port.

However, the write port will now be controlled by the AXI port via master_read_src_rdy. It should be noted that source and destination is different within a AXI read context. Within a AXI read context, the source is SDRAM and the destination is our user logic.

Apart from the master_read_src_rdy signal, bytes_to_received is also used to indicate that a write should happen to the FIFO buffer.

bytes_to_received keeps track of how many bytes we have still expect from the AXI port that we asked it to send:

always @(posedge clk)
 if (state == START)
 
   bytes_to_receive <= BURST_THRES;
 else if ((state > START) & !master_read_src_rdy & bytes_to_receive != 0) 
   bytes_to_receive <= bytes_to_receive - 1;


We initialise this register when our state machine is in the START status.

Our state machine uses the same states as with our AXI write mechanism, with some minor changes to the conditions for transitioning to other states, but more on this in a moment.

You might have noticed we are referencing a port called restart. I am using this port to cater for the scenario where we are constantly reading frames from SDRAM.

When we have finished reading a frame from SDRAM, we would like to reset the memory pointer to the beginning of the frame.

The first place you might have seen we use this port is on the reset port of the data_buf. So, on a restart we will empty the FIFO. The other places we need to make use of this signal within our module is shown below:

...
always @(posedge clk)
if (!reset | restart)
  count_in_buf <= 0;
else if (!read & write)
  count_in_buf <= count_in_buf + 1;
else if (read & !write)    
  count_in_buf <= count_in_buf - 1;
...
always @(negedge clk)
if (!reset | (restart))
begin
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
end
else if (state == INIT_CMD)
begin
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= {BURST_THRES,2'b0};
end    
...

The restart signal also needs to be used by state machine, which I haven't shown above, since it nees special mention.

In our state machine we cannot blindly reset the state to IDLE. We first need to wait for the AXI port to deliver it last batch of data:

always @(posedge clk)
if (!reset | (restart & bytes_to_receive == 0 & state == 0))  
  state <= 0;
else
  case( state )
  //cater for scenario of flush
    IDLE: if (count_in_buf < BURST_THRES) 
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_read_src_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: if (!master_read_src_rdy & bytes_to_receive == 1)
                    state <= IDLE;    
  
  endcase


Finally, let us quickly cover the assignments to some crucial ports:

assign master_read_src_rdy = ip2bus_otputs[3];
assign cmd_ack = ip2bus_otputs[0];
assign ip2bus_inputs[0] = mstread_req;
assign ip2bus_inputs[1] = mst_type; 
assign ip2bus_mst_addr = axi_start_address;
assign ip2bus_inputs[2] = master_read_dst_rdy;


Gluing everything together

With our new AXI IP and burst_block_read module developed, it is time we give it a test run.

As mentioned earlier, our test will be to write the BASIC ROM to SDRAM via the XSCT console and then see if our FPGA reads back the same information.

To do this test, we need to develop an extra Verilog module for gluing everything together. We will call this module axi_restart_test, and here is the complete code for this module:

module axi_restart_test(
  input wire clk,
  input wire reset,
  //input wire write,
  //input wire [31:0] write_data, 
  //output src ready
  //-----------------------------------------
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [7:0] byte_out,
  output wire trigger_restart,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs
    );
    
reg [31:0] byte_shift_reg;    
reg [1:0] byte_num = 0;
reg [9:0] bytes_to_send = 1023;
reg [1:0] state = 0;
reg [6:0] restart_counter = 63;
wire [31:0] data_buf_out;
wire empty;

assign trigger_restart = state == 2;  
    
burst_read_block my_read_block(
          .clk(clk),
          .reset(reset),
          
          .restart(trigger_restart),
           
          .count_in_buf(),
          
          
          .ip2bus_mst_addr(ip2bus_mst_addr),
          .ip2bus_mst_length(ip2bus_mst_length),
          .ip2bus_mstrd_d(ip2bus_mstrd_d),
          .ip2bus_inputs(ip2bus_inputs),
          .ip2bus_otputs(ip2bus_otputs),
          .axi_d_out(data_buf_out),
          .empty(empty),
          .read(!empty & byte_num == 0)
            );

assign byte_out = byte_shift_reg[31:24];

always @(posedge clk)
  case (state)
    2'b0 : state <= 1;
    2'b1 : state <= bytes_to_send == 0 ? 2 : 1;
    2'b10 : state <= restart_counter == 0 ? 0 : 2;    
  endcase
  
always @(posedge clk)
  restart_counter <= state == 2 ? restart_counter - 1 : 63; 
    
always @(posedge clk)
if (byte_num == 0 & !empty)
  byte_shift_reg <= data_buf_out;
else
  byte_shift_reg <= {byte_shift_reg[23:0], 8'b0};
  
always @(posedge clk)
if (state == 0)
  bytes_to_send <= 1023;
else if (state == 1 & !empty)
  bytes_to_send <= bytes_to_send - 1;
  
always @(posedge clk)    
  byte_num <= trigger_restart & empty ? 0 : byte_num + 1;
    
endmodule


This module takes the words received from the burst_read_block and it returns it byte for byte over the output port byte_out.

It also continuously outputs the first 1024 bytes of BASIC ROM which is retrieved from SDRAM.

With these modules developed, we can add them to our block design and wire everything up, which results in the following:

You can see on the image that I have also added an Integrated Logic Analyser on the right hand side. I have attached the following probes to the Integrated Logic Analyser:


  • ip2bus_mst_addr (32 bits)
  • ip2bus_mstrd_d (32 bits)
  • byte_out (8 bits)
  • trigger restart (1 bit)
  • ip2bus_inputs (5 bits)
  • ip2bus_outputs (6 bits)

Test Results

Let us have a look at the Test Results.

Probably the easiest way is just to view the captured waveforms in the Waveform window.

We have, however, also the option to view the captured data as a csv file, which we will use in this post.

To export the captured data as a csv file, you need to ensure that you have the tcl console open of the hardware manager. You then issue the following command within the Tcl console:

write_hw_ila_data my_hw_ila_data_file.zip [upload_hw_ila_data hw_ila_1]

You need to specify the full path for the resulting zip file. Also, hw_ila_1 is the name of your wave capture window.

When you open up the created zip, you will see a number files, of which one of them will be a csv file. In our case the csv file will like the following:


The fields are listed as below:

  • Sample in Buffer
  • Sample in Window
  • TRIGGER
  • design_1_i/axi_restart_test_0_byte_out[7:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_1[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_2[0:0]
  • design_1_i/axi_restart_test_0_ip2bus_mst_addr_1[5:2]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_3[31:6]
  • design_1_i/myip_burst_read_test_0_bus2ip_mstrd_d[31:0]
  • design_1_i/myip_burst_read_test_0_ip2bus_otputs[4:0]
  • u_ila_0_myip_burst_read_test_0_ip2bus_otputs[5:5]
  • design_1_i/axi_restart_test_0_ip2bus_inputs[2:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_4[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr[0:0]
  • design_1_i/axi_restart_test_0_trigger_restart_1
You will see that some fields, like ip2bus_mst_addr, is broken down into two fields for some reason.

Now, let us see if we can analyse the data!

A good place to start is after a restart pulse, which will bring us to the beginning of the Basic ROM data. Let us have a look at a snippet after a restart pulse:

798,798,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,1
799,799,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,1
800,800,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,1
801,801,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,1
802,802,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,1
803,803,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,0
804,804,0,00,0,0,0,0008000,ca59c65b,08,0,4,0,0,0
805,805,0,00,0,0,0,0008000,ca59c65b,08,0,3,0,0,0
806,806,0,00,0,0,0,0008000,ca59c65b,09,0,3,0,0,0
807,807,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
808,808,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0


I have highlighted in Bold Black the end of the restart pulse. I have indicated changes to other signals in red.

Let us see if we can puzzle out what the signal changes in red represent. First, let us have a look at the signal which change from a 4 to a 3.

From our column names we see that this column represents ip2bus_inputs. Let us quickly recap on the meaning of the different bits for this signal:

  • Bit 2: master_read_dst_rdy
  • Bit 1: mst_type
  • Bit 0: mstread_req
So, we can see that in transitioning from 4 to 3, we are signaling that we are ready to receive data and we issue a read request.

Let us now see if we can figure out the signal that changes from an 8 to a 9. This time we see that this signal represents ip2bus_otputs. The meaning of these signals are as follows:

  • Bit 0: bus2ip_mst_cmdack 
  • Bit 1: bus2ip_mst_cmplt
  • Bit 2: bus2ip_mst_error
  • Bit 3: bus2ip_mstrd_src_rdy_n     
  • Bit 4: md_error
So we see that Bit 0 transitions to 1 indicating the the AXI subsystem has indeed accepted our command!

What is funny, though, is that the two entries following the transition indicates that source ready signal is not asserted straightaway! This means that we don't have data available yet from our AXI system that we can read.

We only get data about 37 clock cycles later:

940,940,0,bc,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
941,941,0,58,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
942,942,0,bc,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
943,943,0,cc,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
944,944,0,b3,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
945,945,0,7d,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
946,946,0,03,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
947,947,0,10,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
948,948,0,bf,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
949,949,0,71,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
950,950,0,b3,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
951,951,0,9e,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
952,952,0,b9,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
953,953,0,ea,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
954,954,0,e0,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
955,955,0,97,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
956,956,0,e2,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
957,957,0,64,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
958,958,0,bf,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
959,959,0,ed,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
960,960,0,e2,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
961,961,0,b4,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
962,962,0,e2,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
963,963,0,6b,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
964,964,0,b8,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
965,965,0,0d,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
966,966,0,e3,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
967,967,0,0e,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
968,968,0,b4,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
969,969,0,65,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
970,970,0,b7,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
971,971,0,7c,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
972,972,0,b7,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
973,973,0,8b,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
974,974,0,b7,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
975,975,0,ad,0,0,e,0008001,f0c83aa4,08,0,0,0,0,0
976,976,0,00,0,0,e,0008001,b700b6ec,00,0,0,0,0,0
977,977,0,00,0,0,e,0008001,b737b72c,00,0,0,0,0,0


I had a look at a couple of subsequent read requests and all of them have this delay of more less 37 clock cycles.

So, we can conlcude that for each AXI read request, there is a 37 clock cycle latency.

Now, let us have a look at the actual data that comes through:

840,840,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
841,841,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
842,842,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
843,843,0,00,0,0,0,0008000,e37be394,00,0,0,0,0,0
844,844,0,00,0,0,0,0008000,424d4243,00,0,0,0,0,0
845,845,0,00,0,0,0,0008000,43495341,00,0,0,0,0,0
846,846,0,00,0,0,0,0008000,a741a830,00,0,0,0,0,0
847,847,0,00,0,0,0,0008000,a8f7ad1d,00,0,0,0,0,0


This looks more or less like the data at the beginning of a C64 BASIC ROM, with the byte order reversed, though.

Let us now see if we can pinpoint the place where this 4 byte words gets output as a stream of bytes. This happens a couple of clock cycles after we start receiving data:

840,840,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
841,841,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
842,842,0,00,0,0,0,0008000,ca59c65b,08,0,0,0,0,0
843,843,0,00,0,0,0,0008000,e37be394,00,0,0,0,0,0
844,844,0,00,0,0,0,0008000,424d4243,00,0,0,0,0,0
845,845,0,00,0,0,0,0008000,43495341,00,0,0,0,0,0
846,846,0,00,0,0,0,0008000,a741a830,00,0,0,0,0,0
847,847,0,00,0,0,0,0008000,a8f7ad1d,00,0,0,0,0,0
848,848,0,e3,0,0,0,0008000,abbeaba4,00,0,0,0,0,0
849,849,0,7b,0,0,0,0008000,ac05b080,00,0,0,0,0,0
850,850,0,e3,0,0,0,0008000,a89fa9a4,00,0,0,0,0,0
851,851,0,94,0,0,0,0008000,b5489809,08,0,0,0,0,0
852,852,0,42,0,0,0,0008000,b5489809,08,0,0,0,0,0
853,853,0,4d,0,0,0,0008000,b5489809,08,0,0,0,0,0


The byte values looks more or less ok, except that just after the 64th byte we are presented with 8 zero byte values, which is not correct if you compare with the actual BASIC ROM data.

Some closer investigation yielded that this was caused by our AXI bus not been able to provide the data fast enough. To get an idea of the problem, I have created the following table:

000008000e37be394000
000008000424d4243001
00000800043495341002
000008000a741a830003
000008000a8f7ad1d004
0e30008000abbeaba4005
17b0008000ac05b080006
2e30008000a89fa9a4007
3940008000b548980908
4420008000b548980908
54d0008000b548980908
6420008000b548980908
7430008000a927a870008
8430008000a882a81c009
9490008000a93aa8d10010
10530008000a94aa82e0011
11410008000e167b82c0012
12a70008000e164e1550013
13410008000b823b3b20014
14a8000800001b033c508
153000f800001b033c508
16a800f800001b033c50a
17f700f800001b033c508
18ad00f800001b033c509
191d00f800001b033c508
20ab00f800001b033c508
21be00f800001b033c508
22ab00f800001b033c508
23a400f800001b033c508
24ac00f800001b033c508
250500f800001b033c508
26b000f800001b033c508
278000f800001b033c508
28a800f800001b033c508
299f00f800001b033c508
30a900f800001b033c508
31a400f800001b033c508
32a900f800001b033c508
332700f800001b033c508
34a800f800001b033c508
357000f800001b033c508
36a800f800001b033c508
378200f800001b033c508
38a800f800001b033c508
391c00f800001b033c508
40a900f800001b033c508
413a00f800001b033c508
42a800f800001b033c508
43d100f800001b033c508
44a900f800001b033c508
454a00f800001b033c508
46a800f800001b033c508
472e00f800001b033c508
48e100f800001b033c508
496700f800001b033c508
50b800f800001b033c508
512c00f800001b033c508
52e100f800001b033c508
536400f800001b033c508
54e100f800001b033c508
555500f8000aa9faa7f0015
56b800f80006c10a26008
572300f80006c10a26008
58b300f80006c10a26008
59b200f80006c10a26008
60aa00f80006c10a26008
619f00f80006c10a26008
62aa00f80006c10a26008
637f00f80006c10a26008
0000f80006c10a26008
0000f80006c10a26008
0000f80006c10a26008
0000f80006c10a26008
0000f8000a69ba8560016
0000f8000aa85a65d0017
0000f8000e1bde1290018
0000f8000ab7ae1c60019

This is basically a snippet of the CSV file converted into an HTML table.

I have also added an extra column on the left hand site and a column on the right hand side.

The column on the left hand site indicates when our byte output stream starts, and increment the count after each byte output. For byte count 64 and up the count cell is blank and indicates invalid data.

The extra column on the righthand site counts each time we receive a four-byte word from the AXI bus.

You can see that received word number 15 is the last word of data that arrives "in-time"and gets displayed during byte number 60, 61, 62 and 63.

Received word 16, however, only arrives some 5 clock cycles after byte value number 63 was outputted. So, we have a small buffer underflow issue here 😞

We can avoid this buffer underflow situation by tweaking the size of our FIFO receive buffer. The size of the buffer should be made just over twice the BURST_THRES paramater.

Which value of BURST_THRES should we choose? BURST_THRES should be set so that we trigger the next AXI read transaction with enough elements in our FIFO buffer to survive the latency of more or less 36 clock cycles.

So, to cater for a possible worst case scenario, we can probably choose a BURST_THRES of 50. So a typical FIFO buffer size for this scenario would 102 elements.

Just a final note on analysing the data in the CSV file.

Another kind of odd behaviour you might realise from the HTML table above is that once we start receiving the data words from the AXI subsystem, we receive it continuously from Word 0 to Word 7. However, after Word 7 there is a delay of four clock cycles before we receive data again.

You get this kind of behaviour for the other AXI read transactions as well, but for the majority of cases it is not only 4 delay cycles, but 6, 7 or 8 delay cycles!

Considering that we are requesting 16 data words at a time, these delay clock cycles eats more or less 50% of the available bandwidth!

Investigating the loss of Bandwidth

Let us see if we can figure out why our bandwidth is so low.

After some reading I found a clue.

The ACP-AXI port we are using is 64-bit wide on the Processing system side. Our AXI IP block, however, has a 32-bit wide data port instead of a 64-bit one.

I suspect that this 32-bit wide data port is the cause for our bandwidth issue.

The obvious check then be to extend the databus of our existing AXI IP also to 64-bit.

However, quickly looking at Vivado it doesn't seem so obvious to extend the databus width of a AXI created peripheral to 64 bits. There is an option to specify databus width from the dropdown, but in the drop-down, suprise-suprise, there is only a 32-bit option!

I am sure with some tweaking one could make a 64-bit option appear, but for now I want something simple to test my theory on whether the mismatch of port width is the cause for the bandwidth issue.

Another option would be to use an AXI port on the Processing subsystem that is also 32-bits wide. At least then we test if matching ports also result in the same amount of lost clock cycles.  

There is indeed an AXI port on the Processing subsystem that is 32-bits in width: The General Purpose (GP) AXI port.

To be honest, I have have been avoiding AXI-GP ports up to this point in time for the simple reason that it is not guaranteed that the FPGA will see test data we wrote via the XSCT console on an AXI-GP port. SO, let us see if we can move the bar higher and see if we can make the FPGA read data on the AXI-GP port written by the XSCT console.

To do this, let us have a look again at the question: Why can't we guarantee that memory writes via the XSCT console will be read back via the FPGA on the AXI-GP port?

The answer to this question is that memory writes via the XSCT console doesn't get written directly to DDR memory, but rather to the L1 cache associated with the ARM Core our XSCT console is attached to.

So, is there a way to flush the contents of the L1 cache to DDR memory? To answer this question, we need to look a bit in the ZYNQ Technical Reference manual on how the whole L1-Cache mechanism works.

The L1-Cache contains a number of Cache lines. When a request is made from Main Memory, the contents is read from DDR memory and then stored within a Cache Line. All subsequent requests to the same memory location is retrieved from the relevant Cache line within the L1-Cache.

With writes to main memory the relevant contents is first read from Main memory and stored within a cache line. The contents to be written is then applied to the relevant cache line, and the whole cache line is marked as dirty.

If we just left it there, we will just have a Cache line entry marked as dirty, which will never be written to Main Memory.

However, things start to change when we keep writing data up to a point when we have no free Cache Line entries left. In this scenario the L1 Cache Manager will choose a Cache Line entry to evacuate to accommodate the new write. The Cache Line entry chosen will usually be a Least Recently Read/Wite one.

If the chosen Cache Line is marked as dirty, the existing contents first gets written to main memory.

So, to ensure that our writes from XSCT console gets written to Main Memory we need to write a file that is bigger than the size of the L1 Cache.

The L1 Data Cache is 32KB, while the BASIC ROM we used previously for testing is 8KB in size. So we need to create new file by contenting the BASIC ROM a number of times till the resulting file is bigger than 32KB.

On the Bash Terminal in Linux this concatenated file can be produced with the following:

cat basic.bin basic.bin basic.bin basic.bin basic.bin basic.bin > big.bin

The file big.bin will then be the file you use to write to memory via the XSCT console.

Changing the Design to use an AXI-GP port

Let us change the design to make use of an AXI-GP port.

We this by double clicking on the Processing Sub System Block within the Block Design.

On the Page Navigator Panel on the Left select PS-PL Configuration and make the following selections:


We Unselect the ACP Salve Interface so that only one AXI Slave port is present on the Zynq Processing block.

Finally you need to wire everything up again.

Test Results with the AXI-GP port

This time around the results look a lot better. You obviously still have your about 37 clock latency for every AXI read transaction, but once the data starts coming in, it comes in without any intermittent delay clock cycles. With some read transactions you might see one or two intermittent delay clock cycle during the receiving of data, but that is about it:

.
.
.
1096,1096,0,00,ca59c65b,0,0,0,0,0,0,0008000,08,0,0
1097,1097,0,00,ca59c65b,0,0,0,0,0,0,0008000,08,0,0
1098,1098,0,00,e37be394,0,0,0,0,0,0,0008000,00,0,0
1099,1099,0,00,424d4243,0,0,0,0,0,0,0008000,00,0,0
1100,1100,0,00,43495341,0,0,0,0,0,0,0008000,00,0,0
1101,1101,0,00,a741a830,0,0,0,0,0,0,0008000,00,0,0
1102,1102,0,00,a8f7ad1d,0,0,0,0,0,0,0008000,00,0,0
1103,1103,0,e3,abbeaba4,0,0,0,0,0,0,0008000,00,0,0
1104,1104,0,7b,ac05b080,0,0,0,0,0,0,0008000,00,0,0
1105,1105,0,e3,a89fa9a4,0,0,0,0,0,0,0008000,00,0,0
1106,1106,0,94,a927a870,0,0,0,0,0,0,0008000,00,0,0
1107,1107,0,42,a882a81c,0,0,0,0,0,0,0008000,00,0,0
1108,1108,0,4d,a93aa8d1,0,0,0,0,0,0,0008000,00,0,0
1109,1109,0,42,a94aa82e,0,0,0,0,0,0,0008000,00,0,0
1110,1110,0,43,e167b82c,0,0,0,0,0,0,0008000,00,0,0
1111,1111,0,43,e164e155,0,0,0,0,0,0,0008000,00,0,0
1112,1112,0,49,b823b3b2,0,0,0,0,0,0,0008000,00,0,0
1113,1113,0,53,01b033c5,4,0,0,0,0,0,0008000,08,0,0
1114,1114,0,41,01b033c5,4,0,0,0,0,f,0008000,08,0,0
.
.
.
1153,1153,0,b8,01b033c5,0,0,0,0,0,f,0008000,08,0,0
1154,1154,0,2c,01b033c5,0,0,0,0,0,f,0008000,08,0,0
1155,1155,0,e1,aa9faa7f,0,0,0,0,0,f,0008000,00,0,0
1156,1156,0,64,6c10a260,0,0,0,0,0,f,0008000,08,0,0
1157,1157,0,e1,6c10a260,0,0,0,0,0,f,0008000,08,0,0
1158,1158,0,55,a69ba856,0,0,0,0,0,f,0008000,00,0,0
1159,1159,0,b8,aa85a65d,0,0,0,0,0,f,0008000,00,0,0
1160,1160,0,23,e1bde129,0,0,0,0,0,f,0008000,00,0,0
1161,1161,0,b3,ab7ae1c6,0,0,0,0,0,f,0008000,00,0,0
1162,1162,0,b2,bc39a641,0,0,0,0,0,f,0008000,00,0,0
1163,1163,0,aa,bc58bccc,0,0,0,0,0,f,0008000,00,0,0
1164,1164,0,9f,b37d0310,0,0,0,0,0,f,0008000,00,0,0
1165,1165,0,aa,bf71b39e,0,0,0,0,0,f,0008000,00,0,0
1166,1166,0,7f,b9eae097,0,0,0,0,0,f,0008000,00,0,0
1167,1167,0,a6,e264bfed,0,0,0,0,0,f,0008000,00,0,0
1168,1168,0,9b,e2b4e26b,0,0,0,0,0,f,0008000,00,0,0
1169,1169,0,a8,b80de30e,0,0,0,0,0,f,0008000,00,0,0
1170,1170,0,56,b465b77c,0,0,0,0,0,f,0008000,00,0,0
1171,1171,0,aa,b78bb7ad,0,0,0,0,0,f,0008000,00,0,0
1172,1172,0,85,e164e155,4,0,0,0,0,f,0008000,08,0,0
1173,1173,0,a6,e164e155,4,0,0,0,0,e,0008001,08,0,0
.
.
.
1212,1212,0,0d,e164e155,0,0,0,0,0,e,0008001,08,0,0
1213,1213,0,e3,e164e155,0,0,0,0,0,e,0008001,08,0,0
1214,1214,0,0e,b700b6ec,0,0,0,0,0,e,0008001,00,0,0
1215,1215,0,b4,b737b72c,0,0,0,0,0,e,0008001,00,0,0
1216,1216,0,65,e37be394,0,0,0,0,0,e,0008001,08,0,0
1217,1217,0,b7,e37be394,0,0,0,0,0,e,0008001,08,0,0
1218,1218,0,7c,79b86979,0,0,0,0,0,e,0008001,00,0,0
1219,1219,0,b7,2a7bb852,0,0,0,0,0,e,0008001,00,0,0
1220,1220,0,8b,bb117bba,0,0,0,0,0,e,0008001,00,0,0
1221,1221,0,b7,50bf7a7f,0,0,0,0,0,e,0008001,00,0,0
1222,1222,0,ad,e546afe8,0,0,0,0,0,e,0008001,00,0,0
1223,1223,0,b7,bfb37daf,0,0,0,0,0,e,0008001,00,0,0
1224,1224,0,00,64aed35a,0,0,0,0,0,e,0008001,00,0,0
1225,1225,0,b6,4e45b015,0,0,0,0,0,e,0008001,00,0,0
1226,1226,0,ec,d24f46c4,0,0,0,0,0,e,0008001,00,0,0
1227,1227,0,b7,d458454e,0,0,0,0,0,e,0008001,00,0,0
1228,1228,0,37,c1544144,0,0,0,0,0,e,0008001,00,0,0
1229,1229,0,b7,55504e49,0,0,0,0,0,e,0008001,00,0,0
1230,1230,0,2c,4e49a354,0,0,0,0,0,e,0008001,00,0,0
1231,1231,0,79,e164e155,4,0,0,0,0,e,0008001,08,0,0
1232,1232,0,b8,e164e155,4,0,0,0,0,d,0008002,08,0,0
.
.
.
1279,1279,0,4e,e164e155,0,0,0,0,0,d,0008002,08,0,0
1280,1280,0,49,e164e155,0,0,0,0,0,d,0008002,08,0,0
1281,1281,0,a3,44d45550,0,0,0,0,0,d,0008002,00,0,0
1282,1282,0,54,4552cd49,0,0,0,0,0,d,0008002,00,0,0
1283,1283,0,e1,454cc441,0,0,0,0,0,d,0008002,00,0,0
1284,1284,0,64,544f47d4,0,0,0,0,0,d,0008002,00,0,0
1285,1285,0,e1,ce5552cf,0,0,0,0,0,d,0008002,00,0,0
1286,1286,0,55,4552c649,0,0,0,0,0,d,0008002,00,0,0
1287,1287,0,45,524f5453,0,0,0,0,0,d,0008002,00,0,0
1288,1288,0,52,534f47c5,0,0,0,0,0,d,0008002,00,0,0
1289,1289,0,cd,4552c255,0,0,0,0,0,d,0008002,00,0,0
1290,1290,0,49,ce525554,0,0,0,0,0,d,0008002,00,0,0
1291,1291,0,45,53cd4552,0,0,0,0,0,d,0008002,00,0,0
1292,1292,0,4c,4fd04f54,0,0,0,0,0,d,0008002,00,0,0
1293,1293,0,c4,494157ce,0,0,0,0,0,d,0008002,00,0,0
1294,1294,0,41,414f4cd4,0,0,0,0,0,d,0008002,00,0,0
1295,1295,0,54,564153c4,0,0,0,0,0,d,0008002,00,0,0
1296,1296,0,4f,b465b77c,4,0,0,0,0,d,0008002,08,0,0
1297,1297,0,47,b465b77c,4,0,0,0,0,c,0008003,08,0,0
.
.
.
1336,1336,0,4f,b465b77c,0,0,0,0,0,c,0008003,08,0,0
1337,1337,0,4c,b465b77c,0,0,0,0,0,c,0008003,08,0,0
1338,1338,0,d4,524556c5,0,0,0,0,0,c,0008003,00,0,0
1339,1339,0,56,44d94649,0,0,0,0,0,c,0008003,00,0,0
1340,1340,0,41,4f50c645,0,0,0,0,0,c,0008003,00,0,0
1341,1341,0,53,5250c54b,0,0,0,0,0,c,0008003,00,0,0
1342,1342,0,c4,a3544e49,0,0,0,0,0,c,0008003,00,0,0
1343,1343,0,52,4e495250,0,0,0,0,0,c,0008003,00,0,0
1344,1344,0,45,4e4f43d4,0,0,0,0,0,c,0008003,00,0,0
1345,1345,0,56,53494cd4,0,0,0,0,0,c,0008003,00,0,0
1346,1346,0,c5,d24c43d4,0,0,0,0,0,c,0008003,00,0,0
1347,1347,0,44,53c44d43,0,0,0,0,0,c,0008003,00,0,0
1348,1348,0,d9,504fd359,0,0,0,0,0,c,0008003,00,0,0
1349,1349,0,46,4c43ce45,0,0,0,0,0,c,0008003,00,0,0
1350,1350,0,49,47c5534f,0,0,0,0,0,c,0008003,00,0,0
1351,1351,0,4f,454ed445,0,0,0,0,0,c,0008003,00,0,0
1352,1352,0,50,424154d7,0,0,0,0,0,c,0008003,00,0,0
1353,1353,0,c6,55504e49,4,0,0,0,0,c,0008003,08,0,0
1354,1354,0,45,55504e49,4,0,0,0,0,b,0008004,08,0,0
.
.
.

From these set of results we can see that are coming pretty close to the theoretical bandwidth.

In Summary

In this post we have discovered how to read from SDRAM to FPGA via the AXI port on the Zynq.

We also had at the bandwidth performance while reading. Reading data from a AXI-ACP port with a databus width of 32-bits is probably not the cleverest thing to do, and you sacrifice a lot of the potential bandwidth in this way.

Our test on a 32-bit AXI_GP behaved more or less as expected and we got close to the theoretical bandwidth.

In both tests (e.g AXI_ACP and AXI_GP) we experienced a latency of more less 37 clock cycles per AXI read transaction. Your FIFO buffer should be made large enough so that this latency doesn't effect the Bandwidth.

In the next post we will see if we can continuously read a picture fram from SDRAM and display it on the VGA screen.

 Till next time!

No comments:

Post a Comment