Tuesday, 13 November 2018

Getting the cursor to flash and Keyboard Interaction


Welcome back! It has been a while since I played around with my C64 FPGA implementation, so I first had to familiarise  myself with where I was and where do I want to go 😊

In the previous post I managed to show the C64 Welcome message on the VGA screen, but without any flashing cursor.

In this post we will be implementing the flashing cursor, as well as implementing keyboard interaction.

Implementing the flashing cursor

To implement the C64 flashing cursor on our FPGA implementation, we should just ensure we interrupt our 6502 at a regular interval of 60 times per second.

To do this, we first need to implement a counter for this:

    always @(posedge clk_1_mhz)
    if (c64_reset | counter_60_hz == 0)
      counter_60_hz <= 16666;
      counter_60_hz <= counter_60_hz - 1;

Here we have created a counter counting down from 16666 to zero and then reloading it with a value of 16666. Since we are clocking it with the 1 MHz signal, the counter will underflow at a rate of 60Hz, which is what we need.

We will use this clock to generate our interrupt signal:

    always @(posedge clk_1_mhz)
    if (c64_reset)
      int_occ <= 0;
    else if (counter_60_hz == 0)
      int_occ <= 1;
    else if (addr == 16'hdc0d & we == 0)               
      int_occ <= 0;     

As you can see we are setting int_occ to a one when our counter reaches zero. At this stage we should mimic CIA 6526 behaviour, meaning that once an interrupt happens the interrupt status for this interrupt should remain set until cleared by software.

The interrupt status gets cleared by simply reading the interrupt status register and this is done with the else statement  else if (addr == 16'hdc0d & we == 0).

Great, we can now generate an interrupt 60 times per second and all we need to do is hooking up this signal to our 6502 core:

    cpu mycpu ( clk_1_mhz, c64_reset, addr, 
                          combined_d_out, ram_in, we, int_occ, 1'b0, 1'b1 );

That is all there is to get the cursor flashing.

Let us now continue to implement keyboard interaction.

Keyboard Interaction

Within a real C64 the keys of its keyboard is arranged electrically as an 8x8 square matrix.

This 8x8 matrix in turn is hooked up to Port A and Port B of CIA#1.

Port A energises specific rows in the 8x8 matrix and Port B can see which keys within the energised row is either open or closed.

The following diagram gives an idea of how the keys is arranged within the matrix:

Image result for c64 keyboard matrix

On the top right you can get an idea of how the keyboard connector looks like.

I am not in possession of a real C64, so we will need to make use of a USB keyboard and take the keystrokes and emulate C64 keystrokes.

So, how do we go about with this keyboard emulation? Well, firstly if you have a look at the diagram above, you will see all the keys is numbered from 0 to 63.

This gives us 64 possible keys, each one that can be either on or off. Each key can therefore be thought of as a bit.

Thinking of the memory space of our two ARM cores living on the ZYNQ, each memory location is 32 bits wide. Thus, we could fit all the possible C64 keys within two memory locations!

Is is then up to us to write a program running on the ARM processor, fetching keystrokes from the USB keyboard, and toggling the desired bits within above mentioned two memory locations to emulate the desired C64 key presses.

The previous paragraph sounds like a mouthful, so let us try to break it down a little. We need to achieve the following:

  • Interface with the USB keyboard and interpret keystrokes
  • Enable our C64 module to receive data from one of the ARM cores, also located on the ZYNQ
  • Emulate the C64 keypress with the data we received from one of the ARM cores.
To interface with the USB keyboard can be quite challenging task. At this point in time I don't want to elaborate too much, but towards the end of this  post I will reveal a plan of action to get to a point of getting input from a USB keyboard 😀

Looking at the second point. In order for our C64 module to receive information from a ARM core, we need to engage the road of AXI Slave interfaces. We will cover this a bit a later in this post.

We will the third point, C64 keypresses emulation in the next section.

C64 Keypress emulation

Let us now continue to implement the C64 key press emulation.

We start off by adding two ports to our C64 module:

module block_test(
  input clk,
  input axi_clk_in,
  input proc_rst,
  output proc_rst_neg,
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs,
  input wire [31:0] slave_0_reg, 
  input wire [31:0] slave_1_reg

These two ports are the two words as mentioned in the previous section, where each bit corresponds to a key on the C64 keyboard.

The following diagram gives an explanation of each bit. The number in each bit position is the relevant C64 scancode.

We will leave the connection of these two ports to the outside world for another section.

Next, we should split these 2 words into separate rows:

    wire [7:0] keyboard_row_0;
    wire [7:0] keyboard_row_1;
    wire [7:0] keyboard_row_2;
    wire [7:0] keyboard_row_3;
    wire [7:0] keyboard_row_4;
    wire [7:0] keyboard_row_5;
    wire [7:0] keyboard_row_6;
    wire [7:0] keyboard_row_7;     
    assign keyboard_row_0 = slave_0_reg[7:0];
    assign keyboard_row_1 = slave_0_reg[15:8];
    assign keyboard_row_2 = slave_0_reg[23:16];
    assign keyboard_row_3 = slave_0_reg[31:24];
    assign keyboard_row_4 = slave_1_reg[7:0];
    assign keyboard_row_5 = slave_1_reg[15:8];
    assign keyboard_row_6 = slave_1_reg[23:16];
    assign keyboard_row_7 = slave_1_reg[31:24];

We should now think how should emulate the keyboard behaviour. Remember, port A on CIA#1 energise the applicable row, and we read the result via port B.

So, for starters we should capture 6502 writes to port A, which is address $DC00:

    reg [7:0] keyboard_control_byte;
    always @(posedge clk_1_mhz)
    if (addr == 16'hdc00 & we)
      keyboard_control_byte <= ram_in;

Next, we should simulate the value for Port B, which is address $DC01:

    wire [7:0] keyboard_result_byte;
    assign keyboard_result_byte = (~keyboard_control_byte[0] ? keyboard_row_0 : 0) |           
                                  (~keyboard_control_byte[1] ? keyboard_row_1 : 0) |
                                  (~keyboard_control_byte[2] ? keyboard_row_2 : 0) |
                                  (~keyboard_control_byte[3] ? keyboard_row_3 : 0) |
                                  (~keyboard_control_byte[4] ? keyboard_row_4 : 0) |
                                  (~keyboard_control_byte[5] ? keyboard_row_5 : 0) |
                                  (~keyboard_control_byte[6] ? keyboard_row_6 : 0) |
                                  (~keyboard_control_byte[7] ? keyboard_row_7 : 0);
    always @*
        casex (addr_delayed)
          16'b101x_xxxx_xxxx_xxxx : combined_d_out = basic_out;
          16'b111x_xxxx_xxxx_xxxx : combined_d_out = kernel_out;
          16'hd012: combined_d_out = line_counter;
          16'hdc01: combined_d_out = ~keyboard_result_byte;
          default: combined_d_out = ram_out;

I might be worthwhile to mention that we are working here with active when high logic, because it makes live easier. You might recall though that the C64 works with active low logic.

So to work between these two worlds, we negate the value received from port A, and when you send the calculated value back to port B we negate it again.

Connecting to the outside World

With the changes perfoemd to our C64 module, we need a way to interface to an ARM core. This is where we need to work again with AXI's.

In previous posts we worked a couple of times with AXI's. The AXI's we worked with previously were all AXI Masters.

We defined an AXI Master for writing frames produced by our VIC-II to SDRAM. We also defined an AXI Master for reading back these frames from SDRAM by our VGA module and generating a VGA signal for displaying these frames on screen.

An AXI Master can be seen as a source for generating memory requests.

In our case where we need to receive data from an ARM core, we need something the opposite, which is receiving memory orders. An AXI peripheral receiving memory orders, is called an AXI Slave.

You can create an AXI block which contain both an AXI slave and an AXI master. The following is an example:

This is within our existing design.

Marked in green is the new AXI slave port called S00_AXI and in red is our existing AXI Master port called M00_AXI.

You will also see that I have also hooked the two slave port as indicated also in green. To enable these two ports on the AXI block, I had to some custom code changes which I will cover now.

With our AXI open in IP Packager, scroll to the user port section and change it as follows:

 // Users to add ports here
        input wire [31:0] ip2bus_mst_addr,
        input wire [11:0] ip2bus_mst_length,
        input wire [31:0] ip2bus_mstwr_d,
        input wire [4:0] ip2bus_inputs,
        output wire [5:0] ip2bus_otputs,
        output wire [31:0] slave_reg_0,
        output wire [31:0] slave_reg_1,
 // User ports ends

One of the things you will realise when you configure an AXI module to have an AXI slave interface, is that an AXI slave module will automatically be created and an instance be created within the top module. The instance within the top module will look something like the following:

 myip_burst_test_v1_0_S00_AXI # ( 
 ) myip_burst_test_v1_0_S00_AXI_inst (

Let us now have a look at the code for this module, looking only at the interesting parts of the code, though.

You will see that there is a couple of slave registers defined within this module:

 reg [C_S_AXI_DATA_WIDTH-1:0] slv_reg0;
 reg [C_S_AXI_DATA_WIDTH-1:0] slv_reg1;
 reg [C_S_AXI_DATA_WIDTH-1:0] slv_reg2;
 reg [C_S_AXI_DATA_WIDTH-1:0] slv_reg3;

These registers basically forms the heart of the AXI slave interface. In effect it is these registers that will map at a specific address in address space, and if one of the ARM cores to a write to this address range, the contents of the write will end off in one of these slave registers.

It is the content of these registers which we want to propogate to our C64 module to inform it which key was pressed. More on this later.

The following snippet is also interesting:

 always @( posedge S_AXI_ACLK )
   if ( S_AXI_ARESETN == 1'b0 )
       slv_reg0 <= 0;
       slv_reg1 <= 0;
       slv_reg2 <= 0;
       slv_reg3 <= 0;
   else begin
     if (slv_reg_wren)
         case ( axi_awaddr[ADDR_LSB+OPT_MEM_ADDR_BITS:ADDR_LSB] )
             for ( byte_index = 0; byte_index <= (C_S_AXI_DATA_WIDTH/8)-1; byte_index = byte_index+1 )
               if ( S_AXI_WSTRB[byte_index] == 1 ) begin
                 // Respective byte enables are asserted as per write strobes 
                 // Slave register 0
                 slv_reg0[(byte_index*8) +: 8] <= S_AXI_WDATA[(byte_index*8) +: 8];
             for ( byte_index = 0; byte_index <= (C_S_AXI_DATA_WIDTH/8)-1; byte_index = byte_index+1 )
               if ( S_AXI_WSTRB[byte_index] == 1 ) begin
                 // Respective byte enables are asserted as per write strobes 
                 // Slave register 1
                 slv_reg1[(byte_index*8) +: 8] <= S_AXI_WDATA[(byte_index*8) +: 8];
             for ( byte_index = 0; byte_index <= (C_S_AXI_DATA_WIDTH/8)-1; byte_index = byte_index+1 )
               if ( S_AXI_WSTRB[byte_index] == 1 ) begin
                 // Respective byte enables are asserted as per write strobes 
                 // Slave register 2
                 slv_reg2[(byte_index*8) +: 8] <= S_AXI_WDATA[(byte_index*8) +: 8];
             for ( byte_index = 0; byte_index <= (C_S_AXI_DATA_WIDTH/8)-1; byte_index = byte_index+1 )
               if ( S_AXI_WSTRB[byte_index] == 1 ) begin
                 // Respective byte enables are asserted as per write strobes 
                 // Slave register 3
                 slv_reg3[(byte_index*8) +: 8] <= S_AXI_WDATA[(byte_index*8) +: 8];
           default : begin
                       slv_reg0 <= slv_reg0;
                       slv_reg1 <= slv_reg1;
                       slv_reg2 <= slv_reg2;
                       slv_reg3 <= slv_reg3;

This code starts off by saying that at a reset, all slave registers is been initialised to a zero.

During a write operation the applicable slave register gets written in the case statement.

What remains to be done is to surface the contents of the first two slave registers as two output ports in this module:

 // Users to add ports here
        output wire [31:0] slave_reg_0,
        output wire [31:0] slave_reg_1,
 // User ports ends
 // Add user logic here
    assign slave_reg_0 = slv_reg0;
    assign slave_reg_1 = slv_reg1;

 // User logic ends

These ports should then be connected all the way to the top module, which we can then connect to our C64 module.

With the ports added and everything hooked up, you will see in the address editor a section in address space is reserved for this slave interface:

With this address map, when have a program running on the ARM core and it writes to either address 0x43c0_0000 or 0x43c0_0004, the content will arrive at our C64 module at the two slave ports.

The Test Program

With our block design completed, we need to write a small C program that will run on one of the ARM cores to test the design.

This test program should basically set one of the bits in the two slave registers to trigger a simulated keypress.

For the program, the following main method will do:

int main()
    return 0;

This program sets bit 8 of the slave register. Bit 8 in this register corresponds to scancode 8, and should therefore type a '3' on the C64 screen.

A Test Run

Time to do a quick Test Run. The following video shows the result:

The video starts off with a flashing cursor and shortly afterwards a '3' gets printed when the program executes.

It works!

In Summary

In this post we managed to implement the flashing cursor as well as implementing key press simulation.

Well, obviously to make our live easier it would be nice to capture keystrokes from a real keyboard, which in our case would be an USB keyboard.

This is where things can really get interesting.

Firstly, we can make our life easy by installing PetaLinux on our Zybo board. This is a version of Linux and have drivers that will take care of all USB communications and detecting keyboard strokes for us.

When running the Zybo board in Standalone mode, you cannot make use of these USB/keyboard drivers and you will need to develop something yourself.

This is where my Hacker instinct starts to kick in and the eagerness to learn how stuff works that we all takes for granted.

This is an excellent opportunity to learn how USB works, so I thought of dedicating a couple of posts on developing a stripped down USB protocol stack.

So, in the next post I will spend some time on a bit of theory on how USB communications work and then take it from there.

Till next time!

Saturday, 25 August 2018

Displaying the C64 welcome screen on VGA screen


In the previous post we managed to display a static frame stored in SDRAM to a VGA screen.

In this post we are going to take it one step forward and try to display frames rendered from our VIC-II module to a VGA screen.

To View a Video for this Post...

This video explains how to modify our current C64 design to take output the frames from our VIC-II module and display it on a VGA screen.

For a more in detail discussion on the contents of this video, please continue reading...

A recap on the current state of our C64 design

It has been some time since we worked on our C64 design.

In the last couple of posts all efforts was diverted into developing functionality for displaying frames stored in SDRAM to a VGA screen.

Within goal accomplished in the previous post, I think it is time we focus again on our C64 design.

So, let us start by refreshing our minds where we ended off with our C64 design.

When we last worked on the C64 design, it was in a state where we could generate a VIC-II frame with the C64 welcome screen, and store it in SDRAM.

So, with the functionality from the previous post where we could display a frame from SDRAM to a VGA screen, we should be able see the C64 welcome screen displayed on a VGA screen.

One things to mention though is that our design from the previous post expects frames to be at the exact resolution of the VGA monitor you are using. Our C64 design, however, produces frames at a much lower resolution (e.g. something like 404x284).

We will therefore need to modify our VGA block to cater for the resolution produced by our C64 design.

Modifying the VGA output block

As mentioned in the previous section, our C64 block produces frames that is lower in resolution than a typical recent VGA monitor.

On the over hand, since most recent VGA monitors are LCD's, it is best to produce VGA signals having the native resolutions of the VGA screen in question. It just looks better on these screens in native resolutions.

In this post we will therefore output a signal at native resolution and display the C64 output frames in small section on the screen.

This requirement requires a couple of changes to our VGA block design.

The first change is within our Asynchronous FIFO that buffer that buffers pixel data from the AXI clock domain to the pixel clock domain:

     //Reading port
     .ReadEn_in((vert_pos_next > 100)  & (vert_pos_next < 384) &
                                (horiz_pos_next > 100) & (horiz_pos_next < 505)),
     //Writing port.  
     .WriteEn_in(/*state != GET_SET*/(state_shift_reg == STATE_16_SHIFT_STORED | state_shift_reg == STATE_16_SHIFT_SHIFTED) & !buffer_full),
     .Clear_in(/*state == RESET_CYCLE)*/trigger_restart_state == RESTART_STATE_RESTART)

With this change we only enable reads from this FIFO when our VGA signal is at the visible region on the screen.

The visible region is between vertical position 100 and 384 and between horizontal position 100 and 505. This will give as the small 404x284 C64 window on the screen.

We would like our small C64 window to be surrounded by a black border on the screen. To that we need to do the following modifications:

assign out_pixel_buffer_final = (vert_pos > 100)  & (vert_pos < 384) &
                                (horiz_pos > 100) & (horiz_pos < 505)
                                ? out_pixel_buffer : 0;

With this change we output the pixel data from our FIFO if it is within the visible region. For all other positions we output a black pixel.

Adding VGA output to our C64 design

Let us now add VGA output to our C64 design. We do this by first adding the VGA block mentioned in the previous section.

Next, we add the AXI Read block we have developed and used in the last couple of posts.

We will wire up this VGA and AXI read block in a similar way as described previously. Care should however be taken with the AXI Master output port of our AXI read block.

When adding the VGA functionality to our existing C64 design, we will have indeed have two AXI master ports we will need to hookup, whereas our current Processing system only have one AXI slave port configure.

We will therefor need to twig our design a bit to cater for the two master ports.

Start off by removing all AXI helper blocks. I have highlighted them in the following picture:

With these blocks removed, we need to configure our Processing block by double clicking on it, and selecting an extra GP port:

With this option selected, you will see an extra AXI GP port on the processing block:

You can now make use of the designer assistance provided by the IDE to wire up both GP ports.

Just after the wiring up, you will most probably get the following warning:

The IDE will nonetheless allow you to continue, but you will eventually be stopped with an actual error during either the Synthesis or Bitstream generation process. So it is better to try and resolve the warning at this point.

Let us see if we can resolved this warning. We start by opening the Address editor tab within Block design window:

The values in the address editor is used to configure the AXI Helper blocks to decide to which GP port to forward a particular address. But, there is a bit of duplicate mappings here. For instance, the 512MB block range starting at address 0x0 is mapped to both GP0 and GP1.

We can simplify the mapping the following:

Now we have GP0 dedicated to our AXI write block and GP1 dedicated to our AXI read block.

With these changes our design should Synthesise and generate the BitStream without any errors.

The End result

With everything started up, the screen looks as expected:

This is only the Welcome screen with no Flashing cursor.

In Summary

In this post we managed to display the frames produced from our VIC-II module to the VGA screen.

In the next post we will attempt to implement the flashing cursor.

Till next time!

Tuesday, 10 July 2018

Fixing the Non static frame


In the previous post we attempted to view random data in SDRAM as a static frame on the VGA screen.

We, however, ended with a random alternating pattern displayed instead of a random static pattern.

In this post we will attempt to fix this anomaly.

To view a Video of this Blog...

This video explains with the help of a Xilinx community post that the cause of the non static frame displayed was likely caused by the asynchronous FIFO implementation used. I also show how I apply the suggestions from the Community post to my existing post in order to fix the problem...

If you rather prefer the written version together with a discussion on the actual changes to the Verilog code, please continue reading...

Some help from a old community post

The anomaly encountered in the previous post really baffled me, and I didn't know where to actually start looking for the cause of the problem. So, I consulted the Internet...

In my searching I came across the following  post on a Xilinx Community forum:

Interesting thing here is that the community member  that posted the query used exactly the same implementation I used from the Asic-World website:

The member was experiencing some serious timing violations when trying  to synthesise the design.

The key to the solution was provided by the community member with the nickname Avrumw. He pointed out that in the comments of the mentioned design it was suggested that the design follows some recommendations from a Xilinx article.  Avrumw, however had some serious doubts whether Xilinx would make some of these suggestions at all because of the following (I am quoting from Awrumw's answer):
  - It uses a latch
  - It uses the asynchronous preset/clear inputs of flip-flops for part of its functionality
The suggestions Awrum gave to avoid these practices was the following:

  - infer the RAM
  - use Gray counts for bringing addresses between domains
  - use standard "two back to back flip-flop" synchronizers (with the ASYNC_REG property set) to move the Gray coded read pointer into the write domain (for generating full) and the Gray coded write pointer into the read domain (for generating empty)
Admittedly, the design on Asic World did indeed made use of Gray Counters.

Let us know proceed and see if we can apply these suggestions to our design

Applying the suggestions to our design

From the suggestions, the first thing I am going to do, is make use of back-back flip synchronizers.

We will need a set of two of these back-back synchronizers. One for passing pNextWordToWrite to the read side and another one for passing pNextWordToRead to the write side.

These synchronizers will be defined as follows:

   (* ASYNC_REG = "TRUE" *)  reg [3:0] synchro_write_side_0, synchro_write_side_1; 
   (* ASYNC_REG = "TRUE" *) reg [3:0] synchro_read_side_0, synchro_read_side_1;

The ASYNC_REG annotation will ensure that the flip-flops for each synchroniser set will be placed closed to each other when synthesising the design.

These flip-flops will be assigned as follows:

//write synchroniser
     always @(posedge WClk) 
       synchro_write_side_0 <= pNextWordToRead;
       synchro_write_side_1 <= synchro_write_side_0;

//read synchroniser
     always @(posedge RClk) 
       synchro_read_side_0 <= pNextWordToWrite;
       synchro_read_side_1 <= synchro_read_side_0;

Please take note that each synchroniser gets clocked by a different clock.

Let us now see where these synchronisers will get used. Before we continue, I would just like to mention that I had to deuplicate the code for tboth the write side and the read side. So, let us first  start with the code on the write side:

//Empty/Full Handling on Write Side
    //'EqualAddresses' logic:
    assign EqualAddresses_write_side = (pNextWordToWrite == synchro_write_side_1);

    //'Quadrant selectors' logic:
    assign Set_Status_write_side = (pNextWordToWrite[ADDRESS_WIDTH-2] ~^ synchro_write_side_1[ADDRESS_WIDTH-1]) &
                         (pNextWordToWrite[ADDRESS_WIDTH-1] ^  synchro_write_side_1[ADDRESS_WIDTH-2]);
    assign Rst_Status_write_side = (pNextWordToWrite[ADDRESS_WIDTH-2] ^  synchro_write_side_1[ADDRESS_WIDTH-1]) &
                         (pNextWordToWrite[ADDRESS_WIDTH-1] ~^ synchro_write_side_1[ADDRESS_WIDTH-2]);

Here we have replaced all instances of pNextWordToRead with synchro_write_side_1.

Next, let us get rid of the transparent latch. First, let us have a look at the original code that inferred a transparent latch:

    //'Status' latch logic:
    always @ (Set_Status, Rst_Status, Clear_in) //D Latch w/ Asynchronous Clear & Preset.
        if (Rst_Status | Clear_in)
            Status = 0;  //Going 'Empty'.
        else if (Set_Status)
            Status = 1;  //Going 'Full'.

If you look closely at the code, you will identify many scenarios where there will be no assignment. In those scenarios we need to revert to one or other previous stored state. For this reason the above code will be inferred as a transparent latch.

To eliminate the need for a transparent latch we need to split the above into pieces that will infer into a pure computational logic block and a storage element. The result is as follows:

    always @*            
        if (Rst_Status_write_side | Clear_in)
          Status_write_side = 0;  //Going 'Empty'.
        else if (Set_Status_write_side)
          Status_write_side = 1;  //Going 'Full'.
          Status_write_side = Status_write_prev_side; 
    always @(posedge WClk)
         Status_write_prev_side <= Status_write_side; 

So, we have a pure storage element Status_write_prev_side that store the contents of the computational block  Status_write_side at each clock cycle. So, in the case where there is no assignment happening for Status_write_side, we can just output the value of Status_write_prev_side.

Next, let us see what we can do to eliminate the need for a flip flop with an asynchronous preset. First, let us look again at the original code that will infer a flip-flop with an asynchronous preset:

    //'Full_out' logic for the writing port:
    assign PresetFull = Status & EqualAddresses;  //'Full' Fifo.
    always @ (posedge WClk, posedge PresetFull) //D Flip-Flop w/ Asynchronous Preset.
        if (PresetFull)
            Full_out <= 1;
            Full_out <= 0;

Looking at this piece of code, one can immediately see why they needed to use an asynchronous flip-flop. In deriving PresetFull we had to use some values that gets assigned in the read clock domain. So, it would make sense to trigger the assignment the moment PresetFull transitions from a zero to a one rather than waiting for the Wclk to transition. In this way we can avoid a setup and hold violation.

However, with Xilinx FPGA's we still try and avoid these asynchronous presets. Since we safely moved over pNextWordToRead from the read domain to the write domain, we don't need such manoeuvres. So, the assignment of Full_out, just simplifies to:

   assign PresetFull_write_side = Status_write_side & EqualAddresses_write_side;  //'Full' Fifo.
   assign Full_out = PresetFull_write_side;             

This takes care of the Full indicator on the write side. For the empty indicator that is used on the read side, we have a similar set of code:

//Empty/Full Handling on Read Side
    //'EqualAddresses' logic:
assign EqualAddresses_read_side = (synchro_read_side_1 == pNextWordToRead);

//'Quadrant selectors' logic:
assign Set_Status_read_side = (synchro_read_side_1[ADDRESS_WIDTH-2] ~^ pNextWordToRead[ADDRESS_WIDTH-1]) &
                     (synchro_read_side_1[ADDRESS_WIDTH-1] ^  pNextWordToRead[ADDRESS_WIDTH-2]);
assign Rst_Status_read_side = (synchro_read_side_1[ADDRESS_WIDTH-2] ^  pNextWordToRead[ADDRESS_WIDTH-1]) &
                     (synchro_read_side_1[ADDRESS_WIDTH-1] ~^ pNextWordToRead[ADDRESS_WIDTH-2]);
                     //reg                                 Status_write_side, Status_write_prev_side;
//'Status' latch logic:
always @*            
    if (Rst_Status_read_side | Clear_in)
      Status_read_side = 0;  //Going 'Empty'.
    else if (Set_Status_read_side)
      Status_read_side = 1;  //Going 'Full'.
      Status_read_side = Status_read_prev_side; 
always @(posedge RClk)
     Status_read_prev_side <= Status_read_side; 
//'Full_out' logic for the writing port:
assign PresetEmpty_read_side = ~Status_read_side & EqualAddresses_read_side;  //'Full' Fifo.

assign Empty_out = PresetEmpty_read_side;


This is all the changes required to our design

The Results

I can confirm that the mentioned changes did in fact solve my issue and a static random pattern was displayed on screen.

I wanted to show a picture in this post on how the screen looks like with these changes, but the photo is not very clear. I better exercise would be to display a meaningful photo on the VGA screen.

To do this exercise we will make use of the XSCT console to write the contents of a image file  to the SDRAM of the ZYBO board.

Needless to say, this image file will need to contain raw pixel data in the format RGB565. The file format that comes close this is Microsoft's BMP format. Interesting enough, GIMP allows us to create a BMP file in the RGB565 format.

To do this open up the image you want to convert in GIMP and then select File/Export As.

Give a filename, suffix it with a .bmp extension and hit export. Specify the options in the option window as follows and hit the export button again:

We will then use this file and write its contents to the SDRAM of the Zybo board. You should remember though that the image file doesn't start with raw image straight away, but rather from byte offset 0x46 as deduced via the following article on Wikipedia:

So, because our image frame starts at address 0x200000 in Zybo SDRAM we should write our file at the address starting at 0x200000 - 0x46 to account for the header. Thus, we should write our file to SDRAM starting at address  0x1fffba.

With our Zybo board programmed and a program been kicked off via the Xilinx SDK, we should issue the following command via the XSCT console:

mwr -size b -bin -file /home/johan/Downloads/bm1360.bmp 0x1fffba 3000000

Obviously you need to specify your own file name.

With the image data written our VGA display looks as follows:

Static image indeed.

In Summary

In this post we fixed the issue where a non static image was shown onscreen.

This issue was caused by the following unsafe practices :

  • Using transparent latches
  • Using flip-flops with asynchronous presets.
In the next post we will attempt to display the output from our VIC-II to the VGA screen.

Till next time!

Tuesday, 12 June 2018

Displaying Frames from SDRAM to VGA


In the previous post we managed to read a long stream of data sequentially from SDRAM via the AXI sub system.

In the blog post prior that we played around passing data over a cross clock domain (e.g. generating data in a 100MHz domain and passed it to the 85MHz pixel clock domain). Specifically for this, we used the Asynchronous FIFO implementation of Alex Claros F published on the AsicWorld website. 

In this post we will try and bring together the contents of the previous posts, that is reading a video video frame from SDRAM, and passing to the VGA pixel clock domain in order to display it on a VGA LCD monitor.

Also, in this post I started playing the idea to also include a Youtube video outlining the contents covert in the Blog post.

I am going to be frank and admit that I am a total Rookie in making Youtube Videos, so please forgive me for blunders made in the video 😊

To view a Video of this Blog...

This video gives an overview of this Blog Post, as well as a practical session in Vivado on the changes needed to the existing design.

If you rather prefer the written version together with a discussion on the actual changes to the Verilog code, please continue reading...


The following diagram shows an overview of what we want to achieve in this post:

We will be receiving the frame data from SDRAM via the AXI subsystem.

We will buffer the data coming from the AXI subsystem in a FIFO. This is the same FIFO we used as mentioned in the previous post.

I might have not put emphases on this previously, but a special feature of this FIFO is that it provides a half full indicator in addition to the Full and Empty indicators you will find with most FIFO implementations.

I could probably have build the Half full functionality into out existing Asynchronous Pixel FIFO and thereby eliminated the need for a second FIFO.

However, this would mean more functionality living in the cross clock domain, which would potentially add to more potential frustrations, debugging setup and hold violations.

You will also see a kind of a shift register living between the two FIFO's. The reason for this is beacuse data comes in as 32-bit words, whereas every pixel is only 16-bits is size. So in effect we have two pixels per 32-bit word.

It is the sole responsibility of this shift register to split the 32-bit words into individual pixels.

The shift register shifts 16-bits to the left at a time. The upper 16-bits provides our pixel data, which is buffered within our Asynchrous FIFO.

To Read/Write or not to Read/Write

Reading and writing to and from our two FIFO's at the right moment is crucial to avoid buffer overflows or buffer underflows.

In addition, these reading and writing patterns also effect the operation of our shift register between these FIFO's.

All in all we should attempt to always keep our AXI buffer at least half full and the asynchronous FIFO completely full.

Let us have a quick look at a outline of Verilog code that effects the reading and writing to these two FIFO buffers:

assign read_from_axi = !axi_buffer_empty & !buffer_full & ? 1 : 0;

     //Reading port
burst_read_block my_read_block(


The read_from_axi wire comtrols reads from our AXI FIFO. Obviously we only read from this FIFO if it is not empty and our Asynchrous FIFO is not full.

As mentioned, the shift register also plays an important role in reading and writing to the FIFO buffers, so let us have a look at the implementation of this shift register:

reg [31:0] shift_reg_16_bit;

parameter STATE_16_SHIFT_IDLE = 2'd0;
parameter STATE_16_SHIFT_STORED = 2'd1;
parameter STATE_16_SHIFT_SHIFTED = 2'd2;

always @(posedge clk_axi)
  if (read_from_axi)
    shift_reg_16_bit <= {axi_read_data[15:0], axi_read_data[31:16]};
  else if (state_shift_reg == STATE_16_SHIFT_STORED & !buffer_full)
    shift_reg_16_bit <= {shift_reg_16_bit[15:0], 16'b0};

always @(posedge clk_axi)
  case (state_shift_reg)
    2'd0: state_shift_reg <= !axi_buffer_empty & !buffer_full ? STATE_16_SHIFT_STORED : STATE_16_SHIFT_IDLE;      
    2'd1: state_shift_reg <= buffer_full ? STATE_16_SHIFT_STORED : STATE_16_SHIFT_SHIFTED;
                   if (buffer_full)
                     state_shift_reg <= STATE_16_SHIFT_SHIFTED;
                   else if (axi_buffer_empty)
                     state_shift_reg <= STATE_16_SHIFT_IDLE;
                     state_shift_reg <= STATE_16_SHIFT_STORED;      

As you can see, the state machine plays an important role in the operation of the shift register. We start off with an IDLE state. Once our AXI FIFO has data and our Asynchrous buffer is not full we transition to the SHIFT stored state. This instructs the Shift register to store a value from the AXI FIFO rather than shifting.

The Shift register performs a shift operation when we transition to the SHIFTED state.

While our two buffers is in the right state our SHift register will continue the operation of loading a value from AXI buffer followed by a shifting operation.

This state machine is also crucial for the opeation of other parts in the system as highlighted below:

assign read_from_axi = !axi_buffer_empty & !buffer_full & (state_shift_reg == STATE_16_SHIFT_IDLE | state_shift_reg == STATE_16_SHIFT_SHIFTED) ? 1 : 0;

     //Reading port
     .WriteEn_in(state_shift_reg == STATE_16_SHIFT_STORED | state_shift_reg == STATE_16_SHIFT_SHIFTED) & !buffer_full),


Preparing for the next Frame at VSYNC

When we finished drawing a frame on the screen, it is time for us to prepare our system for the next frame.

This preparation consists out of a couple of things:

  • Clearing all FIFO buffers
  • Allow time to finish all AXI transactions currently busy
  • Reset the pointer of the next address to be read from SDRAM to the beginning og the frame
  • Refilling the FIFO's with pixel data
We should allow sufficient time for this preparation. An ideal event to trigger this preparation is when a VSYNC signal occurs. This will allow more than enough to finish all prep for the next frame. 

This whole prep process will also be driven by a state machine:

always @(posedge clk_axi)
  if (trigger_restart_state == RESTART_STATE_WAIT)
    restart_counter <= 400;
  else if (trigger_restart_state == RESTART_STATE_RESTART)
    restart_counter <= restart_counter == 0 ? 0 : restart_counter - 1;
always @(posedge clk_axi)
  case (trigger_restart_state)
    RESTART_STATE_WAIT : trigger_restart_state <= vert_sync_delayed_5 ? RESTART_STATE_RESTART : RESTART_STATE_WAIT;
    RESTART_STATE_RESTART : trigger_restart_state <= restart_counter == 0 ? RESTART_STATE_END : RESTART_STATE_RESTART;   
    RESTART_STATE_END : trigger_restart_state <= vert_sync_delayed_5 ? RESTART_STATE_END : RESTART_STATE_WAIT;   

When the VSYNC pulse is encountered the prep prcess will last for 400 AXI clock cycles. This is enough time for initialisation for the next frame as well as enough time for any AXI transactions that is in process to complete.

Let us have a quick look at what is effect by the prep for next frame:

     //Reading port
     .Clear_in(trigger_restart_state == RESTART_STATE_RESTART),

burst_read_block my_read_block(
          .restart(trigger_restart_state == RESTART_STATE_RESTART),

The End Results

A simple test for our design is just to start up the Zybo board without any picture frame loaded in SDRAM.

When you power SDRAM without initialising the contents, it will contain a random sequence of bytes which will result in a noisy pattern on screen.

The big test here is that the noise pattern should be static and not alternating. An alternating noise pattern would indicate some buffer underflow conditions happening.

Here is a frame I have captured from the video:

There is a random pattern indeed. However, it is alternating!

So, we need to spend some time troubleshooting.

We will leave this investigation for the next post.

In Summary

In this post we have attempted to display a frame stored in SDRAM to VGA.

Our ultimate test was to display a static noise pattern on screen. In the end we managed to get a noise pattern displayed on screen, but in an alternating fashion.

In the next post we are going to investigate what causes the noise pattern to alternate and attempt to fix it.

Till next time!

Sunday, 13 May 2018

Reading From SDRAM


In the previous post we manage to pass pixel data from one clock domain to another and getting it displayed it on a VGA screen.

The above was a very nice practical test for me around all the theory surrounding cross clock domains. I must admit, I was quite overwhelmed with all the theory surrounding Cross Clock Domains on the Internet.

When overwhelmed with theory one must always attempt, if you can, to do a simple practical test for a reality check 😄

The next big goal is to attempt to read frames from SDRAM and display it on a VGA screen. This is quite a big chunk to perform at once, so we will try and break it down in small chunks.

In this post we will try and see if we can attempt to read from SDRAM to FPGA. We have already managed to write to SDRAM in a previous post, so the reading shouldn't be that hard.

However, the only part that concerns me with the reading is that our pixel clock clock is operating at speed very close to our AXI clock speed, so we must validate if our AXI system can keep up with with sufficient performance.

So, in this post I will also show a very primitive way for checking AXI-bus performance.

The Plan

Throughout this whole Blog-series I have been following the approach of divide and conquer.

This just makes life simpler when exploring new turf and should you encounter a bug you limit the scope you need to search in order to isolate the problem.

This is also the reason why we are only going to focus in this post in reading from SDRAM to FPGA.

But, in order to validate that we are reading data correctly from SDRAM, we need to be able to populate SDRAM with known data in the first place.

So, what is an easy way to populate SDRAM with known data? The Xilinx SDK come to our rescue here.

The XCST console integrated within the Xilinx SDK actually allows you to write a binary file that is present on your desktop machine directly to SDRAM on the Zybo Board at an address you specify.

Since our main goal is to develop a C64 on FPGA I am going to use the XSCT console to write a copy of the BASIC ROM from the C64 to the SDRAM of the Zybo board.

We are then going to attempt to read this binary back from SDRAM to our FPGA by means of our developed design and inspect the data that comes back by means of an Integrated Logic Analyser.

I have covered the use of an Integrated Logic Analyser in a previous post.

To make things more interesting we will take the data received from SDRAM, which is in 32-bit word form, and output it to a 8-bit port in a byte by byte fashion, Reliving the 8-bit area!

Reusing old code

To make our development a bit faster, we will be reusing our code developed in a previous post where we managed to write to SDRAM. With some tinkering we can make it read from SDRAM.

Let us refresh our minds again on how this code worked.

The whole design revolves around the IP provided by Xilinx called the LogiCore AXI Master Burst. The following diagram taken from the Xilinx datasheet explains the basic use of this core:

The relevent IP Core I have just described is indicated by the block AXI Master Burst.

This block connects to the outside world via AXI. Our user logic will connect to this block via the IPIC protocol which is somewhat simpler than the AXI protocol.

Now, we need to wrap the AXI Master Block within an IP within Vivado so we can add it as a block within our Block Design.

We will then wire up the rest  of our design to this Block.

In the post where have implemented the functionaslity for writing to SDRAM, we encapsulated the user related logic within a module called burst_block which had the following signature:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs


All the port which names starts with ip2,  forms part of the IPIC which is connected to the AXI MASTER Block.

During the course of this post we will rework both the module burst block and our AXI IP to cater for SDRAM reading.

I want to stress though that I will however not change these two components to provide dual functionality, but rather make copies and change the copies to provide the read functionality.

Thus, in the end we will have a AXI IP and burst_block providing the AXI write functionality and we will have another AXI IP and burst_block-set providing the read functionality.

The new AXI IP Block

Let us start by creating a new AXI IP Block for reading.

As mentioned in a previous post, you will start off this process by clicking on the Tools menu, selecting Create and Package new IP, and selecting Create AXI Peripheral on the second wizard page.

I am not going to bore you with the whole process again, but let us have a look at how the complete AXI Block will look like in the block design:

This block almost looks the same as the one we have develop for writing to SDRAM.

The real major difference is that we don't have a data input port, but a data output port.

Another difference is the assignment of m00_axi_aruser[0] and m00_axi_awuser[0]. These assignments will change as follows:

    assign m00_axi_aruser[0] = 1'b1;
    assign m00_axi_awuser[0] = 1'b0;

In our Write Axi block from a previous post we swapped around the zero and one values. Because this block is a read block, we are interested in Coherent reads rather than writes.

Just to refresh our minds again, we are making use of the ACP AXI port on the processing subsystem, in which we see memory in exactly the same way as the two ARM cores see memory. For that reason we need to worry about coherency. Here is a quite quote from the Technical Reference Manual for the Zynq, regarding above mentioned ports:

ACP coherent read requests: An ACP read request is coherent when ARUSER[0] = 1 and ARCACHE[1] = 1 alongside ARVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is read directly from the relevant processor and returned to the ACP port. When the data is not present in any of the Cortex-A9 processors, the read request is issued on one of the SCU AXI master ports, along with all its AXI parameters, with the exception of the locked attribute.

ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.
Why do we need to look at memory in the same way as a CPU core? The answer is that because we will be using XSCT console to write test data to SDRAM for reading back later by our user logic.

The XSCT console, however, always operates within the context of a CPU core. So reads/writes that we do from this console will always be from a L1 or a L2 cache. So, if we make use a AXI master port accessing the DDR RAM directly, like HP AXI or GP AXI, we might not read the data back that we wrote via the XSCT console.

Changes to block_burst

Let us create a new block_burst for doing reading. The definition of this new will looks as follow:

module burst_read_block(
  input wire clk,
  input wire reset,
  input wire restart,
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [31:0] axi_d_out,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs,
  output wire empty,
  input wire read

A major change compared to the previous AXI write module is that our module will now be receiving data from the AXI port and the rest of our design will now read data from this block.

Our FIFO buffer contained within this module will also have its read and write ports swapped around a bit:

fifo #(

   data_buf (
            .reset(!reset | restart),
            .write(!master_read_src_rdy & bytes_to_receive > 0),

The read port will now be driven user logic instead of the AXI port.

However, the write port will now be controlled by the AXI port via master_read_src_rdy. It should be noted that source and destination is different within a AXI read context. Within a AXI read context, the source is SDRAM and the destination is our user logic.

Apart from the master_read_src_rdy signal, bytes_to_received is also used to indicate that a write should happen to the FIFO buffer.

bytes_to_received keeps track of how many bytes we have still expect from the AXI port that we asked it to send:

always @(posedge clk)
 if (state == START)
   bytes_to_receive <= BURST_THRES;
 else if ((state > START) & !master_read_src_rdy & bytes_to_receive != 0) 
   bytes_to_receive <= bytes_to_receive - 1;

We initialise this register when our state machine is in the START status.

Our state machine uses the same states as with our AXI write mechanism, with some minor changes to the conditions for transitioning to other states, but more on this in a moment.

You might have noticed we are referencing a port called restart. I am using this port to cater for the scenario where we are constantly reading frames from SDRAM.

When we have finished reading a frame from SDRAM, we would like to reset the memory pointer to the beginning of the frame.

The first place you might have seen we use this port is on the reset port of the data_buf. So, on a restart we will empty the FIFO. The other places we need to make use of this signal within our module is shown below:

always @(posedge clk)
if (!reset | restart)
  count_in_buf <= 0;
else if (!read & write)
  count_in_buf <= count_in_buf + 1;
else if (read & !write)    
  count_in_buf <= count_in_buf - 1;
always @(negedge clk)
if (!reset | (restart))
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
else if (state == INIT_CMD)
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= {BURST_THRES,2'b0};

The restart signal also needs to be used by state machine, which I haven't shown above, since it nees special mention.

In our state machine we cannot blindly reset the state to IDLE. We first need to wait for the AXI port to deliver it last batch of data:

always @(posedge clk)
if (!reset | (restart & bytes_to_receive == 0 & state == 0))  
  state <= 0;
  case( state )
  //cater for scenario of flush
    IDLE: if (count_in_buf < BURST_THRES) 
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_read_src_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: if (!master_read_src_rdy & bytes_to_receive == 1)
                    state <= IDLE;    

Finally, let us quickly cover the assignments to some crucial ports:

assign master_read_src_rdy = ip2bus_otputs[3];
assign cmd_ack = ip2bus_otputs[0];
assign ip2bus_inputs[0] = mstread_req;
assign ip2bus_inputs[1] = mst_type; 
assign ip2bus_mst_addr = axi_start_address;
assign ip2bus_inputs[2] = master_read_dst_rdy;

Gluing everything together

With our new AXI IP and burst_block_read module developed, it is time we give it a test run.

As mentioned earlier, our test will be to write the BASIC ROM to SDRAM via the XSCT console and then see if our FPGA reads back the same information.

To do this test, we need to develop an extra Verilog module for gluing everything together. We will call this module axi_restart_test, and here is the complete code for this module:

module axi_restart_test(
  input wire clk,
  input wire reset,
  //input wire write,
  //input wire [31:0] write_data, 
  //output src ready
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [7:0] byte_out,
  output wire trigger_restart,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs
reg [31:0] byte_shift_reg;    
reg [1:0] byte_num = 0;
reg [9:0] bytes_to_send = 1023;
reg [1:0] state = 0;
reg [6:0] restart_counter = 63;
wire [31:0] data_buf_out;
wire empty;

assign trigger_restart = state == 2;  
burst_read_block my_read_block(
          .read(!empty & byte_num == 0)

assign byte_out = byte_shift_reg[31:24];

always @(posedge clk)
  case (state)
    2'b0 : state <= 1;
    2'b1 : state <= bytes_to_send == 0 ? 2 : 1;
    2'b10 : state <= restart_counter == 0 ? 0 : 2;    
always @(posedge clk)
  restart_counter <= state == 2 ? restart_counter - 1 : 63; 
always @(posedge clk)
if (byte_num == 0 & !empty)
  byte_shift_reg <= data_buf_out;
  byte_shift_reg <= {byte_shift_reg[23:0], 8'b0};
always @(posedge clk)
if (state == 0)
  bytes_to_send <= 1023;
else if (state == 1 & !empty)
  bytes_to_send <= bytes_to_send - 1;
always @(posedge clk)    
  byte_num <= trigger_restart & empty ? 0 : byte_num + 1;

This module takes the words received from the burst_read_block and it returns it byte for byte over the output port byte_out.

It also continuously outputs the first 1024 bytes of BASIC ROM which is retrieved from SDRAM.

With these modules developed, we can add them to our block design and wire everything up, which results in the following:

You can see on the image that I have also added an Integrated Logic Analyser on the right hand side. I have attached the following probes to the Integrated Logic Analyser:

  • ip2bus_mst_addr (32 bits)
  • ip2bus_mstrd_d (32 bits)
  • byte_out (8 bits)
  • trigger restart (1 bit)
  • ip2bus_inputs (5 bits)
  • ip2bus_outputs (6 bits)

Test Results

Let us have a look at the Test Results.

Probably the easiest way is just to view the captured waveforms in the Waveform window.

We have, however, also the option to view the captured data as a csv file, which we will use in this post.

To export the captured data as a csv file, you need to ensure that you have the tcl console open of the hardware manager. You then issue the following command within the Tcl console:

write_hw_ila_data my_hw_ila_data_file.zip [upload_hw_ila_data hw_ila_1]

You need to specify the full path for the resulting zip file. Also, hw_ila_1 is the name of your wave capture window.

When you open up the created zip, you will see a number files, of which one of them will be a csv file. In our case the csv file will like the following:

The fields are listed as below:

  • Sample in Buffer
  • Sample in Window
  • design_1_i/axi_restart_test_0_byte_out[7:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_1[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_2[0:0]
  • design_1_i/axi_restart_test_0_ip2bus_mst_addr_1[5:2]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_3[31:6]
  • design_1_i/myip_burst_read_test_0_bus2ip_mstrd_d[31:0]
  • design_1_i/myip_burst_read_test_0_ip2bus_otputs[4:0]
  • u_ila_0_myip_burst_read_test_0_ip2bus_otputs[5:5]
  • design_1_i/axi_restart_test_0_ip2bus_inputs[2:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_4[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr[0:0]
  • design_1_i/axi_restart_test_0_trigger_restart_1
You will see that some fields, like ip2bus_mst_addr, is broken down into two fields for some reason.

Now, let us see if we can analyse the data!

A good place to start is after a restart pulse, which will bring us to the beginning of the Basic ROM data. Let us have a look at a snippet after a restart pulse:


I have highlighted in Bold Black the end of the restart pulse. I have indicated changes to other signals in red.

Let us see if we can puzzle out what the signal changes in red represent. First, let us have a look at the signal which change from a 4 to a 3.

From our column names we see that this column represents ip2bus_inputs. Let us quickly recap on the meaning of the different bits for this signal:

  • Bit 2: master_read_dst_rdy
  • Bit 1: mst_type
  • Bit 0: mstread_req
So, we can see that in transitioning from 4 to 3, we are signaling that we are ready to receive data and we issue a read request.

Let us now see if we can figure out the signal that changes from an 8 to a 9. This time we see that this signal represents ip2bus_otputs. The meaning of these signals are as follows:

  • Bit 0: bus2ip_mst_cmdack 
  • Bit 1: bus2ip_mst_cmplt
  • Bit 2: bus2ip_mst_error
  • Bit 3: bus2ip_mstrd_src_rdy_n     
  • Bit 4: md_error
So we see that Bit 0 transitions to 1 indicating the the AXI subsystem has indeed accepted our command!

What is funny, though, is that the two entries following the transition indicates that source ready signal is not asserted straightaway! This means that we don't have data available yet from our AXI system that we can read.

We only get data about 37 clock cycles later:


I had a look at a couple of subsequent read requests and all of them have this delay of more less 37 clock cycles.

So, we can conlcude that for each AXI read request, there is a 37 clock cycle latency.

Now, let us have a look at the actual data that comes through:


This looks more or less like the data at the beginning of a C64 BASIC ROM, with the byte order reversed, though.

Let us now see if we can pinpoint the place where this 4 byte words gets output as a stream of bytes. This happens a couple of clock cycles after we start receiving data:


The byte values looks more or less ok, except that just after the 64th byte we are presented with 8 zero byte values, which is not correct if you compare with the actual BASIC ROM data.

Some closer investigation yielded that this was caused by our AXI bus not been able to provide the data fast enough. To get an idea of the problem, I have created the following table:


This is basically a snippet of the CSV file converted into an HTML table.

I have also added an extra column on the left hand site and a column on the right hand side.

The column on the left hand site indicates when our byte output stream starts, and increment the count after each byte output. For byte count 64 and up the count cell is blank and indicates invalid data.

The extra column on the righthand site counts each time we receive a four-byte word from the AXI bus.

You can see that received word number 15 is the last word of data that arrives "in-time"and gets displayed during byte number 60, 61, 62 and 63.

Received word 16, however, only arrives some 5 clock cycles after byte value number 63 was outputted. So, we have a small buffer underflow issue here 😞

We can avoid this buffer underflow situation by tweaking the size of our FIFO receive buffer. The size of the buffer should be made just over twice the BURST_THRES paramater.

Which value of BURST_THRES should we choose? BURST_THRES should be set so that we trigger the next AXI read transaction with enough elements in our FIFO buffer to survive the latency of more or less 36 clock cycles.

So, to cater for a possible worst case scenario, we can probably choose a BURST_THRES of 50. So a typical FIFO buffer size for this scenario would 102 elements.

Just a final note on analysing the data in the CSV file.

Another kind of odd behaviour you might realise from the HTML table above is that once we start receiving the data words from the AXI subsystem, we receive it continuously from Word 0 to Word 7. However, after Word 7 there is a delay of four clock cycles before we receive data again.

You get this kind of behaviour for the other AXI read transactions as well, but for the majority of cases it is not only 4 delay cycles, but 6, 7 or 8 delay cycles!

Considering that we are requesting 16 data words at a time, these delay clock cycles eats more or less 50% of the available bandwidth!

Investigating the loss of Bandwidth

Let us see if we can figure out why our bandwidth is so low.

After some reading I found a clue.

The ACP-AXI port we are using is 64-bit wide on the Processing system side. Our AXI IP block, however, has a 32-bit wide data port instead of a 64-bit one.

I suspect that this 32-bit wide data port is the cause for our bandwidth issue.

The obvious check then be to extend the databus of our existing AXI IP also to 64-bit.

However, quickly looking at Vivado it doesn't seem so obvious to extend the databus width of a AXI created peripheral to 64 bits. There is an option to specify databus width from the dropdown, but in the drop-down, suprise-suprise, there is only a 32-bit option!

I am sure with some tweaking one could make a 64-bit option appear, but for now I want something simple to test my theory on whether the mismatch of port width is the cause for the bandwidth issue.

Another option would be to use an AXI port on the Processing subsystem that is also 32-bits wide. At least then we test if matching ports also result in the same amount of lost clock cycles.  

There is indeed an AXI port on the Processing subsystem that is 32-bits in width: The General Purpose (GP) AXI port.

To be honest, I have have been avoiding AXI-GP ports up to this point in time for the simple reason that it is not guaranteed that the FPGA will see test data we wrote via the XSCT console on an AXI-GP port. SO, let us see if we can move the bar higher and see if we can make the FPGA read data on the AXI-GP port written by the XSCT console.

To do this, let us have a look again at the question: Why can't we guarantee that memory writes via the XSCT console will be read back via the FPGA on the AXI-GP port?

The answer to this question is that memory writes via the XSCT console doesn't get written directly to DDR memory, but rather to the L1 cache associated with the ARM Core our XSCT console is attached to.

So, is there a way to flush the contents of the L1 cache to DDR memory? To answer this question, we need to look a bit in the ZYNQ Technical Reference manual on how the whole L1-Cache mechanism works.

The L1-Cache contains a number of Cache lines. When a request is made from Main Memory, the contents is read from DDR memory and then stored within a Cache Line. All subsequent requests to the same memory location is retrieved from the relevant Cache line within the L1-Cache.

With writes to main memory the relevant contents is first read from Main memory and stored within a cache line. The contents to be written is then applied to the relevant cache line, and the whole cache line is marked as dirty.

If we just left it there, we will just have a Cache line entry marked as dirty, which will never be written to Main Memory.

However, things start to change when we keep writing data up to a point when we have no free Cache Line entries left. In this scenario the L1 Cache Manager will choose a Cache Line entry to evacuate to accommodate the new write. The Cache Line entry chosen will usually be a Least Recently Read/Wite one.

If the chosen Cache Line is marked as dirty, the existing contents first gets written to main memory.

So, to ensure that our writes from XSCT console gets written to Main Memory we need to write a file that is bigger than the size of the L1 Cache.

The L1 Data Cache is 32KB, while the BASIC ROM we used previously for testing is 8KB in size. So we need to create new file by contenting the BASIC ROM a number of times till the resulting file is bigger than 32KB.

On the Bash Terminal in Linux this concatenated file can be produced with the following:

cat basic.bin basic.bin basic.bin basic.bin basic.bin basic.bin > big.bin

The file big.bin will then be the file you use to write to memory via the XSCT console.

Changing the Design to use an AXI-GP port

Let us change the design to make use of an AXI-GP port.

We this by double clicking on the Processing Sub System Block within the Block Design.

On the Page Navigator Panel on the Left select PS-PL Configuration and make the following selections:

We Unselect the ACP Salve Interface so that only one AXI Slave port is present on the Zynq Processing block.

Finally you need to wire everything up again.

Test Results with the AXI-GP port

This time around the results look a lot better. You obviously still have your about 37 clock latency for every AXI read transaction, but once the data starts coming in, it comes in without any intermittent delay clock cycles. With some read transactions you might see one or two intermittent delay clock cycle during the receiving of data, but that is about it:


From these set of results we can see that are coming pretty close to the theoretical bandwidth.

In Summary

In this post we have discovered how to read from SDRAM to FPGA via the AXI port on the Zynq.

We also had at the bandwidth performance while reading. Reading data from a AXI-ACP port with a databus width of 32-bits is probably not the cleverest thing to do, and you sacrifice a lot of the potential bandwidth in this way.

Our test on a 32-bit AXI_GP behaved more or less as expected and we got close to the theoretical bandwidth.

In both tests (e.g AXI_ACP and AXI_GP) we experienced a latency of more less 37 clock cycles per AXI read transaction. Your FIFO buffer should be made large enough so that this latency doesn't effect the Bandwidth.

In the next post we will see if we can continuously read a picture fram from SDRAM and display it on the VGA screen.

 Till next time!