Saturday, 25 August 2018

Displaying the C64 welcome screen on VGA screen


In the previous post we managed to display a static frame stored in SDRAM to a VGA screen.

In this post we are going to take it one step forward and try to display frames rendered from our VIC-II module to a VGA screen.

To View a Video for this Post...

This video explains how to modify our current C64 design to take output the frames from our VIC-II module and display it on a VGA screen.

For a more in detail discussion on the contents of this video, please continue reading...

A recap on the current state of our C64 design

It has been some time since we worked on our C64 design.

In the last couple of posts all efforts was diverted into developing functionality for displaying frames stored in SDRAM to a VGA screen.

Within goal accomplished in the previous post, I think it is time we focus again on our C64 design.

So, let us start by refreshing our minds where we ended off with our C64 design.

When we last worked on the C64 design, it was in a state where we could generate a VIC-II frame with the C64 welcome screen, and store it in SDRAM.

So, with the functionality from the previous post where we could display a frame from SDRAM to a VGA screen, we should be able see the C64 welcome screen displayed on a VGA screen.

One things to mention though is that our design from the previous post expects frames to be at the exact resolution of the VGA monitor you are using. Our C64 design, however, produces frames at a much lower resolution (e.g. something like 404x284).

We will therefore need to modify our VGA block to cater for the resolution produced by our C64 design.

Modifying the VGA output block

As mentioned in the previous section, our C64 block produces frames that is lower in resolution than a typical recent VGA monitor.

On the over hand, since most recent VGA monitors are LCD's, it is best to produce VGA signals having the native resolutions of the VGA screen in question. It just looks better on these screens in native resolutions.

In this post we will therefore output a signal at native resolution and display the C64 output frames in small section on the screen.

This requirement requires a couple of changes to our VGA block design.

The first change is within our Asynchronous FIFO that buffer that buffers pixel data from the AXI clock domain to the pixel clock domain:

     //Reading port
     .ReadEn_in((vert_pos_next > 100)  & (vert_pos_next < 384) &
                                (horiz_pos_next > 100) & (horiz_pos_next < 505)),
     //Writing port.  
     .WriteEn_in(/*state != GET_SET*/(state_shift_reg == STATE_16_SHIFT_STORED | state_shift_reg == STATE_16_SHIFT_SHIFTED) & !buffer_full),
     .Clear_in(/*state == RESET_CYCLE)*/trigger_restart_state == RESTART_STATE_RESTART)

With this change we only enable reads from this FIFO when our VGA signal is at the visible region on the screen.

The visible region is between vertical position 100 and 384 and between horizontal position 100 and 505. This will give as the small 404x284 C64 window on the screen.

We would like our small C64 window to be surrounded by a black border on the screen. To that we need to do the following modifications:

assign out_pixel_buffer_final = (vert_pos > 100)  & (vert_pos < 384) &
                                (horiz_pos > 100) & (horiz_pos < 505)
                                ? out_pixel_buffer : 0;

With this change we output the pixel data from our FIFO if it is within the visible region. For all other positions we output a black pixel.

Adding VGA output to our C64 design

Let us now add VGA output to our C64 design. We do this by first adding the VGA block mentioned in the previous section.

Next, we add the AXI Read block we have developed and used in the last couple of posts.

We will wire up this VGA and AXI read block in a similar way as described previously. Care should however be taken with the AXI Master output port of our AXI read block.

When adding the VGA functionality to our existing C64 design, we will have indeed have two AXI master ports we will need to hookup, whereas our current Processing system only have one AXI slave port configure.

We will therefor need to twig our design a bit to cater for the two master ports.

Start off by removing all AXI helper blocks. I have highlighted them in the following picture:

With these blocks removed, we need to configure our Processing block by double clicking on it, and selecting an extra GP port:

With this option selected, you will see an extra AXI GP port on the processing block:

You can now make use of the designer assistance provided by the IDE to wire up both GP ports.

Just after the wiring up, you will most probably get the following warning:

The IDE will nonetheless allow you to continue, but you will eventually be stopped with an actual error during either the Synthesis or Bitstream generation process. So it is better to try and resolve the warning at this point.

Let us see if we can resolved this warning. We start by opening the Address editor tab within Block design window:

The values in the address editor is used to configure the AXI Helper blocks to decide to which GP port to forward a particular address. But, there is a bit of duplicate mappings here. For instance, the 512MB block range starting at address 0x0 is mapped to both GP0 and GP1.

We can simplify the mapping the following:

Now we have GP0 dedicated to our AXI write block and GP1 dedicated to our AXI read block.

With these changes our design should Synthesise and generate the BitStream without any errors.

The End result

With everything started up, the screen looks as expected:

This is only the Welcome screen with no Flashing cursor.

In Summary

In this post we managed to display the frames produced from our VIC-II module to the VGA screen.

In the next post we will attempt to implement the flashing cursor.

Till next time!

Tuesday, 10 July 2018

Fixing the Non static frame


In the previous post we attempted to view random data in SDRAM as a static frame on the VGA screen.

We, however, ended with a random alternating pattern displayed instead of a random static pattern.

In this post we will attempt to fix this anomaly.

To view a Video of this Blog...

This video explains with the help of a Xilinx community post that the cause of the non static frame displayed was likely caused by the asynchronous FIFO implementation used. I also show how I apply the suggestions from the Community post to my existing post in order to fix the problem...

If you rather prefer the written version together with a discussion on the actual changes to the Verilog code, please continue reading...

Some help from a old community post

The anomaly encountered in the previous post really baffled me, and I didn't know where to actually start looking for the cause of the problem. So, I consulted the Internet...

In my searching I came across the following  post on a Xilinx Community forum:

Interesting thing here is that the community member  that posted the query used exactly the same implementation I used from the Asic-World website:

The member was experiencing some serious timing violations when trying  to synthesise the design.

The key to the solution was provided by the community member with the nickname Avrumw. He pointed out that in the comments of the mentioned design it was suggested that the design follows some recommendations from a Xilinx article.  Avrumw, however had some serious doubts whether Xilinx would make some of these suggestions at all because of the following (I am quoting from Awrumw's answer):
  - It uses a latch
  - It uses the asynchronous preset/clear inputs of flip-flops for part of its functionality
The suggestions Awrum gave to avoid these practices was the following:

  - infer the RAM
  - use Gray counts for bringing addresses between domains
  - use standard "two back to back flip-flop" synchronizers (with the ASYNC_REG property set) to move the Gray coded read pointer into the write domain (for generating full) and the Gray coded write pointer into the read domain (for generating empty)
Admittedly, the design on Asic World did indeed made use of Gray Counters.

Let us know proceed and see if we can apply these suggestions to our design

Applying the suggestions to our design

From the suggestions, the first thing I am going to do, is make use of back-back flip synchronizers.

We will need a set of two of these back-back synchronizers. One for passing pNextWordToWrite to the read side and another one for passing pNextWordToRead to the write side.

These synchronizers will be defined as follows:

   (* ASYNC_REG = "TRUE" *)  reg [3:0] synchro_write_side_0, synchro_write_side_1; 
   (* ASYNC_REG = "TRUE" *) reg [3:0] synchro_read_side_0, synchro_read_side_1;

The ASYNC_REG annotation will ensure that the flip-flops for each synchroniser set will be placed closed to each other when synthesising the design.

These flip-flops will be assigned as follows:

//write synchroniser
     always @(posedge WClk) 
       synchro_write_side_0 <= pNextWordToRead;
       synchro_write_side_1 <= synchro_write_side_0;

//read synchroniser
     always @(posedge RClk) 
       synchro_read_side_0 <= pNextWordToWrite;
       synchro_read_side_1 <= synchro_read_side_0;

Please take note that each synchroniser gets clocked by a different clock.

Let us now see where these synchronisers will get used. Before we continue, I would just like to mention that I had to deuplicate the code for tboth the write side and the read side. So, let us first  start with the code on the write side:

//Empty/Full Handling on Write Side
    //'EqualAddresses' logic:
    assign EqualAddresses_write_side = (pNextWordToWrite == synchro_write_side_1);

    //'Quadrant selectors' logic:
    assign Set_Status_write_side = (pNextWordToWrite[ADDRESS_WIDTH-2] ~^ synchro_write_side_1[ADDRESS_WIDTH-1]) &
                         (pNextWordToWrite[ADDRESS_WIDTH-1] ^  synchro_write_side_1[ADDRESS_WIDTH-2]);
    assign Rst_Status_write_side = (pNextWordToWrite[ADDRESS_WIDTH-2] ^  synchro_write_side_1[ADDRESS_WIDTH-1]) &
                         (pNextWordToWrite[ADDRESS_WIDTH-1] ~^ synchro_write_side_1[ADDRESS_WIDTH-2]);

Here we have replaced all instances of pNextWordToRead with synchro_write_side_1.

Next, let us get rid of the transparent latch. First, let us have a look at the original code that inferred a transparent latch:

    //'Status' latch logic:
    always @ (Set_Status, Rst_Status, Clear_in) //D Latch w/ Asynchronous Clear & Preset.
        if (Rst_Status | Clear_in)
            Status = 0;  //Going 'Empty'.
        else if (Set_Status)
            Status = 1;  //Going 'Full'.

If you look closely at the code, you will identify many scenarios where there will be no assignment. In those scenarios we need to revert to one or other previous stored state. For this reason the above code will be inferred as a transparent latch.

To eliminate the need for a transparent latch we need to split the above into pieces that will infer into a pure computational logic block and a storage element. The result is as follows:

    always @*            
        if (Rst_Status_write_side | Clear_in)
          Status_write_side = 0;  //Going 'Empty'.
        else if (Set_Status_write_side)
          Status_write_side = 1;  //Going 'Full'.
          Status_write_side = Status_write_prev_side; 
    always @(posedge WClk)
         Status_write_prev_side <= Status_write_side; 

So, we have a pure storage element Status_write_prev_side that store the contents of the computational block  Status_write_side at each clock cycle. So, in the case where there is no assignment happening for Status_write_side, we can just output the value of Status_write_prev_side.

Next, let us see what we can do to eliminate the need for a flip flop with an asynchronous preset. First, let us look again at the original code that will infer a flip-flop with an asynchronous preset:

    //'Full_out' logic for the writing port:
    assign PresetFull = Status & EqualAddresses;  //'Full' Fifo.
    always @ (posedge WClk, posedge PresetFull) //D Flip-Flop w/ Asynchronous Preset.
        if (PresetFull)
            Full_out <= 1;
            Full_out <= 0;

Looking at this piece of code, one can immediately see why they needed to use an asynchronous flip-flop. In deriving PresetFull we had to use some values that gets assigned in the read clock domain. So, it would make sense to trigger the assignment the moment PresetFull transitions from a zero to a one rather than waiting for the Wclk to transition. In this way we can avoid a setup and hold violation.

However, with Xilinx FPGA's we still try and avoid these asynchronous presets. Since we safely moved over pNextWordToRead from the read domain to the write domain, we don't need such manoeuvres. So, the assignment of Full_out, just simplifies to:

   assign PresetFull_write_side = Status_write_side & EqualAddresses_write_side;  //'Full' Fifo.
   assign Full_out = PresetFull_write_side;             

This takes care of the Full indicator on the write side. For the empty indicator that is used on the read side, we have a similar set of code:

//Empty/Full Handling on Read Side
    //'EqualAddresses' logic:
assign EqualAddresses_read_side = (synchro_read_side_1 == pNextWordToRead);

//'Quadrant selectors' logic:
assign Set_Status_read_side = (synchro_read_side_1[ADDRESS_WIDTH-2] ~^ pNextWordToRead[ADDRESS_WIDTH-1]) &
                     (synchro_read_side_1[ADDRESS_WIDTH-1] ^  pNextWordToRead[ADDRESS_WIDTH-2]);
assign Rst_Status_read_side = (synchro_read_side_1[ADDRESS_WIDTH-2] ^  pNextWordToRead[ADDRESS_WIDTH-1]) &
                     (synchro_read_side_1[ADDRESS_WIDTH-1] ~^ pNextWordToRead[ADDRESS_WIDTH-2]);
                     //reg                                 Status_write_side, Status_write_prev_side;
//'Status' latch logic:
always @*            
    if (Rst_Status_read_side | Clear_in)
      Status_read_side = 0;  //Going 'Empty'.
    else if (Set_Status_read_side)
      Status_read_side = 1;  //Going 'Full'.
      Status_read_side = Status_read_prev_side; 
always @(posedge RClk)
     Status_read_prev_side <= Status_read_side; 
//'Full_out' logic for the writing port:
assign PresetEmpty_read_side = ~Status_read_side & EqualAddresses_read_side;  //'Full' Fifo.

assign Empty_out = PresetEmpty_read_side;


This is all the changes required to our design

The Results

I can confirm that the mentioned changes did in fact solve my issue and a static random pattern was displayed on screen.

I wanted to show a picture in this post on how the screen looks like with these changes, but the photo is not very clear. I better exercise would be to display a meaningful photo on the VGA screen.

To do this exercise we will make use of the XSCT console to write the contents of a image file  to the SDRAM of the ZYBO board.

Needless to say, this image file will need to contain raw pixel data in the format RGB565. The file format that comes close this is Microsoft's BMP format. Interesting enough, GIMP allows us to create a BMP file in the RGB565 format.

To do this open up the image you want to convert in GIMP and then select File/Export As.

Give a filename, suffix it with a .bmp extension and hit export. Specify the options in the option window as follows and hit the export button again:

We will then use this file and write its contents to the SDRAM of the Zybo board. You should remember though that the image file doesn't start with raw image straight away, but rather from byte offset 0x46 as deduced via the following article on Wikipedia:

So, because our image frame starts at address 0x200000 in Zybo SDRAM we should write our file at the address starting at 0x200000 - 0x46 to account for the header. Thus, we should write our file to SDRAM starting at address  0x1fffba.

With our Zybo board programmed and a program been kicked off via the Xilinx SDK, we should issue the following command via the XSCT console:

mwr -size b -bin -file /home/johan/Downloads/bm1360.bmp 0x1fffba 3000000

Obviously you need to specify your own file name.

With the image data written our VGA display looks as follows:

Static image indeed.

In Summary

In this post we fixed the issue where a non static image was shown onscreen.

This issue was caused by the following unsafe practices :

  • Using transparent latches
  • Using flip-flops with asynchronous presets.
In the next post we will attempt to display the output from our VIC-II to the VGA screen.

Till next time!

Tuesday, 12 June 2018

Displaying Frames from SDRAM to VGA


In the previous post we managed to read a long stream of data sequentially from SDRAM via the AXI sub system.

In the blog post prior that we played around passing data over a cross clock domain (e.g. generating data in a 100MHz domain and passed it to the 85MHz pixel clock domain). Specifically for this, we used the Asynchronous FIFO implementation of Alex Claros F published on the AsicWorld website. 

In this post we will try and bring together the contents of the previous posts, that is reading a video video frame from SDRAM, and passing to the VGA pixel clock domain in order to display it on a VGA LCD monitor.

Also, in this post I started playing the idea to also include a Youtube video outlining the contents covert in the Blog post.

I am going to be frank and admit that I am a total Rookie in making Youtube Videos, so please forgive me for blunders made in the video 😊

To view a Video of this Blog...

This video gives an overview of this Blog Post, as well as a practical session in Vivado on the changes needed to the existing design.

If you rather prefer the written version together with a discussion on the actual changes to the Verilog code, please continue reading...


The following diagram shows an overview of what we want to achieve in this post:

We will be receiving the frame data from SDRAM via the AXI subsystem.

We will buffer the data coming from the AXI subsystem in a FIFO. This is the same FIFO we used as mentioned in the previous post.

I might have not put emphases on this previously, but a special feature of this FIFO is that it provides a half full indicator in addition to the Full and Empty indicators you will find with most FIFO implementations.

I could probably have build the Half full functionality into out existing Asynchronous Pixel FIFO and thereby eliminated the need for a second FIFO.

However, this would mean more functionality living in the cross clock domain, which would potentially add to more potential frustrations, debugging setup and hold violations.

You will also see a kind of a shift register living between the two FIFO's. The reason for this is beacuse data comes in as 32-bit words, whereas every pixel is only 16-bits is size. So in effect we have two pixels per 32-bit word.

It is the sole responsibility of this shift register to split the 32-bit words into individual pixels.

The shift register shifts 16-bits to the left at a time. The upper 16-bits provides our pixel data, which is buffered within our Asynchrous FIFO.

To Read/Write or not to Read/Write

Reading and writing to and from our two FIFO's at the right moment is crucial to avoid buffer overflows or buffer underflows.

In addition, these reading and writing patterns also effect the operation of our shift register between these FIFO's.

All in all we should attempt to always keep our AXI buffer at least half full and the asynchronous FIFO completely full.

Let us have a quick look at a outline of Verilog code that effects the reading and writing to these two FIFO buffers:

assign read_from_axi = !axi_buffer_empty & !buffer_full & ? 1 : 0;

     //Reading port
burst_read_block my_read_block(


The read_from_axi wire comtrols reads from our AXI FIFO. Obviously we only read from this FIFO if it is not empty and our Asynchrous FIFO is not full.

As mentioned, the shift register also plays an important role in reading and writing to the FIFO buffers, so let us have a look at the implementation of this shift register:

reg [31:0] shift_reg_16_bit;

parameter STATE_16_SHIFT_IDLE = 2'd0;
parameter STATE_16_SHIFT_STORED = 2'd1;
parameter STATE_16_SHIFT_SHIFTED = 2'd2;

always @(posedge clk_axi)
  if (read_from_axi)
    shift_reg_16_bit <= {axi_read_data[15:0], axi_read_data[31:16]};
  else if (state_shift_reg == STATE_16_SHIFT_STORED & !buffer_full)
    shift_reg_16_bit <= {shift_reg_16_bit[15:0], 16'b0};

always @(posedge clk_axi)
  case (state_shift_reg)
    2'd0: state_shift_reg <= !axi_buffer_empty & !buffer_full ? STATE_16_SHIFT_STORED : STATE_16_SHIFT_IDLE;      
    2'd1: state_shift_reg <= buffer_full ? STATE_16_SHIFT_STORED : STATE_16_SHIFT_SHIFTED;
                   if (buffer_full)
                     state_shift_reg <= STATE_16_SHIFT_SHIFTED;
                   else if (axi_buffer_empty)
                     state_shift_reg <= STATE_16_SHIFT_IDLE;
                     state_shift_reg <= STATE_16_SHIFT_STORED;      

As you can see, the state machine plays an important role in the operation of the shift register. We start off with an IDLE state. Once our AXI FIFO has data and our Asynchrous buffer is not full we transition to the SHIFT stored state. This instructs the Shift register to store a value from the AXI FIFO rather than shifting.

The Shift register performs a shift operation when we transition to the SHIFTED state.

While our two buffers is in the right state our SHift register will continue the operation of loading a value from AXI buffer followed by a shifting operation.

This state machine is also crucial for the opeation of other parts in the system as highlighted below:

assign read_from_axi = !axi_buffer_empty & !buffer_full & (state_shift_reg == STATE_16_SHIFT_IDLE | state_shift_reg == STATE_16_SHIFT_SHIFTED) ? 1 : 0;

     //Reading port
     .WriteEn_in(state_shift_reg == STATE_16_SHIFT_STORED | state_shift_reg == STATE_16_SHIFT_SHIFTED) & !buffer_full),


Preparing for the next Frame at VSYNC

When we finished drawing a frame on the screen, it is time for us to prepare our system for the next frame.

This preparation consists out of a couple of things:

  • Clearing all FIFO buffers
  • Allow time to finish all AXI transactions currently busy
  • Reset the pointer of the next address to be read from SDRAM to the beginning og the frame
  • Refilling the FIFO's with pixel data
We should allow sufficient time for this preparation. An ideal event to trigger this preparation is when a VSYNC signal occurs. This will allow more than enough to finish all prep for the next frame. 

This whole prep process will also be driven by a state machine:

always @(posedge clk_axi)
  if (trigger_restart_state == RESTART_STATE_WAIT)
    restart_counter <= 400;
  else if (trigger_restart_state == RESTART_STATE_RESTART)
    restart_counter <= restart_counter == 0 ? 0 : restart_counter - 1;
always @(posedge clk_axi)
  case (trigger_restart_state)
    RESTART_STATE_WAIT : trigger_restart_state <= vert_sync_delayed_5 ? RESTART_STATE_RESTART : RESTART_STATE_WAIT;
    RESTART_STATE_RESTART : trigger_restart_state <= restart_counter == 0 ? RESTART_STATE_END : RESTART_STATE_RESTART;   
    RESTART_STATE_END : trigger_restart_state <= vert_sync_delayed_5 ? RESTART_STATE_END : RESTART_STATE_WAIT;   

When the VSYNC pulse is encountered the prep prcess will last for 400 AXI clock cycles. This is enough time for initialisation for the next frame as well as enough time for any AXI transactions that is in process to complete.

Let us have a quick look at what is effect by the prep for next frame:

     //Reading port
     .Clear_in(trigger_restart_state == RESTART_STATE_RESTART),

burst_read_block my_read_block(
          .restart(trigger_restart_state == RESTART_STATE_RESTART),

The End Results

A simple test for our design is just to start up the Zybo board without any picture frame loaded in SDRAM.

When you power SDRAM without initialising the contents, it will contain a random sequence of bytes which will result in a noisy pattern on screen.

The big test here is that the noise pattern should be static and not alternating. An alternating noise pattern would indicate some buffer underflow conditions happening.

Here is a frame I have captured from the video:

There is a random pattern indeed. However, it is alternating!

So, we need to spend some time troubleshooting.

We will leave this investigation for the next post.

In Summary

In this post we have attempted to display a frame stored in SDRAM to VGA.

Our ultimate test was to display a static noise pattern on screen. In the end we managed to get a noise pattern displayed on screen, but in an alternating fashion.

In the next post we are going to investigate what causes the noise pattern to alternate and attempt to fix it.

Till next time!

Sunday, 13 May 2018

Reading From SDRAM


In the previous post we manage to pass pixel data from one clock domain to another and getting it displayed it on a VGA screen.

The above was a very nice practical test for me around all the theory surrounding cross clock domains. I must admit, I was quite overwhelmed with all the theory surrounding Cross Clock Domains on the Internet.

When overwhelmed with theory one must always attempt, if you can, to do a simple practical test for a reality check 😄

The next big goal is to attempt to read frames from SDRAM and display it on a VGA screen. This is quite a big chunk to perform at once, so we will try and break it down in small chunks.

In this post we will try and see if we can attempt to read from SDRAM to FPGA. We have already managed to write to SDRAM in a previous post, so the reading shouldn't be that hard.

However, the only part that concerns me with the reading is that our pixel clock clock is operating at speed very close to our AXI clock speed, so we must validate if our AXI system can keep up with with sufficient performance.

So, in this post I will also show a very primitive way for checking AXI-bus performance.

The Plan

Throughout this whole Blog-series I have been following the approach of divide and conquer.

This just makes life simpler when exploring new turf and should you encounter a bug you limit the scope you need to search in order to isolate the problem.

This is also the reason why we are only going to focus in this post in reading from SDRAM to FPGA.

But, in order to validate that we are reading data correctly from SDRAM, we need to be able to populate SDRAM with known data in the first place.

So, what is an easy way to populate SDRAM with known data? The Xilinx SDK come to our rescue here.

The XCST console integrated within the Xilinx SDK actually allows you to write a binary file that is present on your desktop machine directly to SDRAM on the Zybo Board at an address you specify.

Since our main goal is to develop a C64 on FPGA I am going to use the XSCT console to write a copy of the BASIC ROM from the C64 to the SDRAM of the Zybo board.

We are then going to attempt to read this binary back from SDRAM to our FPGA by means of our developed design and inspect the data that comes back by means of an Integrated Logic Analyser.

I have covered the use of an Integrated Logic Analyser in a previous post.

To make things more interesting we will take the data received from SDRAM, which is in 32-bit word form, and output it to a 8-bit port in a byte by byte fashion, Reliving the 8-bit area!

Reusing old code

To make our development a bit faster, we will be reusing our code developed in a previous post where we managed to write to SDRAM. With some tinkering we can make it read from SDRAM.

Let us refresh our minds again on how this code worked.

The whole design revolves around the IP provided by Xilinx called the LogiCore AXI Master Burst. The following diagram taken from the Xilinx datasheet explains the basic use of this core:

The relevent IP Core I have just described is indicated by the block AXI Master Burst.

This block connects to the outside world via AXI. Our user logic will connect to this block via the IPIC protocol which is somewhat simpler than the AXI protocol.

Now, we need to wrap the AXI Master Block within an IP within Vivado so we can add it as a block within our Block Design.

We will then wire up the rest  of our design to this Block.

In the post where have implemented the functionaslity for writing to SDRAM, we encapsulated the user related logic within a module called burst_block which had the following signature:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs


All the port which names starts with ip2,  forms part of the IPIC which is connected to the AXI MASTER Block.

During the course of this post we will rework both the module burst block and our AXI IP to cater for SDRAM reading.

I want to stress though that I will however not change these two components to provide dual functionality, but rather make copies and change the copies to provide the read functionality.

Thus, in the end we will have a AXI IP and burst_block providing the AXI write functionality and we will have another AXI IP and burst_block-set providing the read functionality.

The new AXI IP Block

Let us start by creating a new AXI IP Block for reading.

As mentioned in a previous post, you will start off this process by clicking on the Tools menu, selecting Create and Package new IP, and selecting Create AXI Peripheral on the second wizard page.

I am not going to bore you with the whole process again, but let us have a look at how the complete AXI Block will look like in the block design:

This block almost looks the same as the one we have develop for writing to SDRAM.

The real major difference is that we don't have a data input port, but a data output port.

Another difference is the assignment of m00_axi_aruser[0] and m00_axi_awuser[0]. These assignments will change as follows:

    assign m00_axi_aruser[0] = 1'b1;
    assign m00_axi_awuser[0] = 1'b0;

In our Write Axi block from a previous post we swapped around the zero and one values. Because this block is a read block, we are interested in Coherent reads rather than writes.

Just to refresh our minds again, we are making use of the ACP AXI port on the processing subsystem, in which we see memory in exactly the same way as the two ARM cores see memory. For that reason we need to worry about coherency. Here is a quite quote from the Technical Reference Manual for the Zynq, regarding above mentioned ports:

ACP coherent read requests: An ACP read request is coherent when ARUSER[0] = 1 and ARCACHE[1] = 1 alongside ARVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is read directly from the relevant processor and returned to the ACP port. When the data is not present in any of the Cortex-A9 processors, the read request is issued on one of the SCU AXI master ports, along with all its AXI parameters, with the exception of the locked attribute.

ACP coherent write requests: An ACP write request is coherent when AWUSER[0] = 1 and AWCACHE[1] =1 alongside AWVALID. In this case, the SCU enforces coherency. When the data is present in one of the Cortex-A9 processors, the data is first cleaned and invalidated from the relevant CPU. When the data is not present in any of the Cortex-A9 processors, or when it has been cleaned and invalidated, the write request is issued on one of the SCU AXI master ports, along with all corresponding AXI parameters with the exception of the locked attribute.
Why do we need to look at memory in the same way as a CPU core? The answer is that because we will be using XSCT console to write test data to SDRAM for reading back later by our user logic.

The XSCT console, however, always operates within the context of a CPU core. So reads/writes that we do from this console will always be from a L1 or a L2 cache. So, if we make use a AXI master port accessing the DDR RAM directly, like HP AXI or GP AXI, we might not read the data back that we wrote via the XSCT console.

Changes to block_burst

Let us create a new block_burst for doing reading. The definition of this new will looks as follow:

module burst_read_block(
  input wire clk,
  input wire reset,
  input wire restart,
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [31:0] axi_d_out,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs,
  output wire empty,
  input wire read

A major change compared to the previous AXI write module is that our module will now be receiving data from the AXI port and the rest of our design will now read data from this block.

Our FIFO buffer contained within this module will also have its read and write ports swapped around a bit:

fifo #(

   data_buf (
            .reset(!reset | restart),
            .write(!master_read_src_rdy & bytes_to_receive > 0),

The read port will now be driven user logic instead of the AXI port.

However, the write port will now be controlled by the AXI port via master_read_src_rdy. It should be noted that source and destination is different within a AXI read context. Within a AXI read context, the source is SDRAM and the destination is our user logic.

Apart from the master_read_src_rdy signal, bytes_to_received is also used to indicate that a write should happen to the FIFO buffer.

bytes_to_received keeps track of how many bytes we have still expect from the AXI port that we asked it to send:

always @(posedge clk)
 if (state == START)
   bytes_to_receive <= BURST_THRES;
 else if ((state > START) & !master_read_src_rdy & bytes_to_receive != 0) 
   bytes_to_receive <= bytes_to_receive - 1;

We initialise this register when our state machine is in the START status.

Our state machine uses the same states as with our AXI write mechanism, with some minor changes to the conditions for transitioning to other states, but more on this in a moment.

You might have noticed we are referencing a port called restart. I am using this port to cater for the scenario where we are constantly reading frames from SDRAM.

When we have finished reading a frame from SDRAM, we would like to reset the memory pointer to the beginning of the frame.

The first place you might have seen we use this port is on the reset port of the data_buf. So, on a restart we will empty the FIFO. The other places we need to make use of this signal within our module is shown below:

always @(posedge clk)
if (!reset | restart)
  count_in_buf <= 0;
else if (!read & write)
  count_in_buf <= count_in_buf + 1;
else if (read & !write)    
  count_in_buf <= count_in_buf - 1;
always @(negedge clk)
if (!reset | (restart))
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
else if (state == INIT_CMD)
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= {BURST_THRES,2'b0};

The restart signal also needs to be used by state machine, which I haven't shown above, since it nees special mention.

In our state machine we cannot blindly reset the state to IDLE. We first need to wait for the AXI port to deliver it last batch of data:

always @(posedge clk)
if (!reset | (restart & bytes_to_receive == 0 & state == 0))  
  state <= 0;
  case( state )
  //cater for scenario of flush
    IDLE: if (count_in_buf < BURST_THRES) 
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_read_src_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: if (!master_read_src_rdy & bytes_to_receive == 1)
                    state <= IDLE;    

Finally, let us quickly cover the assignments to some crucial ports:

assign master_read_src_rdy = ip2bus_otputs[3];
assign cmd_ack = ip2bus_otputs[0];
assign ip2bus_inputs[0] = mstread_req;
assign ip2bus_inputs[1] = mst_type; 
assign ip2bus_mst_addr = axi_start_address;
assign ip2bus_inputs[2] = master_read_dst_rdy;

Gluing everything together

With our new AXI IP and burst_block_read module developed, it is time we give it a test run.

As mentioned earlier, our test will be to write the BASIC ROM to SDRAM via the XSCT console and then see if our FPGA reads back the same information.

To do this test, we need to develop an extra Verilog module for gluing everything together. We will call this module axi_restart_test, and here is the complete code for this module:

module axi_restart_test(
  input wire clk,
  input wire reset,
  //input wire write,
  //input wire [31:0] write_data, 
  //output src ready
  output wire [31:0] ip2bus_mst_addr,
  output wire [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [7:0] byte_out,
  output wire trigger_restart,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs
reg [31:0] byte_shift_reg;    
reg [1:0] byte_num = 0;
reg [9:0] bytes_to_send = 1023;
reg [1:0] state = 0;
reg [6:0] restart_counter = 63;
wire [31:0] data_buf_out;
wire empty;

assign trigger_restart = state == 2;  
burst_read_block my_read_block(
          .read(!empty & byte_num == 0)

assign byte_out = byte_shift_reg[31:24];

always @(posedge clk)
  case (state)
    2'b0 : state <= 1;
    2'b1 : state <= bytes_to_send == 0 ? 2 : 1;
    2'b10 : state <= restart_counter == 0 ? 0 : 2;    
always @(posedge clk)
  restart_counter <= state == 2 ? restart_counter - 1 : 63; 
always @(posedge clk)
if (byte_num == 0 & !empty)
  byte_shift_reg <= data_buf_out;
  byte_shift_reg <= {byte_shift_reg[23:0], 8'b0};
always @(posedge clk)
if (state == 0)
  bytes_to_send <= 1023;
else if (state == 1 & !empty)
  bytes_to_send <= bytes_to_send - 1;
always @(posedge clk)    
  byte_num <= trigger_restart & empty ? 0 : byte_num + 1;

This module takes the words received from the burst_read_block and it returns it byte for byte over the output port byte_out.

It also continuously outputs the first 1024 bytes of BASIC ROM which is retrieved from SDRAM.

With these modules developed, we can add them to our block design and wire everything up, which results in the following:

You can see on the image that I have also added an Integrated Logic Analyser on the right hand side. I have attached the following probes to the Integrated Logic Analyser:

  • ip2bus_mst_addr (32 bits)
  • ip2bus_mstrd_d (32 bits)
  • byte_out (8 bits)
  • trigger restart (1 bit)
  • ip2bus_inputs (5 bits)
  • ip2bus_outputs (6 bits)

Test Results

Let us have a look at the Test Results.

Probably the easiest way is just to view the captured waveforms in the Waveform window.

We have, however, also the option to view the captured data as a csv file, which we will use in this post.

To export the captured data as a csv file, you need to ensure that you have the tcl console open of the hardware manager. You then issue the following command within the Tcl console:

write_hw_ila_data [upload_hw_ila_data hw_ila_1]

You need to specify the full path for the resulting zip file. Also, hw_ila_1 is the name of your wave capture window.

When you open up the created zip, you will see a number files, of which one of them will be a csv file. In our case the csv file will like the following:

The fields are listed as below:

  • Sample in Buffer
  • Sample in Window
  • design_1_i/axi_restart_test_0_byte_out[7:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_1[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_2[0:0]
  • design_1_i/axi_restart_test_0_ip2bus_mst_addr_1[5:2]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_3[31:6]
  • design_1_i/myip_burst_read_test_0_bus2ip_mstrd_d[31:0]
  • design_1_i/myip_burst_read_test_0_ip2bus_otputs[4:0]
  • u_ila_0_myip_burst_read_test_0_ip2bus_otputs[5:5]
  • design_1_i/axi_restart_test_0_ip2bus_inputs[2:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr_4[0:0]
  • u_ila_0_axi_restart_test_0_ip2bus_mst_addr[0:0]
  • design_1_i/axi_restart_test_0_trigger_restart_1
You will see that some fields, like ip2bus_mst_addr, is broken down into two fields for some reason.

Now, let us see if we can analyse the data!

A good place to start is after a restart pulse, which will bring us to the beginning of the Basic ROM data. Let us have a look at a snippet after a restart pulse:


I have highlighted in Bold Black the end of the restart pulse. I have indicated changes to other signals in red.

Let us see if we can puzzle out what the signal changes in red represent. First, let us have a look at the signal which change from a 4 to a 3.

From our column names we see that this column represents ip2bus_inputs. Let us quickly recap on the meaning of the different bits for this signal:

  • Bit 2: master_read_dst_rdy
  • Bit 1: mst_type
  • Bit 0: mstread_req
So, we can see that in transitioning from 4 to 3, we are signaling that we are ready to receive data and we issue a read request.

Let us now see if we can figure out the signal that changes from an 8 to a 9. This time we see that this signal represents ip2bus_otputs. The meaning of these signals are as follows:

  • Bit 0: bus2ip_mst_cmdack 
  • Bit 1: bus2ip_mst_cmplt
  • Bit 2: bus2ip_mst_error
  • Bit 3: bus2ip_mstrd_src_rdy_n     
  • Bit 4: md_error
So we see that Bit 0 transitions to 1 indicating the the AXI subsystem has indeed accepted our command!

What is funny, though, is that the two entries following the transition indicates that source ready signal is not asserted straightaway! This means that we don't have data available yet from our AXI system that we can read.

We only get data about 37 clock cycles later:


I had a look at a couple of subsequent read requests and all of them have this delay of more less 37 clock cycles.

So, we can conlcude that for each AXI read request, there is a 37 clock cycle latency.

Now, let us have a look at the actual data that comes through:


This looks more or less like the data at the beginning of a C64 BASIC ROM, with the byte order reversed, though.

Let us now see if we can pinpoint the place where this 4 byte words gets output as a stream of bytes. This happens a couple of clock cycles after we start receiving data:


The byte values looks more or less ok, except that just after the 64th byte we are presented with 8 zero byte values, which is not correct if you compare with the actual BASIC ROM data.

Some closer investigation yielded that this was caused by our AXI bus not been able to provide the data fast enough. To get an idea of the problem, I have created the following table:


This is basically a snippet of the CSV file converted into an HTML table.

I have also added an extra column on the left hand site and a column on the right hand side.

The column on the left hand site indicates when our byte output stream starts, and increment the count after each byte output. For byte count 64 and up the count cell is blank and indicates invalid data.

The extra column on the righthand site counts each time we receive a four-byte word from the AXI bus.

You can see that received word number 15 is the last word of data that arrives "in-time"and gets displayed during byte number 60, 61, 62 and 63.

Received word 16, however, only arrives some 5 clock cycles after byte value number 63 was outputted. So, we have a small buffer underflow issue here 😞

We can avoid this buffer underflow situation by tweaking the size of our FIFO receive buffer. The size of the buffer should be made just over twice the BURST_THRES paramater.

Which value of BURST_THRES should we choose? BURST_THRES should be set so that we trigger the next AXI read transaction with enough elements in our FIFO buffer to survive the latency of more or less 36 clock cycles.

So, to cater for a possible worst case scenario, we can probably choose a BURST_THRES of 50. So a typical FIFO buffer size for this scenario would 102 elements.

Just a final note on analysing the data in the CSV file.

Another kind of odd behaviour you might realise from the HTML table above is that once we start receiving the data words from the AXI subsystem, we receive it continuously from Word 0 to Word 7. However, after Word 7 there is a delay of four clock cycles before we receive data again.

You get this kind of behaviour for the other AXI read transactions as well, but for the majority of cases it is not only 4 delay cycles, but 6, 7 or 8 delay cycles!

Considering that we are requesting 16 data words at a time, these delay clock cycles eats more or less 50% of the available bandwidth!

Investigating the loss of Bandwidth

Let us see if we can figure out why our bandwidth is so low.

After some reading I found a clue.

The ACP-AXI port we are using is 64-bit wide on the Processing system side. Our AXI IP block, however, has a 32-bit wide data port instead of a 64-bit one.

I suspect that this 32-bit wide data port is the cause for our bandwidth issue.

The obvious check then be to extend the databus of our existing AXI IP also to 64-bit.

However, quickly looking at Vivado it doesn't seem so obvious to extend the databus width of a AXI created peripheral to 64 bits. There is an option to specify databus width from the dropdown, but in the drop-down, suprise-suprise, there is only a 32-bit option!

I am sure with some tweaking one could make a 64-bit option appear, but for now I want something simple to test my theory on whether the mismatch of port width is the cause for the bandwidth issue.

Another option would be to use an AXI port on the Processing subsystem that is also 32-bits wide. At least then we test if matching ports also result in the same amount of lost clock cycles.  

There is indeed an AXI port on the Processing subsystem that is 32-bits in width: The General Purpose (GP) AXI port.

To be honest, I have have been avoiding AXI-GP ports up to this point in time for the simple reason that it is not guaranteed that the FPGA will see test data we wrote via the XSCT console on an AXI-GP port. SO, let us see if we can move the bar higher and see if we can make the FPGA read data on the AXI-GP port written by the XSCT console.

To do this, let us have a look again at the question: Why can't we guarantee that memory writes via the XSCT console will be read back via the FPGA on the AXI-GP port?

The answer to this question is that memory writes via the XSCT console doesn't get written directly to DDR memory, but rather to the L1 cache associated with the ARM Core our XSCT console is attached to.

So, is there a way to flush the contents of the L1 cache to DDR memory? To answer this question, we need to look a bit in the ZYNQ Technical Reference manual on how the whole L1-Cache mechanism works.

The L1-Cache contains a number of Cache lines. When a request is made from Main Memory, the contents is read from DDR memory and then stored within a Cache Line. All subsequent requests to the same memory location is retrieved from the relevant Cache line within the L1-Cache.

With writes to main memory the relevant contents is first read from Main memory and stored within a cache line. The contents to be written is then applied to the relevant cache line, and the whole cache line is marked as dirty.

If we just left it there, we will just have a Cache line entry marked as dirty, which will never be written to Main Memory.

However, things start to change when we keep writing data up to a point when we have no free Cache Line entries left. In this scenario the L1 Cache Manager will choose a Cache Line entry to evacuate to accommodate the new write. The Cache Line entry chosen will usually be a Least Recently Read/Wite one.

If the chosen Cache Line is marked as dirty, the existing contents first gets written to main memory.

So, to ensure that our writes from XSCT console gets written to Main Memory we need to write a file that is bigger than the size of the L1 Cache.

The L1 Data Cache is 32KB, while the BASIC ROM we used previously for testing is 8KB in size. So we need to create new file by contenting the BASIC ROM a number of times till the resulting file is bigger than 32KB.

On the Bash Terminal in Linux this concatenated file can be produced with the following:

cat basic.bin basic.bin basic.bin basic.bin basic.bin basic.bin > big.bin

The file big.bin will then be the file you use to write to memory via the XSCT console.

Changing the Design to use an AXI-GP port

Let us change the design to make use of an AXI-GP port.

We this by double clicking on the Processing Sub System Block within the Block Design.

On the Page Navigator Panel on the Left select PS-PL Configuration and make the following selections:

We Unselect the ACP Salve Interface so that only one AXI Slave port is present on the Zynq Processing block.

Finally you need to wire everything up again.

Test Results with the AXI-GP port

This time around the results look a lot better. You obviously still have your about 37 clock latency for every AXI read transaction, but once the data starts coming in, it comes in without any intermittent delay clock cycles. With some read transactions you might see one or two intermittent delay clock cycle during the receiving of data, but that is about it:


From these set of results we can see that are coming pretty close to the theoretical bandwidth.

In Summary

In this post we have discovered how to read from SDRAM to FPGA via the AXI port on the Zynq.

We also had at the bandwidth performance while reading. Reading data from a AXI-ACP port with a databus width of 32-bits is probably not the cleverest thing to do, and you sacrifice a lot of the potential bandwidth in this way.

Our test on a 32-bit AXI_GP behaved more or less as expected and we got close to the theoretical bandwidth.

In both tests (e.g AXI_ACP and AXI_GP) we experienced a latency of more less 37 clock cycles per AXI read transaction. Your FIFO buffer should be made large enough so that this latency doesn't effect the Bandwidth.

In the next post we will see if we can continuously read a picture fram from SDRAM and display it on the VGA screen.

 Till next time!

Wednesday, 25 April 2018

Crossing from the AXI to the VGA Pixel Clock Domain


In the previous post we played around with VGA output from the Zybo Board.

In the end we manage to get a screen filled with A's.

This is one step closer in getting the pixel output from VIC-II module displayed on a VGA enabled screen.

In a previous post we manage to write the pixel output from our VIC-II module to SDRAM. So, the next logical step would be to continuously read the pixel data back from SDRAM and displaying it on the VGA-enabled screen.

In reading back the pixel data from SDRAM and displaying on a VGA screen, we are again faced with a cross clock domain problem: The AXI port used for retrieving data from SDRAM is operating at 100MHz whereas our VGA pixel clock is clocking at around 85MHz.

In the post where we manage to wrote pixel data from VIC-II to SDRAM, we were also faced with a cross clock domain issue. This cross clock domain problem, however, was easier to solve since we could fit multiple 100MHz clock cycles on a single VIC-II clock pulse. These multiple clock cycles made it easy for us to ensure that we are more or less on the centre of a VIC-II clock pulse when sampling data for the AXI domain.

The case is not so simple in our SDRAM->VGA scenario where we have 100MHZ versus 85MHZ. In this scenario we can fit one-"and a bit" AXI clock cycles on one VGA pixel clock pulse. There is thus no easy way for us to tell when we are at the edge of a VGA pixel clock pulse.

The target of this post therefore is to see if we can find a way to solve this particular Cross Clock domain problem.

We will also test the solution to the above on the physical FPGA to see if we really got meaningful data back when it crossed from the AXI clock domain to the VGA clock domain

Some Research

I did some searching on the Internet to see how people managed to solve similar Cross Clock domain issues than what we currently have with our SDRAM->VGA.

Most resources on the web suggests that you should make use of a asynchronous FIFO-buffer. With a asynchronous FIFO-buffer you feed data with one clock frequency and read data back with a different clock frequency.

This sounds exactly what we need! But how to implement a asynchronous FIFO is another story.

It all boils down the fact that for any FIFO, whether asynchronous or not, we need two pointers: One for keeping track of the current top (e.g. the next place in memory we will write data), and another pointer for keeping track of the current bottom (e.g. the next place in memory where will read data).

The trick comes in with the fact that top pointer and bottom pointer gets updated in different clock domains, but occasionally both clock domains needs access to both pointers.

In our case, for instance, the AXI clock domain will update the top pointer every time it writes an element to the FIFO. Similarly, the VGA pixel domain clock will update the bottom pointer every time data is read from the FIFO.

Also, both the AXI and VGA pixel clock domain needs access to both pointers. The AXI clock domain needs to the read the bottom pointer so that it doesn't write passed this position causing data to be overwritten that was not read yet.  Similarly, the VGA clock domain needs to able to read the top pointer to avoid reading pass valid data.

So, we are basically still having to solve a cross clock domain issue regarding the top and bottom pointers.

The first possible solution for the above issue that comes to mind is to use a two flip-flop synchronizer as described in the following web page:

This solution, however, only works well for single bit signals. For multi-bit signals, as the top and bottom pointer, you have the risk that some of the bits might settle down before the others, thus ending off with half-baked values.

The solution many web resources poses for passing multi-bit counter values across clock domains, is to make them count using Gray code. When counting in Gray code, only a single bit changes at a time as shown below for a four bit Gray counter:

0 0 0 0 
0 0 0 1
0 0 1 1
0 0 1 0
0 1 1 0
0 1 1 1
0 1 0 1
0 1 0 0
1 1 0 0
1 1 0 1
1 1 1 1
1 1 1 0
1 0 1 0
1 0 1 1
1 0 0 1
1 0 0 0

The concept sounds simple, but is still quite a mission to implement a asynchronous FIFO-buffer from scratch. So, looking around on Internet, I found an existing implementation for Asynchronous FIFO:

This implementation was written by Alex Claros F and he based it on a article Asynchronous FIFO in Virtex-II FPGAs, written by Peter Alfke.

I was about to post a copy of the above mentioned implementation in my Blog, but then it came to mind that the author of this module didn't really give explicit permission within the comments of the module for reproducing his work on another website.

However, this shouldn't stop me from thanking Alex Claros F and Peter Alfke for publicly sharing their work.

The Approach

Having found an implantation of a Asynchronous FIFO-buffer, I am curious to know if this buffer would really work within our context of sending data from the AXI domain to our VGA Pixel clock domain.

To test a Cross Clock domain implementation we will basically take the VGA module developed in the previous post, and split it into two Clock domains.

In the 100MHz clock domain we will move all the functionality responsible for pixel data generation. This pixel data we will write to the Asynchronous FIFO Buffer, at a rate of 100MHz.

Will will link up the receiving end of the Asynchronous FIFO-buffer  to the VGA Pixel clock domain, reading one pixel element at a time and outputting it to the VGA connector.

One final piece of information worth mentioning is that on the AXI clock domain side we will try and keep the FIFO buffer full at all times.

Overview of the Asynchronous FIFO module

Let cover some finer details of Alex Claros's Asynchronous FIFO module.

When instantiating an instance of this model, there is two crucial parameters, DATA_WIDTH and ADRESS_WIDTH.

The default value for DATA_WIDTH is 8 bits. In our case we will need to bump this value to 16 bits because of our pixel bit size.

The default value for ADDRESS_WIDTH is 4 bits. This means a FIFO buffer size of 16 elements which will be sufficient for our case.

Let us now have a look at the ports of this module.

Firstly we have a port called Data_out for reading data and Data_in for writing data.

The reading and writing is clocked by two separate clocks RClk and WClk.

Also, we have two ports specifying whether we have something to write or want to read at a clock which are ReadEn_in and WriteEn_in.

The Clear_in buffer reset the FIFO to an empty state. We will typically use this functionality when we have just finished drawing a frame on the screen to ensure we stay in sync.

The Full_out port indicates that the FIFO buffer is full and we should abstain from writing any more data while this pin is high. We will use this port to ensure the buffer is kept full at all times.

Finally, the Empty_out indicates that the buffer  is empty and reads should not be done. In our implementation we will not be using this port since we always try and keep the buffer full.

Implementing a State Machine

Our whole buffering mechanism will be driven by the Vertical Sync signal. When we reach a Vertical Sync pulse, we will clear the FIFO with the Clear_in port, and start populating the FIFO again with data starting again with the beginning of the frame.

During the course of the drawing the next frame we will try and keep the buffer full, till we encounter another Vertical Sync pulse.

To aid in this process flow we will need to implement a state machine.

We implement this state machine within our existing VGA module as follows:

parameter WAIT_START_VSYNC = 2'd0;
parameter RESET_CYCLE = 2'd1;
parameter GET_SET = 2'd2;
parameter WAIT_END_VSYNC = 2'd3;
reg [1:0] state = 2'b0;
always @(posedge clk_axi)
  case (state)
    WAIT_START_VSYNC: state <= vert_sync ? RESET_CYCLE : WAIT_START_VSYNC;  
    RESET_CYCLE: state <= GET_SET;
    GET_SET: state <= WAIT_END_VSYNC; 
    WAIT_END_VSYNC: state <= vert_sync ? WAIT_END_VSYNC : WAIT_START_VSYNC;    

The majority of time the state machine waits for the Vertical sync signal after which it reset all state for the beginning of a new frame.

You might have noticed that I have written vert_sync in Italics. This just to remind us that this signal comes from the VGA pixel clock domain. Yes, another cross clock domain issue we should take care of!

Luckily this is a single bit signal for which we can use a double flip-flop synchronizer which we mentioned earlier.

When you do some reading on double flip-flop synchronizers, you will see that they will mention quite often that they will catch 99% of all setup and hold violations. To cater for the remaining 0.9% of setup and hold violations you should add one or more additional flip-flops to the chain. I have gone a bit overboard and added five flip-flop synchronisers:

reg vert_sync_delayed_1;
reg vert_sync_delayed_2;
reg vert_sync_delayed_3;
reg vert_sync_delayed_4;
reg vert_sync_delayed_5;
always @(posedge clk_axi)
  vert_sync_delayed_1 <= vert_sync;
  vert_sync_delayed_2 <= vert_sync_delayed_1;   
  vert_sync_delayed_3 <= vert_sync_delayed_2;
  vert_sync_delayed_4 <= vert_sync_delayed_3;
  vert_sync_delayed_5 <= vert_sync_delayed_4;  
always @(posedge clk_axi)
  case (state)
    WAIT_START_VSYNC: state <= vert_sync_delayed_5 ? RESET_CYCLE : WAIT_START_VSYNC;  
    RESET_CYCLE: state <= GET_SET;
    GET_SET: state <= WAIT_END_VSYNC; 
    WAIT_END_VSYNC: state <= vert_sync_delayed_5 ? WAIT_END_VSYNC : WAIT_START_VSYNC;    

Generating pixels in the AXI domain

As mentioned earlier we will move the generation of pixel data from the VGA pixel Clock domain to the AXI clock domain.

To implement this change we need to implement a horizontal/Vertical position counters that also clock in the AXI domain:

reg [10:0] horiz_pos_buffer = 0;
reg [10:0] vert_pos_buffer = 0;
always @(posedge clk_axi)
if (state == RESET_CYCLE)
  horiz_pos_buffer <= 0;
  vert_pos_buffer <= 0;
end else
if (buffer_full)
  //do nothing
end else
if (horiz_pos_buffer < 1359)
  horiz_pos_buffer <= horiz_pos_buffer + 1;
else begin
  horiz_pos_buffer <= 0;
  if (vert_pos_buffer < 767)
    vert_pos_buffer <= vert_pos_buffer + 1;
  end else
    vert_pos_buffer <= 0;  

You will notice that we don't increment the counters when the buffer is full.

Now, let us give attention on the generation of pixel data:

assign pixel_in_char = horiz_pos_buffer[2:0]; 
always @(posedge clk_axi)
  if (buffer_full)
  if (pixel_in_char == 0)
  //pixel_shift_reg <= 0;
    case ({horiz_pos_buffer[3],vert_pos_buffer[2:0]})
      4'h0 : pixel_shift_reg <= 8'h18;
      4'h1 : pixel_shift_reg <= 8'h3C;
      4'h2 : pixel_shift_reg <= 8'h66;
      4'h3 : pixel_shift_reg <= 8'h7E;
      4'h4 : pixel_shift_reg <= 8'h66;
      4'h5 : pixel_shift_reg <= 8'h66;
      4'h6 : pixel_shift_reg <= 8'h66;
      4'h7 : pixel_shift_reg <= 8'h00;
      4'h8 : pixel_shift_reg <= 8'h7c;
      4'h9 : pixel_shift_reg <= 8'h66;
      4'ha : pixel_shift_reg <= 8'h66;
      4'hb : pixel_shift_reg <= 8'h7c;
      4'hc : pixel_shift_reg <= 8'h66;
      4'hd : pixel_shift_reg <= 8'h66;
      4'he : pixel_shift_reg <= 8'h7c;
      4'hf : pixel_shift_reg <= 8'h00;      
    pixel_shift_reg <= {pixel_shift_reg[6:0], 1'b0};   

We have use the the existing functionality of pixel_shift_reg and extend it a bit. Obviously we are now using the position counters within AXI domain (e.g. the counter variables with the suffix _buffer).

You will also notice that we don't do any operation the pixel_shift_reg if the buffer is full.

To make stuff also a bit more interesting, I will not be filling the screen only with A's this time, but with AB's.

Up to this point in time we haven't really had a look at the statement for instantiating an Asynchronous Buffer, so let us quickly have look at how it will look at the moment:

     //Reading port
     //Writing port.  
     .WriteEn_in(state != GET_SET),
     .Clear_in(state == RESET_CYCLE));

As you see we are naming the instance my_fifo and we are overriding the DATA_WIDTH to 16 bits as explained previously.

Also, we are clearing the buffer when we are in the state RESET_CYCLE.

You will also notice that we are have enabled writing to the buffer in almost all cases except for when we are in the state GET_SET. The reason for this is because in the clock cycle directly after RESET_CYCLE, the shift register isn't initialised yet. If we have wired WriteEn_in to a '1' a value, we would have indeed written data to buffer during this clock cycle, which would have been an extra erroneous pixel.

It is for this reason why I have introduced an extra state after RESET_CYCLE, holding back the first write so that pixel_shift_reg can initialise properly. I have called this state GET_SET after the analogy of an athletics event where the athletes transition from following states: On your marks, GET SET, GO.

Reading out pixels for display

We are now ready for implementing the functionality that falls within the VGA Pixel Clock domain.

We start off by doing the necessary changes to our FIFO instance:

     //Reading port
     .ReadEn_in((vert_pos < VERT_RES) & (horiz_pos < HORIZ_RES)),
     //Writing port.  
     .WriteEn_in(state != GET_SET),
     .Clear_in(state == RESET_CYCLE));

First of all we only enable a read when we are currently within a visible portion on the screen. out_pixel_buffer is the pixel data we need to display.

We wire this port to the rest of our VGA Pixel Clock domain as follows:

assign red = out_pixel_buffer_final[15:11];
assign green = out_pixel_buffer_final[10:5];
assign blue = out_pixel_buffer_final[4:0];
assign out_pixel_buffer_final = (vert_pos < VERT_RES) 
                                & (horiz_pos < HORIZ_RES) ? 
                                out_pixel_buffer : 0;

Here again we only output pixel data from the buffer if we are within a visible portion on the screen.

The End Result

Let us have a look at the end result. I have again taken a close-up of the screen:

All the characters looks normal for me and didn't really spot any odd one out pixels because of potential Setup-and-Hold-Violations.

This is indeed a very crude check, but at least I think the Asynchronous FIFO-buffer is doing its job.

In Summary

In this post we managed to successfully split the VGA developed in the previous post into two Cross Clock domains with the help of a Asynchronous FIFO Buffer.

In the next post we will start to implement functionality for reading from SDRAM to our FPGA.

Till next time!