Wednesday 25 April 2018

Crossing from the AXI to the VGA Pixel Clock Domain

Foreword

In the previous post we played around with VGA output from the Zybo Board.

In the end we manage to get a screen filled with A's.

This is one step closer in getting the pixel output from VIC-II module displayed on a VGA enabled screen.

In a previous post we manage to write the pixel output from our VIC-II module to SDRAM. So, the next logical step would be to continuously read the pixel data back from SDRAM and displaying it on the VGA-enabled screen.

In reading back the pixel data from SDRAM and displaying on a VGA screen, we are again faced with a cross clock domain problem: The AXI port used for retrieving data from SDRAM is operating at 100MHz whereas our VGA pixel clock is clocking at around 85MHz.

In the post where we manage to wrote pixel data from VIC-II to SDRAM, we were also faced with a cross clock domain issue. This cross clock domain problem, however, was easier to solve since we could fit multiple 100MHz clock cycles on a single VIC-II clock pulse. These multiple clock cycles made it easy for us to ensure that we are more or less on the centre of a VIC-II clock pulse when sampling data for the AXI domain.

The case is not so simple in our SDRAM->VGA scenario where we have 100MHZ versus 85MHZ. In this scenario we can fit one-"and a bit" AXI clock cycles on one VGA pixel clock pulse. There is thus no easy way for us to tell when we are at the edge of a VGA pixel clock pulse.

The target of this post therefore is to see if we can find a way to solve this particular Cross Clock domain problem.

We will also test the solution to the above on the physical FPGA to see if we really got meaningful data back when it crossed from the AXI clock domain to the VGA clock domain

Some Research

I did some searching on the Internet to see how people managed to solve similar Cross Clock domain issues than what we currently have with our SDRAM->VGA.

Most resources on the web suggests that you should make use of a asynchronous FIFO-buffer. With a asynchronous FIFO-buffer you feed data with one clock frequency and read data back with a different clock frequency.

This sounds exactly what we need! But how to implement a asynchronous FIFO is another story.

It all boils down the fact that for any FIFO, whether asynchronous or not, we need two pointers: One for keeping track of the current top (e.g. the next place in memory we will write data), and another pointer for keeping track of the current bottom (e.g. the next place in memory where will read data).

The trick comes in with the fact that top pointer and bottom pointer gets updated in different clock domains, but occasionally both clock domains needs access to both pointers.

In our case, for instance, the AXI clock domain will update the top pointer every time it writes an element to the FIFO. Similarly, the VGA pixel domain clock will update the bottom pointer every time data is read from the FIFO.

Also, both the AXI and VGA pixel clock domain needs access to both pointers. The AXI clock domain needs to the read the bottom pointer so that it doesn't write passed this position causing data to be overwritten that was not read yet.  Similarly, the VGA clock domain needs to able to read the top pointer to avoid reading pass valid data.

So, we are basically still having to solve a cross clock domain issue regarding the top and bottom pointers.

The first possible solution for the above issue that comes to mind is to use a two flip-flop synchronizer as described in the following web page: https://www.edn.com/electronics-blogs/day-in-the-life-of-a-chip-designer/4435339/Synchronizer-techniques-for-multi-clock-domain-SoCs

This solution, however, only works well for single bit signals. For multi-bit signals, as the top and bottom pointer, you have the risk that some of the bits might settle down before the others, thus ending off with half-baked values.

The solution many web resources poses for passing multi-bit counter values across clock domains, is to make them count using Gray code. When counting in Gray code, only a single bit changes at a time as shown below for a four bit Gray counter:

0 0 0 0 
0 0 0 1
0 0 1 1
0 0 1 0
0 1 1 0
0 1 1 1
0 1 0 1
0 1 0 0
1 1 0 0
1 1 0 1
1 1 1 1
1 1 1 0
1 0 1 0
1 0 1 1
1 0 0 1
1 0 0 0

The concept sounds simple, but is still quite a mission to implement a asynchronous FIFO-buffer from scratch. So, looking around on Internet, I found an existing implementation for Asynchronous FIFO:

http://www.asic-world.com/examples/verilog/asyn_fifo.html

This implementation was written by Alex Claros F and he based it on a article Asynchronous FIFO in Virtex-II FPGAs, written by Peter Alfke.

I was about to post a copy of the above mentioned implementation in my Blog, but then it came to mind that the author of this module didn't really give explicit permission within the comments of the module for reproducing his work on another website.

However, this shouldn't stop me from thanking Alex Claros F and Peter Alfke for publicly sharing their work.

The Approach

Having found an implantation of a Asynchronous FIFO-buffer, I am curious to know if this buffer would really work within our context of sending data from the AXI domain to our VGA Pixel clock domain.

To test a Cross Clock domain implementation we will basically take the VGA module developed in the previous post, and split it into two Clock domains.

In the 100MHz clock domain we will move all the functionality responsible for pixel data generation. This pixel data we will write to the Asynchronous FIFO Buffer, at a rate of 100MHz.

Will will link up the receiving end of the Asynchronous FIFO-buffer  to the VGA Pixel clock domain, reading one pixel element at a time and outputting it to the VGA connector.

One final piece of information worth mentioning is that on the AXI clock domain side we will try and keep the FIFO buffer full at all times.

Overview of the Asynchronous FIFO module

Let cover some finer details of Alex Claros's Asynchronous FIFO module.

When instantiating an instance of this model, there is two crucial parameters, DATA_WIDTH and ADRESS_WIDTH.

The default value for DATA_WIDTH is 8 bits. In our case we will need to bump this value to 16 bits because of our pixel bit size.

The default value for ADDRESS_WIDTH is 4 bits. This means a FIFO buffer size of 16 elements which will be sufficient for our case.

Let us now have a look at the ports of this module.

Firstly we have a port called Data_out for reading data and Data_in for writing data.

The reading and writing is clocked by two separate clocks RClk and WClk.

Also, we have two ports specifying whether we have something to write or want to read at a clock which are ReadEn_in and WriteEn_in.

The Clear_in buffer reset the FIFO to an empty state. We will typically use this functionality when we have just finished drawing a frame on the screen to ensure we stay in sync.

The Full_out port indicates that the FIFO buffer is full and we should abstain from writing any more data while this pin is high. We will use this port to ensure the buffer is kept full at all times.

Finally, the Empty_out indicates that the buffer  is empty and reads should not be done. In our implementation we will not be using this port since we always try and keep the buffer full.

Implementing a State Machine

Our whole buffering mechanism will be driven by the Vertical Sync signal. When we reach a Vertical Sync pulse, we will clear the FIFO with the Clear_in port, and start populating the FIFO again with data starting again with the beginning of the frame.

During the course of the drawing the next frame we will try and keep the buffer full, till we encounter another Vertical Sync pulse.

To aid in this process flow we will need to implement a state machine.

We implement this state machine within our existing VGA module as follows:

...
parameter WAIT_START_VSYNC = 2'd0;
parameter RESET_CYCLE = 2'd1;
parameter GET_SET = 2'd2;
parameter WAIT_END_VSYNC = 2'd3;
...
reg [1:0] state = 2'b0;
...
always @(posedge clk_axi)
  case (state)
    WAIT_START_VSYNC: state <= vert_sync ? RESET_CYCLE : WAIT_START_VSYNC;  
    RESET_CYCLE: state <= GET_SET;
    GET_SET: state <= WAIT_END_VSYNC; 
    WAIT_END_VSYNC: state <= vert_sync ? WAIT_END_VSYNC : WAIT_START_VSYNC;    
  endcase
...

The majority of time the state machine waits for the Vertical sync signal after which it reset all state for the beginning of a new frame.

You might have noticed that I have written vert_sync in Italics. This just to remind us that this signal comes from the VGA pixel clock domain. Yes, another cross clock domain issue we should take care of!

Luckily this is a single bit signal for which we can use a double flip-flop synchronizer which we mentioned earlier.

When you do some reading on double flip-flop synchronizers, you will see that they will mention quite often that they will catch 99% of all setup and hold violations. To cater for the remaining 0.9% of setup and hold violations you should add one or more additional flip-flops to the chain. I have gone a bit overboard and added five flip-flop synchronisers:

...
reg vert_sync_delayed_1;
reg vert_sync_delayed_2;
reg vert_sync_delayed_3;
reg vert_sync_delayed_4;
reg vert_sync_delayed_5;
...
always @(posedge clk_axi)
begin
  vert_sync_delayed_1 <= vert_sync;
  vert_sync_delayed_2 <= vert_sync_delayed_1;   
  vert_sync_delayed_3 <= vert_sync_delayed_2;
  vert_sync_delayed_4 <= vert_sync_delayed_3;
  vert_sync_delayed_5 <= vert_sync_delayed_4;  
end
...
always @(posedge clk_axi)
  case (state)
    WAIT_START_VSYNC: state <= vert_sync_delayed_5 ? RESET_CYCLE : WAIT_START_VSYNC;  
    RESET_CYCLE: state <= GET_SET;
    GET_SET: state <= WAIT_END_VSYNC; 
    WAIT_END_VSYNC: state <= vert_sync_delayed_5 ? WAIT_END_VSYNC : WAIT_START_VSYNC;    
  endcase
...

Generating pixels in the AXI domain

As mentioned earlier we will move the generation of pixel data from the VGA pixel Clock domain to the AXI clock domain.

To implement this change we need to implement a horizontal/Vertical position counters that also clock in the AXI domain:

...
reg [10:0] horiz_pos_buffer = 0;
reg [10:0] vert_pos_buffer = 0;
...
always @(posedge clk_axi)
if (state == RESET_CYCLE)
begin
  horiz_pos_buffer <= 0;
  vert_pos_buffer <= 0;
end else
if (buffer_full)
begin
  //do nothing
end else
if (horiz_pos_buffer < 1359)
  horiz_pos_buffer <= horiz_pos_buffer + 1;
else begin
  horiz_pos_buffer <= 0;
  if (vert_pos_buffer < 767)
  begin
    vert_pos_buffer <= vert_pos_buffer + 1;
  end else
  begin
    vert_pos_buffer <= 0;  
  end
end
...

You will notice that we don't increment the counters when the buffer is full.

Now, let us give attention on the generation of pixel data:

...
assign pixel_in_char = horiz_pos_buffer[2:0]; 
...
always @(posedge clk_axi)
  if (buffer_full)
  begin
  end
  else
  if (pixel_in_char == 0)
  begin
  //pixel_shift_reg <= 0;
    case ({horiz_pos_buffer[3],vert_pos_buffer[2:0]})
      4'h0 : pixel_shift_reg <= 8'h18;
      4'h1 : pixel_shift_reg <= 8'h3C;
      4'h2 : pixel_shift_reg <= 8'h66;
      4'h3 : pixel_shift_reg <= 8'h7E;
      4'h4 : pixel_shift_reg <= 8'h66;
      4'h5 : pixel_shift_reg <= 8'h66;
      4'h6 : pixel_shift_reg <= 8'h66;
      4'h7 : pixel_shift_reg <= 8'h00;
      4'h8 : pixel_shift_reg <= 8'h7c;
      4'h9 : pixel_shift_reg <= 8'h66;
      4'ha : pixel_shift_reg <= 8'h66;
      4'hb : pixel_shift_reg <= 8'h7c;
      4'hc : pixel_shift_reg <= 8'h66;
      4'hd : pixel_shift_reg <= 8'h66;
      4'he : pixel_shift_reg <= 8'h7c;
      4'hf : pixel_shift_reg <= 8'h00;      
    endcase
  end    
  else
    pixel_shift_reg <= {pixel_shift_reg[6:0], 1'b0};   
...



We have use the the existing functionality of pixel_shift_reg and extend it a bit. Obviously we are now using the position counters within AXI domain (e.g. the counter variables with the suffix _buffer).

You will also notice that we don't do any operation the pixel_shift_reg if the buffer is full.

To make stuff also a bit more interesting, I will not be filling the screen only with A's this time, but with AB's.

Up to this point in time we haven't really had a look at the statement for instantiating an Asynchronous Buffer, so let us quickly have look at how it will look at the moment:

aFifo
  #(.DATA_WIDTH(16))
  my_fifo
     //Reading port
    (.Data_out(), 
     .Empty_out(),
     .ReadEn_in(),
     .RClk(clk),        
     //Writing port.  
     .Data_in(out_pixel),  
     .Full_out(buffer_full),
     .WriteEn_in(state != GET_SET),
     .WClk(clk_axi),
  
     .Clear_in(state == RESET_CYCLE));


As you see we are naming the instance my_fifo and we are overriding the DATA_WIDTH to 16 bits as explained previously.

Also, we are clearing the buffer when we are in the state RESET_CYCLE.

You will also notice that we are have enabled writing to the buffer in almost all cases except for when we are in the state GET_SET. The reason for this is because in the clock cycle directly after RESET_CYCLE, the shift register isn't initialised yet. If we have wired WriteEn_in to a '1' a value, we would have indeed written data to buffer during this clock cycle, which would have been an extra erroneous pixel.

It is for this reason why I have introduced an extra state after RESET_CYCLE, holding back the first write so that pixel_shift_reg can initialise properly. I have called this state GET_SET after the analogy of an athletics event where the athletes transition from following states: On your marks, GET SET, GO.

Reading out pixels for display

We are now ready for implementing the functionality that falls within the VGA Pixel Clock domain.

We start off by doing the necessary changes to our FIFO instance:

aFifo
  #(.DATA_WIDTH(16))
  my_fifo
     //Reading port
    (.Data_out(out_pixel_buffer), 
     .Empty_out(),
     .ReadEn_in((vert_pos < VERT_RES) & (horiz_pos < HORIZ_RES)),
     .RClk(clk),        
     //Writing port.  
     .Data_in(out_pixel),  
     .Full_out(buffer_full),
     .WriteEn_in(state != GET_SET),
     .WClk(clk_axi),
  
     .Clear_in(state == RESET_CYCLE));


First of all we only enable a read when we are currently within a visible portion on the screen. out_pixel_buffer is the pixel data we need to display.

We wire this port to the rest of our VGA Pixel Clock domain as follows:

...
assign red = out_pixel_buffer_final[15:11];
assign green = out_pixel_buffer_final[10:5];
assign blue = out_pixel_buffer_final[4:0];
...
assign out_pixel_buffer_final = (vert_pos < VERT_RES) 
                                & (horiz_pos < HORIZ_RES) ? 
                                out_pixel_buffer : 0;
...



Here again we only output pixel data from the buffer if we are within a visible portion on the screen.

The End Result

Let us have a look at the end result. I have again taken a close-up of the screen:



All the characters looks normal for me and didn't really spot any odd one out pixels because of potential Setup-and-Hold-Violations.

This is indeed a very crude check, but at least I think the Asynchronous FIFO-buffer is doing its job.

In Summary

In this post we managed to successfully split the VGA developed in the previous post into two Cross Clock domains with the help of a Asynchronous FIFO Buffer.

In the next post we will start to implement functionality for reading from SDRAM to our FPGA.

Till next time!

Thursday 19 April 2018

Playing with VGA output

Foreword

In the previous post we managed to write the output of our VIC-II module to SDRAM. The resulting frame as retrieved from the SDRAM of the Zybo looked pretty distorted, although we could seem some resemblance of the C64 Welcome screen.

After I posted the last post, I did some investigation into why the frames get distorted. It turned out that an old bitstream file got stuck in the Xilinx SDK Workspace I used, and exporting new bitstream files from Vivado to this Workspace simply didn't override this old bitstream.

Eventually, after some frustration, I deleted this Workspace, created a new a one and exported the bitstream again from my Vivado project to this Workspace. It was only then I had the aha moment: My frame rendered perfectly without any distortions!

Well, one less thing to worry about. In this post then I will focus on something totally different.

In this post we will get VGA output to work. We will, however, not be developing a full solution for displaying the contents of a framebuffer on a monitor, but something simple, which is displaying a screen filled with the character 'A'.

Introduction to VGA timings

When you want to display something on a VGA enabled screen you will spend most of your effort getting to know the meaning of VGA timing parameters. So let us start covering them.

The base of all is the frequency of the pixel clock. The pixel clock basically give the pace at which you dish out pixels of your frame to your monitor. The name says it all: One pixel for every clock cycle of the pixel clock.

It is important to note that at every clock cycle of the pixel clock will you not only be outputting displayable information. You will also have pixel cycles where there is blanking happening and synchronisation. These terms will become clear in a moment.

Three terms you will hear quite often when talking about VGA parameters are front porch, back porch and synchronisation pulse. The following diagram will clarify these terms:


This graph illustrates a typical video signal. In the centre, Picture information, represent the visible parts of the signal.

The two small pedestals in the picture represents synchronisation pulses. A synchronisation pulse basically instructs the monitor to reset the place to draw the next pixel on the screen to the beginning of the next line. These sync pulses ensure that monitor draws the pixels of the video signal to the correct places on the screen.

You will notice that this sync pulse is not directly following the picture information, but rather have some padding surrounding it.

This padding was added to the cater for the limitations of Cathode Ray Tubes, which was used in the first VGA monitors.

Actually calling CRT's "The first VGA monitors" sounds a bit misleading, as if it CRT's was only used briefly as VGA monitors. This is anything but!

CRT's have been used for VGA monitors for quite a number of decades. Even in the early 2000's your standard monitor was a CRT. LCD monitors only really started killing CRT's towards the end of 2010.

But, let us get back to the point of the discussion: what limitations does a CRT have? Let us start by reviewing how a CRT works.

A CRT projects an electron beam on a surface that is coated with phosphor. Where the beam hits the surface, a tiny spot on the screen will illuminate. Thus, to have a picture displayed on the screen this beam needs continuously scan across the whole screen.

This is performed from left to right and from top to bottom. The beam is moved around with the aid of magnetic deflection coils.

When the electron beam reaches the end of a line, the horizontal deflection coils moves the beam rapidly back to the left to start a new line.

During the period when the beam moves rapidly back to the left and resuming scanning from left to right, the beam is not moving at a uniform speed, It is during this period we don't want the beam to draw anything on the screen at all.

It is for this reason we need to add some padding surrounding the horizontal sync pulse, so that drawing on the screen can only resume once the beam has reached a steady speed.

These padding surrounding the sync pulse are also parameters that needs to be specified for a VGA signal. There is two parameters for this purpose: Back Porch and Front Porch.

Front Porch is the period of padding in front of the Sync Pulse and the Back Porch is the period of padding after the sync pulse. In the VGA world, these two parameters is specified in terms of pixels.

So in summary, we have covered the following parameters so far:

  • Pixel Clock Frequency
  • Horizontal resolution, measured in pixels
  • Horizontal Front porch in pixels
  • Horizontal Back Porch in pixels
  • Horizontal Sync pulse width, also measured in pixels.
Most of the parameters above is applicable to the horizontal direction. There is, however, a similar set of parameters for the vertical direction. These parameters are not specified in terms of pixel, but rather in lines:

  • Vertical resolution (lines)
  • Vertical Front Porch (lines)
  • Vertical Back Porch (lines)
  • Vertical Sync pulse (lines)
This is about all there is to VGA timings.

Figuring out the VGA Timings

Before we can start to develop a FPGA implementation for outputting a VGA signal, we need to first figure the VGA timing parameters to use as discussed in the previous section.

For this exercise I am going to use LCD monitor for displaying the signal that has a resolution of 1360X768 @ 60HZ. This is not really a standard VESA resolution that you will find timings for on the VESA website, so I had a bit of a hard time doing Internet searches for finding the parameters.

Eventually I found something useful from the following link:

http://forums.entechtaiwan.com/index.php?topic=2578.25;wap2

Scrolling down the web page I found the timing parameters I was looking for, but package as a Linux modeline:

"1360x768" 85.875 1360 1408 1520 1768 768 769 772 810 +hsync +vsync
I have seen Linux modelines a couple of times in the past and is used to configure the video card when running XWindows. However, I never really paid attention to what these numbers really mean. The numbers in quotaion obviously looks like the target resolution, but the other numbers is a bit Greek to me :-)

So, let us do a bit of further Internet Searching on what these numbers mean...

The following web page comes to the rescue:

http://howto-pages.org/ModeLines/

The key on the page is where they describe how to write down the numbers:

...you write down the frequency of the pixel clock in MHz: 108
Next, you simply list out in this order: HDisplay HSyncStart HSyncEnd HTotal. In my case:
1280 1346 1458 1688.
Fourthly, you list out the corresponding vertical data: VDisplay VSyncStart VSyncEnd VTotal:
1024 1025 1028 1066
We can apply the same reasoning to our modeline. So, the number 85.875 is the frequency of pixel clock.

To understand the rest of the numbers, we should visualise one long line that starts at the beginning of visible data and extends all the way to the end of the Horizontal Back Porch. All the crucial timing elements is then marked as a specific pixel on the line.

So, if we start with the first number after the pixel clock. This number, 1360, indicates that pixel number 1360 is the last visible pixel on the line. Pixels after this pixel is part of the Front Porch.

The Front Porch pixels carries on till we reach pixel number 1408 (e.g. the number in the list of parameters). At this pixel we enable the Horizontal sync pulse which lasts till we reach pixel 1520 after which the sync pulse is switched off.

After the Horizontal Sync pulse is switched off, we are in the Back Porch period which lasts till pixel number 1768.

After pixel 1768 we wrap back to pixel 0, and we are at the beginning of the visible area of the next line.

The set are numbers following 1768 are related to Vertical Syncing, which follows the same convention as the Horizontal parameters. The only difference is that we specify the Vertical parameters in terms of lines.

With Linux modelines somehow demastified, we can now calculate the parameters for use within our FPGA design.

Starting width the Front Porch, we know it starts at pixel 1360 and carries on till pixel 1408, so the Horizontal Front Porch width in terms of pixels is:

1408 - 1360 = 48

Similarly, the Horizontal sync pulse width can be calculated as:

1520 - 1408 = 112

Finally, we can calculate the Back Porch width as:

1768 - 1520 = 248

We can know move on to the Vertical parameters. Front Vertical Porch width:

769 - 768 = 1 line

Vertical sync pulse width:

772 - 769 = 3 lines

Back Vertical Porch:

810 - 772 = 38 lines

The last two parameters in the list, +hsync and +vsync, indicates the polarity of the horizontal and vertical sync pulse. In this case both our sync pulses will trigger a sync action when they are at a logic level '1'.

Designing the FPGA module

We finally have enough information to start the design of our FPGA module.

We start with a very basic skeleton:

module vga(
  input clk,
    );

endmodule


We currently only have the clock as an input port. Obviously this clock will need to clock at the desired pixel frequency which is 85.875MHz.

Next, we should add the various VGA parameters to our module:

module vga(
  input clk,
    );

parameter HORIZ_RES = 1360;
parameter VERT_RES = 768;
parameter HORIZ_BACK_PORCH = 248;
parameter HORIZ_FRONT_PORCH = 48;
parameter HORIZ_SYNC = 112;
parameter VERT_BACK_PORCH = 38;
parameter VERT_FRONT_PORCH = 1;
parameter VERT_SYNC = 3;

endmodule


From these parameters, we add further deducted parameters for triggering the various events during scanlines:

...
parameter TOTAL_HORIZ_RES = HORIZ_RES + HORIZ_BACK_PORCH + HORIZ_SYNC + HORIZ_FRONT_PORCH;
parameter TOTAL_VERT_RES = VERT_RES + VERT_BACK_PORCH + VERT_SYNC + VERT_FRONT_PORCH;
parameter HORIZ_SYNC_START = HORIZ_RES + HORIZ_FRONT_PORCH;
parameter HORIZ_SYNC_END = HORIZ_SYNC_START + HORIZ_SYNC;          
parameter VERT_SYNC_START = VERT_RES + VERT_FRONT_PORCH;
parameter VERT_SYNC_END = VERT_SYNC_START + VERT_SYNC;          
...

Admitted, these parameters looks exactly as the modeline parameters we started with. You would probably only go this approach if you got the VGA parameters in another way and not via a modeline...

Next, which should implement two counters for both the vertical and horizontal directions:

...
reg [10:0] horiz_pos = 0;
reg [10:0] vert_pos = 0;
...
always @(posedge clk)
if (horiz_pos < TOTAL_HORIZ_RES - 1)
  horiz_pos <= horiz_pos + 1;
else begin
  horiz_pos <= 0;
  if (vert_pos < TOTAL_VERT_RES - 1)
  begin
    vert_pos <= vert_pos + 1;
  end else
  begin
    vert_pos <= 0;  
  end
end
...

These counters will synchronise all the functionality within our VGA module.

Next up, let us generate the vertical and horizontal sync pulses:

module vga(
  input clk,
  output vert_sync,
  output horiz_sync,
    );
...
assign vert_sync = vert_pos >= VERT_SYNC_START & vert_pos < VERT_SYNC_END;  
assign horiz_sync = horiz_pos >= HORIZ_SYNC_START & horiz_pos < HORIZ_SYNC_END;
...


Next, we should generate the actual displayable pixel data. As mentioned earlier, we want to display a screen filled with 'A's. We use the A image contained in the C64 Character ROM, which is an 8x8 pixel image.

We will generate the image data in almost the same way as we did with our VIC-II module, which is loading a byte of image data into a shift register and then shifting it out bit by bit for display.

Here is the implementation for the shift register:

wire [2:0] pixel_in_char;
reg [7:0] pixel_shift_reg;
...
assign pixel_in_char = horiz_pos[2:0];
...
always @(posedge clk)
  if (pixel_in_char == 0)
  begin
    case (vert_pos[2:0])
      3'h0 : pixel_shift_reg <= 8'h18;
      3'h1 : pixel_shift_reg <= 8'h3C;
      3'h2 : pixel_shift_reg <= 8'h66;
      3'h3 : pixel_shift_reg <= 8'h7E;
      3'h4 : pixel_shift_reg <= 8'h66;
      3'h5 : pixel_shift_reg <= 8'h66;
      3'h6 : pixel_shift_reg <= 8'h66;
      3'h7 : pixel_shift_reg <= 8'h00;
    endcase
  end    
  else
    pixel_shift_reg <= {pixel_shift_reg[6:0], 1'b0};   
...

We basically break up the visible area in 8x8 cells. When we are at the first pixel of a cell (e.g. bits 2-0 of horiz_pos == 0) we load pixel_shift_reg with the byte value for te applicable row. For the remaining pixels, we just keep shifting out till we get to a new 8x8 cell.

So, if we are within the visible area of the screen pixel_shift_reg[7] will tell us if the current pixel at hand should be on or off.

Next thing we should do is to map an on/off pixel to a color. Before we can this, we should first find out how color signals work in VGA.

To convey color information, a VGA connector provides three analogue pins. There is a separate pin for Red, Green and Blue.

An FPGA can only output zeros and ones on its output pins, an ADC (Analogue to Digital Converter) is required to interface with the color pins on the VGA connector.

Luckily the designers of the Zybo board have taken care of this for us by, apart of the onboard VGA connector, also providing a simple ADC between the FPGA and the VGA connector. One can see a diagram of the setup in the Technical Reference Manual Of the Zybo:

You provide color sample values in 16-bit binary numbers having the RGB-565 format. On the Zybo Board itself there is 3 resister ladder networks, for converting each color channel to an anlague representation. If you want to read a bit more on ADC using resister ladders, you can read the following on Wikipedia:

https://en.wikipedia.org/wiki/Resistor_ladder#R–2R_resistor_ladder_network_(digital_to_analog_conversion)

It is quite an interesting subject!

Back to our FPGA design. For now, we will just output Black if the pixel is off and White if it is on. This translates to the following:

module vga(
  input clk,
  output vert_sync,
  output horiz_sync,
  output [4:0] red,
  output [5:0] green,
  output [4:0] blue
 
    );
...
wire [15:0] out_pixel;
...
assign red = out_pixel[15:11];
assign green = out_pixel[10:5];
assign blue = out_pixel[4:0];
...
assign out_pixel = (vert_pos < VERT_RES) & (horiz_pos < HORIZ_RES) ? (pixel_shift_reg[7] ? 16'hffff : 0) : 0;
...

We only output a value for out_pixel from our shift register if we are within the visible region, otherwise we just output a black pixel.

This concludes our VGA output module.

Wiring everything up

With the VGA output module we need to create an instance of this module and wire up all the ports.

We do this by first wrapping this module into an IP Block, which we covered in a previous post.

We then create a new block design. In this Block Design we will start by droping an instance of our VGA block.

We will also need to invoke the Clock Wizard to create a Block for generating a 85.875MHz clock signal, which will be our pixel clock. We will link up this singal to the clk port of our VGA block.

As usual for our Zybo designs, we also need to add a ZYNQ processing block with relevant supporting blocks to our block design.

Up to this point the block design will look something like the following:


What still needs to be done is to connect the output ports of our VGA block to the pins of the FPGA that leads to the VGA connector.

We need to create a constraint file for doing the pin assignments. There is a constraints file available on GITHUB for the pin assignments of the Zybo Board. From this file just copy out the pin definitions for the VGA related pins which will yield more or less the following:

#VGA Connector
#IO_L7P_T1_AD2P_35
set_property PACKAGE_PIN M19 [get_ports {vga_r[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_r[0]}]

#IO_L9N_T1_DQS_AD3N_35
set_property PACKAGE_PIN L20 [get_ports {vga_r[1]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_r[1]}]

#IO_L17P_T2_AD5P_35
set_property PACKAGE_PIN J20 [get_ports {vga_r[2]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_r[2]}]

#IO_L18N_T2_AD13N_35
set_property PACKAGE_PIN G20 [get_ports {vga_r[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_r[3]}]

#IO_L15P_T2_DQS_AD12P_35
set_property PACKAGE_PIN F19 [get_ports {vga_r[4]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_r[4]}]

#IO_L14N_T2_AD4N_SRCC_35
set_property PACKAGE_PIN H18 [get_ports {vga_g[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[0]}]

#IO_L14P_T2_SRCC_34
set_property PACKAGE_PIN N20 [get_ports {vga_g[1]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[1]}]

#IO_L9P_T1_DQS_AD3P_35
set_property PACKAGE_PIN L19 [get_ports {vga_g[2]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[2]}]

#IO_L10N_T1_AD11N_35
set_property PACKAGE_PIN J19 [get_ports {vga_g[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[3]}]

#IO_L17N_T2_AD5N_35
set_property PACKAGE_PIN H20 [get_ports {vga_g[4]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[4]}]

#IO_L15N_T2_DQS_AD12N_35
set_property PACKAGE_PIN F20 [get_ports {vga_g[5]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_g[5]}]

#IO_L14N_T2_SRCC_34
set_property PACKAGE_PIN P20 [get_ports {vga_b[0]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_b[0]}]

#IO_L7N_T1_AD2N_35
set_property PACKAGE_PIN M20 [get_ports {vga_b[1]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_b[1]}]

#IO_L10P_T1_AD11P_35
set_property PACKAGE_PIN K19 [get_ports {vga_b[2]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_b[2]}]

#IO_L14P_T2_AD4P_SRCC_35
set_property PACKAGE_PIN J18 [get_ports {vga_b[3]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_b[3]}]

#IO_L18P_T2_AD13P_35
set_property PACKAGE_PIN G19 [get_ports {vga_b[4]}]
set_property IOSTANDARD LVCMOS33 [get_ports {vga_b[4]}]

#IO_L13N_T2_MRCC_34
set_property PACKAGE_PIN P19 [get_ports vga_hs]
set_property IOSTANDARD LVCMOS33 [get_ports vga_hs]

#IO_0_34
set_property PACKAGE_PIN R19 [get_ports vga_vs]
set_property IOSTANDARD LVCMOS33 [get_ports vga_vs]

You will notice that each pin of a vector like vga_r, vga_g, vga_b is specified separately.

We still need to add the actual pins to out block design. So, right your block design and select Create Port. Complete the popup box as follows:


For the Port name, you should specify the same name specified  following get_ports in the constraints file. You will need to create a port for each color channel, called vga_r, vga_g, vga_b.

Remember to specify the correct vector range for each one (e.g. 4..0 for vga_r/vga_b and 5..0 for vga_g).

Luckily you need add only one port per color channel, and not one per pin as performed in the constraints file.

After the color channel ports, you need to create two more ports, vga_hs and vga_vs, which are both single ports.

With all the ports crated, you just need to wire them up to your vga block, yielding the following:


We are are done drawing our block design. We can now continue to Synthesise the design and generating the BitStream file.

Once this finished you can export the Bitstream to a Xilinx SDK Workspace and start design on the FPGA as we did in previous posts.

The End Results

I took a close up of the screen with the FPGA running our VGA module:

The 'A''s are pretty crisp. As mentioned this screen is 1360 pixels wide, so one can fit 170 characters on a line on this screen.

A small disappointment is a small un-utilised part at the top of the screen.  The following photo will give you an idea of the unused part of the screen. (The over-use of the flash light was on purpose. The glare on the Gloss border of the monitor helps to identity the real margins of the screen):


There is also a very small margin on the right hand side.

For now, however, I not too fussed with the margins.

In Summary

In this post we played around with VGA output using the Zybo Board.

In the end we managed to get a screen filled with A's.

In the next post and in coming posts I will start working on functionality for reading back the the frames from SDRAM to our FPGA and then displaying it on the VGA screen.

Till next time!

Friday 13 April 2018

Writing video frames to SDRAM

Foreword

In the previous post I described the glitch I encountered using the Block RAM in the Zybo in Dual port mode.

With this glitch the VIC-II couldn't read the contents of RAM from the assigned Block RAM port.

In the end the issue seemed to be related to the fact that the VIC-II only uses the first 16KB of memory. We solved this issue by first reading the full range of RAM a couple of times from the assigned VIC-II RAM port upon startup.

In this post we will be implementing functionality so that the frame data produced by the VIC-II can be written to SDRAM which we will in turn download on a PC so we can verify that the frames produced by the VIC-II on the physical FPGA are indeed correct.

The planned Approach

You might recall that in a previous post we developed a Verilog module called burst_block with which we managed to write data from the FPGA to SDRAM.

In this post we will also use this module to capture video data from our VIC-II module to SDRAM. There is, however, a couple of modifications we need to do to our design before using burst_block for this purpose.

The first required change is due to different clock domains. The burst_block uses the AXI clock which runs at 100MHZ, which we cannot really change. Our VIC-II core, however, outputs pixels at a rate of 8MHz. We therefore need to put in some effort to accommodate these different clock domains in order to avoid setup and Hold violations.

The second required change is due to different data widths. Each pixel output of the VIC-II has a data width of 24 bits whereas the burst_block expects data words of 32 bits. This is a waste of 8 bits per pixel!

We can definitely improve on the differing data width situation. Firstly, 24 bits per pixel from the VIC-II might be a bit of a overkill considering that the VIC-II only have 16 distinct colors.

We can truncate each pixel from the VIC-II to 16-bits using the RGB565 format. With the RGB565 format we have 5 bits for Red, 6 bits for Green and another 5 bits for Blue.

With the pixel output of the VIC-II truncated to 16 bits we can fit two pixels within the 32-bit word input to the burst_block.

Concatenating two pixels into a Word

Let us start with the requirement of squeezing two pixels into a word that goes to our burst_block.

Firstly we truncate the output pixel of the VIC-II module to 16 bits:

wire [15:0] pixel_16_bit;
...
    assign pixel_16_bit = {out_rgb[23:19],out_rgb[15:10],out_rgb[7:3]};
...

The next question is how do we concatenate two of these pixels into a single 32-bit word? We do this by means of a delay element:

reg [15:0] pixel_16_bit_delay;
...
    always @(posedge clk)      
      pixel_16_bit_delay <= pixel_16_bit;
...

The clock source should be the same as the one that drives the pixel clock of the VIC-II, which is 8MHz.

The combined 32-bit word can be formed by just concatenating the above:

wire [31:0] combined_word;
...
   assign combined_word = {pixel_16_bit_delay,pixel_16_bit};
...

Obviously the write to burst_block should only be triggered every second clock cycle.

Handling the Cross Clock Domains

As mentioned earlier, our burst_block is clocking at 100MHz and the VIC-II clocking at 8MHz, which is two cross clock domains which needs special attention.

Before we decide how to deal with these two cross clock domains, let us familiarise ourselves again with the ports of the burst_block module:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs

    );


The key port to look at is write. When the axi clock transitions to a high and write is a 1, then the contents of write_data will be written to an internal buffer of burst_block, queued for writing to SDRAM.

Ideally we should try to keep the write wire high for one axi clock pulse somewhere in the middle of a clock pulse of the VIC-II pixel clock.

Let us start by first determining the centre of VIC-II clock pulse in terms of 100MHz clock cycles.

We can fit 100/8=12.5 100MHz clock cycles on a single VIC-II clock cycle.

The length of a single VIC-II clock pulse, is half of this, eg. 6.25 100Mhz clock pulses. The centre of a VIC-II clock pulse is therefore 3 100MHz clock pulses.

With this information we write the following:

...
reg target_logic_level = 0;
reg do_sample;
...
    always @(posedge axi_clk_in)
    if (cont_bits == 3)
      target_logic_level <= ~target_logic_level;
...
    always @(posedge axi_clk_in)
    if (clk == target_logic_level)
      cont_bits <= cont_bits + 1;
    else
      cont_bits <= 0;
...
    always @(negedge axi_clk_in)
        do_sample <= (cont_bits == 3) & target_logic_level;
...

target_logic_level keeps track of the current logic level the 8MHZ clock signal we expect. When we encounter 3 consecutive clock cycles of this logic level, we know we are more or less in the centre of the clock pulse.

do_sample makes use of this info and is a very good candidate signal we can feed to the write port of the burst_block. There is, however, a couple of extra conditions we need to incorporate as shown below:

reg pixel_sample_offset = 0;

    always @(posedge clk)
      if (!blank_signal)
        pixel_sample_offset <= ~pixel_sample_offset;

assign write_pin = do_sample ? !blank_signal & pixel_sample_offset : 0;

pixel_sample_offset ensures that we only trigger a write every second pixel as mentioned earlier.

blank_signal also plays an important role since we don't write pixels during horizontal and vertical blanking.

Catering for Frame Synchronisation

Currently in the current state of the burst_block you can only send it data and we have no control over the address used to write the given data to. This poses a problem when we want to write a new frame where we want to set the address back to beginning of the frame buffer.

In this section we will cater for this scenario by adding an extra port to burst_block for receiving a frame_sync signal:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire next_frame,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs

    );


Next, we make some modifications to the part where we set the axi_start_address:

always @(negedge clk)
if (!reset | (next_frame & count_in_buf == 0))
begin
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
end
else if (state == INIT_CMD)
begin
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= {BURST_THRES,2'b0};
end    


Previously axi_start_address was only set to a initial value on a reset. We have extend the if statement to also set the address when next_frame is set and count_in_buf is zero.

Why should count_in_buf be zero? Well, the moment we hit a next_frame, we might still have a partially filled buffer. Before we reset the address we should ensure that this buffer is flushed, otherwise the last bits of data of the frame would appear in the beginning of the next frame.

Talking of flushing the buffer. We should also implement some functionality for performing this action. We perform this within the case statement where assign out state:

always @(posedge clk)
if (!reset)  
  state <= 0;
else
  case( state )
  //cater for scenario of flush
    IDLE: if ((count_in_buf > BURST_THRES) | (next_frame & count_in_buf > 0))
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_write_dst_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: if (!master_write_dst_rdy & bytes_to_send == 1)
                    state <= IDLE;    
  
  endcase


Previosly we only initited a AXI write transaction if the buffer reached a certain threshold. We have now added an extra condition to also start an AXI write transaction if we are at the end of the frame and we still have some data left in the buffer.

There is one thing remaining that we should do concerning the flushing the buffer. Currently we have ip2bus_mst_length hardcoded to the value 20. This will always inform the AXI bus that the amount of bytes to send is 20 bytes in length. In a buffer flush scenario, however, it might be less. To cater for this scenario we need to make the following chances:

always @(negedge clk)
if (state == INIT_CMD)
  ip2bus_mst_length <= (count_in_buf > BURST_THRES) ? {BURST_THRES,2'b0} : 
    {count_in_buf[9:0],2'b0}; 


You might have realised that I am appending two zeros to values that is assigned to addresses and lengths. The reason for this is because our buffer works in terms of 32-bit words, whereas the AXI bus expects values in terms of bytes.

The Test Run

With all the modules hooked up, the Design Synthesised and Bitstream written, let us have a look at some results.

We will again do the programming of the FPGA within Xilinx SDK and fire off a hello world program in Debug mode as we did in a previous post where we originally developed burst_block.

We will also use the XSCT console for inspecting the contents of memory to see how the frame written to SDRAM looks like. We will use a bit of different params, though:

mrd -bin -file /home/johan/fram_data.bin 0x200000 57368

This will dump a portion from the Zybo's SDRAM to your PC/Laptop as a binary file. The start address is 0x200000, which is the start address of the framebuffer mentioned in the previous section. The number 57368 is amount of data to transfer in terms of words. In this post we are using a word size of 32 bits, so let us do some quick calculations.

Our frame is 404 pixels wide and 284 lines high, giving a total of 114736 pixels. Within each word we can accommodate two pixels, as mentioned earlier. So, we need to divide 114736 by two, giving us 57368, which is the number we should supply our mrd command as a parameter.

I captured a couple of frames with this command and used a custom program for converting these binary files to a format that an image viewer can open.

The results is a bit strange:




The frames faintly resembles the C64 Welcome screen, although distorted.

The distortion still requires a bit of investigation and I will report back in the next post.

In Summary

In this post we have implemented the functionality for writing the frame output of our VIC-II module to SDRAM.

Checking out the frames produced by running the design on the FPGA itself faintly resembles the C64 Welcome screen with  some distortion.

In the next post I will report back on whether I could isolate the cause of this distortion.

Till Next time!