Friday, 13 April 2018

Writing video frames to SDRAM

Foreword

In the previous post I described the glitch I encountered using the Block RAM in the Zybo in Dual port mode.

With this glitch the VIC-II couldn't read the contents of RAM from the assigned Block RAM port.

In the end the issue seemed to be related to the fact that the VIC-II only uses the first 16KB of memory. We solved this issue by first reading the full range of RAM a couple of times from the assigned VIC-II RAM port upon startup.

In this post we will be implementing functionality so that the frame data produced by the VIC-II can be written to SDRAM which we will in turn download on a PC so we can verify that the frames produced by the VIC-II on the physical FPGA are indeed correct.

The planned Approach

You might recall that in a previous post we developed a Verilog module called burst_block with which we managed to write data from the FPGA to SDRAM.

In this post we will also use this module to capture video data from our VIC-II module to SDRAM. There is, however, a couple of modifications we need to do to our design before using burst_block for this purpose.

The first required change is due to different clock domains. The burst_block uses the AXI clock which runs at 100MHZ, which we cannot really change. Our VIC-II core, however, outputs pixels at a rate of 8MHz. We therefore need to put in some effort to accommodate these different clock domains in order to avoid setup and Hold violations.

The second required change is due to different data widths. Each pixel output of the VIC-II has a data width of 24 bits whereas the burst_block expects data words of 32 bits. This is a waste of 8 bits per pixel!

We can definitely improve on the differing data width situation. Firstly, 24 bits per pixel from the VIC-II might be a bit of a overkill considering that the VIC-II only have 16 distinct colors.

We can truncate each pixel from the VIC-II to 16-bits using the RGB565 format. With the RGB565 format we have 5 bits for Red, 6 bits for Green and another 5 bits for Blue.

With the pixel output of the VIC-II truncated to 16 bits we can fit two pixels within the 32-bit word input to the burst_block.

Concatenating two pixels into a Word

Let us start with the requirement of squeezing two pixels into a word that goes to our burst_block.

Firstly we truncate the output pixel of the VIC-II module to 16 bits:

wire [15:0] pixel_16_bit;
...
    assign pixel_16_bit = {out_rgb[23:19],out_rgb[15:10],out_rgb[7:3]};
...

The next question is how do we concatenate two of these pixels into a single 32-bit word? We do this by means of a delay element:

reg [15:0] pixel_16_bit_delay;
...
    always @(posedge clk)      
      pixel_16_bit_delay <= pixel_16_bit;
...

The clock source should be the same as the one that drives the pixel clock of the VIC-II, which is 8MHz.

The combined 32-bit word can be formed by just concatenating the above:

wire [31:0] combined_word;
...
   assign combined_word = {pixel_16_bit_delay,pixel_16_bit};
...

Obviously the write to burst_block should only be triggered every second clock cycle.

Handling the Cross Clock Domains

As mentioned earlier, our burst_block is clocking at 100MHz and the VIC-II clocking at 8MHz, which is two cross clock domains which needs special attention.

Before we decide how to deal with these two cross clock domains, let us familiarise ourselves again with the ports of the burst_block module:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs

    );


The key port to look at is write. When the axi clock transitions to a high and write is a 1, then the contents of write_data will be written to an internal buffer of burst_block, queued for writing to SDRAM.

Ideally we should try to keep the write wire high for one axi clock pulse somewhere in the middle of a clock pulse of the VIC-II pixel clock.

Let us start by first determining the centre of VIC-II clock pulse in terms of 100MHz clock cycles.

We can fit 100/8=12.5 100MHz clock cycles on a single VIC-II clock cycle.

The length of a single VIC-II clock pulse, is half of this, eg. 6.25 100Mhz clock pulses. The centre of a VIC-II clock pulse is therefore 3 100MHz clock pulses.

With this information we write the following:

...
reg target_logic_level = 0;
reg do_sample;
...
    always @(posedge axi_clk_in)
    if (cont_bits == 3)
      target_logic_level <= ~target_logic_level;
...
    always @(posedge axi_clk_in)
    if (clk == target_logic_level)
      cont_bits <= cont_bits + 1;
    else
      cont_bits <= 0;
...
    always @(negedge axi_clk_in)
        do_sample <= (cont_bits == 3) & target_logic_level;
...

target_logic_level keeps track of the current logic level the 8MHZ clock signal we expect. When we encounter 3 consecutive clock cycles of this logic level, we know we are more or less in the centre of the clock pulse.

do_sample makes use of this info and is a very good candidate signal we can feed to the write port of the burst_block. There is, however, a couple of extra conditions we need to incorporate as shown below:

reg pixel_sample_offset = 0;

    always @(posedge clk)
      if (!blank_signal)
        pixel_sample_offset <= ~pixel_sample_offset;

assign write_pin = do_sample ? !blank_signal & pixel_sample_offset : 0;

pixel_sample_offset ensures that we only trigger a write every second pixel as mentioned earlier.

blank_signal also plays an important role since we don't write pixels during horizontal and vertical blanking.

Catering for Frame Synchronisation

Currently in the current state of the burst_block you can only send it data and we have no control over the address used to write the given data to. This poses a problem when we want to write a new frame where we want to set the address back to beginning of the frame buffer.

In this section we will cater for this scenario by adding an extra port to burst_block for receiving a frame_sync signal:

module burst_block(
  input wire clk,
  input wire reset,
  input wire write,
  input wire next_frame,
  input wire [31:0] write_data, 
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  output wire [31:0] ip2bus_mstwr_d,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs

    );


Next, we make some modifications to the part where we set the axi_start_address:

always @(negedge clk)
if (!reset | (next_frame & count_in_buf == 0))
begin
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
end
else if (state == INIT_CMD)
begin
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= {BURST_THRES,2'b0};
end    


Previously axi_start_address was only set to a initial value on a reset. We have extend the if statement to also set the address when next_frame is set and count_in_buf is zero.

Why should count_in_buf be zero? Well, the moment we hit a next_frame, we might still have a partially filled buffer. Before we reset the address we should ensure that this buffer is flushed, otherwise the last bits of data of the frame would appear in the beginning of the next frame.

Talking of flushing the buffer. We should also implement some functionality for performing this action. We perform this within the case statement where assign out state:

always @(posedge clk)
if (!reset)  
  state <= 0;
else
  case( state )
  //cater for scenario of flush
    IDLE: if ((count_in_buf > BURST_THRES) | (next_frame & count_in_buf > 0))
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_write_dst_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: if (!master_write_dst_rdy & bytes_to_send == 1)
                    state <= IDLE;    
  
  endcase


Previosly we only initited a AXI write transaction if the buffer reached a certain threshold. We have now added an extra condition to also start an AXI write transaction if we are at the end of the frame and we still have some data left in the buffer.

There is one thing remaining that we should do concerning the flushing the buffer. Currently we have ip2bus_mst_length hardcoded to the value 20. This will always inform the AXI bus that the amount of bytes to send is 20 bytes in length. In a buffer flush scenario, however, it might be less. To cater for this scenario we need to make the following chances:

always @(negedge clk)
if (state == INIT_CMD)
  ip2bus_mst_length <= (count_in_buf > BURST_THRES) ? {BURST_THRES,2'b0} : 
    {count_in_buf[9:0],2'b0}; 


You might have realised that I am appending two zeros to values that is assigned to addresses and lengths. The reason for this is because our buffer works in terms of 32-bit words, whereas the AXI bus expects values in terms of bytes.

The Test Run

With all the modules hooked up, the Design Synthesised and Bitstream written, let us have a look at some results.

We will again do the programming of the FPGA within Xilinx SDK and fire off a hello world program in Debug mode as we did in a previous post where we originally developed burst_block.

We will also use the XSCT console for inspecting the contents of memory to see how the frame written to SDRAM looks like. We will use a bit of different params, though:

mrd -bin -file /home/johan/fram_data.bin 0x200000 57368

This will dump a portion from the Zybo's SDRAM to your PC/Laptop as a binary file. The start address is 0x200000, which is the start address of the framebuffer mentioned in the previous section. The number 57368 is amount of data to transfer in terms of words. In this post we are using a word size of 32 bits, so let us do some quick calculations.

Our frame is 404 pixels wide and 284 lines high, giving a total of 114736 pixels. Within each word we can accommodate two pixels, as mentioned earlier. So, we need to divide 114736 by two, giving us 57368, which is the number we should supply our mrd command as a parameter.

I captured a couple of frames with this command and used a custom program for converting these binary files to a format that an image viewer can open.

The results is a bit strange:




The frames faintly resembles the C64 Welcome screen, although distorted.

The distortion still requires a bit of investigation and I will report back in the next post.

In Summary

In this post we have implemented the functionality for writing the frame output of our VIC-II module to SDRAM.

Checking out the frames produced by running the design on the FPGA itself faintly resembles the C64 Welcome screen with  some distortion.

In the next post I will report back on whether I could isolate the cause of this distortion.

Till Next time!

2 comments:

  1. Hi,


    I'm trying to wrap my head around some of the design decisions here.

    Why did you decide to drive the VGA output from SDRAM?

    I mean, you could have just driven the VGA output straight from BRAM, right? From what I can tell, that's what other FPGA c64 implementations do. MEGA65 for example doesn't use DDR on the Nexys4 board at all. All memory is simulated with BRAM. The advantage, I guess, is that it simplifies design since you don't cross clock domains and don't need the AXI interface, but the drawback is that you need a lot of BRAM on the FPGA chip for the framebuffer. You could probably not use the framebuffer at all and drive the display straight from the VIC-II, the way the real c64 does, but then you'd need a 15kHz-capable display. Am I getting this right?

    One advantage of doing it the way you did (that I can think of) is that you're sort of decoupling the display logic from from the rest of the c64 system. And if you wanted to use, for example 720p HDMI output then you'd need a huge framebuffer in BRAM. This way the framebuffer size is not limited by the amount of BRAM available.

    Does this make sense?

    ReplyDelete
  2. The points you mentioned summarise the main thoughts I had that made decide to use SDRAM for buffering frames.

    The main reason, though, was when I start to use a lcd monitor. This type of monitor can probably handle a signal of frequency produced by a VIC-II, but make all kind of kind of funny decissions.

    An LCD monitor will try to fit the image of the sigmal to the whole screen, causing the image to looked stretched on the screen(e.g. C64 video frames are more sqaure where lcd monitors these days are more wide).

    Also, lcd moniyors doesn't handle scaling very well, so the resulting image look typically blurry.

    In the end my thought was to priduce a signal to the mobitor of a native resolution, which need a pixel clock of 85MHz.

    This difference in pixek clocks of a VIC-II and LCD monitor just started to make tge design complex and then I just resolved to SDRAM buffering.

    ReplyDelete