C64 on an FPGA: 2019

Friday, 27 December 2019

Creating and Running UBoot

Foreword

In the previous post we started our journey on getting Linux to run on the Zybo board.

We managed to create and run a First Stage Bootloader (FSBL).

In this post we will be building and running UBoot.

UBoot is an intermediate stage bootloader for booting Linux on ARM based devices.

Creating an ARM Cross Compiler

To compile UBoot/Linux for the Zynq, one needs a cross compiler for compiling source into ARM machine code.

In Linux distros like Ubuntu provides these cross compilers as packages that you can download and install. These packages, however, is sometimes a couple of versions behind and might not be sufficient for building ARM based packages.

In my own experience I have found it better to build the Cross Compiler toolchain yourself. Here is a very handy resource for creating your own toolchain:

https://preshing.com/20141119/how-to-build-a-gcc-cross-compiler/

These set of instructions explains how to build GCC version 4.9.2. This version of GCC is a bit outdated for our purposes. So, for some of the libraries we need to download, we need to get newer ones, which are as follows:

GCC version 7.5.0
Binutils version 2.25
isl version 0.18

When following the instructions from the above link, there are some instructions that need to be tweaked so we can compile for the ARM architecture. These are as follows:

--target=arm-linux-gnueabihf as well as --host=arm-linux-gnueabihf
ARCH=arm
The folder under /opt/cross should be arm-linux-gnueabihf
Also, in step 4 (e.g. Standard C Library Headers and Startup Files) the gcc cross compiler is invoked. The filename for this should be arm-linux-gnueabihf-gcc.

Following the steps from above Internet resource, together with the suggested amendments for the ARM architecture, will yield the Cross compiler within the folder /opt/cross.

If you want to use this Cross compiler, you need to ensure that it within the system path, which you can define as follows:

export PATH=/opt/cross/bin:$PATH

Building UBoot

to Build UBoot, we need to follow the instructions on this resource: https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841973/Build+U-Boot

Let us follow these instructions step by step. Firstly, we need to get the source code from Github, via a terminal window:

git clone http://github.com/Xilinx/u-boot-xlnx.git

This is basically the uboot source with some Xilinx customisations.

Now, change into the cloned directly.

Next we need to set some environment variables within our terminal session:

export CROSS_COMPILE=arm-linux-gnueabihf-

export ARCH=arm

We can now build the u-boot image with the following:

make distclean
make zynq_zybo_defconfig
make

After the build process, there should be an executable called u-boot.elf.

Testing U-boot

Let us test the U-boot executable.

As in the previous post, with the Zybo board connected to the PC and powered up, start a screen session in a terminal window:

sudo screen /dev/ttyUSB1 115200

In another terminal window, start an XSDB session. Also, as in the previous post, issue XSDB commands for starting the FSBL and then stopping it. At this point, the DDR should have been initialised and we should be able to load the u-boot image into it:

dow ~/u-boot-xlnx/u-boot.elf

When you try these steps when your pc/laptop came out of hibernation, you might be presented with the following error:

Memory write error at 0x4000000. Cannot access DDR: the controller is held in reset

If this is the case, just switch the Zybo board off for a couple of seconds and then on again. After switching the Zybo board on, just re-establish a screen session and following the steps of downloading/starting/stopping the FSBL, and then downloading the U-Boot image.

Issuing the command con, will start the U-boot image. If everything went well, the output on your screen session should like the following:

If one just missed the opportunity to stop the autoboot, you will see some output indicating that it is trying to boot images from various devices.

In Summary

In this post we managed to build and run U-Boot. In the next post we will be building Linux and running it on the Zybo board.

Till next time!

Tuesday, 24 December 2019

The boot process of the ZYNQ

Foreword

In the previous post we have scaled up the video frames produced by the VIC-II module so that it can fill the whole screen.

At this point our C64 module has enough functionality implemented so that we can play a game on it. However, everytime we want to play a game with it, the Zybo board need to be attached to a PC, and we need to issue a couple of commands within the Xilinx SDK.

Wouldn't it be nice if the Zybo could just start up on its own and you don't need to connect it to a PC?

This will be the purpose of the next series of posts: To to be able to boot our C64 system on the Zybo board without any hand holding.

To approach this goal, we will work towards been able to run Linux on the Zybo Board. If we are within the Linux ecosystem on the Zybo board, we just have access to so many device drivers, making it easy for us, for instance, to load a tape image from a Micro SDCard.

In this post we will start by getting an overview on how the boot process works on the Zybo Board.

Also, in this in this post I will assume the Zybo board will be connected to a PC running Ubuntu OS or similar.

The Boot Process

When the Zynq is powered up and ready to start the process of booting into an OS, it is faced with a challenging initial condition: The DRAM is disabled.

To get past this hurdle, the Zynq contains onchip memory (OCM). The OCM contains 128KB BootROM code and 256KB SRAM.

The BootROM contains just enough code to load a boot image from a hand full of devices, like a SDCard and QSPI, into SRAM and start execution of it.

256KB of SRAM for loading a boot image is not a lot of memory, so on Zynq devices, the boot process is split into a number of stages:

Stage 0: The BootROM starts executing and loads the First stage loader into SRAM
Stage 1: The First stage bootloader starts to execute from SRAM. It is responsible for enabling DRAM. Amongst other things it also initialises a number of crucial on chip peripherals as well as a number of clocks on the Zynq. Also the bitstream is read into memory as well as the user application
Stage 2: At this stage the DRAM is fully operational and control is handed to the user application. In a Linux ecosystem, we will start with running UBoot, which will eventually load and start a Linux Image.

This is quite a cumbersome process. However, I can understand why this process is necessary. To initialise DRAM is quite a complicated process and one can easily introduce a bug when writing the software routine for this process. Should such a bug exist in BootROM, this could render the Zynq chip useless. Moving DRAM initialisation to software that one retrieves from a SDCard greatly reduces this risk.

Of course SRAM is quite an expensive resource. For this reason the Zynq only contains 256KB of it, and we need three stages of booting instead of just 2.

In this post we will focus on the first stage bootloader (FSBL) and in future posts the Stage 2 bootloader.

Creating a First Stage Bootloader

Let us start by having an overview of the process for creating an FSBL. There is a very nice block diagram of the process here:

The Xilinx SDK get a Hardware handoff file from Vivado and generate the FSBL. On the diagram the FSBL gets aggregated with other files like U-boot and uImage to produce a file called BOOT.BIN.

In this post, however, I will show a method where we will directly invoke the FSBL without having to generate a BOOT.BIN first.

Now, in Vivado let us create a Block Design containing only a Zynq block. Also, let us perform all the suggested wiring:

This block diagram looks very simplistic. However, when you open this block, you will see that it contains very important settings, like DDR settings:

It is these type of settings that Vivado will handoff to the Xilinx SDK and will be incorporated into the FSBL.

With this block design created, do a Synthesise, generate bitstream and an export.

Then, from Vivado launch the Xilinx SDK.

In the File menu, select new application project and provide a name for the application project.

Click next and in the left panel select Zynq FSBL:

When you click finish, a new project will be created and visible in the project panel.

A quick way to test our FSBL, would be to write a message to the console when it runs.

To do this, expand the src folder and open the file main.c. Scroll down to the main method. Just after the call to RegisterHandlers, add the following line:

 xil_printf("Hello World");

Click the down arrow next to the hammer icon and select the Release configuration. Now build the project.

Now, within the sdk folder, within your project folder and under the folder Release, there should be a *.elf file. This is the First stage Bootloader that we will use in the next section.

Testing the First Stage Bootloader

Let us test the FSBL we have created in the previous section.

Ensure the Zybo board is configured to boot into JTAG mode. Hook it up to a PC and power it up.

For his exercise we will need two terminal windows running on the Ubuntu PC. On the first terminal, issue the following command:

sudo screen /dev/ttyUSB1 115200

This is the terminal session where we expect our Hello World message to appear when our FSBL run. The device file might be different in your case or might even be /dev/ttyUSB2.

In the other terminal change to the bin folder of your Xilinx SDK installation and issue the following command:

./xsdb

You will be presented with the XSDB console. At this console, issue the following command:

connect

If you now issue the targets command, you will see a list of targets you can connect to:

xsdb% targets                                                                   
  1  APU
     2  ARM Cortex-A9 MPCore #0 (Running)
     3  ARM Cortex-A9 MPCore #1 (Running)
  4  xc7z010

Our goal is to run our FSBL on the first core, so let us select this core and stop it:

xsdb% target 2                                                                  
xsdb% stop                                                                      
Info: ARM Cortex-A9 MPCore #0 (target 2) Stopped at 0xffffff28 (Suspended)
xsdb% targets                                                                   
  1  APU
     2* ARM Cortex-A9 MPCore #0 (Suspended)
     3  ARM Cortex-A9 MPCore #1 (Running)
  4  xc7z010

We will now load the FSBL into the Zybo board from the location where it was generated. In my case the console output will look as follows:

xsdb% dow ~/fsbl-test/fsbl-test.sdk/fsbl_test/Release/fsbl_test.elf             
Downloading Program -- /home/johan/fsbl-test/fsbl-test.sdk/fsbl_test/Release/fsbl_test.elf
 section, .text: 0x00000000 - 0x0000d38b
 section, .handoff: 0x0000d38c - 0x0000d3d7
 section, .init: 0x0000d3d8 - 0x0000d3ef
 section, .fini: 0x0000d3f0 - 0x0000d407
 section, .rodata: 0x0000d408 - 0x0000d75f
 section, .data: 0x0000d760 - 0x0001054f
 section, .eh_frame: 0x00010550 - 0x00010553
 section, .mmu_tbl: 0x00014000 - 0x00017fff
 section, .init_array: 0x00018000 - 0x00018003
 section, .fini_array: 0x00018004 - 0x00018007
 section, .rsa_ac: 0x00018008 - 0x0001903f
 section, .bss: 0x00019040 - 0x0001ae6f
 section, .heap: 0x0001ae70 - 0x0001ce6f
 section, .stack: 0xffff0000 - 0xffffd3ff
100%    0MB   0.5MB/s  00:00                                                    
Setting PC to Program Start Address 0x00000000
Successfully downloaded /home/johan/fsbl-test/fsbl-test.sdk/fsbl_test/Release/fsbl_test.elf

The FSBL will be loaded at location starting at address 0, which is the area where SRAM resides of OCM.

We can now resume execution of core #0 with the command con. We can now see the output on the other Terminal Window:

Our first stage bootloader worked!

In Summary

In this post we have created and tested a First Stage Bootloader.

In the next post we will continue our journey in getting Linux to run on the Zybo Board.

In particular, we will be getting UBoot to run on the Zybo Board. UBoot is a component you will find in many ARM based systems that assist in booting Linux.

Till next time!

Friday, 20 December 2019

Scaling up the display: Part 2

Foreword

In the previous post we started to investigate the possibility of scaling up the images produced by the VIC-II module, so that it can fill the whole display.

For this purpose we used David Kronstein's Video scaler core. So, in the previous post we tested this core with a test bench to see how the image looks like that is produced by this core.

I was quite satidfied by the results produced by Kronstein's core, so in this post we will integrate Kronstein's core within our C64 FPGA design.

Overview

The following diagram gives an overview of what we want to accomplish in this post:

The flow of the diagram starts off more or less the same as our current design which displays the VIC-II frames on a VGA screen.

We retrieve pixel data from SDRAM via AXI and buffer it. As these data is words of 32-bits, thus containing two pixels per word, we need to split the word into individual pixels.

We also buffer these individual pixels into a FIFO buffer. This FIFO buffer has an additional function of moving data from the AXI clock domain (100MHZ) to the VGA clock domain (84MHz).

In our previous design we directly output pixels from this FIFO to the VGA display.

In this post, however, we introduce two new blocks, the Video Scaler and a FIFO for buffering the effect of potential lag from the Video Scaler.

The most tricky scenario in this design is when we reset all the components in preparation for the next frame. With the video scaler been reset for the next frame, it will immediately start requesting data from the asynchronous FIFO when it becomes available. This can potentially lead some empty conditions in our asynchronous FIFO.

In practice, however, I found that our Asynchronous FIFO doesn't handle these intermittent empty states very well.

It is far better on frame reset to rather give the Asynchronous FIFO to fill up a bit, before starting to read from it. In this why we avoid the asynchronous buffer running empty. We will cover this in a bit more detail in a coming section.

Supplying input data to the Video Scaler

Let us connect the necessary ports so that we can supply input data to the Video Scalar.

First, let us cater for the scenario where we need to reset all the blocks upon a new frame.

The trigger_restart_state register indicate when we are about to start with a new frame. However, this register is clocked within the AXI clock domain, but we need it within the VGA domain, so let us create a two flip-flop synchroniser to take care of the scenario:

(* ASYNC_REG = "TRUE" *) reg state_1, state_2, state_3, state_4, state_5;

always @(posedge clk)
begin
  state_1 <= trigger_restart_state == RESTART_STATE_RESTART;
  state_2 <= state_1;
  state_3 <= state_2;
  state_4 <= state_3;
  state_5 <= state_4;
end;

streamScaler #(
//---------------------------Parameters----------------------------------------
.DATA_WIDTH(8),  //Width of input/output data
.CHANNELS(3)  //Number of channels of DATA_WIDTH, for color images
//---------------------Non-user-definable parameters----------------------------
)
  myscaler
(
...
.start(state_5),
...
);

We receive pixel data from the asynchronous FIFO relaying 16-bit pixel values from the AXI domain to the VGA domain. As mentioned in the previous post, the Video scaler expects 24-bit samples, so let us do a conversion:

streamScaler #(
//---------------------------Parameters----------------------------------------
.DATA_WIDTH(8),  //Width of input/output data
.CHANNELS(3)  //Number of channels of DATA_WIDTH, for color images
//---------------------Non-user-definable parameters----------------------------
)
  myscaler
(
...
.dIn({out_pixel_buffer[15:11],3'b0,out_pixel_buffer[10:5],2'b0,out_pixel_buffer[4:0],3'b0}),
...
);

The next port to focus on is the port on the video_scaler, signalling it that the data is valid. For this, lets start off simple, saying that the data is valid if the asynchronous buffer is not empty and it is not the start of the frame:

assign data_valid_in = !state_5 && !async_empty;

You might remember that in the previous section I mentioned that it is preferable to give the asynchronous buffer some time to fill up before reading from it. It would be indeed the data_valid_in port we need to cater for this:

assign data_valid_in = !state_5 && !async_empty && scalar_init;

always @(posedge clk)
begin
  if (state_5)
    scalar_init <= 0;    
  else if (!async_empty && (count_till_read == 60))
    scalar_init <= 1;

  if (state_5)
    count_till_read <= 0;
  else if ((count_till_read < 60) && !async_empty)
    count_till_read <= count_till_read + 1;
end

So, we we hold back asserting the data_valid_in port till our async buffer has been non empty for about 60 clock cycles.

Next, we need to connect the read port on the async fifo:

aFifo
  #(.DATA_WIDTH(16))
  my_fifo
     
    (
...
           .ReadEn_in(nextDIn & data_valid_in),
...
     );

You might recall from quite a number of posts that we have enabled reading from this port when the vga raster was within the visible range. This port is now controlled by the video_scaler (nextDIn). We hold the read back by means of data_valid_in, giving the aFifo a chance to fill up.

Buffering the output of the Video Scaler

As mentioned in the Overview section, we need to buffer the output of the Video scaler.

So, let us start by by defining another FIFO instance:

fifo #(
  .DATA_WIDTH(16),
  .ADDRESS_WIDTH(4)
)

   data_buf_vga (
            .clk(clk), 
            .reset(state_5),
        );

This buffer has a capacity of 16 elements of 16 bits each. Since the Video Scaler outputs samples of 24 bits, we need to connect the write_data port of the FIFO as follows:

fifo #(
  .DATA_WIDTH(16),
  .ADDRESS_WIDTH(4)
)

   data_buf_vga (
...
            .write_data({data_out[23:19],data_out[15:10],data_out[7:3]}),
...
        );

Now, the nextDout port of the Video Scaler need to be in sync with the write port of the FIFO:

...
fifo #(
  .DATA_WIDTH(16),
  .ADDRESS_WIDTH(4)
)

   data_buf_vga (
...
            .write((vert_pos > 10)  & (vert_pos < 760) & data_valid_out & !full_vga_fifo),
            .full(full_vga_fifo),
...
        );
...
streamScaler #(
//---------------------------Parameters----------------------------------------
.DATA_WIDTH(8),  //Width of input/output data
.CHANNELS(3)  //Number of channels of DATA_WIDTH, for color images
//---------------------Non-user-definable parameters----------------------------
)
  myscaler
(
...
.dOutValid(data_valid_out_debug),
.nextDout((vert_pos > 10)  & (vert_pos < 760) & !full_vga_fifo),
...
);
...

The actual idea is to start streaming data out to the screen at line number 20, so we start pre-filling the buffer at line 10.

Streaming the data out to the screen

In the previous section we buffered data from the Video scalar. In this section we will output the buffered data to the VGA port.

As the first step, let us connect all the read ports:

fifo #(
  .DATA_WIDTH(16),
  .ADDRESS_WIDTH(4)
)

   data_buf_vga (
...
            .read((vert_pos > 20)  & (vert_pos < 760) &
                                            (horiz_pos > 100) & (horiz_pos < 1175)),
            .read_data(fifo_data_read)
...
        );

As seen here the visible portion of the screen is between line 20 and 760. On each line the visible portion is between pixel 100 and 1175.

The invisible portions of the screen why want to fill with a black border. To do this we need need to block out the read_data when we are within the invisible regions:

 assign out_pixel_buffer_final = (vert_pos > 20)  & (vert_pos < 760) &
                                (horiz_pos > 100) & (horiz_pos < 1175)
                                ? fifo_data_read : 0;

This out_pixel_buffer_final signal we need to split into the indivudual red, green, blue signals that go to the VGA port:

assign red = out_pixel_buffer_final[15:11];
assign green = out_pixel_buffer_final[10:5];
assign blue = out_pixel_buffer_final[4:0];

Results

I created the following video to demonstrate how the C64 module renders on the VGA screen with the help of video upscaling:

For this demo I loaded the game Blue Max from a tape image. It starts off with the last couple of seconds playing the music of the loader, then the intro tune of Blue Max. I then briefly play the game for a couple of seconds.

In Summary

In this post we integrated David Kronstein's core within our C64 module.

Up to this point we always fired up the Zybo board attached to a PC. It would actually be nice if we could fire up the Zybo board on its own, with an external power supply.

So in the next post we will start investigating how to boot the Zybo from a SDCARD. To kick off this investigation, we will see if we can boot Linux on the Zybo board.

Till next time!

Sunday, 15 December 2019

Scaling up the display: Part 1

Foreword

In the previous two posts We have implemented SID sound within our C64 FPGA module.

If we look back to the Introduction post of this Blog series, the purpose of this series was to create a Complete C64 system on an FPGA.

I think we got pretty close to this goal. We have implemented the following:

Integrated Arlet Otten's 6502 core into our design.
Implemented C64 memory banking.
Booting the whole C64 system.
Loading a game from a .TAP image
Implementing VIC-II module capable of displaying sprites together with a couple of its graphics modes, like multicolor bitmap, and multicolor text mode.
SID sound.

Granted, an important item missing from the list is implementing a C64 disk drive like a 1541. I am, however, not entirely sure if I would want to go down that road, since we already utilised the majority of the Block RAM resources of the ZYNQ FPGA, so I doubt if heir would be sufficient resources left for implementing a 1541 module (e.g. the core of a 1541 disk drive is also a 6502 CPU, also requiring RAM and ROM to operate).

There are, however, some other items currently missing in our C64 module, which I thought would be nice to implement and for which I will be writing some blog posts on how to implement them.

The first item is to scale up the frames produced by the VIC-II module. Currently these frames have a resolution of 404x284. With most monitors available on the market today, these frames will just fill a tiny portion of the screen.

So, we will decicate a post or two on how to scale the VIC-II generated frames up, so that it fills most of the screen.

Another issue that is worth looking into is the fact that currently our C64 module cannot operate on its own on a Zybo board. The Zybo board always needs to be connected to a PC to upload a Bitstream image and for kicking off a standalone program in the Xilinx SDK for providing USB keyboard functionality.

I will also write some Blog posts for implementing a solution for above mentioned issue, which would involve booting Linux from a SDCard fitted to the Zybo board and also loading a bitstream image from the same SDCard into the FPGA.

This is more or less what I have planned for future posts in this Blog series.

Let us start and see if we can upscale the frames produced by the VIC-II!

David Kronstein's Video Scalar Core

As the old saying goes: Don't re-invent the wheel. In this series I tried to apply this bit of advice numerous times:

Using Arlet's 6502 core.
Making use of an asynchronous FIFO buffer as suggested on a Xilinx's community forum.
Using Thomas Kindler's SID implementation.

So, is there a Verilog module available that can scale up an image. Indeed there is on OpenCores website: https://opencores.org/projects/video_stream_scaler

The SVN browser on the website allows us to get hold of the source code. Two files are of importance:

Video+Stream+Scaler+Specifications.pdf
scalar.v

The pdf explains very nicely how the scalar works.

The file scalar.v contains the main module, as well as all sub modules, within one file.

Let us start by having a look at the ports of the Video Scaler module:

//---------------------------Module IO-----------------------------------------
//Clock and reset
input wire    clk,
input wire    rst,
 
//User interface
//Input
input wire [DATA_WIDTH*CHANNELS-1:0]dIn,
input wire       dInValid,
output wire       nextDin,
input wire       start,
 
//Output
output reg [DATA_WIDTH*CHANNELS-1:0]  dOut,
output reg         dOutValid,
input wire         nextDout,
 
//Control
input wire [DISCARD_CNT_WIDTH-1:0] inputDiscardCnt, 
input wire [INPUT_X_RES_WIDTH-1:0] inputXRes,
input wire [INPUT_Y_RES_WIDTH-1:0] inputYRes,
input wire [OUTPUT_X_RES_WIDTH-1:0] outputXRes,
input wire [OUTPUT_Y_RES_WIDTH-1:0] outputYRes,
input wire [SCALE_BITS-1:0]   xScale,
input wire [SCALE_BITS-1:0]   yScale,
 
input wire [OUTPUT_X_RES_WIDTH-1+SCALE_FRAC_BITS:0] leftOffset,
input wire [SCALE_FRAC_BITS-1:0] topFracOffset, 
input wire    nearestNeighbor
);

The clk and rst is obvious, so let us skip to the input port section.

The dIn is the pixel data input. For our emulator we will have three channels (e.g. RGB) and each channel will be 8 bits wide. This may sound confusing at first since the Zybo board works with 16 bit pixels in the format RGB565. However, this scaler assumes the same data width for all channels, so for this reason we will just stick with 8 bits per channel.

The next two signals are handshake signals between the pixel data originator and the video scaler. When the pixel data originator has made data available for a new pixel, it will assert the dInValid signal. In return, the video scaler will assert the nextDin signal when it has accepted the data.

Something to keep in mind with our VIC-II module is that it is outputting pixels at a constant rate and cannot be told to pause for a couple of clock signals. The Video scalar in turn can end up time and again in a situation where it is not able to accept data at a particular clock pulse. We will, however, cross this bridge when we get there.

The last port of the input section, start, is used to signal the video scalar that we are about to transmit data for a new frame.

Next we get to the output port section. With these set of ports our Video scaler behaves like a pixel data producer, which is pixel data for the actual upscaled image. Similarly, dOutValid and nextDout are handshake signals.

Let us move onto the final section, the control section. There is couple of ports in this section I am not going to worry about and I am just going to connect them to the value zero. These ports are the following:

inputDiscardCnt,
leftOffset,
topFracOffset and
nearestNeighbor

The other ports are for specifying the input resolution, output resolution and the ratio by which the input frames should be resized by.

We will calculate the values of these ports in the next section.

Calculating the values of the control ports

Let us start by determining the values for inputXRes and inputYRes.

The spec for the video scaler that for each resolution port we should supply a value that is the actual value, minus one.

We know that the frames produced by our VIC-II has a resolution of 404x284, so we should specify a value of 403 for inputXRes and a value of 283 for inputYRes.

Next, we should decide on the output resolution. For this one would be tempted to use the physical resolution of the monitor you are going to use for the display of the output frames.

However, in these times changes are good that the monitor you will be using will be a wide screen, whereas the the output of a VIC-II would be more towards a square aspect ratio. So, one would end up with a stretched image if using the physical screen resolution as the output resolution for the Video Scaler.

So, we need to proportionally scale up the input image till it just fills the height of the screen.

I am going to use my screen as an example, which has a resolution of 1366x768. I am a bit hesitant to use the full height of the screen, since I just to leave a bit of 'buffering' space to account for possible lag by the video scaler before producing output for the next frame.

So, I will be using a height of 758 for our output frame. At this point we need to calculate the ratio by which we will be resizing our input image.

(Output Resolution Y) / (Input Resolution Y)

= 758 / 284

= 2.6690

This is a very important factor, and we will be using it later again for the ports xScale and yScale.

To get the horizontal output resolution, we should multiply the horizontal input resolution by this factor:

(Input Resolution X) * 2.6690

= 404 * 2.6690

= 1078.276 ≈ 1078

Thus, our output image should have resolution of 1078x758, resolving into a value of 1077 for outputXRes and a value of 757 for outputYRes using the minus one constraint specified in the spec of the video scaler.

Next, we should calculate the values for xScale and yScale, which in our case will the same.

According to the spec, xScale gets calculate by (inputSize / Output size). This is different to the way we have calculated our scale factor, which is (Output size / Input Size). So, to get to a valid value for xScale and yScale, we should use the reciprocal of our factor, which yields around 0.374672.

We now need to represent this fraction in binary. Sound like a daunting task, but fear not!

The scale value uses 4 bits for the integer value and 14 bits for the fraction, totalling 18 bits. This can be visually represented as follows:

0.5

0.25

0.125

0.0625

For clarity, I have shown only the first couple of fractions bits.

Through some trail and error, I found that 0.25 and 0.125 gives a good enough estimation for us: 0.375. Let us convert this fraction into hexadecimal:

0000.01100000000000

= 000001100000000000

= 00 0001 1000 0000 0000

=  0   1    8   0    0

So, the value to use for both xScale and yScale is 18'h1800.

Creating a Test Bench

Let us create a test bench to test the Video Scaler.

First thing we should do is to get hold of source image data. An easy way to get this is to run our C64 FPGA design on the Zybo Board, and then do a mrd (e.g. Memory read command) on the XSCT console where from the memory area where the frame is stored and write it to a file.

However, be mindful to the fact that information is stored in memory as 32-bit words in little endian format. The pixels been in the format RGB565, it would mean that every pair of pixels will have there order reversed.

So, it might be necessary to write a program for reversing the order of the pixels for this exercise.

Next, let us write some code to read data from this file and supplying to the video scaler on demand:

reg [15:0] pixel_in_data;

initial begin
  f = $fopen("<file.data>","rb");
  #300
  @(negedge clk)
  start = 1;
  @(negedge clk)
  start = 0;
  #300;
  
  while (data_count < 500000)
  begin
    @(negedge clk)
    if(nextDIn)
    begin
      data_valid_in = 1;
      $fread(pixel_in_data,f);    
    end 
  end
end

We start by triggering the start flag to inform the video scaler that we are at the beginning of the frame and about to send data.

We keep reading pixel data while the video scaler has asserted the next_in port. With the first pixel that we read, we also asserts the data_valid_in port.

Data in the source file is in the format RGB565, and the video scaler expect it in the format 24-bit color, so we convert it like this:

streamScaler #(
.DATA_WIDTH(8),
.CHANNELS(3)
)
  myscaler
(
...
.dIn({pixel_in_data[15:11],3'b0,pixel_in_data[10:5],2'b0,pixel_in_data[4:0],3'b0}),
...
);

Next, we should capture all the generated pixels from the video scaler and save it as an image file, so we can view it in an image viewer.

I would like like the resulting image file to be again in RGB565 format, so I will do a conversion again:

...
wire [15:0] data_out_concat;
...
assign data_out_concat = {data_out[23:19], data_out[15:11],data_out[7:3]};
...

We write the pixel data as follows:

initial begin
fw = $fopen("<outputfile.data>","wb");
while (data_count < 5000000)
begin
  @(posedge clk)
  if (data_valid_out)
  begin
    $fwrite(fw,"%c",data_out_concat[7:0]);
    $fwrite(fw,"%c",data_out_concat[15:8]);
  end
end
end

So, while data_valid_out is asserted we write the pixel data to the resulting image file. Also, we write each pixel in little endian order, which is the format required by the image viewer.

The Result

Let us have a look at the resulting image file.

We are going to use GIMP to view the image file. Gimp can read image raw image data, provided use the file extension .data.

On opening this file, we specify the format RGB565 and the resolution 1078x758:

The resulting rescaled image is as follows:

Admitted, one cannot clearly see the difference between the scaled up output of the video scalar and the original frame.

We will, however, better see the effect of the upscaling when displayed on a monitor, which we will cover in the next post.

In Summary

In this post we started to investigate upscaling for use in our C64 emulator, so we can fill the whole screen instead of a tiny portion.

We identified David Kronstein's Video scaler core as the candidate for use in this task.

We created a test bench for testing David Kronstein's Video Scaler and successfully managed to upscale a test image.

In the next post we will be integrating this core into our C64 emulator.

Till next time!

Friday, 29 November 2019

Implementing Sound: Part 2

Foreword

In the previous post we gave some thought on the idea of adding SID sound to our C64 module.

This ended off not to be such a daunting task, since we found an existing SID implementation on Github, written in SystemVerilog by Thomas Kindler.

We tested this SID implementation by capturing a couple of seconds worth of SID register writes from a JavaScript emulator I wrote a couple of years ago. These SID register writes I then supplied to Kindler's SID core and listened to the output.

The result was very pleasing. Initially I spotted a bit of clipping, but subsequently fixed this by reducing the volume of each voice.

In this post we will be adding Kindler's core to our C64 module and see if we can play SID sound in realtime.

The importance of clock locking

I would like to start off this post by talking about an issue of a different kind I had to solve with the C64 module.

As I kept adding more functionality to the C64 module, I ended up once again with a case where this core didn't want to boot up anymore.

After checking the Verilog code of the C64 module time and again, I couldn't find anything wrong.

At one point I started wondering: In the clock wizard generating the 16MHz signal, I am not using the lock signal at all. Can this be a source of issues?

The following post on Xilinx Community forums shines a bit of light on this issue: https://www.xilinx.com/support/answers/52806.html. As quoted from this post:

Until the LOCKED signal is asserted High, the DCM/DLL output clocks are not valid and can exhibit glitches, spikes, or other spurious movement.

So, it is a very good idea to honour the Locked signal.

Of course we need to use this signal when generating the reset signal for the 6502:

...
assign c64_reset = (reset_counter > 8000000) & (reset_counter < 8000020) ? 1 : 0;
...
    always @(posedge clk_in)
     if ((reset_counter < 9000000) & locked)
       reset_counter <= reset_counter + 1;
...

You might remember that we generate this signal within the VIC-II module which is clocked at 8MHz. So, in this code we wait about a second after the 8MHz clock generator is locked, after which we assert the signal for a couple of cycles.

Mapping SID into memory

For now, we will only worry about performing writes to the SID, and not any reading. This will result in the following port assignments of the SID module:

MOS6581 sid(
...
    .addr(addr[4:0]),
    .data(ram_in),       
    .n_cs(!(we & io_enabled & (addr[15:8] == 8'hd4))),
    .rw(0),
    .clk(clk_1_mhz), .clk_en(1), .n_reset(!c64_reset)
);

Since the SID only have 29 registers, we only connect the lower 5 bits of the address bits to the SID module.

We permanent wire this module to write mode (e.g. rw is set to zero).

Also, we enable writing when there is a write within the IO region (e.g. address D000 to DFFFF) and the first eight bits of the address equal to 0xD4.

Outputting samples to the Sound System

Some time ago we played with sound on the Zybo board. The Zybo board can create high quality sound with the help of the Analog Devices SSM2603 Audio Codec.

This codec receives samples in a serial fashion with the I2S protocol.

We implemented a block that generates a monotone and converted the samples to the I2S protocol so the audio codec can receive it.

In this section we extend this block so that can receive audio samples from the SID block.

Let us start by having a look at the port definitions of the I2S block:

module i2s(
  input clk,
  input clk_1_mhz,
  input [15:0] audio_in,
  output clk_1_5_mhz,
  output channel_enable,
  output out_data,
  output mute_en
    );

The output ports are basically the I2S ports that we will connect to the Audio codec.

The audio_in port is the audio samples from the SID module.

We have again a cross clock domain that we need to solve. The SID generate samples at a rate of 1MHz, whereas the audio codec need to receive the samples at a rate of 48KHz.

So, apart from solving the clock domain issue, we also need to discard a number of samples from the SID module to get to the 48KHz sample rate.

Let us start by having a look at the critical point at which we need to inject a sample from the 1MHz clock domain:

    always @(posedge clk)
    if (channel_enable_counter == 15 & neg_edge)
    begin
      shift_reg <= {data_val, data_val};      
    end
    else if (neg_edge)
      shift_reg <= {shift_reg[30:0] , 1'b0};

This is the logic for the shift register that shift out the data to the audio codec. Basically we want a 1MHz sample at the right time within data_val by the time the shift register gets reloaded.

We have written the above snippet quite some ago, so let us familiarise ourselves again what is going on in this snippet.

Within the world of our Audiocodec, there is three clock frequencies:

The master clock: 12.288MHz
The serial data clock: 1.536MHz
The sample clock: 48KHz

To avoid multiple cross clock domain issues, we try to clock all our always blocks (Except the 1MHz bits) at 12.288MHz.

We want our shift register to clock at 1.536MHz, so we introduce a signal neg_edge that gets asserted when we are the negative edge of the 1.536 signal:

    reg [1:0] clk_div_counter = 0;
    
    assign neg_edge = (clk_div_counter == 3) & (bclk_int == 1) ? 1 : 0;

    always @(posedge clk)
      clk_div_counter <= clk_div_counter + 1; 

    always @(posedge clk)
    if (clk_div_counter == 3)
      bclk_int <= ~bclk_int;

So, bclk_int is our 1.536MHz clock, which is generated by toggling it every four clock cycles.

Although bclk_int is a clock signal, we don't use it to clock any @always blocks, so no need to worry about any cross clock domain issues here.

Let us similarly bring the 48KHz sample signal into the picture:

    always @(posedge clk)
    if (neg_edge & channel_enable_counter == 15)
      prclk_int <= ~prclk_int;

This looks very similar to the signal that we use to load data into the shift register. The correct instant to load a value from the SID into data_val is a cycle or two after the reload of the shift_regsiter has occurred.

The following snippet accomplish just that for us:

    (* ASYNC_REG = "TRUE" *) reg sig_48_khz_0, sig_48_khz_1, sig_48_khz_2;

    always @(posedge clk_1_mhz)
    begin
       sig_48_khz_0 <= prclk_int;
       sig_48_khz_1 <= sig_48_khz_0;
       sig_48_khz_2 <= sig_48_khz_1; 
    end

    always @(posedge clk_1_mhz)
    if (!sig_48_khz_1 & sig_48_khz_2)
      data_val <= audio_in;

Here we have a multi-flop synchroniser again to bring the 48KHz signal to the 1MHz domain.

This multi-flop syncroniser has an additional function: Delaying the assignment of a new value until the current value has been handed over to the shift register.

So, with the above setup we are in effectively removing the overlap of fetching a sample from the 1MHz domain and reading it in the 48KHz domain.

Implementing these changes will give us a fully functional SID implementation within our C64 module.

In Summary

In this post we finished off our SID implementation within our C64 FPGA implementation.

Special thanks to Thomas Kindler for sharing the source code on Github for his SID implementation.

Previously I aimed that this post would have been the last one.

However, an extra idea popped up. The C64 FPGA implementation in its current state only fills a small area on the screen, so in the next post we will see if we can scale this image up so it can fill most of the screen.

I want to end off this post with an interesting thought. When doing FPGA programming, every now and again one is faced with Cross Clock domain issues. That makes one realise that although we are working with digital electronics where we always have a discrete state of a zero and a one, the circuit in a chip still exhibit similar behaviour to that of an analogue circuit.

Till next time!

Wednesday, 30 October 2019

Implementing Sound: Part 1

Foreword

In the previous post we finished off implementing sprites into our C64 emulator.

This enabled us to fully play the game Dan Dare within our C64 emulator.

In this post we will start to implement a nice-to-have: Sound emulation.

To implement SID emulation from scratch can be quite a daunting task. So, I did some searching on the Internet to see if I could find an existing SID-implementation, written in Verilog.

As part of this post, I will also show how to create a test bench for evaluating such an implementation.

The chosen SID implementation

After some searching on the Internet, I am game across a nice SID implementation on Github coded in SystemVerilog, by Thomas Kindler. Here is the link to the project:

https://github.com/thomask77/verilog-sid-mos6581

I don't have hand-on experience with SystemVerilog, but according to many resources, Vivado does in fact support SystemVerilog. So my SystemVerilog disadvantage shouldn't be much of a setback :-)

Thete was one thing I did experience when using this code with Vivado. For output ports feeding off sequential elements, you need to declare with the 'reg' keyword.

Let us do an overview on how Thomas Kindler's SID module works. Thomas Kindler based much of the inner workings on the Interview with Bob Yannes, the designer of the original SID chip. A copy of this interview can be found on a couple of places on the Internet, including here.

At the heart of each voice on the SID is a 24-bit phase accumulator, clocked at 1MHz.

The phase accumulator is the work horse for generating one complete cycle for the desired waveform at the desired frequency.

Each Voice on the SID have a phase accumulator and gets incremented at each clock cycle by the value stored in its 16 bit frequency register. That is registers 54272+54273 for Voice 1, 54279+54280 for Voice 2, and 54286+54287 for Voice 3.

If a Voice circuit have a frequency value of 1, we will therefore be producing waveforms with a period of 16 seconds, which is a frequency less than 1Hz. On the other hand, a frequency value of 65535 will yield a waveform with a frequency of about 4KHz.

Let us have a look at how the different waveforms gets created.

The triangle waveform is generated as follows:

...
   out_triangle = acc[22:12] << 1;
...
    if (acc[23])
        out_triangle ^= '1;
...

We are using the lower 23 bits of the phase accumulator. Our waveform starts off increasing till bit 23 of the phase accumulator gets set, after which we do a XOR on the resulting values, giving us a mirrored image of the previous time period.

Generating sawtooth is much simpler:

    out_saw      = acc[23:12];

And also pulse:

   out_pulse    = acc[23:12] < pw ? '1 : '0;

Noise gets generated with a Fibonacci sequence, which will not be going into detail here.

I also not be going into the specifics of Envelope generation. I will only mention here that Envelope generation applies ADSR (Attack, Decay, Sustain, Release) to the resulting waveform, and is dealt with in the file sid_env.sv.

Creating the Testbed

The simple part of creating a Testbed for the SID module is wiring up all the ports, supplying a clock signal, and applying some reset logic.

The complex part comes to get hold of a sequence of SID register writes that will generate a sound that we can evaluate by ear.

The most obvious way to get hold of such sequences would be to intercept writes to SID registers when executing the applicable program within a C64 emulator.

Some time ago a write an emulator in JavaScript, here, which I am going to use for this purpose. The full source of this emulator can be found here. We should now briefly put our JavaScript thinking caps on :-)

I will try though to keep the discussion on this JavaScript emulator short, trying to convey just the basic idea. Should some of you would like to have a more detailed blog post on this, please drop me a comment.

The place where we will be doing the interception of SID writes, will be within the file memory.js, within the following method:

  function IOWrite(address, value) {
    if ((address >= 0xdc00) & (address <= 0xdcff)) {
      return ciaWrite(address, value);
    } else if ((address >= 0xd000) & (address <= 0xd02e)) {
      return myVideo.writeReg(address - 0xd000, value);
    } else if ((address >= 0xd800) & (address <= 0xdbe8)) {
      return myVideo.writeColorRAM (address - 0xd800, value);
    } else {
      IOUnclaimed[address - 0xd000] = value;
      if ((address & 0xff00) == 54272)
      mysid.log(address & 31, value);
      return;
    } 
  }

mysid is the instance of a class that we still need to define, so let us start with the outline of the class:

function sid() {
var mycpu;
var lasttime = 0;

  this.setCpu = function(cpu) {
    mycpu = cpu;
  }

  this.log = function(addr, val) {
    diff = mycpu.getCycleCount() - lasttime;
    lasttime = mycpu.getCycleCount();

  }

}

One important thing when recording the writes, is to also record the exact time instance when the write happened. For this purpose we need a handle to the CPU instance to get the current Cycle Count.

With these pieces of information at hand, we can create a series of Verilog statement for the register writes that we can place in an initial-begin..end block. We will write these statements to a text area defined within our HTML page. With these changes are log function looks as follows:

...
  this.log = function(addr, val) {
    diff = mycpu.getCycleCount() - lasttime;
    lasttime = mycpu.getCycleCount();
        var ins = document.getElementById("diss");
        var temp = ins.value+ "\n#"+diff*10+";\n@(negedge clk)\n"+
          "rw = 0; n_cs = 0; addr = 5'd"+addr+
          "; data = "+val+";\n@(negedge clk)\n"+"rw=1; n_cs=1;";
        ins.value = temp;
  }
...

A typical sequence of this generated Verilog code looks as follows:

#580;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd11; data = 32;
@(negedge clk)
rw=1; n_cs=1;
#712860;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd0; data = 162;
@(negedge clk)
rw=1; n_cs=1;
#80;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd1; data = 37;
@(negedge clk)
rw=1; n_cs=1;
#200;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd4; data = 32;
@(negedge clk)
rw=1; n_cs=1;

So, we delay each set of assignments by a certain period of time as captured by the log function.

Interesting, JavaScript generating Verilog code!

Next, let us write some code for capturing the produced sound samples to a file so that we can listen to the produced sound later on:

...
integer f = 0;
integer i = 0;
...
initial begin
  f = $fopen("sound.raw","wb");
  #100;
  for (i = 0; i < 90000000; i = i + 1) begin
    @(negedge clk)
    if ((i% 20) == 1)
    begin     
      $fwrite(f,"%c",audio[7:0]);
      $fwrite(f,"%c",audio[15:8]);
    end
  end
$fclose(f);

end

The sound gets produced at a rate of 1MHz. I am reducing the sample rate to 48KHz by catching only every 20th sample. In this way most sound player would be able to keep up.

Audacity is a Opensource program that allows you to import and play these raw samples.

Test Results

Let us listen to the resulting sounds.

Our first attempt is kind of successful, but there is a bit of distortion:

The distortion is more visible within the wave editor of Audacity:

One can clearly see the waveform goes off the screen in a couple of places.

Looking at the source file sid_filter.sv of the SID implementation, one kind of get a feeling of where things go wrong:

out_next = (out_next * reg_vol) >> 2;

Here we multiply the final sample with the master volume and divide the result by 4. When inspecting the waveform during a Vivado simulation, multiplying by a master volume of 15 sometimes yield a number that is way past the range of a 16-bit number, and dividing by 4 simply isn't enough. I fix this by dividing by 8 instead of four:

out_next = (out_next * reg_vol) >> 3;

The result is much better, although not taking advantage of the full volume range:

In Summary

In this post we started implementing sound within our emulator.

We evaluated Thomas Kindler's SID implementation and found it work very well.

Many thanks for Thomas Kindler for making the source of this implementation available on Github.

In the next post we will continue to integrate this SID core to our C64 core.

Till next time!

Monday, 14 October 2019

Implementing Sprites: Part 3

Foreword

In the previous post we have implemented the capability for our sprite to expand in both the X and Y directions. We also have implemented Sprite multicolor mode.

Up to now are our VIC-II only supported a single sprite, Sprite 0. So, in this post we will be connecting the remaining seven sprites.

With our VIC-II module able to display all eight sprites, we would be able to fully play the game Dan Dare with our emulator.

This would indeed be a very nostalgic moment for me, but raised a bit of a concern for me. If one is going to play extended periods on the Zybo with our emulator, wouldn't the Zynq SoC eventually get very hot?

My concern was driven by the fact that these days you find quite a number of videos on the Internet concerning cooling solutions for single board computers. With this in mind, when you come to the Zybo board, you cannot really find any information regarding what kind of temperatures to expect during general use of the board.

So, I will end off this post by sharing what I have found by experimentation regarding the temperature of the Zynq when run our emulator for half an hour or so.

Hooking up the remaining sprites

Currently we only have a single instance of sprite_generator for sprite 0. Let u start by adding instances for the remaining sprites. For simplicity, I am only showing the declarations for the first three:

sprite_generator sprite_0(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[0],sprite_0_xpos}),
  .sprite_y_pos(sprite_0_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 0),

  .x_expand(x_expand[0]),
  .y_expand(y_expand[0]),
  .multi_color_mode(multi_color_mode[0]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_0),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[0]),
  .show_pixel(show_pixel_sprite_0),
  .output_pixel(out_pixel_sprite_0),
  .request_data(),
  .request_line_offset(sprite_0_offset)
    );

sprite_generator sprite_1(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[1],sprite_1_xpos}),
  .sprite_y_pos(sprite_1_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 1),

  .x_expand(x_expand[1]),
  .y_expand(y_expand[1]),
  .multi_color_mode(multi_color_mode[1]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_1),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[1]),
  .show_pixel(show_pixel_sprite_1),
  .output_pixel(out_pixel_sprite_1),
  .request_data(),
  .request_line_offset(sprite_1_offset)
    );

sprite_generator sprite_2(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[2],sprite_2_xpos}),
  .sprite_y_pos(sprite_2_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 2),

  .x_expand(x_expand[2]),
  .y_expand(y_expand[2]),
  .multi_color_mode(multi_color_mode[2]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_2),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[2]),
  .show_pixel(show_pixel_sprite_2),
  .output_pixel(out_pixel_sprite_2),
  .request_data(),
  .request_line_offset(sprite_2_offset)
    );

This is a typical copy and paste exercise. However, some of the ports is specific to the sprite itself. Here is a list of these ports:

sprite_x_pos/ sprite_y_pos
store_byte
x_expand/y_expand
multi_color_mode
primary_color
sprite_enabled
show_pixel/out_pixel
request_line_offset

Obviously some ports will get its value from a particular bit position in a register, whereas the other ports in this list have there own dedicated registers.

Let us have a look at the output ports. The first port is sprite_x_offset. We use these ports as follows:

   always @*
     case (sprite_data_region_offset[6:4])
       3'd0: sprite_offset = sprite_0_offset;
       3'd1: sprite_offset = sprite_1_offset;
       3'd2: sprite_offset = sprite_2_offset;
       3'd3: sprite_offset = sprite_3_offset;
       3'd4: sprite_offset = sprite_4_offset;
       3'd5: sprite_offset = sprite_5_offset;
       3'd6: sprite_offset = sprite_6_offset;
       3'd7: sprite_offset = sprite_7_offset;
    endcase

     always @*
       if (!sprite_data_region && (clk_counter == 6 | clk_counter == 7))
         addr = bit_data_pointer;       
       else if (sprite_data_region && (sprite_data_region_offset[3:0] < 3))
         addr = {mem_pointers[7:4], 7'h7f, sprite_data_region_offset[6:4]};
       else if (sprite_data_region)
         addr = {sprite_data_location, (sprite_offset + sprite_byte_num)}; 
       else
         addr =  {mem_pointers[7:4], screen_mem_pos};

So, we use the applicable sprite_offset when it is the data cycle for a particular sprite.

We sit with a couple of show_pixel/output_pixel pairs for each sprite. We combine these as follows:

always @*
   if (show_pixel_sprite_0)
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1)
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2)
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3)
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4)
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5)
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6)
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7)
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else
     color_for_bit_with_sprite = color_for_bit;

   assign color_for_bit = multicolor_data ? multi_color :    
            (pixel_shift_reg[7] == 1 ? char_buffer_out_delayed[11:8] : background_color);
   assign final_color = (visible_vert & visible_horiz & screen_enabled) ? color_for_bit_with_sprite : border_color;

For now we only assume that all sprites are in front of the main graphics, implementing the hardcoded priority, where the lower the sprite number, the higher the priority.

When we run our implementation on the Zybo board with these changes, it looks very promising: our characters have finally appeared!

One small thing doesn't look right though. Our characters are always in front of everything! They appear in front of rocks. Also, when we walk underwater, using a reed as a snorkel, only the reed should be visible. This is not the case with our emulator in its current state:

We see Dan Dare, his pet, and the Snorkel!

OK, i agree, this shouldn't come as a surprise, since we implemented sprites to be always visible in front of the background graphics.

Fine tuning Sprite display priority

There is a couple of Sprite priority functionality that should be implemented before our game screen can render correctly.

The first priority is priority according the Sprite priority register at address D01B. Firstly we need to implement this register into our VIC-II so it be be written to or read by the 6502. This is similar to the other registers we have implemented.

We use this register as follows:

always @*
   if (show_pixel_sprite_0 && !sprite_priority[0])
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1 && !sprite_priority[1])
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2 && !sprite_priority[2])
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3 && !sprite_priority[3])
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4 && !sprite_priority[4])
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5 && !sprite_priority[5])
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6 && !sprite_priority[6])
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7 && !sprite_priority[7])
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else if (pixel_shift_reg[7])
     color_for_bit_with_sprite = color_for_bit;
   else if (show_pixel_sprite_0)
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1)
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2)
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3)
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4)
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5)
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6)
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7)
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else
     color_for_bit_with_sprite = color_for_bit;

Within this snippet of code, you spot another implied priority by means of the check for pixel_shift_reg[7].

So, if this background pixel has a bit value of zero, it is actually transparent, allowing the sprites with background priorities to show through.

Obviously, if there is neither a visible sprite pixel with back or front priority, we will show the applicable background color.

There is a very interesting scenario when our main graphics is in multicolor mode. In Multicolor mode the high order bit indicates whether it is a background pixel or not. This means that we can have two possible background colors in multicolor mode, pixel value 00 and 01.

Having two background colors enables us to have a sprite that is sometimes hidden behind some objects and in front of others.

The following video shows how the game screen now looks with the recent round of changes:

This time around our emulator renders the scene more realistic. We go behind the rocks and is not visible when we go underwater.

This is actually the great nostalgic moment, what all this whole series of Blog posts were about!

Temperatures on the Zynq SoC

I mentioned in the beginning of this post that I have a bit of a concern on the temperature of the Zynq SoC when you are using it for extended periods of time.

One of my key areas for this concern is our USB stack program that runs on one of the ARM cores, which catches keystrokes from the USB keyboard and send it to our emulator hosted within the FPGA. To get an overall context, here is the main method of our USB stack program:

int main()
{
    Xil_DCacheDisable();
    init_platform();
    initint();
    initUsb();
    status = 0;
    state_machine();
    usleep(100000000);
    cleanup_platform();
    return 0;
}

We do some initialisation, and then we sleep for a long period of time (which in this case is 100 seconds). This sleep is necessary so our program is not terminated as a whole.

The code that does the actual work is the method state_machine, which is invoked every 10 milliseconds by a timer interrupt.

It should be noted that this program is running in standalone mode, and the usleep library call is implemented using busy waiting.

With busy waiting your CPU runs at full speed checking in a loop for something something to happen, which in this case is for 100 seconds to past.

As we know with busy waiting, your CPU is effectively running at 100% utilisation all the time, which uses more energy and produces more heat.

So, how heat will be produced by above program when we run for about half an hour?

Vivado provides some tools for us to answer this question. On the Hardware dashboard, temperature is one of the probes you can add.

When I started the emulator on the Zybo board, the temperature was around 51^oC. Within minutes temperature has risen to about 54^oC.

After about half an hour, the temperature settled to about 58^oC.

This wasn't as bad a I have expected. For interest sake, I was wondering whether you could do some overclocking on the Zybo.

Some fiddling of the settings in the Vivado Block design, it doesn't really look like there is any real overclocking options. The only options that I could see, was to set the frequency of an ARM core between 50MHz to 667MHz.

So, in short, it doesn't look like using the Zybo board for long periods would cause any kinds of overheating.

Also, busy waiting didn't appear to be a big issue after all. However, I was still wondering what kind of temperature difference it would make if we could avoid busy waiting.

On ARM processors, the instruction WFI (wait for interrupt) is provided for this purpose. As per the documentation on ARM's web site:

WFI (Wait For Interrupt) makes the processor suspend execution (Clock is stopped) until one of the following events take place:

An IRQ interrupt
An FIQ interrupt
A Debug Entry request made to the processor.

So, in our case when we call WFI, our CPU would freeze until our timer interrupt fires externally:

int main()
{
    Xil_DCacheDisable();
    init_platform();
    initint();
    initUsb();
    status = 0;
    state_machine();
    asm("loop: wfi");
    asm("b loop");
    cleanup_platform();
    return 0;
}

Here we with added some inline assembly for invoking wfi. It should be remembered once an external interrupt has occurred and has been served, code execution will continue just after the wfi instruction.

It is therefore important that we loop back to the wfi instruction. If we don't, our main method will run to completion.

When we monitor the temperature when we use the WFI method, the Zynq definitely runs cooler. During this run I saw a temperature between 52^oC and 53^oC. About a 5 degree difference!

In Summary

In this post we implemented all eight sprites within our VIC-II module. We also implemented the different priorities between Sprites and the Background.

This indeed brought us to the point where we could fully play the game Dan Dare within our emulator.

With this we are nearing almost the end of this Blog Series. There is, however, one more thing I would like to do, and this is to see if it is possible to add sound to the emulator.

So, in the next post we will start to implement sound.

Till next time!

Thursday, 10 October 2019

Implementing Sprites: Part 2

Foreword

In the previous post we added some very basic sprite functionality to our C64 emulator that enabled us to show a moving sprite.

In this post we will continue to add some more sprite functionality, which will involve the capability to expand a sprite and multicolor mode.

Sprite Expansion

Sprites on the VIC-II has the capability to be expanded in both the X direction and the Y direction.

Register D017 has a bit for each sprite indicating whether it should be expanded in the Y direction.

Similarly, register D01D has a bit for each sprite indicating whether a sprite should be expanded in the X Direction.

So, let us start off by redirecting the bits of registers D017 and D01D to our sprite_generator module as input ports:

module sprite_generator(
...
  input x_expand,
  input y_expand,
...
    );

The first thing that is effected if a sprite is expanded, is its display region. So let us modify we determine this region:

...
  wire [5:0] sprite_width;
  wire [5:0] sprite_height;
...
  assign sprite_height = y_expand ? 42 : 21;
  assign sprite_width = x_expand ? 48 : 24;
...
  assign sprite_display_region = (raster_y_pos >= sprite_y_pos && raster_y_pos < (sprite_y_pos + sprite_height)) &&
                                 (raster_x_pos >= sprite_x_pos && raster_x_pos < (sprite_x_pos + sprite_width));
...

Next, let us consider what should happen when we expand the sprite in the Y direction. In such a case our 21 line sprite should cover 42 lines of the sprite area on the screen. So, each line should be repeated twice.

We do this by dividing the current line within the active sprite area by two:

...
  wire [5:0] request_line_pre;
...
  assign request_line_pre = next_raster - sprite_y_pos;
  assign request_line = y_expand ? (request_line_pre>>1) : request_line_pre;
...

Similarly, when we expand in the X direction we need to repeat each pixel on a line twice. We do this by shifting a pixel out only every second clock cycle:

...
 reg [1:0] toggle; 
...
  always @(posedge clk_in)
  if (!sprite_display_region)
    toggle <= 0;
  else
    toggle <= toggle + 1;
...
 assign toggle_single_color_bit = x_expand ? toggle[0] : 1;
...
  always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region && toggle_single_color_bit)
      sprite_data <= {sprite_data[22:0], 1'b0};
...

We achieve this slow down with the toggle counter. This counter only needs to be one bit wide for now. The reason I made it two bits wide, is for multicolor mode later on.

Multicolor Mode

Let us tackle multicolor mode next.

As we know, in multi color mode our pixels is two pixels wide. This means that we need to shift out two pixels at a time when in multicolor mode. This also implies that we can only do this shift every second clock cycle.

This sounds very similar to the previous section when we need to expand our sprite in the X direction.We will therefore make use again of the toggle counter.

When we are expanding a multicolor sprite in the X direction, we need to slow down the clocking out of the pixels even more: Four clock cycles per pixels.

With all this in mind, we need to add the following code:

...
assign toggle_multi_color_bit = x_expand ? (toggle[1:0] == 2'b11) : toggle[0];
...
 always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region && toggle_single_color_bit && !multi_color_mode)
      sprite_data <= {sprite_data[22:0], 1'b0};
    else if (sprite_display_region && toggle_multi_color_bit && multi_color_mode)
      sprite_data <= {sprite_data[21:0], 2'b0};
...

Now, at any point in time, bits [23:22] will be the value for our current pixel. Next, let us have a look of the meanings for the different bit values, as quoted from https://www.c64-wiki.com/wiki/Sprite:

Pixels with a bit pair of "00" appear transparent, like "0" bits do in high resolution mode.
Pixels with a bit pair of "01" will have the color specified in address 53285/$D025.
Pixels with a bit pair of "11" will have the color specified in address 53286/$D026.
Pixels with a bit pair of "10" will have the color specified assigned to the sprite in question in the range 53287–53294/$D027–D02E.

For above colors, we need to define extra registers within our VIC-II module, and connect it via input ports on our sprite_generator:

module sprite_generator(
...
  input [3:0] sprite_multi_0,
  input [3:0] sprite_multi_1,
  input [3:0] primary_color,
...
    );

Let us now create a case statement for the different colors:

...
 reg [3:0] output_pixel_multi;
...
  always @*
    case (sprite_data[23:22])
      2'b01: output_pixel_multi = sprite_multi_0;
      2'b10: output_pixel_multi = primary_color;
      2'b11: output_pixel_multi = sprite_multi_1;
      default:  output_pixel_multi = 0;
    endcase
...

Let us wire up some finals:

...
assign output_pixel = multi_color_mode ? output_pixel_multi : output_pixel_single;
...
assign show_pixel = sprite_enabled && (multi_color_mode ? !(sprite_data[23:22] == 2'b0) : sprite_data[23]) && sprite_display_region;
...

Test Program and Test Results

To test the code we have developed, we need a simple test program that displays multicolor sprites that is X- and Y-Expanded.

There is a nice example program for multicolor expanded sprites, in the Book Introduction to Basic: Part 2. The Book was part of a two book series published in 1983, titled: An Introduction to Basic - The Comprehensive Teach yourself programming series.

Both these books are available for download on archive.org. The program appear on pages 300, 301 and is titled Glasgow Bus. In this program we have a multicolored sprite expanded in both directions, which moves from left to right.

The following video shows execution of this program within our FPGA C64 implementation:

In Summary

In this post we implemented Sprite expansion capability and multicolor mode within our C64 emulator.

In the next post we will connect up eight instances of our sprite_generator within our C64 emulator, and see if we can make the characters appear when we play the game Dan Dare.

Till next time!

Saturday, 5 October 2019

Implementing Sprites: Part 1

Foreword

In the previous post we have implemented Raster Interrupts as well as Multicolor Text mode.

This enabled us to completely render the status bar and the Background of the game. We were even able to move between screens of the environments.

There was, however, a crucial piece of the game play experience missing: The characters were invisible!

The reason I our characters were hidden, was because we haven't implemented Sprites in our VIC-II module yet. So, our next focus in this Blog series would be to implement Sprites in C64 FPGA implementation.

To implement sprites in an upcoming C64 emulator can be quite a daunting task. The following tasks come to mind, just to name a few:

Coordinate memory access between fetching Sprite data, fetching screen memory content and fetching character image data.
Mixing the sprite images with Text /bitmap mode graphics to get the final picture
Adding functionality to either show sprites in front of text or behind it.
Dealing with transparency
Implementing multicolor mode
Adding functionality to stretch a sprite in ether the Y- or X-direction

I have therefore decided to split the implementation of sprites into a number of separate posts.

In this post we will focus on showing a single sprite in front of text.

To test the resulting Sprite implementation, we will be using a simple Basic program for moving a sprite across the screen.

Retrieving Sprite Data from Memory

Let us start our journey of implementing sprite rendering by thinking how we will be fetching sprite data from memory.

A good start will be to review how our VIC-II currently interface with memory to get image data. Here is a quick outline:

Output port named addr for sending required address of which we want data for.
Requested data is send to input port data_in. This port is 12 bits wide, eight bits data and 4 bits from Color RAM. In this way for each screen location the character code and associated color arrives simultaneously, thus eliminating the need for an extra memory cycle to get the color code.
Memory requests is clocked at 2MHz. This translates to 2 memory accesses during an eight pixel period.

If you went through the particulars of the VIC-II, you will see that it clocks memory accesses at 1MHz. So, why am I clocking memory at 2MHz in my VIC-II implementation?

The key to this answer lies in the fact that within a C64 memory access happens on both the rising edge and falling edge of a 1Mhz clock cycle. The VIC-II access memory on the rising edge and the 6510 CPU on the falling edge of a 1MHz clock pulse.

There is, however, cases where the VIC-II will access memory on both the rising and falling edge. This happens at the beginning of each character line, where the VIC-II needs to retrieve the character code as well as the relevant pixel data to display. The VIC-II needs that extra time to retrieve the code for the character to be displayed, so the CPU cannot do any memory accesses during these times.

As we can see memory access times is very tight for the VIC-II, so one might wonder how the VIC-II manages to get some memory cycles for retrieving spite data. This is where Christian Bauer's write-up on the VIC-II comes to the rescue, as explained here. The section of interest is 3.6.3, Timing of a raster line.

In this section a couple of VIC-II memory access diagrams is shown for a couple of scenarios. The scenario where sprites 2-7 is active on a raster line gives us a very good idea where the VIC-II rertrieves Sprite data from memory (I have added the legend for convenience):

Cycl-# 6                   1 1 1 1 1 1 1 1 1 1 |5 5 5 5 5 5 5 6 6 6 6 6 6
       5 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 |3 4 5 6 7 8 9 0 1 2 3 4 5 1
        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _| _ _ _ _ _ _ _ _ _ _ _ _ _ _
    ø0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |_ _ _ _ _ _ _ _ _ _ _ _ _ _
       __                                      |
   IRQ   ______________________________________|____________________________
                             __________________|________
    BA ______________________                  |        ____________________
                              _ _ _ _ _ _ _ _ _| _ _ _ _ _ _ _
   AEC _______________________ _ _ _ _ _ _ _ _ |_ _ _ _ _ _ _ ______________
                                               |
   VIC ss3sss4sss5sss6sss7sssr r r r r g g g g |g g g i i i i 0sss1sss2sss3s
  6510                        x x x x x x x x x| x x x x X X X
                                               |
Graph.                      |===========0102030|7383940============
                                               |
X coo. \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\|\\\\\\\\\\\\\\\\\\\\\\\\\\\\
       1111111111111111111111111110000000000000|1111111111111111111111111111
       999aaaabbbbccccddddeeeeffff0000111122223|344445555666677778888889999a
       48c048c048c048c048c048c048c048c048c048c0|c048c048c048c048c04cccc04c80

  c  Access to video matrix and Color RAM (c-access)
  g  Access to character generator or bitmap (g-access)
 0-7 Reading the sprite data pointer for sprite 0-7 (p-access)
  s  Reading the sprite data (s-access)
  r  DRAM refresh
  i  Idle access

  x  Read or write access of the processor
  X  Processor may do write accesses, stops on first read (BA is low and so
     is RDY)

Let us unpack the above diagram a bit. Right on top we have cycle number, which is a count that gets incremented at a frequency of 1MHz.

The diagram starts with the last cycle of the previous line, which in this case is 65, which is obviously an NTSC VIC-II variant.

The diagram then continues from cycle#1 to cycle#19. For clarity the cycles from 20-52 is omitted in the diagram, and we continue again from cycle 53 to 65 and the first pixel of the next line.

You will also notice that after each each cycle number there is a space. This is just for the potential time period where the VIC-II might make use of both the rising and the falling edge of the 1MHz clock cycle for retrieving data.

The Graph row shows us when the electron beam is busy drawing on the screen. Making that statement felt a weird for me since we don't really use CRT's anymore :-)

On the Graph row equal(=) is the period when the Border is drawn, and a digit the when we are actually drawing character data.

With the Graph row as reference, let us have a look at the VIC row to see when sprite data is read from memory.

We can actually see that just before we are finished drawing a screen line, which is obviously a border operation, we start reading sprite info. It starts with: 0sss. We read sprite data pointer 0 from the end of screen memory, proceeded by reading three bytes of sprite data.

We do the same for sprites 1 to 7 and we stop just before we start drawing the border of the next raster line.

Also, please note that on the diagram, there is no space between the sprite pointer and sprite data data symbols. So, when we get the relevant data for a sprite, we are using all available memory cycles.

In short, the sprite data is retrieved during the none visible parts of a screen line, or to use PAL and NTSC terminology, during the front- and back-porch of a scan line.

Implementing sprite data Memory accesses

The previous section gives us a good indication where we should implement the reading of sprite data during the drawing of line.

Admitted, in our VIC-II we don't start with a blank period on a rasterline, but rather start immediately to draw the border on the beginning of each raster line.

We will, however, use the blank period of the end of the rasterline to read sprite data.

Let us start to some calculations. There are 4 memory accesses per sprite (e.g. sprite pointer and three bytes). For sprites there are two memory accesses per cycle. So, a sprite needs two cycles to get all its data.

We will be using our x_pos counter to determine when to read sprite data. For this purpose, we write the following code:

...
sprite_data_region;
...
assign sprite_data_region = (x_pos > 368 && x_pos < 496);
...

Our rasterline is 504 pixels, so we start reading sprite data at the very end of the line. Obviously the sprite pixels will be shown in the next line.

It is also more convenient to work with an offset of 368, rather than an absolute x_pos value:

...
wire [9:0] sprite_data_region_offset;
...
assign sprite_data_region_offset = {1'b0, x_pos} - 368;
...

Within this sprite data region, each sprite gets its data in a time period 16 pixels. We can therefore extract the following information from sprite_data_region_offset:

bits 6 - 4: sprite number
bits 3 - 0: Time cycle within the data cycle of current sprite. This is useful for orchestrating the various reads that should happen for the sprite.

Next, let us focus on address generation. With this we should keep in my mind that our VIC-II module reads data from a memory port that is clocked at 2MHz. In relation to a group of 8 pixels this clock pulse at pixel number 3 and pixel number 7.

Let us change our address generation functionality as follows:

...
reg [7:0] sprite_data_location; 
wire [1:0] sprite_byte_num;
...
assign sprite_byte_num = sprite_data_region_offset[3:2] + 2'b11;
...
always @(posedge clk_in)
if (sprite_data_region && sprite_data_region_offset[3:0] == 4)
  sprite_data_location <= data_in[7:0];
...
     always @*
       if (!sprite_data_region && (clk_counter == 6 | clk_counter == 7))
         addr = bit_data_pointer;       
       else if (sprite_data_region && (sprite_data_region_offset[3:0] < 3))
         addr = {mem_pointers[7:4], 7'h7f, sprite_data_region_offset[6:4]};
       else if (sprite_data_region)
         addr = {sprite_data_location, (sprite_0_offset + sprite_byte_num)}; 
       else
         addr =  {mem_pointers[7:4], screen_mem_pos};
...

So, in each sprite section, we need to ensure we assert the address for the applicable sprite pointer before the first 2MHz clock pulse trigger. When this clock pulse trigger the Block RAM will return the value of the sprite pointer in question.

We store this sprite pointer in a register called sprite_data_location at a pixel period after the Block RAM returned this pointer.

Sprite_data_location will be used to generate addresses for the actual sprite data. We need the following pieces of information in addition to generate the addresses for the sprite data:

sprite offset: Line number of the sprite we need data for as a linear address. This will be a factor of three. For instance, should we need the sprite data for line 2, we will specify 6 as the offset.
sprite_byte_num: Either return 0,1 or 2 of the requested line.

We will be using bits 3 and 2 of sprite_data_region_offset for the sprite_byte_num. One should rather remember that we only start reading sprite data from combination 01 and not from combination 00.

During combination 00 we are still reading sprite_data_location. We therefore need to subtract one from this bit combination to get the actual bit combination.

As a quick hack, I achieved this subtraction by one by just adding 2'b11 with the help of Two's complement.

With this piece of code in place we will be receiving the sprite data for all the sprites during the applicable time frames. It is up to use to actually catch this data at the right time and manipulate it up to the point that the sprite gets rendered on the screen.

For all this functionality it makes sense to encapsulate it in a sprite_generator module, which we will cover in the next section.

The Sprite Generator module

Let us start our Sprite Generator module by providing input ports indicating the current Raster Position and Sprite position:

module sprite_generator(
  input clk_in,
  input [8:0] raster_y_pos,
  input [8:0] raster_x_pos,
  input [8:0] sprite_x_pos,
  input [7:0] sprite_y_pos,
    );

First thing we need to calculate is the linear address for the sprite line we want data for:

...
  wire [8:0] next_raster;
  wire [5:0] request_line;
...
  assign next_raster = raster_y_pos + 1;
  assign request_line = next_raster - sprite_y_pos;
  assign request_line_offset = (request_line << 1) + request_line;
...

The calculation of the linear address involves multiplying the line number by three, which we achieve by left shifting the line number by left and adding the line number to the result.

Next, we should add a 3 byte shift register to our sprite generator that can shift the data bytes in when it arrives, as well as shifting the bits out when we are in the display region of the sprite:

module sprite_generator(
...
  input store_byte,
  input [7:0] data,
...
    );
...
  wire sprite_display_region;
  reg [23:0] sprite_data;
...
  assign sprite_display_region = (raster_y_pos >= sprite_y_pos && raster_y_pos < (sprite_y_pos + 21)) &&
                                 (raster_x_pos >= sprite_x_pos && raster_x_pos < (sprite_x_pos + 24));
...
  always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region)
      sprite_data <= {sprite_data[22:0], 1'b0};
...

So, when we are within the visible region of the sprite we shift out the contents of sprite_date one pixel at a time. What we need to next, is to output a color for each bit we shift out.

For now we will only output the color white if the bit is a 1:

module sprite_generator(
...
  output [3:0] output_pixel,
...
    );

...
assign output_pixel = sprite_data[23] ? 4'b1 : 0;
...

One thing we should keep in mind, is that bits with the value zero are in fact transparent. So, we need to have an additional output port indicating whether the current pixel should be transparent or not.

If a sprite pixel is transparent, it just a condition indicating that the VIC-II shouldn't show the current sprite pixel. There are additional conditions in which our VIC-II shouldn't show the pixel of a sprite:

The sprite is disabled
We are not current within the area of the sprite on the screen

Let us wrap all these conditions together and output to a single port:

module sprite_generator(
...
  input sprite_enabled,
  output show_pixel,
...
    );
...
  assign show_pixel = sprite_enabled && sprite_data[23] && sprite_display_region;
...

Wiring everything up

With our sprite generator created, let us hook up to our VIC-II module.

We need to end up with 8 sprite generators. That is a sprite_generator for each sprite. However, for now, to keep things simple, we will just be using a single one.

To start with, we are going to implement some more of the VIC-II registers in our VIC-II module:

reg [7:0] sprite_0_xpos;
reg [7:0] sprite_0_ypos;
reg [7:0] sprite_1_xpos;
reg [7:0] sprite_1_ypos;
reg [7:0] sprite_2_xpos;
reg [7:0] sprite_2_ypos;
reg [7:0] sprite_3_xpos;
reg [7:0] sprite_3_ypos;
reg [7:0] sprite_4_xpos;
reg [7:0] sprite_4_ypos;
reg [7:0] sprite_5_xpos;
reg [7:0] sprite_5_ypos;
reg [7:0] sprite_6_xpos;
reg [7:0] sprite_6_ypos;
reg [7:0] sprite_7_xpos;
reg [7:0] sprite_7_ypos;
reg [7:0] sprite_msb_x = 0;
reg [7:0] sprite_enabled;

always @(posedge clk_1_mhz)
     case (addr_in)
       6'h00: data_out_reg <= sprite_0_xpos;
       6'h01: data_out_reg <= sprite_0_ypos;
       6'h02: data_out_reg <= sprite_1_xpos;
       6'h03: data_out_reg <= sprite_1_ypos;
       6'h04: data_out_reg <= sprite_2_xpos;
       6'h05: data_out_reg <= sprite_2_ypos;
       6'h06: data_out_reg <= sprite_3_xpos;
       6'h07: data_out_reg <= sprite_3_ypos;
       6'h08: data_out_reg <= sprite_4_xpos;
       6'h09: data_out_reg <= sprite_4_ypos;
       6'h0a: data_out_reg <= sprite_5_xpos;
       6'h0b: data_out_reg <= sprite_5_ypos;
       6'h0c: data_out_reg <= sprite_6_xpos;
       6'h0d: data_out_reg <= sprite_6_ypos;
       6'h0e: data_out_reg <= sprite_7_xpos;
       6'h0f: data_out_reg <= sprite_7_ypos;
       6'h10: data_out_reg <= sprite_msb_x;
       6'h15: data_out_reg <= sprite_enabled;
       
       6'h20: data_out_reg <= {4'b0,border_color};
       6'h21: data_out_reg <= {4'b0,background_color};
       6'h22: data_out_reg <= {4'b0,extra_background_color_1};
       6'h23: data_out_reg <= {4'b0,extra_background_color_2};
       6'h11: data_out_reg <= {y_pos_real[8],screen_control_1[6:0]};
       6'h12: data_out_reg <= {y_pos_real[7:0]};
       6'h16: data_out_reg <= screen_control_2;
       6'h18: data_out_reg <= mem_pointers;
       6'h19: data_out_reg <= {7'h0,raster_int};
       6'h1a: data_out_reg <= int_enabled;
     endcase

always @(posedge clk_1_mhz)
begin
  if (we & addr_in == 6'h00)
    sprite_0_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h01)
    sprite_0_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h02)
     sprite_1_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h03)
     sprite_1_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h04)
     sprite_2_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h05)
     sprite_2_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h06)
     sprite_3_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h07)
     sprite_3_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h08)
     sprite_4_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h09)
     sprite_4_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0a)
     sprite_5_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0b)
     sprite_5_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0c)
     sprite_6_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0d)
     sprite_6_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0e)
     sprite_7_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0f)
     sprite_7_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h10)
     sprite_msb_x <= reg_data_in[7:0];
  else if (we & addr_in == 6'h15)
     sprite_enabled <= reg_data_in[7:0];

...
end

We now have enough information to link up most of the inputs ports of our sprite_generator:

sprite_generator sprite_0(
  .raster_y_pos(y_pos),
  .raster_x_pos(x_pos),
  .sprite_x_pos({sprite_msb_x[0],sprite_0_xpos}),
  .sprite_y_pos(sprite_0_ypos),
  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[0]),
    );

We still need to connect the input port store_byte. The following snippet of code will basically generate this signal for us:

...
reg store_sprite_pixel_byte;
...
always @*
  case (sprite_data_region_offset[3:0])
    7, 11, 15:  store_sprite_pixel_byte = sprite_data_region && 1;
    default:  store_sprite_pixel_byte = 0;
  endcase
...

So, for any given sprite data phase, pixel periods 7, 11 and 15 is the periods just after sprite data bytes was retrieved from block RAM. At these times we would like to persist the data byte to our sprite_generator.

However, this code will return the data bytes for all the sprites for a rasterline, so we cannot use as is for the store_byte input port. So, when we assign to the store_byte input port, we just an extra just to make sure the data byte is meant for our sprite_generator:

sprite_generator sprite_0(
...
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 0),
...
    );

All our input ports are now connected.

Let us now see how we can use the output ports of our sprite_generator to render sprites with our VIC-II module:

wire show_pixel_sprite_0;
wire [3:0] out_pixel_sprite_0;


sprite_generator sprite_0(
...
  .show_pixel(show_pixel_sprite_0),
  .output_pixel(out_pixel_sprite_0),
...
    );
...
   assign color_for_bit = multicolor_data ? multi_color :    
            (pixel_shift_reg[7] == 1 ? char_buffer_out_delayed[11:8] : background_color);
   assign color_for_bit_with_sprite = show_pixel_sprite_0 ? out_pixel_sprite_0 : color_for_bit;

   assign final_color = (visible_vert & visible_horiz & screen_enabled) ? color_for_bit_with_sprite : border_color;
...

The actual mixing of the sprite images and the main graphics happens in the wire color_for_bit_with_sprite. If our sprite_generator asserts the show_pixel output port, the sprite pixel will be shown. Otherwise we show a pixel of the main graphics.

This concludes the sprite implementation for this post.

In the following sections we will test our implementation.

Creating a Testbed for simulation

As you can see from this post, there is quite a bit of code that needs to be written just for a single sprite to be displayed.

When undertaking a task like in this post, it is always handy to have a simulation Testbed at hand, like we had created a previous post where we implemented Multicolor Bitmap mode.

The RAM image of such a Testbed contains data for a test image that enables us to test snippets of new code within seconds.

We start by getting hold of any simple program in C64 BASIC that will display a sprite for us. For this purpose I will be using a program in the C64 Users Guide, which is discussed on pages 68 - 71.

This program will show the following image as a sprite:

The program listing is as follows:

In this program we need to make two small modifications, which involves removing the clear screen statement, and using Sprite zero instead of sprite 2.

We remove the Clear Screen statement because we want actually to see that the sprite renders correctly against a background.

We will use Vice C64 emulator to run this code and to create a RAM image for our testbed. For this I will be using the same process as we used in the previous post where we have developed multicolor bitmap mode. So I will not cover the process here.

Here is quick screenshot of Vice C64 emulator executing the test BASIC program:

We have the balloon with the BASIC program as a background. We need to create a RAM image from a Vice Snapshot for our testbed.

Our testbed should then render more or less the same image.

Test Results

Here is an image from our Tesbed

The colora are a bit different from our standard C64 startup colors. This is because a Reused the testbed from a previous post where we developed the multicolor bitmap mode.

Other than that, this image looks more or less the same as the Vice screenshot in the previous section.

However, you will realise the sprite has a bit of a offset compared to the VICE emulator screenshot. Taking the fourth data element on line 220 (e.g. value 3) as reference, you will see on the VICE screenshot the balloon is situated Southeast of the '3', whereas in our testbed image the balloon appears west of the '3'.

These kind of offsets is probably expected, since our C64 FPGA implementation is by no means 100% cycle accurate compared to a real C64.

For now we will just do some offset hacks to get the sprite displayed in the correct position:

sprite_generator sprite_0(
...
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
...
    );

With these changes, let us see how this program runs on the physical FPGA:

This correlates more or less to the VICE rendering of this BASIC program.

In Summary

In this post we have started to implement sprite functionality within our C64 FPGA.

In the next couple of posts we will continue to implement sprite functionality.

Till next time!