C64 on an FPGA: October 2019

Wednesday, 30 October 2019

Implementing Sound: Part 1

Foreword

In the previous post we finished off implementing sprites into our C64 emulator.

This enabled us to fully play the game Dan Dare within our C64 emulator.

In this post we will start to implement a nice-to-have: Sound emulation.

To implement SID emulation from scratch can be quite a daunting task. So, I did some searching on the Internet to see if I could find an existing SID-implementation, written in Verilog.

As part of this post, I will also show how to create a test bench for evaluating such an implementation.

The chosen SID implementation

After some searching on the Internet, I am game across a nice SID implementation on Github coded in SystemVerilog, by Thomas Kindler. Here is the link to the project:

https://github.com/thomask77/verilog-sid-mos6581

I don't have hand-on experience with SystemVerilog, but according to many resources, Vivado does in fact support SystemVerilog. So my SystemVerilog disadvantage shouldn't be much of a setback :-)

Thete was one thing I did experience when using this code with Vivado. For output ports feeding off sequential elements, you need to declare with the 'reg' keyword.

Let us do an overview on how Thomas Kindler's SID module works. Thomas Kindler based much of the inner workings on the Interview with Bob Yannes, the designer of the original SID chip. A copy of this interview can be found on a couple of places on the Internet, including here.

At the heart of each voice on the SID is a 24-bit phase accumulator, clocked at 1MHz.

The phase accumulator is the work horse for generating one complete cycle for the desired waveform at the desired frequency.

Each Voice on the SID have a phase accumulator and gets incremented at each clock cycle by the value stored in its 16 bit frequency register. That is registers 54272+54273 for Voice 1, 54279+54280 for Voice 2, and 54286+54287 for Voice 3.

If a Voice circuit have a frequency value of 1, we will therefore be producing waveforms with a period of 16 seconds, which is a frequency less than 1Hz. On the other hand, a frequency value of 65535 will yield a waveform with a frequency of about 4KHz.

Let us have a look at how the different waveforms gets created.

The triangle waveform is generated as follows:

...
   out_triangle = acc[22:12] << 1;
...
    if (acc[23])
        out_triangle ^= '1;
...

We are using the lower 23 bits of the phase accumulator. Our waveform starts off increasing till bit 23 of the phase accumulator gets set, after which we do a XOR on the resulting values, giving us a mirrored image of the previous time period.

Generating sawtooth is much simpler:

    out_saw      = acc[23:12];

And also pulse:

   out_pulse    = acc[23:12] < pw ? '1 : '0;

Noise gets generated with a Fibonacci sequence, which will not be going into detail here.

I also not be going into the specifics of Envelope generation. I will only mention here that Envelope generation applies ADSR (Attack, Decay, Sustain, Release) to the resulting waveform, and is dealt with in the file sid_env.sv.

Creating the Testbed

The simple part of creating a Testbed for the SID module is wiring up all the ports, supplying a clock signal, and applying some reset logic.

The complex part comes to get hold of a sequence of SID register writes that will generate a sound that we can evaluate by ear.

The most obvious way to get hold of such sequences would be to intercept writes to SID registers when executing the applicable program within a C64 emulator.

Some time ago a write an emulator in JavaScript, here, which I am going to use for this purpose. The full source of this emulator can be found here. We should now briefly put our JavaScript thinking caps on :-)

I will try though to keep the discussion on this JavaScript emulator short, trying to convey just the basic idea. Should some of you would like to have a more detailed blog post on this, please drop me a comment.

The place where we will be doing the interception of SID writes, will be within the file memory.js, within the following method:

  function IOWrite(address, value) {
    if ((address >= 0xdc00) & (address <= 0xdcff)) {
      return ciaWrite(address, value);
    } else if ((address >= 0xd000) & (address <= 0xd02e)) {
      return myVideo.writeReg(address - 0xd000, value);
    } else if ((address >= 0xd800) & (address <= 0xdbe8)) {
      return myVideo.writeColorRAM (address - 0xd800, value);
    } else {
      IOUnclaimed[address - 0xd000] = value;
      if ((address & 0xff00) == 54272)
      mysid.log(address & 31, value);
      return;
    } 
  }

mysid is the instance of a class that we still need to define, so let us start with the outline of the class:

function sid() {
var mycpu;
var lasttime = 0;

  this.setCpu = function(cpu) {
    mycpu = cpu;
  }

  this.log = function(addr, val) {
    diff = mycpu.getCycleCount() - lasttime;
    lasttime = mycpu.getCycleCount();

  }

}

One important thing when recording the writes, is to also record the exact time instance when the write happened. For this purpose we need a handle to the CPU instance to get the current Cycle Count.

With these pieces of information at hand, we can create a series of Verilog statement for the register writes that we can place in an initial-begin..end block. We will write these statements to a text area defined within our HTML page. With these changes are log function looks as follows:

...
  this.log = function(addr, val) {
    diff = mycpu.getCycleCount() - lasttime;
    lasttime = mycpu.getCycleCount();
        var ins = document.getElementById("diss");
        var temp = ins.value+ "\n#"+diff*10+";\n@(negedge clk)\n"+
          "rw = 0; n_cs = 0; addr = 5'd"+addr+
          "; data = "+val+";\n@(negedge clk)\n"+"rw=1; n_cs=1;";
        ins.value = temp;
  }
...

A typical sequence of this generated Verilog code looks as follows:

#580;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd11; data = 32;
@(negedge clk)
rw=1; n_cs=1;
#712860;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd0; data = 162;
@(negedge clk)
rw=1; n_cs=1;
#80;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd1; data = 37;
@(negedge clk)
rw=1; n_cs=1;
#200;
@(negedge clk)
rw = 0; n_cs = 0; addr = 5'd4; data = 32;
@(negedge clk)
rw=1; n_cs=1;

So, we delay each set of assignments by a certain period of time as captured by the log function.

Interesting, JavaScript generating Verilog code!

Next, let us write some code for capturing the produced sound samples to a file so that we can listen to the produced sound later on:

...
integer f = 0;
integer i = 0;
...
initial begin
  f = $fopen("sound.raw","wb");
  #100;
  for (i = 0; i < 90000000; i = i + 1) begin
    @(negedge clk)
    if ((i% 20) == 1)
    begin     
      $fwrite(f,"%c",audio[7:0]);
      $fwrite(f,"%c",audio[15:8]);
    end
  end
$fclose(f);

end

The sound gets produced at a rate of 1MHz. I am reducing the sample rate to 48KHz by catching only every 20th sample. In this way most sound player would be able to keep up.

Audacity is a Opensource program that allows you to import and play these raw samples.

Test Results

Let us listen to the resulting sounds.

Our first attempt is kind of successful, but there is a bit of distortion:

The distortion is more visible within the wave editor of Audacity:

One can clearly see the waveform goes off the screen in a couple of places.

Looking at the source file sid_filter.sv of the SID implementation, one kind of get a feeling of where things go wrong:

out_next = (out_next * reg_vol) >> 2;

Here we multiply the final sample with the master volume and divide the result by 4. When inspecting the waveform during a Vivado simulation, multiplying by a master volume of 15 sometimes yield a number that is way past the range of a 16-bit number, and dividing by 4 simply isn't enough. I fix this by dividing by 8 instead of four:

out_next = (out_next * reg_vol) >> 3;

The result is much better, although not taking advantage of the full volume range:

In Summary

In this post we started implementing sound within our emulator.

We evaluated Thomas Kindler's SID implementation and found it work very well.

Many thanks for Thomas Kindler for making the source of this implementation available on Github.

In the next post we will continue to integrate this SID core to our C64 core.

Till next time!

Monday, 14 October 2019

Implementing Sprites: Part 3

Foreword

In the previous post we have implemented the capability for our sprite to expand in both the X and Y directions. We also have implemented Sprite multicolor mode.

Up to now are our VIC-II only supported a single sprite, Sprite 0. So, in this post we will be connecting the remaining seven sprites.

With our VIC-II module able to display all eight sprites, we would be able to fully play the game Dan Dare with our emulator.

This would indeed be a very nostalgic moment for me, but raised a bit of a concern for me. If one is going to play extended periods on the Zybo with our emulator, wouldn't the Zynq SoC eventually get very hot?

My concern was driven by the fact that these days you find quite a number of videos on the Internet concerning cooling solutions for single board computers. With this in mind, when you come to the Zybo board, you cannot really find any information regarding what kind of temperatures to expect during general use of the board.

So, I will end off this post by sharing what I have found by experimentation regarding the temperature of the Zynq when run our emulator for half an hour or so.

Hooking up the remaining sprites

Currently we only have a single instance of sprite_generator for sprite 0. Let u start by adding instances for the remaining sprites. For simplicity, I am only showing the declarations for the first three:

sprite_generator sprite_0(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[0],sprite_0_xpos}),
  .sprite_y_pos(sprite_0_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 0),

  .x_expand(x_expand[0]),
  .y_expand(y_expand[0]),
  .multi_color_mode(multi_color_mode[0]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_0),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[0]),
  .show_pixel(show_pixel_sprite_0),
  .output_pixel(out_pixel_sprite_0),
  .request_data(),
  .request_line_offset(sprite_0_offset)
    );

sprite_generator sprite_1(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[1],sprite_1_xpos}),
  .sprite_y_pos(sprite_1_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 1),

  .x_expand(x_expand[1]),
  .y_expand(y_expand[1]),
  .multi_color_mode(multi_color_mode[1]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_1),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[1]),
  .show_pixel(show_pixel_sprite_1),
  .output_pixel(out_pixel_sprite_1),
  .request_data(),
  .request_line_offset(sprite_1_offset)
    );

sprite_generator sprite_2(
  .clk_in(clk_in),
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
  .sprite_x_pos({sprite_msb_x[2],sprite_2_xpos}),
  .sprite_y_pos(sprite_2_ypos),
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 2),

  .x_expand(x_expand[2]),
  .y_expand(y_expand[2]),
  .multi_color_mode(multi_color_mode[2]),
  .sprite_multi_0(sprite_multi_color_0),
  .sprite_multi_1(sprite_multi_color_1),
  .primary_color(sprite_primary_color_2),

  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[2]),
  .show_pixel(show_pixel_sprite_2),
  .output_pixel(out_pixel_sprite_2),
  .request_data(),
  .request_line_offset(sprite_2_offset)
    );

This is a typical copy and paste exercise. However, some of the ports is specific to the sprite itself. Here is a list of these ports:

sprite_x_pos/ sprite_y_pos
store_byte
x_expand/y_expand
multi_color_mode
primary_color
sprite_enabled
show_pixel/out_pixel
request_line_offset

Obviously some ports will get its value from a particular bit position in a register, whereas the other ports in this list have there own dedicated registers.

Let us have a look at the output ports. The first port is sprite_x_offset. We use these ports as follows:

   always @*
     case (sprite_data_region_offset[6:4])
       3'd0: sprite_offset = sprite_0_offset;
       3'd1: sprite_offset = sprite_1_offset;
       3'd2: sprite_offset = sprite_2_offset;
       3'd3: sprite_offset = sprite_3_offset;
       3'd4: sprite_offset = sprite_4_offset;
       3'd5: sprite_offset = sprite_5_offset;
       3'd6: sprite_offset = sprite_6_offset;
       3'd7: sprite_offset = sprite_7_offset;
    endcase

     always @*
       if (!sprite_data_region && (clk_counter == 6 | clk_counter == 7))
         addr = bit_data_pointer;       
       else if (sprite_data_region && (sprite_data_region_offset[3:0] < 3))
         addr = {mem_pointers[7:4], 7'h7f, sprite_data_region_offset[6:4]};
       else if (sprite_data_region)
         addr = {sprite_data_location, (sprite_offset + sprite_byte_num)}; 
       else
         addr =  {mem_pointers[7:4], screen_mem_pos};

So, we use the applicable sprite_offset when it is the data cycle for a particular sprite.

We sit with a couple of show_pixel/output_pixel pairs for each sprite. We combine these as follows:

always @*
   if (show_pixel_sprite_0)
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1)
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2)
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3)
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4)
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5)
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6)
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7)
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else
     color_for_bit_with_sprite = color_for_bit;

   assign color_for_bit = multicolor_data ? multi_color :    
            (pixel_shift_reg[7] == 1 ? char_buffer_out_delayed[11:8] : background_color);
   assign final_color = (visible_vert & visible_horiz & screen_enabled) ? color_for_bit_with_sprite : border_color;

For now we only assume that all sprites are in front of the main graphics, implementing the hardcoded priority, where the lower the sprite number, the higher the priority.

When we run our implementation on the Zybo board with these changes, it looks very promising: our characters have finally appeared!

One small thing doesn't look right though. Our characters are always in front of everything! They appear in front of rocks. Also, when we walk underwater, using a reed as a snorkel, only the reed should be visible. This is not the case with our emulator in its current state:

We see Dan Dare, his pet, and the Snorkel!

OK, i agree, this shouldn't come as a surprise, since we implemented sprites to be always visible in front of the background graphics.

Fine tuning Sprite display priority

There is a couple of Sprite priority functionality that should be implemented before our game screen can render correctly.

The first priority is priority according the Sprite priority register at address D01B. Firstly we need to implement this register into our VIC-II so it be be written to or read by the 6502. This is similar to the other registers we have implemented.

We use this register as follows:

always @*
   if (show_pixel_sprite_0 && !sprite_priority[0])
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1 && !sprite_priority[1])
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2 && !sprite_priority[2])
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3 && !sprite_priority[3])
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4 && !sprite_priority[4])
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5 && !sprite_priority[5])
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6 && !sprite_priority[6])
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7 && !sprite_priority[7])
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else if (pixel_shift_reg[7])
     color_for_bit_with_sprite = color_for_bit;
   else if (show_pixel_sprite_0)
     color_for_bit_with_sprite = out_pixel_sprite_0;
   else if (show_pixel_sprite_1)
     color_for_bit_with_sprite = out_pixel_sprite_1;
   else if (show_pixel_sprite_2)
     color_for_bit_with_sprite = out_pixel_sprite_2;
   else if (show_pixel_sprite_3)
     color_for_bit_with_sprite = out_pixel_sprite_3;
   else if (show_pixel_sprite_4)
     color_for_bit_with_sprite = out_pixel_sprite_4;
   else if (show_pixel_sprite_5)
     color_for_bit_with_sprite = out_pixel_sprite_5;
   else if (show_pixel_sprite_6)
     color_for_bit_with_sprite = out_pixel_sprite_6;
   else if (show_pixel_sprite_7)
     color_for_bit_with_sprite = out_pixel_sprite_7;
   else
     color_for_bit_with_sprite = color_for_bit;

Within this snippet of code, you spot another implied priority by means of the check for pixel_shift_reg[7].

So, if this background pixel has a bit value of zero, it is actually transparent, allowing the sprites with background priorities to show through.

Obviously, if there is neither a visible sprite pixel with back or front priority, we will show the applicable background color.

There is a very interesting scenario when our main graphics is in multicolor mode. In Multicolor mode the high order bit indicates whether it is a background pixel or not. This means that we can have two possible background colors in multicolor mode, pixel value 00 and 01.

Having two background colors enables us to have a sprite that is sometimes hidden behind some objects and in front of others.

The following video shows how the game screen now looks with the recent round of changes:

This time around our emulator renders the scene more realistic. We go behind the rocks and is not visible when we go underwater.

This is actually the great nostalgic moment, what all this whole series of Blog posts were about!

Temperatures on the Zynq SoC

I mentioned in the beginning of this post that I have a bit of a concern on the temperature of the Zynq SoC when you are using it for extended periods of time.

One of my key areas for this concern is our USB stack program that runs on one of the ARM cores, which catches keystrokes from the USB keyboard and send it to our emulator hosted within the FPGA. To get an overall context, here is the main method of our USB stack program:

int main()
{
    Xil_DCacheDisable();
    init_platform();
    initint();
    initUsb();
    status = 0;
    state_machine();
    usleep(100000000);
    cleanup_platform();
    return 0;
}

We do some initialisation, and then we sleep for a long period of time (which in this case is 100 seconds). This sleep is necessary so our program is not terminated as a whole.

The code that does the actual work is the method state_machine, which is invoked every 10 milliseconds by a timer interrupt.

It should be noted that this program is running in standalone mode, and the usleep library call is implemented using busy waiting.

With busy waiting your CPU runs at full speed checking in a loop for something something to happen, which in this case is for 100 seconds to past.

As we know with busy waiting, your CPU is effectively running at 100% utilisation all the time, which uses more energy and produces more heat.

So, how heat will be produced by above program when we run for about half an hour?

Vivado provides some tools for us to answer this question. On the Hardware dashboard, temperature is one of the probes you can add.

When I started the emulator on the Zybo board, the temperature was around 51^oC. Within minutes temperature has risen to about 54^oC.

After about half an hour, the temperature settled to about 58^oC.

This wasn't as bad a I have expected. For interest sake, I was wondering whether you could do some overclocking on the Zybo.

Some fiddling of the settings in the Vivado Block design, it doesn't really look like there is any real overclocking options. The only options that I could see, was to set the frequency of an ARM core between 50MHz to 667MHz.

So, in short, it doesn't look like using the Zybo board for long periods would cause any kinds of overheating.

Also, busy waiting didn't appear to be a big issue after all. However, I was still wondering what kind of temperature difference it would make if we could avoid busy waiting.

On ARM processors, the instruction WFI (wait for interrupt) is provided for this purpose. As per the documentation on ARM's web site:

WFI (Wait For Interrupt) makes the processor suspend execution (Clock is stopped) until one of the following events take place:

An IRQ interrupt
An FIQ interrupt
A Debug Entry request made to the processor.

So, in our case when we call WFI, our CPU would freeze until our timer interrupt fires externally:

int main()
{
    Xil_DCacheDisable();
    init_platform();
    initint();
    initUsb();
    status = 0;
    state_machine();
    asm("loop: wfi");
    asm("b loop");
    cleanup_platform();
    return 0;
}

Here we with added some inline assembly for invoking wfi. It should be remembered once an external interrupt has occurred and has been served, code execution will continue just after the wfi instruction.

It is therefore important that we loop back to the wfi instruction. If we don't, our main method will run to completion.

When we monitor the temperature when we use the WFI method, the Zynq definitely runs cooler. During this run I saw a temperature between 52^oC and 53^oC. About a 5 degree difference!

In Summary

In this post we implemented all eight sprites within our VIC-II module. We also implemented the different priorities between Sprites and the Background.

This indeed brought us to the point where we could fully play the game Dan Dare within our emulator.

With this we are nearing almost the end of this Blog Series. There is, however, one more thing I would like to do, and this is to see if it is possible to add sound to the emulator.

So, in the next post we will start to implement sound.

Till next time!

Thursday, 10 October 2019

Implementing Sprites: Part 2

Foreword

In the previous post we added some very basic sprite functionality to our C64 emulator that enabled us to show a moving sprite.

In this post we will continue to add some more sprite functionality, which will involve the capability to expand a sprite and multicolor mode.

Sprite Expansion

Sprites on the VIC-II has the capability to be expanded in both the X direction and the Y direction.

Register D017 has a bit for each sprite indicating whether it should be expanded in the Y direction.

Similarly, register D01D has a bit for each sprite indicating whether a sprite should be expanded in the X Direction.

So, let us start off by redirecting the bits of registers D017 and D01D to our sprite_generator module as input ports:

module sprite_generator(
...
  input x_expand,
  input y_expand,
...
    );

The first thing that is effected if a sprite is expanded, is its display region. So let us modify we determine this region:

...
  wire [5:0] sprite_width;
  wire [5:0] sprite_height;
...
  assign sprite_height = y_expand ? 42 : 21;
  assign sprite_width = x_expand ? 48 : 24;
...
  assign sprite_display_region = (raster_y_pos >= sprite_y_pos && raster_y_pos < (sprite_y_pos + sprite_height)) &&
                                 (raster_x_pos >= sprite_x_pos && raster_x_pos < (sprite_x_pos + sprite_width));
...

Next, let us consider what should happen when we expand the sprite in the Y direction. In such a case our 21 line sprite should cover 42 lines of the sprite area on the screen. So, each line should be repeated twice.

We do this by dividing the current line within the active sprite area by two:

...
  wire [5:0] request_line_pre;
...
  assign request_line_pre = next_raster - sprite_y_pos;
  assign request_line = y_expand ? (request_line_pre>>1) : request_line_pre;
...

Similarly, when we expand in the X direction we need to repeat each pixel on a line twice. We do this by shifting a pixel out only every second clock cycle:

...
 reg [1:0] toggle; 
...
  always @(posedge clk_in)
  if (!sprite_display_region)
    toggle <= 0;
  else
    toggle <= toggle + 1;
...
 assign toggle_single_color_bit = x_expand ? toggle[0] : 1;
...
  always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region && toggle_single_color_bit)
      sprite_data <= {sprite_data[22:0], 1'b0};
...

We achieve this slow down with the toggle counter. This counter only needs to be one bit wide for now. The reason I made it two bits wide, is for multicolor mode later on.

Multicolor Mode

Let us tackle multicolor mode next.

As we know, in multi color mode our pixels is two pixels wide. This means that we need to shift out two pixels at a time when in multicolor mode. This also implies that we can only do this shift every second clock cycle.

This sounds very similar to the previous section when we need to expand our sprite in the X direction.We will therefore make use again of the toggle counter.

When we are expanding a multicolor sprite in the X direction, we need to slow down the clocking out of the pixels even more: Four clock cycles per pixels.

With all this in mind, we need to add the following code:

...
assign toggle_multi_color_bit = x_expand ? (toggle[1:0] == 2'b11) : toggle[0];
...
 always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region && toggle_single_color_bit && !multi_color_mode)
      sprite_data <= {sprite_data[22:0], 1'b0};
    else if (sprite_display_region && toggle_multi_color_bit && multi_color_mode)
      sprite_data <= {sprite_data[21:0], 2'b0};
...

Now, at any point in time, bits [23:22] will be the value for our current pixel. Next, let us have a look of the meanings for the different bit values, as quoted from https://www.c64-wiki.com/wiki/Sprite:

Pixels with a bit pair of "00" appear transparent, like "0" bits do in high resolution mode.
Pixels with a bit pair of "01" will have the color specified in address 53285/$D025.
Pixels with a bit pair of "11" will have the color specified in address 53286/$D026.
Pixels with a bit pair of "10" will have the color specified assigned to the sprite in question in the range 53287–53294/$D027–D02E.

For above colors, we need to define extra registers within our VIC-II module, and connect it via input ports on our sprite_generator:

module sprite_generator(
...
  input [3:0] sprite_multi_0,
  input [3:0] sprite_multi_1,
  input [3:0] primary_color,
...
    );

Let us now create a case statement for the different colors:

...
 reg [3:0] output_pixel_multi;
...
  always @*
    case (sprite_data[23:22])
      2'b01: output_pixel_multi = sprite_multi_0;
      2'b10: output_pixel_multi = primary_color;
      2'b11: output_pixel_multi = sprite_multi_1;
      default:  output_pixel_multi = 0;
    endcase
...

Let us wire up some finals:

...
assign output_pixel = multi_color_mode ? output_pixel_multi : output_pixel_single;
...
assign show_pixel = sprite_enabled && (multi_color_mode ? !(sprite_data[23:22] == 2'b0) : sprite_data[23]) && sprite_display_region;
...

Test Program and Test Results

To test the code we have developed, we need a simple test program that displays multicolor sprites that is X- and Y-Expanded.

There is a nice example program for multicolor expanded sprites, in the Book Introduction to Basic: Part 2. The Book was part of a two book series published in 1983, titled: An Introduction to Basic - The Comprehensive Teach yourself programming series.

Both these books are available for download on archive.org. The program appear on pages 300, 301 and is titled Glasgow Bus. In this program we have a multicolored sprite expanded in both directions, which moves from left to right.

The following video shows execution of this program within our FPGA C64 implementation:

In Summary

In this post we implemented Sprite expansion capability and multicolor mode within our C64 emulator.

In the next post we will connect up eight instances of our sprite_generator within our C64 emulator, and see if we can make the characters appear when we play the game Dan Dare.

Till next time!

Saturday, 5 October 2019

Implementing Sprites: Part 1

Foreword

In the previous post we have implemented Raster Interrupts as well as Multicolor Text mode.

This enabled us to completely render the status bar and the Background of the game. We were even able to move between screens of the environments.

There was, however, a crucial piece of the game play experience missing: The characters were invisible!

The reason I our characters were hidden, was because we haven't implemented Sprites in our VIC-II module yet. So, our next focus in this Blog series would be to implement Sprites in C64 FPGA implementation.

To implement sprites in an upcoming C64 emulator can be quite a daunting task. The following tasks come to mind, just to name a few:

Coordinate memory access between fetching Sprite data, fetching screen memory content and fetching character image data.
Mixing the sprite images with Text /bitmap mode graphics to get the final picture
Adding functionality to either show sprites in front of text or behind it.
Dealing with transparency
Implementing multicolor mode
Adding functionality to stretch a sprite in ether the Y- or X-direction

I have therefore decided to split the implementation of sprites into a number of separate posts.

In this post we will focus on showing a single sprite in front of text.

To test the resulting Sprite implementation, we will be using a simple Basic program for moving a sprite across the screen.

Retrieving Sprite Data from Memory

Let us start our journey of implementing sprite rendering by thinking how we will be fetching sprite data from memory.

A good start will be to review how our VIC-II currently interface with memory to get image data. Here is a quick outline:

Output port named addr for sending required address of which we want data for.
Requested data is send to input port data_in. This port is 12 bits wide, eight bits data and 4 bits from Color RAM. In this way for each screen location the character code and associated color arrives simultaneously, thus eliminating the need for an extra memory cycle to get the color code.
Memory requests is clocked at 2MHz. This translates to 2 memory accesses during an eight pixel period.

If you went through the particulars of the VIC-II, you will see that it clocks memory accesses at 1MHz. So, why am I clocking memory at 2MHz in my VIC-II implementation?

The key to this answer lies in the fact that within a C64 memory access happens on both the rising edge and falling edge of a 1Mhz clock cycle. The VIC-II access memory on the rising edge and the 6510 CPU on the falling edge of a 1MHz clock pulse.

There is, however, cases where the VIC-II will access memory on both the rising and falling edge. This happens at the beginning of each character line, where the VIC-II needs to retrieve the character code as well as the relevant pixel data to display. The VIC-II needs that extra time to retrieve the code for the character to be displayed, so the CPU cannot do any memory accesses during these times.

As we can see memory access times is very tight for the VIC-II, so one might wonder how the VIC-II manages to get some memory cycles for retrieving spite data. This is where Christian Bauer's write-up on the VIC-II comes to the rescue, as explained here. The section of interest is 3.6.3, Timing of a raster line.

In this section a couple of VIC-II memory access diagrams is shown for a couple of scenarios. The scenario where sprites 2-7 is active on a raster line gives us a very good idea where the VIC-II rertrieves Sprite data from memory (I have added the legend for convenience):

Cycl-# 6                   1 1 1 1 1 1 1 1 1 1 |5 5 5 5 5 5 5 6 6 6 6 6 6
       5 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 |3 4 5 6 7 8 9 0 1 2 3 4 5 1
        _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _| _ _ _ _ _ _ _ _ _ _ _ _ _ _
    ø0 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |_ _ _ _ _ _ _ _ _ _ _ _ _ _
       __                                      |
   IRQ   ______________________________________|____________________________
                             __________________|________
    BA ______________________                  |        ____________________
                              _ _ _ _ _ _ _ _ _| _ _ _ _ _ _ _
   AEC _______________________ _ _ _ _ _ _ _ _ |_ _ _ _ _ _ _ ______________
                                               |
   VIC ss3sss4sss5sss6sss7sssr r r r r g g g g |g g g i i i i 0sss1sss2sss3s
  6510                        x x x x x x x x x| x x x x X X X
                                               |
Graph.                      |===========0102030|7383940============
                                               |
X coo. \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\|\\\\\\\\\\\\\\\\\\\\\\\\\\\\
       1111111111111111111111111110000000000000|1111111111111111111111111111
       999aaaabbbbccccddddeeeeffff0000111122223|344445555666677778888889999a
       48c048c048c048c048c048c048c048c048c048c0|c048c048c048c048c04cccc04c80

  c  Access to video matrix and Color RAM (c-access)
  g  Access to character generator or bitmap (g-access)
 0-7 Reading the sprite data pointer for sprite 0-7 (p-access)
  s  Reading the sprite data (s-access)
  r  DRAM refresh
  i  Idle access

  x  Read or write access of the processor
  X  Processor may do write accesses, stops on first read (BA is low and so
     is RDY)

Let us unpack the above diagram a bit. Right on top we have cycle number, which is a count that gets incremented at a frequency of 1MHz.

The diagram starts with the last cycle of the previous line, which in this case is 65, which is obviously an NTSC VIC-II variant.

The diagram then continues from cycle#1 to cycle#19. For clarity the cycles from 20-52 is omitted in the diagram, and we continue again from cycle 53 to 65 and the first pixel of the next line.

You will also notice that after each each cycle number there is a space. This is just for the potential time period where the VIC-II might make use of both the rising and the falling edge of the 1MHz clock cycle for retrieving data.

The Graph row shows us when the electron beam is busy drawing on the screen. Making that statement felt a weird for me since we don't really use CRT's anymore :-)

On the Graph row equal(=) is the period when the Border is drawn, and a digit the when we are actually drawing character data.

With the Graph row as reference, let us have a look at the VIC row to see when sprite data is read from memory.

We can actually see that just before we are finished drawing a screen line, which is obviously a border operation, we start reading sprite info. It starts with: 0sss. We read sprite data pointer 0 from the end of screen memory, proceeded by reading three bytes of sprite data.

We do the same for sprites 1 to 7 and we stop just before we start drawing the border of the next raster line.

Also, please note that on the diagram, there is no space between the sprite pointer and sprite data data symbols. So, when we get the relevant data for a sprite, we are using all available memory cycles.

In short, the sprite data is retrieved during the none visible parts of a screen line, or to use PAL and NTSC terminology, during the front- and back-porch of a scan line.

Implementing sprite data Memory accesses

The previous section gives us a good indication where we should implement the reading of sprite data during the drawing of line.

Admitted, in our VIC-II we don't start with a blank period on a rasterline, but rather start immediately to draw the border on the beginning of each raster line.

We will, however, use the blank period of the end of the rasterline to read sprite data.

Let us start to some calculations. There are 4 memory accesses per sprite (e.g. sprite pointer and three bytes). For sprites there are two memory accesses per cycle. So, a sprite needs two cycles to get all its data.

We will be using our x_pos counter to determine when to read sprite data. For this purpose, we write the following code:

...
sprite_data_region;
...
assign sprite_data_region = (x_pos > 368 && x_pos < 496);
...

Our rasterline is 504 pixels, so we start reading sprite data at the very end of the line. Obviously the sprite pixels will be shown in the next line.

It is also more convenient to work with an offset of 368, rather than an absolute x_pos value:

...
wire [9:0] sprite_data_region_offset;
...
assign sprite_data_region_offset = {1'b0, x_pos} - 368;
...

Within this sprite data region, each sprite gets its data in a time period 16 pixels. We can therefore extract the following information from sprite_data_region_offset:

bits 6 - 4: sprite number
bits 3 - 0: Time cycle within the data cycle of current sprite. This is useful for orchestrating the various reads that should happen for the sprite.

Next, let us focus on address generation. With this we should keep in my mind that our VIC-II module reads data from a memory port that is clocked at 2MHz. In relation to a group of 8 pixels this clock pulse at pixel number 3 and pixel number 7.

Let us change our address generation functionality as follows:

...
reg [7:0] sprite_data_location; 
wire [1:0] sprite_byte_num;
...
assign sprite_byte_num = sprite_data_region_offset[3:2] + 2'b11;
...
always @(posedge clk_in)
if (sprite_data_region && sprite_data_region_offset[3:0] == 4)
  sprite_data_location <= data_in[7:0];
...
     always @*
       if (!sprite_data_region && (clk_counter == 6 | clk_counter == 7))
         addr = bit_data_pointer;       
       else if (sprite_data_region && (sprite_data_region_offset[3:0] < 3))
         addr = {mem_pointers[7:4], 7'h7f, sprite_data_region_offset[6:4]};
       else if (sprite_data_region)
         addr = {sprite_data_location, (sprite_0_offset + sprite_byte_num)}; 
       else
         addr =  {mem_pointers[7:4], screen_mem_pos};
...

So, in each sprite section, we need to ensure we assert the address for the applicable sprite pointer before the first 2MHz clock pulse trigger. When this clock pulse trigger the Block RAM will return the value of the sprite pointer in question.

We store this sprite pointer in a register called sprite_data_location at a pixel period after the Block RAM returned this pointer.

Sprite_data_location will be used to generate addresses for the actual sprite data. We need the following pieces of information in addition to generate the addresses for the sprite data:

sprite offset: Line number of the sprite we need data for as a linear address. This will be a factor of three. For instance, should we need the sprite data for line 2, we will specify 6 as the offset.
sprite_byte_num: Either return 0,1 or 2 of the requested line.

We will be using bits 3 and 2 of sprite_data_region_offset for the sprite_byte_num. One should rather remember that we only start reading sprite data from combination 01 and not from combination 00.

During combination 00 we are still reading sprite_data_location. We therefore need to subtract one from this bit combination to get the actual bit combination.

As a quick hack, I achieved this subtraction by one by just adding 2'b11 with the help of Two's complement.

With this piece of code in place we will be receiving the sprite data for all the sprites during the applicable time frames. It is up to use to actually catch this data at the right time and manipulate it up to the point that the sprite gets rendered on the screen.

For all this functionality it makes sense to encapsulate it in a sprite_generator module, which we will cover in the next section.

The Sprite Generator module

Let us start our Sprite Generator module by providing input ports indicating the current Raster Position and Sprite position:

module sprite_generator(
  input clk_in,
  input [8:0] raster_y_pos,
  input [8:0] raster_x_pos,
  input [8:0] sprite_x_pos,
  input [7:0] sprite_y_pos,
    );

First thing we need to calculate is the linear address for the sprite line we want data for:

...
  wire [8:0] next_raster;
  wire [5:0] request_line;
...
  assign next_raster = raster_y_pos + 1;
  assign request_line = next_raster - sprite_y_pos;
  assign request_line_offset = (request_line << 1) + request_line;
...

The calculation of the linear address involves multiplying the line number by three, which we achieve by left shifting the line number by left and adding the line number to the result.

Next, we should add a 3 byte shift register to our sprite generator that can shift the data bytes in when it arrives, as well as shifting the bits out when we are in the display region of the sprite:

module sprite_generator(
...
  input store_byte,
  input [7:0] data,
...
    );
...
  wire sprite_display_region;
  reg [23:0] sprite_data;
...
  assign sprite_display_region = (raster_y_pos >= sprite_y_pos && raster_y_pos < (sprite_y_pos + 21)) &&
                                 (raster_x_pos >= sprite_x_pos && raster_x_pos < (sprite_x_pos + 24));
...
  always @(posedge clk_in)
    if (store_byte)
      sprite_data <= {sprite_data[15:0], data[7:0]};
    else if (sprite_display_region)
      sprite_data <= {sprite_data[22:0], 1'b0};
...

So, when we are within the visible region of the sprite we shift out the contents of sprite_date one pixel at a time. What we need to next, is to output a color for each bit we shift out.

For now we will only output the color white if the bit is a 1:

module sprite_generator(
...
  output [3:0] output_pixel,
...
    );

...
assign output_pixel = sprite_data[23] ? 4'b1 : 0;
...

One thing we should keep in mind, is that bits with the value zero are in fact transparent. So, we need to have an additional output port indicating whether the current pixel should be transparent or not.

If a sprite pixel is transparent, it just a condition indicating that the VIC-II shouldn't show the current sprite pixel. There are additional conditions in which our VIC-II shouldn't show the pixel of a sprite:

The sprite is disabled
We are not current within the area of the sprite on the screen

Let us wrap all these conditions together and output to a single port:

module sprite_generator(
...
  input sprite_enabled,
  output show_pixel,
...
    );
...
  assign show_pixel = sprite_enabled && sprite_data[23] && sprite_display_region;
...

Wiring everything up

With our sprite generator created, let us hook up to our VIC-II module.

We need to end up with 8 sprite generators. That is a sprite_generator for each sprite. However, for now, to keep things simple, we will just be using a single one.

To start with, we are going to implement some more of the VIC-II registers in our VIC-II module:

reg [7:0] sprite_0_xpos;
reg [7:0] sprite_0_ypos;
reg [7:0] sprite_1_xpos;
reg [7:0] sprite_1_ypos;
reg [7:0] sprite_2_xpos;
reg [7:0] sprite_2_ypos;
reg [7:0] sprite_3_xpos;
reg [7:0] sprite_3_ypos;
reg [7:0] sprite_4_xpos;
reg [7:0] sprite_4_ypos;
reg [7:0] sprite_5_xpos;
reg [7:0] sprite_5_ypos;
reg [7:0] sprite_6_xpos;
reg [7:0] sprite_6_ypos;
reg [7:0] sprite_7_xpos;
reg [7:0] sprite_7_ypos;
reg [7:0] sprite_msb_x = 0;
reg [7:0] sprite_enabled;

always @(posedge clk_1_mhz)
     case (addr_in)
       6'h00: data_out_reg <= sprite_0_xpos;
       6'h01: data_out_reg <= sprite_0_ypos;
       6'h02: data_out_reg <= sprite_1_xpos;
       6'h03: data_out_reg <= sprite_1_ypos;
       6'h04: data_out_reg <= sprite_2_xpos;
       6'h05: data_out_reg <= sprite_2_ypos;
       6'h06: data_out_reg <= sprite_3_xpos;
       6'h07: data_out_reg <= sprite_3_ypos;
       6'h08: data_out_reg <= sprite_4_xpos;
       6'h09: data_out_reg <= sprite_4_ypos;
       6'h0a: data_out_reg <= sprite_5_xpos;
       6'h0b: data_out_reg <= sprite_5_ypos;
       6'h0c: data_out_reg <= sprite_6_xpos;
       6'h0d: data_out_reg <= sprite_6_ypos;
       6'h0e: data_out_reg <= sprite_7_xpos;
       6'h0f: data_out_reg <= sprite_7_ypos;
       6'h10: data_out_reg <= sprite_msb_x;
       6'h15: data_out_reg <= sprite_enabled;
       
       6'h20: data_out_reg <= {4'b0,border_color};
       6'h21: data_out_reg <= {4'b0,background_color};
       6'h22: data_out_reg <= {4'b0,extra_background_color_1};
       6'h23: data_out_reg <= {4'b0,extra_background_color_2};
       6'h11: data_out_reg <= {y_pos_real[8],screen_control_1[6:0]};
       6'h12: data_out_reg <= {y_pos_real[7:0]};
       6'h16: data_out_reg <= screen_control_2;
       6'h18: data_out_reg <= mem_pointers;
       6'h19: data_out_reg <= {7'h0,raster_int};
       6'h1a: data_out_reg <= int_enabled;
     endcase

always @(posedge clk_1_mhz)
begin
  if (we & addr_in == 6'h00)
    sprite_0_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h01)
    sprite_0_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h02)
     sprite_1_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h03)
     sprite_1_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h04)
     sprite_2_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h05)
     sprite_2_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h06)
     sprite_3_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h07)
     sprite_3_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h08)
     sprite_4_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h09)
     sprite_4_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0a)
     sprite_5_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0b)
     sprite_5_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0c)
     sprite_6_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0d)
     sprite_6_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0e)
     sprite_7_xpos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h0f)
     sprite_7_ypos <= reg_data_in[7:0];
  else if (we & addr_in == 6'h10)
     sprite_msb_x <= reg_data_in[7:0];
  else if (we & addr_in == 6'h15)
     sprite_enabled <= reg_data_in[7:0];

...
end

We now have enough information to link up most of the inputs ports of our sprite_generator:

sprite_generator sprite_0(
  .raster_y_pos(y_pos),
  .raster_x_pos(x_pos),
  .sprite_x_pos({sprite_msb_x[0],sprite_0_xpos}),
  .sprite_y_pos(sprite_0_ypos),
  .data(data_in[7:0]),
  .sprite_enabled(sprite_enabled[0]),
    );

We still need to connect the input port store_byte. The following snippet of code will basically generate this signal for us:

...
reg store_sprite_pixel_byte;
...
always @*
  case (sprite_data_region_offset[3:0])
    7, 11, 15:  store_sprite_pixel_byte = sprite_data_region && 1;
    default:  store_sprite_pixel_byte = 0;
  endcase
...

So, for any given sprite data phase, pixel periods 7, 11 and 15 is the periods just after sprite data bytes was retrieved from block RAM. At these times we would like to persist the data byte to our sprite_generator.

However, this code will return the data bytes for all the sprites for a rasterline, so we cannot use as is for the store_byte input port. So, when we assign to the store_byte input port, we just an extra just to make sure the data byte is meant for our sprite_generator:

sprite_generator sprite_0(
...
  .store_byte(store_sprite_pixel_byte && sprite_data_region_offset[6:4] == 0),
...
    );

All our input ports are now connected.

Let us now see how we can use the output ports of our sprite_generator to render sprites with our VIC-II module:

wire show_pixel_sprite_0;
wire [3:0] out_pixel_sprite_0;


sprite_generator sprite_0(
...
  .show_pixel(show_pixel_sprite_0),
  .output_pixel(out_pixel_sprite_0),
...
    );
...
   assign color_for_bit = multicolor_data ? multi_color :    
            (pixel_shift_reg[7] == 1 ? char_buffer_out_delayed[11:8] : background_color);
   assign color_for_bit_with_sprite = show_pixel_sprite_0 ? out_pixel_sprite_0 : color_for_bit;

   assign final_color = (visible_vert & visible_horiz & screen_enabled) ? color_for_bit_with_sprite : border_color;
...

The actual mixing of the sprite images and the main graphics happens in the wire color_for_bit_with_sprite. If our sprite_generator asserts the show_pixel output port, the sprite pixel will be shown. Otherwise we show a pixel of the main graphics.

This concludes the sprite implementation for this post.

In the following sections we will test our implementation.

Creating a Testbed for simulation

As you can see from this post, there is quite a bit of code that needs to be written just for a single sprite to be displayed.

When undertaking a task like in this post, it is always handy to have a simulation Testbed at hand, like we had created a previous post where we implemented Multicolor Bitmap mode.

The RAM image of such a Testbed contains data for a test image that enables us to test snippets of new code within seconds.

We start by getting hold of any simple program in C64 BASIC that will display a sprite for us. For this purpose I will be using a program in the C64 Users Guide, which is discussed on pages 68 - 71.

This program will show the following image as a sprite:

The program listing is as follows:

In this program we need to make two small modifications, which involves removing the clear screen statement, and using Sprite zero instead of sprite 2.

We remove the Clear Screen statement because we want actually to see that the sprite renders correctly against a background.

We will use Vice C64 emulator to run this code and to create a RAM image for our testbed. For this I will be using the same process as we used in the previous post where we have developed multicolor bitmap mode. So I will not cover the process here.

Here is quick screenshot of Vice C64 emulator executing the test BASIC program:

We have the balloon with the BASIC program as a background. We need to create a RAM image from a Vice Snapshot for our testbed.

Our testbed should then render more or less the same image.

Test Results

Here is an image from our Tesbed

The colora are a bit different from our standard C64 startup colors. This is because a Reused the testbed from a previous post where we developed the multicolor bitmap mode.

Other than that, this image looks more or less the same as the Vice screenshot in the previous section.

However, you will realise the sprite has a bit of a offset compared to the VICE emulator screenshot. Taking the fourth data element on line 220 (e.g. value 3) as reference, you will see on the VICE screenshot the balloon is situated Southeast of the '3', whereas in our testbed image the balloon appears west of the '3'.

These kind of offsets is probably expected, since our C64 FPGA implementation is by no means 100% cycle accurate compared to a real C64.

For now we will just do some offset hacks to get the sprite displayed in the correct position:

sprite_generator sprite_0(
...
  .raster_y_pos(y_pos - 5),
  .raster_x_pos(x_pos - 16),
...
    );

With these changes, let us see how this program runs on the physical FPGA:

This correlates more or less to the VICE rendering of this BASIC program.

In Summary

In this post we have started to implement sprite functionality within our C64 FPGA.

In the next couple of posts we will continue to implement sprite functionality.

Till next time!