C64 on an FPGA: April 2019

Friday, 26 April 2019

Integrating Tape Interface with C64 Module: Part 1

Foreword

In the previous post we created the building blocks for simulating the PWM output of a Commodore 1530 Datasette given a .TAP file.

What we need to next is to integrate our Tape simulator with our existing C64 module so we can start loading some programs stored in .TAP format.

With integration of our Tape simulator to our C64 module, the following comes to mind: Memory access complexity. Whereas in our previous post where we only needed to make use of one AXI port for memory access, we now need to make use of three AXI ports:

One for writing produced VIC-II frames to SDRAM.
A second AXI port used by the VGA block retrieving the produced VIC-II frames from SDRAM for display purposes
A third AXI port, to be implemented in this post, for reading .TAP file data for our Tape interface.

Adding a third AXI port is probably not such a big deal, but it is important that we establish that each component gets its fair share of memory bandwidth.

Thus, in this post we will focus on integrating the Tape interface with SDRAM, along with the VIC-II- and VGA-module. We will then verify that there is not any Memory bandwidth contention by checking that the frames are rendered properly on screen and that the Tape Interface produces pulses with widths that corresponds to the .TAP file stored in memory.

Encapsulating the Tape interface

In the previous post we have developed the various building blocks for our Tape Interface, but we actually haven't discuss how we glue these blocks together.

We need to create a module containing instances of these blocks:

module tape_interface(
  input clk,
  input clk_1_mhz,
  input restart,
  input reset,
  output [31:0] ip2bus_mst_addr,
  output [11:0] ip2bus_mst_length,
  input [31:0] ip2bus_mstrd_d,
  output [4:0] ip2bus_inputs,
  input [5:0] ip2bus_otputs,
  output pwm
    );
    
wire data_valid_read_word;
wire [7:0] byte_data;
wire [31:0] r_word_data_out;
wire ack_byte_slice;
wire data_valid_byte;
wire ack_sample_assem;
wire [23:0] timer_val;
wire load_timer;
    
read_word r_word(
      .clk(clk),
      .restart(restart),
      .reset(reset),
      .ack(ack_byte_slice),
      .ip2bus_mst_addr(ip2bus_mst_addr),
      .ip2bus_mst_length(ip2bus_mst_length),
      .ip2bus_mstrd_d(ip2bus_mstrd_d),
      .data_wire_out(r_word_data_out),
      .ip2bus_inputs(ip2bus_inputs),
      .ip2bus_otputs(ip2bus_otputs),
      .data_valid(data_valid_read_word)
        );
        
byteslicer byte_slice(
          .clk(clk_1_mhz),
          .data_valid(data_valid_read_word),
          .byte_out(byte_data),
          .ack(ack_byte_slice),
          .data_in(r_word_data_out),
          .restart(restart),
          .data_valid_out(data_valid_byte),
          .read(ack_sample_assem)
            );
        
sample_assembler samp_assem(
              .clk(clk_1_mhz),
              .data_valid(data_valid_byte),
              .data(byte_data),
              .ack(ack_sample_assem),
              .pwm(pwm),
              .timer_val(timer_val),
              .load_timer(load_timer),
              .restart(restart)
                );
        
tape_pwm t_pwm(
                  .time_val(timer_val),
                  .load_timer(load_timer),
                  .pwm(pwm),
                  .clk(clk_1_mhz)
                    );
    
endmodule

Let us now see how this block fits in with the rest of the design:

Our new AXI interface block is the one on the left bottom. This will be the AXI interface block we use for retrieving .TAP file data from SDRAM.

In the above screenshot I have also showing one of the existing AXI interface blocks. The reason for this is because this block also include a Slave Port, which we will extend so that our ARM core can access information about the tape_interface (e.g. PWM and Restart).

Later on in this post we will be writing a C program that will run on the ARM core, that will measure the pulse widths from the PWM pin. In this way we can determine if the correct pulses is produced.

We will again make use of Block diagram assistance to connect the new AXI Master port, which I described in an earlier post.

For interest sake I just would like to show how the resulting wiring looks like:

Our key component here is axi_smc. It has now three S_AXI inputs on the left, whereas previously there was only two.

At the output of the axi_smc, however, we are still outputting only two M_AXI ports. This is because the Zynq block only supports up to two S_AXI_GP ports. So, two of our input AXI ports needs to share a GP port and this is where memory mapping becomes very important, which we will cover in the next section.

Memory Mapping

Let us have a look at the address mapping:

The first block, myip_burst_read_test_0 is an AXI read block for reading data for the VGA block. This block consumes the most of the memory bandwidth in our design. This block make use of GP0

You can see that the next two blocks, myip_burst_read_test_0 and myip_burst_test_0 share both S_AXI_GP1 of the Processing system. Together these two blocks consumes much less bandwidth than the VGA block. For this reason I have group them together to use the same GP port.

Verifying the design

With everything hooked up, it is time to verify the design. We will do this by checking that the VGA output is rendered correctly and that pulse widths produced by our Tape interface is correct.

The most difficult part is validating the pulse widths. For this purpose we are going to write a C program that will execute on one of the ARM cores, which will measure the time period of the pulses.

The program is as follows:

#include <stdio.h>
#include "xil_exception.h"
#include "xparameters.h"
#include "platform.h"
#include "xil_printf.h"
#include "xil_cache.h"
#include "xil_io.h"
#include "xscugic.h"
#include "xgpiops.h"
#include <unistd.h>

void scheduleTimer(int usec) {
 //set timer value
 Xil_Out32(0xE0002080, usec);
 //reload timer
 Xil_Out32(0xE0002084, 0x40000000);
 Xil_Out32(0xE0002084, 0x81000000);
}

int main()
{
    init_platform();
    Xil_Out32(0x43c00008, 2);
    usleep(1000);
    Xil_Out32(0x43c00008, 0);
    usleep(1000);
    scheduleTimer(15000000);
    //u32 in2 = Xil_In32(0x43c00008);
    while (!(Xil_In32(0x43c00008) & 1)) {

    }

    u32 time = Xil_In32(0xE0002084) & 0xffffff;
    int numbers[100];// = int[20];
    for (int i = 0; i < (0x6a00 -32); i++) {
        while ((Xil_In32(0x43c00008) & 1)) {

        }

        while (!(Xil_In32(0x43c00008) & 1)) {

        }
        u32 end = Xil_In32(0xE0002084) & 0xffffff;

        time = Xil_In32(0xE0002084) & 0xffffff;
    }

    for (int i = 0; i < 100; i++) {
        while ((Xil_In32(0x43c00008) & 1)) {

        }

        while (!(Xil_In32(0x43c00008) & 1)) {

        }
        u32 end = Xil_In32(0xE0002084) & 0xffffff;
        if (end > time)
         numbers[i] = time + (0xffffff - end);
 else
            numbers[i] = time - end;
        time = Xil_In32(0xE0002084) & 0xffffff;
    }


    cleanup_platform();
    return 0;
}

In this program we are making use of one of the timers in the USB block of the Zynq, for accurate time measuring. This timer is set to run continously.

One of the limitations of this USB timer is that it can only count durations of up to 16 seconds, which is a bit too short for our purposes since the header in a .TAP gets transmitted at about 18 seconds.

We can, however, overcome this limitation by just checking each time if the pulse end time is larger than the start time (Remember our counter is counting down). If this happen, we know an overflow condition has occurred in our timer and we can make the necessary adjustments to reflect the true duration.

In the code you will also saw frequent reference to register 0x43c00008. This register contains two bits of importance:

Bit 0: PWM output
Bit 1: Restart bit

With the code in action, we skip the first 0x69e0 pulses, bringing us closer to the start of the pulses for the file header. We can then gather a small set of sensible pulses, which we can use to do some spot checks on the pulses produced.

In Summary

In this post we have integrated our Tape Interface to the memory subsystem of our C64 design as a third AXI access port.

We also verified that the pulses produced by the Tape Interface are sensible values.

In the next post we will continue integrating our Tape module with the rest of the C64 system, gradullay working our way to an implementation where we load a program from a .TAP file into our C64 system.

Till next time!

Tuesday, 9 April 2019

Creating the Tape Interface on the Zybo Board

Foreword

In the previous post we managed to play sound on the Zybo Board.

In this post we will focus again to develop cassette interface been driven by a .TAP file stored in main memory.

Once developed, we will test this interface by playing the output of this interface to speakers.

High Level Overview

Let us start by looking on a high level off what we want to achieve. The following block diagram describe what we want to achieve in a nutshell:

We start off by having a .TAP file stored in the SDRAM of the ZYBO board. The contents of this file gets transferred word by word via the AXI protocol to our FPGA logic.

You might recall from previous posts that whenever we access SDRAM via AXI, we are are making use of two modules that we have developed earlier on.

The first module we use to connect to one of the AXI ports of the ZYNQ processor. In this block we also make use of an AXI Burst block which is an IP provided by Xilinx. The AXI Burst basically abstract the technical details of the AXI protocol and provide us with a set of signals that is easier to work with.

The second module we take the simplified set of signals provided by the AXI Burst Block and store the stream of datawords within a FIFO. The FIFO basically buffer the information received from the AXI port and absorb the bursty nature of the AXI protocol. Thus, on the receiving end the FIFO you will get the datawords at a constant rate.

You will also recall from previous posts that previous mentioned FIFO should have sufficient depth to avoid underflow. In our cassette interface, however, it is sufficient to store only a single word at a time making a FIFO a bit of a overkill.

Why don't we need a FIFO for the cassette interface? That is because we are receiving data words on the AXI bus at 100MHz whereas we will be producing a pulsating data signal with a maximum frequency of about 3KHz. That means that between toggling pulses we will have more than enough time to fetch the next sample from SDRAM.

Let us now refer back to our high level block diagram. Our block used for storing a word of AXI data at a time is the READ WORD block.

We receive data from AXI 32 bits a time, whereas with .TAP file data it is easier for us to inspect data a byte at a time. It is for this reason that we have implemented a BIT SLICER block, that breaks up a word into its individual bytes. This functionality is implemented with a shift register shifting eight bits at a time.

You will see that apart from the data signal between the READ WORD and BIT SLICER block, we have two extra signals: Valid and ACK. This is a pattern you will see quite often in an pipeline architecture.

When the READ WORD block have received a piece of data fro the AXI port, it informs the Byte Slicer by asserting the Valid line. With assertion of this line the Byte stores the data and asserts the ACK line. This in turns informs the READ WORD block that it can go ahead and retrieve the next word from the AXI port.

In this way both the READ WORD and BYTE SLICER is kept busy. This almost remind us of a assembly line in a factory.

One more thing I want to highlight between the READ WORD and BYTE SLICER is the dotted line with the caption Cross Clock Domain. This is to highlight that on the left side of the dotted line we are working at the AXI clock frequency of 100MHZ. on the right hand side we are working at only 1MHZ. To cater for these different clocks, we will again use milti-flop synchronisers in both blocks.

Next, let us have a look at the Sample Assembler block. If you read through the specification for a .TAP file you will see that each pulse width value will be one byte or four bytes. The rule is simple: If the byte value is zero, the next three bytes will give the absolute pulse width in microseconds. If the byte value is non-zero, the pulse with is contained only within a single byte.

It it thus the purpose of the Sample Assembler to determine the duration of the next pulse width with the stream of incoming bytes.

Our final block is the PWM block, which a lot of you will recognise as the acronym for Pulse Width Modulation. PWM actually describes the data signal you receive from a Commodore Datasette: A set of pulses of varying length.

Our PWM block is basically implemented as a countdown timer, toggling its output each time when a underflow condition has occurred.

You will also see that the output of the PWM block is fed back to the the Sample assembler. The Sample assembler uses the pulse transition to a low as a cue to start assembling the next sample pulse duration.

Implementing the READ WORD block

Let us have a look at the code for the READ WORD block.

I will start by showing the complete block of code for this module and then highlighting important snippets from it:

module read_word(
  input wire clk,
  input wire restart,
  input wire reset,
  output reg [12:0] count_in_buf,
  input ack,
  output wire [31:0] ip2bus_mst_addr,
  output reg [11:0] ip2bus_mst_length,
  input wire [31:0] ip2bus_mstrd_d,
  output wire [31:0] axi_d_out,
  output wire [31:0] data_wire_out,
  output wire [4:0] ip2bus_inputs,
  input wire [5:0] ip2bus_otputs,
  output wire empty,
  input wire read,
  output reset_1_mhz,
  output data_valid
    );

reg master_read_dst_rdy; //change to axi name
wire cmd_ack; // change to axi name
wire mstread_req;
wire mst_type;
reg  [31:0] axi_start_address;
reg  [31:0] data_cap;
reg [31:0] reset_1_counter = 50000000;
wire [11:0] burst_len;
(* ASYNC_REG = "TRUE" *) reg sync_ack_0, sync_ack_1, sync_ack_2;
wire master_read_src_rdy;
reg [12:0] bytes_to_receive;
reg [3:0] state;
reg axi_data_loaded = 0;
reg [12:0] axi_data_inc;
wire neg_clk;
wire pos_edge_ack;

assign data_valid = axi_data_loaded;
assign pos_edge_ack = !sync_ack_2 & sync_ack_1;
assign data_wire_out = {data_cap[7:0], data_cap[15:8], data_cap[23:16], data_cap[31:24]};
assign reset_1_mhz = reset_1_counter > 21000000 ? 1 : 0;

parameter
  IDLE = 4'h0,
  INIT_CMD = 4'h1,
  START = 4'h2,
  ACT = 4'h3,
  TRANSMITTING = 4'h4;

parameter BURST_THRES = 124;  

assign neg_clk = ~clk;

always @(posedge clk)
if (reset_1_counter > 20000000)
  reset_1_counter <= reset_1_counter - 1;

always @(posedge clk)
begin
  sync_ack_0 <= ack;
  sync_ack_1 <= sync_ack_0;
  sync_ack_2 <= sync_ack_1;
end

always @(posedge clk)
 if (restart | pos_edge_ack | reset)
  axi_data_loaded <= 0; 
 else if ((state > START) & !master_read_src_rdy & !axi_data_loaded) 
   axi_data_loaded <= 1;
    
always @(posedge clk)
  if (!master_read_src_rdy & !axi_data_loaded)
    data_cap <= ip2bus_mstrd_d;

always @(posedge clk)
if ((reset | restart) & !axi_data_loaded & state == 0)  
  state <= 0;
else
  case( state )
    IDLE: if (!axi_data_loaded) 
            state <= INIT_CMD;
    INIT_CMD: state <= START;             
    START: if (cmd_ack)
             state <= ACT;
    ACT: if (!master_read_src_rdy)
             state <= TRANSMITTING;
    TRANSMITTING: state <= IDLE;    
  
  endcase  
  
always @(negedge clk)
if (restart | reset)
begin
  axi_start_address <= 32'h200000;
  axi_data_inc <= 0;
end
else if (state == INIT_CMD)
begin
  axi_start_address <= axi_start_address + axi_data_inc;
  axi_data_inc <= 4;
end    

always @(negedge clk)
if (state == INIT_CMD)
  ip2bus_mst_length <= 4; 
  
assign mstread_req = (state == START) ? 1 : 0;

assign mst_type = (state == START) ? 1 : 0;

always @*
  if (state == START)
    master_read_dst_rdy = 0;
  else if (state > START & !axi_data_loaded)
    master_read_dst_rdy = 0;
  else
   master_read_dst_rdy = 1;
         
assign master_read_src_rdy = ip2bus_otputs[3];
assign cmd_ack = ip2bus_otputs[0];
assign ip2bus_inputs[0] = mstread_req;
assign ip2bus_inputs[1] = mst_type; 
assign ip2bus_mst_addr = axi_start_address;
assign ip2bus_inputs[2] = master_read_dst_rdy;

assign ip2bus_inputs[3] = 1'b0;
assign ip2bus_inputs[4] = 1'b0;
endmodule

Firstly this code contains some glue logic for interfacing with the AXI Burst block. There is also some reset logic and restart logic. Restart logic is important if you reload SDRAM with a new .TAP file.

We will be receiving the ACK signal from the Byte Slicer, which is another clock domain. For this reason we are defining three synchroniser flip-flops sync_ack_0, sync_ack_1 and sync_ack_2. We are also using these flip-flops to determine the positive edge of the ACK signal, which we use to trigger the loading of the next word from the AXI port.

Bit Slicer

Let us have a look at the implementation for the bit slicer:

module byteslicer(
  input clk,
  input data_valid,
  output [7:0] byte_out,
  output ack,
  input [31:0] data_in,
  input restart,
  input read
    );
    
parameter STATE_INIT = 0;
parameter STATE_LOADED = 1;
parameter STATE_SHIFT_1 = 2;
parameter STATE_SHIFT_2 = 3;
parameter STATE_SHIFT_3 = 4;
    
reg [3:0] state = 0;
reg [31:0] data_reg;
(* ASYNC_REG = "TRUE" *) reg data_valid_0, data_valid_1;

assign ack = state == STATE_INIT & data_valid_1;
assign byte_out = data_reg[31:0];

always @(posedge clk)
begin
  data_valid_0 <= data_valid;
  data_valid_1 <= data_valid_0;
end

always @(posedge clk)
if (state == STATE_INIT & data_valid_1)
  data_reg <= data_in;
else if ((state == STATE_LOADED | state == STATE_SHIFT_1 | STATE_SHIFT_2 | STATE_SHIFT_3) & read)
  data_reg <= {data_reg[23:0],8'h0};

always @(posedge clk)
if (restart)
  state <= STATE_INIT;
else case (state)
  STATE_INIT: state <= data_valid_1 ? STATE_LOADED : STATE_INIT;
  STATE_LOADED: state <= read ? STATE_SHIFT_1 : STATE_LOADED;
  STATE_SHIFT_1: state <= read ? STATE_SHIFT_2 : STATE_SHIFT_1;
  STATE_SHIFT_2: state <= read ? STATE_SHIFT_3 : STATE_SHIFT_2;
  STATE_SHIFT_3: state <= read ? STATE_INIT : STATE_SHIFT_3;
endcase

endmodule

In this module we have again the scenario where we receive a signal from another clock domain, which in this case is data_valid. For this reason we are creating creating the synchronisers data_valid_0 and data_valid_1.

As mentioned earlier on, the bit slicer is shift register shifting eight bits at a time. In our implementation, the shift happens while the read line is asserted, which will be driven by the sample assembler when it needs more bytes.

Sample Assembler

The implementation for the sampler assembler is as follows:

module sample_assembler(
  input clk,
  input data_valid,
  input [7:0] data,
  output ack,
  input pwm,
  output reg [23:0] timer_val,
  output tape_out,
  input restart
    );
    
parameter STATE_START = 0;
parameter STATE_LOADED = 1;
parameter STATE_LOADED_1 = 2;
parameter STATE_LOADED_2 = 3;
parameter STATE_LOADED_3 = 4;


reg [3:0] state = 0;
reg pwm_0, pwm_1;
reg three_byte_sample = 0;
wire neg_edge;

assign tape_out = pwm;
assign neg_edge = !pwm_0 & pwm_1;

always @(posedge clk)
begin
  pwm_0 <= pwm;
  pwm_1 <= pwm_0;
end

assign ack = state == STATE_START | (state == STATE_LOADED_1 & data_valid) | (state == STATE_LOADED_2 & data_valid) | (state == STATE_LOADED_3 & data_valid);

always @(posedge clk)
if (state == STATE_START & data_valid & data != 0)
  timer_val <= {data, 3'b0};
else if ((state == STATE_LOADED_1 | state == STATE_LOADED_2 | state == STATE_LOADED_3) & data_valid)
  timer_val <= {data, timer_val[23:8]};

always @(posedge clk)
if (restart)
  state <= STATE_START;
else case(state)
  STATE_START: begin
                 three_byte_sample <= 0;
                 if (data_valid & data != 0) 
                   state <= STATE_LOADED;
                 else if (data_valid)
                   state <= STATE_LOADED_1;
               end
  STATE_LOADED_1: if (data_valid)
                  begin
                    //three_byte_sample <= 1;             
                    state <= STATE_LOADED_2;
                  end     
    //state <= data_valid ? STATE_LOADED : STATE_START;
  STATE_LOADED_2: if (data_valid)
                   state <= STATE_LOADED_3;
                       
  STATE_LOADED_3: if (data_valid)
                   state <= STATE_LOADED;    

  
  STATE_LOADED: begin
    state <= neg_edge ? STATE_START : STATE_LOADED;
    three_byte_sample <= 0;
  end
endcase
endmodule

The Sample Assembler starts off by inspecting the first byte that comes in. If it is non-zero, it gets padded with three zeros and we have a sample value.

If the first byte is a zero, the Sample Assembler waits patiently for the next three bytes to be clocked in to get the full sample value.

Once a sample value is created, we wait for a negative clock transition from the PWM to restart the process.

PWM

Let us have a look at our final module. Here is the implementation:

module tape_pwm(
  input [23:0] time_val,
  input load_timer,
  output pwm,
  input clk
    );
    
  reg polarity = 1;
  reg [23:0] load = 100;
  reg [23:0] timer = 100;
  assign pwm = polarity;
    
  always @(posedge clk)
    if (load_timer)
      load <= {1'b0,time_val[23:1]};
  
  always @(posedge clk)
  if (timer > 0)
    timer <= timer - 1;
  else
    timer <= load;
    
  always @(posedge clk)
    if (timer == 0)
      polarity <= ~polarity;
endmodule

As mentioned this is just a countdown timer of which toggles the output on underflow.

You will also realise that when storing the timer val in the load register we discard the lower bit. This is because the timer values in a .TAP file is the period between two positive transitions. So, we need to toggle the pulse at a period of half this value.

Linking everything up

With all the modules created it is a matter of linking everything up.

A port that need special mention is the restart port you get on most of these modules. This port needs to be assigned to a AXI slave port so the ZYNQ processor can access it. You can then toggle this bit programatically when you have loaded a .TAP file into SDRAM.

A .TAP file can be loaded into SDRAM by making use of the XSCT command mwr (memory write).

With all the modules linked up you can test the design by integrating with the sound system we developed in the previous post. The produced sound should sound similar as when you play a C64 Tape on a Tape deck.

In Summary

In this post we have developed the cassette interface that will take a .TAP file and produce a corresponding signal of variable pulse widths.

In the next post we will start to integrate this cassette interface to our C64 module.

Till next time!