Foreword
In the previous posts we had been creating a 6502 based design for reading executable code from a FAT32 formatted SD Card and executing it.
I can maybe just summarise my goal again with this current project. My goal is to run a Amiga core on an Arty A7 board. For this project I will be using a 6502 core for doing all the heavy lifting of loading Amiga ROM's and disk images into RAM, from an SD Card, so the Amiga core can execute it.
At this point in time our design use all block RAM. On every FPGA, block RAM is a limited resource, especially if we want to implement something like an Amiga core.
So, in this post we will be trying to run the 6502 core using the DDR RAM available on the Arty A7. Having achieved this goal.
Stumbling Blocks
Let me start discussing the stumbling blocks I cam across the past couple of months in trying get the 6502 core to use 6502 RAM.
Usually when I encounter stumbling blocks, I go into quite some detail in my blog posts about them. However, my stumbling blocks with implementing blocks with DDR RAM were gigantic the past couple of months, so I will try and keep it brief in this section.
So, my initial attempt to write code to interface the 6502 core with the DDR was pretty straightforward, and everything ran as expected during the simulation. However, when I tried running it on the actual Arty A7, things looked totally different than during the simulation.
Every other byte I read back from DDR on the Arty A7 were garbage. When these kind of things happen when playing around with DDR, my heart sinks into my shoes, simply because there is not really tools for troubleshooting these kind of issues. A lot of the operations of DDR happens at frequencies well above that can be captured by the Integrated Logic Analysers. In these cases one can only really solve the issue by some kind of intuition.
After a number of backwards and forwards, I decided to revisit my assumptions of a previous post:
In the post of this diagram, I was working on a memory tester. Basically signal A will resemble the clock signal of the 6502 core.
At point A an address will be asserted by the 6502 and at point B the first DDR instruction will be loaded into a shift register that will shift an instruction out to DDR for opening the row that is associated for the address provided by the 6502 core.
Between two dotted lines are a time period of 1.5ns, so the time period between A and B is 3ns. This translate to 333MHz, and within an FPGA used on a Arty A7, it seemed like a very tight fit to me, although it was sufficient to run a memory tester on the board.
I gave this some thought. There is a lot more logic cells involved with a 6502 core than with a simple memory tester. So, 3ns might not be enough for all the individual address lines to reach their full voltages.
My intuition told me, or should I rather say I made a hypothesis😆, that the problem may be solved by increasing the time period between A and B. We will discover this as a possible solution in the next section.
Clocking changes
With the hypothesis I made in the previous section, I came up with the following clocking scheme:
The 6502_clk is basically the clock that should drive the 6502 core. It is an exact copy of mclk, but I am throwing away 9 clocks in between, thus keeping only every tenth clock. With mclk that is 83.3 MHz, this gives us an effective 6502 clock of 8.3MHz, which is stil above the target clock of 7MHz required for our Amiga core in future.
At the point I have indicated with an arrow, we are loading our shift register with the address asserted by our 6502 core, which is one mclk cycle after the assertion. This works out 12ns, compared to the 3ns of our earlier design. I think this will give ample time for our address lines to settle, before reading it at the next mclk cycle.
The question remains if this bigger time gap will not introduce extra latency causing us to miss our target frequency of 7MHz. We will revisit this question later on when have finished with the design.
Let us start by writing some Verilog code for a counter that keeps track of when to enable the 6502 clock:
reg [3:0] edge_count = 9; always @(negedge mclk) begin if (edge_count == 0) edge_count <= 9; else edge_count <= edge_count - 1; end always @(negedge mclk) begin clk_8_enable <= edge_count == 0; endWe get the resulting 6502 clock with the following:
BUFGCE BUFGCE_8_mhz ( .O(clk_8_mhz), // 1-bit output: Clock output .CE(clk_8_enable), // 1-bit input: Clock enable input for I0 .I(mclk) // 1-bit input: Primary clock );So, we will use the signal clk_8_mhz to clock our 6502. It is important to add a necessary constraint in Vivado, to indicate that it is treated as a clock when synthesizing the design. This constraint will look like the following:
create_generated_clock -name clkdiv1 -source [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O] -edges {1 2 21} [get_pins mcntrl393_i/memctrl16_i/mcontr_sequencer_i/BUFGCE_8_mhz/O]The edges parameter indicates which edges of the mclk clock forms part of the 6502 clock.
Changing the command sequence
assign result_cmd = (state == WAIT_CMD && cmd_valid && !refresh_out) ? {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd} : test_cmd;It was this assignment to the wire a mentioned earlier that that resulted in trying to sample address 3ns after being asserted.
PREPARE_CMD: begin test_cmd <= 32'h000001ff; cmd_slot <= 0; if (edge_count == 8) begin state <= COL_CMD; test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd}; end endYou will also see that we only assert this command and go the next state when edge_count is 8. This ensure that out state machine keeps in sync with our 6502 clock.
COL_CMD: begin begin state <= STATE_PREA; test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, (write_out ? 2'b11 : 2'b00), 10'h1fd}; cmd_slot <= 1; data_in <= {8{cmd_data_out}}; do_write <= write_out; end endThe rest of the state machine are the same.
Lowering the 6502 into the design
retrosystem retrosystem( .cs(), .mosi(), .miso(), .reset(wait_for_read > 0), .gen_clk(clk), .write_ddr(write), .ddr_data_out(data_out_byte), .ddr_data_in(data_in), .ddr_addr(address_byte), .led(led), .sclk(), .cd(), .wp() );First of all, I had to come up a name, for a module that was top.v, that is not a top module anymore. So, I just picked the name retrosystem, which contains a SD Card module and a 6502 system.
- write_ddr
- ddr_data_out
- ddr_data_in
- ddr_addr
always @* begin casex (addr_delayed) //16'hfexx: combined_data = o_data_sdspi[7:0]; 16'b1111_1011_xxxx_xx00: combined_data = o_data_sdspi[7:0]; 16'b1111_1011_xxxx_xx01: combined_data = wb_data_store[7:0]; 16'b1111_1011_xxxx_xx10: combined_data = wb_data_store[15:8]; 16'b1111_1011_xxxx_xx11: combined_data = wb_data_store[23:16]; 16'b0000_0xxx_xxxx_xxxx: combined_data = addr_delayed[0] ? ddr_data_in[15:8] : ddr_data_in[7:0]; default: combined_data = rom_out; endcase endCombined data is the port that combines data of the various sources and send to the 6502 core via the DI input.
assign ram_6502_addr = cpu_address; assign write_ddr = (we_6502 & cpu_address[15:9] == 0); assign ddr_data_out = cpu_data_out; assign ddr_addr = cpu_address_result;I mentioned that the retrosystem block needs to be instantiated with mem_tester. Speaking of mem_tester, it also contains a state machine which is no longer necessary.
Checking Timing
PREPARE_CMD: begin test_cmd <= 32'h000001ff; cmd_slot <= 0; if (edge_count == 2) begin cap_value <= data_out; end do_capture <= 0; if (edge_count == 8) begin state <= COL_CMD; test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd}; end cmd_status <= 1; endAs shown by the bolded section, we capture data_out when edge_count is 2.
The Test Program
.ORG $FC00 ldx #offset copy lda zcode,x sta $4,x dex bpl copy ldx #0 read lda $4,x inx cpx #$0a bne read jmp $4 zcode lda #$20 sta $fb0b lda #0 sta $fb0b lda #0 sta $0 sta $1 sta $2 sta $3 lp1 inc $0 bne lp1 lp2 inc $1 bne lp1 lp3 inc $2 lda $2 cmp #60 bne lp1 lda #$20 eor $3 sta $fb0b sta $3 lda #0 sta $2 beq lp1 endz nop offset=*-zcode ENDROM = $FFFF-*-3 .FILL ENDROM 00 .BYTE 0, $FC, 00, 00This code starting at FC00, which is the start of our "ROM", basically does three things. It starts by loading the code starting at label zcode into RAM starting at address $4.
Real Life Results
Address 4: A9 Address 5: 20 Address 6: 8D Address 7: 0B Address 8: FB Address 9: A9 Address a: 00 Address b: 8D Address c: 0B Address d: FB Address e: A9 Address f: 00And next the ILA capture:
No comments:
Post a Comment