Foreword
In the previous post we managed to get a 6502 based ecosystem together where we could access both an SD Card and DDR3 memory.
With this design we can load quite a lot of stuff from SD Card into DDR3 memory and thus reduce our dependency on limit Block RAM that is available on the FPGA. This opens the possibility to emulate an Amiga core on the Arty A7 FPGA board.
As mentioned in previous posts, we will be using an 6502 based system that will do all the work of loading all the required stuff from SD Card to DDR3, which the Amiga core requires to work. Needless to say, this would require both the 6502 core and Amiga core to access the DDR3 memory.
One way to address the need of both 6502 + Amiga core to access the DDR3, would be to use the memory controller we developed in the last couple of posts, and just let the two cores make turns to access DDR3 memory. Knowing that our memory controller runs at around 8MHz, that would mean that our Amiga core would be running accessing DDR3 memory at around 4MHZ, because it would be accessing memory at every second clock cycle. This is far from ideal with a stock Amiga running at least at 7MHz.
So, in this post we will try and come up with an optimised dual channel memory controller where we will attempt to make both the Amiga core and 6502 core access memory at 7MHz.
The Magic of Memory Banks
In our journey with DDR3 memory, we got the know the different states memory can be in:
- Activate: Activate a row for reading or writing
- Read/Write: Read or write a particular column of data
- Precharge: After you are finished with your reads/writes on a particular row, you first need to precharge the row, before moving on to the next one.
Using timeslots wisely
PREPARE_CMD: begin test_cmd <= 32'h000001ff; cmd_slot <= 0; if (edge_count == 8) begin state <= COL_CMD; test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd}; end endtest_cmd is the command we want to issue. I am not going to explain the individual bits for this, but it basically indicates what RAS/CAS/WRITE should be set as for the command. cmd_slot indicates at which of the 4 time slots the command should be issued.
cmd_addr #( .IODELAY_GRP(IODELAY_GRP), .IOSTANDARD(IOSTANDARD_CMDA), .SLEW(SLEW_CMDA), .REFCLK_FREQUENCY(REFCLK_FREQUENCY), .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE), .ADDRESS_NUMBER(ADDRESS_NUMBER) ) cmd_addr_i( .ddr3_a (ddr3_a[ADDRESS_NUMBER-1:0]), // output address ports (14:0) for 4Gb device .ddr3_ba (ddr3_ba[2:0]), // output bank address ports .ddr3_we (ddr3_we), // output WE port .ddr3_ras (ddr3_ras), // output RAS port .ddr3_cas (ddr3_cas), // output CAS port .ddr3_cke (ddr3_cke), // output Clock Enable port .ddr3_odt (ddr3_odt), // output ODT port, .cmd_slot (cmd_slot), .clk (clk), // free-running system clock, same frequency as iclk (shared for R/W) .clk_div (clk_div), // free-running half clk frequency, front aligned to clk (shared for R/W) .rst (rst), // reset delays/serdes .in_a (in_a[2*ADDRESS_NUMBER-1:0]), // input address, 2 bits per signal (first, second) (29:0) for 4Gb device .in_ba (in_ba[5:0]), // input bank address, 2 bits per signal (first, second) .in_we (in_we[1:0]), // input WE, 2 bits (first, second) .in_ras (in_ras[1:0]), // input RAS, 2 bits (first, second) .in_cas (in_cas[1:0]), // input CAS, 2 bits (first, second) .in_cke (in_cke[1:0]), // input CKE, 2 bits (first, second) .in_odt (in_odt[1:0]), // input ODT, 2 bits (first, second) // .in_tri (in_tri[1:0]), // tristate command/address outputs - same timing, but no odelay .in_tri (in_tri), // tristate command/address outputs - same timing, but no odelay .dly_data (dly_data[7:0]), // delay value (3 LSB - fine delay) .dly_addr (dly_addr[4:0]), // select which delay to program .ld_delay (ld_cmda), // load delay data to selected iodelayl (clk_div synchronous) .set (set) // clk_div synchronous set all delays from previously loaded values );At this point we have already stripped of all the necessary bits from the command, as indicated in bold.
// All addresses generate genvar i; for (i=0; i<ADDRESS_NUMBER; i=i+1) begin: addr_block // assign decode_addr[i]=(ld_dly_addr[4:0] == i)?1'b1:1'b0; cmda_single #( .IODELAY_GRP(IODELAY_GRP), .IOSTANDARD(IOSTANDARD), .SLEW(SLEW), .REFCLK_FREQUENCY(REFCLK_FREQUENCY), .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE) ) cmda_addr_i ( .dq(ddr3_a[i]), // I/O pad (appears on the output 1/2 clk_div earlier, than DDR data) .clk(clk), // free-running system clock, same frequency as iclk (shared for R/W) .clk_div(clk_div), // free-running half clk frequency, front aligned to clk (shared for R/W) .rst(rst), .dly_data(dly_data_r[7:0]), // delay value (3 LSB - fine delay) .din({{2{in_a_r[ADDRESS_NUMBER+i]}},{2{in_a_r[i]}}}), // parallel data to be sent out // .tin(in_tri_r[1:0]), // tristate for data out (sent out earlier than data!) .tin(in_tri_r), // tristate for data out (sent out earlier than data!) .set_delay(set_r), // clk_div synchronous load odelay value from dly_data .ld_delay(ld_dly_addr[i]) // clk_div synchronous set odealy value from loaded ); end endgenerateHere cmda_single is applicable to a single address bit, so we need to replicate it for every bit of the address. We do that with a for-loop construct.
// we cmda_single #( .IODELAY_GRP(IODELAY_GRP), .IOSTANDARD(IOSTANDARD), .SLEW(SLEW), .REFCLK_FREQUENCY(REFCLK_FREQUENCY), .HIGH_PERFORMANCE_MODE(HIGH_PERFORMANCE_MODE) ) cmda_we_i ( .dq(ddr3_we), .clk(clk), .clk_div(clk_div), .rst(rst), .dly_data(dly_data_r[7:0]), .din({cmd_slot[1] ? {in_we_r[0], 1'b1} : {1'b1 , in_we_r[0]}, cmd_slot[0] ? {in_we_r[1], 1'b1} : {1'b1 , in_we_r[1]}}), .tin(in_tri_r), .set_delay(set_r), .ld_delay(ld_dly_cmd[3]));Note as before, our command slot is still 2 bits, but the meaning has a changed a bit. Previously cmd_slot was to be interpreted as a number between 0 and 3, but now each memory channel has its own bit, and have each access to only to two slots.
ROW_CMD: begin if (edge_count == 9) begin test_cmd <= 32'h000001ff; phy_rcw_pos_2 <= 2; end else begin test_cmd <= 32'h000005ff; phy_rcw_pos_2 <= 7; end cmd_slot <= 0; if (edge_count == 8) begin state <= COL_CMD; test_cmd <= {1'b0, 8'b0, cmd_address[15:10], 1'b0, 16'h21fd}; end end COL_CMD: begin state <= WAIT_READ_WRITE_0; test_cmd <= {1'b0, 4'b0, {cmd_address[9:3], map_address[2:0]}, 1'b0, 4'h1, (write_out ? 2'b11 : 2'b00), 10'h1fd}; cmd_slot <= 1; mem_channel <= 0; data_in <= {8{cmd_data_out}}; do_write <= write_out; end WAIT_READ_WRITE_0: begin state <= WAIT_READ_WRITE_1; dq_tri <= do_write ? 0 : 15; cmd_slot <= 0; test_cmd <= do_write ? 32'h000005ff : 32'h000001ff; end WAIT_READ_WRITE_1: begin state <= WAIT_READ_WRITE_2; end WAIT_READ_WRITE_2: begin test_cmd <= 32'h000001ff; phy_rcw_pos_2 <= 3; state <= PRECHARGE_AFTER_WRITE; end PRECHARGE_AFTER_WRITE: begin data_in <= {8{16'h8888}}; dq_tri <= 0; phy_rcw_pos_2 <= 4; mem_channel <= 1; state <= POST_PRECHARGE; cmd_slot <= 3; test_cmd <= 32'h000029fd; end POST_PRECHARGE: begin cap_value <= data_out; state <= ROW_CMD; phy_rcw_pos_2 <= 7; test_cmd <= 32'h000005ff; endI have bolded the parts that is required to perform memory operations for the second channel. For now I have hardcoded a write operation for the second channel, writing the hex value 8888 to a particular memory location in bank 1, every time it is the turn of the second memory controller.
- Bit 0: Write Enable
- Bit 1: CAS
- Bit 2: RAS