The Ice-V: a uncomplicated, compact RISC-V RV32I implementation in Silice

TL;DR A small CPU construct that will presumably maybe near in handy, a detailed code walkthrough, a true mumble to launch learning about both Silice and RISC-V.

Please conceal: The textual reveal material seemingly wants more polish, please ship recommendations!

What is that this?

The Ice-V is a processor that implements the RISC-V RV32I specification. It is modest and compact (~100 lines when diminished, explore image below), demonstrates many parts of Silice and is regularly a true associate in projects. It is mainly excellent to achieve code from BRAM, the assign the code is baked into the BRAM upon synthesis (is regularly a boot loader later loading from diversified sources).

It is without exclaim hackable and would be extendable besides from SPI, attain code from a RAM, and join to various peripherals. The instance drives LEDs and an external SPI conceal.

The version here runs out of the box on the IceStick ice40 1HK, and could presumably maybe tailored to diversified boards with minimal effort.

Aspects

implements the RV32I specifications
runs code compiled with gcc RISC-V (originate scripts incorporated)
executes directions in 3 cycles, load/store in 4
no longer up to 1K LUTs
validates at round 65 Mz on the IceStick
< 300 lines of commented code (~100 lines compacted)
1 bit per cycle shifter
32 bits RDCYCLE
comes with a DooM fire demo 😉

Your entire processor code

Running the construct

The originate is performed in two steps, first collect some code for the processor to plod:

From projects/ice-v (this directory) plod:

./compile_c.sh tests/c/test_leds.c

Hump your board tp the computer for programming and, from the mission folder plod:

On an IceStick the LEDs will blink across the guts one in a rotating sample.

You have to presumably maybe maybe furthermore simulate the construct with:

./compile_c.sh tests/c/test_leds_simul.c
scheme verilator

The console will output the LEDs sample till you press CTRL+C to interrupt
the simulation.

LEDs: 00001
LEDs: 00010
LEDs: 00100
LEDs: 01000
LEDs: 00001
...

Optionally it is possible you’ll presumably maybe be in a location to droop a small OLED conceal (I historical this one, 128×128 RGB with SSD1351 driver).

The pinout for the IceStick is:

IceStick	OLED
PMOD10 (pin 91)	din
PMOD9 (pin 90)	clk
PMOD8 (pin 88)	cs
PMOD7 (pin 87)	dc
PMOD1 (pin 78)	rst

Equipped with this, it is possible you’ll presumably maybe be in a location to verify the DooM fire or the starfield demos.

For the DooM fire:

./compile_c.sh tests/c/fire.c
scheme icestick -f Makefile.oled

Display: Compling code for the processor requires a RISC-V toolchain. Below Dwelling windows, this is incorporated in the binary package from my fpga-binutils repo. Below macOS and Linux there are precompiled packages, otherwise it is possible you’ll presumably maybe maybe simply use to collect from supply. Witness explore getting
began for more detailed directions.

The Ice-V construct: code walkthrough

Now that we now enjoy got examined the Ice-V let’s dive into the code! The entire processor fits in no longer up to 300 lines of Silice code (~130 without feedback).

A Risc-V processor is surprisingly uncomplicated! That is furthermore a true opportunity to peek some Silice syntax and parts.

The processor is in file ice-v.ice. For the demos, it is incorporated in
a minimalist SOC in file ice-v-soc.ice.

The processor is made of three algorithms:

algorithm attain is accountable for splitting a 32 bit instruction appropriate be taught from reminiscence into information historical by the comfort of the processor (decoder), as well to performing all integer arithmetic (ALU): add, sub, shifts, bitwise operators, and quite a lot of others.
algorithm rv32i_cpu is basically the predominant processor loop. It fetches directions from reminiscence, reads registers, setups the decoder and ALU with this information, performs additional reminiscence load/stores as required, and stores outcomes in registers.

Let’s launch with a high level idea of the processor loop in algorithm rv32i_cpu.

Processor loop

We are able to skip everything at the beginning (we are going to near encourage to that once foremost)
and give attention to the endless loop that executes directions. It has the following constructing:

while (1) {

    // 1. - an instruction appropriate became readily available
    //    - setup register be taught

++: // take a seat up for registers to be be taught (1 cycle)

    // 2. - register information is straight available
    //    - trigger ALU

    while (1) { // decode + ALU while entering the loop (1 cycle)

        // outcomes from decoder and ALU readily available

        if (exec.load | exec.store) {   

            // 4. - setup load/store RAM tackle
            //    - allow reminiscence store?

++: // take a seat up for reminiscence transaction (1 cycle)
            
            // 5. - write loaded information to register?
            //    - restore subsequent instruction tackle
            
            ruin; // done
            // subsequent instruction be taught while looping encourage (1 cycle)
        
        } else {

            // 6. - store consequence of instruction in register
            //    - setup subsequent instruction tackle

            if (exec.working == 0) { // ALU done?
                ruin; // done
                // subsequent instruction be taught while looping encourage (1 cycle)
            }
        }
    }
}

The loop constructing is constructed such that most directions take three cycles, with load/store requiring a additional cycle. It furthermore permits to seem forward to the ALU which every now and then wants more than one cycles (shifts proceed one bit per cycle).
Silice has staunch guidelines on how cycles are historical in control plod (while/ruin/if/else), which permits us to jot down the loop so that no cycles are wasted.

A brand original instruction is available in

Let’s wade thru this shrimp by shrimp. The first while (1) is basically the predominant processor loop. Originally of the iteration (marker 1. above) an instruction is straight available from reminiscence, either from the boot tackle at startup, or from the outdated iteration setup.
We first copy the recommendations be taught from reminiscence correct into a neighborhood instr variable, so that we
are free to achieve diversified reminiscence transactions. We furthermore copy the reminiscence tackle from which
the instruction came in a variable known as personal computer for program counter.

// information is now readily available
instr           = mem.rdata;
personal computer              = mem.addr;

Registers

Storing the instruction in instr will furthermore change the values be taught from the registers.
That is done in the always_after block, which specifies things to be done each and every cycle
and after everything else. The always_after block accommodates these two lines
developing register be taught from instr:

//               vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv ignore for now
xregsA.addr    = /* xregsA.wenable ? exec.write_rd : */ Rtype(instr).rs1;
xregsB.addr    = /* xregsA.wenable ? exec.write_rd : */ Rtype(instr).rs2;

The registers are saved in two BRAMs, xregsA and xregsB. By surroundings their
addr arena we known that the values of the registers shall be of their rdata
arena on the subsequent clock cycle. That is the motive we take a seat up for one cycle with
a ++: after marker 1.

The Rtype(instr).rs1 syntax is using the bitfield declared on the head of the file:

// bitfield for more uncomplicated decoding of directions
bitfield Rtype { uint1 unused1, uint1 signal, uint5 unused2, uint5 rs2, 
                 uint5 rs1,     uint3 op,   uint5 rd,      uint7 opcode}

Writing Rtype(instr).rs1 is the an identical as instr[15,5] (5 bits width from bit 15), but in an
more uncomplicated to be taught/modify structure.

The motive we use two BRAMs is to be in a location to be taught two registers in a single cycle.
So these two BRAMs at all times personal the an identical values, but shall be be taught independently.

The BRAMs instantiated at the beginning of the processor:

bram int32 xregsA[32] = {pad(0)}; bram int32 xregsB[32] = {pad(0)};

pad(0) fills the arrays with zeros.

xregsA and xregsB are at all times written to collectively, so they defend the an identical values.
That is furthermore done in the always_after block:

// write encourage information to both register BRAMs
xregsA.wdata   = write_back;      xregsB.wdata   = write_back;     
// xregsB written when xregsA is
xregsB.wenable = xregsA.wenable; 
// write to write_rd, else discover instruction register
xregsA.addr    = xregsA.wenable ? exec.write_rd : Rtype(instr).rs1;
xregsB.addr    = xregsA.wenable ? exec.write_rd : Rtype(instr).rs2;

Each and every BRAM wdata arena are dwelling to the an identical write_back mark and xregsB.wenable tracks
xregsA.wenable. Ultimately, when their arena wenabled = 1 they both write to the
connected addr given by exec.write_rd. This ensures both BRAMs at all times personal
the an identical values.

Triggering decoder and ALU

After this setup, we take a seat up for one cycle (++: ) for the register values to be readily available on the BRAM outputs. As soon as the register values are readily available (marker 2.), the decoder and ALU will launch refreshing from these updated values. Each and every decoder and ALU are grouped in the second
algorithm known as attain. The outputs of attain are accessed with the ‘dot’
syntax: exec.name_of_output.

Algorithm attain is instantiated at some level of the processor as follows:

// decoder + ALU, executes the instruction and tells processor what to achieve
attain exec(
    instr <:: instr, personal computer <:: personal computer, xa <: xregsA.rdata, xb <: xregsB.rdata
);

The algorithm receives the instruction instr, this system counter personal computer, the
register values xregsA.rdata and xregsB.rdata. These are bound to the
algorithm's inputs with the wiring operators <:: and <: . There could be an foremost
distinction between both connected to timing. Operator <:: methodology that the variable is wired such
that the instantiated algorithm attain sees its mark earlier than it is modified
by the host algorithm rv32i_cpu at some level of the cycle. This ability that truth attain doesn't
straight explore the change when both instr and personal computer are assigned at marker 1.,
but onlys explore the change on the subsequent cycle (after the ++: ). This creates a one
cycle latency, but furthermore makes the circuit shorter leading to an elevated max
frequency for the construct. These are foremost tradeoffs to play with in your designs.

Again to marker 2., as quickly because the register information is straight available information flows
thru attain and we now enjoy got nothing specific to achieve. On the different hand, the ALU fragment of attain
desires to be instructed it'll accrued trigger its computations at this specific cycle:

For all particulars on the (foremost!) matter of algorithm bindings and timings please consult with the dedicated page.

The operations loop

Then we enter a second while(1) loop. In many circumstances we can ruin out of this second loop after appropriate one cycle, but every now and then the ALU desires to work over more than one cycles, so the loop permits to wait. Entering a loop takes one cycle, so while we enter the loop information flows thru attain and its outputs are ready after we are in the loop.

In the loop we distinguish two circumstances: either a load/store has to be performed if (exec.load | exec.store) or else one other instruction is working. Let's first enjoy in thoughts the second case (marker 6.). A non load/store instruction ran thru the decoder and ALU.

Varied than load/store

First, we now enjoy got in thoughts writing the instruction consequence to a register. That is done with the following code:

// store lead to register
xregsA.wenable = ~exec.no_rd;

Web xregsA is a BRAM keeping register values. Its wenable arena signifies whether or no longer we write (1) or be taught (0). Right here, this could be enabled if the decoder output exec.no_rd is low. However that looks a shrimp short? Let's scream the assign attain we reveal what to jot down?
That is that if truth be told done in the always_after block, as we now enjoy got viewed earlier (explore Registers fragment above). The information to be written is determined with:

xregsA.wdata   = write_back;      xregsB.wdata   = write_back;

This explains why we get no longer wish to dwelling it any other time when writing the consequence of the instruction to the register.

However why attain that? Why no longer simply write this code in 6. alongside the comfort? That is for efficiency, both by methodology of circuit dimension and frequency. If the project used to be in 6. a more advanced circuit would be generated to make certain it is excellent done on this specific mumble. This would require a more advanced multiplexer circuit, and therefore it is simplest to at all times blindly dwelling this mark in the always_after block. As prolonged as we attain no longer dwelling xregsA.wenable = 1 nothing gets written anyway. That is an foremost facet of efficient hardware construct, and by fastidiously averting uncessary conditions your designs shall be made far more efficient. Please furthermore consult with Silice construct pointers.

So what's the worth of write_back? It is defined with the following code:

// what attain we write in register? (personal computer, alu or val, load is handled individually)
// 'or trick' from femtorv32
int32 write_back <: (exec.soar      ? (next_pc<<2)        : 32b0)
                  | (exec.storeAddr ? exec.n[0,$addrW+2$] : 32b0)
                  | (exec.storeVal  ? exec.val            : 32b0)
                  | (exec.load      ? loaded              : 32b0)
                  | (exec.intop     ? exec.r              : 32b0);

write_back <: ... defines an expression tracker: the be taught-glorious variable write_back
is an alias to the expression given in its definition (a wire in Verilog phrases).
write_back presents the worth to jot down encourage in step with the decoder outputs.
exec.storeAddr signifies to jot down encourage the tackle computed by the ALU in exec.n (AUIPC).
exec.storeVal signifies to jot down encourage the worth exec.val from the decoder (LUI or RDCYCLE).
exec.soar signifies to jot down encourage next_pc<<2 (JAL, JALR, conditional branches taken). The shift transform 32-bits instruction pointer into byte addresses.

Alright! the register is updated. Back to marker 6. Next we set the address of the next instruction to fetch and execute:

// subsequent instruction tackle
wide_addr      = exec.soar ? (exec.n >> 2) : next_pc;

That is either the tackle computed from the ALU in case of a soar/division
as indicated by exec.soar, or the worth of next_pc which is barely personal computer + 1:
the instruction following the most modern one.

Nearly done, but first we now enjoy got to verify whether or no longer the ALU is no longer in a multi-cycle
operations.
That's the reason we glorious ruin if (exec.working == 0).
If no longer, the loop iterates any other time, waiting for the ALU.
Display that 6. shall be visited any other time, so we are going to write any other time to the register. And bound, if the
ALU is no longer yet done the write we did earlier than could presumably maybe maybe very smartly be an wrong mark. However that is
all stunning: the consequence have a tendency on the last iteration, and it prices us nothing
to achieve these writes. Primarily it prices us less because no longer doing them would any other time
require more multiplexer circuitry!

After we ruin, it takes one cycle to head encourage to the launch of the loop. For the length of this
cycle the subsequent instruction is be taught from mem and the consequence (if any) is written to
the register in xregsA.

You have to presumably maybe maybe simply enjoy seen that we wrote the subsequent tackle in wide_addr while the
reminiscence interface is mem, so we must at all times accrued enjoy written to mem.addr? That is to permit
the SOC to explore a mighty broader tackle bus and make reminiscence mapping. The tackle we dwelling
in wide_addr is assigned to mem.addr in the always_after block, that is
applied on the pause of every and every cycle: mem.addr = wide_addr. It is furthermore output
from the algorithm to the SOC: output! uint12 wide_addr(0) the assign output! methodology
the SOC straight explore adjustments to wide_addr.

Load/store

That is it for non load/store directions. Now let us plod encourage to if (exec.load | exec.store)
and explore how a load/store is handled. For the rationale that Ice-V is mainly excellent for BRAM, we
know all reminiscence transactions take a single cycle. While we are going to wish to fable
for this cycle, this is a tall luxury in contrast to having to seem forward to an unknown
number of cycles an external reminiscence controller.

When reaching marker .4 we first setup the tackle of the weight/store. This tackle
comes from the ALU:

// reminiscence tackle from wich to load/store
wide_addr = exec.n >> 2;

The sift by two is because of the truth that computed addresses are in bytes, while
the reminiscence interface addresses are in 32-bits words.

Then, this is either a store or a load. If that is a store, we would like to permit
writing to reminiscence. The reminiscence is named mem and is a BRAM, given to the CPU: algorithm rv32i_cpu( bram_port mem, ... ). The BRAM holds 32 bits words at each and every tackle.
To allow writing we dwelling its wenable member. On the different hand this BRAM has a specificity:
it permits a write masks. So wenable is no longer a single bit, but four bits, which
permits to selectively write any of the four bytes at each and every reminiscence tackle.

And we would like that! The RISC-V RV32I specification parts load/store for bytes,
16-bits and 32-bits words. Which methodology that searching on the instruction (SB/SH/SW)
we would like to setup the write masks as it'll be. That is done with this code:

// == Store (enabled below if exec.store == 1)
// originate write masks searching on SB, SH, SW
mem.wenable = ({4{exec.store}} & { { 2{exec.op[0,2]==2b10} },
                                       exec.op[0,1] | exec.op[1,1], 1b1 
                                } ) << exec.n[0,2];

That can presumably maybe maybe simply appear a shrimp cryptic but what this does is to produces a write masks of the invent 4b0001, 4b0010, 4b0100, 4b1000 (SB) or 4b0011, 4b1100 (SH) or 4b1111 (SW) searching on exec.op[0,2] (load kind) and exec.n[0,2] (tackle lowest bits).
As this can simply no longer be a store despite everything, an AND between the masks and exec.store is
applied. The syntax {4{exec.store}} methodology that the bit exec.store is replicated
four cases to mark a uint4.

Subsequent we wait one cycle for the reminiscence transaction to occur in BRAM with ++: . If
that used to be a store we appropriate wrote and we are done after we attain marker 5.

If that used to be a load we appropriate be taught from reminiscence and accrued wish to store the consequence
in the chosen register. That is done by this code:

// == Load (enabled below if exec.load == 1)
// commit consequence
xregsA.wenable = ~exec.no_rd;

That is ample to trigger the register change, since xregsA.wdata and
xregsA.addr are successfully dwelling up afterwards in the always_after block the assign
wdata is assigned write_back.
Web that write_back selects loaded when exec.load is high (explore
definition of write_back above). loaded is defined as follows:

// decodes values loaded from reminiscence (historical when exec.load == 1)
uint32 aligned <: mem.rdata >> {exec.n[0,2],3b000};
change ( exec.op[0,2] ) { // LB / LBU, LH / LHU, LW
    case 2b00:{ loaded = {{24{(~exec.op[2,1])&aligned[ 7,1]}},aligned[ 0,8]}; }
    case 2b01:{ loaded = {{16{(~exec.op[2,1])&aligned[15,1]}},aligned[ 0,16]};}
    case 2b10:{ loaded = aligned;   }
    default:  { loaded = {32{1bx}}; } // get no longer care (doesn't occur)
}

This selects the loaded mark relying of whether or no longer a byte (LB/LBU), 16-bits (LH/LHU) or 32-bits (LW) had been accessed (U signifies unsigned). mem.rdata is the worth honest out of reminiscence, and
it is shifted in aligned to be the fragment chosen by the tackle lowest bits exec.n[0,2].

Display that {exec.n[0,2],3b000} is barely exec.n[0,2] << 3 (a left shift by three bits is connected to concatenating three 0 bits to the honest).

After the weight/store is done we restore the subsequent instruction tackle next_pc,
so that the processor is able to proceed with the subsequent iteration after the ruin:

// restore tackle to program counter
wide_addr      = next_pc;
// exit the operations loop
ruin;

And that is the rationale it! Now we enjoy viewed your full processor good judgment. Let's now dive into some
of the diversified substances.

The decoder

The decoder is fragment of algorithm attain. It's a somewhat uncomplicated affair.
It starts by decoding all
the that it is possible you'll presumably maybe be in a location to imagine quick values -- these are constants encoded in the diversified
styles of directions:

// decode immediates
int32 imm_u  <: {instr[12,20],12b0};
int32 imm_j  <: {{12{instr[31,1]}},instr[12,8],instr[20,1],instr[21,10],1b0};
int32 imm_i  <: {{20{instr[31,1]}},instr[20,12]};
int32 imm_b  <: {{20{instr[31,1]}},instr[7,1],instr[25,6],instr[8,4],1b0};
int32 imm_s  <: {{20{instr[31,1]}},instr[25,7],instr[7,5]};

These values are glorious historical when the matching instruction executes. Let's scream
imm_i is historical in register-quick integer operations.

The next fragment checks the opcode and items a boolean for every and every that it is possible you'll presumably maybe be in a location to imagine instruction:

uint5 opcode    <: instr[ 2, 5];
uint1 AUIPC     <: opcode == 5b00101;  uint1 LUI    <: opcode == 5b01101;
uint1 JAL       <: opcode == 5b11011;  uint1 JALR   <: opcode == 5b11001;
uint1 IntImm    <: opcode == 5b00100;  uint1 IntReg <: opcode == 5b01100;
uint1 Cycles    <: opcode == 5b11100;  uint1 division <: opcode == 5b11000;

These are clearly mutually-original, so glorious one of those is 1 at a given
cycle.

Ultimately we dwelling the decoder outputs, telling the processor what to achieve with the instruction.

// ==== dwelling decoder outputs searching on incoming directions
// load/store?
load         := opcode == 5b00000;   store        := opcode == 5b01000;   
// operator for load/store           // register to jot down to?
op           := Rtype(instr).op;     write_rd     := Rtype(instr).rd;    
// attain we now enjoy got to jot down a consequence to a register?
no_rd        := division  | store  | (Rtype(instr).rd == 5b0);
// integer operations                // store subsequent tackle?
intop        := (IntImm | IntReg);   storeAddr    := AUIPC;  
// mark to store straight           // store mark?
val          := LUI ? imm_u : cycle; storeVal     := LUI     | Cycles;

The at all times assign operator := historical on outputs methodology that
the output is determined to this mark first thing each and every cycle (this is a shortcut
connected to a usual project = in an always_before block).

Let's scream write_rd := Rtype(instr).rd is the index of the bolt jam
register for the instruction, while no_rd := division | store | (Rtype(instr).rd == 5b0)
signifies whether or no longer the write to the register is enabled or no longer.

Display the condition Rtype(instr).rd == 5b0 in no_rd. That is because
register zero, as per the RISC-V spec, must accrued at all times remain zero.

The ALU

The ALU performs all integer computations. It consists of three parts. The
integer operations comparable to ADD, SUB, SLLI, SRLI, AND, XOR (output r) ;
the comparator for conditional branches (output soar) ; the subsequent tackle adder (output n).

Because of the methodology the recommendations plod is setup we are able to use a nice trick. The ALU as smartly
because the comparator snatch two integers for his or her operations. The setup of the Ice-V
is such that both can input the an identical integers, so they may be able to fragment the an identical circuits
to make connected operations. And what's original to <,<=,>,>=? They
can all be done with a single subtraction! This trick is implemented as follows:

// ==== permits to achieve subtraction and all comparisons with a single adder
// trick from femtorv32/swapforth/J1
int33 a_minus_b <: {1b1,~b} + {1b0,xa} + 33b1;
uint1 a_lt_b    <: (xa[31,1] ^ b[31,1]) ? xa[31,1] : a_minus_b[32,1];
uint1 a_lt_b_u  <: a_minus_b[32,1];
uint1 a_eq_b    <: a_minus_b[0,32] == 0;

xa is the first register, while b is chosen earlier than in step with outcomes from the decoder:

// ==== snatch ALU second input 
int32 b         <: regOrImm ? (xb) : imm_i;

The preference is made by this line in the decoder:

uint1 regOrImm  <: IntReg  | division;

In the same contrivance, the subsequent tackle adder selects its first input in step with the decoder
indications:

// ==== snatch subsequent tackle adder first input
int32 addr_a    <: pcOrReg ? __signed({personal computer[0,$addrW-2$],2b0}) : xa;

Let's scream, directions AUIPC, JAL and division will snatch this system
counter personal computer for addr_a as shall be viewed in the decoder:

uint1 pcOrReg   <: AUIPC   | JAL    | division;

The second mark in the subsequent tackle computation is an instantaneous chosen basically based
on the working instruction:

// ==== snatch quick for the subsequent tackle computation
int32 addr_imm  <: (AUIPC  ? imm_u : 32b0) | (JAL         ? imm_j : 32b0)
                |  (division ? imm_b : 32b0) | ((JALR|load) ? imm_i : 32b0)
                |  (store  ? imm_s : 32b0);

The next tackle is then simply the sum of addr_a and the quick: n = addr_a + addr_imm.

The comparator and quite a lot of the ALU are change circumstances returning the chosen
computation from op := Rtype(instr).op.
For the comparator:

// ====================== Comparator for branching
change (op[1,2]) {
    case 2b00:  { j = a_eq_b;  } /*BEQ */ case 2b10: { j=a_lt_b;} /*BLT*/ 
    case 2b11:  { j = a_lt_b_u;} /*BLTU*/ default:   { j = 1bx; }
}
soar = (JAL | JALR) | (division & (j ^ op[0,1]));
//                                   ^^^^^^^ negates comparator consequence

For the integer arithmetic:

// all ALU operations
change (op) {
    case 3b000: { r = sub ? a_minus_b : xa + b; }            // ADD / SUB
    case 3b010: { r = a_lt_b; } case 3b011: { r = a_lt_b_u; }// SLTI / SLTU
    case 3b100: { r = xa ^ b; } case 3b110: { r = xa | b;   }// XOR / OR
    case 3b001: { r = shift;  } case 3b101: { r = shift;    }// SLLI/SRLI/SRAI
    case 3b111: { r = xa & b; }     // AND
    default:    { r = {32{1bx}}; }  // get no longer care
}

On the different hand, one thing is occurring for the shifts (SLLI, SRLI, SRAI). Indeed, integer shifts << and >>
shall be performed in one cycle but on the expense of a broad circuit (many LUTs!).
As an different, we opt a compact construct. So the comfort of the code in the ALU describes
a shifter engaging one bit per cycle. Right here it is:

int32 shift(0);
// shift (one bit per clock)
if (working) {
    // decrease shift dimension
    shamt = shamt - 1;
    // shift one bit
    shift = op[2,1] ? (Rtype(instr).signal ? {r[31,1],r[1,31]} 
                        : {__signed(1b0),r[1,31]}) : {r[0,31],__signed(1b0)};      
} else {
    // launch engaging?
    shamt = ((aluShift & trigger) ? __unsigned(b[0,5]) : 0);
    // store mark to be shifted
    shift = xa;
}
// are we accrued engaging?
working = (shamt != 0);

The idea is that shift is the consequence of engaging r by one bit
each and every cycle. r is updated with shift in the ALU change case:
case 3b001: { r = shift; } case 3b101: { r = shift; }.
Before everything, the shifter is no longer working and shift is assigned xa.
After that, shift is r shifted by one bit with correct kind signedness: shift = op[2,1] ? ...

shamt is the number of bits by which to shift. It starts with the amount be taught
from the decoder ((aluShift & trigger) ? __unsigned(b[0,5]) : 0) and then
decreases by one each and every cycle when working.

Display how trigger is historical in the test initializing shamt and starting the shifter. This ensures the shifter glorious triggers on the honest cycle, when exec.trigger is pulsed to 1 by the processor.

And voilà, our ALU is full, and we now enjoy got viewed all foremost substances of the processor!

Going additional

The Ice-V can support as a nice playground. You have to presumably maybe maybe wish to strive to droop-it to a RAM
that has a wait interface, flip it correct into a RV32IM, experiment with diversified
setups of decoder and ALU. There are such quite a lot of possibilities and tradeoffs!

Hyperlinks

This implementation vastly benefited from diversified projects (explore furthermore feedback
in supply):

Ice-V's simplest buddy: FemtoRV https://github.com/BrunoLevy/be taught-fpga/tree/grasp/FemtoRV, furthermore fits without exclaim in no longer up to 100 Verilog lines.
PicoRV https://github.com/cliffordwolf/picorv32
Ibex https://github.com/lowRISC/ibex
Stackoverflow put up on CPU construct (explore solution) https://stackoverflow.com/questions/51592244/implementation-of-uncomplicated-microprocessor-using-verilog

About a diversified substantial RISC-V projects (there are many! chuffed to add links, let me know)

The smallest processor on this planet: SERV
vexriscv
neorv32

Toolchain links:

RISC-V toolchain https://github.com/riscv/riscv-gnu-toolchain
Pre-compiled riscv-toolchain for Linux https://matthieu-moy.fr/spip/?Pre-compiled-RISC-V-GNU-toolchain-and-spike&lang=en
Homebrew RISC-V toolchain for macOS https://github.com/riscv/homebrew-riscv

Be taught More