A new edition of SystemVerilog-Homework adds exercises that use FPU of an open-source CPU

The industry needs interns trained in pipelined microarchitecture and timing, with solid coding skills

Every year, electronic companies get new interns from the universities. Some interns become very useful, contributing pieces of design that eventually become silicon inside mass-market commercial devices. To accomplish this, an RTL or DV intern should already be trained before entering the internship: have solid SystemVerilog coding skills, practice with pipelined designs and control flow, and develop intuition about static timing analysis.

Screening and interviewing interns require time and effort from senior engineers and hiring managers. This process can be improved by reviewing the candidate’s solutions to some public challenges based on open-source projects. The challenge should be generic and realistic at the same time. In this article, we propose an example of such a challenge.

The Mad Max idea: Build exercises from pieces of open-source CPUs

We at Verilog Meetup got this idea from the movie “Mad Max: Fury Road” where the post-apocalyptic survivors build cars from pieces of other vehicles:

We use the same approach: we get open-source CPUs, break them into pieces, and ask students to assemble some custom computing blocks from those parts. Specifically, we got an FPU from an open-source CPU called Wally and used it to create black-box modules that implement floating-point addition, subtraction, multiplication, division, comparison and square root calculation. We ask the students to implement blocks that compute various formulas, such as:

Sorting a set of numbers.
Roots of quadratic and similar equations.
Taylor and Maclaurin series.
3D rotations using quaternions.
Out-of-order FPUs with different kinds of scoreboard and Tomasulo-like approaches.

Wally FPU is described in detail in the book RISC-V System-On-Chip Design by David Harris, James Stine, Sarah Harris and Rose Thompson. However our exercises do not require a student to read that book. Just the opposite: we require every operation to be treated as a black box, without any attempt to cheat by converting a pipelined box with a latency into a combinational module using knowledge about the IEEE 754 floating-point format or a computing algorithm.

For each formula, we expect a student to create several implementations:

An FSM-based implementation that uses a minimal number of arithmetic blocks and makes no assumption about whether they are pipelined or not.
An FSM-based implementation that uses a minimal number of arithmetic blocks but uses the fact that these blocks are pipelined inside. This assumption can reduce the number of FSM states.
A pipelined implementation capable of accepting the formula arguments back-to-back, getting each clock cycle a new set of arguments, indefinitely, without gaps and without losing data.
Optional: A pipelined implementation with flow control using valid/ready protocol both upstream and downstream. This can be created in several ways, either by adding double buffers at each pipeline stage or by adding FIFO and credit counters (see the details below).
Optional: An implementation that distributes work between several FSMs to improve bandwidth. A student should figure out the cases where such a solution is more practical than a pipelined solution.

An example of an exercise: the discriminant of a quadratic equation

Let’s illustrate with a specific exercise, for example 03_finite_state_machines/03_08_float_discriminant/03_08_float_discriminant.sv from systemverilog-homework GitHub repository. The task is to design a block that computes the discriminant of a quadratic equation using the formula b²-4*a*c:

The FSM-based solutions 1 and 2 should use only one instance of the multiplication module and one instance of the subtraction module:

The waveform should look like the following image; after putting the arguments “a“, “b“, and “c” to the module’s input and asserting arg_vld, you have to wait a number of clock cycles for res_vld before putting a new set of arguments.

Note: the diagram does not show the real latency of the calculation (i.e. the result does not have to be ready on the 6^th clock). A student has to figure out the actual latency based on the latencies of the sub-blocks and his microarchitecture.

Unlike FSM-based implementation, the pipelined implementation should allow data to be processed back-to-back, consuming a new set of abc every clock cycle indefinitely, without gaps and without losing data. The pipelined implementation should use three instances of multiplier and one instance of subtractor.

Note: the diagram does not show the real pipeline latency (i.e. the result does not have to be ready on the 6^th clock). A student has to figure out the actual latency based on the latencies of the sub-blocks and his microarchitecture.

The pipelined implementation 4 with flow control should use valid/ready protocol both upstream and downstream. One way to design such a module is described in the textbook Digital Design: A Systems Approach by William James Dally and R. Curtis Harting, which covers the pipelines with double buffering. However this approach is now considered old-fashioned. A better approach, with credit counters, is described in a newer textbook Modern System-on-Chip Design by David J. Greaves .

We recommend using a credit counter-based approach. All you have to do to convert a valid-only pipelined solution 3 into solution 4 with valid/ready – is to add a FIFO, a credit counter and some logic:

For some formulas, a pipelined version is way too costly in terms of area. For other formulas, the number of operations is way too variable; computing may take from tens of clock cycles to hundreds. In such cases, a worthy approach to improve bandwidth is to start multiple FSMs and schedule new requests dynamically between them. The variant of the discriminant exercise is in 04_arithmetics_and_pipelining/04_12_float_discriminant_distributor.

How to check the solution works

The exercises in systemverilog-homework use self-checking testbenches and scripts for basic verification of the student’s solutions. They work under Linux, MacOS, Windows (requires Git Bash) or Windows WSL. The required software includes Icarus Verilog, Git and Bash; you may also need GTKWave or Surfer waveform viewer. Git for Windows includes Bash. To get up and running you need to do the following:


git clone https://github.com/yuri-panchul/systemverilog-homework.git
cd 03_finite_state_machines
./run_linux_mac.sh

You will see the following output:

Answer “y”. This will clone a repository for Wally CPU and extract the necessary files from there.

Your first goal is to make PASS, at least for the 03_08 exercise:

Then you need to repeat the same thing with 04_arithmetics_and_pipelining/04_12_float_discriminant_distributor.

Once you develop a solution, you also need to check whether the code is synthesizable and has reasonable timing. You can use various FPGA synthesis tools for it (Xilinx, Altera, Gowin, Lattice, Efinix) or ASIC design tools (Synopsys Design Compiler, Cadence Genus, Open Lane, Caravel, Tiny Tapeout) – it is up to you. If you want to try open-source ASIC synthesis, there is an article on how to work around the pitfalls in this way: The State of Caravel: the First Look.

However this is not the end of the story. Your solution has to be reviewed by somebody with experience in SystemVerilog and microarchitecture: a teacher, colleague or interviewer. In any case, good luck and we hope you enjoy the experience.