Throughput and latency are fundamental concepts in modern digital system. Throughput refers to the time the system takes to process the data per clock cycle where latency refers to the time it takes for the data to traverse from one location to another to complete an operation. High throughput and low latency are essential for today’s fast-paced digital world. For example, higher throughput ensure smooth and uninterrupted audio and video streaming and low latency provides seamless and responsive gaming experience reducing lag. It is necessary to strike a balance between high throughput and low latency. By processing more data simultaneously, you can boost throughput, but only if the system can handle the additional load. Therefore, digital designers need to find the right balance depending on the specific requirement for the system. The two most commonly implemented strategies to improve throughput is pipeline and parallel processing.

Pipelining

Pipelining is a process of breaking down a complex task into multiple smaller tasks grouped into stages. A task in between stages is execute in a linear fashion allowing data to flow from one stage to the other contributing to the overall task.

Parallel Processing

Pipelining is a process of breaking down a complex task into multiple smaller tasks grouped into stages. A task in between stages is execute in a linear fashion allowing data to flow from one stage to the other contributing to the overall task.

Basic Pipelining

Two fundamental control signals, “valid/ready”, play a critical role in managing data flow and ensuring smooth operation in a pipelined systems. Valid signal indicates the current data on the pipeline is valid for processing in the concurrent stage, hence, the valid signal is asserted high when the data is ready for the next stage. Ready signal indicates that the next stage in the pipeline is ready to accept the data from the current stage, hence, the next stage asserts the ready signal high when its is ready to receive data from the previous pipeline stage. The ready/valid signal work in conjunction to ensure smooth flow of data in the pipeline when both valid and ready signals are asserted. This ensures that data is transferred only when the next stage is prepared to process it, preventing data loss or corruption.

Pipeline with Valid/Ready

 

Valid/Ready Waveform

Pipeline with Upstream and Downstream Signals

Pipeline with Upstream and Downstream Signals

When designing a pipeline stage is necessary to define an understanding nomenclature to avoid any confusion. Generally, inputs to the pipeline stage are referred to as upstream signals and the outputs from the pipeline stage are called downstream signals. For the purposes of simplicity, since the data flows is from input to output (left to right), all the signals to the left of the pipeline stage is labeled with a suffix “up_” and the all the signals to the right of the pipeline stage is labeled with a suffix “dwn_”. The two control signals valid and ready are labeled as “vld” and “rdy” respectively.

Interface signals of the nth pipeline stage among N pipeline stages:

  • Upstream
  1. up_data: input data from the (n-1)th pipeline stage.
  2. up_vld: input signal indicating that the data on the up_data line is valid and ready for processing.
  3. up_rdy: output signal to the previous (n-1)th stage to indicate the pipeline stage is ready to receive new data.

 

  • Downstream
  1. dwn_data: output to the (n+1)the pipeline stage from the nth pipelined stage after processing.
  2. dwn_vld: output signal indicating that the data on the dwn_data line is valid for the next (n+1)th pipeline stage.
  3. dwn_rdy: input signal from the (n+1)th stage indicating that it is ready to receive the new data.

When up_vld is high and up_rdy is asserted the up_data is latched into the pipeline stage. Similarly, when dwn_vld is high and dwn_data is asserted the data stored in the pipeline is transferred onto the dwn_data signal.

What is Back-pressure?

When the upstream stage (producer) generates data at a faster rate than the downstream stage (consumer) to accept the data, there will be data loss as the consumer is unable to accept the data at the rate at which it is being produced. To ensure no data loss, a control mechanism called back-pressure is introduced so that the upstream stages does not generate a new data until the downstream is ready to accept new data. This mechanism is crucial for maintaining system stability and prevent data loss. Buffers are used to store data temporarily between stages to prevent data loss. Back-pressure involves a feedback mechanism to continuously monitor each stages, the downstream stages will signal upstream stages to stop data production until when back-pressure occurs.

Different Types of Buffers to Handle Back-pressure

In pipeline design, buffers are essential components for managing data flow and implementing back-pressure. They temporarily store data between stages to prevent data loss and ensure smooth operation. Different types of buffers are utilized for various purposes, each with unique characteristics and use cases. Here are the few types of buffers used for back-pressure in pipeline design:

 

Global Stall

The entire pipeline is halted or stalled when a global stall occurs. The data does not flow through the pipeline until the consumer is ready to accept the new data. Figure 4, shows logic for global stall.

Half-Performance Buffer

Half performance buffer or more generally know as the Half rate buffer, where each stage of the pipeline takes two clock cycles to process the data instead of one. Since each pipeline takes two clock cycles to process the data, there is more time available to process the date in between the pipeline stages. Half rate buffer increases latency and reduces throughput.

Half Rate Buffer

Skid Buffer

Skid buffer also know as double buffer is a technique used in pipeline design where two storage elements are used to store data to improve performance and throughput.

Skid Buffer

2-Depth FIFO

Two-depth FIFO is a simple FIFO with only two depths. If there is a stall then the second storage location in FIFO is used to store data.

2-depth FIFO

Design Example

Let’s consider the design example sqrt(A + sqrt(b + sqrt(c))) from Focus on Microarchitecture. The square root block is designed based on the algorithm from Hacker’s Delight by Henry Warren. The square root design is itself build with N stage pipeline which gives rise to a fixed latency of N clock cycles, meaning the when the input X is given to the square root design it take N clock cycles to process the data and produce an output Y. Two valid signals “x_vld” and “y_vld” indicate the presence of valid data on the input and output respectively. The figure 2, shows the waveform of the implemented square root design with N latency.

 

The source code for the square root module can be found here. Using this square root module to implement the pipelined sqrt(A + sqrt(B + sqrt(C))) requires the designer to carefully align the data so that there is no data loss. It is necessary to ensure the data is aligned by adding buffers to correctly perform the operation. The figure below, shows the design with pipeline stages and the appropriate valid/ready signals at each block along with buffer blocks for data alignment.

Assuming the square root block takes N clock cycles to produce the result, the buffers to align the data on the B line should have depth of N stages. Similarly, on the A line the buffer stage should have a depth of 2*N+1, 2*N corresponds to two square root block and 1 corresponds to the pipeline stage between the square root block 1 and 2. These buffer stages can de design using one of the four buffers described before. One of the simplest way to align the data coming from the square root block and the buffer block is to use shift register with N depth stage and 2*N+1 deep stages. As long as N is minimum the you can get away with simple shift registers. What if the design is large and N is large? In such cases it is not advisable to use shift registers. A N-depth FIFO is more suitable is such scenarios. The above example with N=4 stages is implemented with a FIFO for the buffer stage to align the data is implemented and the performance of the system is tabulated below:

Note: when implementing a 2-depth FIFO design, to achieve maximum performance the FIFO buffer length must be doubled.

1 thought on “A Case Study on Effective Pipeline Design in Digital Systems

Comments are closed.