Queue Design in SystemVerilog:

Entry is stored into the Queue in a certain order. The order could be as simple as find any first vacant entry or find a next vacant entry from previous allocation or find the last entry that became available recently.

Queues are used in Digital design when the Data from a Stream is needed to be stored into a Structure, manipulated and taken out of Order based on a protocol or events in the Design.

The Entry could be taken out of queue (de-allocated) based on a certain protocol. If the protocol involves series of events that are common for each entry then a FSM shown below is used. However, if each entry in the queue needs a separate event then just a Combinational Trigger logic to clear a valid bit maybe sufficient.

Here’s an example of 8 Entry Queue that uses FSM for each Entry manipulation :

Queue FSM
parameter   DEPTH         = 8;
parameter   INVALID_STATE = 4'b0001;
parameter   VALID_STATE   = 4'b0010;
parameter   EVENT1_STATE  = 4'b0100;
parameter   DISABLE_STATE = 4'b1000;

logic [DEPTH-1:0]      state_ps[3:0];
logic [DEPTH-1:0]      state_ns[3:0];  
logic                  queue_empty;
logic                  queue_full;

logic [DATA_WIDTH-1:0] queue_data[DEPTH-1:0];
logic                  queue_valid[DEPTH-1:0];
logic [DATA_WIDTH-1:0] data_out[DEPTH-1:0];

logic [DEPTH-1:0]      new_entry;
logic [DEPTH-1:0]      allocate_entry;
logic                  allocate_new;

logic [DEPTH-1:0]      event1_started;
logic [DEPTH-1:0]      event1_completed;
logic [DEPTH-1:0]      event1_dis;

genvar x, i, j;

always_comb begin
  casez (state_ps[0])
    8'b????_???1 : new_entry = 8'h1;
    8'b????_??10 : new_entry = 8'h2;
    8'b????_?100 : new_entry = 8'h4;
    8'b????_1000 : new_entry = 8'h8;
    8'b???1_0000 : new_entry = 8'h10;
    8'b??10_0000 : new_entry = 8'h20;
    8'b?100_0000 : new_entry = 8'h40;
    8'b1000_0000 : new_entry = 8'h80;
    default      : new_entry = '0;
  endcase 
end

assign allocate_entry = (~queue_full & allocate_new) ? new_entry : 8'b0;

always_comb begin
  for (int entry = 0; entry < DEPTH; entry++)
    casez (state_ps[entry])
      INVALID_STATE : if(entry_dis[entry])
                        state_ns[entry]   = DISABLE_STATE;
                      else if(allocate_entry[entry])
                         state_ns[entry]  = VALID_STATE;
                       else
                         state_ns[entry]  = INVALID_STATE;
 
     DISABLE_STATE  : if(~entry_dis[entry])
                         state_ns[entry]  = INVALID_STATE;
                      else
                        state_ns[entry]   = DISABLE_STATE;                        

      VALID_STATE   : if(event1_started[entry]) 
                         state_ns[entry]  = EVENT1_STATE;
                      else
                         state_ns[entry]  = VALID_STATE;

      EVENT1_STATE :  if(event1_completed[entry])
                         state_ns[entry] = INVALID_STATE;
                       else
                         state_ns[entry]  = EVENT1_STATE;

      default      :  state_ns[entry] = state_ps[entry]; 
    endcase
  end
end

generate
for (x = 0; x < DEPTH; x++)
  always_ff @(posedge clk or negedge reset) begin
    if(~reset)
      state_ps[x] <= INVALID_STATE;
    else
      state_ps[x] <= state_ns[x];
  end
end
endgenerate

assign event1_started   = request_sent[DEPTH-1:0]; // From other Module
assign event1_completed = response_ack[DEPTH-1:0]; // From other Module
assign entry_dis        = entry_disable[DEPTH-1:0];// From other Module 

//Queue Data maintenance
generate
for (i = 0; i < DEPTH; i++)
  always_ff @(posedge clk or negedge reset)
    if(~reset | event_completed[i]) begin
      queue_data[i]   <= queue_data[i]; // Can be zero if required 
      queue_valid[i]  <= 1'b0;
    end
    else if (allocate_entry[i]) begin
      queue_data[i]   <= data_in;
      queue_valid[i]  <= 1'b1;
    end
    else begin
      queue_data[i]   <= queue_data[i];
      queue_valid[i]  <= queue_valid[i];
    end
  end
 end
endgenerate


//Queue Data readout of multiple entries.
generate
 for (j = 0; j < DEPTH; j++)  
   assign data_out[j] = (event_completed[j] & queue_valid[j]) 
                         ? queue_data[j] :'0; 
 end
endgenerate

Here request & response could be sent and received from another Design Module. These Events (Started and Completed) could happen one per entry per cycle or multiple entries per cycle. Hence, there are multiple entries getting State updates i.e multiple entries taken out of Queue.

If only single event is expected to complete every cycle then Decoder is needed to extract out single Data element out of queue as shown below:

logic [2:0] entry_dec;
always_comb begin
 unique case (event_completed[7:0])
    8'h0000_0001 : entry_dec = 3'h0;
    8'b0000_0010 : entry_dec = 3'h1;
    8'b0000_0100 : entry_dec = 3'h2;
    8'b0000_1000 : entry_dec = 3'h3;
    8'b0001_0000 : entry_dec = 3'h4;
    8'b0010_0000 : entry_dec = 3'h5;
    8'b0100_0000 : entry_dec = 8'h6;
    8'b1000_0000 : entry_dec = 8'h7;
  endcase 
end

//Queue Data readout of Single entry
assign data_out = ((|event_completed) & queue_valid[entry_dec]) 
                   ? queue_data[entry_dec] :'0; 
 

Entry disable is done in order to limit or disable the number of Entries Queue or apply back-pressure on preceding Design module. This programming is usually done via Firmware and used as a effective tool to Test out Corner cases.

Synchronous FIFO :

Fifo (first-in-first-out) are used to for serial transfer of information whenever there is a difference of Transfer rate. The Transfer rate may differ due to difference in number of ports, frequency or data-width between source and destination.

The FIFO width is chosen to compensate for the Transfer rate and is calculated as follows:

Fifo size = Source Freq. * ports * Data-with / Dest. Freq. * ports * Data-with

Ex: Source : Port = 1, Freq. = 100KHz, Data-Width = 20
Destination : Port = 2, Freq. = 50KHz, Data-Width = 10
FIFO Size : 1*100*20/ 1*50*10 = 4 Entries

If the FIFO size is a fractional number then we round-up the FIFO size to nearest largest whole number. For Ex 4.33 -> 5.

Here’s a SV reference for 8 Entry deep FIFO with Data-Width of 32 bits :

8-Entry FIFO
logic [3:0] wraddr, rdaddr;
logic [7:0] wren, rden;
logic [31:0] data_ff[7:0];
logic [31:0] data_in, data_out;
logic wr_en, rd_en;
logic fifo_empty, fifo_full;

always_comb begin
if(!fifo_full & wr_en)
 wraddr = wraddr + 1;
else
 wraddr = wraddr_ff;
end

always_comb begin
 if(!fifo_empty & rd_en)
   rdaddr = rdaddr + 1;
 else
   rdaddr = rdaddr_ff;
end

always_ff @(posedge clk or negedge reset) begin
 if(!reset)
   rdaddr_ff <= 0;
 else if (rd_en)
  rdaddr_ff <=  rdaddr;
 else
  rdaddr_ff <= rdaddr_ff;
end

always_ff @(posedge clk or negedge reset) begin
 if(!reset)
   wraddr_ff <= 0;
 else if (wr_en)
  wraddr_ff <=  wraddr;
 else
  wraddr_ff <= wraddr_ff;
end

assign fifo_full = (wraddr[2:0] == rdaddr[2:0]) & 
                   (wraddr[3] != rdaddr[3]);

assign fifo_empty = (wraddr[2:0] == rdaddr[2:0]) & 
                    (wraddr[3] == rdaddr[3]);

assign wren = (wr_en == 1'b1) ? (1 << wraddr) : 8'b0;
assign rden = (rd_en == 1'b1) ? (1 << rdaddr) : 8'b0;

genvar i;
generate 
for (i = 0; i < 8; i++) begin  
always_ff @(posedge clk or negedge reset) begin
  if(wren[i]) 
     data_ff[i]  <= data_in;
   else
      data_ff[i] <= data_ff[i];
  end
end
endgenerate

always_comb begin
 for (int j = 0; j < 8; j++) begin
  if(rden[j]) 
    data_out = data_ff[j];
  else
    data_out = 8'b0;  
end

If the Fifo Width is not binary multiples of 2 , detecting full and empty conditions is difficult using above method. Alternately, we can use below two implementations that are more scalable with Fifo Widths:

1st implementation can be followed with a use of Single Counter to track Full and Empty conditions :

logic [2:0] fifo_count, fifo_count_ff;

always_comb begin
unique casez ({rd_en, wr_en})
2'b00 : fifo_count = fifo_count_ff;
2'b01 : fifo_count = (fifo_count_ff == 3'b111) ? 3'b111
: (fifo_count_ff + 3'b001);
2'b10 : fifo_count = (fifo_count_ff == 3'b0) ? 3'b0
: (fifo_count_ff - 3'b001);
2'b11 : fifo_count = fifo_count_ff;
endcase
end

always_ff @(posedge clk or negedge reset) begin
if(!reset)
fifo_count_ff <= 3'b0;
else
fifo_count_ff <= fifo_count;
end

assign fifo_full = (fifo_count == 3'b111);
assign fifo_empty = (fifo_count == 3'b0);

2nd implementation is using read and write roll-over pointers on wraddr and rdaddr.

logic wr_rollover, wr_rollover_ff;
logic rd_rollover, rd_rollover_ff;

always_comb begin
if((wraddr == 3'b111) & (wr_en & !rd_en))
wr_rollover = ~wr_rollover_ff;
else
wr_rollover = wr_rollover_ff;
end

always_ff @(posedge clk or negedge reset) begin
if (!reset)
wr_rollover_ff <= 1'b0;
else
wr_rollover_ff <= wr_rollover;
end

always_comb begin
if((rdaddr == 3'b0) & (!wr_en & rd_en))
rd_rollover = ~rd_rollover_ff;
else
rd_rollover = rd_rollover_ff;
end

always_ff @(posedge clk or negedge reset) begin
if (!reset)
rd_rollover_ff <= 1'b0;
else
rd_rollover_ff <= rd_rollover;
end

assign fifo_full = (wraddr[2:0] == rdaddr[2:0]) &
(wr_rollover != rd_rollover);

assign fifo_empty = (wraddr[2:0] == rdaddr[2:0]) &
(wr_rollover == rd_rollover);

XOR based clock gating & implementation:

Clock gating is way to save power in synchronous logic by temporarily shutting-off clocks in sequential logic. The clock gating logic could be based on functional behavior of sequential logic or could be purely based of detection of Traffic activity through the logic block. More details are specified in clock-gating section here : 

XOR clock gating : In order to understand XOR based clock gating, let us first understand the property of XOR logic. Here X and Y are inputs and Y is output of XOR gate.

X   Y  Output
0   0   0
0   1   1
1   0   1
1   1   0

XOR logic has a property that allows us to detect if 2 inputs are different i.e if 2 inputs are different the output is 1 otherwise its 0. Also, XOR allows the bits to be inverted if the bits are XORed with 1.

In case of XOR clock gating, the information(A) to be stored is inverted using XOR with 1s (Abar), then, the information to be stored is compared with current information(B) in the flops (again using XOR) and if more than 50% of the bits differ in terms of polarity, then the inverted information (Abar) is stored instead of information(A). This reduces number of bit-flips required for storage of information of A into B thereby saving switching power.

For E.g:
Consider a 10-bit Sequential logic (B[9:0]) storing a value on a valid and this value needs to be written and read (bout) every clock cycle.

logic [9:0] A;         // new info to be written
logic [9:0] Ainv;      // Xored info
logic [9:0] Abar;      // inverted info
logic [9:0] B;         // info to be Stored
logic [3:0] Ainv_cnt;  // Inverted count 
logic       save_inv; // Inversion indicator
logic       save_inv_ff; // Inversion indicator
logic       valid;    // incoming valid 
logic [9:0] b_out;     // info to be read out

assign Abar =  A ^ {10{1'b1}};
assign Ainv =  A ^ B;

// Count number of 1s
always_comb begin
  Ainv_cnt = '0;  
  for (int cnt = 0; cnt < 10; cnt++) begin
    Ainv_cnt += Ainv[cnt];
  end
end

// Detect if bit-flips are more than 50%
assign save_inv = (Ainv_cnt > 5) ? 1'b1 : 1'b0;

always_ff (@posdege clk or negedge reset) begin
 if(reset) 
   sav_inv_ff <= 1'b0;
 else
   save_inv_ff <= save_inv; 
end

// Store A or Abar to minimize switching
always_ff (@posdege of clk or negedge reset) begin
  if(reset) 
    b <= 10'b0;
  elsif(save_inv & Valid)
    b <= Abar;
  elsif (~save_inv & Valid)
    b <= A;
  else
    b <= b;
end
  
// On read-out, make sure to read correct stored info.
bout = save_inv_ff ? b ^ {10{1'b1}} : b;


Please note that XOR logic for inversion and adders gates for counting the bits increases the combinational gate count of the logic. Also an extra bit(save_inv_ff) is stored additionally. Therefore, any power savings here comes at a cost of Area increase. This Area increase will increase static power but will reduce dynamic or switching power of Flops. Therefore careful analysis is recommended before using this technique. 

In general, this technique is more suitable for highly correlated data. For E.g Media or Video type workloads.
 

ECC (Error Correction Codes) :

ECC refers to Error Correction Codes. Error happens whenever there is a bit flips & information is read incorrectly. The bit flip can be a single or double bit causing single bit errors or double bit errors. The bit flips can occur because of hard Errors or soft Errors.
The hard errors are due to inherent defects of circuits during manufacturing, Temperature-variance and general Wear and Tear. These issues often cause stuck at faults where bit is permanently stuck at 0 or 1. Soft Errors occur due to gamma rays colliding with bits resulting in bit flips. The temporary bit flips are also caused due to noisy environment where electronic interference is high for E.g if a circuit happens to be close to power supply.

There are 3 key parts to the ECC :

ECC Generation:
ECC generation is basically a process of applying an algorithm to calculate extra bits that would be stored with Data. The algorithm is an XOR logic where each ECC bit is derived from XOR of several bits including few of ECC bits. These bits are stored along with Data into memory or array & then retrieved back for Detection and generation. The number of ECC bits for generation is dependent on size of the data & can be calculated using below formula :

SECDED : 2^n+1: where n+1 = number of ECC bits.
DECTED : 2^n+2: Where n+2 = number of ECC bits.

For E.g :
For 8 Bits of Data with single bit correction and double bit detection (SECDED) we would need 3 ECC bits i.e from 2^(2+1).
For 8 Bits of Data with double bit correction and Triple bit detection (DECTED) we would need 4 ECC bits i.e from 2^(2+2).

ECC Detection:
The detection is basically a method to know whether there was an Error. For detection, at the minimum we should know that whether there was a bit flip i.e polarity change and also in some cases also how many bits were flipped. These are the 2 important parts of information that allows us to make decision whether Error can be corrected.
For detection, ECC bits are re-generated with same XOR formula that was used in generation and then these bits are compared against the original ECC bits retrieved from the Memory/Array. If the XOR result of the original and regenerated ECC bits is not 0, then there is a evidence of Error syndrome & hence Error is detected.
In order to understand how many bits were in Error, an Error Syndrome is used. An Error Syndrome is basically a list of codes that are stored and used as a reference. Whenever Error is detected, XOR result is referenced against these stored Codes. If there is a match then the Error can be corrected otherwise Error cannot be corrected even though it was detected.

ECC Correction:
Correction is a process of restoring the data to its original state. This is done by using Error Syndrome. The Error syndrome is unique per bit and if the XOR result match is found then that particular bit is in Error. For E.g for SECDED protection on 8-bit data with 3 bits of ECC, there will be 11 Error Syndrome Codes for detecting single bit Error on each 11 bit positions. In order to correct that bit, if the syndrome matches then the polarity of that particular bit is flipped i.e from 0 -> 1 or 1-> 0 & original value is restored.
It is also important to note that if the XOR result is not 0 (i.e if Error was detected) but no matching Error Syndrome was found then Error cannot be corrected. This typically happens whenever there are more bit flips than the ECC bits can correct. For E.g if there 2 bit flips in SECDED type of protection on Data.

SystemVerilog Assertions :

Assertions are a useful way to verify the behavior of the design. Assertions can be written whenever we expect certain signal behavior to be True or False. Assertions help designers to protect against bad inputs & also assist in faster Debug. Assertions are critical component in achieving Formal Proof of the Design.

In general Assertions are classified into two categories:
1. Concurrent Assertions
2. Immediate Assertions

1. Immediate Assertions: These type of Assertions check the properties that hold True or False all the time i.e Clock independent. For Ex. :

P1 :  if (req.opcode != reserved)
      $error ("opcode Error seen");
assert property (P1);


P2 : assert property (!Read && !Write);

2. Concurrent Assertions: These type of assertions are clock based and therefore property is checked only @posedge or @negedge of the clock. These Assertions are more popular in most of the Synchronous Designs. For Ex. :

P1: assert property @(posedge clk) disable iff(!rst) (req |=> grant);

sequence s1;
 (valid == 1b1);
endsequence

sequence s2:
  ##[1:3] (data != '0);
endsequence

P2: assert property @(posedge clk) disable iff(!rst) (s1 |-> s2);

Since Assertions cannot be synthesized it is necessary to guard them with `ifdef and `endif. Alternately, Assertions are grouped into a dedicated package and the package is selectively added depending on the type of compilation. However, adding it into a package makes difficult to debug as source code is isolated.

Lets take a look at different examples of Assertion Operators:

1. $fell() : Event fell in between 2 consecutive cycles.
// clk enable is 0 after 1 cycle whenever valid is 0.
P1: assert property @(posedge clk) (~val) |-> ##1 $fell(clk_en);




2. $change() : Event changed in between 2 consecutive cycles.
// FSM State changes between 2 cycles whenever ack is received.
P2: assert property @(posedge clk) (~ack) |-> ##[1:2] $change(state);




3. $stable() : Event is stable in between 2 consecutive cycles.
// counter is stable whenever wren is 0.
P3: assert property @(posedge clk) (~wren) |->  $stable(count);




4. $onehot() : Event is onehot encoded.
// FSM State is onehot encoded
P4: assert property @(posedge clk) $onehot(fsm_state);




5. $onehot0() : Event is at the most onehot or it could be all 0s.
// Mux select is onehot encoded at most or could be 0.
P5: assert property @(posedge clk) $onehot0(mux_sel);




6. $rose() : Event rose in between 2 consecutive cycles.
// Grant is seen after 1 cycle whenever request is asserted.
P6: assert property @(posedge clk) (req) |=> $rose(grant);




7. $past() : Event was True in previous cycle.
// If grant was seen then request was seen previously.
P7: assert property @(posedge clk) (grant) |-> $past(req);




8.  ##N : One Event was followed by another event in N cycles. // see 1 for example.
   


    
9. ##[M:N] : One Event was followed by another event in between M to N cycles. // see 2 for example




10. [*N] : Event was repeated for at least N consecutive cycles.  
// whenever stall is asserted ack is low for 3 consecutive cycles
P10: assert property @(posedge clk) (stall) |-> (~ack)[*3];



       
11. if (cond) prop1 else prop2 : If condition (cond) is satisfied then
Property (prop1) is True otherwise property (prop2) is True.
P11: assert property @(posedge clk) if (fifo_empty) (!read_en) else (read_en);




12. Ev1 |-> Ev2 : Whenever Event (Ev1) is True then Event (Ev2) is also True. // see 1 for example




13. Ev1 |=> Ev2: Whenever Event (Ev1) is True then Event (Ev2) is also True 
starting next cycle. // see 6 for example




14. $isunknown() : Check if Event/Signal is X or Z.
// opcode should not be X or Z.  
P14: assert property @(posedge clk) $isunknown(opcode);




15. $countones() : Count the number of 1s in the Signal/Event.
//For a 3-bit Mux-sel, assert select values should be less than 6
P15 : assert property @(posedge clk) ($countones(mux_sel) < 3'h6);




16. k[*M:N] : Event (k) is expected to be repeated between M and N Cycles
 // Power off should result in valid out to be off in between 3 to 5 cycles.  
 P16: assert property @(posedge clk) (pwr_off) |-> (~valid_out)[*3:5];




17. k[->N] : Match the Nth cycle of the Event (k).
//whenever write val is asserted write is seen at 8th cycle.
P17: assert property @(posedge clk) (in_wr) |-> (~write_valid)[->8];




18. k[->M:N] : Match the event (k) is True from M to N Cycle.
//whenever read val is asserted read is seen between 8 to 10 cycles.
P18: assert property @(posedge clk) (in_rd) |-> (~rd_valid)[->8:10];




19. ##[0:$] or ##[*] : Open-ended, Event is True eventually or 
by end of simulation.
// Arbiter will grant request eventually if no stall and request is high.
P18: assert property @(posedge clk) (req & ~no_stall) |-> ##[1:$] grant;


Sequence is also used as a part of Assertions whenever there are a series of events that need to happen in order for the event to hold False or True. Here’s an example of sequence:

sequence req_active;
   //request is de-asserted and seen to be asserted in between 1 to 3 cycles
  (!req)  ##[1:3] $rose(req); 
endsequence

sequence stall_inactive;
 // Stall signal is held 0 to 3 consecutive cycles and then rose on 4th cycle
 (~stall)[*3] ##1 $rose(stall);
endsequence

// whenever sequence req_active is True then sequence stall_active is True.
assert property @(posedge clk) disable iff (~rst) (req_active |-> stall_active);

Its important to note that Assertions should not be more complex than necessary. If the Assertion is complicated then the conditions can be split into multiple Assertions for simplicity and Debug. Also, Assertions also act as useful way to determine if there is an X-prop issue in the logic. This can be simply checked by adding an assertion on control signals to check whether they are driven to known values i.e !$unknown(sig)

Clock Domain Synchronization :

Clock domain synchronization is required when we have signals crossing logic domains that are running on two different Frequencies that are Asynchronous to each other. The signal from source domain needs to be synchronized to destination domain before it can be used. If the synchronization is not performed then it could result into Metastability of the signal & could result into incorrect sampling of the signal at the receiving flop. The process of synchronization can be broadly categorized into 2 types.

  1. Open loop solution.
  2. Closed loop solution.

Open-loop Solution :
2-D flop synchronizer
: Here the source signal is flopped twice on destination clock before its used by destination logic. The 2 Flops allow signal to settle down and sample it correctly without getting into Metastable State. We may also use 3 stages of D-Flops if needed but 2 D-Flops are sufficient for most of the cases. The general rule is period of destination clock should be 1.5 times of source clock period. Example below shows the 2 D-Flop Synchronizer .

2D flop synchronizer
2 D-FLOP Synchronizer : Courtesy: Clifford E. Cummings- SNUG 2001

Binary to Gray Conversion : If we have Multiple bits or a Signal Bus crossing clock domain then it can be converted into Gray code. Conversion to Gray code allows 1-bit difference between signal updates. This allows Synchronizer to be effective as there is less probability of Metastability. The conversion from Gray to Binary is discussed in Async FIFO post.

Synchronized Load Pulse: Another Open loop solution is to pass a synchronized load pulse to destination domain and then use this synchronized pulse to flop in Data bits as shown.

Synchronized Load Pulse
Synchronized Load Pulse : Courtesy: Clifford E. Cummings- SNUG 2001

Closed-Loop Solution :
Synchronized Load Pulse with Feedback
: In closed loop solution a feedback from destination clock domain is received before sending the next signal from source clock domain. This reduces the chances of Metastability even further as original signal is re-tried if clock synchronization does not happen correctly. The feedback Mechanism is done using a simple FSM and feedback signal from destination domain needs to be clock crossed as well.
Due to Feedback mechanism, this type of solution is slower compared to open-loop but more reliable. It can be used to transfer critical control bits from one clock domain to other. For e.g. FSM States , Mux Selects etc. Here’s an example of closed loop solution using Synchronized load Pulse.

Synchronized Load Pulse with Feedback
Synchronized Load Pulse with Feedback : Courtesy: Clifford E. Cummings- SNUG 2001

Async FIFO: Async FIFOs are used for clock crossing data bits or large group of signal buses. In this type of implementation read and write pointers are converted into gray code first and then clock crossed into opposite clock domain. The depth of the FIFO is determined using clock domain frequency and data width. Please refer to Async FIFO implementation post for more details.
Alternately, if the frequency difference is small and data widths are same then the FIFO depth can be restricted to 1. This type of implementation is called 1-depth 2 Register FIFO. The 2 Registers are needed for clock crossing however depth of 1 is used to indicate Full and Empty Conditions. This type of implementation is shown below:

2 Register 1 depth Async Fifo Synchronizer
2-Register 1-depth FIFO :Courtesy: Clifford E. Cummings- SNUG 2001

In general its recommended to create a single instance of clock crossing module. This common module can be instantiated multiple times if needed throughout the Design.

This post only discusses high-level details of CDC but if you are interested in implementation and more in depth Technical details please refer to this Paper by Clifford E. Cummings.

Clock Divider :

In industry, most of clock division happens either through PLL (Phase-locked-loop) in ASIC and through DCM (Digital-Clock-Manger) in FPGAs.

Once clock is available, its possible to have a simple synchronous clock division through few Combinational Gates and D-Flops. Here are few examples of Clock division of equal Duty cycle. Qout (not shown) is the Output of Right-most Flop :

Divide by 2 Counter :

Divide by 2 Counter
Clock divide by 2
Divide by 2 Counter
always_ff @(posedge clk or negedge reset) begin
  if(reset)
    Q <= 1'b0;
  else
    Q <= D;
end

assign D = ~Q;
assign Qout = Q;

Divide by 1.5 Counter :

Divide by 1.5 Counter
Divide by 11/2
Divide by 1.5 Counter
always_ff @(posedge clk or negedge reset) begin
  if(reset)
    Q0 <= 1'b0;
  else
    Q0 <= D0;
end

assign D0 = ~Q0;

always_ff @(negedge clk or negedge reset) begin
  if(reset)
    Q1 <= 1'b0;
  else
    Q1 <= Q0;
end

assign Qout = Q1 | Q0;

Divide by 3 Counter :

Divide by 3 counter
clock divide by 3
Divide by 3 Counter
always_ff @(posedge clk or negedge reset) begin
  if(reset)
    Q0 <= 1'b0;
   else
    Q0 <= D0;
end

always_ff @(posedge clk or negedge reset) begin
  if(reset)
    Q1 <= 1'b0;
   else
    Q1 <= Q0;
end

always_ff @(negedge clk or negedge reset) begin
  if(reset)
    Q2 <= 1'b0;
   else
    Q2 <=  Q1;
end

assign D0 = ~(Q0 & Q1);
assign Qout = Q1 | Q2;

Divide by 4 Counter :

Divide by 4 Counter
Clock division by 4
Divide by 4 Counter
always_ff @(posedge clk or negedge reset) begin
  if(reset)
    Q0 <= 1'b0;
   else
    Q0 <= D0;
end
assign D0 = ~Q0;

always_ff @(posedge Q0 or negedge reset) begin
  if(reset)
    Q1 <= 1'b0;
   else
    Q1 <= D1;
end

assign D1 = ~Q1;
assign Qout = Q1;

Clock and Power Gating Techniques:

There are mainly two types of Power dissipation in CMOS Transistors.

1. Static Power dissipation :
Static Power dissipation is mainly caused due to leakage of Transistors. The leakage could be from any of the sources such as :

1. Gate Leakage through dielectric.
2. Subthreshold leakage when CMOS is off.
3. Junction leakage from source and drain diffusion.
4. Contention Current in ratioed circuits.

Pstat = (Isub + Igate + Icont + Ijunc) * V
Where, V  = Volatage
       I* = Various leakage Currents 

The static Power dissipation is inherent to the properties of Transistor and therefore its efficiency mainly depends on the type/technology of CMOS Transistor. As the Transistors are shrinking Static Power dissipation is increasing as Leakages are higher at smaller Technology nodes.

2. Dynamic Power dissipation:
Dynamic Power dissipation is caused due switching of Transistor i.e from 1->0 or 0 ->1. Also, it can be caused due to short circuit when both pMOS and nMOS are partially ON for very short time :

Pdyn = Psw + Psc
Where, Psw = Switching Power
       Psc = Short Circuit Power
Psw = a * C * V * V * f. 

Where,a = activity factor, 
      V = voltage,
      C = Capacitance,
      f = Frequency

As short-circuit power is often very small, its ignored in Pdyn calculations. Therefore Pdyn is directly proportional to frequency, Capacitance, activity factor & square of voltage. If we are able to reduce any of these factors dynamic power reduces in proportion.

The Total Power is combination of Static and dynamic Power and can be stated as :

Ptot = Pstat + Pdyn

In order to study understand how each Power dissipating factor can be reduced in Static and Dynamic Power dissipation, please refer to Chapter 5 Power of a book CMOS VLSI Design by Neil Weste & David Harris (Refer to Link below).

In a Front-end RTL design, Static Power and dynamic Power can be saved by efficient Power and clock gating techniques.

Power Gating :
In Power gating technique, the source of power to a logic block is turned-off temporarily whenever there is no logical processing needed or no activity is required. This is often done through a dedicated Power Management Unit inside the Design that provides various clock sources to different parts of the design. However, it should be noted that loss of power (e.g P1 domain) results into loss of data so important information such as FSM States and other important Firmware values should be stored (e.g Pon domain) somewhere so that it could be retrieved whenever Power is back and design block is functional again. This useful information is often shared in Power-retention Registers or a local RAM. These Registers and RAM retain power when rest of the logic is powered-off. There are also levels of depth of Power gating depending on the extent and length of the Time, the logic its supposed to be inactive. A good example would be Sleep vs Hibernate in a PC. Here’s an block diagram that gives an overview of Power Gating.

Power Gating Structural Hierarchy

Clock-gating :
Clock gating is a way reducing dynamic Power dissipation by temporary turning-off clock of the Flops on certain parts of the logic or by turning-off enable on gated Flops. In other words, Flops are turned-on only if there is valid information to be stored or transferred. The accuracy with which these clocks are Turned-off is captured by clock gating efficiency. The examples of these 2 types of mechanisms is shown below:

1) Clock Gating with Free running Clocks :

always_ff @(posedge clk or negedge reset) begin
  if(reset)
     Q <= 1'b0;
  else if (clk_en)
     Q <= D;
end


2) Clock Gating with Gated Clocks

always_latch  begin
 if(~clk) 
    clkg_en = enable;
end
   
assign gated_clk = clk & clkg_en;

always_ff @(posedge gated_clk or negedge reset) begin
  if(reset)
     Q <= 1'b0;
  else 
     Q <= D;
end
1) Flop with Clock Enable
power and clock gating
2) Flop with Gated Clock

As a generic guideline, if there a lot of flops in the logic that use same gating enable then its better to design a gated clock implementation. This 2nd implementation is well suited for data-flops and it also decreases Timing-risk on enable (clk_en) as enable signal does not need to travel to every gated flop. Also this type of implementation is able to provide glitch-free clocking. Backend Tools often convert create gated clocks by combining several flops that use same clock gating enable if clock optimization feature is Turned-on during Synthesis. There are also other Tools like PowerArtist that statically analyze RTL design and identify gated Flops, their efficiencies, overall Power of each block and potential flops that could be gated.
The 1st implementation is used where clock gating logic is much diverse and each or some of the flops need a separate functional clock enable. This type of clock gating is also used to hold or freeze the value of flops as a way to debug or Stall the Logic.
There are also levels in Clock gating similar to Power Gating. The first level of clock gating is called Trunk level level gating wherein clock to a a Top level of the design block could be shut-off. Then there is a Leaf level gating wherein parts of the modules could be gated individually while rest of the submodules are still ON.

Clock gating Structural Hierarchy

Clock & Power-down Overrides :
Power and clock overrides are used to ungate the clocks. There maybe cases where there maybe a functional issue that may cause Flops to be gated incorrectly. If such issues occur late in a Project where the Design is in convergence mode it may cause significant delays as Design needs to be re-synthesized after correction of bug. In order to avoid such delays Powerdown & Clock overrides act as backup feature to workaround the issue. These overrides also provide a way of Testing DFT (Design for Test) by allowing way to Scan the flops. The override mechanism is often enabled through Firmware Programming.

Adders :

Adders are used extensively in Computer arithmetic in addition, Subtraction, multiplication and division. There are is a constant attempt to make Adders faster as faster Arithmetic leads to faster Machine , E,g Faster Graphics . At very basic level Adders are classified as Half Adders and Full Adders:

Half Adder : Its a 1-bit Adder with no Carry-in. Output is 1-bit Sum and a Carry-out. Sum is XOR of inputs and Carry out is AND of inputs & is represented as follows:

Sum   = A ^ B;
Carry = A & B;

Gate and block diagram representation of Half Adder is shown below :

Half Adder
Half Adder
(Courtesy: Creative Commons, Attribution-Share Alike 4.0 International )
Half Adder Block Diagram

Full Adder : Its a 1-bit/multi-bit Adder with carry in from previous Adder. Sum and Carry-out are represented as :

Sum    = A ^ B ^ C;
       = A B C + A' B' C' + A' B C' + A B'C';

Carry  = C (A ^ B) + AB;
       = C ( A B' + A' B) + AB; 
       = A B' C + A' B C + A B;

Gate and block diagram representation of Half Adder is shown below :

Full Adder
Full Adder
(Courtesy : Creative Commons Attribution-Share Alike 4.0 International )
Full Adder
Block Diagram

Multi-bit Adders can be formed by combining HA and FA. In each case, calculation of Carry is the most critical part in terms of Timing. Here’s an example of 2-bit Adder & 4-bit Adder formed using 1 HA and multiple FA:

2-bit Adder
4-Bit Adder (Ripple-Carry)
(Courtesy: Creative Commons Attribution-Share Alike 4.0 International )

In order to optimize carry and make Adders fasters, different flavors of Adders have been introduced:
1. Ripple Carry Adder
2. Cary-Save Adder
3. Look-Ahead Carry Adder

Signal Names in Digital Design :

There are various methods of defining signal names in Digital Design. Naming a signal correctly may seem a trivial & underrated task but its important because it often results in faster debug and increases the overall readability of the Code. Defining an intuitive signal could difficult sometimes due to shortage of time at hand or lack of predictability of number of signals needed in the design. Below are few references that can help Designers in faster and intuitive Signal Naming:

Defining Inter-Block Signals : Here Block represents group of Sub-Modules or Modules enclosed in a Top-level Module or Wrapper. Communication between the Blocks may require clock crossing. For Ex. 2 entities within an IP.

Nomenclature : source_destination_signalname_clkdomain_flopStaging

Example 1: logic gen_recv_req_c1ff 
gen   = generator block. 
recv  = receiver block.
req   = signal name (request).
c1    = Clock Domain (optional if same clock domain)
ff    = Staging i.e arriving from a flop.

Defining Intra-Block Signals : These signals have source and destination within the Block but source could be one Submodule & destination could be in another Module within the block. Therefore we can Skip Block name for such cases:

Nomenclature : module1_module2_signalname_staging

Examples:
 
1. logic [7:0] genagt_arb_data_ff

   genagt =  generator agent (Module 1)
   arb    =  Arbiter (Module 2)
   Data   =  Signal Name
   FF     =  Staging i.e Flopped    

2. logic buff_cntrl_sel_comb
   buff  = Buffer (Submodule 1)
   cntrl = Control (Module 2)
   Sel   = Signal Name
   Comb  = Staging

Defining Internal Module/Submodule Signals : These type of signals do not communicate outside as their source and destination are within the Module. For such type of signals, Block, Module, source & destination names can be skipped as communication is between one combinational logic to another combinational logic or Flop and Vice-Versa . Here are few examples :

Nomenclature  : signalname_staging

Examples 1 :
logic eventsel_qual
eventsel  = Event Select (signal name)
qual      = qualified (staging)

Example 2 :
logic stallcyc_ff
stallcyc = Stall Cycle (Signal name)
ff       = flop (Staging)

It is also important to avoid some signal names that could be misleading or unclear, here are examples of unclear Signal Naming :

1. logic arbvalsff; // Difficult to understand without spaces/underscores in between.

2. logic Arb_Val_Sff;// Difficult to grep for some Editors if upper and lower case signals are mixed, also decreases readability.

3. logic arbiter_value_sclock_flopped; // Too long, becomes unreadable.

4. logic arvl_sff; // Too short, becomes unreadable.

5. logic arb_val_ff2; // Numeric Value at the end can cause problems in some Compilers.

6. logic arb_val_inst // '_inst' could be mistaken for a Module Instance

Clocking Blocks:

Clocking blocks are used to trigger or provide sample events to the DUT. Clocking block captures a protocol & are usually defined in an Interface. In certain instances Clocking block protocol can Trigger an event that happens after certain conditions are met. For E.g Trigger counter after Request or Valid is seen. It can also create certain clocking events that are on different clock and are not readily available or logic that has not been coded yet. Clocking blocks provide a way to add asynchronous triggering events without need of explicit clocks & thus avoiding Race conditions in Program or Testbench.

interface example_i(input logic clk);

logic request, grant, stall;

clocking cb_i (@posedge clk);
  input grant, stall;
  output request ##1 grant;
  property check : (~stall & grant) |=> request;
  endproperty
endclocking

endinterface

program test(example_i interf_i);

Prop_1: assert property (interf_i.cb_i.check);

initial begin
  @(posedge clk)
  dut.request = interf_i.cb_i.request;
end

endprogram

Clocking blocks cannot be defined inside Tasks, Functions or Packages and the Scope of the Clocking Block is Static i.e limited to Interface , Program or Module in which its defined.

Input sampling can be done by using clocking block name itself or by using sample signals inside it to get the desired skew.

clocking legal_bus_read @(posedge clk);
  input negedge bus_avail;
  output request_valid ##2 bus_avail;
endclocking

// In a program or Testbench Module

initial begin
  //Option-1
  @legal_bus_read;

  //Option-2
  wait (legal_bus_read.request_valid == 1);
end

SystemVerilog Interface :

Interface with Modports
Interface

SystemVerilog Interface is a convenient method of communication between 2 design blocks. Interface encapsulates information about signals such ports, clocks, defines, parameters and directionality into a single entity. This entity, then, can be accessed at very low level for e.g Register access or to a very high level for E.g Virtual Interface.

Additionally, we can also define Tasks, Functions inside an Interface along with Assertion and SVA Checks. This encapsulation provides portability to Design that can use same interface to communicate differently depending on the type of use. This is achieved by defining Modports and Clocking blocks inside Interface.

Similarly, same design module may use different Interfaces communicate differently if the two interface have same Task or Functions defined.

Here’s an simple example of an Interface :

interface example_inter;
  logic [7:0] data_load;
  logic [7:0] data_read;
  logic rd_en;
  logic wr_en;
endinterface

module primary(example_inter example_if, input clock, reset);
 logic [7:0] data_storage;

 always_ff @(posedge clock) begin
   if(reset)
     data_storage <= '0;
   else
     data_storage <= example_if.data_load;
 end 

 assign example_if.data_read = example_if.rd_en ? data_storage : '0;  
endmodule


module top;
  logic clock;
  logic reset;
  
   example_inter example_if ();
   primary       primary_if 
 (.example_if(example_if), .clock(clock), .reset(reset));  
   
endmodule

Here’s an example of Interface using Modport :

interface example_inter;
  logic [7:0] data_load;
  logic [7:0] data_read;
  logic rd_en;
  logic wr_en;

 modport generator_m (output rd_en, output wr_en, output data_load);
 modport receiver_m 
 (input rd_en, input wr_en, input data_load, output data_read);  

endinterface

module generator (example_inter.generator_m gen_i);
// generator Module code
...
..
endmodule

module receiver (example_inter.receiver_m rec_i);
// receiver Module Code
...
..
endmodule

module top;
 example_inter inter_inst();
 
 generator gen_inst (.gen_i(inter_inst));
 receiver  rec_inst (.rec_i(inter_inst));
endmodule 

Task inside Interface :

Task defined inside interface can be used by different Modules by defining the Task inside Interface. This method allows same Task to be used by different Modules by providing unique values. For E.g. Below, one Task ‘Timer‘ can be used by different Modules for counting purposes by providing unique values of threshold on interface.

interface example_inter;
 logic [2:0] count_value;
 logic [2:0] threshold;

task timer (input logic [2:0] count_value);
//Counter Code to count till count_value
... 
..
endtask

endinterface

module example_delay (example_inter ifc);
logic [2:0] final_count;
assign final_count = ifc.threshold;

// count till Threshold
ifc.timer(final_count); 

endmodule

Asynchronous FIFO :

Asynchronous FIFO is needed whenever we want to transfer data between design blocks that are in different clock domains. The difference in clock domains makes writing and reading the FIFO tricky.

If appropriate precautions are not taken then we could end up in a scenario where write into FIFO has not yet finished and we are attempting to Read it or Vice-versa. This scenario often causes data loss and Metastability issues.

In order to avoid such scenarios, the reading and writing is done via a synchronizer. The synchronizer ensures that read and write pointers calculations are consistent and data in FIFO is not accidentally overwritten or read twice.

However, with the clock crossing we need to ensure that FIFO full and empty conditions are taking into account the clock crossing cycles. In other words, pessimistic full and empty conditions need to be added.

Here’s an example to 8-deep FIFO with Write in aclk domain and read in bclk domain:

logic [3:0] wraddr, wraddr_gray, rdaddr, rdaddr_gray;
logic [3:0] wraddr_b1ff,wraddr_bff, wraddr_aff;
logic [3:0] rdaddr_a1ff, rdaddr_aff, rdaddr_bff;
logic [7:0] wren, wren_qual, rden, rden_qual;
logic [31:0] data_ff[7:0];
logic [31:0] data_in, data_out;
logic wr_en, rd_en;
logic fifo_empty, fifo_full;

always_comb begin
if(!fifo_full & wr_en)
  wraddr = wraddr_ff + 1;
else
  wraddr = wraddr_aff;
end

always_ff @(posedge aclk or negedge reset) begin
 if(!reset)
   wraddr_aff <= 0;
 else if (wr_en)
   wraddr_aff <=  wraddr;
 else
   wraddr_aff <= wraddr_aff;
end

//---- Convert wraddr to Gray code ----//
assign wraddr_gray = (wraddr >> 1) ^ wraddr;

//----- Clock sync to bclk --------//
always_ff @(posedge bclk or negedge reset) begin
 if(!reset) begin
   wraddr_bff  <= 0;
   wraddr_b1ff <= 0;
 end
 else begin
   wraddr_bff  <= wraddr_gray;
   wraddr_b1ff <= wraddr_bff;
 end
end

always_comb begin
 if(!fifo_empty & rd_en)
   rdaddr = rdaddr_bff + 1;
 else
   rdaddr = rdaddr_bff;
end

always_ff @(posedge bclk or negedge reset) begin
 if(!reset)
   rdaddr_bff <= 0;
 else if (rd_en)
   rdaddr_bff <=  rdaddr;
 else
   rdaddr_bff <= rdaddr_bff;
end

//------ Convert rdaddr to Gray code -------//
assign rdaddr_gray = (rdaddr >> 1) ^ rdaddr;

//------- Clock Sync to aclk ----------//
always_ff @(posedge aclk or negedge reset) begin
 if(!reset) begin
   rdaddr_aff  <= 0;
   rdaddr_a1ff <= 0;
 end
 else begin
   rdaddr_aff  <= rdaddr_gray;
   rdaddr_a1ff <= rdaddr_aff;
 end
end

//------- Data Read and Write ---------//
assign wren_qual = (wr_en & !fifo_full) ? 
                   (1 << wraddr) : 8'b0;
assign rden_qual = (rd_en & !fifo_empty) ? 
                   (1 << rdaddr) : 8'b0;

genvar i;
generate
for (i = 0; i < 8; i++) begin  
always_ff @(posedge aclk or negedge reset) begin
  if(wren_qual[i]) 
     data_ff[i]  <= data_in;
   else
     data_ff[i] <= data_ff[i];
  end
end
endgenerate

always_comb begin
 for (int j = 0; j < 8; j++) begin
  if(rden_qual[j]) 
    data_out = data_ff[j];
  else
    data_out = 8'b0;  
end

//------ Full and Empty Conditions -----------------//
assign fifo_full = 
(wraddr_gray[2:0] == rdaddr_a1ff[2:0]) &&
(wraddr_gray[3:3] != rdaddr_a1ff[3:3]);

assign fifo_empty = 
(wraddr_b1ff[3:0] == rdaddr_gray[3:0]);

Alternately, similar to Synchronous FIFO, we can also use synchronized write and read rollover signals in calculation of full and empty conditions.

Rise & Fall-Edge Signal Detection:

As a Digital Designer, often times it is needed to define an interface to communicate to other Design Modules. This communication is defined by a protocol that may involve detection of Rising or Falling edge of a Signal. For E.g Rising edge of request and Falling edge of Ack. In such cases Edge detection logic can be designed as follows:

Rising Edge Detection :

logic rise_edge_sig_a;
logic level_sig_a;
logic level_sig_a_ff;

always_ff @(posedge clk or negedge reset) begin
  if(!reset)
    level_sig_a_ff <= 1'b0;
  else
    level_sig_a_ff <= level_sig_a;
end

assign rise_edge_sig_a = level_sig_a & (~level_sig_a_ff);
Rising Edge Detection

Falling Edge Detection :

logic fall_edge_sig_b;
logic level_sig_b_ff;
logic level_sig_b;

always_ff @(posedge clk or negedge reset) begin
  if(!reset)
    level_sig_b_ff <= 1'b0;
  else
    level_sig_b_ff <= level_sig_b;
end

assign fall_edge_sig_b = (~level_sig_b) & level_sig_b_ff;
Falling Edge Detection

Please note that if your intention is to use Level Signal information & convert it into corresponding pulses (Level-to-Pulse Converter) then this design is not a good design fit. This is because the design is Edge detection circuit and relies on edge of the source signal. Therefore, Level Signal information may get lost in the conversion from Level to Pulse.

Mux/De-Mux/Case Statements in SystemVerilog :

Multiplexers are used to select a single input from several inputs with the help of Select signal. Muxes form a combinational logic that can be written as follows. The number of bits required of select are calculated as 2^n = number of inputs , where n is number of select bits.

logic [3:0] select;
logic output, input;
always_comb begin
case (select[3:0]) begin
4'b0001 : output = input_1;
4'b0010 : output = input_2;
4'b0100 : output = input_3;
4'b1000 : output = input_4;
default : output = 1'b0;
endcase
end

The above logic can also be coded as using “if else” statement using a always_comb or using an assign statement using “? :” operator. In this case Mux becomes a Priority Encoded as priority of input_1 > input_2 > input_3 > input_4.

always_comb begin
if(select == 4'b0001) output = input_1;
else if (select == 4'b0001) output = input_1;
else if (select == 4'b0010) output = input_2;
else if (select == 4'b0100) output = input_3;
else if (select == 4'b1000) output = input_4;
else output = 1'b0;
end

assign output = (select == 4'b0001) ? input_1 :
(select == 4'b0010) ? input_2 :
(select == 4'b0100) ? input_2 :
(select == 4'b1000) ? input_2 :
1'b0;

If the Muxes are being used to drive Data Bus then its recommended that Selects to be driven from Flops (For E.x FSM ) otherwise the outputs may change continuously if the selects are not stable in that cycle. Also, if the select to the Muxes is not from a Flop its recommended to initialize select signal first as it avoids X-propagation into downstream logic:

always_comb begin
select = 4'b0;
case (select[3:0]) begin
4'b0001 : output = input_1;
4'b0010 : output = input_2;
4'b0100 : output = input_3;
4'b1000 : output = input_4;
default : output = 1'b0;
endcase
end

De-Multiplexers or decoders perform opposite operation to Multiplexers. Here a single input is distributed to many outputs depending on the select. The number is bits in select can be calculated as power of 2 of number of outputs. The decoders are often used in converting packed signals to unpacked signals or per entry enable for Arrays.

logic [1:0]  sel;
logic [3:0] out;

always_comb begin
unique case (sel[1:0]) begin
2'b00 : out = 4'b0001;
2'b01 : out = 4'b0010;
2'b10 : out = 4'b0100;
2'b11 : out = 4'b1000;
endcase
end

Mux and De-Mux design in SV is often done using Case Statements. There are different flavors of case statements and each one is used in different scenarios to achieve different results.

Casex : In this type of case statement bits used in comparison can be selectively ignored if the values of comparison are ‘x’ or ‘z’. casex statements can result in different simulation and synthesis results so one needs to be extra careful will using casex. For E.x : When select[2:0] is 3’bxxx the output in simulation could be 2’b01 while in synthesis output could be 2’b11 because select[2:0] becomes 3’b001 in ‘x’ or ‘z’ conditions.

logic [2:0] select;
logic [1:0] output_a;

always_comb begin
casex (select[2:0])
3'bxx1 : output_a = 2'b01;
3'b01x : output_a = 2'b10;
3'b001 : output_a = 2'b11;
3'b100 : output_a = 2'b00;
default : output_a = 2'b00;
end

Casez : In casez statements, bits with ‘z’ values are ignored or treated as don’t-care. However, the bits with ‘x’ values are used in comparison. The casez statements are very useful in creating a priority logic and are more readable than if-else statements.

logic [2:0] selb;
logic [1:0] output_b;

// Priority of selection [0] > [1] > [2]
always_comb begin
casez (selb[2:0])
3'b??1 : output_b = 2'b01;
3'b?10 : output_b = 2'b10;
3'b100 : output_b = 2'b11;
default : output_b = 2'b00;
end

Synthesis Directives : In some instances depending on the compiler it is allowed to pass on compiler directives such as ‘// synthesis full_case‘ and ‘// synthesis parallel_case‘ with the case statement.

logic [2:0] selb;
logic [1:0] output_b;

// 1) Example of full case
// Priority of selection [0] > [1] > [2]
always_comb begin
casez (selb[2:0]) // synthesis full_case
3'b??1 : output_b = 2'b01;
3'b?10 : output_b = 2'b10;
3'b100 : output_b = 2'b11;
end

In full_case, the user explicitly tells synthesis that we do not care about output when the selb = 2’b00. The Mux in this case infers storage and generates less hardware by not evaluating all the possible selb values.

logic [2:0] selc;
logic input_x, input_y, input_w, output_z;
always_comb begin
case (selc) // synthesis parallel_case
3'b001 : output_z = input_x;
3'b010 : output_z = input_y;
2'b100 : output_z = input_w;
endcase
end

In parallel case, user implies that the selection conditions are mutually exclusive so the synthesis generates limited States. Parallel case is often useful in case of one-hot logic for E.g FSM.

In general compiler directives should be avoided as it may result in different Simulation and Synthesis results.

Basic Gates using Muxes :

  • Inverter/NOT Gate using 2:1 Mux
logic input_x;
logic output_y;
assign ouput_y = input_x ? 1'b0 : 1'b1;
  • AND Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? input_w : 1'b0;
  • OR Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? 1'b1 : input_w;
  • XOR Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? ~input_w : input_w;

Operator usage in SystemVerilog:

  • Assign operator: blocking and used in writing Combinational logic.
Ex : assign a = b; 
  • Arithmetic & Assignment operator : Generally used in combinational loops , generate loops in sequential logic.
Arithmetic Operator types
x = y + z; - Add Operator
x = y - z; - Subtract Operator
x = y / z; - Divide Operator
x = y % z; - Modulo Operator
x = y * z; - Multiply operator

Arithmetic Assignment Operator types
a+=1; i.e, a = a + 1;
a-=1; i.e, a = a - 1;
a/=1; i.e, a = a / 1;
a*=1; i.e, a = a * 1;
a%=4; i.e. a = a % 4;
a*=2; i.e, a = a * a;
  • Reduction Operators: Generally, used in combinational control logic:
    logic [3:0] sel;
logic any_sel_hi;
logic all_sel_hi;
logic sel_parity;

any_sel_hi = |sel; // any_sel_hi = sel[3] | sel[2] | sel[1] | sel[0];
all_sel_hi = &sel; // all_sel_hi = sel[3] & sel[2] & sel[1] & sel[0];
sel_parity = ^sel; // sel_parity = sel[3] ^ sel[2] ^ sel[1] ^ sel[0];

inv_sel_parity = ~^sel;
//inv_sel_parity = ~(sel[3] ^ sel[2] ^ sel[1] ^ sel[0]);

inv_any_sel_hi = ~|sel;
//inv_any_sel_hi = ~(sel[3] | sel[2] | sel[1] | sel[0]);

inv_all_sel_hi = ~&sel;
//inv_all_sel_hi = ~(sel[3] & sel[2] & sel[1] & sel[0];
  • Relational Operators : Used for comparison in combinational logic:
logic a;
logic b;
logic c;

assign c = a > b; // c is high/True if a greater than b
assign c = a < b; // c is high/True if a less than b

assign c = a >= b; // c is high/True if a greater than or equal to b
assign c = a <= b; // c is high/True if a less than or equal to b
  • Shift Operators : Logical Shift & Arithmetic Shift.
logic [2:0] a;
logic signed [2:0] b;
logic c, d, e, f;
assign a = 3'b101;
assign b = 3'b101;

// Logical Shift
assign c = a << 1; // shift c by 1 position to left &
// fill the LSB with Zero and remove MSB,
// i.e c = 3'b010;
assign d = a >> 1; // shift a by 1 position to right &
// fill the MSB with Zero and remove LSB,
// i.e c = 3'b010;

//Arithmetic Shift
assign e = b <<< 1; // shift b by 1 position to left &
// fill the LSB with Zero and retain MSB since
// its a signed datatype, i.e e = 3'b110;
assign f = b >>> 1; // shift b by 1 position to right &
// fill the MSB with signed bit and
// remove LSB, i.e f = 3'b110;

// Note that if b were not a Signed datatype,
// results of arithmetic and logical shift would have been same.
// i.e c = e, d = f;

  • Conditional Operator: Used in combinational logic to create Muxes and/or decoder logic.
logic a, b, c, d;
assign c = b ? a : d; // check if b is true if yes, then c = b else c = d.
  • Concatenation & Replication Operator : Used in joining bits to create Bus, concatenation can be on LHS and RHS of assignments. concatenation is treated as packed vector.
logic [4:0] d;
logic [1:0] b,c;
logic a;
// Concatenation
assign d = {a, b, c};

// Replication
d = {5{a}}; // d = {a,a,a,a,a};

// Concatenation + Replication
d = {b,{3{a}}}; // d = {b[1:0], a, a, a};
d = {1{b,c},a}; // d = {b[1:0],c[1:0],a};
  • Logical Operator : Used in comparison of logical expressions. Mainly used to compare and create boolean results. i.e True or false. Arithmetic operators are used if multiple bits are being manipulated.
logic a, b;
logic [3:0] d;
assign a = 1'b1; assign b = 1'b0;
// Result d = 4'h0 if any of a or b is 0.
if(a && b) d = 4'hf else d = 4'h0;

assign a = 1'b1; assign b = 1'b0;
// Result d = 4'hf if any of a or b is 1.
if(a || b) d = 4'hf else d = 4'h0; // Result d = 4'hf;

  • Wildcard Operator : ‘==?’, ‘!=?’. Here operator ‘?’ acts as a wildcard and matches ‘x’ and ‘z’ values from RHS to any value of corresponding bit on LHS.
logic [1:0] c;
logic [1:0] d;
logic e;

assign c = 2'bxz;
assign d = 2'b11;
assign e = (d ==? c); // Result e = 1'b1, as x and z act as wild cards

assign c = 2'b10;
assign d = 2'b1z;
assign e = (d ==? c); // Result e = 1'b0, as z on LHS is not a wild card.
  • Streaming Operator : Streaming operator ‘<<‘ & ‘>>’ are used to pack or unpack the data in specified order. The packing or unpacking can be done on a matching data-type or by type-casting to a particular data-type that match the Widths. If the packed data consists of 4-State type of variable & 2-State type of variable, the result is inferred as 4-state type.
logic a, b, c, d;
logic [3:0] e;
assign e = (>>{a,b,c,d}); // packs a stream of a,b,c,d
{{>>e}}; // unpacks/generates stream of a,b,c,d
{{<<e}}; // unpacks/generates stream of d,c,b,a
(>>{a,b,c,d}) = e; // unpack d=e[0], c=e[1], b=e[2], a=e[3];

Resets in Digital Design :

There are two types of Resets in Digital Design:

Synchronous Resets :  In Synchronous Reset, flop reset is asserted/de-asserted on a predictable clock edge. The clock edge could be positive or negative depending on the Design requirements.      

Reset Removal in Multi Clock Domain

Asynchronous Resets :  In this type of reset,  flop do not need a clock for Reset. Since Reset is a separate input into Flop, it could be reset asynchronously.

Reset Glitch Removal Circuit

Digital Design:

Synchronous Reset Example :

always @(posedge clk) begin
  if(Rst)
     Q <= 1’b0;
  else
     Q <=  S;
end

Asynchronous Reset Example :

 always @(posedge clk or Rst) begin
if(Rst)
   Q <= 1’b0;
else
Q <= S;
end

 Physical Design :

Due to availability Cell libraries in physical design, Synchronous resets synthesize to smaller Flops compared to the Asynchronous counterparts. In Designs where logic is reset on set of conditions, Asynchronous reset may not be the optimum choice. Also one of the biggest problems of Asynchronous reset is de-assertion of Reset. During the de-assertion of Reset, there is a chance that logic may go into Metastable state if reset happens at or close to clock edge. This may not be visible in Simulation and could easily become a problem in Silicon.  In order to avoid it, there are various Reset de-assertion techniques for Asynchronous Resets.

For E.g Asynchronous Assertion & Synchronous de-assertion.

Reset Distribution

Resets need to be distributed such that flops receiving resets receive it around the same time. This is ensured by creating a balanced Reset Tree. The Reset tree is similar to clock tree but it’s not as critical if the skew is reasonable. The skew is often balanced by using buffer & inverters pairs to Reset or by adjusting clock skew to the source Reset Synchronizer itself.  These techniques are often used after detailed Timing of Resets and clocks are available in physical design. (Reference : Clifford E. Cummings Paper on Resets).

Reset Synchronizer

Reset Glitches:

There could also be glitches which may cause flops to be reset accidentally. This behavior is avoided by adding delay circuit to reset. The intent of this delay circuit is to filter glitches in the reset and allow legal reset to propagate. These delay circuits are often available as a part of Gate Library or can be added as an additional logic. The requirement of Delay circuit is not necessary for all the designs and it depends on System Requirements. Due to Reset glitches, the Reset flops are often coded or inferred as non-scannable flops & not available for Scan Tests . 

Multi-Clock Logic Reset Removal:

In multi-logic design, the reset should be distributed in their respective clock domain via reset distribution tree.  The reset removal for such design could be an issue if the reset is removed independently. In such cases reset removal is either coordinated or removed sequentially if system requirements permit.  In most cases, there is type of handshake circuit that waits for all the clock domains to come out of reset. The handshake logic then asserts a signal called IP ready or Design Ready.

Asynchronous ResetVerification :

Disabling Resets in SV Assertions:

The assertions coded in RTL may sometimes need to be disabled as they are no longer applicable in Reset or behavior of certain Signals in Reset is of no significant value. In such cases, the Assertion or property is disabled using “disable iff(reset)” attribute.

E.x :    

assert property (@(posedge clock) disable iff (reset) $onehot(select_bus));      

In above example, the select_bus is a control signal that is expected to be one-hot after reset but we don’t expect one-hot behavior to hold true during Reset i.e select_bus could be 0 or X during reset. In Reset, we avoid the Assertion from false triggering by using “disable” attribute.

Synchronous Resets are less complex to verify as logic blocks are out of reset at a predictable times. This results in less chances of X-propagation in the Design.

On the other hand, Asynchronous designs may cause an accidental X-propagation. For Example : Asynchronous Reset may power an FSM into an unknown State. Also Asynchronous Resets may cause accidental glitches in the Design & it is recommended to use blocking assignments in Testbenches for Resets .

Reset to Memory Array, Data & Control signals.

Control signals & signals on the interface need to Reset to a known value. This is necessary to avoid X-prop or avoid nondeterministic behavior in the System.

It’s not recommended to Reset Data signals or large Memory Arrays. Using Reset Flops may result into larger Area as Reset Flops are bigger than normal flops and also Timing issues on Reset Routing.

Resets in SV Test-Bench:

Resets in a simple Unit-level Test benches are coded using blocking statements. For such implementation initial function is used.  

module unit_tb()

logic clock;
logic reset;
logic clock;

 initial  begin
   clock = 0;
  reset = 1;
   repeat 10 @(posedge clock);
  reset = 0;
   @(posedge clock);
   req = 1;
   @(posedge clock);
   req = 0;  
   $finish;
end

always_comb #25 clock = ~clock;


endmodule



Hard Reset vs Soft Resets:

There are various types of resets used in logic design. Hard reset is basically a system wide reset wherein logic is reset without any prior indication for saving any FSM States or any valuable System information inside Memory Array or Registers. This Reset is often caused by any catastrophic failure for E.g Circuit failure or User Reset or Device Malfunction.

Soft Resets are caused by failures that are often non-catastrophic for E.g Page-Faults & Exceptions. In such scenarios, valuable information in FSM and Memory is often secured and restored back after Logic is out of reset again. This enables faster recovery and increases overall availability of the system.

Brown-out Resets:

Brown out Resets is a type a Reset that is caused by changes operating voltage levels in the system. This change is operating voltage may be due loss of Power or fluctuations in Power levels. This often happens due to Temperature changes or due to wearing of Electrical components.   The Brown out Resets are often detected in System by Power Management Unit and actions are taken to protect the System by flagging Interrupts or special Interrupt Service Routine like safe Recovery Mode.