XOR based clock gating & implementation:

Clock gating is way to save power in synchronous logic by temporarily shutting-off clocks in sequential logic. The clock gating logic could be based on functional behavior of sequential logic or could be purely based of detection of Traffic activity through the logic block. More details are specified in clock-gating section here : 

XOR clock gating : In order to understand XOR based clock gating, let us first understand the property of XOR logic. Here X and Y are inputs and Y is output of XOR gate.

X   Y  Output
0   0   0
0   1   1
1   0   1
1   1   0

XOR logic has a property that allows us to detect if 2 inputs are different i.e if 2 inputs are different the output is 1 otherwise its 0. Also, XOR allows the bits to be inverted if the bits are XORed with 1.

In case of XOR clock gating, the information(A) to be stored is inverted using XOR with 1s (Abar), then, the information to be stored is compared with current information(B) in the flops (again using XOR) and if more than 50% of the bits differ in terms of polarity, then the inverted information (Abar) is stored instead of information(A). This reduces number of bit-flips required for storage of information of A into B thereby saving switching power.

For E.g:
Consider a 10-bit Sequential logic (B[9:0]) storing a value on a valid and this value needs to be written and read (bout) every clock cycle.

logic [9:0] A;         // new info to be written
logic [9:0] Ainv;      // Xored info
logic [9:0] Abar;      // inverted info
logic [9:0] B;         // info to be Stored
logic [3:0] Ainv_cnt;  // Inverted count 
logic       save_inv; // Inversion indicator
logic       save_inv_ff; // Inversion indicator
logic       valid;    // incoming valid 
logic [9:0] b_out;     // info to be read out

assign Abar =  A ^ {10{1'b1}};
assign Ainv =  A ^ B;

// Count number of 1s
always_comb begin
  Ainv_cnt = '0;  
  for (int cnt = 0; cnt < 10; cnt++) begin
    Ainv_cnt += Ainv[cnt];

// Detect if bit-flips are more than 50%
assign save_inv = (Ainv_cnt > 5) ? 1'b1 : 1'b0;

always_ff (@posdege clk or negedge reset) begin
   sav_inv_ff <= 1'b0;
   save_inv_ff <= save_inv; 

// Store A or Abar to minimize switching
always_ff (@posdege of clk or negedge reset) begin
    b <= 10'b0;
  elsif(save_inv & Valid)
    b <= Abar;
  elsif (~save_inv & Valid)
    b <= A;
    b <= b;
// On read-out, make sure to read correct stored info.
bout = save_inv_ff ? b ^ {10{1'b1}} : b;

Please note that XOR logic for inversion and adders gates for counting the bits increases the combinational gate count of the logic. Also an extra bit(save_inv_ff) is stored additionally. Therefore, any power savings here comes at a cost of Area increase. This Area increase will increase static power but will reduce dynamic or switching power of Flops. Therefore careful analysis is recommended before using this technique. 

In general, this technique is more suitable for highly correlated data. For E.g Media or Video type workloads.

Delay Module & Shift Registers :

Here are some Design problems that I have encountered at some point in my experience as a Designer and had discussions with others on possible solutions. Please note that there can be multiple approaches to solve the same problem so there are no exclusive right answers.

Question 1: How to detect a signal coming into a logic domain whose clocks are off?

Signal Capture
Signal Capture
Solution :  We can use a Set-Reset flop (signal_ff) whose clock is actually the Signal that we are trying to detect(bit_in). The Set pin of the signal is tied to 1 and Reset is tied to the level detection logic (reset) of flop as shown.Please note that the signal is asynchronous in nature and should be used carefully.

logic bit_in;
logic signal_ff;
logic reset;

always_ff (@posedge bit_in) begin
  if (reset)
    signal_ff <= 1'b0;
    signal_ff <= 1'b1;

assign reset = bit_in & signal_ff;

Question 2: How to detect quickly and efficiently if all bits in a BUS are a) all 0s, b) all 1s, c) any 0s and d) any 0s?

Solution :All zeros can be detected using below logic:
 a) assign allzeros = ~(|bus[7:0]);
 b) assign allones  = (&bus[7:0]);
 c) assign anyzeros = ~(&bus[7:0]);
 c) assign anyones  = (|bus[7:0]);

Question 3: How to design a FIFO or Buffer without explicitly checking Full & Empty conditions ?

This can be done through credit counting logic. 
1. Basically, the fixed number of credits are allocated to the requestor.
2. The requestor agent sends a request to the receiver FIFO/Buffer, its credit is then deducted from credit-counter. 
3. The credit is then incremented whenever FIFO is read out.
4. This Credit counter logic allows FIFO to never get Full as requestor agent is back-pressured whenever credit is not available.

Question 4: How to capture serial stream of Bits and calculate if its 8-bits are odd or even ?

Solution :
In this case, we can use a 8-bit shift Register and simple XOR logic to check if its bits are Odd. The implementation can done as follows :

logic       valid;
logic       bit_in;
logic [7:0] regin;
logic [7:0] regin_ff;
logic       odd_byte;

//8-bit Shift Register:
always_ff (@posedge clk or negedge rst) begin
    regin_ff <= 8'b0;
  elseif (Valid)
    regin_ff <= regin;
    regin_ff <= regin_ff;
assign regin = {regin_ff[6:0] bit_in};

//Odd byte calculation
assign odd_byte = (^regin_ff[7:0]);

Question : How to design a delay Module that has an input valid, data and a parameter that defines how many cycle to delay the data and output valid and data?.

Delay Module
Solution :
  In this case we can use a shift register as discussed above.
  However, key thing to note here is that we don't have to shift data. 
  This can be achieved by :
1. Creating a static ID for each data-packet and shifting the ID instead.
3. The data associated with incoming valid is allocated in any of the empty slots by using Find-First logic and an ID is created.
4. The Valid and ID are then shifted until valid reaches MSB.
5. Whenever MSB Valid is detected, the associated ID is used to index Data. 
6. The data is Muxed out and associated Valid and ID are de-allocated/Invalidated.  

SystemVerilog Assertions :

Assertions are a useful way to verify the behavior of the design. Assertions can be written whenever we expect certain signal behavior to be True or False. Assertions help designers to protect against bad inputs & also assist in faster Debug. Assertions are critical component in achieving Formal Proof of the Design.

In general Assertions are classified into two categories:
1. Concurrent Assertions
2. Immediate Assertions

1. Immediate Assertions: These type of Assertions check the properties that hold True or False all the time i.e Clock independent. For Ex. :

P1 :  if (req.opcode != reserved)
      $error ("opcode Error seen");
assert property (P1);

P2 : assert property (!Read && !Write);

2. Concurrent Assertions: These type of assertions are clock based and therefore property is checked only @posedge or @negedge of the clock. These Assertions are more popular in most of the Synchronous Designs. For Ex. :

P1: assert property @(posedge clk) disable iff(!rst) (req |=> grant);

sequence s1;
 (valid == 1b1);

sequence s2:
  ##[1:3] (data != '0);

P2: assert property @(posedge clk) disable iff(!rst) (s1 |-> s2);

Since Assertions cannot be synthesized it is necessary to guard them with `ifdef and `endif. Alternately, Assertions are grouped into a dedicated package and the package is selectively added depending on the type of compilation. However, adding it into a package makes difficult to debug as source code is isolated.

Lets take a look at different examples of Assertion Operators:

1. $fell() : Event fell in between 2 consecutive cycles.
// clk enable is 0 after 1 cycle whenever valid is 0.
P1: assert property @(posedge clk) (~val) |-> ##1 $fell(clk_en);

2. $change() : Event changed in between 2 consecutive cycles.
// FSM State changes between 2 cycles whenever ack is received.
P2: assert property @(posedge clk) (~ack) |-> ##[1:2] $change(state);

3. $stable() : Event is stable in between 2 consecutive cycles.
// counter is stable whenever wren is 0.
P3: assert property @(posedge clk) (~wren) |->  $stable(count);

4. $onehot() : Event is onehot encoded.
// FSM State is onehot encoded
P4: assert property @(posedge clk) $onehot(fsm_state);

5. $onehot0() : Event is at the most onehot or it could be all 0s.
// Mux select is onehot encoded at most or could be 0.
P5: assert property @(posedge clk) $onehot0(mux_sel);

6. $rose() : Event rose in between 2 consecutive cycles.
// Grant is seen after 1 cycle whenever request is asserted.
P6: assert property @(posedge clk) (req) |=> $rose(grant);

7. $past() : Event was True in previous cycle.
// If grant was seen then request was seen previously.
P7: assert property @(posedge clk) (grant) |-> $past(req);

8.  ##N : One Event was followed by another event in N cycles. // see 1 for example.

9. ##[M:N] : One Event was followed by another event in between M to N cycles. // see 2 for example

10. [*N] : Event was repeated for at least N consecutive cycles.  
// whenever stall is asserted ack is low for 3 consecutive cycles
P10: assert property @(posedge clk) (stall) |-> (~ack)[*3];

11. if (cond) prop1 else prop2 : If condition (cond) is satisfied then
Property (prop1) is True otherwise property (prop2) is True.
P11: assert property @(posedge clk) if (fifo_empty) (!read_en) else (read_en);

12. Ev1 |-> Ev2 : Whenever Event (Ev1) is True then Event (Ev2) is also True. // see 1 for example

13. Ev1 |=> Ev2: Whenever Event (Ev1) is True then Event (Ev2) is also True 
starting next cycle. // see 6 for example

14. $isunknown() : Check if Event/Signal is X or Z.
// opcode should not be X or Z.  
P14: assert property @(posedge clk) $isunknown(opcode);

15. $countones() : Count the number of 1s in the Signal/Event.
//For a 3-bit Mux-sel, assert select values should be less than 6
P15 : assert property @(posedge clk) ($countones(mux_sel) < 3'h6);

16. k[*M:N] : Event (k) is expected to be repeated between M and N Cycles
 // Power off should result in valid out to be off in between 3 to 5 cycles.  
 P16: assert property @(posedge clk) (pwr_off) |-> (~valid_out)[*3:5];

17. k[->N] : Match the Nth cycle of the Event (k).
//whenever write val is asserted write is seen at 8th cycle.
P17: assert property @(posedge clk) (in_wr) |-> (~write_valid)[->8];

18. k[->M:N] : Match the event (k) is True from M to N Cycle.
//whenever read val is asserted read is seen between 8 to 10 cycles.
P18: assert property @(posedge clk) (in_rd) |-> (~rd_valid)[->8:10];

19. ##[0:$] or ##[*] : Open-ended, Event is True eventually or 
by end of simulation.
// Arbiter will grant request eventually if no stall and request is high.
P18: assert property @(posedge clk) (req & ~no_stall) |-> ##[1:$] grant;

Sequence is also used as a part of Assertions whenever there are a series of events that need to happen in order for the event to hold False or True. Here’s an example of sequence:

sequence req_active;
   //request is de-asserted and seen to be asserted in between 1 to 3 cycles
  (!req)  ##[1:3] $rose(req); 

sequence stall_inactive;
 // Stall signal is held 0 to 3 consecutive cycles and then rose on 4th cycle
 (~stall)[*3] ##1 $rose(stall);

// whenever sequence req_active is True then sequence stall_active is True.
assert property @(posedge clk) disable iff (~rst) (req_active |-> stall_active);

Its important to note that Assertions should not be more complex than necessary. If the Assertion is complicated then the conditions can be split into multiple Assertions for simplicity and Debug. Also, Assertions also act as useful way to determine if there is an X-prop issue in the logic. This can be simply checked by adding an assertion on control signals to check whether they are driven to known values i.e !$unknown(sig)

Clock and Power Gating Techniques:

There are mainly two types of Power dissipation in CMOS Transistors.

1. Static Power dissipation :
Static Power dissipation is mainly caused due to leakage of Transistors. The leakage could be from any of the sources such as :

1. Gate Leakage through dielectric.
2. Subthreshold leakage when CMOS is off.
3. Junction leakage from source and drain diffusion.
4. Contention Current in ratioed circuits.

Pstat = (Isub + Igate + Icont + Ijunc) * V
Where, V  = Volatage
       I* = Various leakage Currents 

The static Power dissipation is inherent to the properties of Transistor and therefore its efficiency mainly depends on the type/technology of CMOS Transistor. As the Transistors are shrinking Static Power dissipation is increasing as Leakages are higher at smaller Technology nodes.

2. Dynamic Power dissipation:
Dynamic Power dissipation is caused due switching of Transistor i.e from 1->0 or 0 ->1. Also, it can be caused due to short circuit when both pMOS and nMOS are partially ON for very short time :

Pdyn = Psw + Psc
Where, Psw = Switching Power
       Psc = Short Circuit Power
Psw = a * C * V * V * f. 

Where,a = activity factor, 
      V = voltage,
      C = Capacitance,
      f = Frequency

As short-circuit power is often very small, its ignored in Pdyn calculations. Therefore Pdyn is directly proportional to frequency, Capacitance, activity factor & square of voltage. If we are able to reduce any of these factors dynamic power reduces in proportion.

The Total Power is combination of Static and dynamic Power and can be stated as :

Ptot = Pstat + Pdyn

In order to study understand how each Power dissipating factor can be reduced in Static and Dynamic Power dissipation, please refer to Chapter 5 Power of a book CMOS VLSI Design by Neil Weste & David Harris (Refer to Link below).

In a Front-end RTL design, Static Power and dynamic Power can be saved by efficient Power and clock gating techniques.

Power Gating :
In Power gating technique, the source of power to a logic block is turned-off temporarily whenever there is no logical processing needed or no activity is required. This is often done through a dedicated Power Management Unit inside the Design that provides various clock sources to different parts of the design. However, it should be noted that loss of power (e.g P1 domain) results into loss of data so important information such as FSM States and other important Firmware values should be stored (e.g Pon domain) somewhere so that it could be retrieved whenever Power is back and design block is functional again. This useful information is often shared in Power-retention Registers or a local RAM. These Registers and RAM retain power when rest of the logic is powered-off. There are also levels of depth of Power gating depending on the extent and length of the Time, the logic its supposed to be inactive. A good example would be Sleep vs Hibernate in a PC. Here’s an block diagram that gives an overview of Power Gating.

Power Gating Structural Hierarchy

Clock-gating :
Clock gating is a way reducing dynamic Power dissipation by temporary turning-off clock of the Flops on certain parts of the logic or by turning-off enable on gated Flops. In other words, Flops are turned-on only if there is valid information to be stored or transferred. The accuracy with which these clocks are Turned-off is captured by clock gating efficiency. The examples of these 2 types of mechanisms is shown below:

1) Clock Gating with Free running Clocks :

always_ff @(posedge clk or negedge reset) begin
     Q <= 1'b0;
  else if (clk_en)
     Q <= D;

2) Clock Gating with Gated Clocks

always_latch  begin
    clkg_en = enable;
assign gated_clk = clk & clkg_en;

always_ff @(posedge gated_clk or negedge reset) begin
     Q <= 1'b0;
     Q <= D;
1) Flop with Clock Enable
power and clock gating
2) Flop with Gated Clock

As a generic guideline, if there a lot of flops in the logic that use same gating enable then its better to design a gated clock implementation. This 2nd implementation is well suited for data-flops and it also decreases Timing-risk on enable (clk_en) as enable signal does not need to travel to every gated flop. Also this type of implementation is able to provide glitch-free clocking. Backend Tools often convert create gated clocks by combining several flops that use same clock gating enable if clock optimization feature is Turned-on during Synthesis. There are also other Tools like PowerArtist that statically analyze RTL design and identify gated Flops, their efficiencies, overall Power of each block and potential flops that could be gated.
The 1st implementation is used where clock gating logic is much diverse and each or some of the flops need a separate functional clock enable. This type of clock gating is also used to hold or freeze the value of flops as a way to debug or Stall the Logic.
There are also levels in Clock gating similar to Power Gating. The first level of clock gating is called Trunk level level gating wherein clock to a a Top level of the design block could be shut-off. Then there is a Leaf level gating wherein parts of the modules could be gated individually while rest of the submodules are still ON.

Clock gating Structural Hierarchy

Clock & Power-down Overrides :
Power and clock overrides are used to ungate the clocks. There maybe cases where there maybe a functional issue that may cause Flops to be gated incorrectly. If such issues occur late in a Project where the Design is in convergence mode it may cause significant delays as Design needs to be re-synthesized after correction of bug. In order to avoid such delays Powerdown & Clock overrides act as backup feature to workaround the issue. These overrides also provide a way of Testing DFT (Design for Test) by allowing way to Scan the flops. The override mechanism is often enabled through Firmware Programming.

SystemVerilog Interface :

Interface with Modports

SystemVerilog Interface is a convenient method of communication between 2 design blocks. Interface encapsulates information about signals such ports, clocks, defines, parameters and directionality into a single entity. This entity, then, can be accessed at very low level for e.g Register access or to a very high level for E.g Virtual Interface.

Additionally, we can also define Tasks, Functions inside an Interface along with Assertion and SVA Checks. This encapsulation provides portability to Design that can use same interface to communicate differently depending on the type of use. This is achieved by defining Modports and Clocking blocks inside Interface.

Similarly, same design module may use different Interfaces communicate differently if the two interface have same Task or Functions defined.

Here’s an simple example of an Interface :

interface example_inter;
  logic [7:0] data_load;
  logic [7:0] data_read;
  logic rd_en;
  logic wr_en;

module primary(example_inter example_if, input clock, reset);
 logic [7:0] data_storage;

 always_ff @(posedge clock) begin
     data_storage <= '0;
     data_storage <= example_if.data_load;

 assign example_if.data_read = example_if.rd_en ? data_storage : '0;  

module top;
  logic clock;
  logic reset;
   example_inter example_if ();
   primary       primary_if 
 (.example_if(example_if), .clock(clock), .reset(reset));  

Here’s an example of Interface using Modport :

interface example_inter;
  logic [7:0] data_load;
  logic [7:0] data_read;
  logic rd_en;
  logic wr_en;

 modport generator_m (output rd_en, output wr_en, output data_load);
 modport receiver_m 
 (input rd_en, input wr_en, input data_load, output data_read);  


module generator (example_inter.generator_m gen_i);
// generator Module code

module receiver (example_inter.receiver_m rec_i);
// receiver Module Code

module top;
 example_inter inter_inst();
 generator gen_inst (.gen_i(inter_inst));
 receiver  rec_inst (.rec_i(inter_inst));

Task inside Interface :

Task defined inside interface can be used by different Modules by defining the Task inside Interface. This method allows same Task to be used by different Modules by providing unique values. For E.g. Below, one Task ‘Timer‘ can be used by different Modules for counting purposes by providing unique values of threshold on interface.

interface example_inter;
 logic [2:0] count_value;
 logic [2:0] threshold;

task timer (input logic [2:0] count_value);
//Counter Code to count till count_value


module example_delay (example_inter ifc);
logic [2:0] final_count;
assign final_count = ifc.threshold;

// count till Threshold


Mux/De-Mux/Case Statements in SystemVerilog :

Multiplexers are used to select a single input from several inputs with the help of Select signal. Muxes form a combinational logic that can be written as follows. The number of bits required of select are calculated as 2^n = number of inputs , where n is number of select bits.

logic [3:0] select;
logic output, input;
always_comb begin
case (select[3:0]) begin
4'b0001 : output = input_1;
4'b0010 : output = input_2;
4'b0100 : output = input_3;
4'b1000 : output = input_4;
default : output = 1'b0;

The above logic can also be coded as using “if else” statement using a always_comb or using an assign statement using “? :” operator. In this case Mux becomes a Priority Encoded as priority of input_1 > input_2 > input_3 > input_4.

always_comb begin
if(select == 4'b0001) output = input_1;
else if (select == 4'b0001) output = input_1;
else if (select == 4'b0010) output = input_2;
else if (select == 4'b0100) output = input_3;
else if (select == 4'b1000) output = input_4;
else output = 1'b0;

assign output = (select == 4'b0001) ? input_1 :
(select == 4'b0010) ? input_2 :
(select == 4'b0100) ? input_2 :
(select == 4'b1000) ? input_2 :

If the Muxes are being used to drive Data Bus then its recommended that Selects to be driven from Flops (For E.x FSM ) otherwise the outputs may change continuously if the selects are not stable in that cycle. Also, if the select to the Muxes is not from a Flop its recommended to initialize select signal first as it avoids X-propagation into downstream logic:

always_comb begin
select = 4'b0;
case (select[3:0]) begin
4'b0001 : output = input_1;
4'b0010 : output = input_2;
4'b0100 : output = input_3;
4'b1000 : output = input_4;
default : output = 1'b0;

De-Multiplexers or decoders perform opposite operation to Multiplexers. Here a single input is distributed to many outputs depending on the select. The number is bits in select can be calculated as power of 2 of number of outputs. The decoders are often used in converting packed signals to unpacked signals or per entry enable for Arrays.

logic [1:0]  sel;
logic [3:0] out;

always_comb begin
unique case (sel[1:0]) begin
2'b00 : out = 4'b0001;
2'b01 : out = 4'b0010;
2'b10 : out = 4'b0100;
2'b11 : out = 4'b1000;

Mux and De-Mux design in SV is often done using Case Statements. There are different flavors of case statements and each one is used in different scenarios to achieve different results.

Casex : In this type of case statement bits used in comparison can be selectively ignored if the values of comparison are ‘x’ or ‘z’. casex statements can result in different simulation and synthesis results so one needs to be extra careful will using casex. For E.x : When select[2:0] is 3’bxxx the output in simulation could be 2’b01 while in synthesis output could be 2’b11 because select[2:0] becomes 3’b001 in ‘x’ or ‘z’ conditions.

logic [2:0] select;
logic [1:0] output_a;

always_comb begin
casex (select[2:0])
3'bxx1 : output_a = 2'b01;
3'b01x : output_a = 2'b10;
3'b001 : output_a = 2'b11;
3'b100 : output_a = 2'b00;
default : output_a = 2'b00;

Casez : In casez statements, bits with ‘z’ values are ignored or treated as don’t-care. However, the bits with ‘x’ values are used in comparison. The casez statements are very useful in creating a priority logic and are more readable than if-else statements.

logic [2:0] selb;
logic [1:0] output_b;

// Priority of selection [0] > [1] > [2]
always_comb begin
casez (selb[2:0])
3'b??1 : output_b = 2'b01;
3'b?10 : output_b = 2'b10;
3'b100 : output_b = 2'b11;
default : output_b = 2'b00;

Synthesis Directives : In some instances depending on the compiler it is allowed to pass on compiler directives such as ‘// synthesis full_case‘ and ‘// synthesis parallel_case‘ with the case statement.

logic [2:0] selb;
logic [1:0] output_b;

// 1) Example of full case
// Priority of selection [0] > [1] > [2]
always_comb begin
casez (selb[2:0]) // synthesis full_case
3'b??1 : output_b = 2'b01;
3'b?10 : output_b = 2'b10;
3'b100 : output_b = 2'b11;

In full_case, the user explicitly tells synthesis that we do not care about output when the selb = 2’b00. The Mux in this case infers storage and generates less hardware by not evaluating all the possible selb values.

logic [2:0] selc;
logic input_x, input_y, input_w, output_z;
always_comb begin
case (selc) // synthesis parallel_case
3'b001 : output_z = input_x;
3'b010 : output_z = input_y;
2'b100 : output_z = input_w;

In parallel case, user implies that the selection conditions are mutually exclusive so the synthesis generates limited States. Parallel case is often useful in case of one-hot logic for E.g FSM.

In general compiler directives should be avoided as it may result in different Simulation and Synthesis results.

Basic Gates using Muxes :

  • Inverter/NOT Gate using 2:1 Mux
logic input_x;
logic output_y;
assign ouput_y = input_x ? 1'b0 : 1'b1;
  • AND Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? input_w : 1'b0;
  • OR Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? 1'b1 : input_w;
  • XOR Gate using 2:1 Mux
logic output_y;
logic input_x;
logic input_w;
assign ouput_y = input_x ? ~input_w : input_w;