XOR based clock gating & implementation:

Clock gating is way to save power in synchronous logic by temporarily shutting-off clocks in sequential logic. The clock gating logic could be based on functional behavior of sequential logic or could be purely based of detection of Traffic activity through the logic block. More details are specified in clock-gating section here : 

XOR clock gating : In order to understand XOR based clock gating, let us first understand the property of XOR logic. Here X and Y are inputs and Y is output of XOR gate.

X   Y  Output
0   0   0
0   1   1
1   0   1
1   1   0

XOR logic has a property that allows us to detect if 2 inputs are different i.e if 2 inputs are different the output is 1 otherwise its 0. Also, XOR allows the bits to be inverted if the bits are XORed with 1.

In case of XOR clock gating, the information(A) to be stored is inverted using XOR with 1s (Abar), then, the information to be stored is compared with current information(B) in the flops (again using XOR) and if more than 50% of the bits differ in terms of polarity, then the inverted information (Abar) is stored instead of information(A). This reduces number of bit-flips required for storage of information of A into B thereby saving switching power.

For E.g:
Consider a 10-bit Sequential logic (B[9:0]) storing a value on a valid and this value needs to be written and read (bout) every clock cycle.

logic [9:0] A;         // new info to be written
logic [9:0] Ainv;      // Xored info
logic [9:0] Abar;      // inverted info
logic [9:0] B;         // info to be Stored
logic [3:0] Ainv_cnt;  // Inverted count 
logic       save_inv; // Inversion indicator
logic       save_inv_ff; // Inversion indicator
logic       valid;    // incoming valid 
logic [9:0] b_out;     // info to be read out

assign Abar =  A ^ {10{1'b1}};
assign Ainv =  A ^ B;

// Count number of 1s
always_comb begin
  Ainv_cnt = '0;  
  for (int cnt = 0; cnt < 10; cnt++) begin
    Ainv_cnt += Ainv[cnt];

// Detect if bit-flips are more than 50%
assign save_inv = (Ainv_cnt > 5) ? 1'b1 : 1'b0;

always_ff (@posdege clk or negedge reset) begin
   sav_inv_ff <= 1'b0;
   save_inv_ff <= save_inv; 

// Store A or Abar to minimize switching
always_ff (@posdege of clk or negedge reset) begin
    b <= 10'b0;
  elsif(save_inv & Valid)
    b <= Abar;
  elsif (~save_inv & Valid)
    b <= A;
    b <= b;
// On read-out, make sure to read correct stored info.
bout = save_inv_ff ? b ^ {10{1'b1}} : b;

Please note that XOR logic for inversion and adders gates for counting the bits increases the combinational gate count of the logic. Also an extra bit(save_inv_ff) is stored additionally. Therefore, any power savings here comes at a cost of Area increase. This Area increase will increase static power but will reduce dynamic or switching power of Flops. Therefore careful analysis is recommended before using this technique. 

In general, this technique is more suitable for highly correlated data. For E.g Media or Video type workloads.

Clock and Power Gating Techniques:

There are mainly two types of Power dissipation in CMOS Transistors.

1. Static Power dissipation :
Static Power dissipation is mainly caused due to leakage of Transistors. The leakage could be from any of the sources such as :

1. Gate Leakage through dielectric.
2. Subthreshold leakage when CMOS is off.
3. Junction leakage from source and drain diffusion.
4. Contention Current in ratioed circuits.

Pstat = (Isub + Igate + Icont + Ijunc) * V
Where, V  = Volatage
       I* = Various leakage Currents 

The static Power dissipation is inherent to the properties of Transistor and therefore its efficiency mainly depends on the type/technology of CMOS Transistor. As the Transistors are shrinking Static Power dissipation is increasing as Leakages are higher at smaller Technology nodes.

2. Dynamic Power dissipation:
Dynamic Power dissipation is caused due switching of Transistor i.e from 1->0 or 0 ->1. Also, it can be caused due to short circuit when both pMOS and nMOS are partially ON for very short time :

Pdyn = Psw + Psc
Where, Psw = Switching Power
       Psc = Short Circuit Power
Psw = a * C * V * V * f. 

Where,a = activity factor, 
      V = voltage,
      C = Capacitance,
      f = Frequency

As short-circuit power is often very small, its ignored in Pdyn calculations. Therefore Pdyn is directly proportional to frequency, Capacitance, activity factor & square of voltage. If we are able to reduce any of these factors dynamic power reduces in proportion.

The Total Power is combination of Static and dynamic Power and can be stated as :

Ptot = Pstat + Pdyn

In order to study understand how each Power dissipating factor can be reduced in Static and Dynamic Power dissipation, please refer to Chapter 5 Power of a book CMOS VLSI Design by Neil Weste & David Harris (Refer to Link below).

In a Front-end RTL design, Static Power and dynamic Power can be saved by efficient Power and clock gating techniques.

Power Gating :
In Power gating technique, the source of power to a logic block is turned-off temporarily whenever there is no logical processing needed or no activity is required. This is often done through a dedicated Power Management Unit inside the Design that provides various clock sources to different parts of the design. However, it should be noted that loss of power (e.g P1 domain) results into loss of data so important information such as FSM States and other important Firmware values should be stored (e.g Pon domain) somewhere so that it could be retrieved whenever Power is back and design block is functional again. This useful information is often shared in Power-retention Registers or a local RAM. These Registers and RAM retain power when rest of the logic is powered-off. There are also levels of depth of Power gating depending on the extent and length of the Time, the logic its supposed to be inactive. A good example would be Sleep vs Hibernate in a PC. Here’s an block diagram that gives an overview of Power Gating.

Power Gating Structural Hierarchy

Clock-gating :
Clock gating is a way reducing dynamic Power dissipation by temporary turning-off clock of the Flops on certain parts of the logic or by turning-off enable on gated Flops. In other words, Flops are turned-on only if there is valid information to be stored or transferred. The accuracy with which these clocks are Turned-off is captured by clock gating efficiency. The examples of these 2 types of mechanisms is shown below:

1) Clock Gating with Free running Clocks :

always_ff @(posedge clk or negedge reset) begin
     Q <= 1'b0;
  else if (clk_en)
     Q <= D;

2) Clock Gating with Gated Clocks

always_latch  begin
    clkg_en = enable;
assign gated_clk = clk & clkg_en;

always_ff @(posedge gated_clk or negedge reset) begin
     Q <= 1'b0;
     Q <= D;
1) Flop with Clock Enable
power and clock gating
2) Flop with Gated Clock

As a generic guideline, if there a lot of flops in the logic that use same gating enable then its better to design a gated clock implementation. This 2nd implementation is well suited for data-flops and it also decreases Timing-risk on enable (clk_en) as enable signal does not need to travel to every gated flop. Also this type of implementation is able to provide glitch-free clocking. Backend Tools often convert create gated clocks by combining several flops that use same clock gating enable if clock optimization feature is Turned-on during Synthesis. There are also other Tools like PowerArtist that statically analyze RTL design and identify gated Flops, their efficiencies, overall Power of each block and potential flops that could be gated.
The 1st implementation is used where clock gating logic is much diverse and each or some of the flops need a separate functional clock enable. This type of clock gating is also used to hold or freeze the value of flops as a way to debug or Stall the Logic.
There are also levels in Clock gating similar to Power Gating. The first level of clock gating is called Trunk level level gating wherein clock to a a Top level of the design block could be shut-off. Then there is a Leaf level gating wherein parts of the modules could be gated individually while rest of the submodules are still ON.

Clock gating Structural Hierarchy

Clock & Power-down Overrides :
Power and clock overrides are used to ungate the clocks. There maybe cases where there maybe a functional issue that may cause Flops to be gated incorrectly. If such issues occur late in a Project where the Design is in convergence mode it may cause significant delays as Design needs to be re-synthesized after correction of bug. In order to avoid such delays Powerdown & Clock overrides act as backup feature to workaround the issue. These overrides also provide a way of Testing DFT (Design for Test) by allowing way to Scan the flops. The override mechanism is often enabled through Firmware Programming.