

#### EECS151/251A Spring 2024 Digital Design and Integrated Circuits

Instructor: John Wawrzynek

Lecture 22: Multiplier Circuits and Shifters

## Announcements

- □ Homework 10 posted <u>due next Wednesday</u>
- □ 2 more weeks of lecture (including this week)
- Next week Monday guest lecture: Sandesh Bharadwaj, from Apple, Hardware Verification
- I more homework exercise

# Warmup Recall long multiplication of base-10 by hand: 56 x 12

## □ In base-2 (binary), we do the same thing: x $\frac{011}{101}$



Many different circuits exist for multiplication.

Each one has a different balance between speed (performance) and amount of logic (cost).

## "Shift and Add" Multiplier



- Cost  $\alpha$  n, T = n clock cycles.
- What is the critical path for determining the min clock period?

Sums each partial product, one at a time.

In binary, each partial product is shifted versions of A or 0.

Control Algorithm:

- 1.  $P \leftarrow 0, A \leftarrow$  multiplicand,
  - B ← multiplier
- 2. If LSB of B==1 then add A to P

else add 0

- 3. Shift [P][B] right 1
- 4. Repeat steps 2 and 3 n-1 more times.
- 5. [P][B] has product.

## **Signed Multiplication**

*Remember* for 2's complement numbers <u>MSB has negative weight</u>:

$$X = \sum_{i=0}^{n-2} x_i \cdot 2^i - x_{n-1} \cdot 2^{n-1}$$

ex: 
$$-6 = 11010_2 = 0.2^0 + 1.2^1 + 0.2^2 + 1.2^3 - 1.2^4$$
  
= 0 + 2 + 0 + 8 - 16 = -6

□ Therefore for multiplication:

a) subtract final partial product (multiplier is signed)

b) <u>sign-extend partial products</u> (multiplicand is signed)

Modifications to shift & add circuit:

a) adder/subtractor

b) sign-extender on P shifter register



#### x 0101

## **Outline for Multipliers**

- Combinational multiplier
- Latency & Throughput
  - Wallace Tree
  - Pipelining to increase throughput
- Smaller multipliers
  - Booth encoding
  - Serial, bit-serial

## Two's complement multiplier



### Unsigned Combinational Multiplier

## **Array Multiplier**

Single cycle multiply: Generates all n partial products simultaneously.



## **Carry-Save Addition**

- Speeding up multiplication is a matter of speeding up the summing of the partial products.
- □ "Carry-save" addition can help.
- Carry-save addition passes (saves) the carries to the output, rather than propagating them.
- Carry-save addition takes in 3 numbers and produces 2.
- □ (Sometimes called a "3:2 compressor")

Example: sum four numbers,  $1_{10} = 0001, 3_{10} = 0011, 2_{10} = 0010, 3_{10} = 0011$ 1<sub>10</sub> 0001 3<sub>10</sub> 0011 carry-save + 2<sub>10</sub> 0010 add  $c 0110 = 6_{10}$ s 0000 = 0<sub>10</sub> carry-save add 3<sub>10</sub> 0011  $c 0100 = 2_{10}$  $s 0101 = 6_{10}$ carry-propagate add  $1001 = 8_{10}$ 

With this technique, we can avoid carry propagation until final addition!

## **Carry-save Circuits**

When adding sets of numbers, carry-save can be used on all but the final sum.

- □ Standard adder (carry propagate) is used for final sum.
- Carry-save is fast (no carry propagation) and cheap (same cost as ripple adder)





## Array Multiplier using Carry-save Addition



## **Carry-save Addition**

CSA is associative and commutative. For example:

 $(((X_0 + X_1) + X_2) + X_3) = ((X_0 + X_1) + (X_2 + X_3))$ 



- A balanced tree can be used to reduce the logic delay.
- It doesn't matter where you add the carries and sums, as long as you eventually do add them.
- This structure is the basis of the *Wallace Tree Multiplier*.
  - Partial products are summed with the CSA tree. Fast CPA (ex: CLA) is used for final sum.
- Multiplier delay  $\alpha \log_{3/2} N + \log_2 N$

## Increasing Throughput: Pipelining

Idea: split processing across several clock cycles by dividing circuit into pipeline stages separated by registers that hold values passing from one stage to the next.





#### Smaller Combinational Multipliers

## **Bit-serial Multiplier**

□ Bit-serial multiplier (n<sup>2</sup> cycles, one bit of result per n cycles):



Control Algorithm:

```
repeat n cycles { // outer (i) loop
repeat n cycles { // inner (j) loop
shiftA, selectSum, shiftHI
}
shiftB, shiftHI, shiftLOW, reset
}
Note: The occurrence of a control
signal x means x=1. The absence
of x means x=0.
```



#### **Signed Multipliers**

## **Combinational Multiplier (signed!)**



19

## **Combinational Multiplier (signed)**



#### 2's Complement Multiplication

Step 1: two's complement operands so high order bit is  $-2^{N-1}$ . Must sign extend partial products and subtract the last one

|                  |    |            | ;    | X3<br>* Y3 |      | X1<br>Y1  | X0<br>Y0 |
|------------------|----|------------|------|------------|------|-----------|----------|
| + X3Y1           |    | X3Y1       | X3Y1 | X2Y1       | X1Y1 |           | X0Y0     |
| + X3Y2<br>- X3Y3 |    |            |      |            | X0Y2 |           |          |
| <b>Z</b> 7       | Z6 | <b>z</b> 5 | Z4   | <b>z</b> 3 | Z2   | <b>Z1</b> | zo       |

Step 2: don't want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction into add of (complement + 1).

|   | X3Y0 | X3Y0 | X3Y0 | X3Y0 | X3Y0 | X2Y0 | X1Y0   | X0Y0  |
|---|------|------|------|------|------|------|--------|-------|
| + |      |      |      |      | 1    |      |        |       |
| + | X3Y1 | X3Y1 | X3Y1 | X3Y1 | X2Y1 | X1Y1 | X0Y1   |       |
| + |      |      |      | 1    |      |      |        |       |
| + | X3Y2 | X3Y2 | X3Y2 | X2Y2 | X1Y2 | X0Y2 |        |       |
| + |      |      | 1    |      |      |      |        |       |
| + | X3Y3 | X3Y3 | X2Y3 | X1Y3 | X0X3 | ٦    |        |       |
| + |      |      |      |      | 1    | - {  | ·B = ~ | B + 1 |
| + |      | 1    |      |      |      | J    |        |       |
| - |      | 1    | 1    | 1    | 1    |      |        |       |
|   |      |      |      |      |      |      |        |       |

(Baugh-Wooley)

Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away!

|   |      |             |             | X3Y0        | X2Y0 | X1Y0 | X0Y0 |
|---|------|-------------|-------------|-------------|------|------|------|
| + |      |             | X3Y1        | X2Y1        | X1Y1 | X0Y1 |      |
| + |      | X2Y2        | X1Y2        | X0Y2        |      |      |      |
| + | X3X3 | <u>x2y3</u> | <u>x1Y3</u> | <u>x0y3</u> |      |      |      |
| + |      |             |             |             |      |      |      |
| + |      |             |             | 1           |      |      |      |
| - | 1    | 1           | 1           | 1           |      |      |      |
|   |      |             |             |             |      |      |      |

Step 4: finish computing the constants...

|   | X3Y0 X2Y0 X1Y0 X0Y0        |
|---|----------------------------|
| + | <b>X3Y1 X2Y1 X1Y1 X0Y1</b> |
| + | x2Y2 X1Y2 X0Y2             |
| + | X3Y3 X2Y3 X1Y3 X0Y3        |
| + | 1 1                        |

Result: multiplying 2's complement operands takes just approximately same amount of hardware as multiplying unsigned operands!

## **2's Complement Multiplication**



22

#### Example

• What's -3 x -5?

1101 x 1011

#### **Multiplication in Verilog**

You can use the "\*" operator to multiply two numbers:

```
wire [9:0] a,b;
wire [19:0] result = a*b; // unsigned multiplication!
```

If you want Verilog to treat your operands as signed two's complement numbers, add the keyword signed to your wire or reg declaration:

```
wire signed [9:0] a,b;
wire signed [19:0] result = a*b; // signed multiplication!
```

Remember: unlike addition and subtraction, you need different circuitry if your multiplication operands are signed vs. unsigned. Same is true of the >>> (arithmetic right shift) operator. To get signed operations all operands must be signed.

```
wire signed [9:0] a;
wire [9:0] b;
wire signed [19:0] result = a*$signed(b);
```

To make a signed constant: 10'sh37C



## Constant Coefficient Multiplication Shifters

## **Constant Multiplication**

- Our multiplier circuits so far has assumed both the multiplicand (A) and the multiplier (B) can vary at runtime.
- □ What if one of the two is a constant?

Y = C \* X

Constant Coefficient" multiplication comes up often in signal processing and other hardware. Ex:

$$\mathbf{x}_{i} = \alpha \mathbf{y}_{i-1} + \mathbf{x}_{i} \qquad \mathbf{x}_{i} - \mathbf{y}_{i}$$

where  $\alpha$  is an application dependent constant that is hard-wired into the circuit.

How do we build and array style (combinational) multiplier that takes advantage of the constancy of one of the operands?

## **Multiplication by a Constant**

If the constant C in C\*X is a power of 2, then the multiplication is simply a shift of X.



□ What about division?

□ What about multiplication by non-powers of 2?

## **Multiplication by a Constant**

□ In general, a combination of fixed shifts and addition:

• Ex:  $6^*X = 0110^*X = (2^2 + 2^1)^*X = 2^2X + 2^1X$ 



• Details:



## Multiplication by a Constant

□ Another example:  $C = 23_{10} = 010111$ 



- In general, the number of additions equals one less than the number of 1's in the constant.
- Using carry-save adders (for all but one addition) helps reduce the delay and cost, and using balanced trees helps with delay.
- □ Is there a way to further reduce the number of adders (and thus the cost and delay)?

## **Multiplication using Subtraction**

□ Subtraction is approximately the same cost and delay as addition. □ Consider C\*X where C is the constant value  $15_{10} = 01111$ .

C\*X requires 3 additions.

□ We can "recode" 15

from  $01111 = (2^3 + 2^2 + 2^1 + 2^0)$ to  $1000\overline{1} = (2^4 - 2^0)$ 

where 1 means negative weight.

□ Therefore, 15\*X can be implemented with only one subtractor.



## **Canonic Signed Digit Representation**

- □ CSD represents numbers using 1,  $\overline{1}$ , & 0 with the least possible number of non-zero digits.
  - Strings of 2 or more non-zero digits are replaced.
  - Leads to a unique representation.
- □ To form CSD representation might take 2 passes:
  - First pass: replace all occurrences of 2 or more 1's:

Second pass: same as above, plus replace 0110 by 0010

01..10 by 10..10

and  $0\overline{1}10$  by  $00\overline{1}0$ 

□ Examples:

| 011101 = 29<br>100101 = 32 - 4 + 1 | 0010111 = 23<br>001100 <u>1</u><br>010 <u>1</u> 00 <u>1</u> = 32 - 8 - 1 | 0110110 = 54<br>10 <u>7</u> 10 <u>7</u> 0<br>1007070 = 64 - 8 - 2 |
|------------------------------------|--------------------------------------------------------------------------|-------------------------------------------------------------------|
| 100101 = 32 - 4 + 1                |                                                                          | 1001010 - 07 - 0 - 2                                              |

□ Can we further simplify the multiplier circuits?

## "Constant Coefficient Multiplication" (KCM)



□ CSD helps, but the multipliers are limited to shifts followed by adds.

CSD multiplier: Y = 231\*X = (2<sup>8</sup> - 2<sup>5</sup> + 2<sup>3</sup> - 2<sup>0</sup>)\*X



□ How about shift/add/shift/add ...?

■ KCM multiplier: Y = 231\*X = 7\*33\*X = (2<sup>3</sup> - 2<sup>0</sup>)\*(2<sup>5</sup> + 2<sup>0</sup>)\*X



- □ No simple algorithm exists to determine the optimal KCM representation.
- Most use exhaustive search method.



#### **Shifters**

## **Fixed Shifters / Rotators Defined**



## Variable Shifters / Rotators

- Example: X >> S, where S is unknown when we synthesize the circuit.
- Uses: shift instruction in processors (ARM includes a shift on every instruction), floating-point arithmetic, division/multiplication by powers of 2, etc.
- One way to build this is a simple shift-register:
  - a) Load word, b) shift enable for S cycles, c) read word.



- Worst case delay O(N), not good for processor design.
- Can we do it in O(logN) time and fit it in one cycle?

## Log Shifter / Rotator

□ Log(N) stages, each shifts (or not) by a power of 2 places,  $S=[s_2;s_1;s_0]$ :



## LUT Mapping of Log shifter



Efficient with 2to1 multiplexors, for instance, 3LUTs.

Virtex6 has 6LUTs. Naturally makes 4to1 muxes:

Reorganize shifter to use 4to1 muxes.





## "Improved" Shifter / Rotator

□ How about this approach? Could it lead to even less delay?



- □ What is the delay of these big muxes?
- □ Look a transistor-level implementation?

Left-shift with rotate

## **Barrel Shifter**

□ Cost/delay?



## **Connection Matrix**

- Generally useful structure:
  - N<sup>2</sup> control points.
  - What other interesting functions can it do?



## **Cross-bar Switch**

- □ Nlog(N) control signals.
- Supports all interesting permutations
  - All one-to-one and onemany connections.
- Commonly used in communication hardware (switches, routers).

