

# EECS151/251A Spring 2024 Digital Design and Integrated Circuits

Instructor: John Wawrzynek

Lecture 20



# Homework assignment 9 posted - due next Friday



# **Outline**

- Register Transfer Notation
- List Processor Example
- Design Optimization
- Resource Utilization Charts



# **Register Transfer Notation**



- At the high-level we view these systems as a collection of state elements and CL blocks.
- "RTL" is a commonly used acronym for "Register Transfer Level" description.
- It follows from the fact that all synchronous digital system can be described as a set of state elements connected by combinational logic blocks.
- Though not strictly correct, some also use "RTL" to mean the Verilog or VHDL code that describes such systems.

# **Register Transfer "notation" Descriptions**

- We introduce a notation for describing the behavior of systems at the register transfer level.
- Can view the operation of digital synchronous systems as a set of data transfers between registers with combinational logic operations happening during the transfer.

RT notation comprises a set of register transfers with optional operators as part of the transfer. Example:

> regA ← regB regC ← regA + regB if (start==1) regA ← regC

We use ";" to separate transfers that occur on separate cycles.

Use "," to separate transfers that occur on the same cycle.

Example (2 cycles):

 $regA \leftarrow regB, regB \leftarrow 0;$  $regC \leftarrow regA;$ 

# **Example of Using RT Notation**

 $ACC \leftarrow ACC + R0, R1 \leftarrow R0;$  $ACC \leftarrow ACC + R1, R0 \leftarrow R1;$  $R0 \leftarrow ACC;$ 



- In this case: RT notation description is used to sequence the operations on the datapath.
- It becomes the high-level specification for the controller.
- Design of the FSM controller follows directly from the RT notation sequence. In this example (and most other designs) the FSM controls movement of data by controlling the multiplexor control signals.

# **Example of Using RT Notation**

Sometimes RT Notation is used as a starting point for designing *both* the datapath and the control:

□ example:

 $regA \leftarrow IN;$   $regB \leftarrow IN;$   $regC \leftarrow regA + regB;$  $regB \leftarrow regC;$ 

□ From this we can deduce:

- IN must fanout to both regA and regB
- regA and regB must output to an adder
- the adder must output to regC
- regB must take its input from a mux that selects between IN and regC

• What does the datapath look like?



• The controller:

#### Control points:

clock enable for A register
 clock enable for B register
 mux control

```
FSM controller:
4 states (one per cycle)
```



# **List Processor Example**

### List Processor Example

- RT Notation gives us a framework for making high-level optimizations.
- □ General design procedure outline:
  - 1. Problem, Constraints, and Component Library Spec.
  - 2. "Algorithm" Selection
  - 3. Micro-architecture Specification
  - 4. Analysis of Cost, Performance, Power
  - 5. Optimizations, Variations
  - 6. Detailed Design

# **1. Problem Specification**

Design a circuit that forms the sum of all the 2's complement integers stored in a linked-list structure starting at memory address 0:



Assume: All integers and pointers are 8-bit. The link-list is stored in a memory block with an 8-bit address port and 8-bit data port, as shown below. The pointer from the last element in the list is 0. At least one node in list.



Note: We don't assume nodes are aligned on 2 Byte boundaries.

XO

X1

2. 3.

6<sup>.</sup> 7.

8

# **1. Other Specifications**

#### Design Constraints:

Usually the design specification puts a restriction on cost, performance, power or all.
 We will leave this unspecified for now and return to it later.

#### □ Component Library:

| component             | delay                         |
|-----------------------|-------------------------------|
| simple logic gates    | 0.5ns                         |
| n-bit register        | clk-to-Q=0.5ns                |
|                       | setup=0.5ns                   |
| n-bit 2-1 multiplexor | 1ns                           |
| n-bit adder           | (2 log(n) + 2)ns              |
| memory                | 10ns read (asynchronous read) |
| zero compare          | 0.5 log(n)                    |

(single ported memory, register has CE but no RST)

Are these reasonable?

## **Review of Register with "Clock Enable"**

□ Register with Clock Enable:



IN

Functional description only. Transistor level circuit could have lower input delay and fewer transistors.

Allows register to be either be loaded on selected clock positive edge or to retain its previous value.

 $\Box$  Assume both data and CE require setup time = 0.5ns.

□ Assume no reset input.

# "Algorithm" Selection

In this case the memory only allows one access per cycle, so the algorithm is limited to sequential execution. If in another case more input data is available at once, then a more parallel solution may be possible.

#### Assume datapath state registers NEXT and SUM.

- NEXT holds a pointer to the node in memory.
- SUM holds the result of adding the node values to this point.

*This RT Notation becomes the basis for DP and controller.* 

# 3. Micro-Architecture #1

Direct implementation of RTL description:



# 4. Analysis of Cost, Performance, and Power

- □ Skip Power for now.
- □ Cost:
  - How do we measure it? # of transistors? # of gates? # of CLBs?
  - Depends on implementation technology. Often we are just interested in comparing the *relative* cost of two competing implementations. (Save this for later)

#### Performance:

- 2 clock cycles per number added.
- What is the minimum clock period?
- The controller might be on the critical path. Therefore we need to know the implementation, and controller input and output delay. We do a design and could later optimize if it is indeed on the critical path.

#### **Possible Controller Implementation**

□ Based on this, what is the controller input and output delay?

**One-hot FSM** 



# 4. Analysis of Performance

Other paths exist for each cycle in the loop. These are the worst case.



# 4. Analysis of Performance

#### Detailed timing:

clock period (T) = max (clock period for each state) T > 31ns, F < 32 MHz

□ Observation:

COMPUTE\_SUM state does most of the work. Most of the components are inactive in GET\_NEXT state.

GET\_NEXT does: Memory access + ...

COMPUTE\_SUM does: 8-bit add, memory access, 15-bit add + ...

□ Conclusion:

Move one of the adds to GET\_NEXT.



# **List Processor Optimization**

## 5. Optimization

Add new register named NUMA, for address of number to add.

□ Update code to reflect our change (note still 2 cycles per iteration):

## 5. Optimization

□ Architecture #2: D 0 A\_SEL NEXT\_SEL -Memory 1 0 0 0 А **NEXT** LD\_NEXT -SUM\_SEL-SUM < LD SUM-==0 NEXT\_SEL NEXT ZERO If (START==1) NEXT $\leftarrow$ 0, SUM $\leftarrow$ 0, NUMA $\leftarrow$ 1; NUMA∢ LD\_NEXT repeat { SUM←SUM + Memory[NUMA]; NUMA←Memory[NEXT] + 1, NEXT←Memory[NEXT] ; } until (NEXT==0);  $R \leftarrow SUM, DONE \leftarrow 1;$ 

□ Incremental cost: addition of another register and mux.

# 5. Optimization, Architecture #2



New timing:
 Clock Period (T) = max (clock period for each state)

T > 23ns, F < 43Mhz

Is this worth the extra cost?Can we lower the cost?

Notice that the circuit now only performs one add on every cycle. Why not share the adder for both cycles?

## 5. Optimization, Architecture #3



□ Incremental cost:

- Addition of another mux and control (ADD\_SEL). Removal of an 8-bit adder.
- Performance:
  - No change.
- □ Change is definitely worth it.



## **Resource Utilization Charts**

#### **Resource Utilization Charts**

- One way to visualize these (and other possible) optimizations is through the use of a *resource utilization charts*.
- These are used in high-level design to help schedule operations on shared resources.
- □ Resources are listed on the y-axis. Time (in cycles) on the x-axis.

|--|

| memory        |   | fetch A1 |       | fetch A2 |       |   |   |
|---------------|---|----------|-------|----------|-------|---|---|
| bus           |   | fetch A1 |       | fetch A2 |       |   |   |
| register-file |   | read B1  |       | read B2  |       |   |   |
| ALU           |   |          | A1+B1 |          | A2+B2 |   |   |
| cycle         | 1 | 2        | 3     | 4        | 5     | 6 | 7 |

□ Our list processor has two shared resources: memory and adder

## List Example Resource Scheduling

□ Unoptimized solution: 1. SUM←SUM + Memory[NEXT+1]; 2. NEXT←Memory[NEXT];

| memory | fetch x | fetch next | fetch x | fetch next |
|--------|---------|------------|---------|------------|
| adder1 | next+i  |            | next+1  |            |
| adder2 | sum     |            | sum     |            |
|        | 1       | 2          | 1       | 2          |

• Optimized solution: 1. SUM←SUM + Memory[NUMA];

2. NEXT←Memory[NEXT], NUMA←Memory[NEXT]+1;

| memory | fetch x | fetch next | fetch x | fetch next |
|--------|---------|------------|---------|------------|
| adder  | sum     | numa       | sum     | núma       |

• How about the other combination: add x register

|                         | h next |
|-------------------------|--------|
| adder numa sum numa sun | ım     |

1. X←Memory[NUMA], NUMA←NEXT+1;

2. NEXT←Memory[NEXT], SUM←SUM+X;

• Does this work? If so, a very short clock period. The fetch and the add from each cycle would be independent.  $T = max(T_{mem}, T_{add})$  instead of  $T_{mem} + T_{add}$ .

#### List Example Resource Scheduling - Ad hoc method

□ Schedule one loop iteration followed by the next (4 cycles per result):

| Memory | next <sub>1</sub> |                   | <b>X</b> <sub>1</sub> |                  | next <sub>2</sub> |                   | x <sub>2</sub> |                  | Initiation        |
|--------|-------------------|-------------------|-----------------------|------------------|-------------------|-------------------|----------------|------------------|-------------------|
| adder  |                   | numa <sub>1</sub> |                       | sum <sub>1</sub> |                   | numa <sub>2</sub> |                | sum <sub>2</sub> | Interval (II) = 4 |

#### $\Box$ How can we overlap iterations? next<sub>2</sub> depends on next<sub>1</sub>.

- "slide" second iteration into first (3 cycles per result):

| Me  | mory | next <sub>1</sub> |                   | X <sub>1</sub> | next <sub>2</sub> |                   | <i>x</i> <sub>2</sub> |                  | Initiation        |
|-----|------|-------------------|-------------------|----------------|-------------------|-------------------|-----------------------|------------------|-------------------|
| add | der  |                   | numa <sub>1</sub> |                | sum <sub>1</sub>  | numa <sub>2</sub> |                       | sum <sub>2</sub> | Interval (II) = 3 |

– or further:

| Memory | next <sub>1</sub> | next <sub>2</sub> | <b>X</b> <sub>1</sub> | x <sub>2</sub>   | next <sub>3</sub> | next <sub>4</sub> | х <sub>3</sub>    | Х <sub>4</sub>   |                  |
|--------|-------------------|-------------------|-----------------------|------------------|-------------------|-------------------|-------------------|------------------|------------------|
| adder  |                   | numa <sub>1</sub> | numa <sub>2</sub>     | sum <sub>1</sub> | sum <sub>2</sub>  | numa <sub>3</sub> | numa <sub>4</sub> | sum <sub>3</sub> | sum <sub>4</sub> |

The repeating pattern is 4 cycles. Not exactly the pattern what we were looking for. But does it work correctly?

#### List Example Resource Scheduling - another attempt

□ In this case, first spread out, then pack.

| Memory | next <sub>1</sub> |                   | <b>X</b> <sub>1</sub> |                  |  |
|--------|-------------------|-------------------|-----------------------|------------------|--|
| adder  |                   | numa <sub>1</sub> |                       | sum <sub>1</sub> |  |



- □ Three different loop iterations active at once.
- □ Short cycle time (no dependencies within a cycle)
- □ full utilization (only 2 cycles per result)
- □ Initialization: x=0, numa=1, sum=0, next=memory[0]
- Control states (out of the loop)
  - two to start: initialize next, clear sum, set numa, clear x; get next<sub>2</sub> two to finish:

5. Optimization, Architecture #4



- □ Incremental cost:
  - Addition of another register & mux, adder mux, and control.
- □ **Performance:** find max time of the four actions
  - 1. X←Memory[NUMA],

NUMA←NEXT+1; same

- 0.5+1+10+1+0.5 = 13nssame for all  $\Rightarrow$  T>13ns, F<77MHz
- 2. NEXT←Memory[NEXT], SUM←SUM+X;

# **Other Optimizations**

#### □ Node alignment restriction:

- If the application of the list processor allows us to restrict the placement of nodes in memory so that they are aligned on even multiples of 2 bytes.
  - NUMA addition can be eliminated.
  - Controller supplies "0" for low-bit of memory address for NEXT, and "1" for X.
- Furthermore, if we could use a memory with a 16-bit wide output, then could fetch entire node in one cycle:

{NEXT, X}  $\leftarrow$  Memory[NEXT], SUM  $\leftarrow$  SUM + X;

 $\Rightarrow$  execution time cut in half (half as many cycles)

### **List Processor Conclusions**

- □ Through careful optimization:
  - clock frequency increased from 32MHz to 77MHz
  - little cost increase.
- □ "Scheduling" was used to overlap and to maximize use of resources.
- Essentially through pipelining the operations (the extra added registers -NUMA, X - act as pipeline registers.
- Questions:
  - Consider the design process we went through:
    - Could a computer program go from RTL description to circuits automatically?
    - Could a computer program derive the optimizations that we did?
    - It is the goal of "High-Level Synthesis" to do similar transformations and automatic mappings. "C-to-gates" compilers are an example.