## **Cache memories**

[§5.1] A *cache* is a small, fast memory which is *transparent* to the processor.

- The cache duplicates information that is in main memory.
- With each data block in the cache, there is associated an *identifier* or *tag*. This allows the cache to be *content* addressable.



- A cache miss is the term analogous to a page fault. It occurs when a referenced word is not in the cache.
  - Cache misses must be handled much more quickly than page faults. Thus, they are handled in hardware.
- Caches can be organized according to four different strategies:
  - Direct
  - Fully associative
  - Set associative
  - Sectored

Lecture 4

Architecture of Parallel Computers

1

We want to structure the cache to achieve a high hit ratio.

- Hit—the referenced information is in the cache.
  - *Miss*—referenced information is not in cache, must be read in from main memory.

$$Hit ratio \equiv \frac{Number of hits}{Total number of references}$$

We will study caches that have three different placement policies (direct, fully associative, set associative).

## Direct

Only 1 choice of where to place a block.

#### block $i \rightarrow \text{line } i \mod 128$

Each line has its own tag associated with it.

When the line is in use, the tag contains the high-order seven bits of the main-memory address of the block.



- A cache implements several different *policies* for retrieving and storing information, one in each of the following categories:
  - *Placement policy*—determines where a block is placed when it is brought into the cache.
  - Replacement policy—determines what information is purged when space is needed for a new entry.
  - Write policy—determines how soon information in the cache is written to lower levels in the memory hierarchy.

#### Cache memory organization

[§5.2] Information is moved into and out of the cache in *blocks*. When a block is in the cache, it occupies a cache *line*. Blocks are usually larger than one byte,

- to take advantage of locality in programs, and
  - because memory may be organized so that it can overlap transfers of several bytes at a time.

The block size is the same as the line size of the cache.

A *placement policy* determines where a particular block can be placed when it goes into the cache. E.g., is a block of memory eligible to be placed in any line in the cache, or is it restricted to a single line?

In our examples, we assume-

- The cache contains 2048 bytes, with 16 bytes per line Thus it has 128 lines.
- Main memory is made up of 256K bytes, or 16384 blocks. Thus an address consists of

© 2023 Edward F. Gehringer

CSC 506 Lecture Notes, Spring 2023

2

4

- To search for a word in the cache,
  - 1. Determine what line to look in (easy; just select bits 10–4 of the address).
  - Compare the leading seven bits (bits 17–11) of the address with the tag of the line. If it matches, the block is in the cache.
  - 3. Select the desired bytes from the line.

## Advantages:

Fast lookup (only one comparison needed).

Cheap hardware (only one tag needs to be checked).

Easy to decide where to place a block

Disadvantage: Contention for cache lines.

## Exercise: What would the size of the tag, index, and offset fields be

- the line size from our example were doubled, without changing the size of the cache? 7, 6, 5
- the cache size from our example were doubled, without changing the size of the line? 6, 8, 4
- an address were 32 bits long, but the cache size and line size were the same as in the example? 21, 7, 4

#### Fully associative

Any block can be placed in any line in the cache.

- This means that we have 128 choices of where to place a block.
  - block  $i \rightarrow$  any free (or purgeable) cache location



Each line has its own tag associated with it.

When the line is in use, the tag contains the high-order *fourteen* bits of the main-memory address of the block.

To search for a word in the cache,

- Simultaneously compare the leading 14 bits (bits 17–4) of the address with the tag of all lines. If it matches any one, the block is in the cache.
- 2. Select the desired bytes from the line.

Advantages:

Minimal contention for lines.

Wide variety of replacement algorithms feasible.

<u>Exercise</u>: What would the size of the tag and offset fields be if—

• the line size from our example were doubled, without changing the size of the cache? 13, 5

| Lecture 4 | Architecture of Parallel Computers | 5 |
|-----------|------------------------------------|---|
|           |                                    |   |

Which steps would be different if the cache were directly mapped?

#### Set associative

1 < n < 128 choices of where to place a block.

A compromise between direct and fully associative strategies. The cache is divided into *s* sets, where *s* is a power of 2.

•

block  $i \rightarrow any$  line in set  $i \mod s$ 

Each line has its own tag associated with it.

When the line is in use, the tag contains the high-order *eight* bits of the main-memory address of the block. (The next six bits can be derived from the set number.)



- the cache size from our example were doubled, without changing the size of the line? 14, 4
  - an address were 32 bits long, but the cache size and line size were the same as in the example? 28, 4

## Disadvantage:

The most expensive of all organizations, due to the high cost of associative-comparison hardware.

A flowchart of cache operation: The process of searching a fully associative cache is very similar to using a directly mapped cache. Let us consider them in detail.



© 2023 Edward F. Gehringer

CSC 506 Lecture Notes, Spring 2023

6

<u>Exercise</u>: What would the size of the tag, index, and offset fields be if-

- the line size from our example were doubled, without changing the size of the cache? 8, 5, 5
- the set size from our example were doubled, without changing the size of a line or the cache? 9, 5, 4
- the cache size from our example were doubled, without changing the size of the line or a set? 7, 7, 4
- an address were 32 bits long, but the cache size and line size was the same as in the example? 22, 6, 4

To search for a word in the cache,

- 1. Select the proper set (i mod s).
- 2. Simultaneously compare the leading 8 bits (bits 17–10) of the address with the tag of all lines in the set. If it matches any one, the block is in the cache.
  - At the same time, the (first bytes of) the lines are also being read out so they will be accessible at the end of the cycle.
- If a match is found, gate the data from the proper block to the cache-output buffer.

4. Select the desired bytes from the line



- All reads from the cache occur as early as possible, to allow maximum time for the comparison to take place.
- Which line to use is decided late, after the data have reached high-speed registers, so the processor can receive the data fast.

Factors influencing line lengths:

- Long lines  $\Rightarrow$  higher hit ratios.
- Long lines ⇒ less memory devoted to tags.
- Long lines ⇒ longer memory transactions (undesirable in a multiprocessor).
- Long lines ⇒ more write-backs (explained below).

For most machines, line sizes between 32 and 128 bytes perform best.

If there are *b* lines per set, the cache is said to be *b-way* set associative. How many way associative was the example above?

The logic to compare 2, 4, or 8 tags simultaneously can be made quite fast.

But as *b* increases beyond that, cycle time starts to climb, and the higher cycle time begins to offset the increased associativity.

Almost all L1 caches are less than 8-way set-associative. L2 caches often have higher associativity.

## Two-level caches

# Write policy

[§5.2.3] Answer these questions, based on the text.

What are the two write policies mentioned in the text?

Lecture 4

Architecture of Parallel Computers

Which one is typically used when a block is to be written to main memory, and why?

Which one can be used when a block is to be written to a lower level of the cache, and why?

Can you explain what error correction has to do with the choice of write policy?

Explain what a parity bit has to do with this.

#### **Principle of inclusion**

[§5.2.4] To analyze a second-level cache, we use the *principle of inclusion*—a large second-level cache includes everything in the first-level cache.

We can then do the analysis by assuming the first-level cache did not exist, and measuring the hit ratio of the second-level cache alone.

How should the line length in the second-level cache relate to the line length in the first-level cache? The line length in the 2<sup>nd</sup>-level cache should not be shorter than the line length in the 1<sup>st</sup>-level cache

When we measure a two-level cache system, two miss ratios are of interest:

The local miss rate for a cache is the
 # misses experienced by the cache
 number of incoming references

To compute this ratio for the L2 cache, we need to know the number of misses in the L1 cache.

© 2023 Edward F. Gehringer

CSC 506 Lecture Notes, Spring 2023

10

• The global miss rate of the cache is

# L2 misses # of references made by processor

This is the primary measure of the L2 cache.

What conditions need to be satisfied in order for inclusion to hold?

• L2 associativity must be ≥ L1 associativity, irrespective of the number of sets.

Otherwise, more entries in a particular set could fit into the L1 cache than the L2 cache, which means the L2 cache couldn't hold everything in the L1 cache.

• The number of L2 sets has to be  $\geq$  the number of L1 sets, irrespective of L2 associativity.

(Assume that the L2 line size is  $\geq$  L1 line size.)

If this were not true, multiple L1 sets would depend on a single L2 set for backing store. So references to one L1 set could affect the backing store for another L1 set.

• All reference information from L1 is passed to L2 so that it can update its replacement bits.

Even if all of these conditions hold, we still won't have logical inclusion if L1 is write-back. (However, we will still have *statistical inclusion*—L2 *usually* contains L1 data.)

## [§5.2.6] Translation Lookaside Buffers

The CPU generates *virtual* addresses, which correspond to locations in virtual memory.

In principle, the virtual addresses are translated to physical addresses using a page table.



Therefore, the TLB and the cache must be accessed sequentially.



This adds an extra cycle in case of a hit.

(The page *displacement* is sometimes called the "page offset." But we will call it the displacement to avoid confusion with the block offset," which we just call "offset.")

How can we avoid wasting this time?

Lecture 5

Architecture of Parallel Computers

#### Let's take a look at address translation.



In this example, what is the page size? 2<sup>12</sup> = 4096 bytes

How much physical memory is there? 225 bytes

Our goal is to allow the cache to be indexed before address translation completes.

In order to do that, we need to have the index field be *entirely contained* within the page displacement.

So, if the displacement is *d* bits wide, the width of the index is *j* bits, and the offset is *k* bits, we must have  $j + k \le d$ .



Let's look at what happens when a memory address is accessed.



What are the steps in cache access?

- 1. Access the set that could contain the sought-after address.
- 2. Pull down the tags into the sense amplifiers (purple).
- 3. Compare the tags with the tag of the sought-after address
- 4. Read all lines in the set into the sense amplifiers (purple).
- 5. Select the line that actually contains the sought-after address.
- 6. Select the sought-after byte(s) or word(s) to return.
- 7. Return the sought-after byte(s) or word(s) to the processor.

We always need to read lines into the sense amplifiers and then select the word (cf. the direct-mapped cache diagram in Lecture 4).

Now, if we know the index *before* address translation takes place, we <u>can perform steps</u> 1, 2, and 4 while address translation is occurring.

There is a tradeoff between speed and power efficiency.

- For power efficiency, which order should should steps 1 through 4 be performed in? 1, 2, 3, 4
- For maximum speed, which of steps 1 through 4 can be performed in parallel? 2 & 4

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

2

Cache hit time reduces from two cycles to one!

... because the cache can now be *indexed* in parallel with TLB (although the tag match uses output from the TLB).

But there are some constraints...

• Suppose our cache is direct mapped. Then the index field just contains the line number. So, (line number || block offset) must fit inside the page displacement.

What is the largest the cache can be? 2<sup>12</sup> = 1 page

- If we want to increase the size of the cache, what can we do? Make it set-associative, because this increases the tag width and decreases the width of the index.
- Options:
  - For new machines, select page size such that—

page size  $\geq \frac{\text{cache size}}{\text{associativity}}$ 

· If page size is fixed, select associativity so that-

associativity  $\geq \frac{\text{cache size}}{\text{page size}}$ 

Example: MC88110

- Page size = 4KB
- I-cache, D-cache are both: 8KB, 2-way set-associative (4KB = 8KB / 2)

Example: VAX series

- Page size = 512B
- For a 16KB cache, need assoc. = (16KB / 512B) = 32-way set. assoc.!

The textbook gives these three alternatives for cache indexing and tagging. <u>Answer some questions</u> about them.

# Physically Indexed and Tagged



Virtually Indexed and Tagged



Virtually Indexed but Physically Tagged



# Multilevel cache design

What are distinguishing <u>features of the different cache levels</u> of the four-level design (from 2013) illustrated on p. 135 of the textbook?

|           | Distinguish-<br>ing feature | Size      | Access time | Implement'n<br>techology |
|-----------|-----------------------------|-----------|-------------|--------------------------|
| L1 cache  | Split                       | 32KB-64KB | 2-3 cycles  | SRAM                     |
| L2 cache  | Unified                     | 256KB-1MB | 10-20 cycle | SRAM                     |
| L3 cache  | Banked                      | 8–80 MB   | 20-50 cycle | DRAM                     |
| L4 cache  | Off-chip                    | 30–100 MB | 50-80 cycle | DRAM                     |
| Main mem. | Off-chip                    | 4–32 GB   | 120–400 cy  | DRAM                     |

What are some advantages of a centralized cache?

Interconnect between L2 and other levels is simplified, because it can be in one place.

| La des E  |                                    | - |
|-----------|------------------------------------|---|
| Lecture 5 | Architecture of Parallel Computers | 5 |
|           |                                    |   |



**Replacement policies** 

LRU is a good strategy for cache replacement.

What's the main disadantage of physically indexed and tagged?

What is the organization we have just been discussing (in the last diagram)?

What is the main disadvantage of virtually indexed and tagged?

Movement of data between banks is simplified (e.g., in virt. Indexed & tagged caches).

What are some advantages of a banked structure?

A portion of the cache is close to each processor, which helps speed access time.

More scalable. A single tile (core, L1 caches, 1 bank of L2) can be designed, & stamped as many times as needed.

#### Inclusion in multilevel caches

Answer these questions about inclusion policies.

Which kind(s) of caches move a block from one level to the other?

Which kind(s) of caches propagate up an eviction from the L2 to the L1?

Which kind(s) of caches have to inform the L2 about a write to the L1?

In an inclusive cache, can L2 associativity be greater than L1 associativity?

Find and describe the typo in this diagram.

© 2023 Edward F. Gehringer

```
CSC/ECE 506 Lecture Notes, Spring 2023
```

6

In a set-associative cache, LRU is reasonably cheap to implement. Why? The number of lines you need to check is small (= associativity of the cache)

With the LRU algorithm, the lines can be arranged in an *LRU stack*, in order of recency of reference. Suppose a string of references is—

#### abcdabeabcde

and there are 4 lines. Then the LRU stacks after each reference are—

| а |   |   |   |   |   | e<br>b |   |   |   |   |   |
|---|---|---|---|---|---|--------|---|---|---|---|---|
|   |   | а | b | с | d | а      | b | е | а | b | С |
|   |   |   | а | b | С | d      | d | d | е | а | b |
| * | * | * | * |   |   | *      |   |   | * | * | * |

Notice that at each step:

- The line that is referenced moves to the top of the LRU stack.
- · All lines below that line keep their same position.
- · All lines above that line move down by one position.

How many bits per set are required to keep track of LRU status in both of the implementations described in the text?

- Matrix 16
- Pseudo-LRU 3



Lecture 5

Architecture of Parallel Computers

| NC STATE UNIVERSITY                             |
|-------------------------------------------------|
|                                                 |
| The Cache-Coherence<br>Problem                  |
| Lecture 13<br>(Chapter 6)                       |
| CSC/ECE 506: Architecture of Parallel Computers |
| 1                                               |





















 sum = 0;

 for (i=1; i<2; i++) { [ock(id, myLock); sum = sum + a[1]; unlock(id, myLock); sum = sum + a[1]; unlock(id, myLock); sum sum; suppose a[1] = 3 and a[2] = 7</td>

 suppose a[1] = 3 and a[2] = 7

 • Will it print sum = 10?

















Peterson's Algorithm int turn; int interested[n]; // initialized to false void lock (int process, int lvar) { // process is 0 or 1 int other = 1 - proces; interested[process] = TRUE; turn = other; while (turn == other if interested[other] == TRUE) (); // Fost: turn != other if interested[other] == FALSE void unlock () occurs only if . facterested[other] == FALSE: either the other process has not competed for the lock, or it has just called unlock (), or . turn != other: the other process is competing, has set the turn to our process, and will be blocked in the while () locu 20

20































Snoop-Based Coherence on a Bus







Write-Through State-Transition Diagram PrRd/--PrWr/BusW write-through no-write-allocate v write invalidate PrRd/Bu: BusWr How does this protocol guarantee write r-initiated transaction propagation? --> Bus-snooper-initiated transaction How does it guarantee write serialization? Key: A write invalidates all other caches Therefore, we have: - Modified line: exists as V in only 1 cache - Clean line: exists as V in at least 1 cache - Invalid state represents invalidated line or not present in the cache NC STATE UNIVERSIT CSC/ECE 506: Architecture of Parallel Compute 10

























• When "multiple" means "all", we have sequential consistency (SC)

print A;

· Sequential consistency (SC) corresponds to our intuition.

· Other memory consistency models do not obey our intuition!

· Coherence doesn't help; it pertains only to a single location

while (flag == 0); /\*spin idly\*/

CSC/ECE 506: Architecture of Parallel Compu

A = 1;

NC STATE UNIVERSITY

23

24

flag = 1;

Lecture 14 Outline

Invalidation vs. update coherence

Bus-based coherence

- Idea: If this block is written, send the new word to all other caches.
- New bus transaction: Update
- Compared to invalidate, <u>what are advs. and disads.</u>?
- Advantages
- Other processors don't miss on next access
  - Saves refetch: In invalidation protocols, they would miss & bus transaction.

**IPMTER** -Based Protocols

- Saves bandwidth: A single bus transaction updates several caches
- Disadvantages
  - Multiple writes by same processor cause multiple update transactions
    - · In invalidation, first write gets exclusive ownership, other writes local

CSC/ECE 506: Architecture of Parallel Compute

CSC/ECE 506: Architecture of Parallel Computers

NC STATE UNIVERSITY

Invalidate versus Update
Is a block written by one processor read by other processors before it is rewritten?
Invalidation:

Yes - Readers will take a miss.
No - Multiple writes can occur without additional traffic.
Copies that won't be used again get cleared out.

Update:

Yes - Readers will not miss if they had a copy previously
A single bus transaction will update all copies
No - Multiple useless updates, even to dead copies
Invalidation protocols are much more popular.
Some systems provide both, or even hybrid



NC STATE UNIVERSITY

















## Performance of coherence protocols

Cache misses have traditionally been classified into four categories:

- Cold misses (or "compulsory misses") occur the first time that a block is referenced.
- Conflict misses are misses that would not occur if the cache were fully associative with LRU replacement.
- Capacity misses occur when the cache size is not sufficient to hold data between references.
- Coherence misses are misses caused by the coherence protocol.

The first three types occur in uniprocessors. The last is specific to multiprocessors.

To these, Solihin adds *context-switch* (or "system-related") misses, which are related to task switches.

Let's look at a uniprocessor example, a very small cache that has only four lines.

Let's look first at a fully associative cache, because which kind(s) of misses can't it have?

Here's an example of a reference trace of 0, 2, 4, 0, 2, 4, 6, 8, 0.



In a fully associative cache, there are 5 cold misses, because 5 different blocks are referenced.

There are 3 hits.

Lecture 16

Architecture of Parallel Computers

<u>Classify each of these references</u> as a hit or a particular kind of miss.

Of the three conflict misses in the set-associative cache, one is a hit here. Block 2 is still in the cache the second time it is referenced. The other two are conflict misses in this cache.

Now, let's talk about coherence misses.

Coherence misses can be divided into those caused by *true sharing* and those caused by *false sharing* (see p. 236 of the Solihin text).

- False-sharing misses are those caused by having a line size larger than one word. <u>Can you explain?</u>
- True-sharing misses, on the other hand, occur when
  - $\circ\;$  a processor writes into a cache line, invalidating a copy of the same block in another processors' cache,
  - o after which

How can we attack each of the four kinds of misses?

- To reduce capacity misses, we can ^ cache size
- To reduce conflict misses, we can ^ associativity
- To reduce cold misses, we can ^ line size
- To reduce coherence misses, we can change the line size.

Similarly, context-switch misses can be divided into categories.

- *Replaced* misses are blocks that were replaced while the other process(es) were active.
- Reordered misses are blocks that were shoved so far down the LRU stack by the other process(es) that they are replaced soon afterwards (when they otherwise would've stayed in the cache).

Which protocol is best? What cache line size is performs best? What kind of misses predominate?

The remaining reference (the third one to block 0) is not a cold miss.

It must be a capacity miss, because the cache doesn't have room to hold all five blocks.

We'll assume that replacement is LRU; in this case, block 0 replaces the LRU line, which at that point is line 1.

Now let's suppose the cache is 2-way set associative. This means there are two sets, one (set 0) that will hold the even-numbered blocks, and one (set 1) that will hold the odd-numbered blocks.



Since only even-numbered blocks are referenced in this trace, they will all map to set 0.

This time, though, there won't be any hits.

Classify each of these references as a hit or a particular kind of miss.

References that would have been hits in a fully associative cache, but are misses in a less-associative cache, are conflict misses.

Finally, let's look at a direct-mapped cache. Blocks with numbers congruent to 0 mod 4 map to line 0; blocks with numbers congruent to 1 mod 4 map to line 1, etc.



© 2023 Edward F. Gehringer CS0

CSC/ECE 506 Lecture Notes, Spring 2023

2

# Simulations

Questions like these can be answered by simulation. Getting the answer right is part art and part science.

Parameters need to be chosen for the simulator. Culler & Singh (1998) selected a single-level 4-way set-associative 1 MB cache with 64-byte lines.

The simulation assumes an idealized memory model, which assumes that references take constant time. Why is this not realistic?

The simulated workload consists of

- six parallel programs (Barnes, LU, Ocean, Radix, Radiosity, Raytrace) from the SPLASH-2 suite and
- one multiprogrammed workload, consisting of mainly serial programs.

Invalidate vs. update

with respect to miss rate

Which is better, an update or an invalidation protocol?

Let's look at real programs.



Where there are many coherence misses, update performs better

If there were many capacity misses, update would hurt, because it would needlessly keep data in cache, which would need to be updated, increasing bus traffic.

Lecture 16

Architecture of Parallel Computers

5

- true-sharing misses? Down
- false-sharing misses? Increase

If we increase the line size, what happens to bus traffic? It increases, because for each miss, more data needs to be brought in.

So it is not clear which line size will work best.



Results for the first three applications seem to show that which line size is best? 256

For the second set of applications, which do not fit in cache, Radix shows a greatly increasing number of false-sharing misses with

#### with respect to bus traffic

Compare the

• upgrades in inv. protocol with the

• updates in upd. protocol Each of these operations

produces bus traffic.

Which are more frequent?

Updates in an update protocol are more prevalent than upgrades

in an invalidation protocol. Which protocol causes

more bus traffic? The update protocol

causes more traffic.



The main problem is that one processor tends to write a block multiple times before another processor reads it.

This causes several bus transactions instead of one, as there would be in an invalidation protocol.

Effect of cache line size

#### on miss rate

If we increase the line size, <u>what happens</u> to each of the following classes of misses?

- capacity misses? Hard to say
- conflict misses? Hard to say

© 2023 Edward F. Gehringer

CSC/ECE 506 Lecture Notes, Spring 2023

6



#### on bus traffic

Larger line sizes generate more bus traffic.



7



The results are different than for miss rate—traffic almost always increases with increasing line size.

But address-bus traffic moves in the opposite direction from data-bus traffic.

With this in mind, which line size appears to be best? 32 or 64

Context-switch misses

As cache size gets larger, there are fewer uniprocessor ("natural") cache misses.

But the <u>number of context-switch misses</u> may go up (mcf, soplex) or down (namd, perlbench).

- Why could it go up?
- Why could it go down?

Reordered misses also decline as the cache becomes large. Why? Because there are enough lines in the cache that blocks farther down the LRU stack won't be replaced.

| Lecture 16 | Architecture of Parallel Computers | 9 |
|------------|------------------------------------|---|
|            |                                    |   |

# Usually, a portion of the L2 is placed near each L1; this is a *tiled* arrangement.

|                       | liled Multico         | re (with Hing)        |                       |
|-----------------------|-----------------------|-----------------------|-----------------------|
| Core & L1             | Core & L1             | Core & L1             | Core & L1             |
|                       | B                     | R                     | <u> </u>              |
|                       | R                     | R                     |                       |
| L2 Cache<br>Core & L1 |

What are some advantages of a distributed structure?

- In replication: A single tile (core, L1 caches, 1 bank of L2) can be designed, & stamped as many times as needed. So it is more scalable, easier to verify, use in next generation (same advs. as multicore!)
- In layout: More feasible for a manycore processor, where wire length and thermal considerations prevent a cache from being centralized.

*Hybrid centralized* + *distributed structure:* There's a tradeoff between centralized and distributed.

- A large cache is uniformly slow, especially if it needs to handle coherence.
- A distributed cache requires a lot of interconnections, and routing latency is high if the cache is in too many places.

| A compromise is to have |     |
|-------------------------|-----|
| an L2 cache that is     | Cor |
| distributed, but not as |     |
|                         | Con |
| distributed as the L1   |     |
| caches.                 | Cor |
|                         |     |

| riyona onnearonanioarea ca | Cacillos          |
|----------------------------|-------------------|
| Core & L1 Core & L1        | Core &L1 Core &L1 |
| L2 Cache R R               | L2 Cache          |
| Core &L1 Core &L1          | Core &L1 Core &L1 |
| Core &L1 Core &L1          | Core &L1 Core &L1 |
| L2 Cache R R               | L2 Cache          |
| Core &L1 Core &L1          | Core &L1 Core &L1 |



Figure 5.13: Breakdown of the types of L2 cache misses suffered by SPEC2006 applications with various cache sizes. Source: [39].

## Physical cache organization

[Solihin §5.6] A cache is *centralized* ("united") if its banks are adjacent on the chip.

What are some advantages of a centralized structure?

- Uniform access time
- Interconnect between the cache and the next level (e.g., on-chip memory controller)) is simplified, because it can be in one place.



A centralized cache usually uses a crossbar (see also p. 167 of the text).

A cache is distributed if its banks are scattered around the chip.

© 2023 Edward F. Gehringer CSC/ECE 506 Lectu

CSC/ECE 506 Lecture Notes, Spring 2023 10

## Logical cache organization

[Solihin §5.7] Regardless of whether a cache is centralized or distributed, there are several options in mapping addresses to tiles.

- A processor can be limited to accessing a single tile, the one closest to it (private cache configuration).
  - A block in the local cache may also exist in other caches; the copies must be kept coherent by a coherence protocol.
- All of the tiles can form a large logical cache. The address of a block completely determines what tile it is found in (shared 1-tile associative).
  - It may require a lot of hops to get from a processor to the cache.
- A block can be mapped to two tiles (shared 2-tile associative).
   Block numbers are arranged to improve distance locality.
- Or, a block can be allowed to map to any tile (full tile
  - associativity).<u>What is the upside</u>?
    - o What is the downside?

Another option is a partitioned shared cache organization.



- Can you tell how many tiles each block can map to? Yes, four.
- Can you tell how many *lines* each block can map to? No, because we don't know how the address is divided into tag, index, and offset fields.
- How does coherence play a role? Blocks shared between processors can be allowed to be cached in different groups if there is a coherence protocol.

Lecture 16

Architecture of Parallel Computers

#### Cache Coherence vs. Memory Consistency

- Cache coherence
  - deals with ordering of writes to a single memory location
     only needed for systems with caches
- · Memory consistency
  - deals with ordering of reads/writes to *all* memory locations
     needed in systems with or without caches

Why is a memory consistency model needed?

[§9.1] Programmer's intuition:

| P0:                   | P1:                                   |
|-----------------------|---------------------------------------|
| S1: datum = $5;$      | <pre>S3: while (!datumIsReady);</pre> |
| S2: datumIsReady = 1; | S4: = datum                           |

Programmers expect s4 to read the new value of datum (i.e., 5).

This expectation is violated if-

- s2 appears to be executed before s1
- S4 appears to be executed before S3

Thus, Hypothesis 1: Program-order expectation

Programmers expect memory accesses in a thread to be executed in the same order in which they occur in the source code.

Not only the executing thread, but *all* threads, are expected to see them in this order.

| P0:                                       | P1:                                                                            | P2:                                                   |
|-------------------------------------------|--------------------------------------------------------------------------------|-------------------------------------------------------|
| <pre>S1: x = 5;<br/>S2: xReady = 1;</pre> | <pre>S3: while<br/>(!xReady) {};<br/>S4: y = x + 4;<br/>S5: xyReady = 1;</pre> | <pre>S6: while    (!xyReady) {}; S7: z = x * y;</pre> |

Lecture 19

Architecture of Parallel Computers

1

3

Memory accesses emanating from a processor should be performed in program order, and each of them should be performed atomically.

These expectations were incorporated in Lamport's 1979 definition of sequential consistency:

A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program.

## Sequentially consistent vs. non-SC outcomes

Consider these code sequences, with a and b initialized to 0.

| P0:                          | P1:                          |
|------------------------------|------------------------------|
| S1: $a = 1;$<br>S2: $b = 1;$ | S3: print b;<br>S4: print a; |
| S2: $b = 1;$                 | S4: print a;                 |

Note that this program is *non-deterministic* due to a lack of synchronization.

Under SC,  $\texttt{s1} \rightarrow \texttt{s2}$  and  $\texttt{s3} \rightarrow \texttt{s4}$  are guaranteed

Assuming SC, what values might possibly be printed for a and b?

#### S1, S2, S3, S4 cause a, b = 1, 1

**S3**, **S4**, **S1**, **S2** cause **a**, **b** = 0, 0

**S1**, **S3**, **S2**, **S4** cause **a**, **b** = 1, 0

What values for a, b are impossible? a = 0, b = 1

Prove it.

For a to print as 0, it must be that  $s4 \rightarrow s1$ : e.g., s3, s4, s1, s2 $\rightarrow$  S4 precedes S1

For b to print as 1, it must be that  $s2 \rightarrow s3$ : e.g., s1, s2, s3,  $s4 \rightarrow s1$  precedes s4

Lecture 19

Let's say, initially, x = y = z = xReady = xyReady = 0

As a programmer, what would you expect to be the value of  ${\bf z}$  at  ${\bf s7?}$   ${\bf 45}$ 

This implies that if the new value of **x** has been propagated to **P2**, it has also been propagated to **P1**.

Thus, Hypothesis 2: Atomicity expectation

A read or write happens instantaneously with respect to all processors.

How can the atomicity expectation be violated?

Step 1: New values of x and xReady have been propagated to P1, but have not reached P2.

Step 2: New values of y and xyReady have been propagated to P2 before x is propagated to P2.

Step 3: When  $\mathbf{x}$  is propagated to P2, P2 has already read the old value of  $\mathbf{x}$ , and  $\mathbf{z}$  has been set to 0.



Is there any other way that a violation of store atomicity can lead to a wrong value for  $\mathbf{z}?$ 

What is <u>another "incorrect" value</u> that could be written for z? Explain how this could happen.

Summary of programmer's expectations:

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023 2

Both of these conditions cannot hold. Prove it.

On a non-SC machine, the outcome of a, b = 0, 1 is possible. What statement ordering can produce it? s4, s1, s2, s3

In this case, which of the two SC precedence guarantees (above) is violated? Program order

Let's take another example.

| P0:                   | P1:                   |
|-----------------------|-----------------------|
| <pre>S1: a = 1;</pre> | <pre>S3: b = 1;</pre> |
| S2: print b;          | S4: print a;          |

Exercise: Assuming that a and b are initialized to 0,

- what values can be printed under SC?
- what values are impossible to print under SC?
- prove that the impossible results can only occur if SC is violated.

Answer: Note that the program is non-deterministic due to a lack of synchronization.

With SC,  $\texttt{s1} \rightarrow \texttt{s2}$  and  $\texttt{s3} \rightarrow \texttt{s4}$  are guaranteed

S1, S2, S3, S4 cause a, b = 1, 0

S3, S4, S1, S2 cause a, b = 0, 1

S1, S3, S2, S4 cause a, b = 1, 1

a, b = 0, 0 is impossible in SC. Proof:

For a to be 0, it must be that  $s4 \rightarrow s1$ : s3, s4, s1, s2

For b to be 0, it must be that  $s2 \rightarrow s3$ : s1, s2, s3, s4

These cannot both hold, because we would have to have  $s1 \rightarrow s2 \rightarrow s3 \rightarrow s4 \rightarrow s1$ .

On a nondeterministic machine, the outcome  $\mathbf{a}, \mathbf{b} = 0, 0$  is possible.

\$4, \$1, \$2, \$3

#### $\circ~$ In this case, $\textbf{s3} \rightarrow \textbf{s4}$ is violated

- \$2, \$3, \$4, \$1
  - $\circ~$  In this case,  $\texttt{S1} \rightarrow \texttt{S2}$  is violated

Both of the previous examples are non-deterministic.

Non-deterministic codes are notoriously hard to debug.

But non-determinism may have legitimate uses. See Code 3.16 (ocean-current simulation) and 3.18 (smoothing filter for grayscale image).

So, does preserving ordering of memory accesses matter?

- · Probably not if non-determinism is intentional
- Otherwise, yes, because:
  - Helps keep programmers sane during debugging.
  - Even properly synchronized programs need ordering for the synchronization to work properly.

## Building a SC system

[§9.2] Which of the two hypotheses (expectations) can be guaranteed by software? Maintaining program order

- Ensure that compiler does not reorder memory accesses;
- Declare critical variables as volatile (to avoid register allocation, code elimination, etc.)

What hypothesis needs to be maintained by hardware? Atomicity

- Execute one memory access one at a time, in program order. One access needs to be complete before the next can start.
- In the processor pipeline, memory accesses can be overlapped or reordered.

| Lecture 19 | Architecture of Parallel Computers | 5 |
|------------|------------------------------------|---|
|            |                                    |   |

 As soon as address is known/predictable, issue a prefetch request to fetch the block in Modified state

But this is not a perfect strategy. Why not?

- Prefetch too late ⇒
- Prefetch too early ⇒

#### Via speculation

We can violate ordering, but undo the effect if atomicity is violated.

- The ability to undo execution and re-execute is already present in out-of-order processors (as covered in ECE 563).
  - So, we only need to determine when atomicity has been violated.
- Consider load A, followed by load B
  - o In strict SC, load B must wait until load A completes
  - With speculation, load B accesses the cache anyway; the processor just marks load B as speculative
  - If B is invalidated before it "retires," atomicity has been violated.
  - o In this case, the architecture cancels B and re-executes it.

Store speculation is harder, because stores cannot be canceled. Hence, only load speculation is employed.

- o But they must go to the cache in program order.
- A load is complete when the block has been read from the cache.
- A store is complete when an invalidation has been posted (on a bus) or acknowledged (see details in §10.2.1).

#### Example of SC Ordering

| S1: 1d R1, A | s1 must complete before s2, |
|--------------|-----------------------------|
| S2: 1d R2, B | s2 before s3, etc.          |
| S3: st R3, C |                             |
| S4: st R4, D |                             |
| S5: ld R5, D |                             |

Implications

- If S1 is a cache miss but S2 is a cache hit, S2 still must wait until S1 is completed. Same with S3 and S4.
- **S4** must wait for **S3** to complete, even though stores are often retired early.
- S5 must wait for S4 to complete, even though they are to the same location!

## Improving SC performance

#### Via prefetching

We still have to obey ordering, but we can make each load/store complete faster, e.g. by converting cache misses into cache hits:

- Employ load prefetching
  - As soon as address is known/predictable,
    - fetch before previous loads have completed,
  - issue a prefetch request to fetch the block in Exclusive/Shared state
- · Employ store prefetching

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

6

#### **Relaxed Memory-Consistency Models**

# <u>Review</u>. Why are relaxed memory-consistency models needed? How do relaxed MC models require programs to be changed?

The "safety net" between operations whose order needs to be guaranteed is often a *fence* instruction.

- The fence ensures that memory operations that are "younger" are not issued until the older mem ops have globally performed. The newer instruction must
  - wait until all older writes have been posted on the bus (or received InvAck);
  - o wait until all older reads have completed;
  - flush the pipeline to avoid issuing younger mem ops early e.g., instructions already in the pipeline may have previously read the shared variable that was just updated.
- · Programmers must insert fences.

What if amateur programmers perform their own synchronization, and forget fences? Machine does not guarantee correctness.

#### A continuum of consistency models

Sequential consistency is one view of what a programming model should guarantee.

Let us introduce a way of diagramming consistency models. Suppose that—

- The value of a particular memory word in processor 2's local memory is 0.
- Then processor 1 writes the value 1 to that word of memory. Note that this is a remote write.

1

From this we can see that running the same program twice in a row in a system with sequential consistency may not give the same results.

#### Causal consistency

L

The first step in weakening the consistency constraints is to distinguish between events that are potentially *causally* connected and those that are not.

Two events are causally related if one can influence the other.

$$\frac{P_{1:} \ W(x)1}{P_{2:} \ R(x)1 \ W(y)2}$$

Here, the write to x could influence the write to y, because  $P_2$  might have read x and used its value to calculate y.

On the other hand, without the intervening read, the two writes would not have been causally connected:

$$\frac{P_{1:} \ W(x) 1}{P_{2:}} \qquad W(y) 2$$

The following pairs of operations are potentially causally related:

- · A read followed by a later write by the same processor.
- · A write followed by a later read to the same location.
- The transitive closure of the above two types of pairs of operations.

Operations that are not causally related are said to be concurrent.

**Causal consistency:** Writes that are potentially causally related must be seen in the same order by all processors.

Concurrent writes may be seen in a different order by different processors.

• Processor 2 then reads the word. But, being local, the read occurs quickly, and the value 0 is returned.

What's wrong with this?

This situation can be diagrammed like this (the horizontal axis represents time):

$$P_{1:} \quad W(x) = 0$$
  
 $P_{2:} \quad R(x) = 0$ 

Depending upon how the program is written, it may or may not be able to tolerate a situation like this.

But, in any case, the programmer must understand what can happen when memory is accessed in a DSM system.

Sequential consistency

Sequential consistency: The result of any execution is the same as if

- the memory operations of all processors were executed in some sequential order, and
- the operations of each individual processor appear in this sequence in the order specified by its program.

Sequential consistency does *not* mean that writes are instantly visible throughout the system (it would be impossible to implement that anyway).

The example below illustrates two sequentially consistent executions.

Note that a read from  $P_2$  is allowed to return an out-of-date value (because it has not yet "seen" the previous write).



© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

2

4

Here is a sequence of events that is allowed with a causally consistent memory, but disallowed by a sequentially consistent memory:

| $P_{1:}$ $W(z)$        | x)1 W(x)3              |             |
|------------------------|------------------------|-------------|
| <b>P</b> <sub>2:</sub> | R(x)1 W(x)2            |             |
| P <sub>3:</sub>        | <i>R</i> ( <i>x</i> )1 | R(x)3 R(x)2 |
| <b>P</b> <sub>4:</sub> | <i>R</i> ( <i>x</i> )1 | R(x)2 R(x)3 |

Why is this not allowed by sequential consistency?

Why is this allowed by causal consistency?

What is the violation of causal consistency in the sequence below?

| P <sub>1:</sub> W(.    | k)1         |             |
|------------------------|-------------|-------------|
| P <sub>2:</sub>        | R(x)1 W(x)2 |             |
| <b>P</b> <sub>3:</sub> |             | R(x)2 R(x)1 |
| P <sub>4:</sub>        |             | R(x)1 R(x)2 |

Without the R(x)1 by  $P_2$ , this sequence would've been causally consistent.

Implementing causal consistency requires the construction of a dependency graph, showing which operations depend on which other operations.

#### Processor consistency

Causal consistency requires that all processes see causally related writes from *all* processors in the same order.

The next step is to relax this requirement, to require only that writes from the same processor be seen in order. This gives processor consistency

Processor consistency: Writes performed by a single processor are received by all other processors in the order in which they were issued.

Writes from different processors may be seen in a different order by different processors.

Processor consistency would permit this sequence that we saw violated causal consistency:

 $P_{1:} W(x)1$  $P_{2}$ R(x)1 W(x)2 $P_{3}$ R(x)2 R(x)1 $P_{4}$ R(x)1 R(x)2

Another way of looking at this model is that all writes generated by different processors are considered to be concurrent.

Note: Some definitions of processor consistency require cache coherence too. Processor consistency without cache coherence is called PRAM consistency.

Exercise: What is the strongest consistency model that each of the following satisfy?

| <i>P</i> <sub>1:</sub> | <i>W</i> ( <i>x</i> )1 |       |             |
|------------------------|------------------------|-------|-------------|
| P <sub>2:</sub>        | <i>R</i> ( <i>x</i> )1 | W(x)2 |             |
| <b>P</b> <sub>3:</sub> |                        |       | R(x)1 R(x)2 |
| <b>P</b> <sub>4:</sub> |                        |       | R(x)2 R(x)1 |

| 0 | Loads do not wait for stores to complete ("perform") they |  |
|---|-----------------------------------------------------------|--|

- access the cache right away (without being speculative!). o A load dependent on an older store (in the same
- processor) can "bypass" (directly obtain the store value before it is stored).

Architecture of Parallel Computers

PC also removes write atomicity.

Lecture 20



flag2 = 1; if (flag1 == 0)

PC produces SC results, because ordering between 2 stores is preserved.

PC fails to produce SC results, because PC does not guarantee ordering betw. store & younger load

- · How close is PC to programmers' expectation?
  - o Most of the time, very close (e.g., post-wait synchronization works correctly)
  - o Major OSes are ported to PC with relative ease
- Cases that cause errors in PC usually are due to races that also happen in SC.
  - o However, debugging races in PC is more difficult.

#### Weak ordering

Processor consistency is still stronger than necessary for many programs, because it requires that writes originating in a single processor be seen in order everywhere.

 $P_{1:} W(y)1$ 

| <b>P</b> <sub>2:</sub> | <i>R</i> ( <i>x</i> )1 | W(y)2 |             |
|------------------------|------------------------|-------|-------------|
| <b>P</b> <sub>3:</sub> |                        |       | R(y)1 R(y)2 |
| <b>P</b> <sub>4:</sub> |                        |       | R(y)2 R(y)1 |

 $P_{1:} W(x)1$ 

| <b>P</b> <sub>2:</sub> | R(x)1 W(y)2 |             |
|------------------------|-------------|-------------|
| P <sub>3:</sub>        | ŀ           | R(x)1 R(y)2 |
| P <sub>4:</sub>        | I           | R(y)2 R(x)1 |

Sometimes processor consistency can lead to counterintuitive results. Assume that a and b are initialized to 0.

| <i>P</i> <sub>1</sub> :       | <b>P</b> <sub>2</sub> : |
|-------------------------------|-------------------------|
| a = 1;<br>if (b == 0)         | b = 1;<br>if (a == 0)   |
| <i>kill(p</i> <sub>2</sub> ); | <i>kill(p</i> 1);       |

At first glance, it seems that no more than one process should be killed.

With processor consistency, however, it is possible for both to be killed. Explain how

What processor consistency guarantees

- · SC ensures ordering of
  - $\circ$  LD → LD ◦ LD  $\rightarrow$  ST o ST → LD
  - $\circ$  ST → ST
- PC removes the ST→LD constraint, with significant implications for II P
  - o Values can be loaded into other caches, even if there's a store to the same location in some write buffer.

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023 6

8

But it is not always necessary for other processors to see writes in order-or even to see all writes, for that matter.

Suppose a processor is in a tight loop in a critical section, reading and writing variables.

Other processes aren't supposed to touch these variables until the process exits its critical section.

Under processor consistency, the memory has no way of knowing that other processes don't care about these writes, so it has to propagate all writes to all other processors in the normal way.

To relax our consistency model further, we have to divide memory operations into two classes and treat them differently.

- · Accesses to synchronization variables are sequentially consistent.
- · Accesses to other memory locations can be treated as concurrent.

This strategy is known as weak ordering.

With weak ordering, we don't need to propagate accesses that occur during a critical section.

We can just wait until the process exits its critical section, and then-

- make sure that the results are propagated throughout the system, and
- · stop other actions from taking place until this has happened.

Similarly, when we want to enter a critical section, we need to make sure that all previous writes have finished.

These constraints yield the following definition:

## Weak ordering: A memory system exhibits weak ordering iff-

- 1. Accesses to synchronization variables are sequentially consistent.
- 2. No access to a synchronization variable can be performed until all previous writes have completed everywhere.

 No data access (read or write) can be performed until all previous accesses to synchronization variables have been performed.

Thus, by doing a synchronization before reading shared data, a process can be assured of getting the most recent values written by other processes before their immediately preceding *S*s.

Note that this model does not allow more than one critical section to execute at a time, even if the critical sections involve disjoint sets of variables.

This model puts a greater burden on the programmer, who must decide which variables are synchronization variables.

Weak ordering says that memory does not have to be kept up to date between synchronization operations.

This is similar to how a compiler can put variables in registers for efficiency's sake. Memory is only up to date when these variables are written back.

If there were any possibility that another process would want to read these variables, they couldn't be kept in registers.

This shows that processes can live with out-of-date values, provided that they know when to access them and when not to.

The following is a legal sequence under weak ordering. Can you explain why?

| <b>P</b> <sub>1:</sub> | W(x)1 | W(x)2 | S                      |                        |   |  |
|------------------------|-------|-------|------------------------|------------------------|---|--|
| <b>P</b> <sub>2:</sub> |       |       | R(x)2                  | <i>R</i> ( <i>x</i> )1 | S |  |
| P <sub>3:</sub>        |       |       | <i>R</i> ( <i>x</i> )1 | <i>R</i> ( <i>x</i> )2 | S |  |

Here's a sequence that's illegal under weak ordering. Why?

| Lecture 20 | Architecture of Parallel Computers |
|------------|------------------------------------|

 making sure that the local processor has seen all previous writes anywhere in the system.



If the memory could tell the difference between entry and exit of a critical section, it would only need to satisfy one of these conditions.

Release consistency provides two operations:

- acquire operations tell the memory system that a critical section is about to be entered.
- · release operations say a c. s. has just been exited.

It is possible to acquire or release a single synchronization variable, so more than one critical section can be in progress at a time.

When an acquire occurs, the memory will make sure that all the local copies of shared variables are brought up to date.

When a release is done, the shared variables that have been changed are propagated out to the other processors.

But—

- doing an acquire does not guarantee that locally made changes will be propagated out immediately.
- doing a release does not necessarily import changes from other processors.

Here is an example of a valid event sequence for release consistency (*A* stands for "acquire," and *Q* for "release" or "quit"):



| <b>P</b> <sub>1:</sub> | W(x)1 | W(x)2 | S |                        |
|------------------------|-------|-------|---|------------------------|
| <b>P</b> <sub>2:</sub> |       |       | S | <i>R</i> ( <i>x</i> )1 |
|                        |       |       |   |                        |

Load/Store : Load/Store

Sync

Load/Store

Load/Store

Sync

Load/Store

Load/Store

Synch may be implemented as a lock acquire/release

Before a synch, all previous ops must finish. Before any ld/st, all previous synch must finish.

Why safe? Typically within a critical section, we have made sure that only one process is inside, thus safe to reorder anything in the critical section.

Outside a critical section, we usually do not care about the order of mem ops (we would have used synchronization if we had cared).

How to know whether a particular ld/st serves as a synchronization point?

- Assume all atomic instructions are synchronization points
   o fetch-and-op, test-and-set
- Assume all load linked (LL) and store conditional (SC) are synchronization points

Release consistency

Weak ordering does not distinguish between entry to critical section and exit from it.

Thus, on both occasions, it has to take the actions appropriate to both:

• making sure that all locally initiated writes have been propagated to all other memories, and



© 2023 Edward F. Gehringer

CSC/ECE 506 Lecture Notes, Spring 2023

10

12

Note that since  $P_3$  has not done a synchronize, it does not necessarily get the new value of *x*.

**Release consistency:** A system is release consistent if it obeys these rules:

- Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed.
- 2. Before a release is allowed to be performed, all previous reads and writes done by the process must have completed.
- 3. The acquire and release accesses must be processor consistent.

If these conditions are met, and processes use *acquire* and *release* properly, the results of an execution will be the same as on a sequentially consistent memory.

Summary: Sequential consistency is possible, but costly. The model can be relaxed in various ways. Consistency models not using synchronization operations:

| Type of<br>consistency | Description                                                                                                                                                                                                                                      |  |  |  |
|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| Sequential             | All processes see all shared accesses in same order.                                                                                                                                                                                             |  |  |  |
| Causal                 | All processes see all causally related shared accesses in the same order.                                                                                                                                                                        |  |  |  |
| Processor              | All processes see writes from each processor in<br>the order they were initiated. Writes from different<br>processors may not be seen in the same order,<br>except that writes to the same location will be seen<br>in the same order everwhere. |  |  |  |

Consistency models using synchronization operations:

Lecture 20

| Type of<br>consistency | Description                                                                             |
|------------------------|-----------------------------------------------------------------------------------------|
| Weak                   | Shared data can only be counted on to be<br>consistent after a synchronization is done. |
| Release                | Shared data are made consistent when a critical region is exited.                       |

The following diagram contrasts various forms of consistency.

| Sequential consistency                                                   | Processor consistency                                                                | Weak<br>ordering                                       | Release<br>consistency |
|--------------------------------------------------------------------------|--------------------------------------------------------------------------------------|--------------------------------------------------------|------------------------|
| $\mathbb{R} \to \mathbb{R} \to \mathbb{R} \to \mathbb{R} \to \mathbb{R}$ | $\begin{array}{c} R \to R \to W \to R \\ W, W, \cdots \\ \{W, W, \cdots \end{array}$ | {M, M}<br>↓<br>SYNCH<br>↓<br>{M, M}<br>↓<br>SYNCH<br>: |                        |

Architecture of Parallel Computers

## Lock Implementations

[§8.1] Recall the three kinds of synchronization from Lecture 6:

- Point-to-point post() and wait(); send() and receive();
- LockBarrier

Performance metrics for lock implementations

- Uncontended latency
  - $\circ$   $\,$  Time to acquire a lock when there is no contention
- Traffic
  - Lock acquisition when lock is already locked
  - Lock acquisition when lock is free
  - Lock release
- Fairness
  - Degree in which a thread can acquire a lock with respect to others
- Storage
  - As a function of # of threads/processors

## The need for atomicity

## This code sequence illustrates the need for atomicity. Explain.

```
void lock (int *lockvar) {
  while (*lockvar == 1) {}; // wait until released
  *lockvar = 1; // acquire lock
}
void unlock (int *lockvar) {
  *lockvar = 0;
}
```

## In assembly language, the sequence looks like this:

| lock: ld | R1, &lockvar | 11 | R1 = | lockvar |    |    |     |   |
|----------|--------------|----|------|---------|----|----|-----|---|
| bnz      | z R1, lock   | // | jump | to lock | if | R1 | ! = | 0 |

Lecture 17 Architecture of Parallel Computers

sti &lockvar, #1 // lockvar = 1
ret // return to caller
unlock: sti &lockvar, #0 // lockvar = 0
ret // return to caller

- The ld-to-sti sequence must be executed atomically:
  - The sequence appears to execute in its entiretyMultiple sequences are serialized

Examples of atomic instructions

- test-and-set Rx, M
  - read the value stored in memory location M, test the value against a constant (e.g. 0), and if they match, write the value in register Rx to the memory location M.
- fetch-and-op M
  - read the value stored in memory location M, perform op to it (e.g., increment, decrement, addition, subtraction), then store the new value to the memory location M.
- exchange Rx, M
  - atomically exchange (or swap) the value in memory location M with the value in register Rx.
- compare-and-swap Rx, Ry, M
  - compare the value in memory location M with the value in register Rx. If they match, write the value in register Ry to M, and copy the value in Rx to Ry.

2

How to ensure one atomic instruction is executed at a time:

- 1. Reserve the bus until done
  - Other atomic instructions cannot get to the bus

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

- 2. Reserve the cache block involved until done
  - o Obtain exclusive permission (e.g. "M" in MESI)
  - Reject or delay any invalidation or intervention requests until done
- 3. Provide "illusion" of atomicity instead
  - Using load-link/store-conditional (to be discussed later)

#### Test and set

test-and-set can be used like this to implement a lock:

| lock:   | t&s R1, &lockvar | <pre>// R1 = MEM[&amp;lockvar]; // if (R1==0) MEM[&amp;lockvar]=1</pre>                                |
|---------|------------------|--------------------------------------------------------------------------------------------------------|
| unlock: | ret              | <pre>// jump to lock if R1 != 0 // return to caller // MEM[&amp;lockvar] = 0 // return to caller</pre> |

What value does lockvar have when the lock is acquired? 1 free? 0

Here is an example of test-and-set execution. Describe what it shows.

|      | Thread 0                         |               | Thread 1                                          |
|------|----------------------------------|---------------|---------------------------------------------------|
|      | tŵs R1, ŵlockvar<br>bnz R1, lock | // successful | t&s R1, &lockvar // failed<br>bnz R1, lock        |
|      | in critical section              |               | t&s R1, &lockvar // failed<br>bnz R1, lock        |
| Time | sti &lockvar, #0                 |               | t&s R1, &lockvar <i>   failed</i><br>bnz R1, lock |
|      |                                  |               | t&s R1, &lockvar // successful<br>bnz R1, lock    |
|      |                                  |               | in critical section                               |
| 1    | ,                                |               | sti &lockvar, #0                                  |

Let's look at how a sequence of test-and-sets by three processors plays out:

| Request    | P1 | P2 | P3 | BusRequest |
|------------|----|----|----|------------|
| Initially  | -  | -  | -  | -          |
| P1: t&s    | М  | -  | -  | BusRdX     |
| P2: t&s    | I  | М  | -  | BusRdX     |
| P3: t&s    | I  | I  | М  | BusRdX     |
| P2: t&s    | Ι  | М  | I  | BusRdX     |
| P1: unlock | М  | -  | I  | BusRdX     |
| P2: t&s    | -  | М  | I  | BusRdX     |
| P3: t&s    | -  | -  | М  | BusRdX     |
| P3: t&s    | I  | I  | М  | -          |
| P2: unlock | Ι  | М  | I  | BusRdX     |
| P3: t&s    | -  | -  | М  | BusRdX     |
| P3: unlock | 1  | 1  | М  | _          |

How does test-and-set perform on the four metrics listed above?

- Uncontended latency
- Fairness
- Traffic
- Storage

Drawbacks of Test&Set Lock (TSL)

What is the main drawback of test&set locks?

- High traffic, many coherence transactions. These retard the progress of processes that don't have the lock
- The invalidations by processes trying to get in may make the critical section slower.

Without changing the lock mechanism, how can we diminish this overhead?

1

- Back off: pause for awhile
  - Back off by too little: traffic still high
  - o Back off by too much: wait longer than necessary
- Exponential back-off: Increase the back-off interval exponentially with each failure.

## Test and Test&Set Lock (TTSL)

- Busy-wait with ordinary read operations, not test&set.
  - Cached lock variable will be invalidated when release occurs
- When value changes (to 0), try to obtain lock with test&set

   Only one attempter will succeed; others will fail and start testing again.

Let's compare the code for TSL with TTSL.

TSL:

|       | bnz R1, lock;<br>ret                                  | <pre>// R1 = MEM[&amp;lockvar];<br/>// if (R1==0) MEM[&amp;lockvar]=1<br/>// jump to lock if R1 != 0<br/>// return to caller<br/>// MEM[&amp;lockvar] = 0<br/>// return to caller</pre>        |
|-------|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TTSL: |                                                       |                                                                                                                                                                                                |
| lock: | <pre>bnz R1, lock;<br/>t&amp;s R1, &amp;lockvar</pre> | <pre>// Rl = MEM[&amp;lockvar]<br/>// jump to lock if Rl != 0<br/>// Rl = MEM[&amp;lockvar];<br/>// if (Rl==0)MEM[&amp;lockvar]=1<br/>// jump to lock if Rl != 0<br/>// return to caller</pre> |

// return to caller

5

Architecture of Parallel Computers

unlock: sti &lockvar, #0 // MEM[&lockvar] = 0

ret

Lecture 17

The **lock** method now contains two loops. What would happen if we removed the second loop? No longer atomic ... equivalent to our first code sequence in the lecture that didn't enforce mutual exclusion.

Here's a trace of a TSL, and then TTSL, execution. Let's compare them line by line.

# Fill out this table:

|                 | TSL | TTSL |
|-----------------|-----|------|
| # BusReads      | 0   | 6    |
| # BusReadXs     | 9   | 0    |
| # BusUpgrs      | 0   | 4    |
| # invalidations | 8   | 5    |

(What's the proper way to count invalidations?)

© 2023 Edward F. Gehringer

CSC/ECE 506 Lecture Notes, Spring 2023

6

- Successful lock acquisition:
  - 2 bus transactions in TTSL
    - 1 BusRd to intervene with a remotely cached block
    - 1 BusUpgr to invalidate all remote copies
  - o vs. only 1 in TSL
    - 1 BusRdX to invalidate all remote copies
- Failed lock acquisition:
  - 1 bus transaction in TTSL
    - 1 BusRd to read a copy
    - then, loop until lock becomes free
  - vs. unlimited with TSL
    - Each attempt generates a BusRdX

# LL/SC

- TTSL is an improvement over TSL.
- But bus-based locking
  - o has a limited applicability (explain)
  - $\circ~$  is not scalable with fine-grain locks (explain)
- Suppose we could lock a cache block instead of a bus ...
  - Expensive, must rely on buffering or NACK to prevent a line from being stolen by another processor.
- Instead of providing atomicity, can we provide an illusion of atomicity instead?
  - $\circ~$  This would involve detecting a violation of atomicity.
  - If something "happens to" the value loaded, cancel the store (because we must not allow newly stored value to become visible to other processors)

| TSL: Request | P1 | P2 | P3 | BusRequest |
|--------------|----|----|----|------------|
| Initially    | _  | -  | -  | -          |
| P1: t&s      | М  | I  | -  | BusRdX     |
| P2: t&s      | I  | М  | -  | BusRdX     |
| P3: t&s      | I  | I  | М  | BusRdX     |
| P2: t&s      | I  | М  | I  | BusRdX     |
| P1: unlock   | М  | I  | I  | BusRdX     |
| P2: t&s      | I  | М  | I  | BusRdX     |
| P3: t&s      | I  | I  | М  | BusRdX     |
| P3: t&s      | I  | I  | М  | -          |
| P2: unlock   | Ι  | М  | I  | BusRdX     |
| P3: t&s      | I  | I  | М  | BusRdX     |
| P3: unlock   | Ι  | I  | М  | -          |
|              |    |    |    |            |

| TTSL: Request | P1 | P2 | P3 | Bus Request |
|---------------|----|----|----|-------------|
| Initially     | -  | -  | -  | -           |
| P1: ld        | E  | -  | -  | BusRd       |
| P1: t&s       | М  | -  | -  | -           |
| P2: ld        | S  | S  | I  | BusRd       |
| P3: ld        | S  | S  | S  | BusRd       |
| P2: ld        | S  | S  | S  | -           |
| P1: unlock    | М  | I  | -  | BusUpgr     |
| P2: ld        | S  | S  | -  | BusRd       |
| P2: t&s       | Ι  | М  | -  | BusUpgr     |
| P3: ld        | -  | S  | S  | BusRd       |
| P3: ld        | Ι  | S  | S  | -           |
| P2: unlock    | Ι  | М  |    | BusUpgr     |
| P3: ld        | I  | S  | S  | BusRd       |
| P3: t&s       | I  | I  | М  | BusUpgr     |
| P3: unlock    | I  | I  | М  | -           |

TSL vs. TTSL summary

Lecture 17

7

o Go back and repeat all other instructions (load, branch,

This can be done with two new instructions:

• Load Linked/Locked (LL)

etc.).

- $\circ~$  reads a word from memory, and
- o stores the address in a special LL register
- The LL register is cleared if anything happens that may break atomicity, e.g.,
  - A context switch occurs
  - The block containing the address in the LL register is invalidated.
- Store Conditional (SC)
  - tests whether the address in the LL register matches the store address
  - o if so, store succeeds: store goes to cache/memory;
  - $_{\odot}\,$  else, store fails: the store is canceled, 0 is returned.

# Here is the code.

| lock: LL R1, &lockvar | 11 | <pre>R1 = lockvar;</pre> |
|-----------------------|----|--------------------------|
|                       | 11 | LINKREG = &lockvar       |
| bnz R1, lock          | 11 | jump to lock if R1 != 0  |
| add R1, R1, #1        | 11 | R1 = 1                   |
| SC R1, &lockvar       | 11 | lockvar = R1;            |
| beqz R1, lock         | 11 | jump to lock if SC fails |
| ret                   | // | return to caller         |
| unlock: sti &lockvar, | #0 | // lockwar = 0           |
| ret                   | π0 | // return to caller      |
| Iel                   |    | // recurn co Carter      |

Note that this code, like the TTSL code, consists of two loops. Compare each loop with its TTSL counterpart.

- The first loop
- · The second loop

Lecture 17

Architecture of Parallel Computers

- A single release invalidates O(p) caches, causing O(p) subsequent cache misses
- Hence, each critical section causes  $O(p^2)$  network traffic
- Fairness: There is no guarantee that a thread that contends for a lock will eventually acquire it.

These issues can be addressed by two different kinds of locks.

#### **Ticket Lock**

- Ensures fairness, but still incurs O(p<sup>2</sup>) traffic
- Uses the concept of a "bakery" queue
- A thread attempting to acquire a lock is given a ticket number representing its position in the queue.
- · Lock acquisition order follows the queue order.

#### Implementation:

ticketLock\_init(int \*next\_ticket, int \*now\_serving) {
 \*now\_serving = \*next\_ticket = 0;
}

ticketLock\_acquire(int \*next\_ticket, int \*now\_serving) {
 my\_ticket = fetch\_and\_inc(next\_ticket);
 while (\*now\_serving != my\_ticket) {};

```
}
```

ticketLock\_release(int \*next\_ticket, int \*now\_serving) {
 \*now\_serving++;
}

Trace:

Lecture 17

## Here is a trace of execution. Compare it with TTSL.

| Request    | P1 | P2 | P3  | BusRequest |
|------------|----|----|-----|------------|
| Initially  | -  | -  | -   | -          |
| P1: LL     | E  | -  | -   | BusRd      |
| P1: SC     | М  | -  | 1   | -          |
| P2: LL     | S  | S  | -   | BusRd      |
| P3: LL     | S  | S  | S   | BusRd      |
| P2: LL     | S  | S  | S   | -          |
| P1: unlock | М  |    |     | BusUpgr    |
| P2: LL     | S  | S  | - 1 | BusRd      |
| P2: SC     | Ι  | М  | - 1 | BusUpgr    |
| P3: LL     | Ι  | S  | S   | BusRd      |
| P3: LL     | 1  | S  | S   | -          |
| P2: unlock | Ι  | М  | - 1 | BusUpgr    |
| P3: LL     | Ι  | S  | S   | BusRd      |
| P3: SC     | Ι  | Ι  | М   | BusUpgr    |
| P3: unlock |    | 1  | М   | -          |

- Similar bus traffic
  - $\circ~$  Spinning using loads  $\Rightarrow$  no bus transactions when the lock is not free
  - Successful lock acquisition involves two bus transactions. What are they?
- But a failed SC does not generate a bus transaction (in TTSL, all test&sets generate bus transactions).
  - o Why don't SCs fail often?

#### Limitations of LL/SC

• Suppose a lock is highly contended by *p* threads • There are *O*(*p*) attempts to acquire and release a lock

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

10

| Stopa            | novt tieket | now serving | my_ticket |    |    |  |
|------------------|-------------|-------------|-----------|----|----|--|
| Steps            | next_licket | now_serving | P1        | P2 | P3 |  |
| Initially        | 0           | 0           | -         | 1  | -  |  |
| P1: fetch&inc    | 1           | 0           | 0         | 1  | -  |  |
| P2: fetch&inc    | 2           | 0           | 0         | 1  | -  |  |
| P3: fetch&inc    | 3           | 0           | 0         | 1  | 2  |  |
| P1:now_serving++ | 3           | 1           | 0         | 1  | 2  |  |
| P2:now_serving++ | 3           | 2           | 0         | 1  | 2  |  |
| P3:now_serving++ | 3           | 3           | 0         | 1  | 2  |  |

Note that fetch&inc can be implemented with LL/SC.

# Array-Based Queueing Locks

With a ticket lock, a release still invalidates O(p) caches.

*Idea:* Avoid this by letting each thread wait for a unique variable. Waiting processes poll on different locations in an array of size *p*.

Just change now\_serving to an array! (renamed "can\_serve").

A thread attempting to acquire a lock is given a ticket number in the queue.

Lock acquisition order follows the queue order

- Acquire
  - fetch&inc obtains the address on which to spin (the next array element).
  - We must ensure that these addresses are in different cache lines or memories
- Release
  - Set next location in array to 1, thus waking up process spinning on it.

Advantages and disadvantages:

O(1) traffic per acquire with coherent caches
 And each release invalidates only one cache.

11

9

- FIFO ordering, as in ticket lock, ensuring fairness
- But, O(p) space per lock ٠
- Good scalability for bus-based machines •

## Implementation:

| <pre>ABQL_init(int *next_ticket, int *can_serve) {     *next_ticket = 0;     for (i=1; iCMAXSIZE; i++)         can_serve[i] = 0;         can_serve[0] = 1; }</pre> |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>ABQL_acquire(int *next_ticket, int *can_serve) {  *my_ticket = fetch_and_inc(next_ticket) % MAXSIZE;  while (can_serve[*my_ticket] != 1) {}; }</pre>          |
| <pre>ABQL_release(int *next_ticket, int *can_serve) {    can_serve[*my_ticket + 1] = 1;    can_serve[*my_ticket] = 0; // prepare for next time }</pre>             |

Trace:

| Ohama              |             |              | my_ticket |    |    |
|--------------------|-------------|--------------|-----------|----|----|
| Steps              | next_ticket | can_serve[]  | P1        | P2 | P3 |
| Initially          | 0           | [1, 0, 0, 0] | Ι         | -  | Ι  |
| P1: f&i            | 1           | [1, 0, 0, 0] | 0         | -  | Ι  |
| P2: f&i            | 2           | [1, 0, 0, 0] | 0         | 1  | Ι  |
| P3: f&i            | 3           | [1, 0, 0, 0] | 0         | 1  | 2  |
| P1: can_serve[1]=1 | 3           | [0, 1, 0, 0] | 0         | 1  | 2  |
| P2: can_serve[2]=1 | 3           | [0, 0, 1, 0] | 0         | 1  | 2  |
| P3: can_serve[3]=1 | 3           | [0, 0, 0, 1] | 0         | 1  | 2  |

Let's compare array-based queueing locks with ticket locks.

Fill out this table, assuming that 10 threads are competing:

Lecture 17

Architecture of Parallel Computers

13

Array-based Ticket locks queueing locks #of invalidations 9 9 # of subsequent 9 + 8 + ... + 1 = 451 + 1 + ... + 1 = 9 cache misses

# Comparison of lock implementations

| Criterion             | TSL                   | TTSL                  | LL/SC                 | Ticket                | ABQL                  |
|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
| Uncontested latency   | Lowest                | Lower                 | Lower                 | Higher                | Higher                |
| 1 release max traffic | <i>O</i> ( <i>p</i> ) | O(1)                  |
| Wait traffic          | High                  | Low                   | -                     | -                     | -                     |
| Storage               | O(1)                  | O(1)                  | O(1)                  | O(1)                  | <i>O</i> ( <i>p</i> ) |
| Fairness guaranteed?  | No                    | No                    | No                    | Yes                   | Yes                   |

Discussion:

- Design must balance latency vs. scalability

  - ABQL is not necessarily best.
    Often LL/SC locks perform very well.
    Scalable programs rarely use highly-contended locks.
- Fairness sounds good in theory, but
  - o Must ensure that the current/next lock holder does not suffer from context switches or any long delay events

© 2023 Edward F. Gehringer CSC/ECE 506 Lecture Notes, Spring 2023

## Barriers

[\$8.2] Like locks, barriers can be implemented in different ways, depending upon how important efficiency is.

- Performance criteria
  - $\circ\;$  Latency: time spent from reaching the barrier to leaving it
  - Traffic: number of bytes communicated as a function of number of processors
- In current systems, barriers are typically implemented in software using locks, flags, counters.
  - Adequate for small systems
     Not scalable for large systems

A thread might have this general organization:

```
parallel region
BARRIER
parallel region
BARRIER
```

Note that barriers are usually constructed using locks, and thus can use any of the lock implementations in the previous lecture.

A barrier can be implemented like this (first attempt):

```
// shared variables used in barrier & their initial values
int numArrived = 0;
lock_type barLock = 0;
int canGo = 0;
```

```
// barrier implementation
void barrier () {
    lock(&barLock);
        if (numArrived == 0) // first thread sets flag
        canGo = 0;
        numArrived++;
    }
}
```

Lecture 20

}

Architecture of Parallel Computers

```
if (myCount < NUM_THREADS) {
   while (canGo != valueToAwait) {}; // await last thread
}
else { // this is the last thread to arrive
   numArrived = 0; // reset for next barrier
   canGo = valueToAwait; // release all threads
}</pre>
```

How does the traffic at this barrier scale? Each thread increments numArrived, which forces all other threads to re-cache the block (canGo is probably in the same block).  $O(p^2)$ .

#### Combining-tree barrier

[§8.2.2] A tree-based strategy can be used to reduce contention, similarly to the way we used partial sums in Lecture 6.

- · Threads represent the leaf nodes of a tree.
- The non-leaf nodes are the variables that the threads spin on.
- Each thread spins on the variable of its immediate parent, which constitutes an intermediate barrier.
- Once all threads have arrived at the intermediate barrier, one of these threads goes on and spins on the variable immediately above.
- This is repeated until the root is reached. At this point, the root releases all threads by setting a flag. (or by propagating the release flag all the way down the tree to the leaf nodes.)

How does this <u>improve performance</u>? No barrier has more than a few threads spinning on it. So the number of invalidations is reduced to O(p) per barrier

But there is an offsetting cost to a combining tree. What is it? Latency is higher, because threads need to traverse log(p) barriers to know that all threads have reached the barrier.

[§8.2.3] In very large supercomputers, however, this technique does not suffice.

int myCount = numArrived; unlock(&barLock);

```
if (myCount < NUM_THREADS) {
    while (canGo == 0) {}; // wait for last thread
}</pre>
```

```
else { // this is the last thread to arrive
    numArrived = 0; // reset for next barrier
    canGo = 1; // release all threads
```

What's wrong with this? Suppose the first thread to leave the first barrier arrives at another barrier (or at this barrier again) and sets canGo back to 0—before all of the other threads notice that it has become 1.

#### Sense-reversal centralized barrier

[\$8.2.1] The simplest solution to the correctness problem above just toggles the barrier  $\ldots$ 

- the first time, the threads wait for canGo to become 1;
- the next time they wait for it to become 0;
- and then they alternate waiting for it to become 1 and 0 at successive barriers.

Here is the code:

} }

```
// variables used in a barrier and their initial values
int numArrived = 0;
lock_type barLock = 0;
int canGo = 0;
// thread-private variable
int valueToAwait = 0;
// barrier implementation
void barrier () {
 valueToAwait = 1 - valueToAwait; // toggle it
 lock(fbarLock);
```

```
lock(&DarLock);
numArrived++;
int myCount = numArrived;
unlock(&DarLock);
```

```
© 2024 Edward F. Gehringer
```

CSC/ECE 506 Lecture Notes, Spring 2024

2

The BlueGene/L system has a special *barrier network* for implementing barriers and broadcasting notifications to processors.

The network contains four independent channels.

Each level does a global and of the signals from the levels below it.

The signals are combined in hardware and propagate to the top of a combining tree.



The tree can also be used to do a global interrupt when the entire machine or partition must be stopped as soon as possible "for diagnostic purposes."

In this case, each level does a global or of the signals from beneath.

Once the signal propagates to the top of the tree, the resultant notification is broadcast down the tree.

The round-trip latency is only 1.5 µs for a system of 64K nodes.

1