CS 6290 - High Performance Computer Architecture

🔍 Overview

High performance computer architecture is like the coal engine room of a ship. It graciously allows everything beautiful, greedy, and consumerist to thrive on the ship (web developers (/s)), while the engine is isolated and must function.

This involves the design of task scheduling, caching, concurrency, energy consumption, and specialization for certain computation.

This course provides an introduction to instruction pipelining, task scheduling, caching, concurrency, and others, with a look at some of the finer details in implementation. Coding is done over SESC, a multi-threaded chip simulator in C++. Lectures cover specific algorithms, their tradeoffs for speed and memory, as well as edge cases that must be accounted for.

*I took this course Spring 2020, contents may have changed since then

🏢 Structure

Four Projects - 50% (of the final grade)
Midterm - 20%
Final - 30%

Projects

Each of these assignments relies on using the architecture simulator, SESC, which

Intro to SESC - Getting our feet wet with the simulator and its outputs
Branch Prediction - Implementing changes to SESC to analyze branch prediction behavior
Cache Misses - SESC changes to classify cache misses
Cache Coherence - SESC changes to classify multiprocessor cache misses

Exams

The course content, including the exams, follows the Udacity course identically. While the project assignments do not cover all these topics, I believe almost everything was fair game for the exams.

💻 Project 1 - Introduction to SESC

At the core of developing and examining computer architecture is determining performance. While there are many metrics to be considered in practice, we consider methods that optimize for speed and cycle efficiency.

The Iron law of Processor Performance is

CPU Time = $\frac{Instructions}{Program}* \frac{Cycles}{Instructions}*\frac{Time}{Cycle}$

Instructions per program, the code size, can be optimized from the compiler side. Cycles per instruction (CPI) is optimized from the processor design. Time per cycle is affected by the hardware from chip design. While we touch on compilers and hard a bit, the main focus in this course is processor design.

Another important concept is Amdahl's Law, which measures the overall speedup of an application given the fraction of the program that would be affected and the improved rate.

Speedup = $\frac{1}{(1-p)+\frac{p}{N}}$ , where $p$ is the fraction of the overall execution time that is affected by the enhancement, and $N$ is the rate of improvement.

For example, we can compare two systems which are identical except the processor and the disk drive. One has a processor which has 150% of the other's speed and takes up 70% of the overall execution time, and the other has a hard drive which has 300% of the other's speed and takes up 20% of the overall execution time.

Processor improvement = $\frac{1}{(1-0.7)+\frac{0.7}{1.5}}$ = 1.304 speedup

Hard drive improvement = $\frac{1}{(1-0.2)+\frac{0.2}{3}}$ = 1.15 speedup

You can also apply this to other comparisons, like finding speedup given the % program that is parallelizable and the # threads, or even optimizing for effort spent in your daily life. It's okay- you don't need to spend so much effort on anger towards XYZ, it's a small weight on your life experience.

Instruction Pipelining enables instruction-level parallelism within a single processor. At the very high level, to execute a single assembly instruction, we need to fetch, decode, execute with the ALU, and write back the result to the register. We can keep all these components busy by pipelining the instructions, as visualized below.

Source

Writing back to memory takes more than 1 cycle, so oftentimes, if an instruction is dependent on the previous result, the pipeline will have to be flushed and start over. Through clever pipelines, prediction methods, and compiler optimizations, we can reduce the chance and impact of this happening.

Results

We are given a virtual machine image to standardize the environment for running SuperESCaler Simulator (sesc). Sesc simulates processors that run MIPS assembly, and we did have to examine an C application's disassembly. We primarily play around with different configurations for this project, and change the source code in the later projects.

For example, one of the benchmark tests is LU Decomposition, and here we run our simulator under configuration file 'cmp4-noc.conf' with setting '-n64', meaning the test is done on a 64x64 matrix. After a simulation completes, a report is generated, and can then be summarized.

On the top row, we can see 'Exe Time', and 'Sim Time', which are 2.870s and 1.522ms, respectively, meaning it took 2.87 seconds to run the simulation on the machine, and the LU decomposition would have taken 1.522ms given a real processor under the same configuration.

Two other notable metrics, on the middle left, are 'nInst' and 'Cycles', which are 1720455 and 1521941. This corresponds to the total number of MIPS instructions executed, as well as the total cycles, which results in 1.13 instructions per cycle (IPC). I'll continue to address other metrics in later sections.

We can run this same simulated processor on LU Decomposition, but on a 128x128 matrix, and we can see here that it took almost 10x more instructions to complete.

💻 Project 2 - Branch Prediction

A branch is an instruction that can cause the processor to start a difference instruction sequence. In an 'if else', depending on the condition result, one of two sequences will execute. Switching to an unexpected sequence can cause a pipeline flush, and branches account for 20% of all instructions, so mispredictions can have a major impact on processor performance.

Branch prediction is a strategy to estimate whether or not a branch will be taken. There are many techniques that we discuss in this course, but I'll just touch over a simple implementation, the 1-bit predictor.

A branch history table (BHT) stores the result of the recent branches, whether they were taken (T) or not taken (NT). So if we have two branch instructions, at address 0xABCD and 0xBCDE, and they were recently T and NT, respectively, then our BHT would look like this:

// BHT
0xABCD: 1 // Taken
0xBCDE: 0 // Not Taken

The next time we see these instructions, the following sequence will reflect the taken or not taken, so we can hopefully avoid a pipeline flush. We can demonstrate this in a simple program.

function foo() {
  for (int i = 0; i < HUGE_NUMBER; i++) {
    // BHT 0xABCD: 1, 'else' instructions are loaded
    if (false) { } else { }
    // BHT 0xBCDE: 0, first instructions are loaded
    if (true) { } else { }
  }
}

Let's say our sequence for one branch is T, NT, T, NT, T, NT, etc. This would be common for a counter with condition, i % 2 == 0, Our branch prediction will get every result wrong, but there is clearly a pattern we should recognize.

Branch prediction can far more effective than the BHT with 1 bit, which cannot recognize simple patterns. Other strategies include adding more bits, branch predication (executing both outcomes and throwing away the unused), and complex yet efficient layers of predictors. Branch prediction optimizations are what enabled the Sceptre Meltdown to occur.

Results

For this project, we test different branch prediction strategies, and observe some of their metrics by changing the simulator code.

We're now working with a different benchmark, 'raytrace', and our branch predictor strategy is 'Not Taken', which means we always predict that we will not branch. The simulation summary is shown below. We can see its performance on the second row of values, which starts with 'BPType' 'NotTaken'. The Total: 49.63% represents how many branches were correctly taken, and further down the row, BPred: 44.39, is the performance of the branch prediction strategy, NotTaken. There were a total of 402 million cyles to execute 207 million instructions.

Now, we can run this again with the branch strategy 'Hybrid,' which is a tournament predictor (I haven't touched on this, but you learn more here). Let's review those same metrics, now our BPred is 84.18% accurate! This reduces the number of cycles to 250 million, which is almost half the number as the naive strategy, to execute the same number of instructions.

The task for the assignment was to track the accuracy of the predictor across varied branch frequencies. Let's say we have a branch instructions that each executed 10 times in total- how would a branch predictor perform with these instructions compared to branches that executed over 1000 times? Which would be more important to optimize for overall execution time?

Below is part of the source code we need to alter to for keeping track of these statistics- though I removed all portions from my submission.

// Implementation for the Hybrid predictor
PredType BPHybrid::predict(const Instruction *inst, InstID oracleID, bool doUpdate)
{
  bpredEnergy->inc();

  if( inst->isBranchTaken() )
    return btb.predict(inst, oracleID, doUpdate);

  bool taken = (inst->calcNextInstID() != oracleID);

  ... // Many more lines of code

  // Tournament predictor uses local and global tables
  bool ptaken = metaOut ? localTaken : globalTaken;

  if (taken != ptaken) {
    if (doUpdate)
      btb.updateOnly(inst,oracleID);
      return MissPrediction;
  }

  return ptaken ? btb.predict(inst, oracleID, doUpdate) : CorrectPrediction;
}

💻 Project 3 - Cache Misses

A cache is a small memory component that enables fast data lookup, which saves time compared to directly reading from main memory. When accessing memory and the data exists in the cache, this is a cache hit, so you can quickly return the requested data. If it does not exist, this is a cache miss, so you have to instead fetch from main memory.

Least Recently Used (LRU) is a common strategy to determine what to keep and kick out from the cache. When the cache is full, the data that has been untouched for the longest period will be overwritten by the next memory access. The intuition of this is that recent data tends to be looked at repeatedly- this can be verified by instruction statistics.

There are two cache miss types we'll consider: compulsory and capacity.

A compulsory miss occurs when we have not ever accessed that data- if the cache were infinite sized, this miss would still occur.

A capacity miss is a miss that would occur even in a fully associative LRU cache that has the same overall capacity. Cache associativity determines how the memory is indexed. This graphic nicely compares various settings.

Results

The major requirement of this project involved maintaining a count of compulsory and capacity misses. This involves maintaining our own data structures to distinguish between the miss types.

// Method to read from a cache. Many lines from the original source are omitted
void SMPCache::doRead(MemRequest *mreq)
{
    PAddr addr = mreq->getPAddr();
    Line *l = cache->readLine(addr);
    PAddr tag = calcTag(addr);

    if (l && l->canBeRead()) {
        outsReq->retire(addr);
        mreq->goUp(hitDelay);
        return;
    }

    if (l && l->isLocked()) {
        readRetry.inc();
        Time_t nextTry = nextSlot();
        if (nextTry == globalClock)
            nextTry++;
        doReadCB::scheduleAbs(nextTry, this, mreq);
        return;
    }

    GI(l, !l->isLocked());

		sendRead(mreq);
}

I won't comment much beyond this, as learning about the code was part of the task at hand.

💻 Project 4 - Cache Coherence

Multiprocessor systems have cores which maintain their own cache. This means that if we naively use these caches within a parallel execution which operates on shared data, the data in the caches will be inconsistent. This is called an incoherent system.

Cache coherence involves the process and structure for multiple cores to read and write consistent data. A coherent system has the following requirements:

If a core reads a memory location, the data it receives was written by the last valid write
If a core writes to a memory location, when another core reads that same memory location, it receives the same value
All cores should agree on the order of writes to a memory location

The simplest solution to this is to avoid caches altogether, and only read/write to shared memory. This would clearly result in very slow execution times.

One strategy is to use the MSI Protocol, which uses a directory to maintain a state all cache blocks, and is unique to each core. There are three states a block can be in

Modified: The block has been modified in the current core, and is inconsistent with other caches as well as the main memory. In this case, we would need to write the data back to memory before evicting the block.
Shared: The block is unmodified and exists in at least one other core. We can safely replace the data without writing back to memory.
Invalid: The block is either not present in the current cache or has been modified by another core. If we want to read this block, we need to fetch the data from another cache or the main memory.

Source

Local refers to operations done by current core, and remote refers to a different core and cache accessing the data. For more details, you can watch this lecture from the course.

Results

The main requirement for this project was similar to the last, which is maintaining our own data structure to classify and count coherence misses for multiprocessor simulation. This required many more modifications, as we had to deal with methods not just within cache reading and writing. Below is the source code for invalidating a cache block.

void SMPCache::realInvalidate(PAddr addr, ushort size, bool writeBack)
{
    while(size) {

        Line *l = cache->findLine(addr);

        if (l) {
            nextSlot(); // counts for occupancy to invalidate line
            IJ(l->isValid());
            if (l->isDirty()) {
                invalDirty.inc();
                if(writeBack)
                    doWriteBack(addr);
            }
            l->invalidate();
        }
        addr += cache->getLineSize();
        size -= cache->getLineSize();
    }
}

✏️ Exams

As mentioned before, exam content follows the course structure and Udacity lectures. Here are some of the other topics

Instruction Level Parallelism (ILP) - Further strategies to improve pipeline efficiency including reordering the execution based on dependencies. This includes data structures like the re-order buffer, and methods like Tomasulo's Algorithm

Compiler ILP - Strategies the compiler can use to achieve better ILP, including instruction scheduling and loop unrolling

Virtual Memory - Techniques and data structures to translate a program's view of memory with the processor's actual storage

Fault Tolerance - Solutions and tradeoffs between robust systems that can recover from memory loss, specifically by RAID configurations

⏩ Next Steps

I walked away mostly unsure about modern approaches to HPCA. While I can now read a few blogs and papers and have general familiarization with the terms, I can't say with much confidence what a person might work with on the job. Understanding what goes on under the hood will help you gauge your software performance as well as help optimize your design for high-level languages.

Kevin Wang

CS 6290 - High Performance Computer Architecture

🔍 Overview

🏢 Structure

Projects

Exams

💻 Project 1 - Introduction to SESC

Results

💻 Project 2 - Branch Prediction

Results

💻 Project 3 - Cache Misses

Results

💻 Project 4 - Cache Coherence

Results

✏️ Exams

⏩ Next Steps

CS 7646 - Machine Learning For Trading

CSE 6242 - Data and Visual Analytics

Melon - Food Delivery