Previous Course Projects

Implementing Garbage Collection in Hardware

Garbage collection (GC) is typically a software-based process in managed languages that is responsible for freeing memory-allocated objects when a program no longer needs them. Even in multi-threaded or multi-core processors, the processor itself is typically in charge of running the GC program. Because of the linear nature of processors, a simple pipelined core might be poorly suited for GC. Accordingly, improving the efficiency of GC for workloads that spend large proportions of CPU cycles on GC can enhance processor performance and energy consumption.

In this paper, we propose a dedicated hardware-based garbage collection module integrated into a RISC-V RocketChip SoC. By examining the performance and physical characteristics, we compare the trade-offs and potential utility of a dedicated hardware garbage collector for certain workloads. Based on the work of Maas et. al., we designed our own GC unit that implements parallel marking and tracing with a block-based sweeper. We propose a new object layout, utilizing shared knowledge between the allocator and GC hardware. We completed functional verification using Verilator test benches and synthesized our design using the Cadence Genus Synthesis Tool. Finally, we built a performance model from our microbenchmark results to compare to software-based GC on OpenJDK’s JVM running workloads from the DaCapo Java benchmark suite. We found an average performance improvement of 1.62x over software-based GC with an 11.3% increase in SoC surface area. We demonstrate that memory management is well-suited for parallel hardware acceleration and compare our performance, timing, power, and area to Maas’s hardware.

Custom RISC-V Vector Instruction for Computing Cosine Similarity

Cosine similarity accounts for a significant portion of workloads in many vector-based applications, particularly in machine learning and natural language processing. This project proposes the addition of a custom instruction to the RISC-V Vector ISA extension for efficiently computing cosine similarity, optimizing performance for these workloads without incurring high power or hardware costs. PARSEC microbenchmarks run on a core with the addition of the cosine similarity functional unit have shown a 7.95% increase in computations per second. The functional unit has been designed at the RTL level using Verilog, and synthesis shows the unit has an area of 0.213 mm^2 and a power consumption of 230 mW.

Implementing and Integrating the Perceptron Predictor on the Berkeley Out-of-Order Machine (BOOM) RISC-V core

We are looking to evaluate the performance of the Perceptron branch predictor, a foundational branch predictor that has inspired many other branch predictors like it. The Perceptron predictor was introduced as a method to overcome the limitations of traditional table-based predictors like Two-Level Adaptive or GShare, especially when dealing with complex branch patterns. Unlike predictors that rely solely on saturating counters, the Perceptron predictor uses a weighted history-based scheme, allowing it to learn correlations between global branch history and branch outcomes effectively. Despite its potential, the Perceptron predictor’s practical adoption has been limited due to concerns about hardware complexity and implementation overhead. By integrating and analyzing the Perceptron Branch Predictor in the BOOM microarchitecture, we aim to assess its viability as a high- accuracy alternative and evaluate the trade-offs involved in using machine-learning-based branch prediction in modern processors.

A Survey of Value Prediction in Computer Architecture

Value prediction in computer architecture is a speculative technique aimed at reducing pipeline stalls caused by data dependencies and long-latency operations, such as L2 cache misses and floating-point divisions. By leveraging historical trends, value prediction allows processors to predict values for dependent instructions, similar to branch prediction. This paper presents a survey of value prediction techniques, including last value, stride-based, and contexxt-based prediction schemes. We discuss the design and implementation of a Load Value Prediction Unit (LVPU) in the gem5 simulator and evaluate its impact on performance using benchmarks from the SPEC2017 suite. Our results demonstrate the potential for small performance improvements, particularly with stride-based and context-based predictors. However, high penalties for mispredictions underscore the importance of accurate prediction mechanisms and recovery strategies. This work highlights the promise of value prediction as a means to enhance processor efficiency while identifying challenges that warrant further research.

Memory Dependence Prediction using Store Sets and Store Vectors

Out-of-order superscalar processors present opportunities for improving functional unit utilization and performance by re-ordering loads and stores. However, processors have no method of discovering in advance whether a load will read from the same address a store will write to. Past research has explored mechanisms like Store Vectors and Store Sets to predict memory dependencies. This study implements Store Vectors, 2-bit saturating counters for Store Sets, and naive scheduling strategies in the gem5 simulator. We validate our implementations and evaluate their performance on some SPEC2017 benchmarks and our own microbenchmarks written in assembly.

Optimizing Instruction Scheduling by Integrating Speculative Wakeup and Select-Free Scheduling

In out-of-order CPUs, the critical path containing wakeup and select operations in instruction scheduling significantly constrains both clock frequency and Instructions Per Cycle (IPC) performance. While conventional approaches to pipelining the scheduling logic can alleviate frequency limitations, they introduce substantial performance penalties, leading to 10%-19% IPC degradation. Previous research has proposed effective approaches to mitigate this IPC degradation by focusing independently on either wakeup logic or select logic. This report investigates the integration of Speculative-Wakeup (SW) logic with Select-Free (SF) logic and analyzes its effectiveness. We implemented both techniques and their integration in the Gem5 simulator. Our results, based on SPEC-CPU2017 benchmarks, demonstrate that a machine using this integrated approach achieves an IPC that is 10% higher than machines using either SW logic or SF logic alone, highlighting the complementary nature of these two techniques.

Replication of Cyclone Instruction Scheduler

This project replicates the Cyclone scheduler within the gem5 framework, integrating pre-scheduling algorithms and selective replay mechanisms to combine the benefits of static and dynamic scheduling. The implementation introduces key components such as the countdown queue, main queue, and timing table, adhering to the scheduler’s theoretical design. Functional validation confirms the scheduler’s reliability across diverse scenarios, including operand readiness, speculative recovery, and replay mechanisms. Performance evaluations using SPEC2017 benchmarks reveal that while the Cyclone scheduler reduces hardware complexity compared to traditional designs, it encounters challenges such as switchback conflicts under memory0intensive workloads, which impact instructions per cycle (IPC). Optimization techniques, including queue width scaling and retiming tables, demonstrate potential IPC improvements of up to 7% in specific benchmarks. These results highlight the Cyclone scheduler’s feasibility as a hardware-efficient alternative while identifying areas for further optimization, particularly in scalability and workload adaptability.

Detecting Resource Deadlocks in ICN Topologies

Deadlock mitigation is an important consideration, especially when it comes to coherence protocols. Most deadlock prevention works in ICN topologies have focused on either mitigating protocol deadlocks or routing deadlocks. There has not been consideration toward resource deadlocks, which occur only when the number of virtual networks is reduced and different message types share the same buffer. This is the focus of this project. In this project, we explore what resource deadlocks are and formally analyze them. We investigate three topologies in depth: fully connected, unidirectional ring, and bidirectional ring for a simple request-response protocol, and propose four concrete flow control invariants to mitigate these deadlocks. Using formal reasoning and model checking in Murphi, we show that all four strategies work with fully connected topologies and unidirectional right topologies. However, only one of the strategies would work with bidirectional ring topologies.

DRRIP and BRRIP Caching

Cache systems have limited capacity and must eventually suffer from page fault, where a block of memory is not currently in the cache. A candidate block must then be picked as a replacement, and the problem of picking the most optimal block to replace is nontrivial. There are many heuristic-based policies that implement naive strategies (Least Recently Used, Least Frequently Used) but each of these are vulnerable to programs with certain complex cache access patterns such as scanning and thrashing, and are possibly too simple for programs with more sophisticated control flow. To address this, modern approaches use advanced techniques such as interval prediction, set dueling, and more. DRRIP (Dynamic Re-Reference Interval Prediction) is a culmination of many of these modern techniques building from NRU (Not Recently Used). Though this is a seemingly more robust policy, it is still a heuristic, and exploring the relative improvement as compared to the naive policies is of interest. We explore the performance differences between our implemented version of DRRIP, SRRIP, BRRIP, and LRU in cache misses and total cycles across different benchmarks in the SPEC 2017 benchmark suite. We also verify correctness and examine performance during particular cache access patterns using our own microbenchmarks, comparing our implementation against naive policies.

Comparative Study of YAGS and Hybrid Agree-Perceptron Branch Predictors

This project implements and compares two modern branch predictors: a hybrid agree-perceptron predictor and the YAGS (Yet Another Global Scheme) predictor. The hybrid agree-perceptron predictor combines the perceptron predictor’s ability to learn linearly separable functions with the agree predictor’s simpler structure. The YAGS predictor build upon traditional global history predictors by maintaining separate caches for taken and not-taken branches that deviate from a simple biased prediction. We evaluate these predictors using SPEC benchmarks (mcf_s, perlbench) to compare their prediction accuracy, performance impact, and sensitivity to hyperparameter tuning. Our analysis aims to identify the strengths and weaknesses of each predictor and determine optimal hyperparameter values for each predictor. We found that the UAGS predictor generally under-performed our two reference predictors (tournament, Gem5’s built-in perceptron), while the agree-perceptron predictor was comparable. This suggests that YAGS may not be as apt at adapting to complex branch patterns, while the hybrid predictor is more robust.

Dynamic Verification of Untrusted Memory Operations

Advancements in processor architectures and hardware technologies have significantly enhanced processor performance but have simultaneously increased vulnerability to reliability and security issues. The reduction in transistor sizes has made systems more susceptible to transient faults, while the growing complexity of modern multi-core out-of- order (OoO) processors has created additional opportunities for security breaches. One potential solution to mitigate errors is the use of redundancy through an in-order checker processor. In this setup, a trailing checker verifies the entire lifecycle of instructions executed by a leading untrusted processor, ensuring correct execution. While several fault-tolerant models have been pro- posed, these architectures often trust load values provided by the untrusted processor, introducing potential vulnerabilities. This research proposes a novel architectural framework that comprehensively verifies instructions, including load instructions. To achieve minimal invasiveness during manufacturing and optimize power efficiency, the checker die is 3D-stacked on top of the untrusted processor’s die. This new architecture enables several innovative use cases, such as enhanced prevention of fabrication attacks and the support for more aggressive microarchitectural designs, which can operate securely under the protective framework provided by this approach.

SPEED-ACC: Scalable Parallel and Energy-Efficient DNA Classification Accelerator

The rapid advancement in genomic sequencing technologies has resulted in an explosion of data, creating substantial computational bottlenecks in DNA analysis workloads. Applications such as DNA classification are particularly impacted due to their reliance on intensive, large-scale pattern matching. Existing solutions are increasingly unable to manage the scale and energy demands of these datasets, highlighting the need for architectures that can perform faster and more efficient pattern matching. To address these challenges, we propose SPEED-ACC — a data-optimized, in-memory CAM-based accelerator designed for parallel and energy-efficient DNA classification. SPEED-ACC harnesses the inherent data distribution in genomic datasets through a novel prefix-based indexing and CAM partitioning scheme that reduces the active search space, enabling significant scalability. We demonstrate results for SPEED-ACC on both 10T cell and emerging 1T2M cell CAM designs. Our experimental evaluations show that a single design point of SPEED-ACC achieves a simultaneous 455x/22,700x improvement in sequence throughput and an over 7.1x/2880x improvement in energy effi- ciency over state-of-the-art hardware/software solutions, making it robust for scalable DNA classification in the era of large-scale genomic data.