Achieving low load-to-use latency with energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction validation) or provide data reuse in register sharing L0 caches). These a range of tradeoffs between latency, reuse, overhead. In this work, we present prefetching technique that achieves state-of-the-art performance witho...