Low Latency RNN Inference with Cellular Batching
نویسندگان
چکیده
Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency and throughput of RNN inference. Unlike existing systems that batch a fixed set of dataflow graphs, cellular batching makes batching decisions at the granularity of an RNN “cell” (a subgraph with shared weights) and dynamically assembles a batched cell for execution as requests join and leave the system. We implemented our approach in a system called BatchMaker. Experiments show that BatchMaker achieves much lower latency and also higher throughput than existing systems.
منابع مشابه
Tuning Paxos for High-Throughput with Batching and Pipelining
Paxos is probably the most known state machine replication protocol. Two optimizations that can greatly improve its performance are batching and pipelining. Their effectiveness depends significantly on the system properties, mainly network latency and bandwidth, but also on the CPU speed and properties of the application. This makes it hard to know when and how to use each optimization to achie...
متن کاملLow Latency Anonymity with Mix Rings
We introduce mix rings, a novel peer-to-peer mixnet architecture for anonymity that yields low-latency networking compared to existing mixnet architectures. A mix ring is a cycle of continuous-time mixes that uses carefully coordinated cover traffic and a simple fan-out mechanism to protect the initiator from timing analysis attacks. Key features of the mix ring architecture include decoupling ...
متن کاملNeural Speed Reading via Skim-RNN
Inspired by the principles of speed reading, we introduce Skim-RNN, a recurrent neural network (RNN) that dynamically decides to update only a small fraction of the hidden state for relatively unimportant input tokens. Skim-RNN gives computational advantage over an RNN that always updates the entire hidden state. Skim-RNN uses the same input and output interfaces as a standard RNN and can be ea...
متن کاملLow Power and Low Latency Phase-Frequency Detector in Quantum-Dot Cellular Automata Nanotechnology
Nowadays, one of the most important blocks in telecommunication circuits is the frequency synthesizer and the frequency multipliers. Phase-frequency detectors are the inseparable parts of these circuits. In this paper, it has been attempted to design two new structures for phase-frequency detectors in QCA nanotechnology. The proposed structures have the capability of detecting the phase ...
متن کاملBuilding a Transparent Batching Layer for Storm
Building a Transparent Batching Layer for Storm Matthias J. Sax, Malu Castellanos HP Laboratories HPL-2013-69 streaming data, distributed streaming system, batching, performance, optimization Storm is a distributed intra-node-parallel stream processing system built for very low latency processing. One major drawback of Storm is its relatively low throughput. In order to increase Storm's throu...
متن کامل