Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


[Submitted on 16 Oct 2024]

View a PDF of the paper titled Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond, by Costin-Andrei Oncescu and 3 other authors

View PDF
HTML (experimental)

Abstract:While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs’ exact inference to quasilinear $O(Llog^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $1.6times$ end-to-end improvement over standard inference by improving $50times$ within the position-mixing part.

Submission history

From: Costin-Andrei Oncescu [view email]
[v1]
Wed, 16 Oct 2024 19:23:46 UTC (1,474 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.