Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity

Release v9.0.0: better learning rate schedules, integration of thinc-apple-ops · explosion/thinc



arXiv:2410.01028v1 Announce Type: new
Abstract: We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.