Patch-Level Training for Large Language Models

stp2ySeptember 16, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 17 Jul 2024 (v1), last revised 13 Sep 2024 (this version, v2)]

View a PDF of the paper titled Patch-Level Training for Large Language Models, by Chenze Shao and 2 other authors

View PDF
HTML (experimental)

Abstract:As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$times$, without compromising the model performance compared to token-level training. Source code: url{this https URL}.

Submission history

From: Chenze Shao [view email]
[v1]
Wed, 17 Jul 2024 15:48:39 UTC (1,345 KB)
[v2]
Fri, 13 Sep 2024 03:07:37 UTC (1,360 KB)

Source link
lol

By stp2y