21
May
The T5 model (Raffel et al, 2019) is widely used in the NLP community. Its base model has been downloaded from Hugging Face millions of times, leaving no doubt that these models are a favorite of the community. However, T5's tokenizer omits important code-related tokens and subsequent pretraining datasets have been released with higher quality filtering and more diverse domains. In this blog post, we introduce a new version of T5 intended to address those weaknesses: Pile-T5, trained on the Pile (Gao et al, 2020) and using the LLaMA tokenizer (Touvron et al, 2023). Model Description# Our alternative version replaces…