deeplearning

How Can We Prevent Catastrophic Forgetting and Preserve Knowledge During LLM Fine-tuning?

How Can We Prevent Catastrophic Forgetting and Preserve Knowledge During LLM Fine-tuning?

隨著LLaMA的更新,模型的能力越來越完善,但我也發現模型變得越來越難微調。主要問題是:經過微調後,模型會明顯失去原本的技能,這就是所謂的「災難性遺忘」(Catastrophic Forgetting)。這個問題在LLaMA-3之後變得更棘手;經過一次迭代就可能導致模型出現結尾停不下來的現象。在LLaMA-3技術報告中也提到了"tail repetition"的相關描述。 網路上有許多文章討論如何在Huggingface拉模型接續訓練而不破壞模型的原本能力,大多提到以下幾種方式: 準備歷史資料一起參與訓練 在目標函數中限制模型不能偏離原本的結果太遠 套用LoRA或其他adaptation技巧 然而,實際經驗顯示,這幾個方法實作效果不佳,或成本過高。例如,如何準備LLaMA-3-Instruct的歷史資料呢?準備這樣的資料可能遠比訓練的新資料更龐大。此外,準備歷史資料與「一筆資料只訓練一次」的經驗準則相衝突。對已經學會的資料進行再訓練,相當於重複訓練同一資料,這會使模型傾向於使用「Ad hoc」的方法,這樣的知識並沒有在模型中合理地被吸收,而是機械地記住答案。LoRA等adaptation方法的效果也有限,僅僅是讓「學習新知識」與「破壞舊知識」的過程變得較慢,讓你在兩者之間尋找妥協點,但我們從未透過此方法獲得令人滿意的結果。看看專家的做法,例如Taiwan LLaMA是從base model開始訓練,自行準備instruction-tuning的資料集,因此沒有這個問題。 解決方法 有效的解決方案是找到「最小修改幅度」的訓練方式。首先,大家通常依照chat template生成輸入和輸出的資料,然後將其丟給模型訓練。這樣的訓練方式無形中讓模型學到許多已經具備的知識,或形成無意中引入的新觀念(例如某種風格傾向)。許多人知道要使用相同模型來輔助生成訓練資料,但這還不夠。理想情況下,應該只訓練關鍵知識的token,而不是讓所有內容參與loss計算。實驗表明,這樣的做法讓我們可以省去DPO等進階訓練方法,只需進行簡單的SFT全參數訓練即可。 另一個有意思的點是:LoRA無法保護模型原本的能力,而你也不需要特別保護。大模型的參數量足夠將各種高階知識分散到不同的參數中。我發現,只要prompt具備完整的知識描述(推理過程),且能夠被attention的QK計算選到,知識就能被正確地放置於模型中。這些attention過程決定了知識是否會互相污染。如果歷史詞不具備推理要素,相當於沒有歷史詞,這等同於讓模型學習「無中生有」地產生某個詞。由於「無中生有」屬於沒有先前條件就會被觸發的知識,它會明顯壓縮模型現有的能力。別忘了,大語言模型儲存知識的空間非常龐大,幾乎可以引入特定知識而感受不到其他能力消失。通過精細調整訓練資料,即可解決問題,並在數千題benchmark考題中保持完全相同的結果。 在準備資料時,不能完全套用傳統機器學習的經驗來設計。差異在於接續訓練時attention的能力早已建立,不是從零開始競爭學習。如果你準備了很多不會被正確attention的提示詞,它們就無法如預期進行學習。反之,也應減少不必要的prompt,因為LLM難免會出現錯誤的attention行為。與其冒險讓模型學到錯誤的關聯,不如刪減冗餘詞彙。 translated by ChatGPT: As LLaMA continues to be updated, the model's capabilities are becoming more refined, but I’ve also noticed that it is increasingly difficult to fine-tune. The main issue is that after fine-tuning, the model significantly loses its original skills, a phenomenon known as "Catastrophic Forgetting." This issue has become more troublesome with LLaMA-3; after just one iteration, the model may exhibit a problem where it can't stop at the end of a response. The LLaMA-3 technical report also mentions this issue, referring…
Read More
Types of Machine Learning you must know!

Types of Machine Learning you must know!

There are 4 Major types of Machine Learning Supervised Learning Regression Classification Unsupervised Learning Clustering Dimensionality Reduction Anomaly Detection Association Semi Supervised Learning Reinforcement Learning Lets Explain one by one for a clear idea that what exactly they are about! Supervised Learning If we have a dataset with both input and output, our job is to understand the relationship between them. Then, we can use that understanding to predict the output for new input. This type of learning is called supervised machine learning. Example: Lets take 1000 Students Data. Now the ML model will create a mathematical link between the…
Read More
The activation functions in PyTorch (5)

The activation functions in PyTorch (5)

Buy Me a Coffee☕ *Memos: My post explains Step function, Identity and ReLU. My post explains Leaky ReLU, PReLU and FReLU. My post explains ELU, SELU and CELU. My post explains GELU, Mish, SiLU and Softplus. My post explains Vanishing Gradient Problem, Exploding Gradient Problem and Dying ReLU Problem. (1) Tanh: can convert an input value(x) to the output value between -1 and 1. *0 and 1 are exclusive. 's formula is y = (ex - e-x) / (ex + e-x). is also called Hyperbolic Tangent Function. is Tanh() in PyTorch. is used in: RNN(Recurrent Neural Network). *RNN in PyTorch.…
Read More
How Deep Learning Works

How Deep Learning Works

Deep Learning is the core of a Machine Learning system, it is how a machine actually learns from data without much human intervention. In this post I am going to discuss how Deep Learning actually works with the data you give. The basis of a Deep Learning system are Neural Networks, they are the fundamental part of how a machine learns by itself. To understand how a Neural Network learns you need to understand how a Neural Network is structured. There are mainly 3 (or more) layers in the neural network1. Input Layer: Where the data to the network is…
Read More
No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.