Training Overview
Overview
Typical training pipeline involves (a) pre-training, (b) supervised fine-tuning i.e., SFT and (c) RLHF or Alignment.
Fig. Typical LLM Training Pipeline. Image Credits (Huyen, 2023)
Pre-Training
Pre-training is performed on very large low-quality data. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al., 2024).
- GPT-3’s dataset (OpenAI): 0.5 trillion tokens
- Gopher’s dataset (DeepMind): 1 trillion tokens
- RedPajama (Together): 1.2 trillion tokens
- LLaMa’s dataset (Meta): 1.4 trillion tokens
1 Trillion tokens are equivalent to approximately 15 million books.
Supervised Fine-Tuning (SFT)
In SFT, the examples are high-quality data, following the format (prompt, response) and are called demonstration data. OpenAI calls supervised finetuning behavior cloning. Data scale: 10k - 100k (prompt, response) pairs. SFT allows models to better adhere to specific instructions.
- InstructGPT: ~14.5k pairs (13k from labelers +1.5k from customers)
- Alpaca: 52K ChatGPT instructions
- Databricks’ Dolly-15k: ~15k pairs, created by Databricks employees
- OpenAssistant: 161k messages in 10k conversations \(\rightarrow\) approximately 88k pairs
- Dialogue-fine tuned Gopher: ~5 billion tokens, which is estimated to be in the order of 10M messages. However, these are filtered out using heuristics from the Internet, so not of the highest quality.
Finetuning Methods
- Instruction Finetuning: to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions
- Continually Pre-training: taking in new knowledge
- Proxy Tuning: Finetuning LLMs without altering their weights
RLHF or Alignment
The idea: what if we have a scoring function that, if given a prompt and a response, outputs a score for how good that response is? Then we use this scoring function to further train our LLMs towards giving responses with high scores. That’s exactly what RLHF does. RLHF consists of two parts:
- Train a reward model to act as a scoring function.
- Optimize LLM to generate responses for which the reward model will give high scores.
Data scale: 100K - 1M examples
- InstructGPT: 50k prompts. Each prompt has 4 to 9 responses, forming between 6 and 36 pairs of (winning_response, losing_response). This means between 300K and 1.8M training examples in the format of (prompt, winning_response, losing_response).
- Constitutional AI, which is suspected to be the backbone of Claude (Anthropic): 318K comparisons – 135K generated by humans, and 183K generated by AI. Anthropic has an older version of their data open-sourced (hh-rlhf), which consists of roughly 170K comparisons.
Alignment step hones the LLMs to respond more helpfully and safely to user prompts. In some cases, RLHF step 1 (like InstructGPT) includes SFT.
Insights into training LLMs
(Biderman et al., 2023) released 8 LLMs (from 70M to 12B parameters) along with training details, analysis and insights:
- Does pretraining on duplicated data (i.e., training for >1 epoch) make a difference? It turns out that deduplication does not benefit or hurt performance.
- Does training order influence memorization? Unfortunately, it turns out that it does not. “Unfortunately,” because if this was true, we could mitigate undesirable verbatim memorization issues by reordering the training data.
- Does pretrained term frequency influence task performance? Yes, few-shot accuracy tends to be higher for terms that occur more frequently.
- Does increasing the batch size affect training efficiency and model convergence? Doubling the batch size halves the training time but doesn’t hurt convergence.
Instruction Pre-training
(Cheng et al., 2024) investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. Used Instruction Synthesizer to create the text token stream along with the instructions and responses.
Creating Alignment Data from Scratch
MagPie (Xu et al., 2024) proposed a hack to generate high-quality instruction finetuning dataset. It prompts the Llama3-8B with a pre-query template (
References
- Huyen, C. (2023). RLHF: Reinforcement Learning from Human Feedback [Blog]. https://huyenchip.com/2023/05/02/rlhf.html
- Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of data? Limits of LLM scaling based on human-generated data. https://arxiv.org/abs/2211.04325
- Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S. V. S. N. S., Raff, E., Skowron, A., Sutawika, L., & van der Wal, O. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. https://arxiv.org/abs/2304.01373
- Cheng, D., Gu, Y., Huang, S., Bi, J., Huang, M., & Wei, F. (2024). Instruction Pre-Training: Language Models are Supervised Multitask Learners. https://arxiv.org/abs/2406.14491
- Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., & Lin, B. Y. (2024). Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. https://arxiv.org/abs/2406.08464
- Raschka, S. (2023). LLM Training: RLHF and Its Alternatives [Blog]. https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives
- Chadha, A. (2023). LLM Alignment. Distilled AI.
- Huyen, C. (2024). Predictive Human Preference: From Model Ranking to Model Routing [Blog]. https://huyenchip.com/2024/02/28/predictive-human-preference.html
- OpenAI. (2022). Aligning language models to follow instructions by OpenAI [Blog]. https://openai.com/index/instruction-following/
- Raschka, S. (2024). Instruction Pretraining LLMs [Blog]. https://magazine.sebastianraschka.com/p/instruction-pretraining-llms