Training Overview

Overview

Typical training pipeline involves (a) pre-training, (b) supervised fine-tuning i.e., SFT and (c) RLHF or Alignment.

Fig. Typical LLM Training Pipeline. Image Credits (Huyen, 2023)

Pre-Training

Pre-training is performed on very large low-quality data. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al., 2024).

GPT-3’s dataset (OpenAI): 0.5 trillion tokens
Gopher’s dataset (DeepMind): 1 trillion tokens
RedPajama (Together): 1.2 trillion tokens
LLaMa’s dataset (Meta): 1.4 trillion tokens

1 Trillion tokens are equivalent to approximately 15 million books.

Supervised Fine-Tuning (SFT)

In SFT, the examples are high-quality data, following the format (prompt, response) and are called demonstration data. OpenAI calls supervised finetuning behavior cloning. Data scale: 10k - 100k (prompt, response) pairs. SFT allows models to better adhere to specific instructions.

InstructGPT: ~14.5k pairs (13k from labelers +1.5k from customers)
Alpaca: 52K ChatGPT instructions
Databricks’ Dolly-15k: ~15k pairs, created by Databricks employees
OpenAssistant: 161k messages in 10k conversations \(\rightarrow\) approximately 88k pairs
Dialogue-fine tuned Gopher: ~5 billion tokens, which is estimated to be in the order of 10M messages. However, these are filtered out using heuristics from the Internet, so not of the highest quality.

Finetuning Methods

Instruction Finetuning: to get openly available LLMs to better follow instructions or specialize these LLMs on subsets or new instructions
Continually Pre-training: taking in new knowledge
Proxy Tuning: Finetuning LLMs without altering their weights

RLHF or Alignment

The idea: what if we have a scoring function that, if given a prompt and a response, outputs a score for how good that response is? Then we use this scoring function to further train our LLMs towards giving responses with high scores. That’s exactly what RLHF does. RLHF consists of two parts:

Train a reward model to act as a scoring function.
Optimize LLM to generate responses for which the reward model will give high scores.

Data scale: 100K - 1M examples

InstructGPT: 50k prompts. Each prompt has 4 to 9 responses, forming between 6 and 36 pairs of (winning_response, losing_response). This means between 300K and 1.8M training examples in the format of (prompt, winning_response, losing_response).
Constitutional AI, which is suspected to be the backbone of Claude (Anthropic): 318K comparisons – 135K generated by humans, and 183K generated by AI. Anthropic has an older version of their data open-sourced (hh-rlhf), which consists of roughly 170K comparisons.

Alignment step hones the LLMs to respond more helpfully and safely to user prompts. In some cases, RLHF step 1 (like InstructGPT) includes SFT.

Insights into training LLMs

(Biderman et al., 2023) released 8 LLMs (from 70M to 12B parameters) along with training details, analysis and insights:

Does pretraining on duplicated data (i.e., training for >1 epoch) make a difference? It turns out that deduplication does not benefit or hurt performance.
Does training order influence memorization? Unfortunately, it turns out that it does not. “Unfortunately,” because if this was true, we could mitigate undesirable verbatim memorization issues by reordering the training data.
Does pretrained term frequency influence task performance? Yes, few-shot accuracy tends to be higher for terms that occur more frequently.
Does increasing the batch size affect training efficiency and model convergence? Doubling the batch size halves the training time but doesn’t hurt convergence.

Instruction Pre-training

(Cheng et al., 2024) investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. Used Instruction Synthesizer to create the text token stream along with the instructions and responses.

Creating Alignment Data from Scratch MagPie (Xu et al., 2024) proposed a hack to generate high-quality instruction finetuning dataset. It prompts the Llama3-8B with a pre-query template (user) as input, resulting in an instruction, and the instruction is further fed to Llama3-8B to get the response.

References

Huyen, C. (2023). RLHF: Reinforcement Learning from Human Feedback [Blog]. https://huyenchip.com/2023/05/02/rlhf.html
Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of data? Limits of LLM scaling based on human-generated data. https://arxiv.org/abs/2211.04325
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S. V. S. N. S., Raff, E., Skowron, A., Sutawika, L., & van der Wal, O. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. https://arxiv.org/abs/2304.01373
Cheng, D., Gu, Y., Huang, S., Bi, J., Huang, M., & Wei, F. (2024). Instruction Pre-Training: Language Models are Supervised Multitask Learners. https://arxiv.org/abs/2406.14491
Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., & Lin, B. Y. (2024). Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. https://arxiv.org/abs/2406.08464
Raschka, S. (2023). LLM Training: RLHF and Its Alternatives [Blog]. https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives
Chadha, A. (2023). LLM Alignment. Distilled AI.
Huyen, C. (2024). Predictive Human Preference: From Model Ranking to Model Routing [Blog]. https://huyenchip.com/2024/02/28/predictive-human-preference.html
OpenAI. (2022). Aligning language models to follow instructions by OpenAI [Blog]. https://openai.com/index/instruction-following/
Raschka, S. (2024). Instruction Pretraining LLMs [Blog]. https://magazine.sebastianraschka.com/p/instruction-pretraining-llms