LLM (Language) Tasks and Datasets

QA

  • SQuAD: Stanford QA Dataset
  • RACE: Reading Comprehension from Examinations

Commonsense Reasoning

  • Story Cloze Test:
  • SWAG: Situations With Adversarial Generations

Natural Language Inference (NLI) / Text Entailment Goal: Whether one sentence can be deferred from another

  • RTE (Recognizing Textual Entailment)
  • SNLI (Stanford Natural Language Inference)
  • MNLI (Multi-Genre NLI)
  • QNLI (Question NLI)
  • SciTail

Named Entity Recognition (NER)

  • CoNLL 2003 NER task
  • OntoNotes 5.0
  • Reuter Corpus
  • Fine-Grained NER (FGN)

Sentiment Analysis

  • SST (Stanford Sentiment Treebank)
  • IMDb

Semantic Role Labeling (SRL) Goal: models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”.

  • CoNLL-2004 & CoNLL-2005 Sentence similarity (or paraphrase detection)
  • MRPC (Microsoft Paraphrase Corpus)
  • QQP (Quora Question Pairs) STS Benchmark

Sentence Acceptability A task to annotate sentences for grammatical acceptability

  • CoLA (Corpus of Linguistic Acceptability)

Text Chunking To divide a text in syntactically correlated parts of words.

  • CoNLL 2000

Part-of-Speech (POS) Tagging

  • Wall Street Journal portion of the Penn Treebank

Machine Translation

  • WMT 2015 English-Czech data (Large)
  • WMT 2014 English-German data (Medium)
  • IWSLT 2015 English-Vietnamese data (Small)

Coreference Resolution Cluster mentions in text that refer to the same underlying real world entities.

  • CoNLL 2012

Long-range Dependency

  • LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects)
  • Children’s Book Test

Multi-task benchmark

  • GLUE multi-task benchmark
  • decaNLP benchmark

Unsupervised pretraining dataset

  • Books corpus
  • 1B Word Language Model Benchmark
  • English Wikipedia

LMM (Multimodal) Tasks and Datasets

Image Caption Datasets

  • MS COCO
  • NoCaps
  • Conceptual Captions
  • Crisscrossed Captions (CxC)
  • Concadia

Pair Image Text Datasets

  • ALIGN
  • LTIP*
  • VTP*
  • JFT-300M/JFT-3B*

VQA Refers to the process of providing an answer to a question given a visual input (image or video).

  • VQAv2
  • OkVQA
  • TextVQA
  • VizWiz

Visual Language Reasoning Infers common-sense information and cognitive understanding given a visual input.

  • VCR (Visual Commonsense Reasoning)
  • NLVR2 (Natural Language for Visual Reasoning)
  • Flickr30K
  • SNLI-VE (Visual Entailment)

Video QA and Understanding

  • MSR-VTT (MSR Video to Text)
  • ActivityNet-QA
  • TGIF (Tumblr GIF)
  • LSMDC (Large Scale Movie Description Challenge)
  • TVQA/+
  • DramaQA
  • VLEP (Video-and-Language Event Prediction)

(*) Internal, non-public datasets

Common Pre-training Strategies

  • Masked Language Modeling is often used when the transformer is trained only on text. Certain tokens of the input are being masked at random. The model is trained to simply predict the masked tokens (words)
  • Next Sequence Prediction works again only with text as input and evaluates if a sentence is an appropriate continuation of the input sentence. By using both false and correct sentences as training data, the model is able to capture long-term dependencies.
  • Masked Region Modeling masks image regions in a similar way to masked language modeling. The model is then trained to predict the features of the masked region.
  • Image-Text Matching forces the model to predict if a sentence is appropriate for a specific image.
  • Word-Region Alignment finds correlations between image region and words.
  • Masked Region Classification predicts the object class for each masked region.
  • Masked Region Feature Regression learns to regress the masked image region to its visual features.

RLHF Datasets

  • Anthropic RLHF Dataset on HuggingFace [Dataset]
  • No Robots of 10k instructions from HuggingFace, based on SFT dataset from InstructGPT [Dataset]

15K Token FineWeb Dataset

(Penedo et al., 2024) describes, creates and makes a 15T token dataset publicly available. Based on Chinchilla Scaling Laws, the 15T dataset should be optimal for 500B parameters. Note that, RedPajama contains 20 trillion tokens, but the researchers found that models trained on RedPajama result in poorer quality than FineWeb due to the different filtering rules applied. The Llama 3 models with 8B, 70B, and 405B sizes were trained on 15 trillion tokens as well, but Meta AI’s training dataset is not publicly available.

References

  1. Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., & Wolf, T. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. https://arxiv.org/abs/2406.17557