Tools & Datasets
LLM (Language) Tasks and Datasets
QA
- SQuAD: Stanford QA Dataset
- RACE: Reading Comprehension from Examinations
Commonsense Reasoning
- Story Cloze Test:
- SWAG: Situations With Adversarial Generations
Natural Language Inference (NLI) / Text Entailment Goal: Whether one sentence can be deferred from another
- RTE (Recognizing Textual Entailment)
- SNLI (Stanford Natural Language Inference)
- MNLI (Multi-Genre NLI)
- QNLI (Question NLI)
- SciTail
Named Entity Recognition (NER)
- CoNLL 2003 NER task
- OntoNotes 5.0
- Reuter Corpus
- Fine-Grained NER (FGN)
Sentiment Analysis
- SST (Stanford Sentiment Treebank)
- IMDb
Semantic Role Labeling (SRL) Goal: models the predicate-argument structure of a sentence, and is often described as answering “Who did what to whom”.
- CoNLL-2004 & CoNLL-2005 Sentence similarity (or paraphrase detection)
- MRPC (Microsoft Paraphrase Corpus)
- QQP (Quora Question Pairs) STS Benchmark
Sentence Acceptability A task to annotate sentences for grammatical acceptability
- CoLA (Corpus of Linguistic Acceptability)
Text Chunking To divide a text in syntactically correlated parts of words.
- CoNLL 2000
Part-of-Speech (POS) Tagging
- Wall Street Journal portion of the Penn Treebank
Machine Translation
- WMT 2015 English-Czech data (Large)
- WMT 2014 English-German data (Medium)
- IWSLT 2015 English-Vietnamese data (Small)
Coreference Resolution Cluster mentions in text that refer to the same underlying real world entities.
- CoNLL 2012
Long-range Dependency
- LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects)
- Children’s Book Test
Multi-task benchmark
- GLUE multi-task benchmark
- decaNLP benchmark
Unsupervised pretraining dataset
- Books corpus
- 1B Word Language Model Benchmark
- English Wikipedia
LMM (Multimodal) Tasks and Datasets
Image Caption Datasets
- MS COCO
- NoCaps
- Conceptual Captions
- Crisscrossed Captions (CxC)
- Concadia
Pair Image Text Datasets
- ALIGN
- LTIP*
- VTP*
- JFT-300M/JFT-3B*
VQA Refers to the process of providing an answer to a question given a visual input (image or video).
- VQAv2
- OkVQA
- TextVQA
- VizWiz
Visual Language Reasoning Infers common-sense information and cognitive understanding given a visual input.
- VCR (Visual Commonsense Reasoning)
- NLVR2 (Natural Language for Visual Reasoning)
- Flickr30K
- SNLI-VE (Visual Entailment)
Video QA and Understanding
- MSR-VTT (MSR Video to Text)
- ActivityNet-QA
- TGIF (Tumblr GIF)
- LSMDC (Large Scale Movie Description Challenge)
- TVQA/+
- DramaQA
- VLEP (Video-and-Language Event Prediction)
(*) Internal, non-public datasets
Common Pre-training Strategies
- Masked Language Modeling is often used when the transformer is trained only on text. Certain tokens of the input are being masked at random. The model is trained to simply predict the masked tokens (words)
- Next Sequence Prediction works again only with text as input and evaluates if a sentence is an appropriate continuation of the input sentence. By using both false and correct sentences as training data, the model is able to capture long-term dependencies.
- Masked Region Modeling masks image regions in a similar way to masked language modeling. The model is then trained to predict the features of the masked region.
- Image-Text Matching forces the model to predict if a sentence is appropriate for a specific image.
- Word-Region Alignment finds correlations between image region and words.
- Masked Region Classification predicts the object class for each masked region.
- Masked Region Feature Regression learns to regress the masked image region to its visual features.
RLHF Datasets
- Anthropic RLHF Dataset on HuggingFace [Dataset]
- No Robots of 10k instructions from HuggingFace, based on SFT dataset from InstructGPT [Dataset]
15K Token FineWeb Dataset
(Penedo et al., 2024) describes, creates and makes a 15T token dataset publicly available. Based on Chinchilla Scaling Laws, the 15T dataset should be optimal for 500B parameters. Note that, RedPajama contains 20 trillion tokens, but the researchers found that models trained on RedPajama result in poorer quality than FineWeb due to the different filtering rules applied. The Llama 3 models with 8B, 70B, and 405B sizes were trained on 15 trillion tokens as well, but Meta AI’s training dataset is not publicly available.
References
- Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., & Wolf, T. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. https://arxiv.org/abs/2406.17557