Published on Sep 19
This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.
The paper "SlimPajama-DC: Understanding Data Combinations for LLM Training" examines the influence of varying data combinations on training large language models (LLMs) using the SlimPajama dataset, a rigorously deduplicated multi-source dataset.
SlimPajama Dataset: This dataset is a refined and deduplicated version of the massive 1.2T tokens RedPajama dataset. The aim is to use a cleaner and more deduplicated dataset to train LLMs.
Empirical Analysis: The SlimPajama-DC research examines the inherent traits and best practices when employing SlimPajama in LLM training.
Global vs. Local Deduplication: An important distinction is made between global deduplication (across various data sources) and local deduplication (within a single data source) and how each influences the performance of trained models.
Data Quality and Deduplication: The research evaluates the impact of the proportions of high-quality/highly-deduplicated multi-source datasets when combined.
Performance Metrics: Their best configuration significantly outperforms a 1.3B model trained on the larger RedPajama dataset, using the same number of training tokens.
Large-Scale Training Infrastructure: They utilized a powerful computing setup with a total capacity of 80 PFLOP/s in bf16 mixed precision.
Potential Real-World Impact:
Efficient Model Training: Understanding the effects of data combinations and deduplication on model training can result in more efficient and effective LLM training processes.
Improved LLMs: By refining the dataset used for training, the resultant models could provide more accurate and useful outputs in various NLP applications.
Guidance for Future Research: This empirical analysis offers insights and best practices for researchers and industry professionals in the domain of large-scale language model training.
Resource Allocation: Recognizing the importance of deduplication and data combination can guide organizations in allocating resources for data cleaning and deduplication.
Generalizability: While the paper shows promising results with SlimPajama, it remains to be seen how these findings generalize across other datasets and models.
Given the emphasis on understanding the nuances of data combinations, deduplication, and their effect on training LLMs:
I'd rate the real-world impact of this paper as an 8 out of 10.
This research offers valuable insights into optimizing the data used in training large language models, potentially leading to better models and more efficient training processes. The findings could be especially relevant for organizations and researchers aiming to maximize the performance of their LLMs using limited resources.