Llama 3.1 405B released. π MagPie-Ultra is the first open dataset using Llama 3.1 405B-Instruct FP8 to generate 50,000 synthetic instruction pairs using the MagPie recipe and
@argilla_io
distilabel. It includes challenging instructions for coding math, data analysis, creative writing, advice seeking, or Brainstorming. βοΈ
MagPie datasets are created by prompting LLMs with "empty" prompts that consist only of starting special tokens, allowing the model to auto-regressively generate user queries and corresponding responses, which are then filtered to select high-quality data. π¨βπ
Note: The dataset is unfiltered but includes quality & difficulty scores, embeddings, topics, and safety scores from ArmorRM and LlamaGuard. π‘
βοΈ Pipeline: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1/blob/main/pipeline.py
π€ Dataset: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1
@argilla_io
distilabel. It includes challenging instructions for coding math, data analysis, creative writing, advice seeking, or Brainstorming. βοΈ
MagPie datasets are created by prompting LLMs with "empty" prompts that consist only of starting special tokens, allowing the model to auto-regressively generate user queries and corresponding responses, which are then filtered to select high-quality data. π¨βπ
Note: The dataset is unfiltered but includes quality & difficulty scores, embeddings, topics, and safety scores from ArmorRM and LlamaGuard. π‘
βοΈ Pipeline: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1/blob/main/pipeline.py
π€ Dataset: https://huggingface.co/datasets/argilla/magpie-ultra-v0.1