language:
<p><h1>🐋 The OpenOrca Dataset! 🐋</h1></p>
<a name="dataset-announcement"></a>
We are thrilled to announce the release of the OpenOrca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the Orca paper. It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
Our latest model, the first 7B to score better overall than all previous models below 30B. 98% of Llama2-70b-chat's performance, in a completely open 7B!
Our third model, the first 13B model to score higher than LLaMA1-65B on the HuggingFace Leaderboard! Released in partnership with Platypus.
Our second model, highlighting that we've surpassed the performance reported in the Orca paper. Was #1 at release time, now surpassed by our own OpenOrca-Platypus2-13B. Released in partnership with OpenChat.
OpenOrca-Preview1-13B This model was trained in less than a day, for <$200, with <10% of our data. At release, it beat the current state of the art models on BigBench-Hard and AGIEval. Achieves ~60% of the improvements reported in the Orca paper.
<a name="dataset-summary"></a>
The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing.
<a name="dataset-attribution"></a>
We would like to give special recognition to the following contributors for their significant efforts and dedication:
Teknium
WingLian/Caseus
Eric Hartford
NanoBit
Pankaj
Winddude
Rohan
http://AlignmentLab.ai:
Autometa
Entropi
AtlasUnified
NeverendingToast
NanoBit
WingLian/Caseus
Also of course, as always, TheBloke, for being the backbone of the whole community.
Many thanks to NanoBit and Caseus, makers of Axolotl, for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others!
We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
Want to visualize our full dataset? Check out our Nomic Atlas Map. <img src="https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B/resolve/main/OpenOrca%20Nomic%20Atlas.png" alt="Atlas Nomic Dataset Map" width="400" height="400" />
<a name="supported-tasks-and-leaderboards"></a>
This dataset supports a range of tasks including language modeling, text generation, and text augmentation. It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing. Further information on leaderboards will be updated as they become available.
<a name="languages"></a>
The language of the data is primarily English.
<a name="dataset-structure"></a>
<a name="data-instances"></a>
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5. The response is then entered into the response field.
<a name="data-fields"></a>
The fields are:
<a name="data-splits"></a>
The data is unsplit.
<a name="dataset-creation"></a>
<a name="curation-rationale"></a>
The dataset was created to provide a source of augmented text data for researchers and developers. The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4. This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.
<a name="source-data"></a>
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
<a name="dataset-use"></a>
<a name="use-cases"></a>
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
<a name="usage-caveats"></a>
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements. Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
<a name="getting-started"></a>
This dataset is organized such that it can be naively loaded via Hugging Face datasets library. We recommend using streaming due to the large size of the files. Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face.
@misc{OpenOrca,
title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces},
author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
year = {2023},
publisher = {HuggingFace},
journal = {HuggingFace repository},
howpublished = {\url{https://https://huggingface.co/Open-Orca/OpenOrca}},
}
@misc{mukherjee2023orca,
title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4},
author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah},
year={2023},
eprint={2306.02707},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{longpre2023flan,
title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning},
author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts},
year={2023},
eprint={2301.13688},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@misc{touvron2023llama,
title={Llama 2: Open Foundation and Fine-Tuned Chat Models},
author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom},
year={2023},
eprint= arXiv 2307.09288
}
@software{touvron2023llama,
title={LLaMA: Open and Efficient Foundation Language Models},
author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume},
journal={arXiv preprint arXiv:2302.13971},
year={2023}
}