Dolly by Databricks
Democratizing the magic of ChatGPT with open models
About Dolly by Databricks
Databricks’ Dolly, a large language model trained on the Databricks Machine Learning Platform, demonstrates that a two-years-old open source model (GPT-J) can, when subjected to just 30 minutes of fine tuning on a focused corpus of 50k records (Stanford Alpacahttps://gpt3demo.com/apps/stanford-alpaca), exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based. Databrick believes this finding is important because it demonstrates that the ability to create powerful artificial intelligence technologies is vastly more accessible than previously realized.
Training Data, Bias & Objectionable Content
Like all language models, dolly-v1-6b reflects the content and limitations of its training corpuses.
-
The Pile: GPT-J’s pre-training corpus contains content mostly collected from the public internet, and like most web-scale datasets, it contains content many users would find objectionable. As such, the model is likely to reflect these shortcomings, potentially overtly in the case it is explicitly asked to produce objectionable content, and sometimes subtly, as in the case of biased or harmful implicit associations.
-
Stanford Alpaca: The instruction tuning corpus for dolly-6b can be assumed to share many of the limitations. In addition, it is known to contain factual inaccuracies, semantic and syntactic irregularities, nonsensical responses, and incorrect mathematical calculations, among other data shortcomings. The model outputs will reflect these limitations.