Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

“by Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia and Reynold Xin

April 12, 2023 in Company Blog

Two weeks ago, we released Dolly, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity (aka instruction-following). Today, we’re releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.

We are open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.

databricks-dolly-15k dataset

databricks-dolly-15k contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

To the best of our knowledge, this dataset is the first open source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT. databricks-dolly-15k was authored by more than 5,000 Databricks employees during March and April of 2023. These training records are natural, expressive and designed to represent a wide range of the behaviors, from brainstorming and content generation to information extraction and summarization.

Why did we create a new dataset?

As soon as we released Dolly 1.0, we were inundated by requests from people who wanted to try it out. The number one question that we kept getting was “can I use this commercially?”

A critical step in the creation of Dolly 1.0, or any instruction following LLMs, is to train the model on a dataset of instruction and response pairs. Dolly 1.0 was trained for $30 using a dataset that the Stanford Alpaca team had created using the OpenAI API. That dataset contained output from ChatGPT, and as the Stanford team pointed out, the terms of service seek to prevent anyone from creating a model that competes with OpenAI. So, unfortunately, the answer to this common question was, “probably not!”

As far as we know, all the existing well-known instruction-following models (Alpaca, Koala, GPT4All, Vicuna) suffer from this limitation, prohibiting commercial use. To get around this conundrum, we started looking for ways to create a new dataset not “tainted” for commercial use.

How did we do it?”

And more

Pro plugin deactivated or invalid

Posted on: April 12, 2023, 10:58 pm Category: Uncategorized

By: Stephen Abram

Comments Off on Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

databricks-dolly-15k dataset

Why did we create a new dataset?

How did we do it?”

0 Responses

About The Author

Recent Comments

Categories

Archives

Tags

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

databricks-dolly-15k dataset

Why did we create a new dataset?

How did we do it?”

0 Responses

Subscribe

About The Author

Recent Comments

Categories

Archives

Tags