GPT-J

GPT-3 Democratized. A 6B parameter open-source version of GPT-3

About GPT-J

GPT-J is the open-source alternative to OpenAI's GPT-3. The model is trained on the Pile, is available for use with Mesh Transformer JAX. Now, thanks to Eleuther AI, anyone can download and use a 6B parameter version of GPT-3.

EleutherAI are the creators of GPT-Neo.

GPT-J-6B performs nearly on par with 6.7B GPT-3 (or Curie) on various zero-shot down-streaming tasks.

Zero-Shot Evaluations

Models roughly sorted by performance, or by FLOPs if not available.

| Model | Weights | Training FLOPs | LAMBADA PPL ↓ | LAMBADA Acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑ | Dataset Size (GB) | |-----------------|---------|----------------|--- |--- |--- |--- |--- |-------------------| | Chance | ✔ | 0 | ~a lot | ~0% | 50% | 25% | 25% | 0 | | GPT-3-Ada‡ | ✘ | ----- | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% | ----- | | GPT-2-1.5B | ✔ | ----- | 10.63 | 51.21% | 59.4% | 50.9% | 70.8% | 40 | | GPTNeo-1.3B‡ | ✔ | 3.0e21 | 7.50 | 57.2% | 55.0% | 48.9% | 71.1% | 825 | | Megatron-2.5B* | ✘ | 2.4e21 | ----- | 61.7% | ----- | ----- | ----- | 174 | | GPTNeo-2.7B‡ | ✔ | 6.8e21 | 5.63 | 62.2% | 56.5% | 55.8% | 73.0% | 825 | | GPT-3-1.3B*‡ | ✘ | 2.4e21 | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% | ~800 | | GPT-3-Babbage‡ | ✘ | ----- | 5.58 | 62.4% | 59.0% | 54.5% | 75.5% | ----- | | Megatron-8.3B* | ✘ | 7.8e21 | ----- | 66.5% | ----- | ----- | ----- | 174 | | GPT-3-2.7B*‡ | ✘ | 4.8e21 | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% | ~800 | | Megatron-11B† | ✔ | 1.0e22 | ----- | ----- | ----- | ----- | ----- | 161 | | GPT-J-6B‡ | ✔ | 1.5e22 | 3.99 | 69.7% | 65.3% | 66.1% | 76.5% | 825 | | GPT-3-6.7B*‡ | ✘ | 1.2e22 | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% | ~800 | | GPT-3-Curie‡ | ✘ | ----- | 4.00 | 69.3% | 65.6% | 68.5% | 77.9% | ----- | | GPT-3-13B*‡ | ✘ | 2.3e22 | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% | ~800 | | GPT-3-175B*‡ | ✘ | 3.1e23 | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% | ~800 | | GPT-3-Davinci‡ | ✘ | ----- | 3.0 | 75% | 72% | 78% | 80% | ----- |

* represents evaluation numbers reported by their respective authors, all other numbers are provided by running the lm-evaluation-harness either with the released weights or with API access. Due to subtle implementation differences as well as different zero shot task framing, these might not be directly comparable. See this blog post for more details.

The Megatron-11B model provides no comparable metrics, and several implementations using the released weights do not reproduce the generation quality and evaluations. (see 1 2 3) Thus, evaluation was not attempted.

These models have been trained with data which contains possible test set contamination. The OpenAI GPT-3 models failed to deduplicate training data for certain test sets, while the GPT-Neo models as well as this one is trained on The Pile, which has not been deduplicated against any test sets.

Source: https://github.com/kingoflolz/mesh-transformer-jax/blob/master/README.md

Introducing GPT-J -- An Open Source Version Of GPT-3 (NLP News)