An open source version of GitHub Copilot, a language model
About GPT-Code-Clippy (GPT-CC)
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.
The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:
- 10+ GitHub stars
- 2+ commits
- Must have a licence
- Exclude forks
- Size < 70708 bytes
- These repositories are then combined with all of the GitHub repositories contain in The Pile.
Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57