GPT-Code-Clippy (GPT-CC)

An open source version of GitHub Copilot, a language model

About GPT-Code-Clippy (GPT-CC)

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

  • 10+ GitHub stars
  • 2+ commits
  • Must have a licence
  • Exclude forks
  • Size < 70708 bytes
  • These repositories are then combined with all of the GitHub repositories contain in The Pile.

Full description can be found here: [https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57]