Evaluating Large Language Models Trained on Code

Original paper · Chen et al 2021

Codex, the model behind Github Copilot, made waves when it was first released to the public a couple of years ago (through Copilot of-course, Codex itself is not open-source). I feel like I'm at a stage in my research journey now where I've covered a lot of the foundational models and am ready for a short detour into alignment and fine-tuning work. Codex is a great start for this as it's one of the more recognized, early adaptions of GPT-3 fine tuning.

HumanEval

If you are familiar with Copilot, at least in its original form, you know that it was mainly used to generate functions from descriptions (or docstrings). This is precisely how Codex was evaluated, using a manually curated set of programming problems called HumanEval to benchmark the trained models. The 164 original programming problems are not the only thing that sets HumanEval apart from other coding benchmarks. The benchmark focuses its evaluation on functional correctness as opposed to matching samples against a reference solution. This metric is much better at accounting for the vast solution space available to any proposed programming problem, however its still clear that this space requires further research. HumanEval goes about defining functional correctness through the completion of a number of unit test, 7.7 per problem to be exact. The model generates a number of sample solutions for a given problem and it is then evaluated by how many of these generated samples complete all the given unit tests.

Finally, HumanEval only consists of hand-written programming problems. Each problem contains a docstring, body and several unit tests. These problems assess language comprehension, reasoning, algorithms, and simple mathematics. It is very important that the evaluation set is hand-written as the model has seen much of the available code on the public domain.

Codex

Codex was trained using 179GB of Python files from public Github repositories. Initially, both a from-scratch model and a GPT-3 fine tune were trained as the authors theorized that GPT's strong natural language representation would be beneficial for the final product. Surprisingly they did not observe any immediate improvements from the fine-tuned model but because the fine-tune converged a lot quicker they continued work using only the fine-tuned version. Apart from this there isn't much to say about Codex, at-least not when it comes to the code fine-tuning stage. The recipe is glaringly simple and the results are shockingly good. Across the 8 different models trained in the Codex family (12M - 12B) performance scales a lot better than SOTA models such as GPT-J and GPT-Neo on the HumanEval evaluation set. Comparable Codex models are often 20-30x smaller than their GPT-J / GPT-Neo counterpart.