-
Evaluating large language models trained on code
The paper presents the results of the OpenAI Codex evaluation on generating Python code. -
Execution-based Evaluation for NL2Bash
A set of 50 prompts to evaluate execution-based evaluation for NL2Bash task