3 datasets found

Filter Results
  • LiveCodeBench

    LiveCodeBench is a benchmark for evaluating the performance of Large Language Models (LLMs) in code editing tasks, including debugging, translating, polishing, and requirement...
  • Evol-Instruct-Code-80k

    Evol-Instruct-Code-80k is a dataset for evaluating the performance of code generation models.
  • HumanEvalFix

    HumanEvalFix is a benchmark for evaluating the performance of code repair models.