Large Language Models - Groups

TRICOTS

A tool for collecting traces from any Python codebase that uses OpenAI’s API.

Dataset
JSON

MACHIAVELLI Benchmark

A dataset of traces from the MACHIAVELLI environment, including API calls and their outcomes.

Dataset
JSON

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM ...

A structured collection of tests for input-output safeguards, including established failure tests, emerging failure tests, and next-gen architecture tests.

Dataset
JSON

3 datasets found

TRICOTS

MACHIAVELLI Benchmark

BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM ...