Eval Results

40 prompts across 8 categories, tested on 9 models.