Eval Results
40 prompts across 8 categories, tested on 9 models.
Code Generation
5.4/10Writing code from specifications
5 prompts45 responses
Code Review
6.3/10Analyzing and improving existing code
5 prompts45 responses
Conversation
7.4/10Natural dialogue and sensitive topics
5 prompts45 responses
Creative Writing
6.4/10Fiction, poetry, and imaginative prose
5 prompts45 responses
Factual QA
7.4/10Answering factual questions accurately
5 prompts45 responses
Instruction Following
6.6/10Following complex multi-step instructions
5 prompts45 responses
Reasoning
7.5/10Logic, math, and multi-step problems
5 prompts45 responses
Summarization
7.5/10Condensing information clearly
5 prompts45 responses