Running LLM Comparisons and Tests¶
What You'll Learn¶
How to compare different AI models, test changes to your prompts, and run experiments to optimize your AI applications for better performance and lower costs.
Why Test and Compare?¶
Save Money¶
- Find cheaper models that work just as well for your use case
- Optimize prompts to use fewer tokens and reduce costs
- Identify inefficiencies in your current setup
Improve Quality¶
- Compare model outputs side-by-side
- Test prompt variations to get better results
- Find the right model for each specific task
Reduce Risk¶
- Test before deploying changes to production
- Understand model behavior before committing to a provider
- Validate performance under different conditions
Types of Testing You Can Do¶
Model Comparison¶
Compare how different AI models handle the same tasks: - GPT-4 vs GPT-3.5 for cost/quality tradeoffs - Claude vs GPT-4 for different reasoning styles - Different model sizes within the same provider
Prompt Testing (A/B Testing)¶
Test different versions of your prompts: - Short vs detailed instructions - Different tone (formal vs casual) - Various examples in few-shot prompting
Provider Comparison¶
Compare the same model from different providers: - OpenAI direct vs OpenRouter for the same model - Cost differences between providers - Response time and reliability comparisons
Configuration Testing¶
Test different settings: - Temperature variations (0.1 vs 0.7 vs 1.0) - Max tokens limits - Different system prompts
Setting Up Your First Comparison Test¶
Step 1: Define What You're Testing¶
Be specific about: - The task: "Summarizing customer reviews" - Success criteria: "Accurate, 2-3 sentences, professional tone" - What you're comparing: "GPT-4 vs Claude for cost and quality"
Step 2: Prepare Test Data¶
Create a test set: - 5-10 examples for quick tests - 50-100 examples for thorough evaluation - Real data from your actual use case - Edge cases that might cause problems
Example test set for email summarization:
Test Case 1: Long customer complaint email
Test Case 2: Short inquiry about pricing
Test Case 3: Technical support request
Test Case 4: Praise and testimonial
Test Case 5: Multi-issue complex email
Step 3: Set Up Test Configurations¶
For each model/prompt you want to test:
Configuration A - GPT-4:
const testA = {
model: 'gpt-4o-mini',
provider: 'openai',
temperature: 0.3,
max_tokens: 150,
prompt: "Summarize this email in 2-3 professional sentences."
}
Configuration B - Claude:
const testB = {
model: 'claude-3-haiku-20240307',
provider: 'anthropic',
temperature: 0.3,
max_tokens: 150,
prompt: "Summarize this email in 2-3 professional sentences."
}
Step 4: Run the Tests Using RunForge¶
Create experiments to track each test configuration:
// Test Configuration A
const resultA = await runforge.track(
{
experiment: 'email-summary-comparison',
variant: 'gpt-4-mini',
test_case: 'customer-complaint-1'
},
() => openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Summarize this email in 2-3 professional sentences.' },
{ role: 'user', content: testEmail }
],
temperature: 0.3,
max_tokens: 150
})
);
// Test Configuration B
const resultB = await runforge.track(
{
experiment: 'email-summary-comparison',
variant: 'claude-haiku',
test_case: 'customer-complaint-1'
},
() => anthropic.messages.create({
model: 'claude-3-haiku-20240307',
max_tokens: 150,
temperature: 0.3,
messages: [
{ role: 'user', content: `Summarize this email in 2-3 professional sentences:\n\n${testEmail}` }
]
})
);
Analyzing Test Results¶
Quantitative Metrics (RunForge Dashboard)¶
Cost Analysis: - Total cost per test run - Cost per successful completion - Token usage (input + output)
Performance Analysis: - Average response time - Success rate - Error patterns
Example comparison:
Qualitative Assessment (Manual Review)¶
For each output, rate: - Accuracy: Does it capture the key points? - Completeness: Is important information missing? - Tone: Is it appropriate for your use case? - Consistency: Similar quality across different inputs?
Rating scale example: - 5 = Perfect, would use as-is - 4 = Very good, minor tweaks needed - 3 = Good, but needs some editing - 2 = Poor, significant problems - 1 = Unusable
Creating a Scorecard¶
Track results in a simple spreadsheet:
| Test Case | Model | Cost | Time | Accuracy | Tone | Overall |
|---|---|---|---|---|---|---|
| Email 1 | GPT-4 | $0.008 | 1.1s | 5 | 4 | 4.5 |
| Email 1 | Claude | $0.005 | 0.8s | 4 | 5 | 4.5 |
| Email 2 | GPT-4 | $0.006 | 1.3s | 4 | 4 | 4.0 |
| Email 2 | Claude | $0.004 | 0.9s | 5 | 4 | 4.5 |
Common Testing Scenarios¶
Scenario 1: Cost Optimization¶
Goal: Find the cheapest model that maintains quality Method: Test expensive model vs cheaper alternatives Success criteria: Less than 20% quality reduction for 50%+ cost savings
Example: - Current: GPT-4 ($0.03/request, quality score 4.5) - Test: GPT-4o-mini ($0.006/request, quality score 4.2) - Result: 80% cost reduction with only 7% quality loss ✅
Scenario 2: Quality Improvement¶
Goal: Better results without breaking the budget Method: Test different prompt strategies and models Success criteria: Measurable improvement in output quality
Example: - Current prompt: "Summarize this" - Test prompts: - "Create a professional 2-sentence summary" - "Identify key points and summarize professionally" - "Summarize focusing on customer concerns and action items"
Scenario 3: Speed Optimization¶
Goal: Faster responses for better user experience Method: Test models and configurations for response time Success criteria: Under 2 seconds average response time
Example: - Current: Claude Sonnet (3.2s average) - Test: Claude Haiku (0.9s average) - Result: 72% speed improvement with similar quality ✅
Scenario 4: Reliability Testing¶
Goal: Find most reliable model/provider combination Method: Test under load and edge cases Success criteria: Over 99% success rate under normal conditions
Advanced Testing Techniques¶
Batch Testing¶
Run many test cases automatically:
const testCases = [
{ id: 'email-1', content: '...', expected_topics: ['refund', 'shipping'] },
{ id: 'email-2', content: '...', expected_topics: ['billing', 'support'] },
// ... more test cases
];
for (const testCase of testCases) {
for (const config of configurations) {
const result = await runforge.track(
{
experiment: 'batch-comparison-2024-01',
variant: config.name,
test_case: testCase.id
},
() => callAI(config, testCase.content)
);
// Store result for analysis
results.push({
testCase: testCase.id,
config: config.name,
cost: result.cost,
tokens: result.tokens,
latency: result.latency,
output: result.output
});
}
}
Statistical Significance¶
For important decisions, run enough tests to be confident: - Small differences: Need 100+ test cases - Large differences: 20-30 test cases may be enough - Critical decisions: Consider professional A/B testing tools
Load Testing¶
Test how models perform under stress: - Concurrent requests: Multiple calls at the same time - Rate limits: How many requests per minute before errors - Consistency: Quality maintained under load
Interpreting Results and Making Decisions¶
When to Choose the Cheaper Option¶
Choose the less expensive model when: - Quality difference is less than 10% - Cost savings are more than 30% - The use case isn't mission-critical - You can implement quality checks elsewhere
When to Pay for Premium¶
Choose the more expensive model when: - Quality difference is more than 20% - The use case is customer-facing and critical - Cost difference is negligible compared to business value - Reliability is significantly better
Gradual Migration Strategy¶
Don't change everything at once: 1. Start with low-risk use cases (internal tools, non-critical features) 2. Monitor closely for 1-2 weeks 3. Expand gradually to more important applications 4. Keep the old model as fallback during transition
Setting Up Ongoing Testing¶
Continuous Monitoring¶
Set up regular comparisons: - Monthly model comparisons: Are new models better? - Quarterly cost optimization: Can we reduce expenses? - Performance regression tests: Is quality declining over time?
Automated Quality Checks¶
Simple automated checks: - Length validation: Summaries should be 2-3 sentences - Keyword detection: Important terms should be preserved - Sentiment consistency: Positive inputs shouldn't become negative - Format compliance: Structured outputs match expected format
A/B Testing in Production¶
For non-critical applications: - Send 10% of traffic to new model - Compare results for 1 week - Gradually increase percentage if results are positive - Full migration only after confidence is high
Common Testing Mistakes to Avoid¶
Testing Too Few Examples¶
- Problem: 3-5 examples aren't enough to draw conclusions
- Solution: Use at least 20 examples for meaningful comparisons
Only Testing Happy Path¶
- Problem: Models might fail differently on edge cases
- Solution: Include difficult, ambiguous, or unusual inputs
Ignoring Cost per Token Differences¶
- Problem: Focusing only on total cost, not efficiency
- Solution: Calculate cost per successful output, not just cost per request
Not Testing with Real Data¶
- Problem: Synthetic test data doesn't match real usage
- Solution: Use actual examples from your production environment
Changing Multiple Variables¶
- Problem: Testing new model AND new prompt makes it hard to know what caused changes
- Solution: Test one variable at a time
Documentation and Communication¶
Test Report Template¶
# Model Comparison: Email Summarization
**Date**: January 15, 2024
**Goal**: Find cost-effective alternative to GPT-4
## Test Configuration
- **Models tested**: GPT-4, GPT-4o-mini, Claude Haiku
- **Test cases**: 25 customer emails
- **Prompt**: "Summarize this email in 2-3 professional sentences"
## Results Summary
| Model | Avg Cost | Avg Time | Quality Score | Success Rate |
|-------|----------|----------|---------------|--------------|
| GPT-4 | $0.025 | 1.8s | 4.3/5 | 100% |
| GPT-4o-mini | $0.008 | 1.1s | 4.1/5 | 100% |
| Claude Haiku | $0.006 | 0.9s | 4.0/5 | 98% |
## Recommendation
Switch to GPT-4o-mini for 68% cost savings with minimal quality impact.
Sharing Results with Team¶
- Summary first: Key findings and recommendations upfront
- Show examples: Include best and worst outputs from each model
- Include costs: Always show financial impact
- Migration plan: Specific steps for implementing changes
Next Steps¶
- Monitor with alerts to catch issues during model transitions
- Optimize your dashboard to track the success of your changes
- Explore provider guides for model-specific optimization tips
- Check use case guides for testing strategies specific to your application type