Skip to content

Running LLM Comparisons and Tests

What You'll Learn

How to compare different AI models, test changes to your prompts, and run experiments to optimize your AI applications for better performance and lower costs.

Why Test and Compare?

Save Money

  • Find cheaper models that work just as well for your use case
  • Optimize prompts to use fewer tokens and reduce costs
  • Identify inefficiencies in your current setup

Improve Quality

  • Compare model outputs side-by-side
  • Test prompt variations to get better results
  • Find the right model for each specific task

Reduce Risk

  • Test before deploying changes to production
  • Understand model behavior before committing to a provider
  • Validate performance under different conditions

Types of Testing You Can Do

Model Comparison

Compare how different AI models handle the same tasks: - GPT-4 vs GPT-3.5 for cost/quality tradeoffs - Claude vs GPT-4 for different reasoning styles - Different model sizes within the same provider

Prompt Testing (A/B Testing)

Test different versions of your prompts: - Short vs detailed instructions - Different tone (formal vs casual) - Various examples in few-shot prompting

Provider Comparison

Compare the same model from different providers: - OpenAI direct vs OpenRouter for the same model - Cost differences between providers - Response time and reliability comparisons

Configuration Testing

Test different settings: - Temperature variations (0.1 vs 0.7 vs 1.0) - Max tokens limits - Different system prompts

Setting Up Your First Comparison Test

Step 1: Define What You're Testing

Be specific about: - The task: "Summarizing customer reviews" - Success criteria: "Accurate, 2-3 sentences, professional tone" - What you're comparing: "GPT-4 vs Claude for cost and quality"

Step 2: Prepare Test Data

Create a test set: - 5-10 examples for quick tests - 50-100 examples for thorough evaluation - Real data from your actual use case - Edge cases that might cause problems

Example test set for email summarization:

Test Case 1: Long customer complaint email
Test Case 2: Short inquiry about pricing
Test Case 3: Technical support request
Test Case 4: Praise and testimonial
Test Case 5: Multi-issue complex email

Step 3: Set Up Test Configurations

For each model/prompt you want to test:

Configuration A - GPT-4:

const testA = {
  model: 'gpt-4o-mini',
  provider: 'openai',
  temperature: 0.3,
  max_tokens: 150,
  prompt: "Summarize this email in 2-3 professional sentences."
}

Configuration B - Claude:

const testB = {
  model: 'claude-3-haiku-20240307',
  provider: 'anthropic',
  temperature: 0.3,
  max_tokens: 150,
  prompt: "Summarize this email in 2-3 professional sentences."
}

Step 4: Run the Tests Using RunForge

Create experiments to track each test configuration:

// Test Configuration A
const resultA = await runforge.track(
  { 
    experiment: 'email-summary-comparison',
    variant: 'gpt-4-mini',
    test_case: 'customer-complaint-1'
  }, 
  () => openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this email in 2-3 professional sentences.' },
      { role: 'user', content: testEmail }
    ],
    temperature: 0.3,
    max_tokens: 150
  })
);

// Test Configuration B  
const resultB = await runforge.track(
  { 
    experiment: 'email-summary-comparison',
    variant: 'claude-haiku',
    test_case: 'customer-complaint-1'
  },
  () => anthropic.messages.create({
    model: 'claude-3-haiku-20240307',
    max_tokens: 150,
    temperature: 0.3,
    messages: [
      { role: 'user', content: `Summarize this email in 2-3 professional sentences:\n\n${testEmail}` }
    ]
  })
);

Analyzing Test Results

Quantitative Metrics (RunForge Dashboard)

Cost Analysis: - Total cost per test run - Cost per successful completion - Token usage (input + output)

Performance Analysis: - Average response time - Success rate - Error patterns

Example comparison:

GPT-4o-mini:   $0.08 total, 1.2s avg, 100% success
Claude Haiku:  $0.05 total, 0.9s avg, 98% success  

Qualitative Assessment (Manual Review)

For each output, rate: - Accuracy: Does it capture the key points? - Completeness: Is important information missing? - Tone: Is it appropriate for your use case? - Consistency: Similar quality across different inputs?

Rating scale example: - 5 = Perfect, would use as-is - 4 = Very good, minor tweaks needed - 3 = Good, but needs some editing - 2 = Poor, significant problems - 1 = Unusable

Creating a Scorecard

Track results in a simple spreadsheet:

Test Case Model Cost Time Accuracy Tone Overall
Email 1 GPT-4 $0.008 1.1s 5 4 4.5
Email 1 Claude $0.005 0.8s 4 5 4.5
Email 2 GPT-4 $0.006 1.3s 4 4 4.0
Email 2 Claude $0.004 0.9s 5 4 4.5

Common Testing Scenarios

Scenario 1: Cost Optimization

Goal: Find the cheapest model that maintains quality Method: Test expensive model vs cheaper alternatives Success criteria: Less than 20% quality reduction for 50%+ cost savings

Example: - Current: GPT-4 ($0.03/request, quality score 4.5) - Test: GPT-4o-mini ($0.006/request, quality score 4.2) - Result: 80% cost reduction with only 7% quality loss ✅

Scenario 2: Quality Improvement

Goal: Better results without breaking the budget Method: Test different prompt strategies and models Success criteria: Measurable improvement in output quality

Example: - Current prompt: "Summarize this" - Test prompts: - "Create a professional 2-sentence summary" - "Identify key points and summarize professionally" - "Summarize focusing on customer concerns and action items"

Scenario 3: Speed Optimization

Goal: Faster responses for better user experience Method: Test models and configurations for response time Success criteria: Under 2 seconds average response time

Example: - Current: Claude Sonnet (3.2s average) - Test: Claude Haiku (0.9s average) - Result: 72% speed improvement with similar quality ✅

Scenario 4: Reliability Testing

Goal: Find most reliable model/provider combination Method: Test under load and edge cases Success criteria: Over 99% success rate under normal conditions

Advanced Testing Techniques

Batch Testing

Run many test cases automatically:

const testCases = [
  { id: 'email-1', content: '...', expected_topics: ['refund', 'shipping'] },
  { id: 'email-2', content: '...', expected_topics: ['billing', 'support'] },
  // ... more test cases
];

for (const testCase of testCases) {
  for (const config of configurations) {
    const result = await runforge.track(
      { 
        experiment: 'batch-comparison-2024-01',
        variant: config.name,
        test_case: testCase.id
      },
      () => callAI(config, testCase.content)
    );

    // Store result for analysis
    results.push({
      testCase: testCase.id,
      config: config.name,
      cost: result.cost,
      tokens: result.tokens,
      latency: result.latency,
      output: result.output
    });
  }
}

Statistical Significance

For important decisions, run enough tests to be confident: - Small differences: Need 100+ test cases - Large differences: 20-30 test cases may be enough - Critical decisions: Consider professional A/B testing tools

Load Testing

Test how models perform under stress: - Concurrent requests: Multiple calls at the same time - Rate limits: How many requests per minute before errors - Consistency: Quality maintained under load

Interpreting Results and Making Decisions

When to Choose the Cheaper Option

Choose the less expensive model when: - Quality difference is less than 10% - Cost savings are more than 30% - The use case isn't mission-critical - You can implement quality checks elsewhere

When to Pay for Premium

Choose the more expensive model when: - Quality difference is more than 20% - The use case is customer-facing and critical - Cost difference is negligible compared to business value - Reliability is significantly better

Gradual Migration Strategy

Don't change everything at once: 1. Start with low-risk use cases (internal tools, non-critical features) 2. Monitor closely for 1-2 weeks 3. Expand gradually to more important applications 4. Keep the old model as fallback during transition

Setting Up Ongoing Testing

Continuous Monitoring

Set up regular comparisons: - Monthly model comparisons: Are new models better? - Quarterly cost optimization: Can we reduce expenses? - Performance regression tests: Is quality declining over time?

Automated Quality Checks

Simple automated checks: - Length validation: Summaries should be 2-3 sentences - Keyword detection: Important terms should be preserved - Sentiment consistency: Positive inputs shouldn't become negative - Format compliance: Structured outputs match expected format

A/B Testing in Production

For non-critical applications: - Send 10% of traffic to new model - Compare results for 1 week - Gradually increase percentage if results are positive - Full migration only after confidence is high

Common Testing Mistakes to Avoid

Testing Too Few Examples

  • Problem: 3-5 examples aren't enough to draw conclusions
  • Solution: Use at least 20 examples for meaningful comparisons

Only Testing Happy Path

  • Problem: Models might fail differently on edge cases
  • Solution: Include difficult, ambiguous, or unusual inputs

Ignoring Cost per Token Differences

  • Problem: Focusing only on total cost, not efficiency
  • Solution: Calculate cost per successful output, not just cost per request

Not Testing with Real Data

  • Problem: Synthetic test data doesn't match real usage
  • Solution: Use actual examples from your production environment

Changing Multiple Variables

  • Problem: Testing new model AND new prompt makes it hard to know what caused changes
  • Solution: Test one variable at a time

Documentation and Communication

Test Report Template

# Model Comparison: Email Summarization
**Date**: January 15, 2024
**Goal**: Find cost-effective alternative to GPT-4

## Test Configuration
- **Models tested**: GPT-4, GPT-4o-mini, Claude Haiku
- **Test cases**: 25 customer emails
- **Prompt**: "Summarize this email in 2-3 professional sentences"

## Results Summary
| Model | Avg Cost | Avg Time | Quality Score | Success Rate |
|-------|----------|----------|---------------|--------------|
| GPT-4 | $0.025 | 1.8s | 4.3/5 | 100% |
| GPT-4o-mini | $0.008 | 1.1s | 4.1/5 | 100% |
| Claude Haiku | $0.006 | 0.9s | 4.0/5 | 98% |

## Recommendation
Switch to GPT-4o-mini for 68% cost savings with minimal quality impact.

Sharing Results with Team

  • Summary first: Key findings and recommendations upfront
  • Show examples: Include best and worst outputs from each model
  • Include costs: Always show financial impact
  • Migration plan: Specific steps for implementing changes

Next Steps