Running LLM Comparisons and Tests¶

What You'll Learn¶

How to compare different AI models, test changes to your prompts, and run experiments to optimize your AI applications for better performance and lower costs.

Why Test and Compare?¶

Save Money¶

Find cheaper models that work just as well for your use case
Optimize prompts to use fewer tokens and reduce costs
Identify inefficiencies in your current setup

Improve Quality¶

Compare model outputs side-by-side
Test prompt variations to get better results
Find the right model for each specific task

Reduce Risk¶

Test before deploying changes to production
Understand model behavior before committing to a provider
Validate performance under different conditions

Types of Testing You Can Do¶

Model Comparison¶

Compare how different AI models handle the same tasks: - GPT-4 vs GPT-3.5 for cost/quality tradeoffs - Claude vs GPT-4 for different reasoning styles - Different model sizes within the same provider

Prompt Testing (A/B Testing)¶

Test different versions of your prompts: - Short vs detailed instructions - Different tone (formal vs casual) - Various examples in few-shot prompting

Provider Comparison¶

Compare the same model from different providers: - OpenAI direct vs OpenRouter for the same model - Cost differences between providers - Response time and reliability comparisons

Configuration Testing¶

Test different settings: - Temperature variations (0.1 vs 0.7 vs 1.0) - Max tokens limits - Different system prompts

Setting Up Your First Comparison Test¶

Step 1: Define What You're Testing¶

Be specific about: - The task: "Summarizing customer reviews" - Success criteria: "Accurate, 2-3 sentences, professional tone" - What you're comparing: "GPT-4 vs Claude for cost and quality"

Step 2: Prepare Test Data¶

Create a test set: - 5-10 examples for quick tests - 50-100 examples for thorough evaluation - Real data from your actual use case - Edge cases that might cause problems

Example test set for email summarization:

Test Case 1: Long customer complaint email
Test Case 2: Short inquiry about pricing
Test Case 3: Technical support request
Test Case 4: Praise and testimonial
Test Case 5: Multi-issue complex email

Step 3: Set Up Test Configurations¶

For each model/prompt you want to test:

Configuration A - GPT-4:

const testA = {
  model: 'gpt-4o-mini',
  provider: 'openai',
  temperature: 0.3,
  max_tokens: 150,
  prompt: "Summarize this email in 2-3 professional sentences."
}

Configuration B - Claude:

const testB = {
  model: 'claude-3-haiku-20240307',
  provider: 'anthropic',
  temperature: 0.3,
  max_tokens: 150,
  prompt: "Summarize this email in 2-3 professional sentences."
}

Step 4: Run the Tests Using RunForge¶

Create experiments to track each test configuration:

// Test Configuration A
const resultA = await runforge.track(
  { 
    experiment: 'email-summary-comparison',
    variant: 'gpt-4-mini',
    test_case: 'customer-complaint-1'
  }, 
  () => openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this email in 2-3 professional sentences.' },
      { role: 'user', content: testEmail }
    ],
    temperature: 0.3,
    max_tokens: 150
  })
);

// Test Configuration B  
const resultB = await runforge.track(
  { 
    experiment: 'email-summary-comparison',
    variant: 'claude-haiku',
    test_case: 'customer-complaint-1'
  },
  () => anthropic.messages.create({
    model: 'claude-3-haiku-20240307',
    max_tokens: 150,
    temperature: 0.3,
    messages: [
      { role: 'user', content: `Summarize this email in 2-3 professional sentences:\n\n${testEmail}` }
    ]
  })
);

Analyzing Test Results¶

Quantitative Metrics (RunForge Dashboard)¶

Cost Analysis: - Total cost per test run - Cost per successful completion - Token usage (input + output)

Performance Analysis: - Average response time - Success rate - Error patterns

Example comparison:

GPT-4o-mini:   $0.08 total, 1.2s avg, 100% success
Claude Haiku:  $0.05 total, 0.9s avg, 98% success

Qualitative Assessment (Manual Review)¶

For each output, rate: - Accuracy: Does it capture the key points? - Completeness: Is important information missing? - Tone: Is it appropriate for your use case? - Consistency: Similar quality across different inputs?

Rating scale example: - 5 = Perfect, would use as-is - 4 = Very good, minor tweaks needed - 3 = Good, but needs some editing - 2 = Poor, significant problems - 1 = Unusable

Creating a Scorecard¶

Track results in a simple spreadsheet:

Test Case	Model	Cost	Time	Accuracy	Tone	Overall
Email 1	GPT-4	$0.008	1.1s	5	4	4.5
Email 1	Claude	$0.005	0.8s	4	5	4.5
Email 2	GPT-4	$0.006	1.3s	4	4	4.0
Email 2	Claude	$0.004	0.9s	5	4	4.5

Common Testing Scenarios¶

Scenario 1: Cost Optimization¶

Goal: Find the cheapest model that maintains quality Method: Test expensive model vs cheaper alternatives Success criteria: Less than 20% quality reduction for 50%+ cost savings

Example: - Current: GPT-4 ($0.03/request, quality score 4.5) - Test: GPT-4o-mini ($0.006/request, quality score 4.2) - Result: 80% cost reduction with only 7% quality loss ✅

Scenario 2: Quality Improvement¶

Goal: Better results without breaking the budget Method: Test different prompt strategies and models Success criteria: Measurable improvement in output quality

Example: - Current prompt: "Summarize this" - Test prompts: - "Create a professional 2-sentence summary" - "Identify key points and summarize professionally" - "Summarize focusing on customer concerns and action items"

Scenario 3: Speed Optimization¶

Goal: Faster responses for better user experience Method: Test models and configurations for response time Success criteria: Under 2 seconds average response time

Example: - Current: Claude Sonnet (3.2s average) - Test: Claude Haiku (0.9s average) - Result: 72% speed improvement with similar quality ✅

Scenario 4: Reliability Testing¶

Goal: Find most reliable model/provider combination Method: Test under load and edge cases Success criteria: Over 99% success rate under normal conditions

Advanced Testing Techniques¶

Batch Testing¶

Run many test cases automatically:

const testCases = [
  { id: 'email-1', content: '...', expected_topics: ['refund', 'shipping'] },
  { id: 'email-2', content: '...', expected_topics: ['billing', 'support'] },
  // ... more test cases
];

for (const testCase of testCases) {
  for (const config of configurations) {
    const result = await runforge.track(
      { 
        experiment: 'batch-comparison-2024-01',
        variant: config.name,
        test_case: testCase.id
      },
      () => callAI(config, testCase.content)
    );

    // Store result for analysis
    results.push({
      testCase: testCase.id,
      config: config.name,
      cost: result.cost,
      tokens: result.tokens,
      latency: result.latency,
      output: result.output
    });
  }
}

Statistical Significance¶

For important decisions, run enough tests to be confident: - Small differences: Need 100+ test cases - Large differences: 20-30 test cases may be enough - Critical decisions: Consider professional A/B testing tools

Load Testing¶

Test how models perform under stress: - Concurrent requests: Multiple calls at the same time - Rate limits: How many requests per minute before errors - Consistency: Quality maintained under load

Interpreting Results and Making Decisions¶

When to Choose the Cheaper Option¶

Choose the less expensive model when: - Quality difference is less than 10% - Cost savings are more than 30% - The use case isn't mission-critical - You can implement quality checks elsewhere

When to Pay for Premium¶

Choose the more expensive model when: - Quality difference is more than 20% - The use case is customer-facing and critical - Cost difference is negligible compared to business value - Reliability is significantly better

Gradual Migration Strategy¶

Don't change everything at once: 1. Start with low-risk use cases (internal tools, non-critical features) 2. Monitor closely for 1-2 weeks 3. Expand gradually to more important applications 4. Keep the old model as fallback during transition

Setting Up Ongoing Testing¶

Continuous Monitoring¶

Set up regular comparisons: - Monthly model comparisons: Are new models better? - Quarterly cost optimization: Can we reduce expenses? - Performance regression tests: Is quality declining over time?

Automated Quality Checks¶

Simple automated checks: - Length validation: Summaries should be 2-3 sentences - Keyword detection: Important terms should be preserved - Sentiment consistency: Positive inputs shouldn't become negative - Format compliance: Structured outputs match expected format

A/B Testing in Production¶

For non-critical applications: - Send 10% of traffic to new model - Compare results for 1 week - Gradually increase percentage if results are positive - Full migration only after confidence is high

Common Testing Mistakes to Avoid¶

Testing Too Few Examples¶

Problem: 3-5 examples aren't enough to draw conclusions
Solution: Use at least 20 examples for meaningful comparisons

Only Testing Happy Path¶

Problem: Models might fail differently on edge cases
Solution: Include difficult, ambiguous, or unusual inputs

Ignoring Cost per Token Differences¶

Problem: Focusing only on total cost, not efficiency
Solution: Calculate cost per successful output, not just cost per request

Not Testing with Real Data¶

Problem: Synthetic test data doesn't match real usage
Solution: Use actual examples from your production environment

Changing Multiple Variables¶

Problem: Testing new model AND new prompt makes it hard to know what caused changes
Solution: Test one variable at a time

Documentation and Communication¶

Test Report Template¶

# Model Comparison: Email Summarization
**Date**: January 15, 2024
**Goal**: Find cost-effective alternative to GPT-4

## Test Configuration
- **Models tested**: GPT-4, GPT-4o-mini, Claude Haiku
- **Test cases**: 25 customer emails
- **Prompt**: "Summarize this email in 2-3 professional sentences"

## Results Summary
| Model | Avg Cost | Avg Time | Quality Score | Success Rate |
|-------|----------|----------|---------------|--------------|
| GPT-4 | $0.025 | 1.8s | 4.3/5 | 100% |
| GPT-4o-mini | $0.008 | 1.1s | 4.1/5 | 100% |
| Claude Haiku | $0.006 | 0.9s | 4.0/5 | 98% |

## Recommendation
Switch to GPT-4o-mini for 68% cost savings with minimal quality impact.

Summary first: Key findings and recommendations upfront
Show examples: Include best and worst outputs from each model
Include costs: Always show financial impact
Migration plan: Specific steps for implementing changes

Next Steps¶

Monitor with alerts to catch issues during model transitions
Optimize your dashboard to track the success of your changes
Explore provider guides for model-specific optimization tips
Check use case guides for testing strategies specific to your application type