Setting Up Monitoring and Alerts¶
What You'll Learn¶
How to set up smart alerts so you'll know immediately when something important happens with your AI applications - like cost spikes, errors, or performance issues.
Why Use Alerts?¶
Avoid Surprises¶
- Cost overruns: Get notified before your bill gets too high
- Service issues: Know about problems before your users do
- Performance drops: Catch slow responses early
- Usage spikes: Understand when your app gets more traffic
Save Time¶
Instead of manually checking your dashboard: - Automatic monitoring: Alerts watch 24/7 - Smart notifications: Only get alerts that matter - Quick response: Fix issues faster with immediate notifications
Types of Alerts Available¶
Cost Alerts¶
Budget Warnings:
- 50% of monthly budget reached
- 80% of monthly budget reached
- 100% of monthly budget reached
- Sudden 2x cost increase in 24 hours
Example: "Your Content Generator project has spent $45 of your $50 monthly budget"
Performance Alerts¶
Response Time Issues: - Average response time over 3 seconds - Any single request taking over 10 seconds - Response time 50% slower than usual
Example: "Your chatbot responses are averaging 4.2 seconds (usually 1.8 seconds)"
Reliability Alerts¶
Error Rate Problems: - Success rate drops below 99% - More than 10 errors in the last hour - Specific error types (rate limits, authentication, etc.)
Example: "Your Customer Support Bot has failed 15 times in the last hour"
Usage Alerts¶
Traffic Changes: - 300% increase in requests compared to normal - No activity for 6+ hours (when you expect activity) - New models or providers being used
Example: "Your app made 1,200 requests in the last hour (usually 400)"
Setting Up Your First Alert¶
Navigate to Alerts¶
- Go to Settings → Alerts or look for a 🔔 bell icon
- Select your project from the dropdown
- Click "Create Alert" or "Add New Alert"
Choose Alert Type¶
You'll see options like:
- Cost/Budget alerts
- Performance alerts
- Reliability alerts
- Usage/Traffic alerts
Let's start with a budget alert:
Configure Budget Alert¶
- Alert Name: "Monthly Budget Warning"
- Alert Type: Budget/Cost
- Threshold: 80% of monthly budget
- Time Period: Current month
- How to notify: Email, SMS, webhook, etc.
Test Your Alert¶
- Review settings: Make sure everything looks right
- Save the alert: Click "Create Alert" or "Save"
- Test notification: Most systems let you send a test alert
- Verify delivery: Check that you received the test notification
Essential Alerts for Every Project¶
1. Budget Alert (High Priority)¶
Alert Name: "Monthly Budget - 80% Warning"
Type: Cost
Condition: Monthly spend > 80% of budget
Notification: Email + SMS
Why this matters: Prevents unexpected bills
2. High Error Rate (High Priority)¶
Alert Name: "Error Rate Spike"
Type: Reliability
Condition: Success rate < 95% for 15 minutes
Notification: Email + Slack
Why this matters: Users are experiencing failures
3. Slow Performance (Medium Priority)¶
Alert Name: "Response Time Warning"
Type: Performance
Condition: Average response > 3 seconds for 30 minutes
Notification: Email
Why this matters: Poor user experience
4. No Activity (Low Priority)¶
Alert Name: "Service Might Be Down"
Type: Usage
Condition: No requests for 4 hours during business hours
Notification: Email
Why this matters: Your application might have stopped working
Advanced Alert Configuration¶
Smart Thresholds¶
Instead of fixed numbers, use dynamic thresholds: - "50% higher than last week" instead of "over 100 requests" - "Response time 2x normal" instead of "over 2 seconds" - "Cost increase 3x typical daily spend" instead of "$50 per day"
Time-Based Conditions¶
Make alerts smarter with time awareness: - Business hours only: Don't get cost alerts at night if that's normal - Weekend patterns: Different thresholds for weekends - Seasonal adjustments: Account for known busy periods
Escalation Rules¶
Set up multiple notification levels:
1. Warning (5 minutes): Email notification
2. Critical (15 minutes): Email + SMS
3. Emergency (30 minutes): Email + SMS + Phone call
Notification Channels¶
Email Notifications¶
Best for: Non-urgent alerts, detailed information Setup: Add your email address in notification settings Pros: Detailed messages, easy to search and archive Cons: Might be delayed, can get buried in inbox
SMS/Text Messages¶
Best for: Urgent alerts that need immediate attention Setup: Add phone number and verify Pros: Immediate delivery, hard to miss Cons: Character limits, costs money
Slack Integration¶
Best for: Team notifications, keeping everyone informed Setup: Connect RunForge to your Slack workspace Pros: Team visibility, conversation context Cons: Can be noisy, might get lost in busy channels
Webhook/API Integration¶
Best for: Custom integrations, automated responses Setup: Configure webhook URL in alert settings Pros: Can trigger automated responses Cons: Requires technical setup
Managing Alert Noise¶
Avoid Alert Fatigue¶
Start conservative: Begin with fewer, more important alerts Tune over time: Adjust thresholds based on experience Group related alerts: Don't send 10 alerts for the same issue
Alert Prioritization¶
Critical: Service down, major security issues High: Budget exceeded, high error rates Medium: Performance degradation, unusual patterns Low: Weekly summaries, minor threshold breaches
Quiet Hours¶
Set up "do not disturb" periods: - Nights and weekends: Unless truly critical - Maintenance windows: When you expect issues - Holiday periods: When usage patterns change
Responding to Alerts¶
Immediate Response Checklist¶
When you get an alert:
- Read the full message: Don't just glance at the subject
- Check the dashboard: Get current status and context
- Assess severity: Is this urgent or can it wait?
- Take action if needed: Fix the issue or escalate
- Follow up: Make sure the issue is resolved
Common Alert Scenarios¶
Cost Spike Alert¶
Example: "Your project spent $25 in the last hour (usually $3)"
Investigation steps:
1. Check recent activity in your dashboard
2. Look for unusual request patterns
3. Verify your applications are working normally
4. Check if you accidentally made a lot of expensive API calls
Possible actions: - Pause non-essential services temporarily - Investigate and fix any runaway processes - Rotate API keys if you suspect unauthorized usage
High Error Rate Alert¶
Example: "Success rate dropped to 87% in the last 30 minutes"
Investigation steps: 1. Check what errors are happening (rate limits, timeouts, etc.) 2. Look at your provider's status page for outages 3. Review recent code deployments for bugs 4. Check if API keys are still valid
Possible actions: - Wait if it's a provider outage - Rollback recent deployments - Implement retry logic with backoff - Contact provider support if needed
Performance Degradation Alert¶
Example: "Average response time is 4.2 seconds (usually 1.8 seconds)"
Investigation steps: 1. Check if specific models are slower than others 2. Look for increased request volume 3. Verify your internet connection and infrastructure 4. Check provider status for performance issues
Possible actions: - Switch to faster models temporarily - Reduce request volume - Optimize your prompts to be shorter - Scale up your infrastructure if needed
Customizing Alerts for Different Use Cases¶
Production Applications¶
- Very sensitive: Low thresholds, immediate notifications
- 24/7 monitoring: Alerts at any time
- Multiple channels: Email, SMS, and team chat
- Escalation: If not responded to in 30 minutes
Development/Testing¶
- Less sensitive: Higher thresholds, daily summaries
- Business hours: No alerts on nights/weekends
- Email only: Less urgent notification methods
- Weekly summaries: Digest of all activities
Personal Projects¶
- Budget-focused: Mainly cost alerts
- Email notifications: No urgent SMS needed
- Higher thresholds: Don't alert for small issues
- Monthly summaries: Overview of usage patterns
Alert Maintenance¶
Regular Review (Monthly)¶
- Check alert history: Which alerts fired? Were they useful?
- Tune thresholds: Adjust based on your normal usage patterns
- Update contacts: Make sure notification info is current
- Review relevance: Remove alerts you no longer need
Seasonal Adjustments¶
- Holiday patterns: Expect different usage during holidays
- Business cycles: Adjust for known busy/slow periods
- Growth: Update thresholds as your usage grows
- Model changes: New models may have different cost/performance profiles
Troubleshooting Alerts¶
Not Receiving Notifications¶
Check these items: - Is your email/phone number correct? - Are notifications going to spam/junk folder? - Is your phone carrier blocking messages? - Are webhooks/integrations properly configured?
Too Many False Alarms¶
Common fixes: - Raise thresholds to be less sensitive - Add time delays (e.g., "only alert after 15 minutes") - Use percentage-based thresholds instead of fixed numbers - Consider business hours restrictions
Missing Important Issues¶
Possible solutions: - Lower thresholds for critical alerts - Add multiple alert conditions - Use escalation rules for persistent problems - Review alert history to find gaps
Best Practices Summary¶
Starting Out¶
- Start with budget alerts - Protect against unexpected costs
- Add error rate monitoring - Know when things break
- Test all notifications - Make sure alerts actually reach you
- Review after one week - Tune based on initial experience
Long-term Success¶
- Tune regularly - Adjust thresholds as you learn normal patterns
- Document responses - Know what to do for each type of alert
- Train your team - Make sure everyone knows how to respond
- Keep it simple - Too many alerts = ignoring all alerts
Next Steps¶
- Optimize your dashboard to complement your alerts
- Learn about testing to prevent issues that cause alerts
- Explore use cases to understand normal usage patterns for better alert thresholds