Incident Response Plan
Last updated March 2026
On this page
Severity Levels
All incidents are classified into one of four severity levels based on user impact and service availability.
| Level | Name | Description | Examples |
|---|---|---|---|
| P1 | Critical | Service is completely down or unusable for all users | Backend API unreachable, all video renders failing, Salesforce callback endpoint returning 500 |
| P2 | Degraded | Service is operational but significantly impaired | Render times 3x slower than normal, intermittent API timeouts, Stripe webhook processing delayed |
| P3 | Minor | A non-critical feature is broken or a small subset of users affected | Single preset type failing, analytics not updating, email notifications not sending |
| P4 | Cosmetic | Visual or UX issue with no functional impact | Styling glitch on player page, typo in narration prompt, minor layout shift in dashboard |
Escalation Path
Incidents follow this escalation chain. Escalate immediately if the current responder cannot resolve within the SLA window.
- First contact: Email hello@purposeforce.org — monitored during business hours (Mon-Fri, 9am-6pm CT)
- On-call engineer: If no response within 30 minutes during business hours, or for P1 incidents reported outside business hours, the on-call engineer is paged automatically
- Engineering lead: If the on-call engineer cannot resolve within the SLA window, or for any P1 lasting longer than 2 hours, escalate to the engineering lead
Response Time SLAs
Response time is measured from when the incident is reported to when the first meaningful acknowledgment or action is taken.
| Severity | Initial Response | Status Update Cadence | Target Resolution |
|---|---|---|---|
| P1 — Critical | 1 hour | Every 30 minutes | 4 hours |
| P2 — Degraded | 4 hours | Every 2 hours | 24 hours |
| P3 — Minor | 1 business day | Daily | 1 week |
| P4 — Cosmetic | Next sprint | Sprint review | Next release |
Common Incidents
Video rendering fails
Symptoms: Videos stuck in "Generating" status indefinitely, or moving to "Failed" status shortly after generation starts.
- Check Vercel function logs — Look for errors in the
/api/pipeline/generateroute. Common issues: Claude API failures, timeout, or malformed data. - Verify API keys — Confirm the
x-rewind-api-keyheader value matches theApi_Key__con the org'sRewind_License__crecord. - Check render count — If
Renders_Used_This_Month__chas reached the tier limit, new generations will be rejected. - Verify callback connectivity — Ensure the backend can reach the Salesforce org's REST endpoint at
/services/apexrest/rewind/callback. Check JWT auth configuration.
Stripe webhooks failing
Symptoms: License tier not updating after payment, subscription changes not reflected in Salesforce.
- Check webhook secret — Verify the
STRIPE_WEBHOOK_SECRETenvironment variable in Vercel matches the signing secret in the Stripe Dashboard under Developers > Webhooks. - Verify endpoint URL — The webhook endpoint should be
https://rewind.purposeforce.org/api/billing/webhook. Check for typos or outdated URLs. - Review Stripe event logs — In Stripe Dashboard, check the webhook event delivery attempts for HTTP status codes and error messages.
- Check Vercel function logs — Look for parsing errors or authentication failures in the webhook handler.
Salesforce callback failing
Symptoms: Generation completes on the backend but narration JSON never arrives in Salesforce. Rewind_Video__c record stays in "Generating" status.
- Check Named Credential — Verify the Named Credential used for callbacks is correctly configured and the authentication provider has a valid refresh token.
- Verify access token — The JWT Bearer or password auth flow in
src/lib/salesforce-auth.tsmay have an expired or revoked token. Check Vercel logs for 401 responses. - Check org connectivity — Ensure the Salesforce org is accessible (not in maintenance, sandbox not suspended).
- Verify callback payload — The generate pipeline stores narration JSON and theme data on the
Rewind_Video__crecord via the Salesforce REST API.
Rate limit exceeded
Symptoms: API returns 429 status code. Users see "Too many requests" errors.
- Check current usage — The rate limiter allows 30 requests per minute per API key. Verify if a burst of legitimate requests caused the limit.
- Check
Rewind_License__crender count — Monthly render limits are enforced separately from the per-minute rate limit. VerifyRenders_Used_This_Month__cvs. the tier limit. - Review for abuse — Check Vercel logs for repeated requests from a single source that may indicate unauthorized usage or a misconfigured scheduled job.
Claude API errors
Symptoms: Video generation starts but fails during narration generation. Errors reference Anthropic or Claude in logs.
- Check Anthropic API key — Verify the
ANTHROPIC_API_KEYenvironment variable in Vercel is valid and has not been rotated. - Check Anthropic rate limits — The Claude API has its own rate limits. If multiple orgs are generating videos simultaneously, the shared key may be throttled. Check status.anthropic.com for service issues.
- Review prompt size — Very large Salesforce datasets can produce prompts that exceed the context window. Check if the failing org has unusually large query results.
Recovery Procedures
Restarting the backend
- Open the Vercel dashboard for the Rewind project
- Navigate to Deployments and redeploy the latest production commit
- Monitor the function logs for the first few minutes to confirm healthy operation
- Verify a test render completes successfully from the dev org
Rotating API keys
- Generate a new API key value
- Update the
Api_Key__cfield on the affectedRewind_License__crecord in Salesforce - If the backend API key environment variable needs rotation, update it in Vercel and redeploy
- Verify connectivity with a test render
Recovering stuck renders
- Query for stale
Rewind_Video__crecords with status "Rendering" older than 60 minutes - Update their status to "Failed" so users can retry
- Investigate root cause in Vercel logs before allowing retries
Post-Incident Review Template
After any P1 or P2 incident is resolved, complete a post-incident review within 48 hours. Use the following template.
Post-Incident Review
- Incident title: Brief description
- Severity: P1 / P2
- Date & time detected: YYYY-MM-DD HH:MM CT
- Date & time resolved: YYYY-MM-DD HH:MM CT
- Duration: Total time from detection to resolution
- Affected users: Number and description of impacted users/orgs
- Summary: What happened, in plain language
- Root cause: The underlying technical cause
- Timeline: Chronological list of key events and actions taken
- What went well: Things that helped resolve the incident quickly
- What could be improved: Gaps in monitoring, communication, or process
- Action items: Specific follow-up tasks with owners and deadlines
Contact
For all incident reports and operational questions:
- Email: hello@purposeforce.org
- Website: purposeforce.org