MCP Usage Guide

This guide shows you how to use PulseStream’s MCP tools in practice. Each section includes the tool calls involved and realistic example output so you know what to expect.

MCP vs. the dashboard

Use the dashboard for overview, triage, and post-incident review. It gives you filtered views of incident groups, service health trends, and analysis history.

Use MCP when you are actively debugging a problem and want to ask questions in plain language from your IDE or terminal. MCP is conversational. You can follow up, drill down, and carry context across multiple questions without leaving your editor.

The two complement each other. A typical flow: notice a problem in the dashboard, switch to MCP to investigate, come back to the dashboard to verify things are stable after you fix it.

Verify your connection first

Before investigating anything, confirm your token and workspace are correct.

Prompt: "Am I connected to PulseStream?"

Tool called: whoami

Example output:

PulseStream Connection Status
=============================

Workspace: production (ws_01abc23def456)
Organization: org_78xyz
Role: Customer
Cluster: cluster-uid-af3b2

Scopes: read:namespaces, read:incidents, read:services, chat

Connection: OK

If the workspace or cluster looks wrong, generate a new token at Settings - Access Tokens and call set_token.

Workflow 1: Investigating an active incident

You get paged. There is an open critical incident. You want to know what is happening and what to do about it.

List open incidents

Prompt: "Show me all open critical incidents"

Tool called: list_incident_groups with status=open, severity=critical

Example output:

Incident Groups (2 total):

1. OOMKilled: payment-processor - repeated restarts
   - ID: grp_9kx2mq7p
   - Status: open
   - Severity: critical
   - Incidents: 14
   - Service: payment-processor
   - First seen: 4/3/2026, 1:12:00 AM
   - Last seen: 4/3/2026, 3:47:00 AM

2. CrashLoopBackOff: order-api - exit code 137
   - ID: grp_4rn8wt1c
   - Status: open
   - Severity: critical
   - Incidents: 6
   - Service: order-api
   - First seen: 4/3/2026, 2:55:00 AM
   - Last seen: 4/3/2026, 3:44:00 AM

Get incident details

Prompt: "Tell me more about the payment-processor incident"

Tool called: get_incident_group with group_id=grp_9kx2mq7p

Example output (truncated):

{
  "id": "grp_9kx2mq7p",
  "title": "OOMKilled: payment-processor - repeated restarts",
  "status": "open",
  "severity": "critical",
  "incident_count": 14,
  "ai_analyzed": false,
  "service_name": "payment-processor",
  "service_namespace": "production",
  "first_seen": "2026-04-03T01:12:00Z",
  "last_seen": "2026-04-03T03:47:00Z",
  "incidents": [
    {
      "id": "inc_m3z9p",
      "error_message": "OOMKilled",
      "event_count": 8,
      "severity": "critical",
      "last_seen": "2026-04-03T03:47:00Z",
      "source": "kubernetes"
    }
  ]
}

Run AI root cause analysis

Prompt: "Analyze the payment-processor incident - what is the root cause?"

Tool called: analyze_incident_group with group_id=grp_9kx2mq7p

Example output (streaming, then delivered as a full response):

## Root Cause Analysis: payment-processor OOMKilled

**Probable cause:** Memory leak in the Stripe webhook handler. The service is
consuming unbounded memory when processing large webhook payloads (> 500KB),
specifically in the event deserialization path introduced in the v2.8.1 deploy
at 00:45 UTC.

**Evidence:**
- 14 OOMKilled events since 01:12 UTC, all within 35-90 minutes of pod start
- Memory usage climbs steadily from ~180MB to the 512MB limit before each kill
- Pattern consistent with object accumulation, not a spike load issue
- The 14 events correlate with the Stripe webhook volume seen during peak hours

**Affected scope:** 3 replicas of payment-processor in namespace production

**Recommended actions:**
1. Temporarily increase memory limit to 1Gi to stop the crash loop
2. Review webhook payload parsing in src/webhooks/stripe.py - look for list
   accumulation across requests
3. Deploy a fix before reverting the memory limit increase
4. Set a memory alert at 80% to catch this pattern earlier next time

**Related services:** order-api depends on payment-processor for checkout
confirmation. It will degrade if payment-processor remains unstable.

Ask follow-up questions

You can ask follow-up questions in the same conversation thread. Pass conversation_id back to preserve context across turns.

Prompt: "What changed in the last deploy that could have caused this?"

Tool called: send_group_chat_message with group_id=grp_9kx2mq7p, message="What changed in the last deploy...", conversation_id=<from previous response>

The AI has context from the prior analysis and can give a more targeted answer without re-examining all the evidence.

Workflow 2: Proactive service investigation

You do not have an active incident but something feels wrong. Your API gateway latency is elevated and you want to understand why.

Prompt: "Why is latency increasing on the api-gateway service in production?"

Tool called: investigate_service with service_name=api-gateway, namespace=production, message="Why is latency increasing on this service?"

The tool scopes the AI investigation to api-gateway in production and streams back findings. The response includes a conversation_id at the end for follow-ups.

Example output:

## Service Investigation: api-gateway (production)

**Observation:** api-gateway is showing a p99 latency increase of ~340ms starting
around 02:20 UTC. The service itself is healthy (2/2 replicas running), but upstream
calls to payment-processor are timing out intermittently.

**Root cause:** The elevated latency is downstream. api-gateway has a 5-second
timeout on payment-processor calls. When payment-processor pods restart due to
OOMKilled (see grp_9kx2mq7p), in-flight requests to api-gateway time out waiting
for a response that never comes.

**Current state:**
- api-gateway: healthy, 2 replicas, no errors in its own logs
- payment-processor: critical - 14 OOMKilled events since 01:12 UTC
- Dependency chain: api-gateway -> payment-processor -> stripe

**Recommended actions:**
1. Fix payment-processor first - that is the actual problem
2. Consider adding a circuit breaker on api-gateway's payment-processor calls
   to fail fast instead of waiting on timeouts
3. Return a 503 with a retry-after header instead of a timeout to callers

---
conversation_id: conv_7tq4mr2x

Follow up: "What would a circuit breaker look like for this dependency?"

Tool called: investigate_service with conversation_id=conv_7tq4mr2x

The AI picks up where it left off, with full context of what it already found.

Workflow 3: Namespace health check after a deployment

You just deployed a new version to the staging namespace and want to confirm everything looks healthy before promoting to production.

Prompt: "Check the staging namespace - everything healthy after the deployment?"

Tool called: investigate_namespace with namespace=staging, message="Check overall health after a recent deployment. Are any services unhealthy, crashing, or showing elevated error rates?"

Example output:

## Namespace Health: staging

**Summary:** staging is mostly healthy. One service needs attention.

**Healthy services (4):**
- api-gateway: 2/2 replicas running
- order-api: 3/3 replicas running
- notification-service: 1/1 replicas running
- frontend: 2/2 replicas running

**Needs attention (1):**
- worker-jobs: 1/3 replicas running. 2 pods are in CrashLoopBackOff with
  exit code 1. Error in pod logs: "DATABASE_URL environment variable not set".
  This suggests a missing or incorrectly named environment variable in the
  new deployment's ConfigMap.

**Recommended actions:**
1. Check the ConfigMap for worker-jobs - likely missing DATABASE_URL
2. Compare environment variables between the previous and current deployment
3. Redeploy with the correct env vars

**No new incidents detected** in the last 30 minutes for this namespace.

---
conversation_id: conv_2bp8ks5n

Workflow 4: Resuming a previous investigation

You were investigating a memory issue yesterday and need to pick up where you left off.

List previous conversations

Prompt: "Show me my recent investigations"

Tool called: list_conversations

Example output:

Investigation Conversations (3 total):

1. Service investigation: payment-processor
   - ID: conv_7tq4mr2x
   - Scope: namespace (production)
   - Messages: 6
   - Last: "What would a circuit breaker look like for this dependency?..."
   - Last Activity: 4/3/2026, 3:58:00 AM

2. Service investigation: worker-jobs
   - ID: conv_2bp8ks5n
   - Scope: namespace (staging)
   - Messages: 4
   - Last: "Can you look at the ConfigMap diff between v2.7 and v2.8?..."
   - Last Activity: 4/2/2026, 11:14:00 PM

3. Incident investigation: grp_9kx2mq7p
   - ID: conv_5xm3qt9a
   - Scope: cluster
   - Messages: 3
   - Last: "What changed in the last deploy that could have caused this?..."
   - Last Activity: 4/3/2026, 2:21:00 AM

Use the conversation ID with investigate_cluster, investigate_namespace,
investigate_service, or send_group_chat_message to resume.

Resume the conversation

Prompt: "Resume the payment-processor investigation and tell me if the OOM issue is resolved"

Tool called: investigate_service with service_name=payment-processor, namespace=production, conversation_id=conv_7tq4mr2x, message="Is the OOM issue resolved?"

The AI has the full prior context, including what it found before, and can give a direct update without starting over.

Multi-turn conversations: what context is retained

When you pass a conversation_id in a follow-up call, the AI has access to:

Every question you asked in that conversation
Every answer it gave
The service, namespace, or incident group the conversation was scoped to

It does not re-run the full investigation from scratch on every turn. Follow-up answers are faster and more targeted because the AI is building on what it already knows.

The conversation_id is always returned at the end of responses from investigate_namespace, investigate_cluster, investigate_service, and send_group_chat_message. Copy it into your next message to keep the thread going.

Tool categories: when to use what

Incident tools

Use these when you have an active alert or a known incident.

Tool	When to use
`list_incident_groups`	Start here. See all open incidents, filtered by severity or status.
`get_incident_group`	Get raw details and the list of individual incidents in a group.
`analyze_incident_group`	Trigger AI root cause analysis. Use this once per incident group.
`send_group_chat_message`	Ask follow-up questions about a specific incident.

Investigation tools

Use these for proactive debugging when you suspect a problem but there is no incident yet, or when you want to understand system state without a specific alert.

Tool	When to use
`investigate_service`	You know which service you care about. Most targeted.
`investigate_namespace`	You want a health check of a whole namespace - good after deployments.
`investigate_cluster`	You have no idea where the problem is. Broadest scan.

Service and namespace tools

Use these to gather context before or alongside an investigation.

Tool	When to use
`list_services`	See all services with their namespace, replica count, and analysis status.
`get_service_details`	Get full details on one service, including codebase analysis summary.
`list_namespaces`	See all namespaces with service counts and analysis state.
`get_namespace`	Get full details on one namespace.
`get_namespace_analysis`	Read a previously completed codebase analysis for a namespace.
`analyze_namespace`	Trigger a fresh codebase analysis. Slow but comprehensive.

Conversation tools

Tool	When to use
`list_conversations`	Find a previous investigation you want to continue.
`get_conversation`	Get metadata on a specific conversation before resuming it.

Audit tool

Tool	When to use
`query_audit_logs`	Search for who did what and when. Useful for post-incident review.