Skip to content

MCP Usage Guide

This guide shows you how to use PulseStream’s MCP tools in practice. Each section includes the tool calls involved and realistic example output so you know what to expect.

Use the dashboard for overview, triage, and post-incident review. It gives you filtered views of incident groups, service health trends, and analysis history.

Use MCP when you are actively debugging a problem and want to ask questions in plain language from your IDE or terminal. MCP is conversational. You can follow up, drill down, and carry context across multiple questions without leaving your editor.

The two complement each other. A typical flow: notice a problem in the dashboard, switch to MCP to investigate, come back to the dashboard to verify things are stable after you fix it.


Before investigating anything, confirm your token and workspace are correct.

Prompt: "Am I connected to PulseStream?"

Tool called: whoami

Example output:

PulseStream Connection Status
=============================
Workspace: production (ws_01abc23def456)
Organization: org_78xyz
Role: Customer
Cluster: cluster-uid-af3b2
Scopes: read:namespaces, read:incidents, read:services, chat
Connection: OK

If the workspace or cluster looks wrong, generate a new token at Settings - Access Tokens and call set_token.


Workflow 1: Investigating an active incident

Section titled “Workflow 1: Investigating an active incident”

You get paged. There is an open critical incident. You want to know what is happening and what to do about it.

  1. List open incidents

    Prompt: "Show me all open critical incidents"

    Tool called: list_incident_groups with status=open, severity=critical

    Example output:

    Incident Groups (2 total):
    1. OOMKilled: payment-processor - repeated restarts
    - ID: grp_9kx2mq7p
    - Status: open
    - Severity: critical
    - Incidents: 14
    - Service: payment-processor
    - First seen: 4/3/2026, 1:12:00 AM
    - Last seen: 4/3/2026, 3:47:00 AM
    2. CrashLoopBackOff: order-api - exit code 137
    - ID: grp_4rn8wt1c
    - Status: open
    - Severity: critical
    - Incidents: 6
    - Service: order-api
    - First seen: 4/3/2026, 2:55:00 AM
    - Last seen: 4/3/2026, 3:44:00 AM
  2. Get incident details

    Prompt: "Tell me more about the payment-processor incident"

    Tool called: get_incident_group with group_id=grp_9kx2mq7p

    Example output (truncated):

    {
    "id": "grp_9kx2mq7p",
    "title": "OOMKilled: payment-processor - repeated restarts",
    "status": "open",
    "severity": "critical",
    "incident_count": 14,
    "ai_analyzed": false,
    "service_name": "payment-processor",
    "service_namespace": "production",
    "first_seen": "2026-04-03T01:12:00Z",
    "last_seen": "2026-04-03T03:47:00Z",
    "incidents": [
    {
    "id": "inc_m3z9p",
    "error_message": "OOMKilled",
    "event_count": 8,
    "severity": "critical",
    "last_seen": "2026-04-03T03:47:00Z",
    "source": "kubernetes"
    }
    ]
    }
  3. Run AI root cause analysis

    Prompt: "Analyze the payment-processor incident - what is the root cause?"

    Tool called: analyze_incident_group with group_id=grp_9kx2mq7p

    Example output (streaming, then delivered as a full response):

    ## Root Cause Analysis: payment-processor OOMKilled
    **Probable cause:** Memory leak in the Stripe webhook handler. The service is
    consuming unbounded memory when processing large webhook payloads (> 500KB),
    specifically in the event deserialization path introduced in the v2.8.1 deploy
    at 00:45 UTC.
    **Evidence:**
    - 14 OOMKilled events since 01:12 UTC, all within 35-90 minutes of pod start
    - Memory usage climbs steadily from ~180MB to the 512MB limit before each kill
    - Pattern consistent with object accumulation, not a spike load issue
    - The 14 events correlate with the Stripe webhook volume seen during peak hours
    **Affected scope:** 3 replicas of payment-processor in namespace production
    **Recommended actions:**
    1. Temporarily increase memory limit to 1Gi to stop the crash loop
    2. Review webhook payload parsing in src/webhooks/stripe.py - look for list
    accumulation across requests
    3. Deploy a fix before reverting the memory limit increase
    4. Set a memory alert at 80% to catch this pattern earlier next time
    **Related services:** order-api depends on payment-processor for checkout
    confirmation. It will degrade if payment-processor remains unstable.
  4. Ask follow-up questions

    You can ask follow-up questions in the same conversation thread. Pass conversation_id back to preserve context across turns.

    Prompt: "What changed in the last deploy that could have caused this?"

    Tool called: send_group_chat_message with group_id=grp_9kx2mq7p, message="What changed in the last deploy...", conversation_id=<from previous response>

    The AI has context from the prior analysis and can give a more targeted answer without re-examining all the evidence.


Workflow 2: Proactive service investigation

Section titled “Workflow 2: Proactive service investigation”

You do not have an active incident but something feels wrong. Your API gateway latency is elevated and you want to understand why.

Prompt: "Why is latency increasing on the api-gateway service in production?"

Tool called: investigate_service with service_name=api-gateway, namespace=production, message="Why is latency increasing on this service?"

The tool scopes the AI investigation to api-gateway in production and streams back findings. The response includes a conversation_id at the end for follow-ups.

Example output:

## Service Investigation: api-gateway (production)
**Observation:** api-gateway is showing a p99 latency increase of ~340ms starting
around 02:20 UTC. The service itself is healthy (2/2 replicas running), but upstream
calls to payment-processor are timing out intermittently.
**Root cause:** The elevated latency is downstream. api-gateway has a 5-second
timeout on payment-processor calls. When payment-processor pods restart due to
OOMKilled (see grp_9kx2mq7p), in-flight requests to api-gateway time out waiting
for a response that never comes.
**Current state:**
- api-gateway: healthy, 2 replicas, no errors in its own logs
- payment-processor: critical - 14 OOMKilled events since 01:12 UTC
- Dependency chain: api-gateway -> payment-processor -> stripe
**Recommended actions:**
1. Fix payment-processor first - that is the actual problem
2. Consider adding a circuit breaker on api-gateway's payment-processor calls
to fail fast instead of waiting on timeouts
3. Return a 503 with a retry-after header instead of a timeout to callers
---
conversation_id: conv_7tq4mr2x

Follow up: "What would a circuit breaker look like for this dependency?"

Tool called: investigate_service with conversation_id=conv_7tq4mr2x

The AI picks up where it left off, with full context of what it already found.


Workflow 3: Namespace health check after a deployment

Section titled “Workflow 3: Namespace health check after a deployment”

You just deployed a new version to the staging namespace and want to confirm everything looks healthy before promoting to production.

Prompt: "Check the staging namespace - everything healthy after the deployment?"

Tool called: investigate_namespace with namespace=staging, message="Check overall health after a recent deployment. Are any services unhealthy, crashing, or showing elevated error rates?"

Example output:

## Namespace Health: staging
**Summary:** staging is mostly healthy. One service needs attention.
**Healthy services (4):**
- api-gateway: 2/2 replicas running
- order-api: 3/3 replicas running
- notification-service: 1/1 replicas running
- frontend: 2/2 replicas running
**Needs attention (1):**
- worker-jobs: 1/3 replicas running. 2 pods are in CrashLoopBackOff with
exit code 1. Error in pod logs: "DATABASE_URL environment variable not set".
This suggests a missing or incorrectly named environment variable in the
new deployment's ConfigMap.
**Recommended actions:**
1. Check the ConfigMap for worker-jobs - likely missing DATABASE_URL
2. Compare environment variables between the previous and current deployment
3. Redeploy with the correct env vars
**No new incidents detected** in the last 30 minutes for this namespace.
---
conversation_id: conv_2bp8ks5n

Workflow 4: Resuming a previous investigation

Section titled “Workflow 4: Resuming a previous investigation”

You were investigating a memory issue yesterday and need to pick up where you left off.

  1. List previous conversations

    Prompt: "Show me my recent investigations"

    Tool called: list_conversations

    Example output:

    Investigation Conversations (3 total):
    1. Service investigation: payment-processor
    - ID: conv_7tq4mr2x
    - Scope: namespace (production)
    - Messages: 6
    - Last: "What would a circuit breaker look like for this dependency?..."
    - Last Activity: 4/3/2026, 3:58:00 AM
    2. Service investigation: worker-jobs
    - ID: conv_2bp8ks5n
    - Scope: namespace (staging)
    - Messages: 4
    - Last: "Can you look at the ConfigMap diff between v2.7 and v2.8?..."
    - Last Activity: 4/2/2026, 11:14:00 PM
    3. Incident investigation: grp_9kx2mq7p
    - ID: conv_5xm3qt9a
    - Scope: cluster
    - Messages: 3
    - Last: "What changed in the last deploy that could have caused this?..."
    - Last Activity: 4/3/2026, 2:21:00 AM
    Use the conversation ID with investigate_cluster, investigate_namespace,
    investigate_service, or send_group_chat_message to resume.
  2. Resume the conversation

    Prompt: "Resume the payment-processor investigation and tell me if the OOM issue is resolved"

    Tool called: investigate_service with service_name=payment-processor, namespace=production, conversation_id=conv_7tq4mr2x, message="Is the OOM issue resolved?"

    The AI has the full prior context, including what it found before, and can give a direct update without starting over.


Multi-turn conversations: what context is retained

Section titled “Multi-turn conversations: what context is retained”

When you pass a conversation_id in a follow-up call, the AI has access to:

  • Every question you asked in that conversation
  • Every answer it gave
  • The service, namespace, or incident group the conversation was scoped to

It does not re-run the full investigation from scratch on every turn. Follow-up answers are faster and more targeted because the AI is building on what it already knows.

The conversation_id is always returned at the end of responses from investigate_namespace, investigate_cluster, investigate_service, and send_group_chat_message. Copy it into your next message to keep the thread going.


Use these when you have an active alert or a known incident.

ToolWhen to use
list_incident_groupsStart here. See all open incidents, filtered by severity or status.
get_incident_groupGet raw details and the list of individual incidents in a group.
analyze_incident_groupTrigger AI root cause analysis. Use this once per incident group.
send_group_chat_messageAsk follow-up questions about a specific incident.

Use these for proactive debugging when you suspect a problem but there is no incident yet, or when you want to understand system state without a specific alert.

ToolWhen to use
investigate_serviceYou know which service you care about. Most targeted.
investigate_namespaceYou want a health check of a whole namespace - good after deployments.
investigate_clusterYou have no idea where the problem is. Broadest scan.

Use these to gather context before or alongside an investigation.

ToolWhen to use
list_servicesSee all services with their namespace, replica count, and analysis status.
get_service_detailsGet full details on one service, including codebase analysis summary.
list_namespacesSee all namespaces with service counts and analysis state.
get_namespaceGet full details on one namespace.
get_namespace_analysisRead a previously completed codebase analysis for a namespace.
analyze_namespaceTrigger a fresh codebase analysis. Slow but comprehensive.
ToolWhen to use
list_conversationsFind a previous investigation you want to continue.
get_conversationGet metadata on a specific conversation before resuming it.
ToolWhen to use
query_audit_logsSearch for who did what and when. Useful for post-incident review.