Skip to content
Marcin Baniowski Connecting people with machines
← Back to notes

Agentic DevOps — From Local IDE to Autonomous Agents

Starting Point: LLMs in the IDE

For smaller projects, managing infrastructure with the help of an LLM from a local IDE is more than sufficient. The right set of permissions, context in CLAUDE.md, skills, and MCP work perfectly well. Mistakes are still possible — running an action against the wrong environment, or approving a command without reading it carefully — but potential mishaps are less likely and generally faster to detect and remediate.

There are situations, however, that require granting LLMs greater autonomy — for example, automating responses to alerts about excessive pod resource consumption or frequent restarts. This calls for a set of tools: Kubernetes, AWS, logs, and metrics, which in turn raises the issue of permissions and security. Each agent must be physically isolated with access limited strictly to the minimum required resources and services. Placing an agent on a single pod with access to multiple environments and read/write levels, relying on prompt instructions as the only safeguard, is a recipe for disaster.

This level of access granularity can significantly complicate configuration — every service, environment, and access level requires a separate service/pod, e.g. stage Kubernetes RO vs. production Kubernetes RO vs. stage Kubernetes RW vs. production AWS RO, and so on. With a potentially large number of agents and other participants (humans), transparent permission management becomes critical.

The proposed architecture centralizes role management behind a single entry point — the MCP Gateway. In this setup, all access services are deployed as pods in an EKS cluster.

MCP Gateway Architecture

Agentic DevOps — Secure MCP Gateway Architecture

MCP Server Permissions

Each MCP server responsible for accessing specific services must run on a separate pod, per environment, per access level (RW, RO). Permissions are granted via IRSA — pods are mapped through Service Accounts to appropriately scoped IAM role policies.

Key Management

API keys are stored in AWS SSM Parameter Store and automatically synchronized to Kubernetes Secrets.

Example script for creating a key for a new agent:

./create-keys.sh --role dev-readonly --user incident-agent-01

[TODO - what the script does]

The agent uses the key in the request header:

Authorization: Bearer <base64-encoded-key>

The script generates a key and stores it in SSM Parameter Store in the format hash:role/user_name — a JSON file with aggregated keys is automatically synchronized to a Kubernetes Secret via External Secrets.

RBAC — Roles and Permissions

Each role is a set of allowed backends:

roleMapping:
  devops-full:
    - eks-rw
    - eks-ro-prod
    - aws-rw
    - costs
    - cloudwatch
    - postgres-rw
    - prometheus-dev
    - prometheus-prod
    - cloudflare-edit
    - argo-sync
    - github-rw
  dev-readonly:
    - eks-ro
    - eks-ro-prod
    - aws-ro
    - costs
    - cloudwatch
    - postgres-ro
    - prometheus-dev
    - prometheus-prod
    - cloudflare-ro
    - argo-view
    - github-ro

Request Flow


1. Agent -> Gateway: POST /mcp/eks-ro/mcp (Bearer abc123)

2. Gateway (Layer 1):
   abc123 -> role "dev-readonly" ✅
   dev-readonly has access to eks-ro? ✅
   -> proxy to pod mcp-eks-ro

3. Pod mcp-eks-ro (Layer 2):
   Agent calls: eks_get_pods(namespace="production")
   Pod -> k8s API -> GET /api/v1/namespaces/production/pods
   k8s RBAC: Service Account mcp-eks-ro -> ClusterRole mcp-viewer -> get pods ✅

   Agent calls: eks_delete_pod(name="nginx-abc")
   k8s RBAC: mcp-viewer does not have "delete" verb -> 403 FORBIDDEN ❌

4. Pod mcp-aws-ro (Layer 3):
   aws_describe_instances -> ec2:Describe* -> OK ✅
   aws_terminate_instance -> missing ec2:TerminateInstances -> AccessDenied ❌

Example: Incident Response Agent

ALERT: "PodCrashLooping: payment-service in production (47 restarts)"

AGENT: "Pod is crash-looping. Checking status and events."
  └─► eks.get_events(namespace="production", name="payment-service")
  ◄── [{reason: "OOMKilled", count: 47}]

AGENT: "OOMKilled. Checking memory usage."
  └─► prometheus.query("container_memory_usage_bytes{pod=~'payment-service.*'}")
  ◄── 500MB

AGENT: "Checking the memory limit."
  └─► eks.get_resource(kind="Deployment", name="payment-service")
  ◄── limits.memory: 512Mi

AGENT: "500/512MB — close to the limit. Checking logs."
  └─► eks.get_pod_logs(pod="payment-service-abc", tail_lines=200)
  ◄── ["DB connection pool exhausted", "Retrying query...", "OOM"]

AGENT: "Connection pool exhausted — database issue."
  └─► postgres.diagnostics(env="production")
  ◄── {active_connections: 98, max_connections: 100,
       slow_queries: [{query: "SELECT ... JOIN ...", mean_ms: 4500}]}

AGENT -> generates report -> Slack + optionally a PR with a fix

Every tool call passes through the MCP Gateway and is subject to permission verification. An agent with the dev-readonly role can collect data for a report but cannot restart pods or scale deployments.

This flow can be extended — for example, by submitting PRs to the ArgoCD repository with a proposal to increase resource limits, or even committing directly to trigger an automatic sync and deployment with updated memory or CPU values, e.g. on staging.

Audit and Observability

Two complementary levels:

Layer What it logs Tooling
Gateway audit Who called which MCP tool, with what parameters, duration Loki (stdout -> Promtail -> Grafana)
LLM trace Full agent loop: model, tokens, reasoning, tool calls Langfuse

Example of a structured gateway log entry:

{
  "type": "mcp_tool_call",
  "user": "incident-agent-01",
  "role": "dev-readonly",
  "backend": "eks-ro",
  "tool_name": "eks_get_pods",
  "tool_args": {"namespace": "production"},
  "http_status": 200,
  "duration_ms": 245
}

Gateway audit works for all clients (IDE, Cloud Code, headless agents). Langfuse requires client-side instrumentation — it can be added when ready and does not block the MCP rollout.

Conclusions

Moving from interactive LLM usage in an IDE to autonomous agents is not a matter of prompts — it is an architectural shift in security. Key principles:

  1. Physical isolation over prompt instructions — an agent cannot do what it has no credentials for
  2. Defense in depth — three independent layers (gateway, k8s RBAC, AWS IAM) provide protection even if one is compromised
  3. Least privilege — each agent receives exactly the permissions it needs
  4. Full audit trail — every tool call is logged with identity, role, and parameters