Starting Point: LLMs in the IDE
For smaller projects, managing infrastructure with the help of an LLM from a local IDE is more than sufficient. The right set of permissions, context in CLAUDE.md, skills, and MCP work perfectly well. Mistakes are still possible — running an action against the wrong environment, or approving a command without reading it carefully — but potential mishaps are less likely and generally faster to detect and remediate.
There are situations, however, that require granting LLMs greater autonomy — for example, automating responses to alerts about excessive pod resource consumption or frequent restarts. This calls for a set of tools: Kubernetes, AWS, logs, and metrics, which in turn raises the issue of permissions and security. Each agent must be physically isolated with access limited strictly to the minimum required resources and services. Placing an agent on a single pod with access to multiple environments and read/write levels, relying on prompt instructions as the only safeguard, is a recipe for disaster.
This level of access granularity can significantly complicate configuration — every service, environment, and access level requires a separate service/pod, e.g. stage Kubernetes RO vs. production Kubernetes RO vs. stage Kubernetes RW vs. production AWS RO, and so on. With a potentially large number of agents and other participants (humans), transparent permission management becomes critical.
The proposed architecture centralizes role management behind a single entry point — the MCP Gateway. In this setup, all access services are deployed as pods in an EKS cluster.
MCP Gateway Architecture

MCP Server Permissions
Each MCP server responsible for accessing specific services must run on a separate pod, per environment, per access level (RW, RO). Permissions are granted via IRSA — pods are mapped through Service Accounts to appropriately scoped IAM role policies.
Key Management
API keys are stored in AWS SSM Parameter Store and automatically synchronized to Kubernetes Secrets.
Example script for creating a key for a new agent:
./create-keys.sh --role dev-readonly --user incident-agent-01
[TODO - what the script does]
The agent uses the key in the request header:
Authorization: Bearer <base64-encoded-key>
The script generates a key and stores it in SSM Parameter Store in the format hash:role/user_name — a JSON file with aggregated keys is automatically synchronized to a Kubernetes Secret via External Secrets.
RBAC — Roles and Permissions
Each role is a set of allowed backends:
roleMapping:
devops-full:
- eks-rw
- eks-ro-prod
- aws-rw
- costs
- cloudwatch
- postgres-rw
- prometheus-dev
- prometheus-prod
- cloudflare-edit
- argo-sync
- github-rw
dev-readonly:
- eks-ro
- eks-ro-prod
- aws-ro
- costs
- cloudwatch
- postgres-ro
- prometheus-dev
- prometheus-prod
- cloudflare-ro
- argo-view
- github-ro
Request Flow
1. Agent -> Gateway: POST /mcp/eks-ro/mcp (Bearer abc123)
2. Gateway (Layer 1):
abc123 -> role "dev-readonly" ✅
dev-readonly has access to eks-ro? ✅
-> proxy to pod mcp-eks-ro
3. Pod mcp-eks-ro (Layer 2):
Agent calls: eks_get_pods(namespace="production")
Pod -> k8s API -> GET /api/v1/namespaces/production/pods
k8s RBAC: Service Account mcp-eks-ro -> ClusterRole mcp-viewer -> get pods ✅
Agent calls: eks_delete_pod(name="nginx-abc")
k8s RBAC: mcp-viewer does not have "delete" verb -> 403 FORBIDDEN ❌
4. Pod mcp-aws-ro (Layer 3):
aws_describe_instances -> ec2:Describe* -> OK ✅
aws_terminate_instance -> missing ec2:TerminateInstances -> AccessDenied ❌
Example: Incident Response Agent
ALERT: "PodCrashLooping: payment-service in production (47 restarts)"
AGENT: "Pod is crash-looping. Checking status and events."
└─► eks.get_events(namespace="production", name="payment-service")
◄── [{reason: "OOMKilled", count: 47}]
AGENT: "OOMKilled. Checking memory usage."
└─► prometheus.query("container_memory_usage_bytes{pod=~'payment-service.*'}")
◄── 500MB
AGENT: "Checking the memory limit."
└─► eks.get_resource(kind="Deployment", name="payment-service")
◄── limits.memory: 512Mi
AGENT: "500/512MB — close to the limit. Checking logs."
└─► eks.get_pod_logs(pod="payment-service-abc", tail_lines=200)
◄── ["DB connection pool exhausted", "Retrying query...", "OOM"]
AGENT: "Connection pool exhausted — database issue."
└─► postgres.diagnostics(env="production")
◄── {active_connections: 98, max_connections: 100,
slow_queries: [{query: "SELECT ... JOIN ...", mean_ms: 4500}]}
AGENT -> generates report -> Slack + optionally a PR with a fix
Every tool call passes through the MCP Gateway and is subject to permission verification. An agent with the dev-readonly role can collect data for a report but cannot restart pods or scale deployments.
This flow can be extended — for example, by submitting PRs to the ArgoCD repository with a proposal to increase resource limits, or even committing directly to trigger an automatic sync and deployment with updated memory or CPU values, e.g. on staging.
Audit and Observability
Two complementary levels:
| Layer | What it logs | Tooling |
|---|---|---|
| Gateway audit | Who called which MCP tool, with what parameters, duration | Loki (stdout -> Promtail -> Grafana) |
| LLM trace | Full agent loop: model, tokens, reasoning, tool calls | Langfuse |
Example of a structured gateway log entry:
{
"type": "mcp_tool_call",
"user": "incident-agent-01",
"role": "dev-readonly",
"backend": "eks-ro",
"tool_name": "eks_get_pods",
"tool_args": {"namespace": "production"},
"http_status": 200,
"duration_ms": 245
}
Gateway audit works for all clients (IDE, Cloud Code, headless agents). Langfuse requires client-side instrumentation — it can be added when ready and does not block the MCP rollout.
Conclusions
Moving from interactive LLM usage in an IDE to autonomous agents is not a matter of prompts — it is an architectural shift in security. Key principles:
- Physical isolation over prompt instructions — an agent cannot do what it has no credentials for
- Defense in depth — three independent layers (gateway, k8s RBAC, AWS IAM) provide protection even if one is compromised
- Least privilege — each agent receives exactly the permissions it needs
- Full audit trail — every tool call is logged with identity, role, and parameters