Skip to main content

Tool Safety

xops.bot classifies every DevOps tool command by risk level. This classification drives approval behavior -- whether a command runs automatically, requires your approval, or is blocked entirely depends on its risk level and your current safety mode.

Risk Classification

Every command that xops.bot can execute is assigned one of four risk levels:

Risk LevelMeaningExamples
LOWRead-only, no side effects. Retrieves information without changing anything.kubectl get, docker ps, aws describe
MEDIUMDiagnostic or local operations. May access resources but does not modify remote state.terraform plan, ansible --check, terraform init
HIGHMutations that modify state. Creates, updates, or reconfigures infrastructure.kubectl apply, docker run, aws ec2 run-instances
CRITICALDestructive operations that cannot be easily undone. Deletes resources, removes data, or tears down infrastructure.kubectl delete, docker rm, terraform destroy

Commands not explicitly classified inherit the tool's default risk level (typically MEDIUM).

How Safety Modes Interact with Risk Levels

Your safety mode determines what happens when a command at each risk level is executed:

Risk LevelSafe ModeStandard ModeFull Mode
LOWAllowed (prompted)Auto-executeAuto-execute
MEDIUMAllowed (prompted)Auto-executeAuto-execute
HIGHBlockedRequires approvalExecutes with awareness
CRITICALBlockedRequires approval + confirmationRequires awareness

Key behaviors:

  • Safe Mode blocks all mutations. Even read-only commands prompt for confirmation. Use this for production monitoring and on-call investigation.
  • Standard Mode (default) runs read-only commands freely. Mutations require your explicit "yes" before executing. Critical commands ask twice.
  • Full Mode runs everything without prompts. Only use this in trusted development environments.

For detailed mode configuration, see Safety Configuration.

Per-Tool Reference

kubectl

Kubernetes command-line tool. 35 classified commands.

CommandRiskDescription
getLOWList resources in tabular or JSON/YAML format
describeLOWShow detailed information about a resource
logsLOWPrint container logs
topLOWDisplay resource usage (CPU/memory)
applyHIGHApply a configuration to a resource
createHIGHCreate a resource from a file or stdin
scaleHIGHSet a new size for a deployment or replica set
deleteCRITICALDelete resources by name, label, or file
drainCRITICALDrain a node in preparation for maintenance

Risk modifier: kubectl apply --dry-run lowers the risk from HIGH to effective LOW. Dry run validates without mutating.

docker

Docker container runtime. 38 classified commands.

CommandRiskDescription
psLOWList containers
imagesLOWList images
logsLOWFetch the logs of a container
statsLOWDisplay container resource usage
runHIGHCreate and start a new container
startHIGHStart a stopped container
stopHIGHStop a running container
pushHIGHPush an image to a registry
rmCRITICALRemove one or more containers
rmiCRITICALRemove one or more images
system pruneCRITICALRemove all unused data

aws

AWS Command Line Interface. 36 classified commands.

CommandRiskDescription
describeLOWDescribe AWS resources
listLOWList AWS resources
sts get-caller-identityLOWCheck current IAM identity
ec2 run-instancesHIGHLaunch new EC2 instances
ec2 stop-instancesHIGHStop running EC2 instances
s3 cpHIGHCopy objects to/from S3
iam createHIGHCreate IAM resources
ec2 terminate-instancesCRITICALPermanently terminate EC2 instances
s3 rmCRITICALDelete objects from S3
rds deleteCRITICALDelete an RDS database instance
cloudformation delete-stackCRITICALDelete a CloudFormation stack

terraform

HashiCorp Terraform infrastructure as code. 26 classified commands.

CommandRiskDescription
versionLOWPrint Terraform version
validateLOWValidate configuration files
showLOWShow the current state or a saved plan
state listLOWList resources in the state
planMEDIUMGenerate an execution plan
initMEDIUMInitialize a working directory
applyHIGHApply changes to infrastructure
importHIGHImport existing infrastructure into state
state rmHIGHRemove items from the state
destroyCRITICALDestroy all managed infrastructure
workspace deleteCRITICALDelete a Terraform workspace

Risk modifier: terraform plan -out=tfplan lowers the effective risk. A saved plan enables review before apply.

ansible

Ansible automation platform. 18 classified commands.

CommandRiskDescription
ansible --versionLOWPrint version information
ansible --list-hostsLOWList hosts matching a pattern
ansible-inventory --listLOWList inventory in JSON format
ansible-docLOWShow module documentation
ansible-playbook --checkMEDIUMDry-run a playbook without changes
ansible-galaxy installMEDIUMInstall roles or collections
ansible (ad-hoc)HIGHRun ad-hoc commands on hosts
ansible-playbookHIGHExecute a playbook
ansible-pullHIGHPull and execute a playbook from VCS

Risk modifier: ansible-playbook --check lowers the risk from HIGH to MEDIUM. Check mode simulates without making changes.

promtool

Prometheus tooling CLI for metrics queries and config validation. 22 classified commands.

CommandRiskDescription
query instantLOWExecute instant PromQL query
query rangeLOWExecute range PromQL query
check configLOWValidate Prometheus configuration files
check rulesLOWValidate alerting and recording rule files
test rulesLOWUnit test alerting and recording rules
tsdb analyzeLOWAnalyze TSDB block churn, cardinality, and compaction
push metricsMEDIUMPush metrics to Prometheus remote write endpoint
tsdb bench writeHIGHRun write benchmarks against TSDB
tsdb create-blocks-fromHIGHCreate TSDB blocks from external data sources

Almost entirely read-only. The only non-LOW commands are push metrics (MEDIUM) and two TSDB write operations (HIGH). No CRITICAL commands.

logcli

Grafana Loki command-line tool for log queries. 6 classified commands.

CommandRiskDescription
queryLOWRun LogQL query for logs over a time range
instant-queryLOWRun instant LogQL query
labelsLOWFind values for a given label
seriesLOWQuery log streams matching label selectors
statsLOWQuery index statistics
volumeLOWQuery aggregate volumes

Entirely read-only. logcli is a pure query tool with no write operations.

jaeger

Jaeger distributed tracing query via HTTP API. 5 classified commands.

CommandRiskDescription
get-servicesLOWList all services that have reported traces
get-operationsLOWList operations for a given service
find-tracesLOWSearch traces by service, operation, time, duration
get-traceLOWRetrieve a single trace by trace ID
get-dependenciesLOWService dependency graph for a time range

Entirely read-only. Jaeger has no dedicated CLI -- queries use curl against the Jaeger HTTP API v3. These are conceptual command names mapping to API endpoints.

Workspace Tool Assignments

Each workspace has access to specific tools based on its agent's domain:

WorkspaceAgentTools
k8s-agentK8s Botkubectl, docker
platform-agentPlatform Botterraform, ansible, aws
finops-agentFinOps Botaws
rca-agentRCA Botkubectl, promtool, logcli, jaeger
incident-agentIncident Botkubectl, promtool, logcli, jaeger

An agent can only execute commands for tools assigned to its workspace. K8s Bot cannot run terraform apply, and Platform Bot cannot run kubectl delete.

Practical Examples

Checking pod status in production (Safe Mode)

You ask K8s Bot: "Show me the pods in the payments namespace."

The bot runs kubectl get pods -n payments. Even though get is LOW risk, Safe Mode prompts you to confirm the read operation. You approve, and the bot displays the pod list.

Deploying a new version (Standard Mode)

You ask K8s Bot: "Deploy the new payments image v2.3.1."

The bot prepares kubectl set image deployment/payments payments=payments:v2.3.1. This is set, classified as HIGH. Standard Mode shows you the exact command and asks for approval. You review and type "yes". The bot executes and confirms the rollout.

Cleaning up dev containers (Full Mode)

You ask K8s Bot: "Prune all stopped containers and dangling images."

The bot runs docker system prune. This is CRITICAL, but Full Mode executes with awareness -- the bot notes the risk level in its response but does not block execution. Use Full Mode only in trusted development environments where cleanup is expected.

Investigating a latency spike (Standard Mode)

You ask RCA Bot: "There's a latency spike in the payments service. Help me investigate."

The bot starts with metrics: promtool query instant 'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payments"}[5m]))' --server.url=http://prometheus:9090. This is LOW risk, so Standard Mode auto-executes. The bot identifies the spike started at 10:45.

Next, logs: logcli query '{app="payments"} |= "error"' --addr=http://loki:3100 --since=1h. Also LOW risk, auto-executes. The bot finds connection timeout errors to the database.

Finally, traces: curl http://jaeger:16686/api/v3/traces?query.service_name=payments&query.duration_min=2s. LOW risk, auto-executes. The bot traces a slow request and finds the database span taking 8 seconds instead of the usual 50ms.

The bot correlates: latency spike at 10:45, database connection errors in logs starting at 10:44, traces showing database spans as the bottleneck. Root cause: database connection pool exhaustion.