Troubleshooting Runbook

When something breaks in a dstack-cloud deployment, the root cause usually falls into one of these categories: attestation mismatch, KMS unavailability, governance hold-up, or infrastructure issues. This runbook covers the most common failure modes and how to diagnose them.

RA-TLS Connection Failures

Symptoms

Workload logs show “RA-TLS handshake failed”
KMS logs show “connection from unverified peer”
Workload cannot obtain keys

Diagnosis

# Check workload logs
dstack-cloud logs

# Check KMS logs
cd kms-prod
dstack-cloud logs

Common Causes and Fixes

Cause	Fix
Workload attestation invalid	Verify the workload’s measurements match what is registered on-chain. Run `dstack-cloud status` to get current measurements.
KMS attestation invalid	Verify KMS is running in a genuine TEE. Check `dstack-cloud status` for the KMS instance.
Clock skew between workload and KMS	RA-TLS requires relatively synchronized clocks. Check NTP configuration on both sides.
Certificate expired	Check that the RA-TLS certificates have not expired. Restart the CVM to regenerate.

Attestation Verification Failures

Symptoms

KMS refuses to dispatch keys
Logs show “measurement not authorized” or “attestation verification failed”

Diagnosis

# Get current measurements
dstack-cloud status
# Note the RTMR3 / OS_IMAGE_HASH

# Check on-chain authorization
cast call <DstackKms_ADDRESS> \
  "isAuthorized(bytes32)(bool)" \
  0xYOUR_MEASUREMENT_HASH \
  --rpc-url $RPC_URL

Common Causes and Fixes

Cause	Fix
Measurement not registered on-chain	Register the measurement via governance. See Register Workload Measurements.
Measurement changed after update	Application code or Docker image changed. Register the new measurement.
KMS pointing to wrong contract	Verify `KMS_CONTRACT_ADDR` environment variable.
RPC returns stale state	Check RPC provider health. Switch to a backup RPC endpoint.

CVM / Enclave Startup Failures

Symptoms

dstack-cloud deploy succeeds but CVM exits immediately
dstack-cloud status shows “ERROR” or “STOPPED”

Diagnosis

# Check logs for the reason
dstack-cloud logs

# Check resource allocation (GCP)
gcloud compute instances describe <INSTANCE_NAME>

# Check resource allocation (Nitro)
sudo amazon-nitro-enclaves-cli describe-enclaves

Common Causes and Fixes

Cause	Fix
Insufficient memory	Allocate more memory. On GCP, use a larger machine type. On Nitro, increase `--memory` in `allocate-enclaves`.
Invalid Docker image	Verify the image exists and is accessible. Use SHA256 digests for pinned images.
Container crash loop	Check application logs. The container may have a runtime error.
OS image incompatible	Ensure the OS image version matches the dstack-cloud CLI version.

GCP-specific

Cause	Fix
Confidential VM quota exceeded	Request quota increase in GCP Console.
VM not booting as TDX	Verify the VM has `confidential-compute: enabled` in GCP Console.

Nitro-specific

Cause	Fix
Enclave image (EIF) too large	Reduce Docker image size. Use multi-stage builds.
Nitro driver not installed	Install: `sudo apt-get install -y aws-nitro-enclaves-cli`
Enclave resource limit exceeded	Run `allocate-enclaves` with higher values and retry.

On-chain Authorization Failures

Symptoms

KMS logs show “workload not authorized”
Keys are not dispatched despite correct attestation

Diagnosis

# Check if measurement is authorized on-chain
cast call <DstackKms_ADDRESS> "isAuthorized(bytes32)(bool)" 0xHASH --rpc-url $RPC_URL

# Check DstackKms contract state
cast call <DstackKms_ADDRESS> "owner()(address)" --rpc-url $RPC_URL

Common Causes and Fixes

Cause	Fix
Measurement registered on wrong contract	Verify the KMS is configured to use the correct `DstackKms` address.
Governance transaction not yet executed	Check the Safe for pending transactions. Wait for timelock.
Measurement was revoked	Check the Safe transaction history. If revoked by mistake, re-register via governance.

KMS Unavailable

Symptoms

Workloads cannot connect to KMS
dstack-cloud status shows KMS as stopped or unreachable

Diagnosis

# Check KMS status
cd kms-prod
dstack-cloud status

# Check KMS logs
dstack-cloud logs

# Test connectivity
curl -k https://<KMS_URL>:12001/health

Common Causes and Fixes

Cause	Fix
KMS CVM stopped	Restart: `dstack-cloud start`
KMS bootstrap not completed	Complete the bootstrap procedure. See Run a dstack-kms CVM on GCP.
Network issue	Verify firewall rules. Check VSOCK proxy on Nitro.
KMS out of memory	Allocate more resources. Check `dstack-cloud logs` for OOM errors.

Governance Transactions Stuck

Symptoms

Governance proposal not advancing
Transaction in Safe queue not executing

Diagnosis

Check the Safe web interface for transaction status
Check if the timelock has expired
Verify the Safe has sufficient gas

Common Causes and Fixes

Cause	Fix
Not enough signatures	Contact missing signers. If a signer is unavailable, consider adding a new signer (requires governance).
Timelock not yet expired	Check the exact expiry time. Wait.
Safe out of gas	Send ETH to the Safe address.
Transaction will revert	Simulate the transaction before executing. The contract state may have changed since the proposal was created. Cancel and re-submit.
Stale transaction in queue	Cancel the stale transaction through the Safe interface. Submit a new one.

VSOCK Proxy Failures (Nitro-specific)

Symptoms

Enclave cannot reach KMS or external services
dstack-cloud logs shows network timeout errors

Diagnosis

# Check if socat is running
ps aux | grep socat

# Check VSOCK proxy logs
journalctl -u vsock-proxy -f   # if running as systemd service

# Test VSOCK connectivity from the host
echo "test" | socat - VSOCK-CONNECT:1:8000

Common Causes and Fixes

Cause	Fix
socat not running	Start the VSOCK proxy. Check `prelaunch.sh` for the proxy startup command.
Wrong VSOCK port	Verify the VSOCK port matches between the Enclave and the proxy.
socat crashed	Restart socat. Check system logs for crash reason. Consider running as a systemd service with auto-restart.
Port conflict	Another process is using the same port. Change the proxy port configuration.

Emergency Operations

Revoke a Compromised Measurement

Draft a governance transaction to remove the measurement from DstackKms
Request expedited approval from all signers
Wait for the timelock (cannot be bypassed)
Execute after the delay
Verify the measurement is no longer authorized

KMS Key Compromise

If the KMS root key may have been compromised:

Stop the KMS immediately: dstack-cloud stop
Audit all workloads that received keys from the compromised KMS
Rotate affected application keys
Deploy a new KMS instance with fresh measurements
Register the new KMS measurements on-chain
Revoke the old KMS measurements
Restart workloads against the new KMS

Full System Recovery

Stop all CVMs and KMS instances
Verify blockchain state is consistent
Redeploy from known-good configuration
Re-register measurements if needed
Verify end-to-end key delivery
Review governance activity for suspicious transactions

Diagnostic Commands Cheat Sheet

# Check deployment status
dstack-cloud status

# View logs
dstack-cloud logs
dstack-cloud logs --follow
dstack-cloud logs --container <name>

# Check measurements
dstack-cloud status | grep -E "measurement|hash|rtmr"

# On-chain queries (using cast)
cast call <ADDR> "isAuthorized(bytes32)(bool)" 0xHASH --rpc-url $RPC_URL
cast call <ADDR> "owner()(address)" --rpc-url $RPC_URL

# GCP diagnostics
gcloud compute instances describe <NAME>
gcloud logging read "resource.type=gce_instance"

# Nitro diagnostics
sudo amazon-nitro-enclaves-cli describe-enclaves
sudo amazon-nitro-enclaves-cli allocate-enclaves --cpu-count 2 --memory 4096

Next Steps

Monitoring and Alerting — Set up proactive monitoring
Upgrade Procedures — Upgrade versions to fix known issues

Concepts

How-to Guides

Operations

Reference

Appendix

Documentation Index

​Troubleshooting Runbook

​RA-TLS Connection Failures

​Symptoms

​Diagnosis

​Common Causes and Fixes

​Attestation Verification Failures

​Symptoms

​Diagnosis

​Common Causes and Fixes

​CVM / Enclave Startup Failures

​Symptoms

​Diagnosis

​Common Causes and Fixes

​GCP-specific

​Nitro-specific

​On-chain Authorization Failures

​Symptoms

​Diagnosis

​Common Causes and Fixes

​KMS Unavailable

​Symptoms

​Diagnosis

​Common Causes and Fixes

​Governance Transactions Stuck

​Symptoms

​Diagnosis

​Common Causes and Fixes

​VSOCK Proxy Failures (Nitro-specific)

​Symptoms

​Diagnosis

​Common Causes and Fixes

​Emergency Operations

​Revoke a Compromised Measurement

​KMS Key Compromise

​Full System Recovery

​Diagnostic Commands Cheat Sheet

​Next Steps

Troubleshooting Runbook

RA-TLS Connection Failures

Symptoms

Diagnosis

Common Causes and Fixes

Attestation Verification Failures

Symptoms

Diagnosis

Common Causes and Fixes

CVM / Enclave Startup Failures

Symptoms

Diagnosis

Common Causes and Fixes

GCP-specific

Nitro-specific

On-chain Authorization Failures

Symptoms

Diagnosis

Common Causes and Fixes

KMS Unavailable

Symptoms

Diagnosis

Common Causes and Fixes

Governance Transactions Stuck

Symptoms

Diagnosis

Common Causes and Fixes

VSOCK Proxy Failures (Nitro-specific)

Symptoms

Diagnosis

Common Causes and Fixes

Emergency Operations

Revoke a Compromised Measurement

KMS Key Compromise

Full System Recovery

Diagnostic Commands Cheat Sheet

Next Steps