Troubleshooting Runbook
When something breaks in a dstack-cloud deployment, the root cause usually falls into one of these categories: attestation mismatch, KMS unavailability, governance hold-up, or infrastructure issues. This runbook covers the most common failure modes and how to diagnose them.RA-TLS Connection Failures
Symptoms
- Workload logs show “RA-TLS handshake failed”
- KMS logs show “connection from unverified peer”
- Workload cannot obtain keys
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| Workload attestation invalid | Verify the workload’s measurements match what is registered on-chain. Run dstack-cloud status to get current measurements. |
| KMS attestation invalid | Verify KMS is running in a genuine TEE. Check dstack-cloud status for the KMS instance. |
| Clock skew between workload and KMS | RA-TLS requires relatively synchronized clocks. Check NTP configuration on both sides. |
| Certificate expired | Check that the RA-TLS certificates have not expired. Restart the CVM to regenerate. |
Attestation Verification Failures
Symptoms
- KMS refuses to dispatch keys
- Logs show “measurement not authorized” or “attestation verification failed”
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| Measurement not registered on-chain | Register the measurement via governance. See Register Workload Measurements. |
| Measurement changed after update | Application code or Docker image changed. Register the new measurement. |
| KMS pointing to wrong contract | Verify KMS_CONTRACT_ADDR environment variable. |
| RPC returns stale state | Check RPC provider health. Switch to a backup RPC endpoint. |
CVM / Enclave Startup Failures
Symptoms
dstack-cloud deploysucceeds but CVM exits immediatelydstack-cloud statusshows “ERROR” or “STOPPED”
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| Insufficient memory | Allocate more memory. On GCP, use a larger machine type. On Nitro, increase --memory in allocate-enclaves. |
| Invalid Docker image | Verify the image exists and is accessible. Use SHA256 digests for pinned images. |
| Container crash loop | Check application logs. The container may have a runtime error. |
| OS image incompatible | Ensure the OS image version matches the dstack-cloud CLI version. |
GCP-specific
| Cause | Fix |
|---|---|
| Confidential VM quota exceeded | Request quota increase in GCP Console. |
| VM not booting as TDX | Verify the VM has confidential-compute: enabled in GCP Console. |
Nitro-specific
| Cause | Fix |
|---|---|
| Enclave image (EIF) too large | Reduce Docker image size. Use multi-stage builds. |
| Nitro driver not installed | Install: sudo apt-get install -y aws-nitro-enclaves-cli |
| Enclave resource limit exceeded | Run allocate-enclaves with higher values and retry. |
On-chain Authorization Failures
Symptoms
- KMS logs show “workload not authorized”
- Keys are not dispatched despite correct attestation
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| Measurement registered on wrong contract | Verify the KMS is configured to use the correct DstackKms address. |
| Governance transaction not yet executed | Check the Safe for pending transactions. Wait for timelock. |
| Measurement was revoked | Check the Safe transaction history. If revoked by mistake, re-register via governance. |
KMS Unavailable
Symptoms
- Workloads cannot connect to KMS
dstack-cloud statusshows KMS as stopped or unreachable
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| KMS CVM stopped | Restart: dstack-cloud start |
| KMS bootstrap not completed | Complete the bootstrap procedure. See Run a dstack-kms CVM on GCP. |
| Network issue | Verify firewall rules. Check VSOCK proxy on Nitro. |
| KMS out of memory | Allocate more resources. Check dstack-cloud logs for OOM errors. |
Governance Transactions Stuck
Symptoms
- Governance proposal not advancing
- Transaction in Safe queue not executing
Diagnosis
- Check the Safe web interface for transaction status
- Check if the timelock has expired
- Verify the Safe has sufficient gas
Common Causes and Fixes
| Cause | Fix |
|---|---|
| Not enough signatures | Contact missing signers. If a signer is unavailable, consider adding a new signer (requires governance). |
| Timelock not yet expired | Check the exact expiry time. Wait. |
| Safe out of gas | Send ETH to the Safe address. |
| Transaction will revert | Simulate the transaction before executing. The contract state may have changed since the proposal was created. Cancel and re-submit. |
| Stale transaction in queue | Cancel the stale transaction through the Safe interface. Submit a new one. |
VSOCK Proxy Failures (Nitro-specific)
Symptoms
- Enclave cannot reach KMS or external services
dstack-cloud logsshows network timeout errors
Diagnosis
Common Causes and Fixes
| Cause | Fix |
|---|---|
| socat not running | Start the VSOCK proxy. Check prelaunch.sh for the proxy startup command. |
| Wrong VSOCK port | Verify the VSOCK port matches between the Enclave and the proxy. |
| socat crashed | Restart socat. Check system logs for crash reason. Consider running as a systemd service with auto-restart. |
| Port conflict | Another process is using the same port. Change the proxy port configuration. |
Emergency Operations
Revoke a Compromised Measurement
- Draft a governance transaction to remove the measurement from
DstackKms - Request expedited approval from all signers
- Wait for the timelock (cannot be bypassed)
- Execute after the delay
- Verify the measurement is no longer authorized
KMS Key Compromise
If the KMS root key may have been compromised:- Stop the KMS immediately:
dstack-cloud stop - Audit all workloads that received keys from the compromised KMS
- Rotate affected application keys
- Deploy a new KMS instance with fresh measurements
- Register the new KMS measurements on-chain
- Revoke the old KMS measurements
- Restart workloads against the new KMS
Full System Recovery
- Stop all CVMs and KMS instances
- Verify blockchain state is consistent
- Redeploy from known-good configuration
- Re-register measurements if needed
- Verify end-to-end key delivery
- Review governance activity for suspicious transactions
Diagnostic Commands Cheat Sheet
Next Steps
- Monitoring and Alerting — Set up proactive monitoring
- Upgrade Procedures — Upgrade versions to fix known issues

