Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.phala.com/llms.txt

Use this file to discover all available pages before exploring further.

Troubleshooting Runbook

When something breaks in a dstack-cloud deployment, the root cause usually falls into one of these categories: attestation mismatch, KMS unavailability, governance hold-up, or infrastructure issues. This runbook covers the most common failure modes and how to diagnose them.

RA-TLS Connection Failures

Symptoms

  • Workload logs show “RA-TLS handshake failed”
  • KMS logs show “connection from unverified peer”
  • Workload cannot obtain keys

Diagnosis

# Check workload logs
dstack-cloud logs

# Check KMS logs
cd kms-prod
dstack-cloud logs

Common Causes and Fixes

CauseFix
Workload attestation invalidVerify the workload’s measurements match what is registered on-chain. Run dstack-cloud status to get current measurements.
KMS attestation invalidVerify KMS is running in a genuine TEE. Check dstack-cloud status for the KMS instance.
Clock skew between workload and KMSRA-TLS requires relatively synchronized clocks. Check NTP configuration on both sides.
Certificate expiredCheck that the RA-TLS certificates have not expired. Restart the CVM to regenerate.

Attestation Verification Failures

Symptoms

  • KMS refuses to dispatch keys
  • Logs show “measurement not authorized” or “attestation verification failed”

Diagnosis

# Get current measurements
dstack-cloud status
# Note the RTMR3 / OS_IMAGE_HASH

# Check on-chain authorization
cast call <DstackKms_ADDRESS> \
  "isAuthorized(bytes32)(bool)" \
  0xYOUR_MEASUREMENT_HASH \
  --rpc-url $RPC_URL

Common Causes and Fixes

CauseFix
Measurement not registered on-chainRegister the measurement via governance. See Register Workload Measurements.
Measurement changed after updateApplication code or Docker image changed. Register the new measurement.
KMS pointing to wrong contractVerify KMS_CONTRACT_ADDR environment variable.
RPC returns stale stateCheck RPC provider health. Switch to a backup RPC endpoint.

CVM / Enclave Startup Failures

Symptoms

  • dstack-cloud deploy succeeds but CVM exits immediately
  • dstack-cloud status shows “ERROR” or “STOPPED”

Diagnosis

# Check logs for the reason
dstack-cloud logs

# Check resource allocation (GCP)
gcloud compute instances describe <INSTANCE_NAME>

# Check resource allocation (Nitro)
sudo amazon-nitro-enclaves-cli describe-enclaves

Common Causes and Fixes

CauseFix
Insufficient memoryAllocate more memory. On GCP, use a larger machine type. On Nitro, increase --memory in allocate-enclaves.
Invalid Docker imageVerify the image exists and is accessible. Use SHA256 digests for pinned images.
Container crash loopCheck application logs. The container may have a runtime error.
OS image incompatibleEnsure the OS image version matches the dstack-cloud CLI version.

GCP-specific

CauseFix
Confidential VM quota exceededRequest quota increase in GCP Console.
VM not booting as TDXVerify the VM has confidential-compute: enabled in GCP Console.

Nitro-specific

CauseFix
Enclave image (EIF) too largeReduce Docker image size. Use multi-stage builds.
Nitro driver not installedInstall: sudo apt-get install -y aws-nitro-enclaves-cli
Enclave resource limit exceededRun allocate-enclaves with higher values and retry.

On-chain Authorization Failures

Symptoms

  • KMS logs show “workload not authorized”
  • Keys are not dispatched despite correct attestation

Diagnosis

# Check if measurement is authorized on-chain
cast call <DstackKms_ADDRESS> "isAuthorized(bytes32)(bool)" 0xHASH --rpc-url $RPC_URL

# Check DstackKms contract state
cast call <DstackKms_ADDRESS> "owner()(address)" --rpc-url $RPC_URL

Common Causes and Fixes

CauseFix
Measurement registered on wrong contractVerify the KMS is configured to use the correct DstackKms address.
Governance transaction not yet executedCheck the Safe for pending transactions. Wait for timelock.
Measurement was revokedCheck the Safe transaction history. If revoked by mistake, re-register via governance.

KMS Unavailable

Symptoms

  • Workloads cannot connect to KMS
  • dstack-cloud status shows KMS as stopped or unreachable

Diagnosis

# Check KMS status
cd kms-prod
dstack-cloud status

# Check KMS logs
dstack-cloud logs

# Test connectivity
curl -k https://<KMS_URL>:12001/health

Common Causes and Fixes

CauseFix
KMS CVM stoppedRestart: dstack-cloud start
KMS bootstrap not completedComplete the bootstrap procedure. See Run a dstack-kms CVM on GCP.
Network issueVerify firewall rules. Check VSOCK proxy on Nitro.
KMS out of memoryAllocate more resources. Check dstack-cloud logs for OOM errors.

Governance Transactions Stuck

Symptoms

  • Governance proposal not advancing
  • Transaction in Safe queue not executing

Diagnosis

  1. Check the Safe web interface for transaction status
  2. Check if the timelock has expired
  3. Verify the Safe has sufficient gas

Common Causes and Fixes

CauseFix
Not enough signaturesContact missing signers. If a signer is unavailable, consider adding a new signer (requires governance).
Timelock not yet expiredCheck the exact expiry time. Wait.
Safe out of gasSend ETH to the Safe address.
Transaction will revertSimulate the transaction before executing. The contract state may have changed since the proposal was created. Cancel and re-submit.
Stale transaction in queueCancel the stale transaction through the Safe interface. Submit a new one.

VSOCK Proxy Failures (Nitro-specific)

Symptoms

  • Enclave cannot reach KMS or external services
  • dstack-cloud logs shows network timeout errors

Diagnosis

# Check if socat is running
ps aux | grep socat

# Check VSOCK proxy logs
journalctl -u vsock-proxy -f   # if running as systemd service

# Test VSOCK connectivity from the host
echo "test" | socat - VSOCK-CONNECT:1:8000

Common Causes and Fixes

CauseFix
socat not runningStart the VSOCK proxy. Check prelaunch.sh for the proxy startup command.
Wrong VSOCK portVerify the VSOCK port matches between the Enclave and the proxy.
socat crashedRestart socat. Check system logs for crash reason. Consider running as a systemd service with auto-restart.
Port conflictAnother process is using the same port. Change the proxy port configuration.

Emergency Operations

Revoke a Compromised Measurement

  1. Draft a governance transaction to remove the measurement from DstackKms
  2. Request expedited approval from all signers
  3. Wait for the timelock (cannot be bypassed)
  4. Execute after the delay
  5. Verify the measurement is no longer authorized

KMS Key Compromise

If the KMS root key may have been compromised:
  1. Stop the KMS immediately: dstack-cloud stop
  2. Audit all workloads that received keys from the compromised KMS
  3. Rotate affected application keys
  4. Deploy a new KMS instance with fresh measurements
  5. Register the new KMS measurements on-chain
  6. Revoke the old KMS measurements
  7. Restart workloads against the new KMS

Full System Recovery

  1. Stop all CVMs and KMS instances
  2. Verify blockchain state is consistent
  3. Redeploy from known-good configuration
  4. Re-register measurements if needed
  5. Verify end-to-end key delivery
  6. Review governance activity for suspicious transactions

Diagnostic Commands Cheat Sheet

# Check deployment status
dstack-cloud status

# View logs
dstack-cloud logs
dstack-cloud logs --follow
dstack-cloud logs --container <name>

# Check measurements
dstack-cloud status | grep -E "measurement|hash|rtmr"

# On-chain queries (using cast)
cast call <ADDR> "isAuthorized(bytes32)(bool)" 0xHASH --rpc-url $RPC_URL
cast call <ADDR> "owner()(address)" --rpc-url $RPC_URL

# GCP diagnostics
gcloud compute instances describe <NAME>
gcloud logging read "resource.type=gce_instance"

# Nitro diagnostics
sudo amazon-nitro-enclaves-cli describe-enclaves
sudo amazon-nitro-enclaves-cli allocate-enclaves --cpu-count 2 --memory 4096

Next Steps