Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.phala.com/llms.txt

Use this file to discover all available pages before exploring further.

Monitoring and Alerting

You can’t secure what you can’t see. This page covers the key metrics to watch, log collection patterns, and alert rules that help you catch attestation failures, governance anomalies, or KMS downtime before they become incidents.

Key Metrics

KMS Metrics

These metrics tell you whether KMS is healthy and delivering keys promptly:
MetricDescriptionAlert Threshold
Key request success ratePercentage of key requests that succeed< 99%
Key dispatch latency (p50/p99)Time from request to key deliveryp99 > 5s
Attestation verification success ratePercentage of attestation verifications that pass< 99%
On-chain sync lagTime between on-chain state change and KMS picking it up> 60s
KMS uptimePercentage of time KMS is reachable< 99.9%

CVM / Enclave Metrics

Track workload health and resource usage:
MetricDescriptionAlert Threshold
CVM uptimePercentage of time the CVM is running< 99%
CVM boot timeTime from deploy command to CVM ready> 5 minutes
RA-TLS handshake success ratePercentage of successful RA-TLS connections< 99%
Container restart countNumber of container restarts within the CVM> 0
Memory usageCVM memory utilization> 90%

Governance Metrics

Governance metrics help you catch stalled proposals or signer inactivity:
MetricDescriptionAlert Threshold
Pending proposalsNumber of governance proposals awaiting execution> 0 for > 24h
Signer participation ratePercentage of signers active in last 7 days< 80%
Timelock queue depthNumber of transactions in the timelock queue> 3

Log Collection

dstack-cloud Logs

Use the built-in log viewer:
# View recent logs
dstack-cloud logs

# Follow logs in real-time
dstack-cloud logs --follow

# View logs for a specific container
dstack-cloud logs --container <container-name>

KMS Logs

KMS logs are available through dstack-cloud logs when the KMS is deployed as a dstack CVM. Key log patterns to monitor:
Log PatternMeaning
attestation verification failedA workload’s attestation was rejected
measurement not authorizedThe workload’s measurement is not registered on-chain
key dispatchedA key was successfully delivered to a workload
on-chain sync completedKMS synced its on-chain state
RA-TLS handshake failedTLS connection with attestation failed

Infrastructure Logs

GCP:
  • Cloud Logging: gcloud logging read "resource.type=gce_instance AND labels.instance_id=<INSTANCE_ID>"
  • Serial port output for boot diagnostics
AWS Nitro:
  • EC2 instance system logs
  • VSOCK proxy logs (if running as a systemd service)

Dashboard Configuration

  1. KMS Health
    • Request rate (requests/min)
    • Success rate (%)
    • Latency histogram (p50, p95, p99)
    • Active connections
  2. CVM Health
    • Number of running CVMs
    • Per-CVM status (running/stopped/error)
    • Boot time trend
  3. Governance
    • Pending proposals count
    • Recent governance transactions
    • Signer activity heatmap

Integration with Datadog

To integrate with Datadog, you’ll need a Datadog agent on each host machine that collects custom metrics from dstack-cloud. The key metric namespaces are dstack.cloud.cvm.*, dstack.cloud.kms.*, and dstack.cloud.governance.*.
  1. Set up a Datadog agent on each host machine
  2. Configure custom metrics collection for KMS, CVM, and governance events
  3. Create dashboards and monitors based on the metrics above

Alert Rules

Critical Alerts (Page immediately)

AlertConditionResponse
KMS downKMS unreachable for > 2 minutesCheck CVM/Enclave status. Restart if needed.
Attestation failure spike> 10 attestation failures in 5 minutesMay indicate a compromised or misconfigured workload. Investigate immediately.
Governance: suspicious transactionNew proposal to revoke critical measurementsReview the proposal. Alert all signers.

Warning Alerts (Page during business hours)

AlertConditionResponse
High key dispatch latencyp99 > 5s for > 10 minutesCheck KMS load. Scale if needed.
CVM restart loopCVM restarted > 3 times in 1 hourInvestigate container health. Check resource limits.
On-chain sync lagKMS on-chain state > 60s behindCheck RPC provider health. Switch to backup RPC.
VSOCK proxy failure (Nitro)VSOCK proxy process not runningRestart the proxy. Check host health.

Info Alerts (Log only)

AlertCondition
New governance proposalAny new proposal submitted to the Safe
CVM deployedNew CVM deployment completed
Measurement registeredNew measurement added on-chain

Escalation Policies

SeverityEscalationResponse Time
CriticalPage on-call SRE + notify security team15 minutes
WarningNotify platform team via Slack/Teams2 hours
InfoLog to incident channelNext business day
Use Incident.io or similar tools to manage incident lifecycle.

On-chain Monitoring

Monitor the blockchain for governance activity:
  • Safe Transaction Service: Subscribe to the Safe’s transaction feed for real-time notifications
  • The Graph / Dune Analytics: Query governance transaction history for reporting
  • Block explorer alerts: Set up watch-only notifications for DstackKms and DstackApp contract events

Next Steps

  • Runbook — Step-by-step troubleshooting for common incidents
  • Upgrade Procedures — How to upgrade CVMs, KMS, and contracts