Prerequisites
- Phala Cloud account with sufficient credits for GPU instances
- A model you want to deploy (Hugging Face model ID or custom weights)
- Basic familiarity with Docker Compose and LLM serving
Choose your GPU hardware
Pick a GPU based on your model’s VRAM requirements. Larger models need more VRAM, and you can scale up to 8 GPUs per instance for models that support tensor parallelism.| GPU type | VRAM per GPU | Best for |
|---|---|---|
| H200 | 141 GB | Most 70B+ models, long context |
| B200 | 180 GB | Largest models, maximum throughput |
GPU instances incur hourly charges starting from provisioning. Check current pricing on the GPU TEE dashboard before launching.
Deploy your instance
Launch the GPU TEE wizard
Sign in to cloud.phala.com, click GPU TEE in the navigation bar, then click Start Building. Select your GPU type and count.
Choose the vLLM template
In the deployment configuration, select vLLM as your template. This gives you a production-ready inference server with OpenAI-compatible API endpoints out of the box.If you prefer full control, choose Custom Configuration and provide your own Docker Compose file. See the custom vLLM compose example below.
Configure your model
Set the model you want to serve. For Hugging Face models, use the model ID directly (e.g.,
meta-llama/Llama-3.1-70B-Instruct). Add any required environment variables like HUGGING_FACE_HUB_TOKEN as encrypted secrets if your model needs authentication.Custom vLLM Docker Compose
If you chose Custom Configuration, here’s a Docker Compose file for serving a model with vLLM. This example serves Qwen 2.5 7B on a single GPU.--tensor-parallel-size to match your GPU count. For example, with 4 GPUs:
Access the OpenAI-compatible API
Once your instance is running, vLLM exposes an OpenAI-compatible API. Find your instance’s URL in the dashboard under your GPU TEE instance details.Verify TEE attestation
Your GPU TEE instance runs on genuine Intel TDX + NVIDIA confidential compute hardware. Verify this to confirm your model is actually running inside a trusted execution environment.Check GPU TEE status
Open JupyterLab (if using the Jupyter template) or SSH into your instance and run:Confirm
CC State: ON and CPU CC Capabilities: INTEL TDX in the output.Run local GPU verification
Install NVIDIA’s attestation tools and run the verifier:Successful output confirms your GPUs are genuine NVIDIA hardware with confidential compute enabled.
Verify the Intel TDX quote
Fetch and verify the CPU attestation through Phala’s verification API:You can also paste the quote into the TEE Attestation Explorer for a visual breakdown.
Add a custom domain
Expose your model API on your own domain by adding dstack-ingress to your Docker Compose. This handles DNS and TLS certificates automatically.Next steps
Verify attestation
Programmatically verify your GPU TEE hardware and software stack
Verify signatures
Confirm inference responses are authentic using cryptographic signatures
Performance benchmark
See how TEE mode performs compared to native GPU execution
Networking guides
Expose services, set up gRPC, and configure TCP endpoints

