> ## Documentation Index > Fetch the complete documentation index at: https://docs.phala.com/llms.txt > Use this file to discover all available pages before exploring further. # Deploy AI Models in GPU TEE > End-to-end guide for deploying custom LLMs on GPU TEE with vLLM and accessing them via OpenAI-compatible API Deploy a custom LLM on Phala Cloud GPU TEE and serve it through an OpenAI-compatible API. This guide walks you through the full workflow from choosing hardware to making your first inference request with verified TEE attestation. For pre-deployed models that don't require infrastructure setup, use [On-demand API](/phala-cloud/confidential-ai/confidential-model/confidential-ai-api) or [Dedicated Models](/phala-cloud/confidential-ai/confidential-gpu/model-template) instead. ## Prerequisites * Phala Cloud account with sufficient credits for GPU instances * A model you want to deploy (Hugging Face model ID or custom weights) * Basic familiarity with Docker Compose and LLM serving ## Choose your GPU hardware Pick a GPU based on your model's VRAM requirements. Larger models need more VRAM, and you can scale up to 8 GPUs per instance for models that support tensor parallelism. | GPU type | VRAM per GPU | Best for | | -------- | ------------ | ---------------------------------- | | H200 | 141 GB | Most 70B+ models, long context | | B200 | 180 GB | Largest models, maximum throughput | A 7B model fits comfortably on a single GPU. A 70B model typically needs 2-4 GPUs depending on quantization. Check your model's documentation for exact VRAM requirements. GPU instances incur hourly charges starting from provisioning. Check current pricing on the [GPU TEE](https://cloud.phala.com) dashboard before launching. ## Deploy your instance Sign in to [cloud.phala.com](https://cloud.phala.com), click **GPU TEE** in the navigation bar, then click **Start Building**. Select your GPU type and count. In the deployment configuration, select **vLLM** as your template. This gives you a production-ready inference server with OpenAI-compatible API endpoints out of the box. If you prefer full control, choose **Custom Configuration** and provide your own Docker Compose file. See the [custom vLLM compose example](#custom-vllm-docker-compose) below. Set the model you want to serve. For Hugging Face models, use the model ID directly (e.g., `meta-llama/Llama-3.1-70B-Instruct`). Add any required environment variables like `HUGGING_FACE_HUB_TOKEN` as encrypted secrets if your model needs authentication. Review the pricing summary and click **Launch Instance**. GPU provisioning takes approximately 1 day. Monitor the status in your dashboard as it progresses through **Preparing**, **Starting**, and **Running**. ## Custom vLLM Docker Compose If you chose Custom Configuration, here's a Docker Compose file for serving a model with vLLM. This example serves Qwen 2.5 7B on a single GPU. ```yaml theme={"system"} services: vllm: image: vllm/vllm-openai:v0.8.5@sha256:abc123... # Pin to digest for attestation ports: - "8000:8000" environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} command: > --model qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --max-model-len 32768 --trust-remote-code deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` For multi-GPU serving, increase `--tensor-parallel-size` to match your GPU count. For example, with 4 GPUs: ```yaml theme={"system"} command: > --model meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 4 --max-model-len 65536 deploy: resources: reservations: devices: - driver: nvidia count: 4 capabilities: [gpu] ``` Pin your Docker images by SHA256 digest (not tags) for attestation verification. Tags like `vllm/vllm-openai:latest` are mutable and break the chain of trust. See [Verify Your Application](/phala-cloud/attestation/verify-your-application) for details. ## Access the OpenAI-compatible API Once your instance is running, vLLM exposes an OpenAI-compatible API. Find your instance's URL in the dashboard under your GPU TEE instance details. ```bash curl theme={"system"} curl https:///v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "What is confidential computing?"} ], "max_tokens": 256 }' ``` ```python Python theme={"system"} from openai import OpenAI client = OpenAI( base_url="https:///v1", api_key="not-needed" # No auth by default; add if configured ) response = client.chat.completions.create( model="qwen/Qwen2.5-7B-Instruct", messages=[ {"role": "user", "content": "What is confidential computing?"} ], max_tokens=256 ) print(response.choices[0].message.content) ``` ```typescript TypeScript theme={"system"} import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'https:///v1', apiKey: 'not-needed', }); const response = await client.chat.completions.create({ model: 'qwen/Qwen2.5-7B-Instruct', messages: [ { role: 'user', content: 'What is confidential computing?' } ], maxTokens: 256, }); console.log(response.choices[0].message.content); ``` You can also check the available models and server health: ```bash theme={"system"} # List loaded models curl https:///v1/models # Health check curl https:///health ``` ## Verify TEE attestation Your GPU TEE instance runs on genuine Intel TDX + NVIDIA confidential compute hardware. Verify this to confirm your model is actually running inside a trusted execution environment. Open JupyterLab (if using the Jupyter template) or SSH into your instance and run: ```bash theme={"system"} nvidia-smi conf-compute -q ``` Confirm `CC State: ON` and `CPU CC Capabilities: INTEL TDX` in the output. Install NVIDIA's attestation tools and run the verifier: ```bash theme={"system"} pip install nv-local-gpu-verifier nv_attestation_sdk python -m verifier.cc_admin ``` Successful output confirms your GPUs are genuine NVIDIA hardware with confidential compute enabled. Fetch and verify the CPU attestation through Phala's verification API: ```bash theme={"system"} # Get the attestation quote from your instance curl https:///attestation -o attestation.json # Verify through Phala's API curl -X POST https://cloud-api.phala.com/api/v1/attestations/verify \ -H "Content-Type: application/json" \ -d @attestation.json ``` You can also paste the quote into the [TEE Attestation Explorer](https://proof.t16z.com/) for a visual breakdown. For programmatic attestation verification in your application, see [Verify Attestation](/phala-cloud/confidential-ai/verify/verify-attestation). ## Add a custom domain Expose your model API on your own domain by adding dstack-ingress to your Docker Compose. This handles DNS and TLS certificates automatically. ```yaml theme={"system"} services: dstack-ingress: image: dstacktee/dstack-ingress:2.2@sha256:d05a7b343c37c1cca1bba8dbf7e8f3c6d2118158af2d41c455103796db4f67f0 ports: - "443:443" environment: - DOMAIN=ai.mycompany.com - TARGET_ENDPOINT=vllm:8000 - CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN} - GATEWAY_DOMAIN=_.${DSTACK_GATEWAY_DOMAIN} - CERTBOT_EMAIL=${CERTBOT_EMAIL} - SET_CAA=true volumes: - /var/run/dstack.sock:/var/run/dstack.sock - /var/run/tappd.sock:/var/run/tappd.sock - cert-data:/etc/letsencrypt - evidences:/evidences vllm: image: vllm/vllm-openai:v0.8.5 # ... your vLLM config from above volumes: cert-data: evidences: ``` See [Set Up a Custom Domain](/phala-cloud/networking/setup-custom-domain) for the full guide including DNS provider configuration and troubleshooting. ## Next steps Programmatically verify your GPU TEE hardware and software stack Confirm inference responses are authentic using signed receipts See how TEE mode performs compared to native GPU execution Expose services, set up gRPC, and configure TCP endpoints