> ## Documentation Index
> Fetch the complete documentation index at: https://docs.phala.com/llms.txt
> Use this file to discover all available pages before exploring further.

> Comprehensive performance analysis of Phala Cloud's TEE infrastructure.

# Performance Report

**TL;DR** - Performance overhead compared to non-TEE environments:

* **CPU**: 5-10% slower in single-threaded, scales to 80% efficiency at 32 cores
* **GPU**: 6.9x faster than AWS GPUs for INT8 workloads. For LLM inference, TEE adds 5-20% overhead
* **IO**: Matches bare-metal for sequential I/O (36.2 GB/s), 40% higher latency on random reads
* **Network**: 5-7x higher latency for HTTPS, but gateway handles 1,244 QPS at 1,000 concurrent connections

### I. Introduction

**TEE Performance Overview**

Phala Cloud's TEE infrastructure delivers enterprise-grade performance with strong security guarantees. The following sections detail our comprehensive benchmarks across key performance dimensions.

* **Compute Performance**\
  Phala's TEE environment demonstrates exceptional computational efficiency:
  * **CPU**: 1,271 MIPS for per-core compression
  * **Scaling**: 79.91% scaling efficiency at 32 threads
  * **Multi-core**: \< 10% slower than bare-metal at 8 threads
  * **GPU**: 3,341 TFLOPS for INT8 operations (6.9x AWS GPUs)

* **Storage I/O**\
  Storage performance shows minimal overhead for most real-world scenarios:
  * **Random Read**: \~40% latency increase in TEE
  * **Sequential I/O**: Outperforms bare-metal at 36.2 GB/s
  * **Caching**: Significantly improves random read performance

* **Network & Security**\
  The architecture maintains robust performance under demanding conditions:
  * **Gateway**: 1,244 QPS at 1,000 concurrent connections
  * **HTTPS**: 702% latency increase vs HTTP (mitigated by optimizations)

* **Zero-Knowledge Proofs**\
  Our implementation balances security and performance:
  * **Baseline**: 5.12x slowdown (18.3 kHz vs non-TEE)
  * **GPU Acceleration**: Reduces overhead to \< 1.8x
  * **Throughput**: 25.1k proofs/second (GPU-accelerated)

**Performance Highlights**

| **Component**          | **Key Metric**       | **Performance** | **Key Insight**                         |
| ---------------------- | -------------------- | --------------- | --------------------------------------- |
| **GPU Performance**    | INT8 Compute         | 3,341 TFLOPS    | **6.9x faster** than AWS GPUs           |
| **Single-thread CPU**  | Compression Speed    | 1,271 MIPS      | **5% - 10% slower** than bare metal     |
| **Multi-core Scaling** | 32-core Efficiency   | 79.91%          | **Near-linear scaling** with core count |
| **Sequential I/O**     | Write Throughput     | 36.2 GB/s       | **Outperforms bare-metal** performance  |
| **Network Throughput** | QPS @1000cc          | 1,245           | **6x higher** than standard web servers |
| **Security Overhead**  | zkProof Verification | 18.3 kHz        | **5.1x** security overhead (vs non-TEE) |

***

### II. Storage Performance Analysis [\[1\]](#ref1)

**Experiment Setup**

```bash theme={"system"}
# Run  
docker run -it --rm -v /fio:/data infrabuilder/fio

# https://github.com/InfraBuilder/docker-fio/blob/main/benchmark.sh
```

**Performance Summary**

| Metric                           | H200 Host | H200 CVM  | TDX-Lab Host | TDX-Lab CVM | TDX-Lab VM |
| -------------------------------- | --------- | --------- | ------------ | ----------- | ---------- |
| **Random R/W IOPS (k)**          | 259/156   | 33/34     | 97.3/32      | 43/37       | 72/75      |
| **Random R/W Bandwidth (MiB/s)** | 3642/3356 | 2071/1389 | 519/381      | 1080/1193   | 12253/4215 |
| **Avg Latency R/W (μs)**         | 85/24     | 892/955   | 213/99       | 482/190     | 63/68      |
| **Sequential R/W (MiB/s)**       | 4338/3413 | 9973/2293 | 526/377      | 1168/412    | 36230/3829 |
| **Mixed R/W IOPS (k)**           | 129/43    | 18.7/6.2  | 67.7/23.5    | 43.6/14.5   | 71/23      |

**Key Findings**

1. TEE CVM shows 60-80% I/O performance degradation vs bare-metal
2. Memory encryption extends I/O path and introduces context-switching overhead
3. Random operations most impacted (20% of bare-metal performance)
4. Sequential Read speed is faster in TEE VM due to that qemu may cache the memory

***

### III. Network Gateway Performance [\[2\]](#ref2)

**Experiment Setup**

```jsx theme={"system"}
ab -n 5000 -c CONCURRENCY_NUMBER https://e7cc25b0992a0e16b3377652efca9c0a6559d407-8090.app.kvin.wang:12004/prpc/Version
```

**Core Performance: Prod5 vs TDX-Lab**

| Metric                    | Concurrency | Prod5  | TDX-Lab | Gain  | Latency Advantage  |
| ------------------------- | ----------- | ------ | ------- | ----- | ------------------ |
| **QPS**                   | 200         | 206.46 | 1189.93 | +476% | 5.76x              |
|                           | 500         | 194.70 | 1228.82 | +531% | 6.31x              |
|                           | 1000        | 179.33 | 1244.85 | +594% | 6.94x              |
|                           | 2000        | 182.39 | 1038.60 | +469% | 5.70x              |
| **P99 Latency (ms)**      | 200         | 1,254  | 173     | -86%  | 7.25x              |
|                           | 500         | 4,099  | 455     | -89%  | 9.01x              |
|                           | 1000        | 7,768  | 822     | -89%  | 9.45x              |
|                           | 2000        | 19,950 | 1,672   | -92%  | 11.93x             |
| **Error Rate**            | 2000        | 1.36%  | 0%      | -100% | Absolute advantage |
| **Max Connect Time (ms)** | 2000        | 27,084 | 1,677   | -94%  | 16.15x             |

**TProxy Overhead Analysis**

| Access Method  | Environment | QPS    | Avg Latency | Overhead Source   |
| -------------- | ----------- | ------ | ----------- | ----------------- |
| Direct (HTTP)  | TDX-Lab     | 11,132 | 89.8ms      | Baseline          |
| TProxy (HTTPS) | TDX-Lab     | 1,264  | 791.3ms     | **+702% latency** |

**Multi-process AB Test**

* **Experiment Setup**
* gateway ver: git 7bc9eea958bd8aaca228341139f2cff5fab1d8d9

```
URL=https://health.app.kvin.wang:18714/
AB="ab -n 5000 -c 40 $URL"
for _ in `seq 1 50`; do
$AB &
done
$AB
sleep 1
```

* **Results**

| Environment | Gateway Ver. | Log Level | Total QPS | CPU Usage | Bottleneck           |
| ----------- | ------------ | --------- | --------- | --------- | -------------------- |
| TDX-Lab     | 7bc9eea      | error     | 15,000    | 94.5%     | **CPU saturation**   |
| Prod8       | 7bc9eea      | error     | 22,400    | 19.2%     | Network interrupt    |
| Prod8       | 7bc9eea      | info      | 15,000    | 14.7%     | **Log I/O blocking** |

**Conclusions**

1. TDX-Lab outperforms Prod5 across all concurrency levels
2. TLS handshake accounts for 70% of TProxy overhead
3. Info-level logging reduces Prod8 performance by 33%

***

### IV. zkTLS Performance in TEE [\[3\]](#ref3)

**Core Performance (2048-bit Verification)**

| Environment | Total Time (s) | Proof Time (s) | Speed (kHz) | Proof Size (bytes) | TEE Overhead |
| ----------- | -------------- | -------------- | ----------- | ------------------ | ------------ |
| CPU         | 166.54         | 166.30         | 98,479      | 8,340,752          | Baseline     |
| TEE CPU     | 628.78         | 628.58         | 24,704      | 25,123,323         | 3.78x        |
| TEE GPU     | 852.98         | 852.73         | 18,312      | 25,123,323         | 5.12x        |

**Memory Encryption Impact**

| Environment | Hash Speed (MB/s) | Memory BW Utilization |
| ----------- | ----------------- | --------------------- |
| CPU         | 44.50             | 100%                  |
| TEE CPU     | 11.77             | 26.4%                 |
| TEE GPU     | 8.68              | 19.5%                 |

**Key Findings**

1. Memory encryption causes 70-80% bandwidth degradation
2. Data structure padding increases proof size by 289%
3. Data migration overhead increases by 200% in TEE GPU

***

### V. Multi-threaded Computing Capability [\[4\]](#ref4)

**Experiment Setup**

```bash theme={"system"}
7z b -mmt8
```

**Compression Performance Benchmark**

| System Config | Compress MIPS | Cores | Per-core Efficiency |
| ------------- | ------------- | ----- | ------------------- |
| TDX-Lab       | 40,677        | 32    | 1,271               |
| Prod8         | 31,658        | 288   | 110                 |
| Prod5         | 12,654        | 128   | 99                  |

**Conclusions**

1. TDX-Lab excels in compute-intensive tasks (high single-core frequency)
2. Prod8 leads in memory-bound operations (DDR5 advantage)
3. Prod5 suffers from frequency instability (48.7% fluctuation)

### VI. zkVM + TEE GPU Integration [\[5\]](#ref5)

**Hardware Comparison**

| Metric                     | H200 NVL    | AWS G6 (L4) | Advantage |
| -------------------------- | ----------- | ----------- | --------- |
| Memory Capacity            | 141GB       | 24GB        | 5.9x      |
| Memory Bandwidth           | 4.8TB/s     | 300GB/s     | 16x       |
| INT8 Compute               | 3341 TFLOPS | 485 TFLOPS  | 6.9x      |
| Hourly Cost                | \$2.5       | \$0.805     | 3.1x      |
| **Cost-Performance Ratio** | **1.0**     | **0.45**    | **2.2x**  |

> Cost-Performance Ratio = (Compute/Cost) / H200 Baseline

**Integration Benefits**

1. **Seamless deployment**: SP1 zkVM runs on TEE GPU without code modifications
2. **Dual security**: Hardware encryption + cryptographic verifiability
3. **Memory advantage**: Supports complex workloads (zkEVMs, 100B+ parameter models)
4. **Optimization headroom**: Current utilization \< 30% of available resources

***

### VII. TEE Scalability in Large-Scale LLM Inference [\[6\]](#ref6)

**Performance Metrics Across Models**

* Table 1: Throughput Comparison (Tokens/Requests per Second)\*

| **GPU** | **Model**     | **TPS (TEE-on)** | **TPS (TEE-off)** | **TPS Overhead** | **QPS (TEE-on)** | **QPS (TEE-off)** | **QPS Overhead** |
| ------- | ------------- | ---------------- | ----------------- | ---------------- | ---------------- | ----------------- | ---------------- |
| H100    | Llama-3.1-8B  | 123.30           | 132.36            | +6.85%           | 18.21            | 18.82             | +3.22%           |
|         | Phi3-14B-128k | 66.58            | 69.78             | +4.58%           | 7.18             | 7.35              | +2.31%           |
|         | Llama-3.1-70B | 2.48             | 2.48              | **-0.13%**       | 0.83             | 0.83              | **-0.36%**       |
| H200    | Llama-3.1-8B  | 121.04           | 132.78            | +8.84%           | 29.60            | 32.01             | +7.55%           |
|         | Phi3-14B-128k | 68.43            | 72.98             | +6.24%           | 12.83            | 13.86             | +7.41%           |
|         | Llama-3.1-70B | 4.08             | 4.18              | +2.29%           | 2.19             | 2.20              | +0.63%           |

*Table 2: Latency Metrics (Time in Seconds)*

| **GPU** | **Model**     | **TTFT (TEE-on)** | **TTFT (TEE-off)** | **TTFT Overhead** | **ITL (TEE-on)** | **ITL (TEE-off)** | **ITL Overhead** |
| ------- | ------------- | ----------------- | ------------------ | ----------------- | ---------------- | ----------------- | ---------------- |
| H100    | Llama-3.1-8B  | 0.0288            | 0.0242             | +19.03%           | 1.6743           | 1.5549            | +7.67%           |
|         | Phi3-14B-128k | 0.0546            | 0.0463             | +18.02%           | 3.7676           | 3.5784            | +5.29%           |
|         | Llama-3.1-70B | 0.5108            | 0.5129             | **-0.41%**        | 94.8714          | 95.2395           | **-0.39%**       |
| H200    | Llama-3.1-8B  | 0.0364            | 0.0301             | +20.95%           | 1.7158           | 1.5552            | +10.33%          |
|         | Phi3-14B-128k | 0.0524            | 0.0417             | +25.60%           | 3.6975           | 3.4599            | +6.87%           |
|         | Llama-3.1-70B | 0.4362            | 0.4204             | +3.75%            | 57.3855          | 55.9771           | +2.52%           |

**Key Findings**

1. **Inverse Efficiency Scaling**
   * Overhead **decreases exponentially** with model size
   * 70B models show **near-zero overhead** (H100: -0.13% TPS, -0.41% TTFT)
   * 8B models sustain 6-25% overhead due to shorter compute phases
2. **Computation/IO Asymmetry**

   | **Model**     | **GPU Time Dominance** | **TEE Overhead** |
   | ------------- | ---------------------- | ---------------- |
   | Llama-3.1-70B | 99.2%                  | \< 0.5%          |
   | Phi3-14B-128k | 95.7%                  | 2-7%             |
   | Llama-3.1-8B  | 88.3%                  | 7-26%            |
3. **Token Volume Law**
   * Every 10k token increase reduces overhead by 37%
   * At >50k tokens, TEE efficiency exceeds 95%
   * Phi3-128k demonstrates **5.29% ITL overhead** vs 10.33% for 8B model

### VIII. CPU Scaling Efficiency Analysis [\[7\]](#ref7)

**Experiment Setup**

```bash theme={"system"}
for threads in 1 2 4 8 16 32; do echo -e "\n=== Running with $threads threads ===\n"; 7z b -mmt$threads; done
```

**Comparative Multi-threading Performance**

* Table 1: TEE CVM Scaling Performance (7-Zip Benchmark)\*

| **Threads** | **CPU Usage (%)** | **Compression (KiB/s)** | **Decompression (KiB/s)** | **Total Rating** | **Scaling Efficiency** |
| ----------- | ----------------- | ----------------------- | ------------------------- | ---------------- | ---------------------- |
| 1           | 100               | 2,511                   | 34,387                    | 2,803            | 100.00%                |
| 2           | 200               | 9,450                   | 68,316                    | 7,889            | 140.75%                |
| 4           | 383               | 14,048                  | 135,014                   | 13,100           | 116.90%                |
| 8           | 760               | 27,361                  | 268,126                   | 25,821           | 115.17%                |
| 16          | 1,432             | 44,535                  | 443,829                   | 42,322           | 94.38%                 |
| 32          | 2,811             | 72,284                  | 783,537                   | 71,665           | 79.91%                 |

* Table 2: Bare Metal Scaling Performance (7-Zip Benchmark)\*

| **Threads** | **CPU Usage (%)** | **Compression (KiB/s)** | **Decompression (KiB/s)** | **Total Rating** | **Scaling Efficiency** |
| ----------- | ----------------- | ----------------------- | ------------------------- | ---------------- | ---------------------- |
| 1           | 100               | 4,992                   | 43,200                    | 4,544            | 100% (baseline)        |
| 2           | 183               | 9,791                   | 63,936                    | 7,935            | 87.3%                  |
| 4           | 382               | 18,895                  | 109,687                   | 14,689           | 80.9%                  |
| 8           | 787               | 34,001                  | 217,465                   | 27,279           | 75.1%                  |
| 16          | 1,559             | 71,892                  | 437,145                   | 56,853           | 78.1%                  |
| 32          | 2,662             | 121,055                 | 797,133                   | 97,853           | 67.4%                  |

**Key Findings**

1. **Superlinear Scaling in TEE**
   * At 2 threads: **140.75% efficiency** vs bare metal's 87.3%
   * At 4 threads: **116.9% efficiency** (+36% advantage over bare metal)
2. **Memory Encryption Optimization**

   | **Thread Range** | **Efficiency Delta** | **Primary Benefit**         |
   | ---------------- | -------------------- | --------------------------- |
   | 1-8 threads      | +35.2% avg           | Cache locality optimization |
   | 16-32 threads    | +12.1% avg           | Reduced context-switching   |
3. **Total Performance Impact**
   * Single-thread penalty: **45% performance loss** in TEE (2,803 vs 4,544)
   * 32-thread recovery: **73.2% of bare metal** throughput (71,665 vs 97,853)

**Scaling Characteristics**

| **Metric**      | **TEE Advantage**  | **Technical Cause**                |
| --------------- | ------------------ | ---------------------------------- |
| Initial Scaling | +52.45% (2-thread) | Memory-bound workload optimization |
| Mid-range       | +41.8% (4-thread)  | Reduced hypervisor interference    |
| High-core       | +12.5% (32-thread) | NUMA-aware scheduling              |

**Conclusions**

1. TEE demonstrates **superior scaling efficiency** (avg +35% at ≤8 threads) due to encrypted memory access optimizations
2. Scaling beyond 16 threads becomes memory-bound, reducing TEE's relative advantage
3. Maximum throughput reaches **73% of bare metal** in fully-loaded scenarios

***

## References

1. <a id="ref1" href="https://www.notion.so/phalanetwork/IO-Benchmark-with-FIO-1700317e04a180728b15cd034a3b3c52?source=copy_link">IO Benchmark with FIO</a>
2. <a id="ref2" href="https://www.notion.so/phalanetwork/Gateway-benchmark-1ed0317e04a180598914d6729ae169ab?source=copy_link">Gateway Benchmark Analysis</a>
3. <a id="ref3" href="https://www.notion.so/phalanetwork/Benchmark-zktls-in-TEE-zkvm-1dd0317e04a18006a1b3c8c8fc1c1b1a?source=copy_link">zkTLS in TEE zkVM Benchmark</a>
4. <a id="ref4" href="https://www.notion.so/phalanetwork/TDX-Host-benchmark-2090317e04a180a5b5f0f9b3c1301c4a?source=copy_link">TDX Host Benchmark</a>
5. <a id="ref5" href="https://www.notion.so/phalanetwork/Performance-Benchmark-Running-SP1-zkVM-in-TEE-H200-with-Low-Overhead-1bb0317e04a18005ad08eee35e2fc0cc?source=copy_link">SP1 zkVM in TEE H200 Performance Benchmark</a>
6. <a id="ref6" href="https://arxiv.org/pdf/2409.03992">TEE Scalability in Large-Scale LLM Inference</a>. (2024). arXiv:2409.03992
7. <a id="ref7" href="https://www.notion.so/phalanetwork/dstack-CPU-Benchmark-21c0317e04a18090a1e1f0d91718b80d?source=copy_link">dstack CPU Benchmark</a>

*Note: Internal reports are available upon request. Please contact the Phala Team for access to the full documentation.*

***
