perf(api-server): tune gallery load shedding

This commit is contained in:
kdletters
2026-05-19 01:00:33 +08:00
parent 3eb292b403
commit 8038b6a6ee
22 changed files with 1178 additions and 80 deletions

View File

@@ -13,7 +13,8 @@ Docker Compose
└─ k6 profile=loadtest 时临时启动,在 compose 网络内压 nginx
```
当前容器模拟参数按 `genarrative-release` 服务器采样值收口为 2 vCPU / 2 GiB RAM / 4096 soft nofile / 768 worker_connections并已在 compose 里落实到 `spacetimedb cpus=1.0 mem_limit=768m``api-server cpus=2.0 mem_limit=1g``nginx cpus=0.25 mem_limit=128m``otelcol cpus=0.25 mem_limit=128m``k6 cpus=0.5 mem_limit=512m`
当前容器模拟参数按 `genarrative-release` 服务器采样值收口为 2 vCPU / 2 GiB RAM / 4096 soft nofile / 768 worker_connections并已在 compose 里落实到 `spacetimedb cpus=1.0 mem_limit=896m``api-server cpus=2.0 mem_limit=1g``nginx cpus=0.5 mem_limit=128m``otelcol cpus=0.25 mem_limit=128m``k6 cpus=1.0 mem_limit=512m`SpacetimeDB 同时设置 `--page_pool_max_size=402653184`,给 reducer、订阅与运行时保留更多非 page pool 内存。
容器 `api-server` 默认 `GENARRATIVE_API_WORKER_THREADS=4`,用于让 Tokio 在 2 vCPU 配额内有更多 I/O 调度 worker该值不会突破 compose 里的 `cpus=2.0` CPU 上限。
Collector 镜像使用 `otel/opentelemetry-collector-contrib:0.151.0`
生产服务器若启用 Collector则由 `deploy/systemd/otelcol-contrib.service``deploy/otelcol/genarrative-debug.yaml` 托管,不走容器镜像。
@@ -52,6 +53,10 @@ GENARRATIVE_SPACETIME_TOKEN=
Linux Docker Engine 若要从宿主机 CLI 连到容器内服务,直接用 `http://127.0.0.1:13101`;容器内部服务之间统一走 `http://spacetimedb:3101`
## 构建工具链
`api-server` 容器镜像只构建 Linux release API 二进制,不构建 `spacetime-module`。当前 `api-server -> spacetime-client -> spacetimedb-sdk 2.2.0` 依赖链要求 Rust 1.93,因此 `deploy/container/api-server.Dockerfile` 的 Rust builder 固定为 `rust:1.93-bookworm`。如果本机 Docker Hub 拉取失败,可以先在本机准备同名本地 builder 镜像,但不要把临时 bootstrap 容器或私有 registry 凭据写入仓库。
## 启动与验证
```bash
@@ -125,7 +130,19 @@ spacetime publish genarrative-loadtest --server http://127.0.0.1:13101 --module-
发布完成后再执行 `npm run container:up``npm run container:k6`。如果 `deploy/container/api-server.env` 里的 `GENARRATIVE_SPACETIME_DATABASE` 改成了别的库名,发布命令里的库名也要同步修改。
如果要压 1000 HTTP req/s`PEAK_RPS` 调到 `500`;如果要压 5000 HTTP req/s`PEAK_RPS` 调到 `2500`,并同时提高 `PREALLOCATED_VUS` / `MAX_VUS`观察是否先被带宽、Nginx `limit_conn` 或 api-server 背压限制。
如果要压 1000 HTTP req/s`PEAK_RPS` 调到 `500`;如果要压 5000 HTTP req/s`PEAK_RPS` 调到 `2500`,并同时提高 `PREALLOCATED_VUS` / `MAX_VUS`观察是否先被带宽、Nginx `limit_conn` / `limit_req` 或 api-server 分组背压限制。当前容器 Nginx 对公开 gallery list 使用 `genarrative_gallery_rps`,公开详情和普通 API 使用 `genarrative_api_rps`,后台 API 使用 `genarrative_admin_rps`api-server 侧对应 `GENARRATIVE_API_GALLERY_MAX_CONCURRENT_REQUESTS``GENARRATIVE_API_DETAIL_MAX_CONCURRENT_REQUESTS``GENARRATIVE_API_ADMIN_MAX_CONCURRENT_REQUESTS`
2026-05-19 的 2C / 2G 容器压测结论:公开 gallery list 的 `limit_conn=320``GENARRATIVE_API_GALLERY_MAX_CONCURRENT_REQUESTS=320` 是当前较稳的上限。用宿主机 k6 打 `http://127.0.0.1:18080``PEAK_RPS=1000` 等价于约 2000 HTTP req/s 的两接口组合压测320 档无 dropped iterations、无 5xx、无 OOM`151710` 个 200 与 `34310` 个 429200 请求 `request_time p95=0.292s`。继续抬到 336 / 352 不会有效吃满 api-server CPU反而让 200 数量减少、p95 升到约 0.31s / 0.32sSpacetimeDB 内存尾部逼近 `880MiB / 896MiB`,下游内存先到危险区。当前不要为了降低“剩余 CPU”继续抬公开列表并发下一步应减少成功列表请求后的 SpacetimeDB tracking 写入或优化下游状态,而不是放大入口并发。
### 内存采样
排查 API 容器内存时,优先对比压测前后的 `/proc/$pid/smaps_rollup` 和 cgroup 当前/峰值,不把 Windows 任务管理器总占用当成单进程结论:
```bash
docker exec genarrative-container-loadtest-api-server-1 sh -c 'pid=$(pidof api-server); grep VmRSS /proc/$pid/status; grep RssAnon /proc/$pid/status; cat /proc/$pid/smaps_rollup | grep Anonymous; echo cgroup_current=$(cat /sys/fs/cgroup/memory.current); echo cgroup_peak=$(cat /sys/fs/cgroup/memory.peak)'
```
`/healthz` 也能复现的内存尖峰应先按连接层、service clone 或 allocator 高水位排查,不要直接归因到 SpacetimeDB procedure、作品列表 cache 或业务 DTO。2026-05-18 验证:`AppState` 改为 `Arc<AppStateInner>` 浅拷贝后,容器内直连 `api-server:8082/healthz` 的 500 HTTP req/s、`PREALLOCATED_VUS=100`、30 秒压测完成 `15001` 次请求,`http_req_failed=0``dropped_iterations=0`API 进程 RSS 从约 18 MiB 升至约 52 MiBcgroup 峰值约 47 MiB未再出现 1 GiB 级尖峰。
## OTLP

View File

@@ -1,4 +1,4 @@
FROM rust:1.88-bookworm AS rust-builder
FROM rust:1.93-bookworm AS rust-builder
WORKDIR /workspace
COPY server-rs ./server-rs
@@ -36,6 +36,7 @@ COPY apps/admin-web/package.json ./apps/admin-web/package.json
RUN npm ci
COPY index.html metadata.json tsconfig.json vite.config.ts ./
COPY scripts/vite-cli.mjs scripts/admin-web-build.mjs ./scripts/
COPY src ./src
COPY public ./public
COPY media ./media

View File

@@ -7,8 +7,11 @@ GENARRATIVE_API_HOST=0.0.0.0
GENARRATIVE_API_PORT=8082
GENARRATIVE_API_LOG=info,tower_http=info
GENARRATIVE_API_LISTEN_BACKLOG=1024
GENARRATIVE_API_WORKER_THREADS=2
GENARRATIVE_API_WORKER_THREADS=4
GENARRATIVE_API_MAX_CONCURRENT_REQUESTS=512
GENARRATIVE_API_GALLERY_MAX_CONCURRENT_REQUESTS=320
GENARRATIVE_API_DETAIL_MAX_CONCURRENT_REQUESTS=64
GENARRATIVE_API_ADMIN_MAX_CONCURRENT_REQUESTS=16
GENARRATIVE_OTEL_ENABLED=false
OTEL_SERVICE_NAME=genarrative-api

View File

@@ -3,6 +3,7 @@ name: genarrative-container-loadtest
services:
spacetimedb:
image: clockworklabs/spacetime:v2.2.0
user: root
command:
[
"start",
@@ -11,11 +12,11 @@ services:
"--data-dir",
"/var/lib/spacetimedb",
"--page_pool_max_size",
"536870912",
"402653184",
"--non-interactive",
]
cpus: "1.0"
mem_limit: 768m
mem_limit: 896m
ports:
- "${GENARRATIVE_CONTAINER_SPACETIME_PORT:-13101}:3101"
volumes:
@@ -73,7 +74,7 @@ services:
context: ../..
dockerfile: deploy/container/api-server.Dockerfile
target: nginx-runtime
cpus: "0.25"
cpus: "0.5"
mem_limit: 128m
depends_on:
api-server:
@@ -111,7 +112,7 @@ services:
k6:
image: grafana/k6:0.52.0
profiles: ["loadtest"]
cpus: "0.5"
cpus: "1.0"
mem_limit: 512m
depends_on:
nginx:

View File

@@ -21,6 +21,9 @@ http {
}
limit_conn_zone $binary_remote_addr zone=genarrative_api_conn:10m;
limit_req_zone $binary_remote_addr zone=genarrative_gallery_rps:10m rate=2400r/s;
limit_req_zone $binary_remote_addr zone=genarrative_api_rps:10m rate=300r/s;
limit_req_zone $binary_remote_addr zone=genarrative_admin_rps:10m rate=30r/s;
sendfile on;
keepalive_timeout 65;
@@ -48,6 +51,8 @@ http {
error_log /var/log/nginx/genarrative.error.log warn;
limit_conn_status 429;
limit_conn_log_level warn;
limit_req_status 429;
limit_req_log_level warn;
root /srv/genarrative/web;
index index.html;
@@ -55,6 +60,7 @@ http {
location ^~ /admin/api/ {
default_type application/json;
limit_conn genarrative_api_conn 64;
limit_req zone=genarrative_admin_rps burst=16 nodelay;
proxy_pass http://genarrative_api/admin/api/;
proxy_http_version 1.1;
@@ -82,9 +88,90 @@ http {
try_files $uri =404;
}
location = /api/runtime/puzzle/gallery {
default_type application/json;
limit_conn genarrative_api_conn 320;
limit_req zone=genarrative_gallery_rps burst=256 nodelay;
proxy_pass http://genarrative_api;
proxy_http_version 1.1;
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
add_header X-Accel-Buffering no always;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Request-Id $request_id;
}
location = /api/runtime/custom-world-gallery {
default_type application/json;
limit_conn genarrative_api_conn 320;
limit_req zone=genarrative_gallery_rps burst=256 nodelay;
proxy_pass http://genarrative_api;
proxy_http_version 1.1;
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
add_header X-Accel-Buffering no always;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Request-Id $request_id;
}
location ~ ^/api/runtime/puzzle/gallery/[^/]+$ {
default_type application/json;
limit_conn genarrative_api_conn 32;
limit_req zone=genarrative_api_rps burst=32 nodelay;
proxy_pass http://genarrative_api;
proxy_http_version 1.1;
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
add_header X-Accel-Buffering no always;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Request-Id $request_id;
}
location ~ ^/api/runtime/custom-world-gallery/[^/]+/[^/]+$ {
default_type application/json;
limit_conn genarrative_api_conn 32;
limit_req zone=genarrative_api_rps burst=32 nodelay;
proxy_pass http://genarrative_api;
proxy_http_version 1.1;
proxy_buffering off;
proxy_read_timeout 3600s;
proxy_send_timeout 3600s;
add_header X-Accel-Buffering no always;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Forwarded-Host $host;
proxy_set_header X-Request-Id $request_id;
}
location ~ ^/api(?:/|$) {
default_type application/json;
limit_conn genarrative_api_conn 64;
limit_req zone=genarrative_api_rps burst=64 nodelay;
proxy_pass http://genarrative_api;
proxy_http_version 1.1;