完善外部生成Worker动态扩缩容
新增外部生成controller进程角色与systemd服务 补齐队列统计procedure与spacetime-client绑定 更新生产部署脚本、健康巡检和server provision的worker/controller口径 新增容器worker smoke脚本并同步运维文档与团队记忆
This commit is contained in:
@@ -97,6 +97,64 @@ npm run container:up -- --scale external-generation-worker=1 external-generation
|
||||
|
||||
动态扩缩容验证必须保持 `GENARRATIVE_EXTERNAL_GENERATION_MODE=queue`;`inline` 模式下生成请求由 `api-server` 同步执行,不会被这些 worker 实例消费。
|
||||
|
||||
### 外部生成 Worker 隔离 Smoke
|
||||
|
||||
如果只想在本机隔离验证 worker 模式,不复用 `deploy/container/api-server.env`,使用专用脚本:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- smoke
|
||||
```
|
||||
|
||||
该脚本会生成 gitignored 的 `deploy/container/worker-smoke/api-server.env` 与端口 state,使用独立 compose project、独立 SpacetimeDB 数据卷和独立 host 端口,完成 `build -> up-spacetime -> publish -> up -> enqueue -> api-update -> enqueue`。测试 job 使用 `worker_smoke_unsupported` 类型,不访问真实 VectorEngine、LLM 或 OSS;预期结果是 worker 领取队列任务后按“不支持的任务类型”执行失败分支,从而验证队列 claim、lease、失败回写路径和 API / worker 进程隔离。`external_generation_job` 是 private table,脚本通过 worker 日志里的 job_id 和 unsupported 记录确认消费,不通过 CLI SQL 绕过权限。`smoke` 默认只启动 `api-server` 与 `external-generation-worker`,避免无关前端 / Nginx 镜像构建;需要同时验证 Nginx 时可分步执行 `up --with-nginx`。
|
||||
|
||||
分步排查时可执行:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- init --force
|
||||
npm run container:worker-smoke -- build
|
||||
npm run container:worker-smoke -- up-spacetime
|
||||
npm run container:worker-smoke -- publish
|
||||
npm run container:worker-smoke -- up
|
||||
npm run container:worker-smoke -- enqueue before-update
|
||||
npm run container:worker-smoke -- api-update
|
||||
npm run container:worker-smoke -- enqueue after-update
|
||||
npm run container:worker-smoke -- status
|
||||
```
|
||||
|
||||
如果隔离端口或库数据需要重置:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- smoke --force
|
||||
```
|
||||
|
||||
`container:worker-smoke` 默认会把本机 `spacetime` 2.4.1 CLI 打成轻量 SpacetimeDB 镜像,避免首次 smoke 必须拉取官方大镜像;普通 `npm run container:*` 压测仍默认使用 `clockworklabs/spacetime:v2.4.1`。如果 Docker build 阶段在容器内拉取 crates.io 依赖不稳定,可让容器内 Cargo 复用本机 Cargo 缓存构建当前二进制,再打入临时 smoke 镜像。该模式默认使用 `rust:1.93-bookworm` 作为 builder、Debian bookworm smoke runtime 承载构建产物;需要换 builder 镜像时设置 `GENARRATIVE_WORKER_SMOKE_CARGO_IMAGE`,需要换运行时基础镜像时设置 `GENARRATIVE_WORKER_SMOKE_LOCAL_BASE_IMAGE`:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- smoke --local-binary
|
||||
```
|
||||
|
||||
`api-update` 只会 `--force-recreate api-server`,并校验 `external-generation-worker` 容器 ID 不变;如要同时重建 API 镜像,使用:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- api-update --build
|
||||
```
|
||||
|
||||
验证 worker 动态扩缩容:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- scale 3
|
||||
npm run container:worker-smoke -- ps
|
||||
npm run container:worker-smoke -- enqueue scaled-workers
|
||||
npm run container:worker-smoke -- scale 1
|
||||
```
|
||||
|
||||
查看或清理隔离环境:
|
||||
|
||||
```bash
|
||||
npm run container:worker-smoke -- logs external-generation-worker
|
||||
npm run container:worker-smoke -- down -v
|
||||
```
|
||||
|
||||
停止:
|
||||
|
||||
```bash
|
||||
|
||||
@@ -2,7 +2,7 @@ name: genarrative-container-loadtest
|
||||
|
||||
services:
|
||||
spacetimedb:
|
||||
image: clockworklabs/spacetime:v2.4.1
|
||||
image: ${GENARRATIVE_CONTAINER_SPACETIME_IMAGE:-clockworklabs/spacetime:v2.4.1}
|
||||
user: root
|
||||
command:
|
||||
[
|
||||
@@ -44,7 +44,7 @@ services:
|
||||
cpus: "2.0"
|
||||
mem_limit: 1g
|
||||
env_file:
|
||||
- ./api-server.env
|
||||
- ${GENARRATIVE_CONTAINER_API_ENV_FILE:-./api-server.env}
|
||||
environment:
|
||||
GENARRATIVE_API_HOST: 0.0.0.0
|
||||
GENARRATIVE_API_PORT: 8082
|
||||
@@ -77,7 +77,7 @@ services:
|
||||
cpus: "2.0"
|
||||
mem_limit: 1g
|
||||
env_file:
|
||||
- ./api-server.env
|
||||
- ${GENARRATIVE_CONTAINER_API_ENV_FILE:-./api-server.env}
|
||||
environment:
|
||||
GENARRATIVE_PROCESS_ROLE: external-generation-worker
|
||||
GENARRATIVE_TRACKING_OUTBOX_DIR: /var/lib/genarrative/tracking-outbox-worker
|
||||
|
||||
13
deploy/env/external-generation-controller.env.example
vendored
Normal file
13
deploy/env/external-generation-controller.env.example
vendored
Normal file
@@ -0,0 +1,13 @@
|
||||
# 复制到 /etc/genarrative/external-generation-controller.env 后按机器容量调整。
|
||||
# controller 只管理 systemd worker 实例;SpacetimeDB、外部 provider 密钥继续复用 api-server.env。
|
||||
# systemd unit 会强制设置 GENARRATIVE_PROCESS_ROLE=external-generation-controller。
|
||||
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_MIN_WORKERS=1
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_MAX_WORKERS=8
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_TARGET_JOBS_PER_WORKER=2
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_POLL_INTERVAL_MS=10000
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_SCALE_DOWN_IDLE_ROUNDS=6
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_SERVICE_TEMPLATE=genarrative-external-generation-worker@{}.service
|
||||
GENARRATIVE_EXTERNAL_GENERATION_CONTROLLER_DRY_RUN=false
|
||||
GENARRATIVE_API_LOG=info,tower_http=info
|
||||
OTEL_SERVICE_NAME=genarrative-external-generation-controller
|
||||
@@ -0,0 +1,27 @@
|
||||
[Unit]
|
||||
Description=Genarrative External Generation Worker Controller
|
||||
After=network-online.target spacetimedb.service
|
||||
Wants=network-online.target
|
||||
Requires=spacetimedb.service
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
WorkingDirectory=/opt/genarrative/current
|
||||
EnvironmentFile=/etc/genarrative/api-server.env
|
||||
EnvironmentFile=-/etc/genarrative/external-generation-controller.env
|
||||
ExecStart=/usr/bin/env GENARRATIVE_PROCESS_ROLE=external-generation-controller GENARRATIVE_TRACKING_OUTBOX_DIR=/var/lib/genarrative/tracking-outbox/controller OTEL_SERVICE_NAME=genarrative-external-generation-controller /opt/genarrative/current/api-server
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
KillSignal=SIGINT
|
||||
TimeoutStopSec=120
|
||||
LimitNOFILE=65535
|
||||
TasksMax=512
|
||||
|
||||
# controller 需要调用 systemctl 管理 worker@N 实例,因此不降为 genarrative 用户。
|
||||
# 它只复用 api-server 发布包和 SpacetimeDB 配置,不直接执行外部生成任务。
|
||||
PrivateTmp=true
|
||||
ProtectSystem=full
|
||||
ReadWritePaths=/opt/genarrative /var/lib/genarrative
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
Reference in New Issue
Block a user