docs: ignore local load test artifacts

2026-05-12 14:43:57 +08:00
parent ed7a6f48d0
commit cf074837a4
2 changed files with 211 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -35,3 +35,8 @@ temp*build*/
 # Local load-test data extracted from private migration files
 scripts/loadtest/data/*.local.json
 # Local load-test run artifacts
 scripts/loadtest/data/k6-*.log
 scripts/loadtest/data/k6-*summary*.md
 scripts/loadtest/data/latest-*-prefix.txt
--- a/.hermes/plans/2026-05-12_0616-remote-works-list-loadtest-diagnosis.md
+++ b/.hermes/plans/2026-05-12_0616-remote-works-list-loadtest-diagnosis.md
@@ -0,0 +1,206 @@
 # 远端作品列表压测排查报告
 时间：2026-05-12 06:16 CST
 目标：`http://82.157.175.59`
 SSH：远端生产机 root 账号（具体私钥路径仅保留在本机环境，不写入仓库）
 ## 背景
 远端 `k6-works-list.js` 压测中：
 - smoke 通过。
 - baseline 10 VU：无 HTTP 错误，但 p95/p99 超阈值。
 - 50 RPS spike：`http_req_failed` / `works_list_shape_error_rate` 约 21.99%。
 - 100 RPS spike：`http_req_failed` / `works_list_shape_error_rate` 约 25.47%。
 - 从 k6 check 看，失败主要集中在 `puzzle_gallery_list`，`custom_world_gallery_list` 基本正常。
 ## 已完成排查
 ### 1. 服务器进程与资源
 远端服务监听：
 - Rust api-server：`127.0.0.1:8082`，systemd 服务 `genarrative-api.service`。
 - SpacetimeDB：`127.0.0.1:3101`，systemd 服务 `spacetimedb.service`。
 - Nginx：公网 80 反代 `/api/*` 到 `127.0.0.1:8082`。
 服务器规格/状态：
 - 2 vCPU。
 - 内存约 1.9GiB。
 - Swap 约 1.9GiB，已有约 600MiB 使用。
 - `/` 磁盘约 69%。
 - Rust api-server 当前 CPU 不高。
 - SpacetimeDB 当前 CPU 不高。
 发现一个独立异常：
 - PM2 下旧 `server-node` 进程 `genarrative` 正在重启风暴。
 - cwd：`/work/Genarrative/server-node`
 - 错误：连接 `127.0.0.1:5432` PostgreSQL 被拒绝。
 - PM2 restart 次数已超过 33 万。
 - 该进程不是当前公网 `/api/*` 使用的 Rust api-server，但会制造额外 CPU/内存/日志抖动。
 ### 2. 压测窗口服务端日志
 子任务聚合了 2026-05-12 04:50-05:05 的 nginx 与 api-server 日志。
 nginx access：
 - `/api/runtime/puzzle/gallery`：4661 次，全部 200。
 - `/api/runtime/custom-world-gallery`：4659 次，全部 200。
 api-server journal：
 `/api/runtime/puzzle/gallery`：
 - completed：4661
 - status：200 全部
 - slow_request：0
 - latency_ms：min 13 / p50 30 / p90 43 / p95 50 / p99 62 / max 88
 `/api/runtime/custom-world-gallery`：
 - completed：4659
 - status：200 全部
 - slow_request：0
 - latency_ms：min 0 / p50 1 / p90 5 / p95 7 / p99 13 / max 49
 结论：
 - 在服务端视角，两个接口在该窗口都没有 5xx，也没有慢请求。
 - 这与 k6 客户端侧 30s timeout / failed check 存在明显不一致。
 - 需要进一步区分：客户端侧网络/连接耗尽/本机 k6 执行环境问题，还是 k6 统计混合/响应解析问题。
 ### 3. k6 脚本行为
 文件：`scripts/loadtest/k6-works-list.js`
 无 `AUTH_TOKEN` 时，每轮 iteration 顺序请求两个接口：
 1. `GET /api/runtime/puzzle/gallery`
 2. `GET /api/runtime/custom-world-gallery`
 `DETAIL_RATIO=0` 时不会请求详情。
 `works_list_shape_error_rate` 不只代表字段结构错误，只要下面任意 check 失败都会计入：
 - status is 200
 - returns json object
 - has collection
 - list item shape
 因此 timeout、非 JSON、非 200、响应结构不符合都会表现为 shape error。
 数据文件实际路径：
 - `scripts/loadtest/data/works-list.local.json`
 脚本里 `data/works-list.local.json` 是相对 k6 脚本文件解析的，因此本身合理。
 ### 4. 代码层疑似瓶颈
 虽然这次远端服务端日志没有复现慢请求，但代码层仍发现一个真实性能隐患。
 `/api/runtime/puzzle/gallery` 调用链：
 - `server-rs/crates/api-server/src/app.rs:1192`
 - `server-rs/crates/api-server/src/puzzle.rs:1385-1409`
 - `server-rs/crates/spacetime-client/src/puzzle.rs:367-381`
 - `server-rs/crates/spacetime-module/src/puzzle.rs:430-443`
 - `server-rs/crates/spacetime-module/src/puzzle.rs:1393-1404`
 关键实现：
 - `list_puzzle_gallery_tx` 对 `puzzle_work_profile().iter()` 全表扫描。
 - 再过滤 `publication_status == Published`。
 - 对每个公开作品调用 `build_puzzle_work_profile_from_row_with_recent_count`。
 - 该函数调用 `count_recent_public_work_plays(ctx, "puzzle", &row.profile_id, now_micros)`。
 `count_recent_public_work_plays`：
 - 文件：`server-rs/crates/spacetime-module/src/runtime/profile.rs:1296-1321`
 - 当前实现对 `public_work_play_daily_stat().iter()` 全表扫描过滤。
 - 但表定义已有复合索引：
  - `server-rs/crates/spacetime-module/src/runtime/profile.rs:242-248`
  - `by_public_work_play_daily_stat_work_day(source_type, profile_id, played_day)`
 - 当前统计函数未使用该索引。
 复杂度风险：
 ```text
 puzzle gallery ~= O(puzzle_work_profile 全表扫描 + Published作品数 * public_work_play_daily_stat 全表扫描)
 ```
 `custom-world-gallery` 与 puzzle 的差异：
 - custom-world 使用 `CustomWorldGalleryEntry` 公开读模型表。
 - puzzle 直接从 `puzzle_work_profile` 即席拼装。
 - 两者都调用 recent count，但 puzzle 更容易受作品表规模和统计表规模影响。
 ## 当前判断
 本次排查有两个层面的结论：
 1. 生产服务端日志没有证明 `puzzle/gallery` 在 04:50-05:05 窗口真的 30s 慢或 5xx。
   - api-server 记录的 p95 只有 50ms。
   - nginx 看到两个接口都是 200。
   - 所以 k6 侧的 30s timeout 需要进一步从客户端网络、连接池、Windows/k6 执行环境、summary 混合统计角度验证。
 2. 代码层确实存在可修的性能隐患。
   - `count_recent_public_work_plays` 未使用已有索引。
   - puzzle gallery 对每个作品重复做 recent count。
   - puzzle gallery 未使用 `publication_status` 索引或读模型。
 ## 建议下一步
 ### A. 先处理服务器 PM2 重启风暴
 建议确认旧 Node 服务是否仍需要。
 如果不需要，应停止并禁用 PM2 中的旧 `server-node`：
 ```bash
 PM2_HOME=/home/ubuntu/.pm2 pm2 stop genarrative
 PM2_HOME=/home/ubuntu/.pm2 pm2 delete genarrative
 PM2_HOME=/home/ubuntu/.pm2 pm2 save
 ```
 这是生产侧操作，执行前需要确认。
 ### B. 单接口短压验证客户端/服务端不一致
 不要继续用混合脚本大压。
 建议新增或临时使用单接口 k6 脚本，分别只测：
 - `/api/runtime/puzzle/gallery`
 - `/api/runtime/custom-world-gallery`
 并在同一时间窗口并行采集：
 - k6 客户端 summary
 - nginx access 请求数/状态码
 - api-server journal latency
 - 本机到服务器网络错误/timeout
 目标是确认 timeout 是不是发生在客户端侧连接/网络，而不是服务端处理慢。
 ### C. 修复代码性能隐患
 优先级建议：
 1. `count_recent_public_work_plays` 改为使用 `by_public_work_play_daily_stat_work_day` 复合索引，或至少改成批量统计，避免 N 次全表扫描。
 2. `list_puzzle_gallery_tx` 使用 `by_puzzle_work_publication_status` 索引查询 Published，或参考 custom-world 建立 `puzzle_gallery_entry` 公开读模型。
 3. gallery 列表页不要实时逐条扫描统计表，可维护读模型或批量聚合 `recent_play_count_7d`。
 ### D. 调整 k6 脚本输出
 建议 k6 summary 按 endpoint tag 输出或新增单接口模式，否则 overall 指标会把 puzzle/custom-world 混在一起。
 建议增加：
 - `ENDPOINT=puzzle_gallery_list`
 - `ENDPOINT=custom_world_gallery_list`
 让脚本只跑一个 endpoint，避免诊断时混淆。