diff --git a/.hermes/shared-memory/pitfalls.md b/.hermes/shared-memory/pitfalls.md index 1b904428..2f0d3fd5 100644 --- a/.hermes/shared-memory/pitfalls.md +++ b/.hermes/shared-memory/pitfalls.md @@ -15,6 +15,22 @@ - 关联:相关文件、文档、提交或 Issue ``` +## 生产冷备份后 API 不能只依赖 SpacetimeDB 自恢复 + +- 现象:release 机器 `03:20` 冷备份后,`spacetimedb.service` 已恢复,但作品列表、创作入口配置或公开 gallery 继续超时 / 502 / 504,`genarrative-api.service` 保持 stopped。 +- 原因:`genarrative-api.service` 配置了 `Requires=spacetimedb.service`,冷备份停止 `spacetimedb.service` 时 API 会被 systemd 依赖关系一并停止;如果 `genarrative-database-backup.service` 只传 `--stop-service spacetimedb.service` 而漏掉 `--restart-service-after genarrative-api.service`,备份脚本只会恢复数据库,不会再拉起 API。 +- 处理:生产冷备份 unit 和发布脚本必须带 `--restart-service-after genarrative-api.service`;仓库用 `npm run check:production-ops` 检查 systemd 模板、API build/deploy 归档和健康巡检链路。现场修复后执行 `systemctl daemon-reload`,但不要为了验证而手动触发冷备份。 +- 验证:`systemctl cat genarrative-database-backup.service` 应包含该参数;`systemctl is-active spacetimedb.service genarrative-api.service nginx.service` 全为 `active`;`curl -fsS http://127.0.0.1:3101/v1/ping`、`/healthz`、`/readyz` 和代表性 `/api/runtime/puzzle/gallery` 均成功。 +- 关联:`deploy/systemd/genarrative-database-backup.service`、`scripts/database-backup-to-oss.mjs`、`scripts/ops/production-health-patrol.mjs`、`docs/【开发运维】本地开发验证与生产运维-2026-05-15.md`。 + +## SpacetimeDB 45 秒超时要看 api-server 记录的阶段 + +- 现象:release 上 Nginx 能立刻连到 `api-server`,但 `/api/runtime/*/gallery`、`/api/creation-entry/config` 等请求在约 `GENARRATIVE_SPACETIME_PROCEDURE_TIMEOUT_SECONDS` 后返回 `502` / `504`。 +- 原因:旧日志只能看到 HTTP 总耗时和最终状态,无法区分卡在连接池、SDK 建连、等待 `on_connect`、订阅 read model、等待 procedure / reducer 回调还是本地订阅 cache 读取。 +- 处理:`spacetime-client` 内置阶段化健康检查和失败日志;`/readyz` 用 `GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS` 短窗口检查 SpacetimeDB 连接租约,业务失败日志包含 `operation_kind`、`operation_name`、`spacetime_stage`、`elapsed_ms`。 +- 验证:`/readyz` 失败时看 `details.spacetime.stage`;业务请求超时时查 `journalctl -u genarrative-api.service` 中同一时间窗口的 `SpacetimeDB client operation failed`,优先按 `pool_acquire`、`connect_build`、`connect_handshake`、`read_model_subscribe`、`procedure_result`、`reducer_result`、`read_cache` 分阶段处理。 +- 关联:`server-rs/crates/spacetime-client/src/lib.rs`、`server-rs/crates/api-server/src/health.rs`、`docs/【开发运维】本地开发验证与生产运维-2026-05-15.md`。 + ## 新建草稿扣费不能和入口卡泥点配置分离 - 现象:后台修改创作入口的 `mudPointCost` 后,入口卡和前置余额提示可能显示新数值,但用户真实钱包流水仍按代码常量扣除。 diff --git a/deploy/systemd/genarrative-health-patrol.service b/deploy/systemd/genarrative-health-patrol.service new file mode 100644 index 00000000..2ef7d26c --- /dev/null +++ b/deploy/systemd/genarrative-health-patrol.service @@ -0,0 +1,20 @@ +[Unit] +Description=Genarrative Production Health Patrol +After=network-online.target genarrative-api.service spacetimedb.service nginx.service +Wants=network-online.target +ConditionPathExists=/opt/genarrative/current/scripts/ops/production-health-patrol.mjs + +[Service] +Type=oneshot +User=root +Group=root +WorkingDirectory=/opt/genarrative/current +EnvironmentFile=-/etc/genarrative/health-patrol.env +ExecStart=/usr/bin/node /opt/genarrative/current/scripts/ops/production-health-patrol.mjs --status-file /var/lib/genarrative/health-patrol/status.json +TimeoutStartSec=30 + +# 巡检只读 systemd、HTTP 和 journal;只允许写入自己的最近一次状态文件。 +NoNewPrivileges=true +PrivateTmp=true +ProtectSystem=full +ReadWritePaths=/var/lib/genarrative/health-patrol diff --git a/deploy/systemd/genarrative-health-patrol.timer b/deploy/systemd/genarrative-health-patrol.timer new file mode 100644 index 00000000..e29bfdc2 --- /dev/null +++ b/deploy/systemd/genarrative-health-patrol.timer @@ -0,0 +1,13 @@ +[Unit] +Description=Run Genarrative Production Health Patrol + +[Timer] +OnBootSec=2min +OnCalendar=*-*-* *:0/5:00 +Persistent=true +RandomizedDelaySec=30 +AccuracySec=30s +Unit=genarrative-health-patrol.service + +[Install] +WantedBy=timers.target diff --git a/docs/【开发运维】本地开发验证与生产运维-2026-05-15.md b/docs/【开发运维】本地开发验证与生产运维-2026-05-15.md index 4797b4d4..d2b630ac 100644 --- a/docs/【开发运维】本地开发验证与生产运维-2026-05-15.md +++ b/docs/【开发运维】本地开发验证与生产运维-2026-05-15.md @@ -96,6 +96,7 @@ npm run admin-web:typecheck ```bash npm run check:encoding npm run check:spacetime-schema +npm run check:production-ops npm run check:server-rs-ddd npm run lint:eslint npm run typecheck @@ -217,7 +218,7 @@ UI 相关修改要重点验证: npm run database:backup:oss -- --data-dir /stdb --stop-service spacetimedb.service --restart-service-after genarrative-api.service ``` -脚本会将数据目录打包成 `tar.gz`,上传到 `oss://///-.tar.gz`。生产建议做冷备份:传入 `--stop-service spacetimedb.service`,脚本会在打包前停止服务、打包后恢复服务,再上传 OSS;因 `genarrative-api.service` 依赖 `spacetimedb.service`,生产定时冷备份还必须传入 `--restart-service-after genarrative-api.service`,确保备份后 API 随数据库一起恢复。由于 OSS 上传可能受服务器带宽限制,`Genarrative-Stdb-Module-Publish` 默认使用 `DATABASE_BACKUP_MODE=async`:先在 publish 前用 `--defer-upload` 生成本地冷备份和 `.manifest.json`,随后继续执行 publish;发布脚本退出前会用后台 `node ... --upload-archive ` 上传同一份发布前备份,不等待上传完成。发布脚本在校验 wasm 后、执行 `spacetime publish` 前会等待显式 `SPACETIME_SERVER_URL` 的 `/v1/ping` 就绪,默认最多等待 `60` 秒;如生产机器冷备份恢复 `spacetimedb.service` 较慢,可临时设置 `GENARRATIVE_STDB_PUBLISH_READY_TIMEOUT_SECONDS` 调整等待时间。需要强一致发布闸门时改用 `DATABASE_BACKUP_MODE=sync`(等价脚本参数 `--backup-mode sync`),备份会在 publish 前同步打包并上传,失败会阻断 publish;确认已有其他备份窗口时才使用 `DATABASE_BACKUP_MODE=skip`(兼容脚本参数 `--skip-backup`)。若业务不能接受停机窗口,应先规划 SpacetimeDB 原生快照或主备策略,不要直接在写入中的数据目录上做热拷贝并当作强一致备份。 +脚本会将数据目录打包成 `tar.gz`,上传到 `oss://///-.tar.gz`。生产建议做冷备份:传入 `--stop-service spacetimedb.service`,脚本会在打包前停止服务、打包后恢复服务,再上传 OSS;因 `genarrative-api.service` 依赖 `spacetimedb.service`,生产定时冷备份还必须传入 `--restart-service-after genarrative-api.service`,确保备份后 API 随数据库一起恢复。`2026-06-10` release 故障就是现场 unit 漏掉该参数,`03:20` 冷备份停止 SpacetimeDB 后 API 被依赖关系一并停止,备份脚本只恢复了 SpacetimeDB,API 直到人工重启前都不可用;后续现场变更、provision 模板和 Jenkins 归档都必须通过 `npm run check:production-ops` 防止回退。由于 OSS 上传可能受服务器带宽限制,`Genarrative-Stdb-Module-Publish` 默认使用 `DATABASE_BACKUP_MODE=async`:先在 publish 前用 `--defer-upload` 生成本地冷备份和 `.manifest.json`,随后继续执行 publish;发布脚本退出前会用后台 `node ... --upload-archive ` 上传同一份发布前备份,不等待上传完成。发布脚本在校验 wasm 后、执行 `spacetime publish` 前会等待显式 `SPACETIME_SERVER_URL` 的 `/v1/ping` 就绪,默认最多等待 `60` 秒;如生产机器冷备份恢复 `spacetimedb.service` 较慢,可临时设置 `GENARRATIVE_STDB_PUBLISH_READY_TIMEOUT_SECONDS` 调整等待时间。需要强一致发布闸门时改用 `DATABASE_BACKUP_MODE=sync`(等价脚本参数 `--backup-mode sync`),备份会在 publish 前同步打包并上传,失败会阻断 publish;确认已有其他备份窗口时才使用 `DATABASE_BACKUP_MODE=skip`(兼容脚本参数 `--skip-backup`)。若业务不能接受停机窗口,应先规划 SpacetimeDB 原生快照或主备策略,不要直接在写入中的数据目录上做热拷贝并当作强一致备份。 生产环境变量模板在 `deploy/env/api-server.env.example`: @@ -234,6 +235,17 @@ GENARRATIVE_DATABASE_BACKUP_OSS_ACCESS_KEY_SECRET= `GENARRATIVE_DATABASE_BACKUP_OSS_BUCKET` 为空时会回退 `ALIYUN_OSS_BUCKET`;AccessKey 默认复用 `ALIYUN_OSS_ACCESS_KEY_ID` / `ALIYUN_OSS_ACCESS_KEY_SECRET`,也可用 `GENARRATIVE_DATABASE_BACKUP_OSS_ACCESS_KEY_ID` / `GENARRATIVE_DATABASE_BACKUP_OSS_ACCESS_KEY_SECRET` 为备份 bucket 单独配置最小权限账号。`Genarrative-Server-Provision` 会创建 `/var/lib/genarrative/database-backups` 并归属 `genarrative:genarrative`,同时安装并启用 `genarrative-database-backup.timer`。手动检查定时器:`systemctl list-timers genarrative-database-backup.timer`;手动触发一次:`systemctl start genarrative-database-backup.service`。 +冷备份后必须做一次只读验收,不要只看 `genarrative-database-backup.service` 是否成功退出: + +```bash +systemctl is-active spacetimedb.service genarrative-api.service nginx.service +curl -fsS --max-time 5 http://127.0.0.1:3101/v1/ping +curl -fsS --max-time 5 http://127.0.0.1:8082/healthz +curl -fsS --max-time 5 http://127.0.0.1:8082/readyz +curl -fsS --max-time 5 http://127.0.0.1/api/creation-entry/config >/dev/null +curl -fsS --max-time 5 http://127.0.0.1/api/runtime/puzzle/gallery >/dev/null +``` + ## 生产运维 生产部署当前口径: @@ -244,11 +256,32 @@ Nginx 负责站点和反向代理 Jenkins 按 web / api / Spacetime module / build / deploy / publish 拆分 ``` +### 生产健康巡检 + +`Genarrative-Server-Provision` 会安装并启用 `genarrative-health-patrol.timer`,默认每 5 分钟运行一次 `genarrative-health-patrol.service`。巡检脚本随 API release 归档到 `/opt/genarrative/current/scripts/ops/production-health-patrol.mjs`,只读检查: + +- `genarrative-api.service`、`spacetimedb.service`、`nginx.service` 是否 active。 +- API 直连 `/healthz`、`/readyz`。 +- SpacetimeDB 直连 `/v1/ping`。 +- 默认直连 API 端口检查 `/api/creation-entry/config`、`/api/runtime/puzzle/gallery`、`/api/runtime/custom-world-gallery`;如需走 Nginx / 公网域名,在 `/etc/genarrative/health-patrol.env` 配置 `GENARRATIVE_HEALTH_PATROL_PUBLIC_BASE_URL=https://<域名>`。 +- 最近 15 分钟 `genarrative-api.service`、`spacetimedb.service`、`nginx.service` 的 `err..alert` 日志。 + +巡检输出总状态 `OK / WARNING / CRITICAL`;只有 `CRITICAL` 默认让 systemd service 失败,`WARNING` 只写日志和状态文件,避免历史日志噪声把 timer 长期打成失败。最近一次结果写入 `/var/lib/genarrative/health-patrol/status.json`。手动执行: + +```bash +systemctl start genarrative-health-patrol.service +systemctl status genarrative-health-patrol.service --no-pager +journalctl -u genarrative-health-patrol.service -n 80 --no-pager +cat /var/lib/genarrative/health-patrol/status.json +``` + +如需接外部告警,可在 `/etc/genarrative/health-patrol.env` 配置 `GENARRATIVE_HEALTH_PATROL_WEBHOOK_URL`;脚本只会在 `WARNING` 或 `CRITICAL` 时向该 webhook 发送 JSON。未配置 webhook 时,告警来源是 systemd 失败状态、journal 和状态文件。 + `Genarrative-Web-Build` 的主站构建失败若出现 Rollup 报错 `"xxx" is not exported by "src/services/publicWorkCode.ts"`,优先按前端公开作品号工具缺失处理,而不是排查 Jenkins 节点环境。修复时要让 `publicWorkCode.ts` 的 `buildPublicWorkCode` 与 `isSamePublicWorkCode` 成对导出,并补 `src/services/publicWorkCode.test.ts` 覆盖对应玩法前缀;随后用 `npm run build:production-release -- --component web --name <临时名>` 复现 Jenkins web 构建路径。 `Genarrative-Web-Build` 会把 `build//web.tar.gz`、`web.tar.gz.sha256`、`release-manifest.json` 和 `scripts/deploy/production-web-deploy.sh` 直接归档为 Jenkins 构建产物;`Genarrative-Web-Deploy` 只通过 `copyArtifacts` 从指定上游构建复制这些产物和部署脚本,不再在目标机器 checkout Git,再执行随构建归档的 `scripts/deploy/production-web-deploy.sh`。Web 发布不再读取构建机本地缓存目录,也不再通过 release agent `rsync` 回构建机拉取大包;如果 deploy 找不到 `web.tar.gz`,应先检查上游 Web Build 是否按同一 `BUILD_VERSION` 成功归档产物。 -`Genarrative-Api-Build` 的 Jenkins 归档产物必须包含 `build//api-server`、`api-server.sha256`、`release-manifest.json`、`scripts/database-backup-to-oss.mjs`、`scripts/deploy/production-api-deploy.sh`、`scripts/deploy/maintenance-on.sh` 和 `scripts/deploy/maintenance-off.sh`。`deploy/systemd/genarrative-database-backup.service` 从 `/opt/genarrative/current/scripts/database-backup-to-oss.mjs` 执行冷备份,`Genarrative-Api-Deploy` 会从上游 API 构建产物复制部署脚本和备份脚本,不再在目标机器 checkout Git;如果 API 发布后 current release 中缺少该脚本,应先检查 `Genarrative-Api-Build` 的 `archiveArtifacts` 和 `Genarrative-Api-Deploy` 的 `copyArtifacts` 过滤器是否仍包含 `build//scripts/database-backup-to-oss.mjs`,不要只在部署机工作区手工补文件。 +`Genarrative-Api-Build` 的 Jenkins 归档产物必须包含 `build//api-server`、`api-server.sha256`、`release-manifest.json`、`scripts/database-backup-to-oss.mjs`、`scripts/ops/production-health-patrol.mjs`、`scripts/deploy/production-api-deploy.sh`、`scripts/deploy/maintenance-on.sh` 和 `scripts/deploy/maintenance-off.sh`。`deploy/systemd/genarrative-database-backup.service` 从 `/opt/genarrative/current/scripts/database-backup-to-oss.mjs` 执行冷备份,`deploy/systemd/genarrative-health-patrol.service` 从 `/opt/genarrative/current/scripts/ops/production-health-patrol.mjs` 执行巡检;`Genarrative-Api-Deploy` 会从上游 API 构建产物复制部署脚本、备份脚本和巡检脚本,不再在目标机器 checkout Git。如果 API 发布后 current release 中缺少这些脚本,应先检查 `Genarrative-Api-Build` 的 `archiveArtifacts` 和 `Genarrative-Api-Deploy` 的 `copyArtifacts` 过滤器是否仍包含 `build//scripts/database-backup-to-oss.mjs` 与 `build//scripts/ops/production-health-patrol.mjs`,不要只在部署机工作区手工补文件。 `Genarrative-Stdb-Module-Build` 的 Jenkins 归档产物必须包含 `build//spacetime_module.wasm`、`spacetime_module.wasm.sha256`、`release-manifest.json`、`scripts/deploy/production-stdb-publish.sh`、`scripts/deploy/maintenance-on.sh`、`scripts/deploy/maintenance-off.sh` 和 `scripts/database-backup-to-oss.mjs`。`Genarrative-Stdb-Module-Publish` 只通过 `copyArtifacts` 复制这些产物和发布脚本,不再在目标机器 checkout Git;如果 publish 前备份脚本缺失,应先检查 Stdb Build 的归档列表和 Stdb Publish 的复制过滤器。 @@ -275,7 +308,8 @@ dev 服务器上的 Gitea 内网入口固定为 `http://10.2.0.10/GenarrativeAI/ - `api-server` 生产模板默认 `GENARRATIVE_API_LISTEN_BACKLOG=1024`、`GENARRATIVE_API_WORKER_THREADS=4`;本地未设置 worker threads 时继续使用 Tokio 默认值。 - `GENARRATIVE_API_MAX_CONCURRENT_REQUESTS=512` 开启应用内 HTTP 并发背压;`GENARRATIVE_API_GALLERY_MAX_CONCURRENT_REQUESTS=320`、`GENARRATIVE_API_DETAIL_MAX_CONCURRENT_REQUESTS=64`、`GENARRATIVE_API_ADMIN_MAX_CONCURRENT_REQUESTS=16` 分别限制公开列表、公开详情和后台 API 热路径。超过许可时直接返回 `429 Too Many Requests` 和 `Retry-After: 1`,`/healthz` 与 `/readyz` 不受该限制。这些值不是 RPS 限速;如果压测中 429 上升但内存和 p95 收敛,说明背压正在保护进程。直连 `api-server` 的极高 RPS 压测若出现 `connection refused`,通常已经打到 TCP 监听 / accept 层,应同时检查 backlog、Nginx upstream keepalive 和前置限流。 -- `api-server` 正常运行时 `/healthz` 返回进程存活状态,`/readyz` 返回是否仍接收新流量;收到 `SIGINT` / `SIGTERM` 后会先把 readiness 标记为不可用,再让 Axum 停止接新连接并等待已有 HTTP 请求排空。systemd 仍以 `KillSignal=SIGINT` 停服务,`TimeoutStopSec=90` 作为长请求排空上限。 +- `api-server` 正常运行时 `/healthz` 只返回进程存活状态,`/readyz` 会同时检查进程是否仍接收新流量和 SpacetimeDB 连接租约是否健康;收到 `SIGINT` / `SIGTERM` 后会先把 readiness 标记为不可用,再让 Axum 停止接新连接并等待已有 HTTP 请求排空。systemd 仍以 `KillSignal=SIGINT` 停服务,`TimeoutStopSec=90` 作为长请求排空上限。 +- SpacetimeDB 健康检查默认使用 `GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS=2` 的短等待窗口,和业务 procedure 的 `GENARRATIVE_SPACETIME_PROCEDURE_TIMEOUT_SECONDS` 分开。`/readyz` 失败时 `details.spacetime.stage` 会标出当前卡住阶段:`pool_acquire`、`connect_build`、`connect_handshake`、`read_model_subscribe`、`procedure_result`、`reducer_result` 或 `read_cache`;`elapsedMs` / `timeoutMs` 用于确认是否命中健康检查窗口。业务请求日志也会写入 `operation_kind`、`operation_name`、`spacetime_stage` 和 `elapsed_ms`,后续 45 秒超时不再只靠 Nginx `request_time=45s` 推断。 - `genarrative-api.service` 设置 `LimitNOFILE=65535`、`TasksMax=2048`;上线后用 `systemctl show genarrative-api.service -p LimitNOFILE -p TasksMax -p TimeoutStopUSec` 和 `cat /proc/$(pidof api-server)/limits` 核对。 - Server provision 不再通过 Windows helper 下载,也不再通过 Linux build 节点中转工具包。`Prepare Provision Tools` 在目标 dev / release agent 工作区内先检查 `/usr/local/bin/otelcol-contrib` 与 `${SPACETIME_ROOT}/bin/current`:版本已满足时直接复用目标机现有文件生成 `provision-tools/`,只有缺失或版本不匹配时才使用 `PROVISION_DOWNLOADS_DIR` 里的本地包或从配置的下载源准备 SpacetimeDB `2.4.1` / `otelcol-contrib 0.151.0`;如果目标服务器下载需要代理,在 `PROVISION_DOWNLOAD_PROXY` 配置目标机可访问的 HTTP 代理。 - 除 `Genarrative-Server-Provision` 外,`Genarrative-Stdb-Module-Build`、`Genarrative-Web-Build`、`Genarrative-Api-Build`、`Genarrative-*Deploy`、`Genarrative-Database-Import/Export`、`Genarrative-Full-Build-And-Deploy` 和 `Genarrative-Notify-Email` 的生产流水线现都以 Linux agent 为主,仍按各自 Jenkinsfile 的 checkout 口径执行。Server provision 不使用公网备用 Git 源。 diff --git a/jenkins/Jenkinsfile.production-api-build b/jenkins/Jenkinsfile.production-api-build index d399ed2f..91aa966b 100644 --- a/jenkins/Jenkinsfile.production-api-build +++ b/jenkins/Jenkinsfile.production-api-build @@ -104,7 +104,7 @@ pipeline { stage('Archive') { steps { - archiveArtifacts artifacts: "build/${env.EFFECTIVE_BUILD_VERSION}/api-server,build/${env.EFFECTIVE_BUILD_VERSION}/api-server.sha256,build/${env.EFFECTIVE_BUILD_VERSION}/release-manifest.json,build/${env.EFFECTIVE_BUILD_VERSION}/scripts/database-backup-to-oss.mjs,scripts/deploy/production-api-deploy.sh,scripts/deploy/maintenance-on.sh,scripts/deploy/maintenance-off.sh", fingerprint: true + archiveArtifacts artifacts: "build/${env.EFFECTIVE_BUILD_VERSION}/api-server,build/${env.EFFECTIVE_BUILD_VERSION}/api-server.sha256,build/${env.EFFECTIVE_BUILD_VERSION}/release-manifest.json,build/${env.EFFECTIVE_BUILD_VERSION}/scripts/database-backup-to-oss.mjs,build/${env.EFFECTIVE_BUILD_VERSION}/scripts/ops/production-health-patrol.mjs,scripts/deploy/production-api-deploy.sh,scripts/deploy/maintenance-on.sh,scripts/deploy/maintenance-off.sh", fingerprint: true } } diff --git a/jenkins/Jenkinsfile.production-api-deploy b/jenkins/Jenkinsfile.production-api-deploy index db7809ce..152c181f 100644 --- a/jenkins/Jenkinsfile.production-api-deploy +++ b/jenkins/Jenkinsfile.production-api-deploy @@ -65,7 +65,7 @@ pipeline { copyArtifacts( projectName: params.BUILD_JOB_NAME, selector: specific(params.BUILD_NUMBER_TO_DEPLOY), - filter: "build/${params.BUILD_VERSION}/api-server,build/${params.BUILD_VERSION}/api-server.sha256,build/${params.BUILD_VERSION}/release-manifest.json,build/${params.BUILD_VERSION}/scripts/database-backup-to-oss.mjs,scripts/deploy/production-api-deploy.sh,scripts/deploy/maintenance-on.sh,scripts/deploy/maintenance-off.sh", + filter: "build/${params.BUILD_VERSION}/api-server,build/${params.BUILD_VERSION}/api-server.sha256,build/${params.BUILD_VERSION}/release-manifest.json,build/${params.BUILD_VERSION}/scripts/database-backup-to-oss.mjs,build/${params.BUILD_VERSION}/scripts/ops/production-health-patrol.mjs,scripts/deploy/production-api-deploy.sh,scripts/deploy/maintenance-on.sh,scripts/deploy/maintenance-off.sh", target: '.', fingerprintArtifacts: true ) diff --git a/package.json b/package.json index 82883e4f..b9055ebf 100644 --- a/package.json +++ b/package.json @@ -27,6 +27,7 @@ "clean": "node -e \"require('fs').rmSync('dist', { recursive: true, force: true })\"", "check:encoding": "node scripts/check-encoding.mjs", "check:spacetime-schema": "node scripts/check-spacetime-schema-guard.mjs", + "check:production-ops": "node scripts/check-production-ops-guardrails.mjs", "assets:child-motion-demo": "node scripts/generate-child-motion-demo-assets.mjs", "assets:match3d-style-references": "node scripts/generate-match3d-style-references.mjs", "check:visual-novel-vn11": "node scripts/check-visual-novel-vn11-negative-scan.mjs", @@ -37,7 +38,7 @@ "lint:guardrails": "npm run lint:eslint", "typecheck": "tsc -p tsconfig.typecheck-guardrails.json --noEmit", "typecheck:guardrails": "npm run typecheck", - "lint": "npm run check:encoding && npm run check:spacetime-schema && npm run lint:eslint && npm run typecheck", + "lint": "npm run check:encoding && npm run check:spacetime-schema && npm run check:production-ops && npm run lint:eslint && npm run typecheck", "lint:fix": "eslint . --ext .ts,.tsx,.js,.mjs,.cjs --fix && prettier --write .", "format": "prettier --write .", "format:check": "prettier --check .", diff --git a/scripts/build-production-release.sh b/scripts/build-production-release.sh index 80068924..19924c54 100644 --- a/scripts/build-production-release.sh +++ b/scripts/build-production-release.sh @@ -445,7 +445,7 @@ if [[ "${BUILD_SPACETIME}" -eq 1 ]]; then write_migration_bootstrap_secret_file fi -mkdir -p "${TARGET_DIR}/scripts" "${TARGET_DIR}/deploy" +mkdir -p "${TARGET_DIR}/scripts" "${TARGET_DIR}/scripts/ops" "${TARGET_DIR}/deploy" cp "${SCRIPT_DIR}/deploy/maintenance-on.sh" "${TARGET_DIR}/scripts/maintenance-on.sh" cp "${SCRIPT_DIR}/deploy/maintenance-off.sh" "${TARGET_DIR}/scripts/maintenance-off.sh" cp "${SCRIPT_DIR}/deploy/maintenance-status.sh" "${TARGET_DIR}/scripts/maintenance-status.sh" @@ -466,6 +466,7 @@ copy_required_file "${SCRIPT_DIR}/spacetime-migration-common.mjs" "${TARGET_DIR} copy_required_file "${SCRIPT_DIR}/spacetime-authorize-migration-operator.mjs" "${TARGET_DIR}/scripts/spacetime-authorize-migration-operator.mjs" "数据库迁移授权脚本" copy_required_file "${SCRIPT_DIR}/spacetime-revoke-migration-operator.mjs" "${TARGET_DIR}/scripts/spacetime-revoke-migration-operator.mjs" "数据库迁移撤权脚本" copy_required_file "${SCRIPT_DIR}/database-backup-to-oss.mjs" "${TARGET_DIR}/scripts/database-backup-to-oss.mjs" "数据库 OSS 备份脚本" +copy_required_file "${SCRIPT_DIR}/ops/production-health-patrol.mjs" "${TARGET_DIR}/scripts/ops/production-health-patrol.mjs" "生产健康巡检脚本" copy_required_dir "${REPO_ROOT}/deploy/systemd" "${TARGET_DIR}/deploy/systemd" "systemd 配置" copy_required_dir "${REPO_ROOT}/deploy/nginx" "${TARGET_DIR}/deploy/nginx" "Nginx 配置" @@ -485,7 +486,7 @@ cat >"${TARGET_DIR}/README.md" <&2 @@ -346,6 +348,19 @@ if [[ ! -f "${BACKUP_SCRIPT_SOURCE}" ]]; then fi cp "${BACKUP_SCRIPT_SOURCE}" "${RELEASE_DIR}/scripts/database-backup-to-oss.mjs" chmod 0644 "${RELEASE_DIR}/scripts/database-backup-to-oss.mjs" +if [[ ! -f "${HEALTH_PATROL_SCRIPT_SOURCE}" ]]; then + if [[ -f "${WORKSPACE_HEALTH_PATROL_SCRIPT_SOURCE}" ]]; then + echo "[production-api-deploy] 发布产物缺少 scripts/ops/production-health-patrol.mjs,回退使用部署工作区脚本;请重新触发包含该脚本的 API 构建。" >&2 + HEALTH_PATROL_SCRIPT_SOURCE="${WORKSPACE_HEALTH_PATROL_SCRIPT_SOURCE}" + else + echo "[production-api-deploy] 未找到生产健康巡检脚本,跳过复制;genarrative-health-patrol.service 会因脚本缺失而跳过执行。" >&2 + HEALTH_PATROL_SCRIPT_SOURCE="" + fi +fi +if [[ -n "${HEALTH_PATROL_SCRIPT_SOURCE}" ]]; then + cp "${HEALTH_PATROL_SCRIPT_SOURCE}" "${RELEASE_DIR}/scripts/ops/production-health-patrol.mjs" + chmod 0644 "${RELEASE_DIR}/scripts/ops/production-health-patrol.mjs" +fi if [[ -f "${SOURCE_DIR}/release-manifest.json" ]]; then cp "${SOURCE_DIR}/release-manifest.json" "${RELEASE_DIR}/release-manifest.api-server.json" diff --git a/scripts/jenkins-server-provision.sh b/scripts/jenkins-server-provision.sh index 5d3535ed..9f399e84 100755 --- a/scripts/jenkins-server-provision.sh +++ b/scripts/jenkins-server-provision.sh @@ -732,10 +732,20 @@ render_database_backup_service() { deploy/systemd/genarrative-database-backup.service } +render_health_patrol_service() { + local current_escaped + current_escaped="$(escape_sed_replacement "${CURRENT_LINK}")" + sed \ + -e "s|/opt/genarrative/current|${current_escaped}|g" \ + deploy/systemd/genarrative-health-patrol.service +} + require_path deploy/systemd/spacetimedb.service require_path deploy/systemd/genarrative-api.service require_path deploy/systemd/genarrative-database-backup.service require_path deploy/systemd/genarrative-database-backup.timer +require_path deploy/systemd/genarrative-health-patrol.service +require_path deploy/systemd/genarrative-health-patrol.timer require_path deploy/systemd/otelcol-contrib.service require_path deploy/otelcol/genarrative-debug.yaml require_path deploy/nginx/genarrative.conf @@ -754,7 +764,7 @@ echo "[server-provision] target=${DEPLOY_TARGET}, dry_run=${DRY_RUN}, nginx_conf run_cmd id require_root_for_real_provision install_nginx_brotli_modules -run_cmd mkdir -p "${SPACETIME_ROOT}" "${RELEASE_ROOT}" "$(dirname "${CURRENT_LINK}")" "$(dirname "${WEB_LINK}")" /etc/genarrative /var/lib/genarrative/maintenance /var/lib/genarrative/auth /var/lib/genarrative/tracking-outbox /var/lib/genarrative/database-backups +run_cmd mkdir -p "${SPACETIME_ROOT}" "${RELEASE_ROOT}" "$(dirname "${CURRENT_LINK}")" "$(dirname "${WEB_LINK}")" /etc/genarrative /var/lib/genarrative/maintenance /var/lib/genarrative/auth /var/lib/genarrative/tracking-outbox /var/lib/genarrative/database-backups /var/lib/genarrative/health-patrol if ! id spacetimedb >/dev/null 2>&1; then run_cmd useradd --system --home-dir "${SPACETIME_ROOT}" --shell /usr/sbin/nologin spacetimedb @@ -786,14 +796,18 @@ sync_spacetime_install "${SPACETIME_ROOT}" spacetimedb_service="$(mktemp)" api_service="$(mktemp)" database_backup_service="$(mktemp)" +health_patrol_service="$(mktemp)" render_spacetimedb_service >"${spacetimedb_service}" render_api_service >"${api_service}" render_database_backup_service >"${database_backup_service}" +render_health_patrol_service >"${health_patrol_service}" install_file "${spacetimedb_service}" /etc/systemd/system/spacetimedb.service 0644 install_file "${api_service}" /etc/systemd/system/genarrative-api.service 0644 install_file "${database_backup_service}" /etc/systemd/system/genarrative-database-backup.service 0644 install_file deploy/systemd/genarrative-database-backup.timer /etc/systemd/system/genarrative-database-backup.timer 0644 -rm -f "${spacetimedb_service}" "${api_service}" "${database_backup_service}" +install_file "${health_patrol_service}" /etc/systemd/system/genarrative-health-patrol.service 0644 +install_file deploy/systemd/genarrative-health-patrol.timer /etc/systemd/system/genarrative-health-patrol.timer 0644 +rm -f "${spacetimedb_service}" "${api_service}" "${database_backup_service}" "${health_patrol_service}" if [[ ! -f "${API_ENV_FILE}" ]]; then echo "+ create ${API_ENV_FILE} from example" @@ -828,7 +842,7 @@ if [[ "${ENABLE_SERVICES}" == "true" ]]; then if [[ "${ENABLE_OTELCOL:-true}" == "true" ]]; then run_cmd systemctl enable otelcol-contrib.service fi - run_cmd systemctl enable spacetimedb.service genarrative-api.service genarrative-database-backup.timer + run_cmd systemctl enable spacetimedb.service genarrative-api.service genarrative-database-backup.timer genarrative-health-patrol.timer if [[ "${ENABLE_OTELCOL:-true}" == "true" ]]; then run_cmd systemctl restart otelcol-contrib.service fi diff --git a/scripts/ops/production-health-patrol.mjs b/scripts/ops/production-health-patrol.mjs new file mode 100644 index 00000000..219d8e29 --- /dev/null +++ b/scripts/ops/production-health-patrol.mjs @@ -0,0 +1,477 @@ +#!/usr/bin/env node + +import {execFile} from 'node:child_process'; +import http from 'node:http'; +import https from 'node:https'; +import {mkdir, writeFile} from 'node:fs/promises'; +import {dirname} from 'node:path'; + +const STATUS_RANK = { + OK: 0, + WARNING: 1, + CRITICAL: 2, +}; + +const DEFAULT_PUBLIC_PATHS = [ + '/api/creation-entry/config', + '/api/runtime/puzzle/gallery', + '/api/runtime/custom-world-gallery', +]; + +const DEFAULT_SERVICES = [ + 'genarrative-api.service', + 'spacetimedb.service', + 'nginx.service', +]; + +function usage() { + console.log(`Usage: + node scripts/ops/production-health-patrol.mjs [options] + +Options: + --api-base-url API direct base URL, default http://127.0.0.1:8082 + --spacetime-base-url SpacetimeDB base URL, default http://127.0.0.1:3101 + --public-base-url Nginx/public base URL, default http://127.0.0.1 + --public-path Public API path to probe; repeatable + --status-file Write the last patrol result as JSON + --timeout-ms HTTP/command timeout, default 5000 + --slow-ms Mark successful probes slower than this as WARNING, default 3000 + --fail-on-warning Exit 1 when the total status is WARNING + --skip-journal Skip recent journal error scan + --json Print JSON instead of text +`); +} + +function readBoolEnv(name, fallback = false) { + const value = process.env[name]; + if (!value) { + return fallback; + } + return ['1', 'true', 'yes', 'on'].includes(value.trim().toLowerCase()); +} + +function parsePositiveInt(raw, fallback) { + const value = Number.parseInt(String(raw ?? ''), 10); + return Number.isFinite(value) && value > 0 ? value : fallback; +} + +function parseArgs(argv) { + const config = { + apiBaseUrl: + process.env.GENARRATIVE_HEALTH_PATROL_API_BASE_URL || + 'http://127.0.0.1:8082', + spacetimeBaseUrl: + process.env.GENARRATIVE_HEALTH_PATROL_SPACETIME_BASE_URL || + 'http://127.0.0.1:3101', + publicBaseUrl: + process.env.GENARRATIVE_HEALTH_PATROL_PUBLIC_BASE_URL || + process.env.GENARRATIVE_HEALTH_PATROL_API_BASE_URL || + 'http://127.0.0.1:8082', + publicPaths: [], + statusFile: process.env.GENARRATIVE_HEALTH_PATROL_STATUS_FILE || '', + timeoutMs: parsePositiveInt( + process.env.GENARRATIVE_HEALTH_PATROL_TIMEOUT_MS, + 5000, + ), + slowMs: parsePositiveInt( + process.env.GENARRATIVE_HEALTH_PATROL_SLOW_MS, + 3000, + ), + failOnWarning: readBoolEnv('GENARRATIVE_HEALTH_PATROL_FAIL_ON_WARNING'), + skipJournal: readBoolEnv('GENARRATIVE_HEALTH_PATROL_SKIP_JOURNAL'), + json: false, + webhookUrl: process.env.GENARRATIVE_HEALTH_PATROL_WEBHOOK_URL || '', + }; + + for (let index = 0; index < argv.length; index += 1) { + const arg = argv[index]; + switch (arg) { + case '-h': + case '--help': + usage(); + process.exit(0); + break; + case '--api-base-url': + config.apiBaseUrl = requireValue(argv, ++index, arg); + break; + case '--spacetime-base-url': + config.spacetimeBaseUrl = requireValue(argv, ++index, arg); + break; + case '--public-base-url': + config.publicBaseUrl = requireValue(argv, ++index, arg); + break; + case '--public-path': + config.publicPaths.push(requireValue(argv, ++index, arg)); + break; + case '--status-file': + config.statusFile = requireValue(argv, ++index, arg); + break; + case '--timeout-ms': + config.timeoutMs = parsePositiveInt(requireValue(argv, ++index, arg), 5000); + break; + case '--slow-ms': + config.slowMs = parsePositiveInt(requireValue(argv, ++index, arg), 3000); + break; + case '--fail-on-warning': + config.failOnWarning = true; + break; + case '--skip-journal': + config.skipJournal = true; + break; + case '--json': + config.json = true; + break; + default: + throw new Error(`未知参数: ${arg}`); + } + } + + if (config.publicPaths.length === 0) { + config.publicPaths = DEFAULT_PUBLIC_PATHS; + } + + return config; +} + +function requireValue(argv, index, flag) { + const value = argv[index]; + if (!value || value.startsWith('--')) { + throw new Error(`${flag} 缺少参数值`); + } + return value; +} + +function joinUrl(baseUrl, path) { + const base = baseUrl.endsWith('/') ? baseUrl.slice(0, -1) : baseUrl; + const suffix = path.startsWith('/') ? path : `/${path}`; + return `${base}${suffix}`; +} + +function maxStatus(checks) { + return checks.reduce((current, check) => { + return STATUS_RANK[check.status] > STATUS_RANK[current] ? check.status : current; + }, 'OK'); +} + +function checkResult(name, status, summary, details = {}) { + return { + name, + status, + summary, + ...details, + }; +} + +function runCommand(command, args, timeoutMs) { + return new Promise((resolve) => { + execFile( + command, + args, + { + timeout: timeoutMs, + windowsHide: true, + maxBuffer: 256 * 1024, + }, + (error, stdout, stderr) => { + resolve({ + command: [command, ...args].join(' '), + code: + typeof error?.code === 'number' + ? error.code + : error + ? 1 + : 0, + signal: error?.signal || '', + stdout: String(stdout || ''), + stderr: String(stderr || ''), + timedOut: Boolean(error?.killed), + error: error ? error.message : '', + }); + }, + ); + }); +} + +async function checkService(serviceName, timeoutMs) { + const result = await runCommand( + 'systemctl', + ['is-active', serviceName], + timeoutMs, + ); + const state = result.stdout.trim() || result.stderr.trim() || result.error; + if (result.code === 0 && state === 'active') { + return checkResult(`service:${serviceName}`, 'OK', 'active', { + command: result.command, + }); + } + + return checkResult( + `service:${serviceName}`, + 'CRITICAL', + `服务状态异常: ${state || `exit ${result.code}`}`, + { + command: result.command, + stderr: result.stderr.trim(), + }, + ); +} + +function requestUrl(url, timeoutMs) { + return new Promise((resolve) => { + const startedAt = Date.now(); + const parsed = new URL(url); + const client = parsed.protocol === 'https:' ? https : http; + const request = client.request( + parsed, + { + method: 'GET', + timeout: timeoutMs, + headers: { + 'User-Agent': 'genarrative-health-patrol/1.0', + Accept: 'application/json,text/plain,*/*', + }, + }, + (response) => { + let body = ''; + response.setEncoding('utf8'); + response.on('data', (chunk) => { + if (body.length < 2048) { + body += chunk; + } + }); + response.on('end', () => { + resolve({ + elapsedMs: Date.now() - startedAt, + statusCode: response.statusCode || 0, + body: body.slice(0, 2048), + }); + }); + }, + ); + + request.on('timeout', () => { + request.destroy(new Error(`timeout after ${timeoutMs}ms`)); + }); + request.on('error', (error) => { + resolve({ + elapsedMs: Date.now() - startedAt, + error: error.message, + }); + }); + request.end(); + }); +} + +async function checkHttp(name, url, config) { + const result = await requestUrl(url, config.timeoutMs); + const curlCommand = `curl -fsS --max-time ${Math.ceil(config.timeoutMs / 1000)} ${url}`; + + if (result.error) { + return checkResult(name, 'CRITICAL', `请求失败: ${result.error}`, { + command: curlCommand, + elapsedMs: result.elapsedMs, + }); + } + + const ok = result.statusCode >= 200 && result.statusCode < 300; + if (!ok) { + return checkResult( + name, + 'CRITICAL', + `HTTP ${result.statusCode},耗时 ${result.elapsedMs}ms`, + { + command: curlCommand, + elapsedMs: result.elapsedMs, + body: result.body.trim(), + }, + ); + } + + if (result.elapsedMs > config.slowMs) { + return checkResult( + name, + 'WARNING', + `HTTP ${result.statusCode} 但耗时偏高: ${result.elapsedMs}ms`, + { + command: curlCommand, + elapsedMs: result.elapsedMs, + }, + ); + } + + return checkResult(name, 'OK', `HTTP ${result.statusCode} ${result.elapsedMs}ms`, { + command: curlCommand, + elapsedMs: result.elapsedMs, + }); +} + +async function checkRecentJournal(config) { + const args = [ + '-u', + 'genarrative-api.service', + '-u', + 'spacetimedb.service', + '-u', + 'nginx.service', + '--since', + '15 minutes ago', + '-p', + 'err..alert', + '--no-pager', + '-o', + 'short-iso', + '-n', + '20', + ]; + const result = await runCommand('journalctl', args, config.timeoutMs); + + if (result.code !== 0) { + return checkResult('journal:recent-errors', 'WARNING', '无法读取最近错误日志', { + command: result.command, + stderr: result.stderr.trim() || result.error, + }); + } + + const lines = result.stdout + .split('\n') + .map((line) => line.trim()) + .filter((line) => line && line !== '-- No entries --'); + + if (lines.length === 0) { + return checkResult('journal:recent-errors', 'OK', '最近 15 分钟无 err..alert 日志', { + command: result.command, + }); + } + + return checkResult( + 'journal:recent-errors', + 'WARNING', + `最近 15 分钟有 ${lines.length} 条 err..alert 日志`, + { + command: result.command, + lines, + }, + ); +} + +async function writeStatusFile(statusFile, payload) { + if (!statusFile) { + return; + } + await mkdir(dirname(statusFile), {recursive: true}); + await writeFile(statusFile, `${JSON.stringify(payload, null, 2)}\n`, 'utf8'); +} + +async function notifyWebhook(config, payload) { + if (!config.webhookUrl || payload.status === 'OK') { + return; + } + + const body = JSON.stringify(payload); + const parsed = new URL(config.webhookUrl); + const client = parsed.protocol === 'https:' ? https : http; + + await new Promise((resolve) => { + const request = client.request( + parsed, + { + method: 'POST', + timeout: config.timeoutMs, + headers: { + 'Content-Type': 'application/json', + 'Content-Length': Buffer.byteLength(body), + }, + }, + (response) => { + response.resume(); + response.on('end', resolve); + }, + ); + request.on('timeout', () => { + request.destroy(new Error(`timeout after ${config.timeoutMs}ms`)); + }); + request.on('error', (error) => { + console.error(`[health-patrol] webhook notify failed: ${error.message}`); + resolve(); + }); + request.end(body); + }); +} + +function printText(payload) { + console.log(`[health-patrol] ${payload.status} ${payload.checkedAt}`); + for (const check of payload.checks) { + console.log(`[${check.status}] ${check.name}: ${check.summary}`); + if (check.command && check.status !== 'OK') { + console.log(` command: ${check.command}`); + } + if (check.stderr) { + console.log(` stderr: ${check.stderr}`); + } + if (check.body) { + console.log(` body: ${check.body}`); + } + if (Array.isArray(check.lines) && check.lines.length > 0) { + for (const line of check.lines) { + console.log(` ${line}`); + } + } + } +} + +async function main() { + const config = parseArgs(process.argv.slice(2)); + const checks = []; + + for (const serviceName of DEFAULT_SERVICES) { + checks.push(await checkService(serviceName, config.timeoutMs)); + } + + checks.push(await checkHttp('api:/healthz', joinUrl(config.apiBaseUrl, '/healthz'), config)); + checks.push(await checkHttp('api:/readyz', joinUrl(config.apiBaseUrl, '/readyz'), config)); + checks.push( + await checkHttp( + 'spacetimedb:/v1/ping', + joinUrl(config.spacetimeBaseUrl, '/v1/ping'), + config, + ), + ); + + for (const path of config.publicPaths) { + checks.push( + await checkHttp(`public:${path}`, joinUrl(config.publicBaseUrl, path), config), + ); + } + + if (!config.skipJournal) { + checks.push(await checkRecentJournal(config)); + } + + const payload = { + status: maxStatus(checks), + checkedAt: new Date().toISOString(), + host: process.env.HOSTNAME || '', + checks, + }; + + await writeStatusFile(config.statusFile, payload); + await notifyWebhook(config, payload); + + if (config.json) { + console.log(JSON.stringify(payload, null, 2)); + } else { + printText(payload); + } + + if (payload.status === 'CRITICAL') { + process.exit(2); + } + if (payload.status === 'WARNING' && config.failOnWarning) { + process.exit(1); + } +} + +main().catch((error) => { + console.error(`[health-patrol] CRITICAL ${error instanceof Error ? error.message : String(error)}`); + process.exit(2); +}); diff --git a/server-rs/crates/api-server/src/app.rs b/server-rs/crates/api-server/src/app.rs index a68f6db5..d9b4b0e3 100644 --- a/server-rs/crates/api-server/src/app.rs +++ b/server-rs/crates/api-server/src/app.rs @@ -269,6 +269,7 @@ mod tests { }; use reqwest::Client; use serde_json::Value; + use spacetime_client::{SpacetimeClientHealthSnapshot, SpacetimeClientStage}; use time::OffsetDateTime; use tokio::net::TcpListener; use tower::ServiceExt; @@ -724,6 +725,45 @@ mod tests { ); } + #[tokio::test] + async fn readyz_reports_spacetime_health_stage() { + let state = AppState::new(AppConfig::default()).expect("state should build"); + state.set_test_spacetime_health(SpacetimeClientHealthSnapshot { + ok: false, + stage: SpacetimeClientStage::ProcedureResult, + checked_at_micros: 1_713_680_000_000_000, + elapsed_ms: 2_000, + timeout_ms: 2_000, + error: Some("SpacetimeDB procedure 调用超时".to_string()), + last_success_at_micros: Some(1_713_679_999_000_000), + last_error: Some("SpacetimeDB procedure 调用超时".to_string()), + }); + let app = build_router(state); + + let response = app + .oneshot( + Request::builder() + .uri("/readyz") + .header("x-request-id", "req-ready-spacetime") + .body(Body::empty()) + .expect("readyz request should build"), + ) + .await + .expect("readyz request should succeed"); + + assert_eq!(response.status(), StatusCode::SERVICE_UNAVAILABLE); + let body = read_json_response(response).await; + assert_eq!(body["error"]["details"]["reason"], "spacetime_unhealthy"); + assert_eq!( + body["error"]["details"]["spacetime"]["stage"], + "procedure_result" + ); + assert_eq!( + body["error"]["details"]["spacetime"]["timeoutMs"], + Value::from(2_000) + ); + } + #[tokio::test] async fn creative_agent_draft_edit_rejects_unconfirmed_template_session() { let app = build_internal_creative_agent_app(); diff --git a/server-rs/crates/api-server/src/config.rs b/server-rs/crates/api-server/src/config.rs index 46a5f9f0..3ca4a2a6 100644 --- a/server-rs/crates/api-server/src/config.rs +++ b/server-rs/crates/api-server/src/config.rs @@ -12,6 +12,7 @@ use platform_speech::{ const DEFAULT_INTERNAL_API_SECRET: &str = "genarrative-dev-internal-bridge"; const SPACETIME_LOCAL_CONFIG_FILE: &str = "spacetime.local.json"; +const DEFAULT_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS: u64 = 2; pub(crate) const DEFAULT_VECTOR_ENGINE_IMAGE_REQUEST_TIMEOUT_MS: u64 = 1_000_000; // 集中管理 api-server 的启动配置,避免入口层直接散落环境变量解析逻辑。 @@ -118,6 +119,7 @@ pub struct AppConfig { pub spacetime_token: Option, pub spacetime_pool_size: u32, pub spacetime_procedure_timeout: Duration, + pub spacetime_health_check_timeout: Duration, pub llm_provider: LlmProvider, pub llm_base_url: String, pub llm_api_key: Option, @@ -276,6 +278,9 @@ impl Default for AppConfig { spacetime_token: None, spacetime_pool_size: 4, spacetime_procedure_timeout: Duration::from_secs(30), + spacetime_health_check_timeout: Duration::from_secs( + DEFAULT_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS, + ), llm_provider: LlmProvider::Ark, llm_base_url: String::new(), llm_api_key: None, @@ -704,6 +709,12 @@ impl AppConfig { config.spacetime_procedure_timeout = Duration::from_secs(spacetime_procedure_timeout_seconds); } + if let Some(spacetime_health_check_timeout_seconds) = + read_first_duration_seconds_env(&["GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS"]) + { + config.spacetime_health_check_timeout = + Duration::from_secs(spacetime_health_check_timeout_seconds); + } if let Some(llm_provider) = read_first_llm_provider_env(&["GENARRATIVE_LLM_PROVIDER", "LLM_PROVIDER"]) @@ -1610,6 +1621,26 @@ mod tests { } } + #[test] + fn from_env_reads_spacetime_health_check_timeout() { + let _guard = ENV_LOCK + .get_or_init(|| Mutex::new(())) + .lock() + .expect("env lock should not poison"); + + unsafe { + std::env::remove_var("GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS"); + std::env::set_var("GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS", "3"); + } + + let config = AppConfig::from_env(); + assert_eq!(config.spacetime_health_check_timeout.as_secs(), 3); + + unsafe { + std::env::remove_var("GENARRATIVE_SPACETIME_HEALTH_CHECK_TIMEOUT_SECONDS"); + } + } + #[test] fn default_keeps_structured_llm_web_search_disabled() { let config = AppConfig::default(); diff --git a/server-rs/crates/api-server/src/health.rs b/server-rs/crates/api-server/src/health.rs index ee83a012..c6df5026 100644 --- a/server-rs/crates/api-server/src/health.rs +++ b/server-rs/crates/api-server/src/health.rs @@ -10,6 +10,7 @@ use crate::{ api_response::json_success_body, http_error::AppError, request_context::RequestContext, state::AppState, }; +use spacetime_client::SpacetimeClientHealthSnapshot; pub async fn health_check(Extension(request_context): Extension) -> Json { json_success_body( @@ -25,23 +26,49 @@ pub async fn readiness_check( State(state): State, Extension(request_context): Extension, ) -> Response { - if state.is_ready() { + if !state.is_ready() { + return AppError::from_status(StatusCode::SERVICE_UNAVAILABLE) + .with_message("api-server 正在退出,不再接收新流量") + .with_details(json!({ + "reason": "api_server_draining", + "ready": false, + })) + .into_response_with_context(Some(&request_context)); + } + + let spacetime_health = state.spacetime_health_check().await; + if spacetime_health.ok { return json_success_body( Some(&request_context), json!({ "ok": true, "ready": true, "service": "genarrative-api-server", + "spacetime": spacetime_health_to_json(&spacetime_health), }), ) .into_response(); } AppError::from_status(StatusCode::SERVICE_UNAVAILABLE) - .with_message("api-server 正在退出,不再接收新流量") + .with_message("SpacetimeDB 连接健康检查失败,api-server 暂不接收新流量") .with_details(json!({ - "reason": "api_server_draining", + "reason": "spacetime_unhealthy", "ready": false, + "spacetime": spacetime_health_to_json(&spacetime_health), })) .into_response_with_context(Some(&request_context)) } + +fn spacetime_health_to_json(snapshot: &SpacetimeClientHealthSnapshot) -> Value { + json!({ + "ok": snapshot.ok, + "stage": snapshot.stage.as_str(), + "checkedAtMicros": snapshot.checked_at_micros, + "elapsedMs": snapshot.elapsed_ms, + "timeoutMs": snapshot.timeout_ms, + "error": snapshot.error, + "lastSuccessAtMicros": snapshot.last_success_at_micros, + "lastError": snapshot.last_error, + }) +} diff --git a/server-rs/crates/api-server/src/state.rs b/server-rs/crates/api-server/src/state.rs index 49d0e381..3858914a 100644 --- a/server-rs/crates/api-server/src/state.rs +++ b/server-rs/crates/api-server/src/state.rs @@ -31,7 +31,9 @@ use platform_wechat::{WechatClient, WechatConfig, pay::WechatPayClient}; use serde_json::Value; use shared_contracts::creation_entry_config::CreationEntryConfigResponse; use shared_contracts::creative_agent::CreativeAgentSessionSnapshot; -use spacetime_client::{SpacetimeClient, SpacetimeClientConfig, SpacetimeClientError}; +use spacetime_client::{ + SpacetimeClient, SpacetimeClientConfig, SpacetimeClientError, SpacetimeClientHealthSnapshot, +}; use time::OffsetDateTime; use tokio::sync::{Semaphore, broadcast}; use tracing::{info, warn}; @@ -242,6 +244,8 @@ pub struct AppStateInner { refresh_cookie_config: RefreshCookieConfig, #[cfg(test)] test_creation_entry_config: Arc>>, + #[cfg(test)] + test_spacetime_health: Arc>>, oss_client: Option, #[cfg_attr(test, allow(dead_code))] auth_store: InMemoryAuthStore, @@ -418,6 +422,10 @@ impl AppState { test_creation_entry_config: Arc::new(Mutex::new(Some( crate::creation_entry_config::test_creation_entry_config_response(), ))), + #[cfg(test)] + test_spacetime_health: Arc::new(Mutex::new(Some( + SpacetimeClientHealthSnapshot::healthy_for_test(), + ))), oss_client, auth_store, password_entry_service, @@ -467,6 +475,30 @@ impl AppState { self.ready.store(false, Ordering::Release); } + pub async fn spacetime_health_check(&self) -> SpacetimeClientHealthSnapshot { + #[cfg(test)] + if let Some(snapshot) = self + .test_spacetime_health + .lock() + .expect("test spacetime health should lock") + .clone() + { + return snapshot; + } + + self.spacetime_client + .health_check(self.config.spacetime_health_check_timeout) + .await + } + + #[cfg(test)] + pub(crate) fn set_test_spacetime_health(&self, snapshot: SpacetimeClientHealthSnapshot) { + *self + .test_spacetime_health + .lock() + .expect("test spacetime health should lock") = Some(snapshot); + } + pub async fn upsert_creation_entry_type_config( &self, input: module_runtime::CreationEntryTypeAdminUpsertInput, diff --git a/server-rs/crates/spacetime-client/src/lib.rs b/server-rs/crates/spacetime-client/src/lib.rs index 20361ba8..6b43db21 100644 --- a/server-rs/crates/spacetime-client/src/lib.rs +++ b/server-rs/crates/spacetime-client/src/lib.rs @@ -105,8 +105,8 @@ pub mod auth; pub mod bark_battle; pub use bark_battle::{ BarkBattleDraftConfigUpsertRecordInput, BarkBattleDraftCreateRecordInput, - BarkBattleRunFinishRecordInput, BarkBattleRunStartRecordInput, - BarkBattleWorkDeleteRecordInput, BarkBattleWorkPublishRecordInput, + BarkBattleRunFinishRecordInput, BarkBattleRunStartRecordInput, BarkBattleWorkDeleteRecordInput, + BarkBattleWorkPublishRecordInput, }; pub mod big_fish; pub mod combat; @@ -132,7 +132,7 @@ use std::{ sync::atomic::{AtomicBool, Ordering}, sync::{Arc, Mutex}, thread::JoinHandle, - time::Duration, + time::{Duration, Instant}, }; use module_ai::{ @@ -241,6 +241,7 @@ use tokio::{ sync::{OwnedSemaphorePermit, RwLock, Semaphore, oneshot}, time::timeout, }; +use tracing::warn; use crate::module_bindings::*; @@ -253,6 +254,60 @@ pub struct SpacetimeClientConfig { pub procedure_timeout: Duration, } +#[derive(Clone, Copy, Debug, PartialEq, Eq)] +pub enum SpacetimeClientStage { + Ready, + PoolAcquire, + ConnectBuild, + ConnectHandshake, + ReadModelSubscribe, + ProcedureResult, + ReducerResult, + ReadCache, +} + +impl SpacetimeClientStage { + pub fn as_str(self) -> &'static str { + match self { + Self::Ready => "ready", + Self::PoolAcquire => "pool_acquire", + Self::ConnectBuild => "connect_build", + Self::ConnectHandshake => "connect_handshake", + Self::ReadModelSubscribe => "read_model_subscribe", + Self::ProcedureResult => "procedure_result", + Self::ReducerResult => "reducer_result", + Self::ReadCache => "read_cache", + } + } +} + +#[derive(Clone, Debug, PartialEq, Eq)] +pub struct SpacetimeClientHealthSnapshot { + pub ok: bool, + pub stage: SpacetimeClientStage, + pub checked_at_micros: i64, + pub elapsed_ms: u64, + pub timeout_ms: u64, + pub error: Option, + pub last_success_at_micros: Option, + pub last_error: Option, +} + +impl SpacetimeClientHealthSnapshot { + pub fn healthy_for_test() -> Self { + Self { + ok: true, + stage: SpacetimeClientStage::Ready, + checked_at_micros: current_unix_micros(), + elapsed_ms: 0, + timeout_ms: 0, + error: None, + last_success_at_micros: Some(current_unix_micros()), + last_error: None, + } + } +} + #[derive(Clone, Debug, PartialEq, Eq)] pub struct AuthStoreSnapshotRecord { pub snapshot_json: Option, @@ -270,6 +325,7 @@ pub struct AuthStoreSnapshotImportRecord { pub struct SpacetimeClient { config: SpacetimeClientConfig, pool: Arc, + health_state: Arc>, creation_entry_config_cache: Arc>>, custom_world_gallery_legacy_sync_attempted: Arc, } @@ -296,6 +352,24 @@ struct SpacetimeConnectionPool { permits: Arc, } +#[derive(Debug, Default)] +struct SpacetimeClientHealthState { + last_success_at_micros: Option, + last_error: Option, +} + +#[derive(Debug)] +struct SpacetimeStageError { + stage: SpacetimeClientStage, + error: SpacetimeClientError, +} + +impl SpacetimeStageError { + fn new(stage: SpacetimeClientStage, error: SpacetimeClientError) -> Self { + Self { stage, error } + } +} + struct PooledConnectionSlot { connection: Option, in_use: bool, @@ -341,6 +415,7 @@ impl SpacetimeClient { Self { config, pool, + health_state: Arc::new(RwLock::new(SpacetimeClientHealthState::default())), creation_entry_config_cache: Arc::new(RwLock::new(None)), custom_world_gallery_legacy_sync_attempted: Arc::new(AtomicBool::new(false)), } @@ -354,29 +429,58 @@ impl SpacetimeClient { where T: Send + 'static, { + let started_at = Instant::now(); let metrics_guard = telemetry::begin_procedure(procedure); let (sender, receiver) = oneshot::channel(); let result_sender = Arc::new(Mutex::new(Some(sender))); - let final_result = match self.acquire_connection().await { + let final_result = match self + .acquire_connection_with_timeout(self.config.procedure_timeout) + .await + { Ok(lease) => { - let result = if let Some(connection) = lease.connection.as_ref() { + let (result, failed_stage) = if let Some(connection) = lease.connection.as_ref() { call(&connection.connection, result_sender.clone()); - match timeout(self.config.procedure_timeout, receiver).await { - Ok(inner) => match inner { - Ok(value) => value, - Err(_) => Err(SpacetimeClientError::ConnectDropped), + let stage = SpacetimeClientStage::ProcedureResult; + ( + match timeout(self.config.procedure_timeout, receiver).await { + Ok(inner) => match inner { + Ok(value) => value, + Err(_) => Err(SpacetimeClientError::ConnectDropped), + }, + Err(_) => Err(Self::resolve_timeout_error(Some(connection), stage)), }, - Err(_) => Err(Self::resolve_timeout_error(Some(connection))), - } + stage, + ) } else { - Err(SpacetimeClientError::Runtime( - "SpacetimeDB 连接租约缺少连接".to_string(), - )) + ( + Err(SpacetimeClientError::Runtime( + "SpacetimeDB 连接租约缺少连接".to_string(), + )), + SpacetimeClientStage::ProcedureResult, + ) }; self.release_connection(lease).await; + if let Err(error) = &result { + log_spacetime_client_failure( + "procedure", + procedure, + failed_stage, + started_at, + error, + ); + } result } - Err(error) => Err(error), + Err(error) => { + log_spacetime_client_failure( + "procedure", + procedure, + error.stage, + started_at, + &error.error, + ); + Err(error.error) + } }; metrics_guard.finish(&final_result); @@ -388,29 +492,58 @@ impl SpacetimeClient { procedure: &'static str, call: impl FnOnce(&DbConnection, ReducerResultSender) + Send + 'static, ) -> Result<(), SpacetimeClientError> { + let started_at = Instant::now(); let metrics_guard = telemetry::begin_procedure(procedure); let (sender, receiver) = oneshot::channel(); let result_sender = Arc::new(Mutex::new(Some(sender))); - let final_result = match self.acquire_connection().await { + let final_result = match self + .acquire_connection_with_timeout(self.config.procedure_timeout) + .await + { Ok(lease) => { - let result = if let Some(connection) = lease.connection.as_ref() { + let (result, failed_stage) = if let Some(connection) = lease.connection.as_ref() { call(&connection.connection, result_sender.clone()); - match timeout(self.config.procedure_timeout, receiver).await { - Ok(inner) => match inner { - Ok(value) => value, - Err(_) => Err(SpacetimeClientError::ConnectDropped), + let stage = SpacetimeClientStage::ReducerResult; + ( + match timeout(self.config.procedure_timeout, receiver).await { + Ok(inner) => match inner { + Ok(value) => value, + Err(_) => Err(SpacetimeClientError::ConnectDropped), + }, + Err(_) => Err(Self::resolve_timeout_error(Some(connection), stage)), }, - Err(_) => Err(Self::resolve_timeout_error(Some(connection))), - } + stage, + ) } else { - Err(SpacetimeClientError::Runtime( - "SpacetimeDB 连接租约缺少连接".to_string(), - )) + ( + Err(SpacetimeClientError::Runtime( + "SpacetimeDB 连接租约缺少连接".to_string(), + )), + SpacetimeClientStage::ReducerResult, + ) }; self.release_connection(lease).await; + if let Err(error) = &result { + log_spacetime_client_failure( + "reducer", + procedure, + failed_stage, + started_at, + error, + ); + } result } - Err(error) => Err(error), + Err(error) => { + log_spacetime_client_failure( + "reducer", + procedure, + error.stage, + started_at, + &error.error, + ); + Err(error.error) + } }; metrics_guard.finish(&final_result); @@ -425,11 +558,22 @@ impl SpacetimeClient { where T: Send + 'static, { + let started_at = Instant::now(); let metrics_guard = telemetry::begin_read(read_name); - let lease = match self.acquire_connection().await { + let lease = match self + .acquire_connection_with_timeout(self.config.procedure_timeout) + .await + { Ok(lease) => lease, Err(error) => { - let final_result = Err(error); + log_spacetime_client_failure( + "read", + read_name, + error.stage, + started_at, + &error.error, + ); + let final_result = Err(error.error); metrics_guard.finish(&final_result); return final_result; } @@ -443,6 +587,15 @@ impl SpacetimeClient { }; self.release_connection(lease).await; + if let Err(error) = &final_result { + log_spacetime_client_failure( + "read", + read_name, + SpacetimeClientStage::ReadCache, + started_at, + error, + ); + } metrics_guard.finish(&final_result); final_result } @@ -455,14 +608,75 @@ impl SpacetimeClient { self.creation_entry_config_cache.read().await.clone() } - async fn acquire_connection(&self) -> Result { - let permit = timeout( - self.config.procedure_timeout, - self.pool.permits.clone().acquire_owned(), - ) - .await - .map_err(|_| SpacetimeClientError::Timeout)? - .map_err(|error| SpacetimeClientError::Runtime(error.to_string()))?; + pub async fn health_check(&self, probe_timeout: Duration) -> SpacetimeClientHealthSnapshot { + let timeout = if probe_timeout.is_zero() { + DEFAULT_PROCEDURE_TIMEOUT + } else { + probe_timeout + }; + let started_at = Instant::now(); + let checked_at_micros = current_unix_micros(); + let result = self.acquire_connection_with_timeout(timeout).await; + match result { + Ok(lease) => { + self.release_connection(lease).await; + let mut health_state = self.health_state.write().await; + health_state.last_success_at_micros = Some(checked_at_micros); + health_state.last_error = None; + SpacetimeClientHealthSnapshot { + ok: true, + stage: SpacetimeClientStage::Ready, + checked_at_micros, + elapsed_ms: duration_millis_u64(started_at.elapsed()), + timeout_ms: duration_millis_u64(timeout), + error: None, + last_success_at_micros: health_state.last_success_at_micros, + last_error: health_state.last_error.clone(), + } + } + Err(error) => { + log_spacetime_client_failure( + "health_check", + "spacetime_connection", + error.stage, + started_at, + &error.error, + ); + let mut health_state = self.health_state.write().await; + let error_message = error.error.to_string(); + health_state.last_error = Some(error_message.clone()); + SpacetimeClientHealthSnapshot { + ok: false, + stage: error.stage, + checked_at_micros, + elapsed_ms: duration_millis_u64(started_at.elapsed()), + timeout_ms: duration_millis_u64(timeout), + error: Some(error_message), + last_success_at_micros: health_state.last_success_at_micros, + last_error: health_state.last_error.clone(), + } + } + } + } + + async fn acquire_connection_with_timeout( + &self, + operation_timeout: Duration, + ) -> Result { + let permit = timeout(operation_timeout, self.pool.permits.clone().acquire_owned()) + .await + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::PoolAcquire, + SpacetimeClientError::Timeout, + ) + })? + .map_err(|error| { + SpacetimeStageError::new( + SpacetimeClientStage::PoolAcquire, + SpacetimeClientError::Runtime(error.to_string()), + ) + })?; loop { for (slot_index, slot) in self.pool.slots.iter().enumerate() { @@ -480,7 +694,7 @@ impl SpacetimeClient { let connection = if let Some(connection) = reusable_connection { connection } else { - match self.build_pooled_connection().await { + match self.build_pooled_connection(operation_timeout).await { Ok(connection) => connection, Err(error) => { let mut slot_guard = self.pool.slots[slot_index].lock().await; @@ -502,7 +716,10 @@ impl SpacetimeClient { } } - async fn build_pooled_connection(&self) -> Result { + async fn build_pooled_connection( + &self, + operation_timeout: Duration, + ) -> Result { let config = self.config.clone(); let broken = Arc::new(AtomicBool::new(false)); let (sender, receiver) = oneshot::channel::>(); @@ -510,7 +727,7 @@ impl SpacetimeClient { let broken_flag = broken.clone(); let disconnect_sender = connect_sender.clone(); let connection = timeout( - self.config.procedure_timeout, + operation_timeout, tokio::task::spawn_blocking(move || { DbConnection::builder() .with_uri(config.server_url) @@ -534,17 +751,41 @@ impl SpacetimeClient { }), ) .await - .map_err(|_| SpacetimeClientError::Timeout)? - .map_err(|error| SpacetimeClientError::Runtime(error.to_string()))??; + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::ConnectBuild, + SpacetimeClientError::Timeout, + ) + })? + .map_err(|error| { + SpacetimeStageError::new( + SpacetimeClientStage::ConnectBuild, + SpacetimeClientError::Runtime(error.to_string()), + ) + })? + .map_err(|error| SpacetimeStageError::new(SpacetimeClientStage::ConnectBuild, error))?; let runner = connection.run_threaded(); - timeout(self.config.procedure_timeout, receiver) + timeout(operation_timeout, receiver) .await - .map_err(|_| SpacetimeClientError::Timeout)? - .map_err(|_| SpacetimeClientError::ConnectDropped)??; + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::ConnectHandshake, + SpacetimeClientError::Timeout, + ) + })? + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::ConnectHandshake, + SpacetimeClientError::ConnectDropped, + ) + })? + .map_err(|error| { + SpacetimeStageError::new(SpacetimeClientStage::ConnectHandshake, error) + })?; let read_model_subscriptions = self - .subscribe_cached_read_models(&connection, broken.clone()) + .subscribe_cached_read_models(&connection, broken.clone(), operation_timeout) .await?; Ok(PooledConnection { @@ -559,7 +800,8 @@ impl SpacetimeClient { &self, connection: &DbConnection, broken: Arc, - ) -> Result, SpacetimeClientError> { + operation_timeout: Duration, + ) -> Result, SpacetimeStageError> { let mut subscriptions = Vec::new(); for query in [ "SELECT * FROM public_work_gallery_entry", @@ -576,7 +818,13 @@ impl SpacetimeClient { "SELECT * FROM big_fish_gallery_view", ] { let subscription = self - .subscribe_cached_read_model_query(connection, broken.clone(), query, true) + .subscribe_cached_read_model_query( + connection, + broken.clone(), + query, + true, + operation_timeout, + ) .await?; subscriptions.push(subscription); } @@ -597,7 +845,13 @@ impl SpacetimeClient { "SELECT * FROM asset_object", ] { if let Ok(subscription) = self - .subscribe_cached_read_model_query(connection, broken.clone(), query, false) + .subscribe_cached_read_model_query( + connection, + broken.clone(), + query, + false, + operation_timeout, + ) .await { subscriptions.push(subscription); @@ -613,7 +867,8 @@ impl SpacetimeClient { broken: Arc, query: &'static str, mark_broken_on_error: bool, - ) -> Result { + operation_timeout: Duration, + ) -> Result { let (sender, receiver) = oneshot::channel::>(); let applied_sender = Arc::new(Mutex::new(Some(sender))); let on_applied_sender = applied_sender.clone(); @@ -635,10 +890,23 @@ impl SpacetimeClient { }) .subscribe(query); - timeout(self.config.procedure_timeout, receiver) + timeout(operation_timeout, receiver) .await - .map_err(|_| SpacetimeClientError::Timeout)? - .map_err(|_| SpacetimeClientError::ConnectDropped)??; + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::ReadModelSubscribe, + SpacetimeClientError::Timeout, + ) + })? + .map_err(|_| { + SpacetimeStageError::new( + SpacetimeClientStage::ReadModelSubscribe, + SpacetimeClientError::ConnectDropped, + ) + })? + .map_err(|error| { + SpacetimeStageError::new(SpacetimeClientStage::ReadModelSubscribe, error) + })?; Ok(subscription) } @@ -658,7 +926,10 @@ impl SpacetimeClient { } // 超时后必须统一归还租约;若连接已先一步断开则回传断线,否则标记坏连接并回传超时。 - fn resolve_timeout_error(connection: Option<&PooledConnection>) -> SpacetimeClientError { + fn resolve_timeout_error( + connection: Option<&PooledConnection>, + _stage: SpacetimeClientStage, + ) -> SpacetimeClientError { if let Some(connection) = connection { if connection.is_broken() { return SpacetimeClientError::ConnectDropped; @@ -681,6 +952,27 @@ fn current_public_work_day() -> i64 { current_unix_micros().div_euclid(PUBLIC_WORK_PLAY_DAY_MICROS) } +fn duration_millis_u64(duration: Duration) -> u64 { + duration.as_millis().min(u64::MAX as u128) as u64 +} + +fn log_spacetime_client_failure( + operation_kind: &'static str, + operation_name: &'static str, + stage: SpacetimeClientStage, + started_at: Instant, + error: &SpacetimeClientError, +) { + warn!( + operation_kind, + operation_name, + spacetime_stage = stage.as_str(), + elapsed_ms = duration_millis_u64(started_at.elapsed()), + error = %error, + "SpacetimeDB client operation failed" + ); +} + fn public_work_recent_play_counts( connection: &DbConnection, source_type: &str,