chore: add loadtest observability setup
This commit is contained in:
@@ -113,6 +113,17 @@ $env:WORKS_DATA="data/works-list.local.json"
|
||||
npm run loadtest:k6:works -- --summary-trend-stats="avg,min,med,p(90),p(95),p(99),max"
|
||||
```
|
||||
|
||||
## 50 HTTP req/s 口径
|
||||
|
||||
`k6-works-list.js` 默认一次 iteration 会依次请求两个公开列表接口:`/api/runtime/puzzle/gallery` 和 `/api/runtime/custom-world-gallery`。因此目标约 50 HTTP req/s 时,`ramping-arrival-rate` 的 `PEAK_RPS` 应设置为 `25`。如果传入 `AUTH_TOKEN` 或把 `DETAIL_RATIO` 设为大于 0,每次 iteration 的请求数会增加,需要重新折算。
|
||||
|
||||
验收目标:
|
||||
|
||||
- `http_req_failed < 1%`
|
||||
- `http_req_duration p95 < 2000ms`
|
||||
- `dropped_iterations = 0`
|
||||
- 压测窗口内 Nginx 无新增 502
|
||||
|
||||
## Smoke
|
||||
|
||||
```bash
|
||||
@@ -151,17 +162,38 @@ BASE_URL=http://127.0.0.1:8787 \
|
||||
WORKS_DATA=data/works-list.local.json \
|
||||
SCENARIO=spike \
|
||||
START_RPS=5 \
|
||||
PEAK_RPS=100 \
|
||||
HOLD=2m \
|
||||
PEAK_RPS=25 \
|
||||
HOLD=60s \
|
||||
DETAIL_RATIO=0 \
|
||||
npm run loadtest:k6:works
|
||||
```
|
||||
|
||||
默认阈值:
|
||||
|
||||
- `http_req_failed < 5%`
|
||||
- `http_req_failed < 1%`
|
||||
- `http_req_duration p95 < 2000ms`
|
||||
- `works_list_shape_error_rate < 5%`
|
||||
- `dropped_iterations = 0`
|
||||
- `works_list_shape_error_rate < 1%`
|
||||
|
||||
PowerShell:
|
||||
|
||||
```powershell
|
||||
$env:BASE_URL="https://genarrative.world"
|
||||
$env:WORKS_DATA="data/works-list.local.json"
|
||||
$env:SCENARIO="spike"
|
||||
$env:START_RPS="5"
|
||||
$env:PEAK_RPS="25"
|
||||
$env:HOLD="60s"
|
||||
$env:END_RPS="5"
|
||||
$env:DETAIL_RATIO="0"
|
||||
npm run loadtest:k6:works -- --summary-trend-stats="avg,min,med,p(90),p(95),p(99),max"
|
||||
```
|
||||
|
||||
线上 release 回归可使用同一组环境变量:
|
||||
|
||||
```bash
|
||||
SCENARIO=spike START_RPS=5 PEAK_RPS=25 HOLD=60s END_RPS=5 DETAIL_RATIO=0 npm run loadtest:k6:works
|
||||
```
|
||||
|
||||
## 带登录态压测个人作品列表
|
||||
|
||||
@@ -197,6 +229,96 @@ npm run loadtest:k6:works
|
||||
- 如果个人作品列表返回 401,确认 `AUTH_TOKEN` 是当前 api-server 可识别的 access token。
|
||||
- 如果详情全部 404,确认是否已向目标环境导入与 `WORKS_DATA` 一致的数据。
|
||||
|
||||
## 压测窗口采集
|
||||
|
||||
Nginx upstream timing:
|
||||
|
||||
```bash
|
||||
sudo tail -f /var/log/nginx/genarrative.access.log
|
||||
sudo tail -f /var/log/nginx/genarrative.error.log
|
||||
```
|
||||
|
||||
api-server 与 SpacetimeDB 日志:
|
||||
|
||||
```bash
|
||||
sudo journalctl -u genarrative-api.service -f
|
||||
sudo journalctl -u spacetimedb.service -f
|
||||
```
|
||||
|
||||
api-server 的 OpenTelemetry 默认关闭。需要验证 OTLP traces / metrics / logs 时,先在服务器本机启动只监听 `127.0.0.1` 的 `otelcol-contrib` debug exporter:
|
||||
|
||||
```bash
|
||||
npm run otel:debug
|
||||
```
|
||||
|
||||
如果要把本机数据转发给 Rider OpenTelemetry 面板,先在 Rider 的 OpenTelemetry 设置中启用固定 OTLP server port,例如 `17011`,再运行:
|
||||
|
||||
```bash
|
||||
RIDER_OTLP_GRPC_ENDPOINT=127.0.0.1:17011 npm run otel:rider
|
||||
```
|
||||
|
||||
脚本会在 `.codex-temp/otelcol/` 生成临时 collector 配置,默认接收 api-server 发到 `http://127.0.0.1:4318` 的 OTLP HTTP 数据。需要改端口时可设置:
|
||||
|
||||
- `OTELCOL_OTLP_HTTP_ENDPOINT`,默认 `127.0.0.1:4318`
|
||||
- `OTELCOL_OTLP_GRPC_ENDPOINT`,默认 `127.0.0.1:4317`
|
||||
- `RIDER_OTLP_GRPC_ENDPOINT`,默认 `127.0.0.1:17011`
|
||||
- `OTELCOL_BIN`,默认 `otelcol-contrib`
|
||||
|
||||
等价的 debug collector 配置如下:
|
||||
|
||||
```yaml
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 127.0.0.1:4317
|
||||
http:
|
||||
endpoint: 127.0.0.1:4318
|
||||
|
||||
exporters:
|
||||
debug:
|
||||
verbosity: detailed
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
exporters: [debug]
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
exporters: [debug]
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
exporters: [debug]
|
||||
```
|
||||
|
||||
```bash
|
||||
otelcol-contrib --config /etc/otelcol-contrib/genarrative-debug.yaml
|
||||
```
|
||||
|
||||
然后在 `/etc/genarrative/api-server.env` 中打开:
|
||||
|
||||
```env
|
||||
GENARRATIVE_OTEL_ENABLED=true
|
||||
OTEL_SERVICE_NAME=genarrative-api
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318
|
||||
```
|
||||
|
||||
注意 `api-server` 当前使用 OTLP HTTP exporter,`OTEL_EXPORTER_OTLP_ENDPOINT` 必须指向 Collector 的 HTTP base endpoint `http://127.0.0.1:4318`。不要把它改成 Collector gRPC 端口 `4317`,也不要直接指向 Rider 的 gRPC 端口;Rider 只由 `npm run otel:rider` 启动的 Collector 通过 `RIDER_OTLP_GRPC_ENDPOINT` 转发。
|
||||
|
||||
OTLP logs 是远端观测增量,不替代本地日志;api-server 日志仍看 `journalctl` / `logs/api-server/`,Nginx 日志仍看文件。日志等级继续用 `GENARRATIVE_API_LOG` / `RUST_LOG` 控制,例如 `info,tower_http=info,spacetime_client=info`。
|
||||
|
||||
Rider 的 Logs 面板展示的是 OTLP log event 自身字段,不会自动把父 span 的全部 attributes 摊平到每一条日志。请求完成日志会直接携带 `request_id`、`http.request.method`、`http.route`、`url.scheme`、`url.path`、`http.response.status_code`、`status_class`、`latency_ms` 和 `slow_request`;更完整的请求链路仍在 Traces 面板中按同一个 trace/span 关联查看。
|
||||
|
||||
线上回归辅助命令:
|
||||
|
||||
```bash
|
||||
systemctl show genarrative-api.service -p LimitNOFILE -p TasksMax
|
||||
cat /proc/$(pidof api-server)/limits
|
||||
ss -ltnp | grep 8082
|
||||
curl -sS http://127.0.0.1:8082/healthz
|
||||
```
|
||||
|
||||
## 验证命令
|
||||
|
||||
```bash
|
||||
|
||||
@@ -56,20 +56,22 @@ const scenarioOptions = {
|
||||
scenarios: {
|
||||
spike: {
|
||||
executor: 'ramping-arrival-rate',
|
||||
startRate: Number(__ENV.START_RPS || 5),
|
||||
preAllocatedVUs: Number(__ENV.PREALLOCATED_VUS || 50),
|
||||
maxVUs: Number(__ENV.MAX_VUS || 200),
|
||||
timeUnit: '1s',
|
||||
stages: [
|
||||
{ target: Number(__ENV.START_RPS || 5), duration: __ENV.RAMP_UP || '30s' },
|
||||
{ target: Number(__ENV.PEAK_RPS || 100), duration: __ENV.HOLD || '2m' },
|
||||
{ target: Number(__ENV.PEAK_RPS || 25), duration: __ENV.RAMP_UP || '30s' },
|
||||
{ target: Number(__ENV.PEAK_RPS || 25), duration: __ENV.HOLD || '2m' },
|
||||
{ target: Number(__ENV.END_RPS || 5), duration: __ENV.RAMP_DOWN || '30s' },
|
||||
],
|
||||
},
|
||||
},
|
||||
thresholds: {
|
||||
http_req_failed: ['rate<0.05'],
|
||||
http_req_failed: ['rate<0.01'],
|
||||
http_req_duration: ['p(95)<2000'],
|
||||
works_list_shape_error_rate: ['rate<0.05'],
|
||||
dropped_iterations: ['count==0'],
|
||||
works_list_shape_error_rate: ['rate<0.01'],
|
||||
},
|
||||
},
|
||||
};
|
||||
|
||||
119
scripts/run-otelcol.mjs
Normal file
119
scripts/run-otelcol.mjs
Normal file
@@ -0,0 +1,119 @@
|
||||
import {spawn} from 'node:child_process';
|
||||
import {mkdirSync, writeFileSync} from 'node:fs';
|
||||
import path from 'node:path';
|
||||
|
||||
const [, , rawMode = 'debug', ...args] = process.argv;
|
||||
const mode = rawMode.trim();
|
||||
const printConfigOnly = args.includes('--print-config');
|
||||
|
||||
const supportedModes = new Set(['debug', 'rider']);
|
||||
if (!supportedModes.has(mode)) {
|
||||
console.error('[otelcol] mode must be one of: debug, rider');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
const otlpHttpEndpoint = readEnv('OTELCOL_OTLP_HTTP_ENDPOINT', '127.0.0.1:4318');
|
||||
const otlpGrpcEndpoint = readEnv('OTELCOL_OTLP_GRPC_ENDPOINT', '127.0.0.1:4317');
|
||||
const riderEndpoint = readEnv('RIDER_OTLP_GRPC_ENDPOINT', '127.0.0.1:17011');
|
||||
const debugVerbosity = readEnv('OTELCOL_DEBUG_VERBOSITY', 'detailed');
|
||||
const otelcolBin = readEnv('OTELCOL_BIN', 'otelcol-contrib');
|
||||
|
||||
const configText = buildConfig(mode);
|
||||
const configDir = path.resolve('.codex-temp', 'otelcol');
|
||||
const configPath = path.join(configDir, `genarrative-${mode}.yaml`);
|
||||
mkdirSync(configDir, {recursive: true});
|
||||
writeFileSync(configPath, configText, 'utf8');
|
||||
|
||||
console.log(`[otelcol] wrote ${configPath}`);
|
||||
console.log(`[otelcol] receiving OTLP HTTP at http://${otlpHttpEndpoint}`);
|
||||
console.log(`[otelcol] receiving OTLP gRPC at ${otlpGrpcEndpoint}`);
|
||||
if (mode === 'rider') {
|
||||
console.log(`[otelcol] forwarding traces/metrics/logs to Rider OTLP gRPC at ${riderEndpoint}`);
|
||||
}
|
||||
console.log(
|
||||
'[otelcol] api-server env: GENARRATIVE_OTEL_ENABLED=true OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318'
|
||||
);
|
||||
|
||||
if (printConfigOnly) {
|
||||
console.log(configText);
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
const child = spawn(otelcolBin, ['--config', configPath], {
|
||||
cwd: process.cwd(),
|
||||
env: process.env,
|
||||
stdio: 'inherit',
|
||||
});
|
||||
|
||||
const stopChild = () => {
|
||||
if (!child.killed) {
|
||||
child.kill();
|
||||
}
|
||||
};
|
||||
|
||||
for (const signal of ['SIGINT', 'SIGTERM', 'SIGHUP']) {
|
||||
process.on(signal, () => {
|
||||
stopChild();
|
||||
process.exit(130);
|
||||
});
|
||||
}
|
||||
|
||||
process.on('exit', stopChild);
|
||||
|
||||
child.on('error', (error) => {
|
||||
console.error(`[otelcol] failed to start ${otelcolBin}: ${error.message}`);
|
||||
console.error('[otelcol] install otelcol-contrib and make sure it is on PATH, or set OTELCOL_BIN.');
|
||||
process.exit(1);
|
||||
});
|
||||
|
||||
child.on('exit', (code, signal) => {
|
||||
if (signal) {
|
||||
console.error(`[otelcol] exited by signal: ${signal}`);
|
||||
process.exit(1);
|
||||
}
|
||||
process.exit(code ?? 0);
|
||||
});
|
||||
|
||||
function readEnv(key, fallback) {
|
||||
const value = process.env[key]?.trim();
|
||||
return value ? value : fallback;
|
||||
}
|
||||
|
||||
function buildConfig(selectedMode) {
|
||||
const exporters =
|
||||
selectedMode === 'rider'
|
||||
? ` otlp/rider:
|
||||
endpoint: ${riderEndpoint}
|
||||
tls:
|
||||
insecure: true
|
||||
debug:
|
||||
verbosity: ${debugVerbosity}`
|
||||
: ` debug:
|
||||
verbosity: ${debugVerbosity}`;
|
||||
|
||||
const pipelineExporters = selectedMode === 'rider' ? '[otlp/rider, debug]' : '[debug]';
|
||||
|
||||
return `receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: ${otlpGrpcEndpoint}
|
||||
http:
|
||||
endpoint: ${otlpHttpEndpoint}
|
||||
|
||||
exporters:
|
||||
${exporters}
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
traces:
|
||||
receivers: [otlp]
|
||||
exporters: ${pipelineExporters}
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
exporters: ${pipelineExporters}
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
exporters: ${pipelineExporters}
|
||||
`;
|
||||
}
|
||||
Reference in New Issue
Block a user