From a4dd25304c1545d298af08d3a07c8d659b8ea8f3 Mon Sep 17 00:00:00 2001 From: ZhenYi <434836402@qq.com> Date: Sun, 26 Apr 2026 00:10:45 +0800 Subject: [PATCH] docs: add monitoring metrics operations document MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Covers: endpoints,指标列表, Prometheus scrape 配置, K8s probe YAML, Alertmanager 告警规则示例 --- docs/metrics.md | 262 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 262 insertions(+) create mode 100644 docs/metrics.md diff --git a/docs/metrics.md b/docs/metrics.md new file mode 100644 index 0000000..7ac32d5 --- /dev/null +++ b/docs/metrics.md @@ -0,0 +1,262 @@ +# 监控指标文档 + +## 概览 + +本项目为以下服务暴露 Prometheus 指标: + +| 服务 | 端点 | 端口 | 类型 | +|------|------|------|------| +| app (主站) | `GET /metrics` | 8080 | actix-web | +| gitserver | `GET :8021/health` | 8021 | actix-web | +| git-hook | `GET :8083/metrics` | 8083 | hyper (内置) | +| email-worker | `GET :8084/metrics` | 8084 | hyper (内置) | + +## 采集配置 + +### Prometheus scrape 配置 + +```yaml +scrape_configs: + - job_name: 'gitdata-app' + static_configs: + - targets: ['app:8080'] + metrics_path: '/metrics' + + - job_name: 'gitdata-gitserver' + static_configs: + - targets: ['gitserver:8021'] + metrics_path: '/metrics' + + - job_name: 'gitdata-git-hook' + static_configs: + - targets: ['git-hook:8083'] + metrics_path: '/metrics' + + - job_name: 'gitdata-email-worker' + static_configs: + - targets: ['email-worker:8084'] + metrics_path: '/metrics' +``` + +## 健康检查端点 + +### gitserver +- **端点**: `GET /health` +- **端口**: 8021 +- **返回**: + ```json + { "status": "ok", "db": "ok", "cache": "ok" } + ``` + 或 `503` 当 db/cache 任一不可用 + +### git-hook +- **端点**: `GET /health` +- **端口**: 8083 +- **返回**: 同上 + +### email-worker +- **端点**: `GET /health` +- **端口**: 8084 +- **返回**: 同上 + +## 指标列表 + +### git-hook 指标 (`/metrics`, port 8083) + +| 指标名 | 类型 | 标签 | 说明 | +|--------|------|------|------| +| `hook_tasks_total` | Counter | `task_type` | 各类型任务总执行次数 | +| `hook_tasks_success_total` | Counter | `task_type` | 成功完成的任务数 | +| `hook_tasks_failed_total` | Counter | `task_type` | 失败的任务数 (不含重试) | +| `hook_tasks_locked_total` | Counter | — | 仓库被其他 worker 锁定,重新入队的次数 | +| `hook_tasks_retried_total` | Counter | — | 触发重试的次数 | +| `hook_tasks_exhausted_total` | Counter | — | 重试耗尽后被丢弃的任务数 | +| `hook_sync_branches_changed_total` | Counter | — | 同步时产生的分支变更总数 | +| `hook_sync_tags_changed_total` | Counter | — | 同步时产生的标签变更总数 | + +**任务类型 (`task_type`)**: +- `Sync` — 完整同步 (refs + commits + tags + LFS + fsck + gc + skills) +- `Fsck` — 仅校验仓库完整性 +- `Gc` — 仅垃圾回收 + +**使用示例**: +```promql +# 任务成功率 +hook_tasks_success_total / hook_tasks_total + +# 失败率 (按类型) +rate(hook_tasks_failed_total{task_type="Sync"}[5m]) + +# 仓库锁定频率 +rate(hook_tasks_locked_total[15m]) +``` + +--- + +### email-worker 指标 (`/metrics`, port 8084) + +| 指标名 | 类型 | 标签 | 说明 | +|--------|------|------|------| +| `email_queued_total` | Counter | — | 写入 Redis Stream 的邮件总数 (生产端) | +| `email_consumed_total` | Counter | — | 从队列消费的邮件总数 (每批次) | +| `email_batch_size` | Counter | — | 消费批次大小累计值 | +| `email_validation_skipped_total` | Counter | — | 收件人地址校验失败被跳过的邮件数 | +| `email_build_errors_total` | Counter | — | 邮件消息构建失败的次数 | +| `email_send_attempts_total` | Counter | — | SMTP 发送尝试总次数 (含重试) | +| `email_sent_total` | Counter | — | 成功发送的邮件数 | +| `email_send_failures_total` | Counter | — | 经 3 次重试后最终失败的邮件数 | + +**使用示例**: +```promql +# 邮件发送成功率 +email_sent_total / (email_sent_total + email_send_failures_total) + +# 队列积压率 +rate(email_consumed_total[5m]) / rate(email_queued_total[5m]) + +# 校验失败率 +rate(email_validation_skipped_total[5m]) / rate(email_queued_total[5m]) +``` + +--- + +### app / gitserver 指标 (`/metrics`) + +由 `observability` crate 导出,包含以下通用指标: + +| 指标名 | 类型 | 说明 | +|--------|------|------| +| `ai_calls_total` | Counter | AI 对话调用总次数 | +| `ai_calls_success` | Counter | AI 调用成功次数 | +| `ai_calls_failure` | Counter | AI 调用失败次数 | +| `ai_input_tokens_total` | Counter | 累计输入 token 数 | +| `ai_output_tokens_total` | Counter | 累计输出 token 数 | +| `ai_function_calls_total` | Counter | AI function/tool 调用次数 | +| `http_requests_total` | Counter | HTTP 请求总数 | +| `http_request_duration_ms_total` | Counter | HTTP 请求累计耗时 (ms) | +| `http_requests_by_status_class` | Gauge | 按状态码分类的请求数 (`2xx`, `4xx`, `5xx`) | + +## Kubernetes 探针配置 + +### gitserver + +```yaml +livenessProbe: + httpGet: + path: /health + port: 8021 + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /health + port: 8021 + initialDelaySeconds: 5 + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +### git-hook + +```yaml +livenessProbe: + httpGet: + path: /health + port: 8083 + initialDelaySeconds: 10 + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /health + port: 8083 + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +### email-worker + +```yaml +livenessProbe: + httpGet: + path: /health + port: 8084 + initialDelaySeconds: 10 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 3 + +readinessProbe: + httpGet: + path: /health + port: 8084 + initialDelaySeconds: 5 + periodSeconds: 15 + timeoutSeconds: 3 + failureThreshold: 3 +``` + +## 告警规则示例 (Alertmanager / Grafana) + +```yaml +# Git Hook Worker Down +- alert: GitHookDown + expr: up{job="gitdata-git-hook"} == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "Git Hook worker 无法访问" + +# Git Hook 任务失败率过高 +- alert: GitHookHighFailureRate + expr: | + rate(hook_tasks_failed_total[5m]) + / rate(hook_tasks_total[5m]) > 0.1 + for: 5m + labels: + severity: warning + annotations: + summary: "Git Hook 任务失败率超过 10%" + +# Email Worker Down +- alert: EmailWorkerDown + expr: up{job="gitdata-email-worker"} == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "Email Worker 无法访问" + +# 邮件发送失败率过高 +- alert: EmailHighFailureRate + expr: | + rate(email_send_failures_total[5m]) + / (rate(email_sent_total[5m]) + rate(email_send_failures_total[5m])) > 0.05 + for: 10m + labels: + severity: warning + annotations: + summary: "邮件发送失败率超过 5%" +``` + +## Sitemap 服务端点 + +| 端点 | 说明 | +|------|------| +| `GET /sitemap.xml` | sitemapindex,引用所有子 sitemap | +| `GET /sidemap/static` | 固定页面 (首页、auth、营销页) | +| `GET /sidemap/users` | 公开用户页面 | +| `GET /sidemap/projects` | 公开项目页面 | +| `GET /sidemap/repos` | 公开仓库页面 | +| `GET /robots.txt` | robots.txt,声明 Sitemap 位置 | + +用户/项目/仓库 sitemap 数据通过 Redis 缓存 (8h TTL),`robots.txt` 中的 Sitemap URL 动态读取 `APP_DOMAIN_URL` 环境变量并强制使用 `https://` 前缀。