gitdataai/docs/metrics.md
ZhenYi a4dd25304c docs: add monitoring metrics operations document
Covers: endpoints,指标列表, Prometheus scrape 配置,
K8s probe YAML, Alertmanager 告警规则示例
2026-04-26 00:10:45 +08:00

263 lines
6.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 监控指标文档
## 概览
本项目为以下服务暴露 Prometheus 指标:
| 服务 | 端点 | 端口 | 类型 |
|------|------|------|------|
| app (主站) | `GET /metrics` | 8080 | actix-web |
| gitserver | `GET :8021/health` | 8021 | actix-web |
| git-hook | `GET :8083/metrics` | 8083 | hyper (内置) |
| email-worker | `GET :8084/metrics` | 8084 | hyper (内置) |
## 采集配置
### Prometheus scrape 配置
```yaml
scrape_configs:
- job_name: 'gitdata-app'
static_configs:
- targets: ['app:8080']
metrics_path: '/metrics'
- job_name: 'gitdata-gitserver'
static_configs:
- targets: ['gitserver:8021']
metrics_path: '/metrics'
- job_name: 'gitdata-git-hook'
static_configs:
- targets: ['git-hook:8083']
metrics_path: '/metrics'
- job_name: 'gitdata-email-worker'
static_configs:
- targets: ['email-worker:8084']
metrics_path: '/metrics'
```
## 健康检查端点
### gitserver
- **端点**: `GET /health`
- **端口**: 8021
- **返回**:
```json
{ "status": "ok", "db": "ok", "cache": "ok" }
```
`503` 当 db/cache 任一不可用
### git-hook
- **端点**: `GET /health`
- **端口**: 8083
- **返回**: 同上
### email-worker
- **端点**: `GET /health`
- **端口**: 8084
- **返回**: 同上
## 指标列表
### git-hook 指标 (`/metrics`, port 8083)
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `hook_tasks_total` | Counter | `task_type` | 各类型任务总执行次数 |
| `hook_tasks_success_total` | Counter | `task_type` | 成功完成的任务数 |
| `hook_tasks_failed_total` | Counter | `task_type` | 失败的任务数 (不含重试) |
| `hook_tasks_locked_total` | Counter | — | 仓库被其他 worker 锁定,重新入队的次数 |
| `hook_tasks_retried_total` | Counter | — | 触发重试的次数 |
| `hook_tasks_exhausted_total` | Counter | — | 重试耗尽后被丢弃的任务数 |
| `hook_sync_branches_changed_total` | Counter | — | 同步时产生的分支变更总数 |
| `hook_sync_tags_changed_total` | Counter | — | 同步时产生的标签变更总数 |
**任务类型 (`task_type`)**:
- `Sync` — 完整同步 (refs + commits + tags + LFS + fsck + gc + skills)
- `Fsck` — 仅校验仓库完整性
- `Gc` — 仅垃圾回收
**使用示例**:
```promql
# 任务成功率
hook_tasks_success_total / hook_tasks_total
# 失败率 (按类型)
rate(hook_tasks_failed_total{task_type="Sync"}[5m])
# 仓库锁定频率
rate(hook_tasks_locked_total[15m])
```
---
### email-worker 指标 (`/metrics`, port 8084)
| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| `email_queued_total` | Counter | — | 写入 Redis Stream 的邮件总数 (生产端) |
| `email_consumed_total` | Counter | — | 从队列消费的邮件总数 (每批次) |
| `email_batch_size` | Counter | — | 消费批次大小累计值 |
| `email_validation_skipped_total` | Counter | — | 收件人地址校验失败被跳过的邮件数 |
| `email_build_errors_total` | Counter | — | 邮件消息构建失败的次数 |
| `email_send_attempts_total` | Counter | — | SMTP 发送尝试总次数 (含重试) |
| `email_sent_total` | Counter | — | 成功发送的邮件数 |
| `email_send_failures_total` | Counter | — | 经 3 次重试后最终失败的邮件数 |
**使用示例**:
```promql
# 邮件发送成功率
email_sent_total / (email_sent_total + email_send_failures_total)
# 队列积压率
rate(email_consumed_total[5m]) / rate(email_queued_total[5m])
# 校验失败率
rate(email_validation_skipped_total[5m]) / rate(email_queued_total[5m])
```
---
### app / gitserver 指标 (`/metrics`)
`observability` crate 导出,包含以下通用指标:
| 指标名 | 类型 | 说明 |
|--------|------|------|
| `ai_calls_total` | Counter | AI 对话调用总次数 |
| `ai_calls_success` | Counter | AI 调用成功次数 |
| `ai_calls_failure` | Counter | AI 调用失败次数 |
| `ai_input_tokens_total` | Counter | 累计输入 token 数 |
| `ai_output_tokens_total` | Counter | 累计输出 token 数 |
| `ai_function_calls_total` | Counter | AI function/tool 调用次数 |
| `http_requests_total` | Counter | HTTP 请求总数 |
| `http_request_duration_ms_total` | Counter | HTTP 请求累计耗时 (ms) |
| `http_requests_by_status_class` | Gauge | 按状态码分类的请求数 (`2xx`, `4xx`, `5xx`) |
## Kubernetes 探针配置
### gitserver
```yaml
livenessProbe:
httpGet:
path: /health
port: 8021
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8021
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
```
### git-hook
```yaml
livenessProbe:
httpGet:
path: /health
port: 8083
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8083
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
```
### email-worker
```yaml
livenessProbe:
httpGet:
path: /health
port: 8084
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8084
initialDelaySeconds: 5
periodSeconds: 15
timeoutSeconds: 3
failureThreshold: 3
```
## 告警规则示例 (Alertmanager / Grafana)
```yaml
# Git Hook Worker Down
- alert: GitHookDown
expr: up{job="gitdata-git-hook"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Git Hook worker 无法访问"
# Git Hook 任务失败率过高
- alert: GitHookHighFailureRate
expr: |
rate(hook_tasks_failed_total[5m])
/ rate(hook_tasks_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Git Hook 任务失败率超过 10%"
# Email Worker Down
- alert: EmailWorkerDown
expr: up{job="gitdata-email-worker"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Email Worker 无法访问"
# 邮件发送失败率过高
- alert: EmailHighFailureRate
expr: |
rate(email_send_failures_total[5m])
/ (rate(email_sent_total[5m]) + rate(email_send_failures_total[5m])) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "邮件发送失败率超过 5%"
```
## Sitemap 服务端点
| 端点 | 说明 |
|------|------|
| `GET /sitemap.xml` | sitemapindex引用所有子 sitemap |
| `GET /sidemap/static` | 固定页面 (首页、auth、营销页) |
| `GET /sidemap/users` | 公开用户页面 |
| `GET /sidemap/projects` | 公开项目页面 |
| `GET /sidemap/repos` | 公开仓库页面 |
| `GET /robots.txt` | robots.txt声明 Sitemap 位置 |
用户/项目/仓库 sitemap 数据通过 Redis 缓存 (8h TTL)`robots.txt` 中的 Sitemap URL 动态读取 `APP_DOMAIN_URL` 环境变量并强制使用 `https://` 前缀。