docs: add monitoring metrics operations document
Covers: endpoints,指标列表, Prometheus scrape 配置, K8s probe YAML, Alertmanager 告警规则示例
This commit is contained in:
parent
d593354ba9
commit
a4dd25304c
262
docs/metrics.md
Normal file
262
docs/metrics.md
Normal file
@ -0,0 +1,262 @@
|
||||
# 监控指标文档
|
||||
|
||||
## 概览
|
||||
|
||||
本项目为以下服务暴露 Prometheus 指标:
|
||||
|
||||
| 服务 | 端点 | 端口 | 类型 |
|
||||
|------|------|------|------|
|
||||
| app (主站) | `GET /metrics` | 8080 | actix-web |
|
||||
| gitserver | `GET :8021/health` | 8021 | actix-web |
|
||||
| git-hook | `GET :8083/metrics` | 8083 | hyper (内置) |
|
||||
| email-worker | `GET :8084/metrics` | 8084 | hyper (内置) |
|
||||
|
||||
## 采集配置
|
||||
|
||||
### Prometheus scrape 配置
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: 'gitdata-app'
|
||||
static_configs:
|
||||
- targets: ['app:8080']
|
||||
metrics_path: '/metrics'
|
||||
|
||||
- job_name: 'gitdata-gitserver'
|
||||
static_configs:
|
||||
- targets: ['gitserver:8021']
|
||||
metrics_path: '/metrics'
|
||||
|
||||
- job_name: 'gitdata-git-hook'
|
||||
static_configs:
|
||||
- targets: ['git-hook:8083']
|
||||
metrics_path: '/metrics'
|
||||
|
||||
- job_name: 'gitdata-email-worker'
|
||||
static_configs:
|
||||
- targets: ['email-worker:8084']
|
||||
metrics_path: '/metrics'
|
||||
```
|
||||
|
||||
## 健康检查端点
|
||||
|
||||
### gitserver
|
||||
- **端点**: `GET /health`
|
||||
- **端口**: 8021
|
||||
- **返回**:
|
||||
```json
|
||||
{ "status": "ok", "db": "ok", "cache": "ok" }
|
||||
```
|
||||
或 `503` 当 db/cache 任一不可用
|
||||
|
||||
### git-hook
|
||||
- **端点**: `GET /health`
|
||||
- **端口**: 8083
|
||||
- **返回**: 同上
|
||||
|
||||
### email-worker
|
||||
- **端点**: `GET /health`
|
||||
- **端口**: 8084
|
||||
- **返回**: 同上
|
||||
|
||||
## 指标列表
|
||||
|
||||
### git-hook 指标 (`/metrics`, port 8083)
|
||||
|
||||
| 指标名 | 类型 | 标签 | 说明 |
|
||||
|--------|------|------|------|
|
||||
| `hook_tasks_total` | Counter | `task_type` | 各类型任务总执行次数 |
|
||||
| `hook_tasks_success_total` | Counter | `task_type` | 成功完成的任务数 |
|
||||
| `hook_tasks_failed_total` | Counter | `task_type` | 失败的任务数 (不含重试) |
|
||||
| `hook_tasks_locked_total` | Counter | — | 仓库被其他 worker 锁定,重新入队的次数 |
|
||||
| `hook_tasks_retried_total` | Counter | — | 触发重试的次数 |
|
||||
| `hook_tasks_exhausted_total` | Counter | — | 重试耗尽后被丢弃的任务数 |
|
||||
| `hook_sync_branches_changed_total` | Counter | — | 同步时产生的分支变更总数 |
|
||||
| `hook_sync_tags_changed_total` | Counter | — | 同步时产生的标签变更总数 |
|
||||
|
||||
**任务类型 (`task_type`)**:
|
||||
- `Sync` — 完整同步 (refs + commits + tags + LFS + fsck + gc + skills)
|
||||
- `Fsck` — 仅校验仓库完整性
|
||||
- `Gc` — 仅垃圾回收
|
||||
|
||||
**使用示例**:
|
||||
```promql
|
||||
# 任务成功率
|
||||
hook_tasks_success_total / hook_tasks_total
|
||||
|
||||
# 失败率 (按类型)
|
||||
rate(hook_tasks_failed_total{task_type="Sync"}[5m])
|
||||
|
||||
# 仓库锁定频率
|
||||
rate(hook_tasks_locked_total[15m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### email-worker 指标 (`/metrics`, port 8084)
|
||||
|
||||
| 指标名 | 类型 | 标签 | 说明 |
|
||||
|--------|------|------|------|
|
||||
| `email_queued_total` | Counter | — | 写入 Redis Stream 的邮件总数 (生产端) |
|
||||
| `email_consumed_total` | Counter | — | 从队列消费的邮件总数 (每批次) |
|
||||
| `email_batch_size` | Counter | — | 消费批次大小累计值 |
|
||||
| `email_validation_skipped_total` | Counter | — | 收件人地址校验失败被跳过的邮件数 |
|
||||
| `email_build_errors_total` | Counter | — | 邮件消息构建失败的次数 |
|
||||
| `email_send_attempts_total` | Counter | — | SMTP 发送尝试总次数 (含重试) |
|
||||
| `email_sent_total` | Counter | — | 成功发送的邮件数 |
|
||||
| `email_send_failures_total` | Counter | — | 经 3 次重试后最终失败的邮件数 |
|
||||
|
||||
**使用示例**:
|
||||
```promql
|
||||
# 邮件发送成功率
|
||||
email_sent_total / (email_sent_total + email_send_failures_total)
|
||||
|
||||
# 队列积压率
|
||||
rate(email_consumed_total[5m]) / rate(email_queued_total[5m])
|
||||
|
||||
# 校验失败率
|
||||
rate(email_validation_skipped_total[5m]) / rate(email_queued_total[5m])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### app / gitserver 指标 (`/metrics`)
|
||||
|
||||
由 `observability` crate 导出,包含以下通用指标:
|
||||
|
||||
| 指标名 | 类型 | 说明 |
|
||||
|--------|------|------|
|
||||
| `ai_calls_total` | Counter | AI 对话调用总次数 |
|
||||
| `ai_calls_success` | Counter | AI 调用成功次数 |
|
||||
| `ai_calls_failure` | Counter | AI 调用失败次数 |
|
||||
| `ai_input_tokens_total` | Counter | 累计输入 token 数 |
|
||||
| `ai_output_tokens_total` | Counter | 累计输出 token 数 |
|
||||
| `ai_function_calls_total` | Counter | AI function/tool 调用次数 |
|
||||
| `http_requests_total` | Counter | HTTP 请求总数 |
|
||||
| `http_request_duration_ms_total` | Counter | HTTP 请求累计耗时 (ms) |
|
||||
| `http_requests_by_status_class` | Gauge | 按状态码分类的请求数 (`2xx`, `4xx`, `5xx`) |
|
||||
|
||||
## Kubernetes 探针配置
|
||||
|
||||
### gitserver
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8021
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8021
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 5
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
### git-hook
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8083
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 15
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8083
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
### email-worker
|
||||
|
||||
```yaml
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8084
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8084
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 15
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 3
|
||||
```
|
||||
|
||||
## 告警规则示例 (Alertmanager / Grafana)
|
||||
|
||||
```yaml
|
||||
# Git Hook Worker Down
|
||||
- alert: GitHookDown
|
||||
expr: up{job="gitdata-git-hook"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Git Hook worker 无法访问"
|
||||
|
||||
# Git Hook 任务失败率过高
|
||||
- alert: GitHookHighFailureRate
|
||||
expr: |
|
||||
rate(hook_tasks_failed_total[5m])
|
||||
/ rate(hook_tasks_total[5m]) > 0.1
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Git Hook 任务失败率超过 10%"
|
||||
|
||||
# Email Worker Down
|
||||
- alert: EmailWorkerDown
|
||||
expr: up{job="gitdata-email-worker"} == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Email Worker 无法访问"
|
||||
|
||||
# 邮件发送失败率过高
|
||||
- alert: EmailHighFailureRate
|
||||
expr: |
|
||||
rate(email_send_failures_total[5m])
|
||||
/ (rate(email_sent_total[5m]) + rate(email_send_failures_total[5m])) > 0.05
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "邮件发送失败率超过 5%"
|
||||
```
|
||||
|
||||
## Sitemap 服务端点
|
||||
|
||||
| 端点 | 说明 |
|
||||
|------|------|
|
||||
| `GET /sitemap.xml` | sitemapindex,引用所有子 sitemap |
|
||||
| `GET /sidemap/static` | 固定页面 (首页、auth、营销页) |
|
||||
| `GET /sidemap/users` | 公开用户页面 |
|
||||
| `GET /sidemap/projects` | 公开项目页面 |
|
||||
| `GET /sidemap/repos` | 公开仓库页面 |
|
||||
| `GET /robots.txt` | robots.txt,声明 Sitemap 位置 |
|
||||
|
||||
用户/项目/仓库 sitemap 数据通过 Redis 缓存 (8h TTL),`robots.txt` 中的 Sitemap URL 动态读取 `APP_DOMAIN_URL` 环境变量并强制使用 `https://` 前缀。
|
||||
Loading…
Reference in New Issue
Block a user