Commit Graph

29 Commits

Author SHA1 Message Date
ZhenYi
1fed9fc8ab fix(git/hook): address review findings — fs blocking, redis timeout, backoff, slog
- sync/mod.rs: wrap scan_skills_from_dir in spawn_blocking to avoid
  blocking the async executor; use to_path_buf() to get owned PathBuf
- pool/worker.rs: replace 500ms poll sleep with cancellation-is_cancelled
  check (eliminates artificial latency); add exponential backoff on Redis
  errors (1s base, 32s cap, reset on success)
- pool/redis.rs: add 5s timeout on pool.get() for all three methods
  (next, ack_raw, nak_with_retry) to prevent indefinite blocking on
  unresponsive Redis
- sync/gc.rs: add comment explaining why git gc --auto non-zero exit
  is benign
- webhook_dispatch.rs: remove unnecessary format! wrappers in slog macros
- config/hook.rs: document max_concurrent intent (K8s operator/HPA, not
  the single-threaded worker itself)
2026-04-17 13:20:31 +08:00
ZhenYi
ef61b193c4 fix(git/hook): refine Redis queue worker, remove dead code, fix warnings
- pool/mod.rs: pass shared http_client Arc to HookWorker
- worker.rs: remove double-locking (sync() manages its own lock),
  await all webhook handles before returning, share http_client,
  hoist namespace query out of loop
- redis.rs: atomic NAK via Lua script (LREM + LPUSH in one eval)
- sync/lock.rs: increase LOCK_TTL from 60s to 300s for large repos
- sync/mod.rs: split sync/sync_work, fsck_only/fsck_work, gc_only/gc_work
  so callers can choose locked vs lock-free path; run_gc + sync_skills
  outside the DB transaction
- hook/mod.rs: remove unused http field from HookService
- ssh/mod.rs, http/mod.rs: remove unused HookService/http imports
2026-04-17 13:05:07 +08:00
ZhenYi
8fb2436f22 feat(git): add Redis-backed hook worker with per-repo distributed locking
- pool/worker.rs: single-threaded consumer that BLMPOPs from Redis queues
  sequentially. K8s replicas provide HA — each pod runs one worker.
- pool/redis.rs: RedisConsumer with BLMOVE atomic dequeue, ACK/NAK, and
  retry-with-json support.
- pool/types.rs: HookTask, TaskType, PoolConfig (minimal — no pool metrics).
- sync/lock.rs: Redis SET NX EX per-repo lock to prevent concurrent workers
  from processing the same repo. Lock conflicts are handled by requeueing
  without incrementing retry count.
- hook/mod.rs: HookService.start_worker() spawns the background worker.
- ssh/mod.rs / http/mod.rs: ReceiveSyncService RPUSHes to Redis queue.
  Both run_http and run_ssh call start_worker() to launch the consumer.
- Lock conflicts (GitError::Locked) in the worker are requeued without
  incrementing retry_count so another worker can pick them up.
2026-04-17 12:33:58 +08:00
ZhenYi
eeb99bf628 refactor(git): drop hook pool, sync execution is now direct and sequential
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
- Remove entire pool/ directory (RedisConsumer, CpuMonitor, LogStream, HookTask, TaskType)
- Remove Redis distributed lock (acquire_lock/release_lock) — K8s StatefulSet
  scheduling guarantees exclusive access per repo shard
- Remove sync/lock.rs, sync/remote.rs, sync/status.rs (dead code)
- Remove hook/event.rs (GitHookEvent was never used)
- New HookService exposes sync_repo / fsck_repo / gc_repo directly
- ReceiveSyncService now calls HookService inline instead of LPUSH to Redis queue
- sync/mod.rs: git2 operations wrapped in spawn_blocking for Send safety
  (git2 types are not Send — async git2 operations must not cross await points)
- scripts/push.js: drop 'frontend' from docker push list (embedded into static binary)
2026-04-17 12:22:09 +08:00
ZhenYi
7e42139989 feat(frontend): embed SPA assets into app binary at compile time
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
- Add libs/frontend crate: build.rs runs pnpm build, copies dist/ to
  OUT_DIR/dist_blobs/, generates frontend.rs with lazy_static! map
- libs/api/dist.rs serves embedded assets via serve_frontend handler
- Register /{path:.*} SPA fallback in route.rs (after /api/*)
- Remove frontend container from deploy: docker/frontend.Dockerfile,
  deploy/templates/frontend-*.yaml, values.yaml frontend section
- Update ingress: gitdata.ai root now routes to app service
- Update scripts: build.js removes frontend step, deploy.js removes frontend
2026-04-17 12:04:34 +08:00
ZhenYi
3de4fff11d feat(service): improve model sync and harden git HTTP/SSH stability
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
Model sync:
- Filter OpenRouter models by what the user's AI client can actually access,
  before upserting metadata (avoids bloating with inaccessible models).
- Fall back to direct endpoint sync when no OpenRouter metadata matches
  (handles Bailian/MiniMax and other non-OpenRouter providers).

Git stability fixes:
- SSH: add 5s timeout on stdin flush/shutdown in channel_eof and
  cleanup_channel to prevent blocking the event loop on unresponsive git.
- SSH: remove dbg!() calls from production code paths.
- HTTP auth: pass proper Logger to SshAuthService instead of discarding
  all auth events to slog::Discard.

Dependencies:
- reqwest: add native-tls feature for HTTPS on Windows/Linux/macOS.
2026-04-17 00:13:40 +08:00
ZhenYi
39d126d843 refactor(service): replace tracing with slog in agent sync module
Some checks are pending
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
All log calls in sync.rs now use slog macros:
- sync_once: uses logger passed as parameter (sourced from AppService.logs)
- sync_upstream_models (HTTP API): uses self.logs
- Remove use of tracing::warn/info/error entirely

The periodic background sync task and the HTTP API handler now
write to the same slog logger as the rest of the application.
2026-04-16 22:52:03 +08:00
ZhenYi
3a30150a41 refactor(service): remove all hardcoded model-name inference from OpenRouter sync
Drop all hard-coded model-name lookup tables that hardcoded specific
model names and prices:
- infer_context_length: remove GPT-4o/o1/Claude/etc. fallback table
- infer_max_output: remove GPT-4o/o1/etc. output token limits
- infer_pricing_fallback: remove entire hardcoded pricing table
- infer_capability_list: derive from architecture.modality only,
  no longer uses model name strings

Also fix stats: if upsert_version fails, skip counting and continue
rather than counting model but not version (which caused
versions_created=0 while pricing_created>0 inconsistency).
2026-04-16 22:47:24 +08:00
ZhenYi
0a998affbb refactor(git): remove SSH rate limiting
SSH is deployed inside Kubernetes cluster where rate limiting
at the application layer is unnecessary. Remove all SSH rate
limiter code:
- SshRateLimiter from SSHandle and SSHServer structs
- is_user_allowed checks in auth_publickey, auth_publickey_offered
- is_repo_access_allowed in exec_request
- is_ip_allowed in server::new_client
- rate_limiter module and start_cleanup
2026-04-16 22:40:59 +08:00
ZhenYi
9368df54da feat(service): auto-sync OpenRouter models on app startup and every 10 minutes
- Add `start_sync_task()` in agent/sync.rs: spawns a background task
  that syncs immediately on app startup, then every 10 minutes.
- `sync_once()` performs a single pass; errors are logged and swallowed
  so the periodic task never stops.
- Remove authentication requirement from OpenRouter API (no API key needed).
- Call `service.start_sync_task()` from main.rs after AppService init.
- Also update the existing `sync_upstream_models` (HTTP API) to remove
  the now-unnecessary API key requirement for consistency.
2026-04-16 22:35:34 +08:00
ZhenYi
329b526bfb fix(git): add storage_path existence pre-check to run_fsck and run_gc
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
Consistent with run_sync: fail fast before blocking the thread pool
if the repo storage path does not exist on disk.
2026-04-16 22:24:14 +08:00
ZhenYi
d26e947f8e fix(git): add storage_path existence pre-check in run_sync
Fail fast if repo storage_path does not exist on disk, before
blocking the thread pool with spawn_blocking. This prevents
invalid tasks from consuming blocking threads while failing.
2026-04-16 22:23:17 +08:00
ZhenYi
8a0d2885f7 fix(git): correct pool panic detection and add failure diagnostic logs
- Fix match pattern: `Ok(Ok(Err(_)))` must be treated as NOT a panic,
  since execute_task_body returns Err(()) on task failure (not a panic).
  Previously the Err path incorrectly set panicked=true, which could
  cause pool to appear unhealthy.
- Add "task failed" log before retry/discard decision so real error
  is visible in logs (previously only last_error on exhausted-retries
  path was logged, which required hitting 5 retries to see the cause).
- Convert 3 remaining pool/mod.rs shorthand logs to format!() pattern.
2026-04-16 22:21:30 +08:00
ZhenYi
bbf2d75fba fix(git): harden hook pool retry, standardize slog log format
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
- Add retry_count to HookTask with serde(default) for backwards compat
- Limit hook task retries to MAX_RETRIES=5, discard after limit to prevent
  infinite requeue loops that caused 'task nack'd and requeued' log spam
- Add nak_with_retry() in RedisConsumer to requeue with incremented count
- Standardize all slog logs: replace "info!(l, "msg"; "k" => v)" shorthand
  with "info!(l, "{}", format!("msg k={}", v))" across ssh/authz.rs,
  ssh/handle.rs, ssh/server.rs, hook/webhook_dispatch.rs, hook/pool/mod.rs
2026-04-16 21:41:35 +08:00
ZhenYi
02847ef1db fix(git): downgrade russh to 0.50.4, remove flate2 feature, fix log format
- Downgrade russh from 0.55.0 to 0.50.4
- Remove unused flate2 feature from russh dependency
- Use info!(logger, "{}", format!(...)) for channel lifecycle log messages
2026-04-16 20:58:01 +08:00
ZhenYi
1090359951 fix(git): add SSH channel lifecycle logging and fix password auth username check
- Remove user=="git" restriction from auth_password: the actual user is
  determined by the token, not the SSH username, matching Gitea's approach
- Add channel_open_session logging with explicit flush to verify
  CHANNEL_OPEN_CONFIRMATION reaches the client
- Add pty_request handler (reject with log) so git clients that request
  a PTY are handled gracefully instead of falling through to default
- Add subsystem_request handler (log + accept) so git subsystems are
  visible in logs
- Prefix unused variables with _ to eliminate warnings
2026-04-16 20:40:17 +08:00
ZhenYi
f5ab554d6b fix(git): add LFS upload size limits and fix HTTP rate limiter read/write counter
- Add LFS_MAX_OBJECT_SIZE (50 GiB) and validate object sizes in both the
  batch advisory check and the upload_object streaming loop to prevent
  unbounded disk usage from malicious clients
- Fix HTTP rate limiter: track read_count and write_count separately so
  a burst of writes cannot exhaust the read budget (previously all
  operations incremented read_count regardless of type)
2026-04-16 20:14:13 +08:00
ZhenYi
cef4ff1289 fix(git): harden HTTP and SSH git transports for robustness
HTTP:
- Return Err(...) instead of Ok(HttpResponse::...) for error cases so
  actix returns correct HTTP status codes instead of 200
- Add 30s timeout on info_refs and handle_git_rpc git subprocess calls
- Add 1MB pre-PACK limit to prevent memory exhaustion on receive-pack
- Enforce branch protection rules (forbid push/force-push/deletion/tag)
- Simplify graceful shutdown (remove manual signal handling)

SSH:
- Fix build_git_command: use block match arms so chained .arg() calls
  are on the Command, not the match expression's () result
- Add MAX_RETRIES=5 to forward() data-pump loop to prevent infinite
  spin on persistent network failures
- Fall back to raw path if canonicalize() fails instead of panicking
- Add platform-specific git config paths (/dev/null on unix, NUL on win)
- Start rate limiter cleanup background task so HashMap doesn't grow
  unbounded over time
- Derive Clone on RateLimiter so SshRateLimiter::start_cleanup works
2026-04-16 20:11:18 +08:00
ZhenYi
677e88980b fix(room): add edit rollback, clean stream channels on room shutdown/idle 2026-04-16 19:28:23 +08:00
ZhenYi
c89f01b718 feat(room): improve robustness — optimistic send, atomic seq, jitter reconnect
Backend:
- Atomic seq assignment via Redis Lua script: INCR + GET run atomically
  inside a Lua script, preventing duplicate seqs under concurrent requests.
  DB reconciliation only triggers on cross-server handoff (rare path).
- Broadcast channel capacity: 10,000 → 100,000 to prevent message drops
  under high-throughput rooms.

Frontend:
- Optimistic sendMessage: adds message to UI immediately (marked
  isOptimistic=true) so user sees it instantly. Replaces with
  server-confirmed message on success, marks as isOptimisticError on
  failure. Fire-and-forget to IndexedDB for persistence.
- Seq-based dedup in onRoomMessage: replaces optimistic message by
  matching seq, preventing duplicates when WS arrives before REST confirm.
- Reconnect jitter: replaced deterministic backoff with full jitter
  (random within backoff window), preventing thundering herd on server
  restart.
- Visual WS status dot in room header: green=connected, amber
  (pulsing)=connecting, red=error/disconnected.
- isPending check extended to cover both old 'temp-' prefix and new
  isOptimistic flag, showing 'Sending...' / 'Failed' badges.
2026-04-16 19:23:06 +08:00
ZhenYi
431f40063f fix(ws): allow APP_DOMAIN_URL and APP_STATIC_DOMAIN origins
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
validate_origin() only allowed localhost origins by default, causing
production WebSocket connections to be rejected. Now it reads
APP_DOMAIN_URL and APP_STATIC_DOMAIN from env and automatically
adds their http/https/ws/wss variants to the allowed origins list.

Also add APP_DOMAIN_URL to the production configmap.
2026-04-16 18:51:52 +08:00
ZhenYi
df87a65cbb fix(room): accept room_public JSON key for HTTP fallback
Some checks are pending
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
The frontend WebSocket client sends room_public, but the HTTP fallback
sends it as a JSON body parsed directly into RoomCreateRequest/RoomUpdateRequest
which expects public. Add #[serde(rename = "room_public")] so both
ws params and HTTP JSON body work consistently.
2026-04-16 18:27:53 +08:00
ZhenYi
8b433c9169 feat(repo): return ssh/https clone URLs from backend
Some checks are pending
CI / Frontend Lint & Type Check (push) Waiting to run
CI / Frontend Build (push) Blocked by required conditions
CI / Rust Lint & Check (push) Waiting to run
CI / Rust Tests (push) Waiting to run
Add ssh_clone_url and https_clone_url to ProjectRepositoryItem,
constructed from config.ssh_domain() and config.git_http_domain().
Update frontend RepoInfo interface and header.tsx to use the
server-provided URLs instead of hardcoded values.
2026-04-16 18:20:48 +08:00
ZhenYi
a09ff66191 refactor(room): remove NATS, use Redis pub/sub for message queue
- Remove async-nats from Cargo.toml dependencies
- Rename nats_publish_failed metric → redis_publish_failed
- Update queue lib doc comment: Redis Streams + Redis Pub/Sub
- Add Paused/Cancelled task statuses to agent_task model
- Add issue_id and retry_count fields to agent_task
- Switch tool executor Mutex from std::sync → tokio::sync (async context)
- Add timeout/rate-limited/retryable/tool-not-found error variants
2026-04-16 17:24:04 +08:00
ZhenYi
b3b74a2396 fix: remove trigger function from message search migration
The PL/pgSQL trigger function caused SQL parsing issues with
split_sql_statements. Keep only the table, column, and index.
2026-04-15 11:39:39 +08:00
ZhenYi
8e7f3b211e fix: use dollar-quoting for PL/pgSQL trigger function
The previous single-quote syntax with escaped quotes was split by
split_sql_statements on semicolons inside the function body.
Use $$ quoting to avoid quote escaping issues.
2026-04-15 11:31:40 +08:00
ZhenYi
b6697258ab fix(migrate): reorder migrations so workspace ALTER runs after project CREATE
m20260411_000003_add_workspace_id_to_project was running before
m20250628_000013_create_project, causing "relation project does not exist".
Move all project table CREATEs before workspace migrations.
2026-04-15 11:19:01 +08:00
ZhenYi
3a8b8c9cf8 remove all foreign key constraints from SQL migrations
Foreign keys at the database level cause issues with deployment flexibility.
Keep only indexes for query performance; enforce referential integrity at
the application level.
2026-04-15 11:03:17 +08:00
ZhenYi
93cfff9738 init 2026-04-15 09:08:09 +08:00