Observability Audit

Logging — current state & improvement plan

Every request currently emits a log line that ships to Better Stack, errors only reach a human if someone happens to be looking, and the Better Stack flusher is silently dropping batches in production. This doc maps what exists today and proposes a four-phase fix: tame the noise, alert errors to Slack/Discord, harden the pipeline, then keep it honest.

Stack · tracing + tracing-subscriber → stdout (Railway) + custom Better Stack layer Prod · Railway loquent.io / loquent-app / prod Sources · src/bases/logging.rs · src/bases/betterstack.rs · src/main.rs · live Railway logs

TL;DR

The prod fallback filter is info,tower_http=debug — it deliberately turns per-request logs on in production. That's the noise. One-line fix, plus a smarter status-aware request logger.

Live bug: the Better Stack flusher failed 5× today in prod and flush() clears the batch even on failure — those logs are gone. Errors can vanish before any alert could fire.

No alerting exists anywhere. Plan: a small in-app AlertLayer (mirrors betterstack.rs) that posts ERROR events to a Slack/Discord webhook with dedup + rate-limiting. Better Stack native alerts as backstop (Slack yes, Discord no).

5xx responses are only logged if the handler remembered to log. Add one response-middleware that logs every 5xx with method + path, so nothing 500s silently.

958 log call sites, generally good structured-field discipline — but 356 error! sites include recoverable paths that should be warn!. Hygiene pass so that error == page-worthy.

No request-id correlation. Optional Phase 3 add: tower-http request-id propagated into the span, so one request's logs group together in Better Stack.

01Current pipeline

Solid foundation: structured fields, JSON in prod, a hand-rolled Better Stack layer with batching. The problems are in the filter defaults, the failure paths, and what happens after the logs land.

tracing::error!/warn!/info!/debug! tower_http TraceLayer (958 call sites) (per-request span + events) │ │ └──────────────┬───────────────────────┘ ▼ tracing_subscriber::registry │ global EnvFilter (RUST_LOG or fallback) dev : debug,hyper=info,sea_orm=info,tower=info,tower_http=debug prod: info,tower_http=debug ◀── per-request logs ON in prod │ ┌──────────────┴──────────────┐ ▼ ▼ fmt layer BetterstackLayer (custom) dev: pretty / prod: JSON mpsc(10k, try_send → silent drop) │ │ ▼ ▼ stdout → Railway logs flusher thread (blocking reqwest) (7-day retention, no alerts) batch 100 / flush 2s │ on HTTP error: eprintln + ▼ batch.clear() ◀── LOG LOSS Better Stack (EU ingest host) dashboards · no alerts configured

Subscriber configuration logging.rs

	Dev (`debug_assertions`)	Production
Format	pretty, colored	JSON (`fmt::layer().json()`)
Fallback filter	`debug,hyper=info,sea_orm=info,tower=info,tower_http=debug`	`info,tower_http=debug`
Override	`RUST_LOG` env var wins in both modes (prod value on Railway unverified — see Q5)
Better Stack	Same layer in both modes, active when `BETTERSTACK_SOURCE_TOKEN` is set; receives everything the global filter passes
Panics	panic hook routes to `tracing::error!` ✓ (so panics will flow through any future AlertLayer for free)

Per-request logging main.rs:212–233

TraceLayer with a custom debug_span!("http_request", method, uri) and a custom on_response that logs status + latency_ms at DEBUG. Subtle but important: the custom closures live in main.rs, so their events carry target loquent — while the layer's default on_request (“started processing request”) keeps target tower_http::*.

Worst of both worlds under the prod fallback filter

The tower_http=debug directive enables the default “started processing request” event (target tower_http) for every request — but the useful custom one (status, latency; target loquent, DEBUG) is filtered out by the info default. Net effect: a noise line per request with no status/latency attached. If Railway sets RUST_LOG=debug instead, we get both lines per request. Either way: noise.

Emission inventory

958

tracing call sites in src/

356

error!

316

warn!

190

info!

debug! + trace!

Heaviest modules: mods/twilio (246 — webhooks log 1–3 lines per event), mods/billing (102), mods/meta (95), mods/assistant (65), mods/plan (50). Field discipline is good (error = %e, org_id, call_sid, …) per .claude/rules/logging.md; no #[instrument], by convention.

02The noise problem

Sampled live prod traffic (Railway, last hour). The request stream is dominated by polling and connection churn — none of it tells us anything when healthy, and all of it generates app log lines that ship to Better Stack.

23:45:54 GET /api/notifications/unread-count 200 47ms ← client polling, ~2×/5min/client 23:46:32 POST /twilio/events 200 32ms ← every Twilio callback 23:46:47 GET /api/realtime 0 53019ms ← WS reconnect churn, every 25–120s 23:48:12 GET /api/assistant/ws 0 122355ms ← assistant WS, same churn 23:56:28 POST /resend/events 200 9ms ← email webhook

Three compounding costs:

Signal drowning. A real error! sits between hundreds of identical request lines. Live Tail in Better Stack is unusable without a saved filter.
Ingestion spend. Better Stack bills per GB ingested. Request lines + Twilio per-event info! chatter are the bulk of the volume while carrying near-zero diagnostic value when everything is healthy.
Alert hygiene debt. Any future "alert on ERROR" rule is only as good as the error stream. Today, recoverable degradations (e.g. the six context-injection failures in twilio_stream_route.rs:301–385 that explicitly continue processing) log at error! — they'd page.

The principle for the fix

A request log line should exist only when it carries information: 5xx → error, 4xx → warn, slow → warn, everything else → debug (visible in dev, absent in prod). Railway's edge already records every request (method/path/status/latency, seen above) — the app doesn't need to duplicate the happy path.

03Reliability gaps

Found while auditing — these are not theoretical. Gap #1 fired five times today.

1 · Better Stack flusher drops batches on failure (observed in prod today)

Prod deploy logs show [betterstack] flush error: error sending request for url (https://s2331916.eu-fsn-3.betterstackdata.com/) at 15:39, 16:19, 19:33, 19:37 and 20:41 UTC. In betterstack.rs:90–105, flush() runs batch.clear() on every path — success, HTTP error, network error. Up to 100 log events vanish per failed flush. Separately, try_send at line 202 silently drops events when the 10k channel is full. If an outage coincides with ingest flakiness, the evidence is deleted.

2 · 5xx responses are not centrally logged

AppError → HTTP conversion (error.rs:41–65) maps Database/Io/Http/Internal to 500 without logging. An error is visible only if the handler logged before returning. One missed call site = a silent 500. There are 300+ API endpoints relying on this discipline.

3 · No path from ERROR to a human

No Slack/Discord/email/webhook anywhere in the codebase; no alert rules configured in Better Stack. Today, learning about a prod error requires opening the Better Stack dashboard and looking.

4 · No request correlation

No request-id. When a request emits 4 log lines across service layers, nothing ties them together in Better Stack. Nice-to-have, not urgent — structured fields (call_sid, org_id) partially compensate.

04Alerting options

Two viable routes — they compose. The in-app layer is the primary recommendation because today's flusher failures prove the Better Stack path can't be the only wire to a human.

Option	Slack	Discord	Latency	Effort	Fails when…
A · In-app `AlertLayer` tracing Layer → webhook POST, mirrors `betterstack.rs` pattern	✓ incoming webhook	✓ webhook (native)	seconds	M · ~200 lines	app itself is down (no process = no alert)
B · Better Stack alert rules threshold/anomaly alerts on a saved query, zero code	✓ native integration	✗ not supported (email/Slack/Teams + Uptime escalations only)	~1–3 min (confirmation window)	S · config only	ingest is failing — exactly what we observed today

Recommended · A as primary

Instant, carries full context (message, target, structured fields), Discord and Slack work with the same ~5-line body difference, and it keeps working when Better Stack ingest hiccups. Panics already route through tracing::error!, so they alert for free. Needs in-layer dedup + rate-limiting so one error loop can't flood the channel.

B as backstop, not primary

Worth configuring anyway (10 minutes, no code): an error-rate threshold alert catches classes of failure the in-app layer can't — sustained error spikes, and "the app is up but wedged". But it's blind during ingest failures and can't reach Discord.

AlertLayer design points

Trigger: events at ERROR only (the level audit in Phase 1 makes this trustworthy). Optional ALERT_MIN_LEVEL escape hatch.
Dedup: key = (target, message); within a 5-min window send the first, then aggregate ("…and 14 more" in the next flush).
Rate limit: hard cap (e.g. 10 sends/min) → overflow becomes a single summary message. Discord webhooks tolerate ~30 req/min; Slack ~1/sec.
Transport: same proven shape as betterstack.rs — bounded mpsc + dedicated flusher thread, try_send so logging never blocks a request.
Config: ALERT_WEBHOOK_URL + ALERT_WEBHOOK_FORMAT=slack|discord (or auto-detect from the URL host). Absent → layer disabled, dev stays quiet.
Payload: level, message, target, env (prod/staging), and the event's structured fields — enough to act without opening Better Stack.

05Improvement plan

Four phases, independently shippable, in value order. Phase 1+2 together are roughly a day of work and deliver the two things asked for: less noise, fast error notifications.

P1Kill the noise S · ~½ day

Replace the on_response closure with a status-aware logger (5xx → error!, 4xx → warn!, ≥2s → warn!, else debug!) and silence the default on_request noise by dropping tower_http=debug from the prod fallback filter (keep tower_http=info). Skip WS upgrade paths (/api/realtime, /api/assistant/ws) and exclude 401/404 from warn! to avoid scanner noise. Demote the known recoverable error! sites (twilio context-injection, etc.) to warn!. Verify the Railway RUST_LOG value while in there.
P2Error alerts → Slack/Discord M · ~½–1 day

New src/bases/alerting.rs — the AlertLayer described above, registered in logging::init() beside the Better Stack layer. Env-gated, deduped, rate-limited, ERROR-only. Includes panics automatically via the existing hook. Plus the 10-minute backstop: a Better Stack threshold alert (ERROR count > N in 5 min → Slack/email).
P3Pipeline hardening M · ~1 day

Three fixes: (a) Better Stack flusher retries with backoff (2 retries, then drop with an explicit drop-count log) instead of batch.clear() on failure; count channel-full drops too. (b) Central 5xx logging — one middleware after the router logs every 5xx with method+path, ending reliance on per-handler discipline (the status-aware P1 logger may already cover this — confirm before building twice). (c) Request-id: tower-http SetRequestIdLayer + put the id on the http_request span so Better Stack groups a request's lines.
P4Hygiene & cost ongoing L · incremental

Audit the 356 error! sites against the rule “error == operation failed AND a human should care” — most churn is in mods/twilio. Decide which per-event info! webhook logs earn INFO vs DEBUG. Check Better Stack ingestion volume before/after P1 to quantify the savings. Consider per-module filter directives (e.g. sea_orm=warn) in the prod fallback.

06Code sketches

Shapes, not final code — to make the plan concrete.

P1 · Status-aware request logging main.rs

// Replaces the current on_response closure. Happy path → debug (absent in prod).
.on_response(|res: &http::Response<_>, latency: Duration, span: &Span| {
    let status = res.status();
    let ms = latency.as_millis() as u64;
    match status.as_u16() {
        500..= 599 => tracing::error!(parent: span, status = status.as_u16(), latency_ms = ms, "request failed"),
        // 401/404 are scanner + auth-expiry noise — keep them at debug
        400..= 499 if status != 401 && status != 404
                    => tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "request rejected"),
        _ if ms > 2_000 => tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "slow request"),
        _ => tracing::debug!(parent: span, status = status.as_u16(), latency_ms = ms, "finished processing request"),
    }
})
// + logging.rs prod fallback: "info,tower_http=info"  (drop the =debug)

P2 · AlertLayer src/bases/alerting.rs (new)

// Same bones as betterstack.rs: Layer impl + bounded channel + flusher thread.
impl<S: Subscriber> Layer<S> for AlertLayer {
    fn on_event(&self, event: &Event, _: Context<S>) {
        if event.metadata().level() != &Level::ERROR { return; }
        let _ = self.sender.try_send(AlertEvent::from(event)); // never blocks
    }
}
// flusher thread: dedup by (target, message) in 5-min window, cap 10 sends/min,
// then POST — Discord: {"content": text} · Slack: {"text": text}
// env: ALERT_WEBHOOK_URL, ALERT_WEBHOOK_FORMAT=discord|slack

P3 · Flusher retry betterstack.rs

fn flush(client: &Client, url: &str, token: &str, batch: &mut Vec<LogEvent>) {
    for attempt in 0..3 {
        match client.post(url).bearer_auth(token).json(batch).send() {
            Ok(r) if r.status().is_success() => { batch.clear(); return; }
            _ if attempt < 2 => std::thread::sleep(Duration::from_millis(500 << attempt)),
            _ => {}
        }
    }
    eprintln!("[betterstack] dropping {} events after 3 failed flushes", batch.len());
    batch.clear(); // bounded loss, now explicit + counted
}

07Open questions

Decisions needed before implementation. Recommendations pre-marked.

Where do error alerts go — Slack or Discord?

Build format-agnostic (ALERT_WEBHOOK_FORMAT env), wire up whichever channel the team actually watches. The payload formatting difference is ~5 lines.
Slack only — also unlocks Better Stack's native integration as the backstop channel.
Discord only — fine for the in-app layer; the Better Stack backstop falls back to email.

What triggers an alert?

Every ERROR event, deduped + rate-limited (first occurrence immediately, repeats aggregated). Depends on the P1/P4 level audit so ERROR means it.
Only HTTP 5xx responses — simpler, but misses background jobs, webhook handlers that swallow to 200, and AI-loop failures.
Better Stack threshold only (no in-app layer) — zero code, but blind during ingest failures like today's and ~minutes slower.

Per-request logs in prod — keep any?

Only exceptions: 5xx / unexpected 4xx / slow (>2s). Healthy traffic disappears from app logs; Railway's edge log still records every request if ever needed.
Keep INFO for all non-polling routes — preserves an in-Better-Stack request trail, but keeps most of the noise and ingestion cost.

Is request-id correlation (P3c) worth it now?

Yes, while touching the TraceLayer anyway — it's ~20 lines with tower-http's request-id feature and makes every future incident cheaper to debug.
Defer — org_id/call_sid fields cover most correlation needs today.

Verify production env (blocked during audit)

Check on Railway: is RUST_LOG set on loquent-app/prod, and to what? (Determines which of the two noise mechanisms in §1 is live.) Reading prod variables needed explicit approval, so this audit used the code-side fallbacks. BETTERSTACK_INGESTING_HOST is confirmed set (EU host visible in flush errors).