TL;DR
The prod fallback filter is info,tower_http=debug — it deliberately turns per-request logs on in production. That's the noise. One-line fix, plus a smarter status-aware request logger.
Live bug: the Better Stack flusher failed 5× today in prod and flush() clears the batch even on failure — those logs are gone. Errors can vanish before any alert could fire.
No alerting exists anywhere. Plan: a small in-app AlertLayer (mirrors betterstack.rs) that posts ERROR events to a Slack/Discord webhook with dedup + rate-limiting. Better Stack native alerts as backstop (Slack yes, Discord no).
5xx responses are only logged if the handler remembered to log. Add one response-middleware that logs every 5xx with method + path, so nothing 500s silently.
958 log call sites, generally good structured-field discipline — but 356 error! sites include recoverable paths that should be warn!. Hygiene pass so that error == page-worthy.
No request-id correlation. Optional Phase 3 add: tower-http request-id propagated into the span, so one request's logs group together in Better Stack.
01Current pipeline
Solid foundation: structured fields, JSON in prod, a hand-rolled Better Stack layer with batching. The problems are in the filter defaults, the failure paths, and what happens after the logs land.
Subscriber configuration logging.rs
Dev (debug_assertions) | Production | |
|---|---|---|
| Format | pretty, colored | JSON (fmt::layer().json()) |
| Fallback filter | debug,hyper=info,sea_orm=info,tower=info,tower_http=debug | info,tower_http=debug |
| Override | RUST_LOG env var wins in both modes (prod value on Railway unverified — see Q5) | |
| Better Stack | Same layer in both modes, active when BETTERSTACK_SOURCE_TOKEN is set; receives everything the global filter passes | |
| Panics | panic hook routes to tracing::error! ✓ (so panics will flow through any future AlertLayer for free) | |
Per-request logging main.rs:212–233
TraceLayer with a custom debug_span!("http_request", method, uri) and a custom on_response that logs status + latency_ms at DEBUG. Subtle but important: the custom closures live in main.rs, so their events carry target loquent — while the layer's default on_request (“started processing request”) keeps target tower_http::*.
Worst of both worlds under the prod fallback filter
The tower_http=debug directive enables the default “started processing request” event (target tower_http) for every request — but the useful custom one (status, latency; target loquent, DEBUG) is filtered out by the info default. Net effect: a noise line per request with no status/latency attached. If Railway sets RUST_LOG=debug instead, we get both lines per request. Either way: noise.
Emission inventory
Heaviest modules: mods/twilio (246 — webhooks log 1–3 lines per event), mods/billing (102), mods/meta (95), mods/assistant (65), mods/plan (50). Field discipline is good (error = %e, org_id, call_sid, …) per .claude/rules/logging.md; no #[instrument], by convention.
02The noise problem
Sampled live prod traffic (Railway, last hour). The request stream is dominated by polling and connection churn — none of it tells us anything when healthy, and all of it generates app log lines that ship to Better Stack.
Three compounding costs:
- Signal drowning. A real
error!sits between hundreds of identical request lines. Live Tail in Better Stack is unusable without a saved filter. - Ingestion spend. Better Stack bills per GB ingested. Request lines + Twilio per-event
info!chatter are the bulk of the volume while carrying near-zero diagnostic value when everything is healthy. - Alert hygiene debt. Any future "alert on ERROR" rule is only as good as the error stream. Today, recoverable degradations (e.g. the six context-injection failures in
twilio_stream_route.rs:301–385that explicitly continue processing) log aterror!— they'd page.
The principle for the fix
A request log line should exist only when it carries information: 5xx → error, 4xx → warn, slow → warn, everything else → debug (visible in dev, absent in prod). Railway's edge already records every request (method/path/status/latency, seen above) — the app doesn't need to duplicate the happy path.
03Reliability gaps
Found while auditing — these are not theoretical. Gap #1 fired five times today.
1 · Better Stack flusher drops batches on failure (observed in prod today)
Prod deploy logs show [betterstack] flush error: error sending request for url (https://s2331916.eu-fsn-3.betterstackdata.com/) at 15:39, 16:19, 19:33, 19:37 and 20:41 UTC. In betterstack.rs:90–105, flush() runs batch.clear() on every path — success, HTTP error, network error. Up to 100 log events vanish per failed flush. Separately, try_send at line 202 silently drops events when the 10k channel is full. If an outage coincides with ingest flakiness, the evidence is deleted.
2 · 5xx responses are not centrally logged
AppError → HTTP conversion (error.rs:41–65) maps Database/Io/Http/Internal to 500 without logging. An error is visible only if the handler logged before returning. One missed call site = a silent 500. There are 300+ API endpoints relying on this discipline.
3 · No path from ERROR to a human
No Slack/Discord/email/webhook anywhere in the codebase; no alert rules configured in Better Stack. Today, learning about a prod error requires opening the Better Stack dashboard and looking.
4 · No request correlation
No request-id. When a request emits 4 log lines across service layers, nothing ties them together in Better Stack. Nice-to-have, not urgent — structured fields (call_sid, org_id) partially compensate.
04Alerting options
Two viable routes — they compose. The in-app layer is the primary recommendation because today's flusher failures prove the Better Stack path can't be the only wire to a human.
| Option | Slack | Discord | Latency | Effort | Fails when… |
|---|---|---|---|---|---|
A · In-app AlertLayertracing Layer → webhook POST, mirrors betterstack.rs pattern |
✓ incoming webhook | ✓ webhook (native) | seconds | M · ~200 lines | app itself is down (no process = no alert) |
| B · Better Stack alert rules threshold/anomaly alerts on a saved query, zero code |
✓ native integration | ✗ not supported (email/Slack/Teams + Uptime escalations only) | ~1–3 min (confirmation window) | S · config only | ingest is failing — exactly what we observed today |
Recommended · A as primary
Instant, carries full context (message, target, structured fields), Discord and Slack work with the same ~5-line body difference, and it keeps working when Better Stack ingest hiccups. Panics already route through tracing::error!, so they alert for free. Needs in-layer dedup + rate-limiting so one error loop can't flood the channel.
B as backstop, not primary
Worth configuring anyway (10 minutes, no code): an error-rate threshold alert catches classes of failure the in-app layer can't — sustained error spikes, and "the app is up but wedged". But it's blind during ingest failures and can't reach Discord.
AlertLayer design points
- Trigger: events at
ERRORonly (the level audit in Phase 1 makes this trustworthy). OptionalALERT_MIN_LEVELescape hatch. - Dedup: key =
(target, message); within a 5-min window send the first, then aggregate ("…and 14 more"in the next flush). - Rate limit: hard cap (e.g. 10 sends/min) → overflow becomes a single summary message. Discord webhooks tolerate ~30 req/min; Slack ~1/sec.
- Transport: same proven shape as
betterstack.rs— bounded mpsc + dedicated flusher thread,try_sendso logging never blocks a request. - Config:
ALERT_WEBHOOK_URL+ALERT_WEBHOOK_FORMAT=slack|discord(or auto-detect from the URL host). Absent → layer disabled, dev stays quiet. - Payload: level, message, target, env (
prod/staging), and the event's structured fields — enough to act without opening Better Stack.
05Improvement plan
Four phases, independently shippable, in value order. Phase 1+2 together are roughly a day of work and deliver the two things asked for: less noise, fast error notifications.
-
P1Kill the noise S · ~½ day
Replace the
on_responseclosure with a status-aware logger (5xx →error!, 4xx →warn!, ≥2s →warn!, elsedebug!) and silence the defaulton_requestnoise by droppingtower_http=debugfrom the prod fallback filter (keeptower_http=info). Skip WS upgrade paths (/api/realtime,/api/assistant/ws) and exclude 401/404 fromwarn!to avoid scanner noise. Demote the known recoverableerror!sites (twilio context-injection, etc.) towarn!. Verify the RailwayRUST_LOGvalue while in there. -
P2Error alerts → Slack/Discord M · ~½–1 day
New
src/bases/alerting.rs— theAlertLayerdescribed above, registered inlogging::init()beside the Better Stack layer. Env-gated, deduped, rate-limited, ERROR-only. Includes panics automatically via the existing hook. Plus the 10-minute backstop: a Better Stack threshold alert (ERROR count > N in 5 min → Slack/email). -
P3Pipeline hardening M · ~1 day
Three fixes: (a) Better Stack flusher retries with backoff (2 retries, then drop with an explicit drop-count log) instead of
batch.clear()on failure; count channel-full drops too. (b) Central 5xx logging — one middleware after the router logs every 5xx with method+path, ending reliance on per-handler discipline (the status-aware P1 logger may already cover this — confirm before building twice). (c) Request-id: tower-httpSetRequestIdLayer+ put the id on thehttp_requestspan so Better Stack groups a request's lines. -
P4Hygiene & cost ongoing L · incremental
Audit the 356
error!sites against the rule “error == operation failed AND a human should care” — most churn is inmods/twilio. Decide which per-eventinfo!webhook logs earn INFO vs DEBUG. Check Better Stack ingestion volume before/after P1 to quantify the savings. Consider per-module filter directives (e.g.sea_orm=warn) in the prod fallback.
06Code sketches
Shapes, not final code — to make the plan concrete.
P1 · Status-aware request logging main.rs
// Replaces the current on_response closure. Happy path → debug (absent in prod).
.on_response(|res: &http::Response<_>, latency: Duration, span: &Span| {
let status = res.status();
let ms = latency.as_millis() as u64;
match status.as_u16() {
500..= 599 => tracing::error!(parent: span, status = status.as_u16(), latency_ms = ms, "request failed"),
// 401/404 are scanner + auth-expiry noise — keep them at debug
400..= 499 if status != 401 && status != 404
=> tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "request rejected"),
_ if ms > 2_000 => tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "slow request"),
_ => tracing::debug!(parent: span, status = status.as_u16(), latency_ms = ms, "finished processing request"),
}
})
// + logging.rs prod fallback: "info,tower_http=info" (drop the =debug)
P2 · AlertLayer src/bases/alerting.rs (new)
// Same bones as betterstack.rs: Layer impl + bounded channel + flusher thread.
impl<S: Subscriber> Layer<S> for AlertLayer {
fn on_event(&self, event: &Event, _: Context<S>) {
if event.metadata().level() != &Level::ERROR { return; }
let _ = self.sender.try_send(AlertEvent::from(event)); // never blocks
}
}
// flusher thread: dedup by (target, message) in 5-min window, cap 10 sends/min,
// then POST — Discord: {"content": text} · Slack: {"text": text}
// env: ALERT_WEBHOOK_URL, ALERT_WEBHOOK_FORMAT=discord|slack
P3 · Flusher retry betterstack.rs
fn flush(client: &Client, url: &str, token: &str, batch: &mut Vec<LogEvent>) {
for attempt in 0..3 {
match client.post(url).bearer_auth(token).json(batch).send() {
Ok(r) if r.status().is_success() => { batch.clear(); return; }
_ if attempt < 2 => std::thread::sleep(Duration::from_millis(500 << attempt)),
_ => {}
}
}
eprintln!("[betterstack] dropping {} events after 3 failed flushes", batch.len());
batch.clear(); // bounded loss, now explicit + counted
}
07Open questions
Decisions needed before implementation. Recommendations pre-marked.
Where do error alerts go — Slack or Discord?
- Build format-agnostic (
ALERT_WEBHOOK_FORMATenv), wire up whichever channel the team actually watches. The payload formatting difference is ~5 lines. - Slack only — also unlocks Better Stack's native integration as the backstop channel.
- Discord only — fine for the in-app layer; the Better Stack backstop falls back to email.
What triggers an alert?
- Every
ERRORevent, deduped + rate-limited (first occurrence immediately, repeats aggregated). Depends on the P1/P4 level audit so ERROR means it. - Only HTTP 5xx responses — simpler, but misses background jobs, webhook handlers that swallow to 200, and AI-loop failures.
- Better Stack threshold only (no in-app layer) — zero code, but blind during ingest failures like today's and ~minutes slower.
Per-request logs in prod — keep any?
- Only exceptions: 5xx / unexpected 4xx / slow (>2s). Healthy traffic disappears from app logs; Railway's edge log still records every request if ever needed.
- Keep INFO for all non-polling routes — preserves an in-Better-Stack request trail, but keeps most of the noise and ingestion cost.
Is request-id correlation (P3c) worth it now?
- Yes, while touching the TraceLayer anyway — it's ~20 lines with tower-http's request-id feature and makes every future incident cheaper to debug.
- Defer —
org_id/call_sidfields cover most correlation needs today.
Verify production env (blocked during audit)
- Check on Railway: is
RUST_LOGset on loquent-app/prod, and to what? (Determines which of the two noise mechanisms in §1 is live.) Reading prod variables needed explicit approval, so this audit used the code-side fallbacks.BETTERSTACK_INGESTING_HOSTis confirmed set (EU host visible in flush errors).