Loquent · Logging Improvements Mode B · 2026-06-11 · brainstorm — not yet a spec
Observability Audit

Logging — current state & improvement plan

Every request currently emits a log line that ships to Better Stack, errors only reach a human if someone happens to be looking, and the Better Stack flusher is silently dropping batches in production. This doc maps what exists today and proposes a four-phase fix: tame the noise, alert errors to Slack/Discord, harden the pipeline, then keep it honest.

Stack · tracing + tracing-subscriber → stdout (Railway) + custom Better Stack layer Prod · Railway loquent.io / loquent-app / prod Sources · src/bases/logging.rs · src/bases/betterstack.rs · src/main.rs · live Railway logs

TL;DR

01

The prod fallback filter is info,tower_http=debug — it deliberately turns per-request logs on in production. That's the noise. One-line fix, plus a smarter status-aware request logger.

02

Live bug: the Better Stack flusher failed 5× today in prod and flush() clears the batch even on failure — those logs are gone. Errors can vanish before any alert could fire.

03

No alerting exists anywhere. Plan: a small in-app AlertLayer (mirrors betterstack.rs) that posts ERROR events to a Slack/Discord webhook with dedup + rate-limiting. Better Stack native alerts as backstop (Slack yes, Discord no).

04

5xx responses are only logged if the handler remembered to log. Add one response-middleware that logs every 5xx with method + path, so nothing 500s silently.

05

958 log call sites, generally good structured-field discipline — but 356 error! sites include recoverable paths that should be warn!. Hygiene pass so that error == page-worthy.

06

No request-id correlation. Optional Phase 3 add: tower-http request-id propagated into the span, so one request's logs group together in Better Stack.

01Current pipeline

Solid foundation: structured fields, JSON in prod, a hand-rolled Better Stack layer with batching. The problems are in the filter defaults, the failure paths, and what happens after the logs land.

tracing::error!/warn!/info!/debug! tower_http TraceLayer (958 call sites) (per-request span + events) │ │ └──────────────┬───────────────────────┘ ▼ tracing_subscriber::registry │ global EnvFilter (RUST_LOG or fallback) dev : debug,hyper=info,sea_orm=info,tower=info,tower_http=debug prod: info,tower_http=debug ◀── per-request logs ON in prod │ ┌──────────────┴──────────────┐ ▼ ▼ fmt layer BetterstackLayer (custom) dev: pretty / prod: JSON mpsc(10k, try_send → silent drop) │ │ ▼ ▼ stdout → Railway logs flusher thread (blocking reqwest) (7-day retention, no alerts) batch 100 / flush 2s │ on HTTP error: eprintln + ▼ batch.clear() ◀── LOG LOSS Better Stack (EU ingest host) dashboards · no alerts configured

Subscriber configuration logging.rs

Dev (debug_assertions)Production
Formatpretty, coloredJSON (fmt::layer().json())
Fallback filterdebug,hyper=info,sea_orm=info,tower=info,tower_http=debuginfo,tower_http=debug
OverrideRUST_LOG env var wins in both modes (prod value on Railway unverified — see Q5)
Better StackSame layer in both modes, active when BETTERSTACK_SOURCE_TOKEN is set; receives everything the global filter passes
Panicspanic hook routes to tracing::error! ✓ (so panics will flow through any future AlertLayer for free)

Per-request logging main.rs:212–233

TraceLayer with a custom debug_span!("http_request", method, uri) and a custom on_response that logs status + latency_ms at DEBUG. Subtle but important: the custom closures live in main.rs, so their events carry target loquent — while the layer's default on_request (“started processing request”) keeps target tower_http::*.

!

Worst of both worlds under the prod fallback filter

The tower_http=debug directive enables the default “started processing request” event (target tower_http) for every request — but the useful custom one (status, latency; target loquent, DEBUG) is filtered out by the info default. Net effect: a noise line per request with no status/latency attached. If Railway sets RUST_LOG=debug instead, we get both lines per request. Either way: noise.

Emission inventory

958
tracing call sites in src/
356
error!
316
warn!
190
info!
96
debug! + trace!

Heaviest modules: mods/twilio (246 — webhooks log 1–3 lines per event), mods/billing (102), mods/meta (95), mods/assistant (65), mods/plan (50). Field discipline is good (error = %e, org_id, call_sid, …) per .claude/rules/logging.md; no #[instrument], by convention.

02The noise problem

Sampled live prod traffic (Railway, last hour). The request stream is dominated by polling and connection churn — none of it tells us anything when healthy, and all of it generates app log lines that ship to Better Stack.

23:45:54 GET /api/notifications/unread-count 200 47ms ← client polling, ~2×/5min/client 23:46:32 POST /twilio/events 200 32ms ← every Twilio callback 23:46:47 GET /api/realtime 0 53019ms ← WS reconnect churn, every 25–120s 23:48:12 GET /api/assistant/ws 0 122355ms ← assistant WS, same churn 23:56:28 POST /resend/events 200 9ms ← email webhook

Three compounding costs:

  • Signal drowning. A real error! sits between hundreds of identical request lines. Live Tail in Better Stack is unusable without a saved filter.
  • Ingestion spend. Better Stack bills per GB ingested. Request lines + Twilio per-event info! chatter are the bulk of the volume while carrying near-zero diagnostic value when everything is healthy.
  • Alert hygiene debt. Any future "alert on ERROR" rule is only as good as the error stream. Today, recoverable degradations (e.g. the six context-injection failures in twilio_stream_route.rs:301–385 that explicitly continue processing) log at error! — they'd page.
i

The principle for the fix

A request log line should exist only when it carries information: 5xx → error, 4xx → warn, slow → warn, everything else → debug (visible in dev, absent in prod). Railway's edge already records every request (method/path/status/latency, seen above) — the app doesn't need to duplicate the happy path.

03Reliability gaps

Found while auditing — these are not theoretical. Gap #1 fired five times today.

×

1 · Better Stack flusher drops batches on failure (observed in prod today)

Prod deploy logs show [betterstack] flush error: error sending request for url (https://s2331916.eu-fsn-3.betterstackdata.com/) at 15:39, 16:19, 19:33, 19:37 and 20:41 UTC. In betterstack.rs:90–105, flush() runs batch.clear() on every path — success, HTTP error, network error. Up to 100 log events vanish per failed flush. Separately, try_send at line 202 silently drops events when the 10k channel is full. If an outage coincides with ingest flakiness, the evidence is deleted.

×

2 · 5xx responses are not centrally logged

AppError → HTTP conversion (error.rs:41–65) maps Database/Io/Http/Internal to 500 without logging. An error is visible only if the handler logged before returning. One missed call site = a silent 500. There are 300+ API endpoints relying on this discipline.

!

3 · No path from ERROR to a human

No Slack/Discord/email/webhook anywhere in the codebase; no alert rules configured in Better Stack. Today, learning about a prod error requires opening the Better Stack dashboard and looking.

!

4 · No request correlation

No request-id. When a request emits 4 log lines across service layers, nothing ties them together in Better Stack. Nice-to-have, not urgent — structured fields (call_sid, org_id) partially compensate.

04Alerting options

Two viable routes — they compose. The in-app layer is the primary recommendation because today's flusher failures prove the Better Stack path can't be the only wire to a human.

OptionSlackDiscordLatencyEffortFails when…
A · In-app AlertLayer
tracing Layer → webhook POST, mirrors betterstack.rs pattern
✓ incoming webhook✓ webhook (native) secondsM · ~200 lines app itself is down (no process = no alert)
B · Better Stack alert rules
threshold/anomaly alerts on a saved query, zero code
✓ native integration✗ not supported (email/Slack/Teams + Uptime escalations only) ~1–3 min (confirmation window)S · config only ingest is failing — exactly what we observed today

Recommended · A as primary

Instant, carries full context (message, target, structured fields), Discord and Slack work with the same ~5-line body difference, and it keeps working when Better Stack ingest hiccups. Panics already route through tracing::error!, so they alert for free. Needs in-layer dedup + rate-limiting so one error loop can't flood the channel.

B as backstop, not primary

Worth configuring anyway (10 minutes, no code): an error-rate threshold alert catches classes of failure the in-app layer can't — sustained error spikes, and "the app is up but wedged". But it's blind during ingest failures and can't reach Discord.

AlertLayer design points

  • Trigger: events at ERROR only (the level audit in Phase 1 makes this trustworthy). Optional ALERT_MIN_LEVEL escape hatch.
  • Dedup: key = (target, message); within a 5-min window send the first, then aggregate ("…and 14 more" in the next flush).
  • Rate limit: hard cap (e.g. 10 sends/min) → overflow becomes a single summary message. Discord webhooks tolerate ~30 req/min; Slack ~1/sec.
  • Transport: same proven shape as betterstack.rs — bounded mpsc + dedicated flusher thread, try_send so logging never blocks a request.
  • Config: ALERT_WEBHOOK_URL + ALERT_WEBHOOK_FORMAT=slack|discord (or auto-detect from the URL host). Absent → layer disabled, dev stays quiet.
  • Payload: level, message, target, env (prod/staging), and the event's structured fields — enough to act without opening Better Stack.

05Improvement plan

Four phases, independently shippable, in value order. Phase 1+2 together are roughly a day of work and deliver the two things asked for: less noise, fast error notifications.

  1. P1Kill the noise S · ~½ day

    Replace the on_response closure with a status-aware logger (5xx → error!, 4xx → warn!, ≥2s → warn!, else debug!) and silence the default on_request noise by dropping tower_http=debug from the prod fallback filter (keep tower_http=info). Skip WS upgrade paths (/api/realtime, /api/assistant/ws) and exclude 401/404 from warn! to avoid scanner noise. Demote the known recoverable error! sites (twilio context-injection, etc.) to warn!. Verify the Railway RUST_LOG value while in there.

  2. P2Error alerts → Slack/Discord M · ~½–1 day

    New src/bases/alerting.rs — the AlertLayer described above, registered in logging::init() beside the Better Stack layer. Env-gated, deduped, rate-limited, ERROR-only. Includes panics automatically via the existing hook. Plus the 10-minute backstop: a Better Stack threshold alert (ERROR count > N in 5 min → Slack/email).

  3. P3Pipeline hardening M · ~1 day

    Three fixes: (a) Better Stack flusher retries with backoff (2 retries, then drop with an explicit drop-count log) instead of batch.clear() on failure; count channel-full drops too. (b) Central 5xx logging — one middleware after the router logs every 5xx with method+path, ending reliance on per-handler discipline (the status-aware P1 logger may already cover this — confirm before building twice). (c) Request-id: tower-http SetRequestIdLayer + put the id on the http_request span so Better Stack groups a request's lines.

  4. P4Hygiene & cost ongoing L · incremental

    Audit the 356 error! sites against the rule “error == operation failed AND a human should care” — most churn is in mods/twilio. Decide which per-event info! webhook logs earn INFO vs DEBUG. Check Better Stack ingestion volume before/after P1 to quantify the savings. Consider per-module filter directives (e.g. sea_orm=warn) in the prod fallback.

06Code sketches

Shapes, not final code — to make the plan concrete.

P1 · Status-aware request logging main.rs

// Replaces the current on_response closure. Happy path → debug (absent in prod).
.on_response(|res: &http::Response<_>, latency: Duration, span: &Span| {
    let status = res.status();
    let ms = latency.as_millis() as u64;
    match status.as_u16() {
        500..= 599 => tracing::error!(parent: span, status = status.as_u16(), latency_ms = ms, "request failed"),
        // 401/404 are scanner + auth-expiry noise — keep them at debug
        400..= 499 if status != 401 && status != 404
                    => tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "request rejected"),
        _ if ms > 2_000 => tracing::warn!(parent: span, status = status.as_u16(), latency_ms = ms, "slow request"),
        _ => tracing::debug!(parent: span, status = status.as_u16(), latency_ms = ms, "finished processing request"),
    }
})
// + logging.rs prod fallback: "info,tower_http=info"  (drop the =debug)

P2 · AlertLayer src/bases/alerting.rs (new)

// Same bones as betterstack.rs: Layer impl + bounded channel + flusher thread.
impl<S: Subscriber> Layer<S> for AlertLayer {
    fn on_event(&self, event: &Event, _: Context<S>) {
        if event.metadata().level() != &Level::ERROR { return; }
        let _ = self.sender.try_send(AlertEvent::from(event)); // never blocks
    }
}
// flusher thread: dedup by (target, message) in 5-min window, cap 10 sends/min,
// then POST — Discord: {"content": text} · Slack: {"text": text}
// env: ALERT_WEBHOOK_URL, ALERT_WEBHOOK_FORMAT=discord|slack

P3 · Flusher retry betterstack.rs

fn flush(client: &Client, url: &str, token: &str, batch: &mut Vec<LogEvent>) {
    for attempt in 0..3 {
        match client.post(url).bearer_auth(token).json(batch).send() {
            Ok(r) if r.status().is_success() => { batch.clear(); return; }
            _ if attempt < 2 => std::thread::sleep(Duration::from_millis(500 << attempt)),
            _ => {}
        }
    }
    eprintln!("[betterstack] dropping {} events after 3 failed flushes", batch.len());
    batch.clear(); // bounded loss, now explicit + counted
}

07Open questions

Decisions needed before implementation. Recommendations pre-marked.

Q1

Where do error alerts go — Slack or Discord?

  • Slack only — also unlocks Better Stack's native integration as the backstop channel.
  • Discord only — fine for the in-app layer; the Better Stack backstop falls back to email.
Q2

What triggers an alert?

  • Only HTTP 5xx responses — simpler, but misses background jobs, webhook handlers that swallow to 200, and AI-loop failures.
  • Better Stack threshold only (no in-app layer) — zero code, but blind during ingest failures like today's and ~minutes slower.
Q3

Per-request logs in prod — keep any?

  • Keep INFO for all non-polling routes — preserves an in-Better-Stack request trail, but keeps most of the noise and ingestion cost.
Q4

Is request-id correlation (P3c) worth it now?

  • Defer — org_id/call_sid fields cover most correlation needs today.
Q5

Verify production env (blocked during audit)