Fail-Open LLM Architecture: Why Your Reviewer Stage Should Never Block a Decision

  • Автор темы Автор темы Sascha
  • Дата начала Дата начала

Sascha

Команда форума
Администратор
Ofline
https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4wlwezjwauppp5j5zy4.png

The problem​


You added a reviewer LLM to your pipeline for good reasons. Quality matters, and a second pass catches self-contradictions the primary model misses. But the moment you wire a second model into production, you have a new failure surface: what happens when the reviewer itself is slow, errored, or unreachable?

The instinct is to treat the reviewer as a gate — no pass from the reviewer, no decision released. That instinct is wrong.

The pattern: fail-open​


When the reviewer fails — rate limit, timeout, network error, bad JSON — the primary decision should pass through unmodified with a logged warning. Not held, not retried-until-exhausted, not queued-for-review. Passed through.

Here's the pattern:


Код:
try:
    review = review_decision(decision, market_state, personality)
    if review.verdict == "reject":
        decision = downgrade_to_hold(decision, review.reasons)
    elif review.verdict == "adjust":
        decision = apply_adjustments(decision, review)
    # "approve" falls through unchanged
except Exception as e:
    logger.warning("reviewer failed, pass-through: %s", e)
    # decision proceeds unchanged — this is the spec, not a bug



The except Exception block is the spec, not the bug.

Why this is right​


One: you already validated the primary decision. The reviewer is bonus quality, not a prerequisite. If the primary's output passes schema validation, sanity checks, and risk bounds, it's good enough to ship. The reviewer is additive confidence, not permission.

Two: fail-closed creates correlated failure. When your model provider has a regional blip, your primary and your reviewer both see elevated latency. If either gates the other, you lose both: primary times out waiting for reviewer, system freezes. Fail-open degrades gracefully to "we shipped what the primary said" instead of "we shipped nothing."

Three: infrastructure timidity compounds. If every stage of your pipeline can block every other stage, your system's reliability becomes the product of every component's reliability. Five stages at 99% each gives you 95%. Fail-open pipelines approach the reliability of the most-critical stage alone.

When fail-open is wrong​


Two cases where you do want fail-closed:

  • Financial transactions. A reviewer that detects "sending $50K to unknown wallet" should block, not warn. But this is a rules engine, not an LLM. Don't put an LLM on the critical path of money movement.
  • Legal or compliance text. GDPR consent flows, medical advice, tax filings. These need human review on errors, not LLM silent-bypass. Queue and escalate.

For everything else — trading signals, content quality, customer service responses, recommendations — fail-open.

Metrics to watch​


If you ship fail-open, track:

  • reviewer_success_rate — fraction of decisions that got reviewed
  • reviewer_adjustment_rate — of those, fraction the reviewer adjusted
  • reviewer_rejection_rate — fraction the reviewer fully rejected
  • reviewer_fallthrough_rate — fraction that bypassed the reviewer due to timeout or error

If reviewer_fallthrough_rate creeps above 5%, you have a reliability problem in your reviewer stack silently degrading quality. Investigate — but don't fix it by wiring a gate. Fix it by making the reviewer faster or more available.

Closing​


The first principle of fault-tolerant systems: every stage should degrade to the simplest correct behavior when its neighbor fails. For LLM reviewer stages, that's pass-through with a warning. Engineering for that explicit case is what separates production-grade AI systems from demos that fall over under load.



A3E Ecosystem builds AI-native trading and content infrastructure.

 
Назад
Сверху Снизу