7UNIT / CASE STUDY 004

AI Systems · Global · 2025

LLM-powered assistants built forreal operating constraints.

THE PROBLEM

Most LLM integrations fail the same way. They work in a demo. They break in production. The failure modes are predictable: context window exhaustion, no graceful fallback when the model is uncertain, no human handoff path, no cost visibility, and no tenant isolation in multi-tenant systems. We have built AI assistant systems specifically around these failure modes.

WHAT WE BUILT

A pattern — not a single product. LLM-powered assistant systems designed around the constraints that kill most implementations in production. Applied across multiple client contexts including WhatsApp inquiry agents, document intelligence pipelines, and engineering team coordination agents.

ARCHITECTURE DECISIONS

Session memory without token blowout

Naive implementations keep the full conversation history in context for every API call. At scale this means every conversation eventually hits the context limit and fails. We implement session-scoped memory with selective compression — preserving intent and key facts while managing token consumption.

Rationale: A session that fails at message 40 because the context window is full is not a production system. Memory management is a first-class concern, not an afterthought.

Trade-off accepted: Slightly more complex memory architecture in exchange for conversations that do not fail at scale.

Human handoff with full context transfer

When the model reaches its confidence boundary, it hands off to a human — with the full conversation context, the model's last reasoning state, and a suggested next action. The human agent does not start from scratch.

Rationale: A handoff that drops context is not a handoff — it is a restart. The human agent should be able to read the conversation and continue without asking the customer to repeat themselves.

Trade-off accepted: More complex handoff state management in exchange for seamless human escalation.

Cost attribution per tenant per conversation

In multi-tenant AI systems, token cost must be attributed accurately. We instrument every LLM call with tenant_id and conversation_id. Monthly cost reports per tenant are a first-class feature, not an afterthought.

Rationale: Without cost attribution, one high-volume tenant can make the entire system unprofitable without any individual transaction appearing unusual. Attribution from the first API call means cost anomalies are visible immediately.

Trade-off accepted: Additional instrumentation overhead in exchange for full cost visibility per tenant.

Graceful degradation

When the LLM API is unavailable or returns an error, the system degrades gracefully — queue and retry for async workflows, immediate human handoff for synchronous conversations.

Rationale: An AI system that fails hard when the model API is unavailable is not production-grade. Graceful degradation means the business continues to operate even when the AI layer is temporarily unavailable.

Trade-off accepted: More complex fallback logic in exchange for operational continuity under failure conditions.

OUTCOMES

WhatsApp inquiry agent: 340+ daily conversations handled without agent involvement
Cost per conversation: attributed, tracked, and reportable per tenant
Human handoff: always with full context transfer — customer never repeats themselves

STACK

Claude APIFastAPIPostgreSQLWhatsApp Business APIRedis

CONTINUE EXPLORING

Related case studies

SMB Sales · UAE · 2025

WhatsApp-native sales CRM for

How we built a complete CRM that operates inside WhatsApp — pipeline tracking, AI follow-up, and lead qualification for UAE sales teams.

Read case study →

EdTech · Italy · 2024

Holistic growth centre portal with

How we built a learner portal connected to Odoo with event-driven curriculum sync and automated certificate generation for 1,200 learners in Italy.

Read case study →

Book a 30-min call →