LLM Product Rescue and Production Launch

ABOUT THE PROJECT

Overview

A B2B SaaS company building an AI-powered contract analysis product for legal operations teams had spent 14 months and $680,000 with a development partner attempting to build an LLM-based feature that could extract key obligations, flag risk clauses, and summarise commercial contracts in plain language. At the point they engaged Verttx, they had a prototype that worked on the 40 contracts used during development, failed unpredictably on real client contracts, hallucinated clause summaries with enough frequency to be unusable in a legal context, and had a response latency of 8-12 seconds per document — four times above the product's acceptable threshold.

The company's CEO had given the product team a final 90-day window to produce something deployable before the AI feature roadmap would be restructured entirely. Verttx conducted a two-week technical assessment, agreed a rebuild scope, and had a production system live nine weeks after the initial engagement call. The system processes real client contracts in production, achieving 94% accuracy on obligation extraction and a 340-millisecond average API response latency.

‍

The Situation

The prototype's failure had three distinct root causes that the technical assessment surfaced. The first was architectural. The previous team had built a single-prompt architecture — sending an entire contract to the LLM in one API call with a large system prompt asking the model to perform all extraction and analysis tasks simultaneously. This approach works adequately on short, well-structured documents. Commercial contracts range from 8 pages to 140 pages, contain highly variable structure, and include clause types and legal terminology that require different extraction strategies. A single-prompt architecture that worked on the 40 development contracts — which had been selected to be representative — performed inconsistently on the real contract population, which was significantly more variable.

The second root cause was the absence of a retrieval layer. The previous system was generating responses entirely from the LLM's parametric knowledge about contract structures, supplemented by the contract text passed in the prompt. When a contract contained unusual clause formulations, jurisdiction-specific language, or industry-specific terminology that differed from the model's training distribution, the model would generate plausible-sounding but factually incorrect summaries rather than flagging uncertainty. In a legal operations context, a confident wrong answer is worse than no answer. The product's early pilot users — in-house legal teams at two mid-sized companies — had caught three instances of incorrect obligation extraction during the pilot period and had withdrawn from the beta programme citing accuracy concerns.

The third root cause was the latency problem. The previous architecture made sequential API calls — processing each section of a contract one after another — with no parallelisation and no caching of intermediate results. A 40-page contract required 12-18 sequential API calls, each waiting for the previous one to complete, producing the 8-12 second response times that made the feature unusable in a legal workflow where users expected document analysis to feel instantaneous.

‍

The Approach

Honest assessment of what existed

Verttx's two-week assessment reviewed the existing codebase, the prompt architecture, the evaluation dataset, and the pilot feedback from the two withdrawn beta users. The assessment concluded that the core LLM integration logic — approximately 30% of the existing codebase — was worth preserving with refactoring. The prompt architecture, the evaluation framework, and the document processing pipeline were discarded entirely. The company's CTO received a written assessment documenting exactly what was being kept, what was being discarded, and why — before any rebuild work began and before any new budget was committed.

Decomposed pipeline architecture

The rebuild replaced the single-prompt architecture with a decomposed pipeline of six specialised processing stages: document ingestion and structure detection, clause segmentation and classification, per-clause extraction using clause-type-specific prompts, cross-clause dependency mapping, risk flag identification, and plain-language summary generation. Each stage uses a prompt optimised for that specific task rather than a general prompt attempting all tasks simultaneously. Clause-type-specific prompts — trained on examples from the company's own contract library — outperform general prompts on extraction accuracy by 23 percentage points on the held-out evaluation set.

RAG architecture for accuracy

A Retrieval-Augmented Generation layer was added between the clause segmentation stage and the extraction stage. The RAG index contains 4,200 annotated clause examples across 34 clause types — indemnification, limitation of liability, IP assignment, termination, force majeure, governing law, and 28 others — drawn from the company's historical contract library and supplemented with publicly available contract databases. When the extraction model encounters a clause, it retrieves the three most similar annotated examples from the index before generating its extraction output. This grounds the model's response in verified examples rather than parametric knowledge, reducing hallucination on unusual clause formulations from a pilot-period rate of approximately 8% to 0.6% in production.

Parallelisation and latency optimisation

The six pipeline stages were restructured to run in parallel where dependencies allowed. Clause segmentation across all contract sections runs simultaneously rather than sequentially. Extraction across all identified clauses runs in parallel batches of eight, with results assembled after all batch calls complete. Intermediate results are cached so that re-analysis of a previously processed contract — triggered by a user editing a clause flag or requesting a re-summary — reuses cached stage outputs rather than reprocessing the full document. These changes reduced average end-to-end processing time from 8-12 seconds to 340 milliseconds for contracts up to 60 pages, and under 900 milliseconds for the longest contracts in the production document population.

‍

The Result

The system went live in production nine weeks after the initial engagement call. In the first 30 days of production operation, the system processed 2,840 real client contracts submitted by users across the company's paying customer base. Response accuracy on obligation extraction — measured against a gold-standard evaluation set of 200 manually annotated contracts — reached 94%, against a product requirement of 90% and a previous prototype accuracy of 71% on the same evaluation set.

The two beta users who had withdrawn from the pilot during the previous development engagement were re-invited to a production trial. Both converted to paying customers within 30 days. The company relaunched its AI feature publicly four weeks after the production go-live. In the three months following relaunch, the AI contract analysis feature was cited as the primary purchase driver in 34% of new customer acquisition conversations tracked by the sales team — up from 0% during the 14 months the prototype had been in development without a launchable product.

The company's CEO presented the outcome to the board at the quarterly review. The AI feature, which had been a $680,000 liability on the previous vendor's watch, had become the product's primary competitive differentiator within 13 weeks of Verttx's engagement starting. Net revenue retention in the quarter of relaunch improved by 8 percentage points as existing customers upgraded to tiers that included the AI feature.

The complete rebuilt system — the decomposed pipeline architecture, all six processing stages, the RAG index and retrieval infrastructure, the evaluation framework, and every prompt — was transferred to the company's engineering team at handover, giving them full ownership of a production AI system they can maintain, iterate on, and extend without any dependency on Verttx.

We had burned fourteen months and most of our AI budget on something we couldn't ship. When we came to Verttx we needed someone to tell us honestly whether what we had was salvageable and what it would actually take to get to production. They gave us that honest answer in two weeks and delivered in nine. The two beta users who had left came back and converted. — CEO, B2B SaaS Company

‍

RESULTS

Production system live nine weeks after engagement. 94% obligation extraction accuracy against a product requirement of 90% and a previous prototype rate of 71%. Average API response latency of 340 milliseconds, down from 8-12 seconds. Hallucination rate fell from 8% to 0.6%. Both withdrawn beta users converted to paying customers within 30 days of relaunch. The AI feature was cited as the primary purchase driver in 34% of new customer acquisition conversations in the three months following launch. Net revenue retention improved 8 percentage points in the relaunch quarter.

9 weeks

From Verttx engagement to full production launch

94%

Obligation extraction accuracy in production

0.6%

Hallucination rate, down from 8% in prototype

340 ms

Average API latency, down from 8-12 seconds