AI vocal agent: how to guarantee its reliability through QA
Consultant product builder data & AI at Converteo, Antoine supports organizations in designing and deploying complex digital solutions. Expert in agentic AI, agile project management, and product quality, he works at the intersection of business needs and technical developments. His background, from data analyst to product owner on AI projects in production, has forged his conviction: what differentiates a successful AI from an expensive gadget is its ability to deliver in real-world conditions.
Key takeaways
- QA is not optional in an AI project: it is what makes the difference between an impressive demo and a product that your users permanently trust.
- Testing in real-world conditions changes everything: a vocal agent that performs in the lab can collapse when faced with street noise. Field testing is not a luxury, it is a necessity.
- Post-launch supervision is just as important as pre-launch testing: with generative AI, the model’s behavior evolves. Only continuous monitoring allows you to keep control over it.
If there is a part often underestimated in the role of the product manager or product owner, it is definitely QA. Yet, this expertise is absolutely central: it is what guarantees a reliable product that is adopted and generates sustainable user satisfaction.
When the product in question is a Voice-to-Voice agent boosted by generative AI, potential sources of error multiply at every layer of the system. Deploying generative AI in production is good. Doing it with rigor, security, and measurable results is better. At Sosh, the launch of Sharlie, a Voice-to-Voice solution integrated into web and mobile applications, relied on a demanding discipline: Business Performance Quality (QPM). Feedback on an approach that transforms a technological promise into a concrete customer service.
A multi-agent architecture designed for performance
Sharlie is not a simple vocal chatbot. Behind this fluid experience lies a multi-agent system orchestrated by an LLM, connected in real time to Orange’s information systems via APIs.
The key to its reliability? A clear segmentation of responsibilities: each agent covers a precise domain (commerce, self-care, support) with dedicated instructions for each tool. This architecture makes it possible to isolate critical flows (subscribing to an option, consulting an invoice) and apply a rigorous business logic to every step.
A testing strategy that goes beyond the laboratory
Achieving an industrial level of performance cannot be improvised. The QPM relies on an End-to-End (E2E) testing strategy structured into five complementary categories, combining manual execution and automation.
1. Functional testing: from the first word to the business transaction
Functional tests form the foundation of the QPM. They cover the entire customer journey, from capturing the vocal intent to confirming the business transaction in the information systems. Two areas are systematically covered:
- End-to-End (E2E) testing: each agent is subjected to scenarios built around real user intents. The objective is to validate the fluidity of the journey, to ensure that no step is left stranded, and that resolution is complete.
- Validation of the feedback loop: the accuracy of responses and the quality of post-conversation summaries are evaluated continuously. They are enriched by direct user feedback, notably through ratings collected at the end of the journey to refine the model at each iteration.
2. API integrations: zero tolerance for data errors
Each vocal interaction can lead to a real business transaction: subscription, cancellation, or contract modification. This is why API integrations have been tested with maximum requirement, targeting two critical risks:
- Data hallucinations (GET): customer information and displayed offers must be strictly accurate. Any approximation by the AI on factual data is unacceptable in a business context.
- Transactional reliability (POST/DELETE): any vocal action must immediately result in a correct update in the customer account. Tests systematically verify the consistency between what the AI confirms orally and what is actually recorded in the systems.
3. Security & ethics: foolproof safeguards
A conversational AI inevitably exposes an attack surface. To protect Sharlie, a two-pronged security strategy was deployed:
- Compliance and guardrails: the AI is trained to elegantly decline any out-of-scope request (politics, weather, personal advice) to remain centered on the Sosh universe. This strict perimeter protects both the user and the brand.
- Robustness campaigns (bug bounty): bug hunters were mobilized to test Sharlie’s resistance to the most common attacks, such as prompt leaking (extraction of system instructions) and system prompt bypass (attempting to break the bot’s persona). Beyond technical security, these tests also aim to protect users’ sensitive data. Each campaign feeds a cycle of continuous improvement of the AI’s directives.
4. Field testing: when the AI holds up in real-world conditions
Voice-to-Voice poses a frequently underestimated challenge: ambient noise. Field testing sessions in real-world conditions (street noise, traffic, music) made it possible to evaluate two critical aspects:
- Audio robustness: Sharlie’s ability to isolate the customer’s voice in any sound environment, regardless of the device used (iOS, Android, web).
- Voice Activity Detection (VAD): a key indicator guided these tests: stop latency, meaning the AI’s ability to interrupt itself instantly as soon as a human voice is detected. A technical detail that makes all the difference in the naturalness of the exchange.
5. Performance & load: holding up at scale
A vocal agent that performs well under normal conditions must also resist peak usage. Two indicators were closely monitored:
- Latency (TTFB): the AI’s response time is measured continuously to guarantee a natural and fluid conversation. The immediate interruption of the stream as soon as a user speaks again is a non-negotiable criterion.
- Load testing: simulations validated Sharlie’s capacity to process several hundred simultaneous conversations without degradation of the quality of service.
Continuous monitoring: AI as a judge of AI
The QPM does not stop on launch day. Thanks to a dedicated supervision tool, every conversation is analyzed in real time using an “LLM as a judge” approach, where one AI evaluates the quality of another. Five dimensions are monitored continuously:
- Technical reliability: detection of infinite loops and tool failures.
- Relational quality: empathy, tone, ability to rephrase.
- Business efficiency: complete resolution of the customer request.
- Commercial strategy: relevance of offers and clarity of responses.
- Security: resistance to hijacking attempts.
Result: hundreds of conversations processed per day, in total confidence
Business Performance Quality is not a constraint, it is a value driver. By combining a robust multi-agent architecture, real-world tests covering five critical dimensions, and automated post-launch supervision, Sharlie is now capable of managing 500 daily conversations with reliability, fluidity, and safety.
That is what it means to transform a technological innovation into a customer experience that delivers on its promises.
At a time when AI operates in an intrinsically probabilistic way (where the model never guarantees a deterministic result), redouble your attention on QA. More than ever, it is what makes the difference between a product that impresses in a demo and a product your users truly trust.