AI vocal agent: how to guarantee its reliability through QA

Article Agentique 04.06.2026
Antoine Margueritte
By Antoine Margueritte

Consultant product builder data & AI at Converteo, Antoine supports organizations in designing and deploying complex digital solutions. Expert in agentic AI, agile project management, and product quality, he works at the intersection of business needs and technical developments. His background, from data analyst to product owner on AI projects in production, has forged his conviction: what differentiates a successful AI from an expensive gadget is its ability to deliver in real-world conditions.

Key takeaways

  • QA is not optional in an AI project: it is what makes the difference between an impressive demo and a product that your users permanently trust.
  • Testing in real-world conditions changes everything: a vocal agent that performs in the lab can collapse when faced with street noise. Field testing is not a luxury, it is a necessity.
  • Post-launch supervision is just as important as pre-launch testing: with generative AI, the model’s behavior evolves. Only continuous monitoring allows you to keep control over it.

If there is a part often underestimated in the role of the product manager or product owner, it is definitely QA. Yet, this expertise is absolutely central: it is what guarantees a reliable product that is adopted and generates sustainable user satisfaction.

When the product in question is a Voice-to-Voice agent boosted by generative AI, potential sources of error multiply at every layer of the system. Deploying generative AI in production is good. Doing it with rigor, security, and measurable results is better. At Sosh, the launch of Sharlie, a Voice-to-Voice solution integrated into web and mobile applications, relied on a demanding discipline: Business Performance Quality (QPM). Feedback on an approach that transforms a technological promise into a concrete customer service.

A multi-agent architecture designed for performance

Sharlie is not a simple vocal chatbot. Behind this fluid experience lies a multi-agent system orchestrated by an LLM, connected in real time to Orange’s information systems via APIs.

The key to its reliability? A clear segmentation of responsibilities: each agent covers a precise domain (commerce, self-care, support) with dedicated instructions for each tool. This architecture makes it possible to isolate critical flows (subscribing to an option, consulting an invoice) and apply a rigorous business logic to every step.

A testing strategy that goes beyond the laboratory

Achieving an industrial level of performance cannot be improvised. The QPM relies on an End-to-End (E2E) testing strategy structured into five complementary categories, combining manual execution and automation.

1. Functional testing: from the first word to the business transaction

Functional tests form the foundation of the QPM. They cover the entire customer journey, from capturing the vocal intent to confirming the business transaction in the information systems. Two areas are systematically covered:

  • End-to-End (E2E) testing: each agent is subjected to scenarios built around real user intents. The objective is to validate the fluidity of the journey, to ensure that no step is left stranded, and that resolution is complete.
  • Validation of the feedback loop: the accuracy of responses and the quality of post-conversation summaries are evaluated continuously. They are enriched by direct user feedback, notably through ratings collected at the end of the journey to refine the model at each iteration.

2. API integrations: zero tolerance for data errors

Each vocal interaction can lead to a real business transaction: subscription, cancellation, or contract modification. This is why API integrations have been tested with maximum requirement, targeting two critical risks:

  • Data hallucinations (GET): customer information and displayed offers must be strictly accurate. Any approximation by the AI on factual data is unacceptable in a business context.
  • Transactional reliability (POST/DELETE): any vocal action must immediately result in a correct update in the customer account. Tests systematically verify the consistency between what the AI confirms orally and what is actually recorded in the systems.

3. Security & ethics: foolproof safeguards

A conversational AI inevitably exposes an attack surface. To protect Sharlie, a two-pronged security strategy was deployed:

  • Compliance and guardrails: the AI is trained to elegantly decline any out-of-scope request (politics, weather, personal advice) to remain centered on the Sosh universe. This strict perimeter protects both the user and the brand.
  • Robustness campaigns (bug bounty): bug hunters were mobilized to test Sharlie’s resistance to the most common attacks, such as prompt leaking (extraction of system instructions) and system prompt bypass (attempting to break the bot’s persona). Beyond technical security, these tests also aim to protect users’ sensitive data. Each campaign feeds a cycle of continuous improvement of the AI’s directives.

4. Field testing: when the AI holds up in real-world conditions

Voice-to-Voice poses a frequently underestimated challenge: ambient noise. Field testing sessions in real-world conditions (street noise, traffic, music) made it possible to evaluate two critical aspects:

  • Audio robustness: Sharlie’s ability to isolate the customer’s voice in any sound environment, regardless of the device used (iOS, Android, web).
  • Voice Activity Detection (VAD): a key indicator guided these tests: stop latency, meaning the AI’s ability to interrupt itself instantly as soon as a human voice is detected. A technical detail that makes all the difference in the naturalness of the exchange.

5. Performance & load: holding up at scale

A vocal agent that performs well under normal conditions must also resist peak usage. Two indicators were closely monitored:

  • Latency (TTFB): the AI’s response time is measured continuously to guarantee a natural and fluid conversation. The immediate interruption of the stream as soon as a user speaks again is a non-negotiable criterion.
  • Load testing: simulations validated Sharlie’s capacity to process several hundred simultaneous conversations without degradation of the quality of service.

Continuous monitoring: AI as a judge of AI

The QPM does not stop on launch day. Thanks to a dedicated supervision tool, every conversation is analyzed in real time using an “LLM as a judge” approach, where one AI evaluates the quality of another. Five dimensions are monitored continuously:

  1. Technical reliability: detection of infinite loops and tool failures.
  2. Relational quality: empathy, tone, ability to rephrase.
  3. Business efficiency: complete resolution of the customer request.
  4. Commercial strategy: relevance of offers and clarity of responses.
  5. Security: resistance to hijacking attempts.

Result: hundreds of conversations processed per day, in total confidence

Business Performance Quality is not a constraint, it is a value driver. By combining a robust multi-agent architecture, real-world tests covering five critical dimensions, and automated post-launch supervision, Sharlie is now capable of managing 500 daily conversations with reliability, fluidity, and safety.

That is what it means to transform a technological innovation into a customer experience that delivers on its promises.

At a time when AI operates in an intrinsically probabilistic way (where the model never guarantees a deterministic result), redouble your attention on QA. More than ever, it is what makes the difference between a product that impresses in a demo and a product your users truly trust.

Antoine Margueritte

By Antoine Margueritte

Consultant Data & IA Product Builder

1 / 1

AI vocal agent: how to guarantee its reliability through QA

How to guarantee the reliability of an AI vocal agent in production? Apply demanding QA: real-world testing, security, and monitoring.

AI vocal agent: how to guarantee its reliability through QA

charles cortes

Product feed optimization for Google and LLMs

How to transform your product feed into a commerce API? Use AI to enrich your data and increase your ROAS on an industrial scale.

AI vocal agent and customer relationship: the Sharlie case by Converteo

How to secure the customer experience with a probabilistic AI vocal agent? Discover the multi-agent architecture and semantic monitoring for Sosh.
De l’IA “boîte noire” à l’IA “responsable par design

Google I/O 2026 announcements: what you need to remember | Converteo

How are the Google I/O 2026 announcements transforming e-commerce? Analysis of the agentic revolution with Gemini Spark and Antigravity.
charles cortes

Agentic commerce: understanding ACP and UCP protocols

How to adapt your e-commerce to LLMs? Master agentic commerce with ACP and UCP protocols.
Commerce agentique et retail : ce qu'il faut retenir de la NRF2026

Google Cloud Next 2026: 3 strategic signals

What to remember from google cloud next 2026? end of the PoC era, industrialization of agentic ai, and announcements from Google Cloud.

How to: deploy an AI agent in production in 4 months for lacoste

How do you industrialize an agentic platform? Discover lacoste's strategy for deploying a high-performing, connected AI agent in just 4 months.

AI Voice Agents in insurance: 3 key takeaways

How to deploy a high-performing AI voice agent in insurance?

AI product: transforming technical prowess into business value

Senior manager in the data & ai transformation practice at Converteo, Charles Letaillieur supports organizations in their strategic ambitions rel...

AI product builder vs PM and Designer: who does what in ai?

what is the role of the ai product builder? discover how they collaborate with the pm and designer to transform a vision into a tangible ai product.

AI Agent: why the product builder must learn to collaborate

Partner in ai and product management at Converteo, David Spire assists organizations in transforming their product strategy in the age of ai and dat...