Agentic AI Projects: From Deterministic to Probabilistic
Key Takeaways
- Traditional digital is deterministic: a test that passes once generally passes the next time. Generative AI systems are probabilistic: they can work 99 times and break on the 100th. This doesn’t just change the tooling; it changes the entire mindset.
- Three elements define this shift: upgrading data governance for LLMs, running a massive volume of probabilistic tests, and pairing deterministic logs with LLM-based evaluations to monitor the agent.
- The right mindset for decision-makers: explicitly accept, at the executive sponsor level, that you cannot guarantee 100% accuracy. This alignment is what makes these projects possible.
For two decades, digital teams have operated in a deterministic world. A tested web page works. A validated workflow does the exact same thing twice. A bug is a bug: it reproduces, you patch it, and it disappears. This predictability shaped our methods, tools, roles, and trade-offs.
Generative AI systems shatter this logic. Asking the exact same question to the exact same model can yield two different answers. An agent that nails a response in 99 cases might derail on the 100th. And this isn’t a bug you patch—it is a structural property of the technology.
Shifting from deterministic to probabilistic is not just an IT issue. It demands a fundamental mindset shift across the entire organization: engineering, business units, governance, and executive sponsors.
Upgrading Data Governance
This is the most underestimated topic, and likely the most critical one long-term. Data governance, across most large enterprises, was built for digital. Content is structured to be displayed on websites, read by human users, and indexed by search engines.
Tomorrow, these same datasets will be read, interpreted, and outputted by LLMs. And what is legible to a human is not necessarily legible to a model. The example from the Orange team working on the “Sharlie” project is eye-opening: on certain questions regarding mobile plans covering Morocco or Switzerland, the agent failed half the time. The root cause wasn’t the model’s lack of intelligence. It was that the input data was formatted for a website—a large block of text that made perfect sense to a human, but completely confused the agent.
The takeaway is broad. The entire data chain (APIs, knowledge bases, business content, product specs) must be progressively overhauled so its outputs are inherently readable by an LLM.
- How do you structure a table?
- How do you explicitly define an eligibility condition?
- How do you format a pricing sheet hierarchy?
Industrializing Validation Through Volume
In a deterministic world, you test a few representative edge cases. In a probabilistic world, that is no longer enough. If an error only occurs once every hundred or thousand times, manual QA will never catch it.
You have to industrialize the generation and evaluation of conversations. In practice, this requires two complementary layers: virtual customers that automatically replay massive volumes of diverse conversations, and an “LLM-as-a-Judge” that grades response quality against explicit criteria.
For a voice agent, this infrastructure becomes the equivalent of an A/B testing suite for a website: a continuous monitoring tool that flags deviations before they ever reach real customers.
The challenge isn’t just technical; it’s organizational:
- Who defines the grading criteria?
- Who interprets the results?
- Who makes the call to halt a deployment when the quality score drops?
These are net-new roles that simply didn’t exist with this level of intensity in legacy digital orgs.
Monitoring by Combining Deterministic and Probabilistic Data
Observability is also changing fundamentally. On a traditional app, we rely on logs: structured, deterministic, and easy to query. You know an API call failed, latency crossed a threshold, or a specific error fired at a specific time.
For a voice agent, these logs still exist and remain critical. But they tell you absolutely nothing about the quality of the interaction. A call can be technically flawless (no errors, low latency, clean transcription) but a complete disaster for the customer (off-topic answer, inappropriate tone, factual hallucination).
Modern monitoring therefore blends two data sources:
- Deterministic logs to track platform health (latency, error rates, tool calls, endless agent loops).
- LLM evaluations to track qualitative health (answer relevance, brand tone adherence, factual accuracy, implicit customer satisfaction).
When you successfully cross-reference these two sources, you get a much clearer picture of what is actually happening. Latency spiking at the exact same time quality drops points to a systemic issue, not just an isolated hallucination. This combination is what makes observability truly actionable.
The Decisive Role of Executive Sponsors
Everything above can be engineered. But it will fail if the project’s executive sponsors haven’t internalized this paradigm shift.
The classic trap: demanding the project team guarantee 100% accuracy. In a probabilistic world, that is impossible, and promising it is dishonest. Conversely, explicitly agreeing at the governance level that certain critical topics demand 100% accuracy (and must therefore bypass the AI or include a hard safety net), while other areas tolerate a managed margin of error, is the absolute prerequisite for moving forward.
This executive alignment is often the dividing line between projects that hit production and those that die as POCs. It’s not about perfect alignment—that’s a pipe dream with tech this new—but real alignment on the risk profile required to go live.
Moving from deterministic to probabilistic is not a problem you can dump solely on the IT team. It impacts data, validation methods, observability, and governance. It requires business units to embrace a degree of uncertainty, and engineering teams to make that uncertainty manageable through new tooling.
Teams that grasp the scale of this shift are building the right muscle memory today. They know that a successful voice agent project isn’t one where everything works perfectly on day one; it’s a project where you can instantly detect what is breaking and fix it even faster. That capability, far more than the specific foundational model you choose, is what will drive competitive advantage in the years ahead.