Data Foundations: How to Structure Your Data for LLMs

Article Data Governance 05.06.2026
By David Guede

David Guede, Data / AI Partner and Agentic AI Expert at Converteo, specializes in deploying production-ready AI architectures. He helps enterprises industrialize intelligent agents, transforming complex business processes into real drivers of performance and competitive advantage.

Key Takeaways

  • Your content, product listings, and APIs were structured for humans browsing a website. When an LLM reads them to answer a customer, they break down.
  • Fixing this is a foundational project that starts with one question: which data do we need to restructure so an LLM can actually use it?

Imagine the scene. A team builds a high-quality voice agent on a modern platform, using a top-tier model. It works perfectly in the demo. You push it to QA, and suddenly, on very specific topics, the agent hallucinates half the time.

The first instinct is to blame the model. It’s too limited. Poorly prompted. Badly configured. Sometimes, that’s partly true. But experience shows that in most cases, the real issue lies elsewhere: it comes from the input data you feed the model.

This anecdote comes directly from the “Sharlie” project at Orange. When asked about mobile plans covering destinations like Morocco or Switzerland, the agent consistently stumbled. Digging deeper, the team didn’t uncover a model flaw. They discovered that the data feeding the model was designed for a website—and it became completely illegible the second it was parsed by an LLM.

This is probably one of the most enduring challenges generative AI poses to any organization. And one of the least discussed.

Why Your Data Isn’t Built for LLMs

Over the years, enterprise content has been structured for a single purpose: to be displayed to humans, on screens, within digital journeys. That has driven some fundamental design choices.

Content optimized for the human eye. A well-crafted pricing sheet highlights the attractive price, uses design for visual hierarchy, and tucks T&Cs into a “good to know” sidebar. To a human looking at the page, everything makes sense. But when an LLM receives that exact same content via text or API, the visual hierarchy vanishes, and the information turns into a jumbled mess.

APIs built for apps, not models. An API powering a website usually returns a massive block of data, structured so the front-end can pick and choose what to display. That’s highly efficient for a developer. It’s a trap for an LLM, which has to reverse-engineer context out of a format built for a completely different system.

Implicit business rules. A lot of content relies on unspoken corporate conventions. “When we say ‘international,’ that excludes Europe.” “If a condition isn’t listed, standard terms apply by default.” Human reps learn these rules over years of experience. An LLM doesn’t know them, and it cannot guess them.

Add these three factors together, and it’s easy to see why even a highly capable model can fail on specific questions: it is correctly executing the prompt using raw material that was never designed for it.

3 Initiatives to Structure Your Data for LLMs

Three core initiatives emerge.

  1. Rethinking specific APIs for LLMs. This doesn’t mean rebuilding everything from scratch. It means identifying the most critical APIs for your conversational agents and building a structured variant specifically designed for model ingestion. A pricing API, for instance, should explicitly state what is included, what is excluded, in which countries, and under what conditions, in a format the model can parse instantly.
  2. Restructuring business content. Product specs, pricing policies, terms and conditions—anything serving as the agent’s knowledge base must be progressively rewritten with a dual purpose (human and machine). This is heavy lifting that impacts content, product, and customer service teams.
  3. Making implicit rules explicit. Everything passed down through word-of-mouth between human reps, everything that is “just known” across the company without being documented, must be put in black and white to become actionable for an agent. This exercise adds massive value far beyond AI: it frequently exposes inconsistencies, outdated policies, and siloed knowledge.

Data Foundations: A Governance Issue

This initiative goes way beyond the IT department. It strikes at the core of enterprise data governance.

Who is responsible for ensuring a product listing is optimized for LLM usage? Today, usually no one. The listing exists for the website, and the buck stops there. Tomorrow, we will need clear data owners capable of arbitrating between the “human” version and the “machine” version of the same content, ensuring both remain perfectly synced over time.

This also requires new workflows. When a product feature changes or a pricing condition updates, how do you guarantee that change propagates everywhere—including into the databases your AI agents rely on? Without that process, you accumulate silent drift that will inevitably trigger hallucinations in production.

This is one of the most structural challenges AI brings to large enterprises today, demanding a gradual overhaul of how content is structured and governed.

The good news is you don’t have to boil the ocean. You can start with the content your production agents hit the most, measure the impact, and scale from there. The bad news is you cannot avoid it: as long as the raw material is flawed, no model—no matter how advanced—will perform well. And until you tackle this data debt, your voice agent’s ROI will remain capped… not by the tech, but by the data.

By David Guede

Partner Data, IA et Agentique

1 / 1
charles cortes

Product page optimization: how to succeed in GEO in 2026

How to adapt your PDPs to LLMs? Product page optimization for GEO and answer engines maximizes your conversions in 2026.

AI and CDPs: Transforming Your Marketing Automation and Data

How are AI and autonomous agents revolutionizing CDPs? Boost your marketing agility and data-driven performance with a smart activation strategy.

Data Foundations: How to Structure Your Data for LLMs

How do you structure data and APIs for LLMs?

AI vocal agent: how to guarantee its reliability through QA

How to guarantee the reliability of an AI vocal agent in production? Apply demanding QA: real-world testing, security, and monitoring.

AI vocal agent: how to guarantee its reliability through QA

charles cortes

Product feed optimization for Google and LLMs

How to transform your product feed into a commerce API? Use AI to enrich your data and increase your ROAS on an industrial scale.

AI vocal agent and customer relationship: the Sharlie case by Converteo

How to secure the customer experience with a probabilistic AI vocal agent? Discover the multi-agent architecture and semantic monitoring for Sosh.

Agentic AI Projects: From Deterministic to Probabilistic

How do you ensure your agentic AI project succeeds? Adopt a probabilistic approach to data governance, validation, and monitoring.
De l’IA “boîte noire” à l’IA “responsable par design

Google I/O 2026 announcements: what you need to remember | Converteo

How are the Google I/O 2026 announcements transforming e-commerce? Analysis of the agentic revolution with Gemini Spark and Antigravity.

AI Voice Agents and Customer Experience: The Sharlie Case by Converteo

Raphael Fétique

Agentic AI in the Enterprise: The New Performance Standard

How do you integrate agentic AI in the enterprise to automate workflows? Discover the strategies to deploy high-performing autonomous agents.

Meet René, LACOSTE’s Agentic AI for Customer Elegance

How do you successfully deploy agentic AI in retail?