Data Foundations: How to Structure Your Data for LLMs
David Guede, Data / AI Partner and Agentic AI Expert at Converteo, specializes in deploying production-ready AI architectures. He helps enterprises industrialize intelligent agents, transforming complex business processes into real drivers of performance and competitive advantage.
Key Takeaways
- Your content, product listings, and APIs were structured for humans browsing a website. When an LLM reads them to answer a customer, they break down.
- Fixing this is a foundational project that starts with one question: which data do we need to restructure so an LLM can actually use it?
Imagine the scene. A team builds a high-quality voice agent on a modern platform, using a top-tier model. It works perfectly in the demo. You push it to QA, and suddenly, on very specific topics, the agent hallucinates half the time.
The first instinct is to blame the model. It’s too limited. Poorly prompted. Badly configured. Sometimes, that’s partly true. But experience shows that in most cases, the real issue lies elsewhere: it comes from the input data you feed the model.
This anecdote comes directly from the “Sharlie” project at Orange. When asked about mobile plans covering destinations like Morocco or Switzerland, the agent consistently stumbled. Digging deeper, the team didn’t uncover a model flaw. They discovered that the data feeding the model was designed for a website—and it became completely illegible the second it was parsed by an LLM.
This is probably one of the most enduring challenges generative AI poses to any organization. And one of the least discussed.
Why Your Data Isn’t Built for LLMs
Over the years, enterprise content has been structured for a single purpose: to be displayed to humans, on screens, within digital journeys. That has driven some fundamental design choices.
Content optimized for the human eye. A well-crafted pricing sheet highlights the attractive price, uses design for visual hierarchy, and tucks T&Cs into a “good to know” sidebar. To a human looking at the page, everything makes sense. But when an LLM receives that exact same content via text or API, the visual hierarchy vanishes, and the information turns into a jumbled mess.
APIs built for apps, not models. An API powering a website usually returns a massive block of data, structured so the front-end can pick and choose what to display. That’s highly efficient for a developer. It’s a trap for an LLM, which has to reverse-engineer context out of a format built for a completely different system.
Implicit business rules. A lot of content relies on unspoken corporate conventions. “When we say ‘international,’ that excludes Europe.” “If a condition isn’t listed, standard terms apply by default.” Human reps learn these rules over years of experience. An LLM doesn’t know them, and it cannot guess them.
Add these three factors together, and it’s easy to see why even a highly capable model can fail on specific questions: it is correctly executing the prompt using raw material that was never designed for it.
3 Initiatives to Structure Your Data for LLMs
Three core initiatives emerge.
- Rethinking specific APIs for LLMs. This doesn’t mean rebuilding everything from scratch. It means identifying the most critical APIs for your conversational agents and building a structured variant specifically designed for model ingestion. A pricing API, for instance, should explicitly state what is included, what is excluded, in which countries, and under what conditions, in a format the model can parse instantly.
- Restructuring business content. Product specs, pricing policies, terms and conditions—anything serving as the agent’s knowledge base must be progressively rewritten with a dual purpose (human and machine). This is heavy lifting that impacts content, product, and customer service teams.
- Making implicit rules explicit. Everything passed down through word-of-mouth between human reps, everything that is “just known” across the company without being documented, must be put in black and white to become actionable for an agent. This exercise adds massive value far beyond AI: it frequently exposes inconsistencies, outdated policies, and siloed knowledge.
Data Foundations: A Governance Issue
This initiative goes way beyond the IT department. It strikes at the core of enterprise data governance.
Who is responsible for ensuring a product listing is optimized for LLM usage? Today, usually no one. The listing exists for the website, and the buck stops there. Tomorrow, we will need clear data owners capable of arbitrating between the “human” version and the “machine” version of the same content, ensuring both remain perfectly synced over time.
This also requires new workflows. When a product feature changes or a pricing condition updates, how do you guarantee that change propagates everywhere—including into the databases your AI agents rely on? Without that process, you accumulate silent drift that will inevitably trigger hallucinations in production.
This is one of the most structural challenges AI brings to large enterprises today, demanding a gradual overhaul of how content is structured and governed.
The good news is you don’t have to boil the ocean. You can start with the content your production agents hit the most, measure the impact, and scale from there. The bad news is you cannot avoid it: as long as the raw material is flawed, no model—no matter how advanced—will perform well. And until you tackle this data debt, your voice agent’s ROI will remain capped… not by the tech, but by the data.