Douwe Kiela on Why RAG Isn’t Lifeless – O’Reilly

July 7, 2025

6

O’Reilly Media

Generative AI within the Actual World: Douwe Kiela on Why RAG Isn’t Lifeless

00:00
/
34m 47s

Be a part of our host Ben Lorica and Douwe Kiela, cofounder of Contextual AI and creator of the primary paper on RAG, to seek out out why RAG stays as related as ever. No matter what you name it, retrieval is on the coronary heart of generative AI. Discover out why—and the right way to construct efficient RAG-based techniques.

In regards to the Generative AI within the Actual World podcast: In 2023, ChatGPT put AI on everybody’s agenda. In 2025, the problem will likely be turning these agendas into actuality. In Generative AI within the Actual World, Ben Lorica interviews leaders who’re constructing with AI. Study from their expertise to assist put AI to work in your enterprise.

Try different episodes of this podcast on the O’Reilly studying platform.

Timestamps

0:00: Introduction to Douwe Kiela, cofounder and CEO of Contextual AI.
0:25: In the present day’s subject is RAG. With frontier fashions promoting large context home windows, many builders marvel if RAG is turning into out of date. What’s your take?
1:03: We now have a weblog put up: isragdeadyet.com. If one thing retains getting pronounced lifeless, it would by no means die. These lengthy context fashions remedy an identical drawback to RAG: the right way to get the related data into the language mannequin. But it surely’s wasteful to make use of the total context on a regular basis. If you wish to know who the headmaster is in Harry Potter, do you need to learn all of the books?
2:04: What’s going to in all probability work finest is RAG plus lengthy context fashions. The actual resolution is to make use of RAG, discover as a lot related data as you’ll be able to, and put it into the language mannequin. The dichotomy between RAG and lengthy context isn’t an actual factor.
2:48: One of many primary points could also be that RAG techniques are annoying to construct, and lengthy context techniques are simple. But when you may make RAG simple too, it’s way more environment friendly.
3:07: The reasoning fashions make it even worse when it comes to value and latency. And should you’re speaking about one thing with a variety of utilization, excessive repetition, it doesn’t make sense.
3:39: You’ve been speaking about RAG 2.0, which appears pure: emphasize techniques over fashions. I’ve lengthy warned folks that RAG is a sophisticated system to construct as a result of there are such a lot of knobs to show. Few builders have the talents to systematically flip these knobs. Are you able to unpack what RAG 2.0 means for groups constructing AI purposes?
4:22: The language mannequin is just a small a part of a a lot larger system. If the system doesn’t work, you’ll be able to have an incredible language mannequin and it’s not going to get the best reply. If you happen to begin from that remark, you’ll be able to consider RAG as a system the place all of the mannequin parts may be optimized collectively.
5:40: What you’re describing is much like what different components of AI are attempting to do: an end-to-end system. How early within the pipeline does your imaginative and prescient begin?
6:07: We now have two core ideas. One is a knowledge retailer—that’s actually extraction, the place we do structure segmentation. We collate all of that data and chunk it, retailer it within the knowledge retailer, after which the brokers sit on high of the information retailer. The brokers do a combination of retrievers, adopted by a reranker and a grounded language mannequin.
7:02: What about embeddings? Are they routinely chosen? If you happen to go to Hugging Face, there are, like, 10,000 embeddings.
7:15: We prevent a variety of that effort. Opinionated orchestration is a means to consider it.
7:31: Two years in the past, when RAG began turning into mainstream, a variety of builders centered on chunking. We had guidelines of thumb and shared tales. This eliminates a variety of that trial and error.
8:06: We principally have two APIs: one for ingestion and one for querying. Querying is contextualized in your knowledge, which we’ve ingested.
8:25: One factor that’s underestimated is doc parsing. Lots of people overfocus on embedding and chunking. Attempt to discover a PDF extraction library for Python. There are such a lot of of them, and you may’t inform which of them are good. They’re all horrible.
8:54: We now have our stand-alone element APIs. Our doc parser is on the market individually. Some areas, like finance, have extraordinarily advanced layouts. Nothing off the shelf works, so we needed to roll our personal resolution. Since we all know this will likely be used for RAG, we course of the doc to make it maximally helpful. We don’t simply extract uncooked data. We additionally extract the doc hierarchy. That’s extraordinarily related as metadata whenever you’re doing retrieval.
10:11: There are open supply libraries—what drove you to construct your personal, which I assume additionally encompasses OCR?
10:45: It encompasses OCR; it has VLMs, advanced structure segmentation, totally different extraction fashions—it’s a really advanced system. Open supply techniques are good for getting began, however you might want to construct for manufacturing, not for the demo. You could make it work on one million PDFs. We see a variety of initiatives die on the way in which to productization.
12:15: It’s not only a query of knowledge extraction; there’s construction inside these paperwork that you could leverage. Lots of people early on have been centered on chunking. My instinct was that extraction was the important thing.
12:48: In case your data extraction is dangerous, you’ll be able to chunk all you need and it received’t do something. Then you’ll be able to embed all you need, however that received’t do something.
13:27: What are you utilizing for scale? Ray?
13:32: For scale, we’re simply utilizing our personal techniques. Every little thing is Kubernetes below the hood.
13:52: Within the early a part of the pipeline, what buildings are you searching for? You point out hierarchy. Individuals are additionally enthusiastic about data graphs. Are you able to extract graphical data?
14:12: GraphRAG is an attention-grabbing idea. In our expertise, it doesn’t make an enormous distinction should you do GraphRAG the way in which the unique paper proposes, which is actually knowledge augmentation. With Neo4j, you’ll be able to generate queries in a question language, which is actually text-to-SQL.
15:08: It presupposes you may have an honest data graph.
15:17: And that you’ve got an honest text-to-query language mannequin. That’s construction retrieval. It’s important to first flip your unstructured knowledge into structured knowledge.
15:43: I wished to speak about retrieval itself. Is retrieval nonetheless an enormous deal?
16:07: It’s the onerous drawback. The way in which we remedy it’s nonetheless utilizing a hybrid: combination of retrievers. There are totally different retrieval modalities you’ll be able to select. On the first stage, you need to solid a large web. Then you definitely put that into the reranker, and people rerankers do all of the good stuff. You need to do quick first-stage retrieval, and rerank after that. It makes an enormous distinction to present your reranker directions. You would possibly need to inform it to favor recency. If the CEO wrote it, I need to prioritize that. Or I would like it to look at knowledge hierarchies. You want some guidelines to seize the way you need to rank knowledge.
17:56: Your retrieval step is advanced. How does it affect latency? And the way does it affect explainability and transparency?
18:17: You’ve observability on all of those levels. When it comes to latency, it’s not that dangerous since you slim the funnel regularly. Latency is one in every of many parameters.
18:52: One of many issues lots of people don’t perceive is that RAG doesn’t utterly protect you from hallucination. You can provide the language mannequin all of the related data, however the language mannequin would possibly nonetheless be opinionated. What’s your resolution to hallucination?
19:37: A basic function language mannequin must fulfill many alternative constraints. It wants to have the ability to hallucinate—it wants to have the ability to speak about issues that aren’t within the ground-truth context. With RAG you don’t need that. We’ve taken open supply base fashions and skilled them to be grounded within the context solely. The language fashions are superb at saying, “I don’t know.” That’s actually vital. Our mannequin can’t speak about something it doesn’t have context on. We name it our grounded language mannequin (GLM).
20:37: Two issues have occurred in current months: reasoning and multimodality.
20:54: Each are tremendous vital for RAG generally. I’m very joyful that multimodality is lastly getting the eye that it observes. A number of knowledge is multimodal. Movies and sophisticated layouts. Qualcomm is one in every of our clients; their knowledge may be very advanced: circuit diagrams, code, tables. You could extract the knowledge the best means and ensure the entire pipeline works.
22:00: Reasoning: I feel individuals are nonetheless underestimating how a lot of a paradigm shift inference-time compute is. We’re doing a variety of work on domain-agnostic planners and ensuring you may have agentic capabilities the place you’ll be able to perceive what you need to retrieve. RAG turns into one of many instruments for the domain-agnostic planner. Retrieval is the way in which you make techniques work on high of your knowledge.
22:42: Inference-time compute will likely be slower and costlier. Is your system engineered so that you solely use that when you might want to?
22:56: We’re a platform the place individuals can construct their very own brokers, so you’ll be able to construct what you need. We now have “assume mode,” the place you employ the reasoning mannequin, or the usual RAG mode, the place it simply does RAG with decrease latency.
23:18: With reasoning fashions, individuals appear to grow to be way more relaxed about latency constraints.
23:40: You describe a system that’s optimized finish to finish. That means that I don’t must do fine-tuning. You don’t must, however you’ll be able to if you need.
24:02: What would fine-tuning purchase me at this level? If I do fine-tuning, the ROI can be small.
24:20: It relies on how a lot a couple of further p.c of efficiency is value to you. For a few of our clients, that may be an enormous distinction. Superb-tuning versus RAG is one other false dichotomy. The reply has all the time been each. The identical is true of MCP and lengthy context.
25:17: My suspicion is along with your system I’m going to do much less fine-tuning.
25:20: Out of the field, our system will likely be fairly good. However we do assist our clients squeeze out max efficiency.
25:37: These nonetheless match into the identical form of supervised fine-tuning: Right here’s some labeled examples.
25:52: We don’t want that many. It’s not labels a lot as examples of the conduct you need. We use artificial knowledge pipelines to get a adequate coaching set. We’re seeing fairly good features with that. It’s actually about capturing the area higher.
26:28: “I don’t want RAG as a result of I’ve brokers.” Aren’t deep analysis instruments simply doing what a RAG system is meant to do?
26:51: They’re utilizing RAG below the hood. MCP is only a protocol; you’d be doing RAG with MCP.
27:25: These deep analysis instruments—the agent is meant to exit and discover related sources. In different phrases, it’s doing what a RAG system is meant to do, however it’s not referred to as RAG.
27:55: I might nonetheless name that RAG. The agent is the generator. You’re augmenting the G with the R. If you wish to get these techniques to work on high of your knowledge, you want retrieval. That’s what RAG is admittedly about.
28:33: The principle distinction is the top product. Lots of people use these to generate a report or slide knowledge they’ll edit.
28:53: Isn’t the distinction simply inference-time compute, the power to do energetic retrieval versus passive retrieval? You all the time retrieve. You can also make that extra energetic; you’ll be able to determine from the mannequin when and what you need to retrieve. However you’re nonetheless retrieving.
29:45: There’s a category of brokers that don’t retrieve. However they don’t work but, however that’s the imaginative and prescient of an agent shifting ahead.
30:11: It’s beginning to work. The device utilized in that instance is retrieval; the opposite device is looking an API. What these reasoners are doing is simply calling APIs as instruments.
30:40: On the finish of the day, Google’s unique imaginative and prescient is what issues: manage all of the world’s data.
30:48: A key distinction between the outdated strategy and the brand new strategy is that we’ve got the G: generative solutions. We don’t must purpose over the retrievals ourselves any extra.
31:19: What components of your platform are open supply?
31:27: We’ve open-sourced a few of our earlier work, and we’ve printed a variety of our analysis.
31:52: One of many matters I’m watching: I feel supervised fine-tuning is a solved drawback. However reinforcement fine-tuning continues to be a UX drawback. What’s the best method to work together with a site professional?
32:25: Gathering that suggestions is essential. We try this as part of our system. You’ll be able to practice these dynamic question paths utilizing the reinforcement sign.
32:52: Within the subsequent 6 to 12 months, what would you wish to see from the inspiration mannequin builders?
33:08: It might be good if longer context really labored. You’ll nonetheless want RAG. The opposite factor is VLMs. VLMs are good, however they’re nonetheless not nice, particularly with regards to fine-grained chart understanding.
33:43: Together with your platform, are you able to carry your personal mannequin, or do you provide the mannequin?
33:51: We now have our personal fashions for the retrieval and contextualization stack. You’ll be able to carry your personal language mannequin, however our GLM usually works higher than what you’ll be able to carry your self.
34:09: Are you seeing adoption of the Chinese language fashions?
34:13: Sure and no. DeepSeek was a vital existence proof. We don’t deploy them for manufacturing clients.

Douwe Kiela on Why RAG Isn’t Lifeless – O’Reilly

Timestamps

Related Articles

Marvel’s Marvel Man Sequence Is Borrowing An Underrated Peacemaker Advertising and marketing Gimmick

How Monetary Providers Corporations Use Agentic AI to Improve Productiveness, Effectivity and Safety

Do You Want a Dubai Transit Visa? Every thing You Must Know

LEAVE A REPLY Cancel reply

Latest Articles

Marvel’s Marvel Man Sequence Is Borrowing An Underrated Peacemaker Advertising and marketing Gimmick

How Monetary Providers Corporations Use Agentic AI to Improve Productiveness, Effectivity and Safety

Do You Want a Dubai Transit Visa? Every thing You Must Know

Does Jelly Go Dangerous? Every little thing You Have to Know.

Methods to Assess and Scale