Advanced Retrieval-Augmented Generation (RAG): A Technical Overview

Authored by G. V. Ranjith Rayalu
Artificial Intelligence & Data Science Engineer


1. Introduction

Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to mitigate hallucinations in large language models and to provide guardrails for generating factually accurate, context-rich responses. The core idea is to retrieve relevant information from a knowledge base using embeddings, TF-IDF, or a combination and then feed these retrieved pieces of text (or “chunks”) into the language model to produce answers that are grounded in real data.

In this document, you’ll find an overview of:


2. Why RAG?

1. Reducing Hallucinations

Large language models sometimes generate plausible but incorrect information “hallucinations.” By grounding the model with vetted reference material, RAG forces reliance on real, verifiable data rather than guesswork.

2. Providing Guardrails

RAG allows you to limit the model’s output to the context found in your curated documents. This approach enforces factual consistency and can help filter out inappropriate or irrelevant content.


3. Embeddings Store & Percentile-Based Chunking

At the core of any RAG system is a strategy for encoding and storing text. Here’s the essence of your percentile-based approach:

3.1 Semantic Chunking and Distance Thresholds

• Text is split by basic punctuation into sentences.
• Consecutive chunks are merged unless their “distance” (often cosine or Euclidean) exceeds a certain percentile threshold.
• For example, if the distance values between consecutive sentences are [0.3, 0.6, 0.9, 0.5] and the 95th-percentile threshold is 0.7, only the pair with 0.9 exceeds this threshold, creating a new chunk boundary.

3.2 Why Use Percentiles?

3.3 Cosine vs. Euclidean Distance

Some workflows use both cosine and Euclidean distance to measure text similarity. Cosine distance often focuses on direction rather than magnitude, making it especially robust for text vectors where absolute length is less relevant than semantic similarity.


4. RAG Components & Data Sources

Your system is designed to handle multiple data sources in a unified manner:

4.1 Question-Answer Pairs

4.2 NCERT

Challenges in Previous Extraction:
PDF extraction often failed in math-heavy or chemistry texts, creating messy or incomplete captures. Moreover, some end-of-chapter questions erroneously appeared in RAG outputs.

Solutions:
• Switched to Mathpix for high-fidelity OCR, extracting math in LaTeX and other content in Markdown.
• Removed end-of-exercise questions from the knowledge base.
• Implemented topic-based chunking with percentile thresholds (e.g., 90% for math, 95% otherwise).

4.3 ALT/PLT Content

The context in ALT/PLT documents is highly interrelated, so a simple percentile threshold (like 90%) was creating unwieldy results. You implemented:

4.4 Learning Outcomes

Many outcomes are similar. To avoid returning duplicates, K-Means clustering is applied on the retrieved outcomes. Only the cluster centers—up to five—are returned.

4.5 Glossary

Here, each term is combined with its definition. A high threshold (~98%) is used, preventing unrelated terms from merging into the same chunk. Future improvements might involve more advanced methods of dictionary extraction and matching.


5. Advanced RAG Enhancements

5.1 TF-IDF Integration

Complementary Strengths: Embeddings capture semantic relationships, while TF-IDF focuses on exact term usage. Combining them provides the best of both worlds.

5.2 Ranker Upgrades

The current ranker uses fixed weights and ranks. Future versions could incorporate specialized rankers (e.g., BERT-based cross-encoders) to become more query-aware and yield higher-fidelity results.

5.3 Query Rewriting

Query rewriting merges context from previous queries, fixes grammar or spelling, and can reduce unnecessary RAG calls if the new query is similar to the old one. It ensures the user’s intent is captured accurately while retrieving the most relevant documents.

5.4 Model Caching

Large models like GPT cache up to a large token window. If your follow-up query reuses a large shared context, the model can reference already processed tokens—speeding up inference and improving response consistency. In practice, chunk updates and system prompt changes often break the 1024-token overlap, but selectively calling RAG only when needed can preserve more context for caching.


6. Graph RAG

6.1 Why Graph RAG?

Graph RAG is especially helpful for global or multi-hop queries. A knowledge graph of entity-relation-entity triples can unify information from across an entire chapter or multiple documents. Queries like “Explain the whole chapter” or “How does friction relate to sliding friction and rolling friction?” benefit greatly from a graph-based approach.

6.2 Basic Knowledge Graph

An entity-relation-entity structure forms the backbone of Graph RAG. By searching for an entity in the graph, the system retrieves connected nodes and edges, culminating in a relevant subgraph of knowledge.

6.3 Analysis & Observations


7. Use Cases & Future Directions

7.1 Summarize Entire Chapters

Graph RAG with hierarchical chunking can provide global summaries. Alternatively, precomputed abstractive or extractive summaries can quickly answer such queries.

7.2 Handling Multi-Hop or Complex Queries

Graph-based indexing simplifies referencing multiple sections of large texts. Combining chunk-level embeddings with a knowledge graph further refines retrieval for complex, multi-step questions.

7.3 Hard Datasets & Edge Cases

Evaluate performance on:

7.4 Practical Enhancements


8. Conclusion

Your RAG system intricately combines percentile-based semantic chunking, TF-IDF, and embedding-based retrieval to ensure high-quality, contextually accurate answers. In specialized domains like NCERT, mathematics, and ALT/PLT content, you’ve tailored chunking thresholds, introduced hierarchical chunking, and leveraged improved OCR (Mathpix) for more accurate text extraction.

Graph RAG represents a significant leap for handling broader, multi-faceted queries by relying on entity–relation networks. While it improves recall and context synergy, it demands careful token management and accurate entity extraction.

Overall, your system’s flexibility—through query rewriting, caching, and combined rankers—positions it well for professional-level usage in Q&A retrieval, entire chapter summarization, and beyond.


9. Key Takeaways