RAG: A Technical Overview

1. Introduction

Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to mitigate hallucinations in large language models and to provide guardrails for generating factually accurate, context-rich responses. The core idea is to retrieve relevant information from a knowledge base using embeddings, TF-IDF, or a combination and then feed these retrieved pieces of text (or “chunks”) into the language model to produce answers that are grounded in real data.

In this document, you’ll find an overview of:

Why RAG can reduce hallucinations and provide guardrails
How percentile-based semantic chunking works
How different knowledge sources (NCERT, Q&A pairs, ALT/PLT, etc.) are handled
How TF-IDF and embeddings can be combined to strengthen results
The concept of Graph RAG for multi-hop and global-context queries
Possible enhancements, including advanced ranking, query rewriting, and caching

2. Why RAG?

1. Reducing Hallucinations

Large language models sometimes generate plausible but incorrect information “hallucinations.” By grounding the model with vetted reference material, RAG forces reliance on real, verifiable data rather than guesswork.

2. Providing Guardrails

RAG allows you to limit the model’s output to the context found in your curated documents. This approach enforces factual consistency and can help filter out inappropriate or irrelevant content.

3. Embeddings Store & Percentile-Based Chunking

At the core of any RAG system is a strategy for encoding and storing text. Here’s the essence of your percentile-based approach:

3.1 Semantic Chunking and Distance Thresholds

• Text is split by basic punctuation into sentences.
• Consecutive chunks are merged unless their “distance” (often cosine or Euclidean) exceeds a certain percentile threshold.
• For example, if the distance values between consecutive sentences are [0.3, 0.6, 0.9, 0.5] and the 95th-percentile threshold is 0.7, only the pair with 0.9 exceeds this threshold, creating a new chunk boundary.

3.2 Why Use Percentiles?

Dynamic Threshold: Instead of a static value, the threshold is derived from the actual data distribution. Some chapters or sections might be more repetitive or complex than others.
Controlling Chunk Sizes: Too many breaks can fragment context, while too few can lead to very large chunks that dilute specificity.

3.3 Cosine vs. Euclidean Distance

Some workflows use both cosine and Euclidean distance to measure text similarity. Cosine distance often focuses on direction rather than magnitude, making it especially robust for text vectors where absolute length is less relevant than semantic similarity.

4. RAG Components & Data Sources

Your system is designed to handle multiple data sources in a unified manner:

4.1 Question-Answer Pairs

Using OpenAI Embeddings:
Each chapter has a specific threshold. Any Q&A pair whose similarity to the query rises above this threshold becomes a candidate. Typically, the top 3 such Q&As are returned.
Using TF-IDF Embeddings:
TF-IDF vectors each document, emphasizing rare words that appear frequently in a specific text. Top documents based on cosine similarity are retrieved.
Combined Reranking:
A basic reranker balances GPT-based and TF-IDF-based ranks (e.g., 0.6 weight for GPT, 0.4 for TF-IDF). The final fallback formula is weight × (1 / (k + rank)), ensuring higher weights or lower ranks dominate.

4.2 NCERT

Challenges in Previous Extraction:
PDF extraction often failed in math-heavy or chemistry texts, creating messy or incomplete captures. Moreover, some end-of-chapter questions erroneously appeared in RAG outputs.

Solutions:
• Switched to Mathpix for high-fidelity OCR, extracting math in LaTeX and other content in Markdown.
• Removed end-of-exercise questions from the knowledge base.
• Implemented topic-based chunking with percentile thresholds (e.g., 90% for math, 95% otherwise).

4.3 ALT/PLT Content

The context in ALT/PLT documents is highly interrelated, so a simple percentile threshold (like 90%) was creating unwieldy results. You implemented:

Hierarchical Chunking: Iterating up to higher thresholds (98%) or until chunk size is below a certain limit, ensuring no single chunk becomes too large.
Limited to 4 Chunks: For manageability in the RAG pipeline.
Graph RAG: An alternative where a knowledge graph can unify content and handle complex relationships, sometimes outperforming standard chunking.

4.4 Learning Outcomes

Many outcomes are similar. To avoid returning duplicates, K-Means clustering is applied on the retrieved outcomes. Only the cluster centers—up to five—are returned.

4.5 Glossary

Here, each term is combined with its definition. A high threshold (~98%) is used, preventing unrelated terms from merging into the same chunk. Future improvements might involve more advanced methods of dictionary extraction and matching.

5. Advanced RAG Enhancements

5.1 TF-IDF Integration

Complementary Strengths: Embeddings capture semantic relationships, while TF-IDF focuses on exact term usage. Combining them provides the best of both worlds.

5.2 Ranker Upgrades

The current ranker uses fixed weights and ranks. Future versions could incorporate specialized rankers (e.g., BERT-based cross-encoders) to become more query-aware and yield higher-fidelity results.

5.3 Query Rewriting

Query rewriting merges context from previous queries, fixes grammar or spelling, and can reduce unnecessary RAG calls if the new query is similar to the old one. It ensures the user’s intent is captured accurately while retrieving the most relevant documents.

5.4 Model Caching

Large models like GPT cache up to a large token window. If your follow-up query reuses a large shared context, the model can reference already processed tokens—speeding up inference and improving response consistency. In practice, chunk updates and system prompt changes often break the 1024-token overlap, but selectively calling RAG only when needed can preserve more context for caching.

6. Graph RAG

6.1 Why Graph RAG?

Graph RAG is especially helpful for global or multi-hop queries. A knowledge graph of entity-relation-entity triples can unify information from across an entire chapter or multiple documents. Queries like “Explain the whole chapter” or “How does friction relate to sliding friction and rolling friction?” benefit greatly from a graph-based approach.

6.2 Basic Knowledge Graph

An entity-relation-entity structure forms the backbone of Graph RAG. By searching for an entity in the graph, the system retrieves connected nodes and edges, culminating in a relevant subgraph of knowledge.

6.3 Analysis & Observations

Benefits: High recall and a more “global” view of the content. Ideal for large texts and complex queries.
Downsides: Potentially higher token usage due to larger contextual retrieval. Entity extraction must be accurate or valuable details can be lost.
Mitigations: Combine with embedding-based matching to prune large subgraphs, and use rankers to filter out less relevant edges.

7. Use Cases & Future Directions

7.1 Summarize Entire Chapters

Graph RAG with hierarchical chunking can provide global summaries. Alternatively, precomputed abstractive or extractive summaries can quickly answer such queries.

7.2 Handling Multi-Hop or Complex Queries

Graph-based indexing simplifies referencing multiple sections of large texts. Combining chunk-level embeddings with a knowledge graph further refines retrieval for complex, multi-step questions.

7.3 Hard Datasets & Edge Cases

Evaluate performance on:

Summarize queries testing global text comprehension
Multi-hop queries that demand interconnection of multiple segments
Queries with no direct answer in the corpus

7.4 Practical Enhancements

Global Hierarchical Summaries: A consolidated node holding the gist of the entire chapter.
More Sophisticated Rankers or Re-Rankers: BERT-based or cross-encoder models to judge relevance accurately.
Larger Knowledge Graphs: Continuous updates and expansions as new data arrives.

8. Conclusion

Your RAG system intricately combines percentile-based semantic chunking, TF-IDF, and embedding-based retrieval to ensure high-quality, contextually accurate answers. In specialized domains like NCERT, mathematics, and ALT/PLT content, you’ve tailored chunking thresholds, introduced hierarchical chunking, and leveraged improved OCR (Mathpix) for more accurate text extraction.

Graph RAG represents a significant leap for handling broader, multi-faceted queries by relying on entity–relation networks. While it improves recall and context synergy, it demands careful token management and accurate entity extraction.

Overall, your system’s flexibility—through query rewriting, caching, and combined rankers—positions it well for professional-level usage in Q&A retrieval, entire chapter summarization, and beyond.

9. Key Takeaways

Guardrails & Factual Consistency: RAG grounds responses in real documents, reducing hallucinations.
Percentile-Based Chunking: Adapts chunking thresholds to each chapter’s distance distribution, balancing context breadth with precision.
TF-IDF + Embeddings: Merging both methods ensures coverage of rare, keyword-specific queries and broader semantic matches.
Hierarchical & Graph Approaches: Ideal for highly interrelated content and multi-hop questions.
Future Directions: More intelligent rankers, caching strategies, query rewriting, advanced graph building, and summarization metrics.