Authored by G. V. Ranjith Rayalu
Artificial Intelligence & Data Science Engineer
Retrieval-Augmented Generation (RAG) has emerged as a powerful technique to mitigate hallucinations in large language models and to provide guardrails for generating factually accurate, context-rich responses. The core idea is to retrieve relevant information from a knowledge base using embeddings, TF-IDF, or a combination and then feed these retrieved pieces of text (or “chunks”) into the language model to produce answers that are grounded in real data.
In this document, you’ll find an overview of:
Large language models sometimes generate plausible but incorrect information “hallucinations.” By grounding the model with vetted reference material, RAG forces reliance on real, verifiable data rather than guesswork.
RAG allows you to limit the model’s output to the context found in your curated documents. This approach enforces factual consistency and can help filter out inappropriate or irrelevant content.
At the core of any RAG system is a strategy for encoding and storing text. Here’s the essence of your percentile-based approach:
• Text is split by basic punctuation into sentences.
• Consecutive chunks are merged unless their “distance” (often cosine or Euclidean) exceeds a certain
percentile threshold.
• For example, if the distance values between consecutive sentences are
[0.3, 0.6, 0.9, 0.5] and the
95th-percentile threshold is 0.7, only the pair with 0.9
exceeds this threshold, creating a new chunk boundary.
Some workflows use both cosine and Euclidean distance to measure text similarity. Cosine distance often focuses on direction rather than magnitude, making it especially robust for text vectors where absolute length is less relevant than semantic similarity.
Your system is designed to handle multiple data sources in a unified manner:
weight × (1 / (k + rank)), ensuring higher weights or lower ranks dominate.
Challenges in Previous Extraction:
PDF extraction often failed in math-heavy or chemistry texts, creating messy or incomplete captures.
Moreover, some end-of-chapter questions erroneously appeared in RAG outputs.
Solutions:
• Switched to Mathpix for high-fidelity OCR, extracting math in LaTeX and other content in Markdown.
• Removed end-of-exercise questions from the knowledge base.
• Implemented topic-based chunking with percentile thresholds (e.g., 90% for math, 95% otherwise).
The context in ALT/PLT documents is highly interrelated, so a simple percentile threshold (like 90%) was creating unwieldy results. You implemented:
Many outcomes are similar. To avoid returning duplicates, K-Means clustering is applied on the retrieved outcomes. Only the cluster centers—up to five—are returned.
Here, each term is combined with its definition. A high threshold (~98%) is used, preventing unrelated terms from merging into the same chunk. Future improvements might involve more advanced methods of dictionary extraction and matching.
Complementary Strengths: Embeddings capture semantic relationships, while TF-IDF focuses on exact term usage. Combining them provides the best of both worlds.
The current ranker uses fixed weights and ranks. Future versions could incorporate specialized rankers (e.g., BERT-based cross-encoders) to become more query-aware and yield higher-fidelity results.
Query rewriting merges context from previous queries, fixes grammar or spelling, and can reduce unnecessary RAG calls if the new query is similar to the old one. It ensures the user’s intent is captured accurately while retrieving the most relevant documents.
Large models like GPT cache up to a large token window. If your follow-up query reuses a large shared context, the model can reference already processed tokens—speeding up inference and improving response consistency. In practice, chunk updates and system prompt changes often break the 1024-token overlap, but selectively calling RAG only when needed can preserve more context for caching.
Graph RAG is especially helpful for global or multi-hop queries. A knowledge graph of entity-relation-entity triples can unify information from across an entire chapter or multiple documents. Queries like “Explain the whole chapter” or “How does friction relate to sliding friction and rolling friction?” benefit greatly from a graph-based approach.
An entity-relation-entity structure forms the backbone of Graph RAG. By searching for an entity in the graph, the system retrieves connected nodes and edges, culminating in a relevant subgraph of knowledge.
Graph RAG with hierarchical chunking can provide global summaries. Alternatively, precomputed abstractive or extractive summaries can quickly answer such queries.
Graph-based indexing simplifies referencing multiple sections of large texts. Combining chunk-level embeddings with a knowledge graph further refines retrieval for complex, multi-step questions.
Evaluate performance on:
Your RAG system intricately combines percentile-based semantic chunking, TF-IDF, and embedding-based retrieval to ensure high-quality, contextually accurate answers. In specialized domains like NCERT, mathematics, and ALT/PLT content, you’ve tailored chunking thresholds, introduced hierarchical chunking, and leveraged improved OCR (Mathpix) for more accurate text extraction.
Graph RAG represents a significant leap for handling broader, multi-faceted queries by relying on entity–relation networks. While it improves recall and context synergy, it demands careful token management and accurate entity extraction.
Overall, your system’s flexibility—through query rewriting, caching, and combined rankers—positions it well for professional-level usage in Q&A retrieval, entire chapter summarization, and beyond.