Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn RAG Pipeline Architecture | Retrieval Pipelines and Architectures
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
RAG Theory Essentials

bookRAG Pipeline Architecture

To understand how Retrieval-Augmented Generation (RAG) systems work, you need to follow the complete journey of a user query as it moves through the pipeline. The process begins with query embedding. When you submit a question or prompt, the system first transforms your input into a numeric vector using a pre-trained embedding model. This vector captures the semantic meaning of your query, allowing the system to compare it with stored representations of documents.

Next comes the retrieval stage. Here, the system uses the embedded query to search a vector database containing document chunks. It calculates similarity scores between your query vector and each document vector, then retrieves the top-k most relevant chunks based on these scores.

After retrieval, the pipeline performs context selection. Not all retrieved chunks are equally useful, so the system may filter, rank, or combine them to select the most pertinent information. This ensures that only the most relevant context is passed on to the next stage.

Finally, the generation phase uses a large language model (LLM) to produce an answer. The LLM receives your original query along with the selected context chunks and generates a response that is both contextually grounded and fluent. This end-to-end flow makes RAG pipelines highly effective for open-domain question answering and other knowledge-intensive tasks.

Vanilla RAG
expand arrow
  • Uses a single retrieval pass to fetch relevant context chunks;
  • Relies on the initial similarity scores for context selection;
  • Passes the top-k retrieved chunks directly to the generator;
  • Simpler, faster, but may include irrelevant or redundant information.
RAG with Reranking
expand arrow
  • Adds an extra step after initial retrieval;
  • Uses a secondary model to rerank the retrieved chunks based on relevance;
  • Improves the quality of selected context but adds computational overhead.
Multi-hop Retrieval RAG
expand arrow
  • Performs multiple retrieval steps;
  • Each step may use information from previous chunks or partial answers;
  • Enables handling of complex, multi-part queries;
  • More accurate for reasoning tasks, but increases complexity and latency.

1. Which stage of the RAG pipeline is responsible for transforming a user's input into a vector representation?

2. What is the primary function of the retrieval stage in a RAG pipeline?

3. How does RAG with reranking differ from vanilla RAG?

4. Which architectural variant is best suited for answering complex questions that require reasoning across multiple pieces of information?

question mark

Which stage of the RAG pipeline is responsible for transforming a user's input into a vector representation?

Select the correct answer

question mark

What is the primary function of the retrieval stage in a RAG pipeline?

Select the correct answer

question mark

How does RAG with reranking differ from vanilla RAG?

Select the correct answer

question mark

Which architectural variant is best suited for answering complex questions that require reasoning across multiple pieces of information?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain what query embedding is in more detail?

How does the retrieval stage determine which documents are most relevant?

What are some common use cases for RAG systems?

bookRAG Pipeline Architecture

Swipe to show menu

To understand how Retrieval-Augmented Generation (RAG) systems work, you need to follow the complete journey of a user query as it moves through the pipeline. The process begins with query embedding. When you submit a question or prompt, the system first transforms your input into a numeric vector using a pre-trained embedding model. This vector captures the semantic meaning of your query, allowing the system to compare it with stored representations of documents.

Next comes the retrieval stage. Here, the system uses the embedded query to search a vector database containing document chunks. It calculates similarity scores between your query vector and each document vector, then retrieves the top-k most relevant chunks based on these scores.

After retrieval, the pipeline performs context selection. Not all retrieved chunks are equally useful, so the system may filter, rank, or combine them to select the most pertinent information. This ensures that only the most relevant context is passed on to the next stage.

Finally, the generation phase uses a large language model (LLM) to produce an answer. The LLM receives your original query along with the selected context chunks and generates a response that is both contextually grounded and fluent. This end-to-end flow makes RAG pipelines highly effective for open-domain question answering and other knowledge-intensive tasks.

Vanilla RAG
expand arrow
  • Uses a single retrieval pass to fetch relevant context chunks;
  • Relies on the initial similarity scores for context selection;
  • Passes the top-k retrieved chunks directly to the generator;
  • Simpler, faster, but may include irrelevant or redundant information.
RAG with Reranking
expand arrow
  • Adds an extra step after initial retrieval;
  • Uses a secondary model to rerank the retrieved chunks based on relevance;
  • Improves the quality of selected context but adds computational overhead.
Multi-hop Retrieval RAG
expand arrow
  • Performs multiple retrieval steps;
  • Each step may use information from previous chunks or partial answers;
  • Enables handling of complex, multi-part queries;
  • More accurate for reasoning tasks, but increases complexity and latency.

1. Which stage of the RAG pipeline is responsible for transforming a user's input into a vector representation?

2. What is the primary function of the retrieval stage in a RAG pipeline?

3. How does RAG with reranking differ from vanilla RAG?

4. Which architectural variant is best suited for answering complex questions that require reasoning across multiple pieces of information?

question mark

Which stage of the RAG pipeline is responsible for transforming a user's input into a vector representation?

Select the correct answer

question mark

What is the primary function of the retrieval stage in a RAG pipeline?

Select the correct answer

question mark

How does RAG with reranking differ from vanilla RAG?

Select the correct answer

question mark

Which architectural variant is best suited for answering complex questions that require reasoning across multiple pieces of information?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt