We then handle user requests by using the embedding model to create an embedding for the query. We use that embedding with a ANN similarity search on the vector store to retrieve matching fragments. Next we use the RAG prompt template to combine the results with the original query, and send the complete input to the LLM.

A nice quick rundown of RAG implementation.