?
RAG, which stands for Retrieval-Augmented Generation, describes a process by which an LLM (Large Language Model) can be optimized by training it to pull from a more specific, smaller knowledge base rather than its huge original base. Typically, LLMs like ChatGPT are trained on the entire internet (billions of data points). This means they are prone to small errors and hallucinations.
Here is an example of a situation where RAG could be used and be helpful:
I want to build a US state tour guide chat bot, which contains general information about US states, such as their capitals, populations, and main tourist attractions. To do this, I can download Wikipedia pages of these US states and train my LLM using text from these specific pages.
Creating your RAG LLM
One of the most popular tools for building RAG systems is LlamaIndex, which:
- Simplifies the integration between LLMs and external data sources
- Allows developers to structure, index, and query their data in a way that is optimized for LLM consumption
- Works with many types of data, such as PDFs and text files
- Helps construct a RAG pipeline that retrieves and injects relevant chunks of data into a prompt before passing it to the LLM for generation
Download your data
Start by getting the data you want to train your model with. To download PDFs from Wikipedia (CC by 4.0) in the right format, make sure you click Print and then “Save as PDF.”
Don’t just export the Wikipedia as a PDF — Llama won’t like the format it’s in and will reject your files.
For the purposes of this article and to keep things simple, I’ll only download the pages of the following 5 popular states:
- Florida
- California
- Washington D.C.
- New York
- Texas
Make sure to save these all in a folder where your project can easily access them. I saved them in one called “data”.
Get necessary API keys
Before you create your custom states database, there are 2 API keys you’ll need to generate.
- One from OpenAI, to access a base LLM
- One from Llama to access the index database you upload custom data to
Once you have these API keys, store them in a .env file in your project.
#.env file
LLAMA_API_KEY = ""
OPENAI_API_KEY = ""
Create an Index and Upload your data
Create a LlamaCloud account. Once you’re in, find the Index section and click “Create” to create a new index.
An index stores and manages document indexes remotely so they can be queried via an API without needing to rebuild or store them locally.
Here’s how it works:
- When you create your index, there will be a place where you can upload files to feed into the model’s database. Upload your PDFs here.
- LlamaIndex parses and chunks the documents.
- It creates an index (e.g., vector index, keyword index).
- This index is stored in LlamaCloud.
- You can then query it using an LLM through the API.
The next thing you need to do is to configure an embedding model. An embedding model is the LLM that will underlie your project and be responsible for retrieving the relevant information and outputting text.
When you’re creating a new index you want to select “Create a new OpenAI embedding”:

When you create your new embedding you’ll have to provide your OpenAI API key and name your model.

Once you have created your model, leave the other index settings as their defaults and hit “Create Index” at the bottom.
It may take a few minutes to parse and store all the documents, so make sure that all the documents have been processed before you try to run a query. The status should show on the right side of the screen when you create your index in a box that says “Index Files Summary”.
Accessing your model via code
Once you’ve created your index, you’ll also get an Organization ID. For cleaner code, add your Organization ID and Index Name to your .env file. Then, retrieve all the necessary variables to initialize your index in your code:
index = LlamaCloudIndex(
name=os.getenv("INDEX_NAME"),
project_name="Default",
organization_id=os.getenv("ORG_ID"),
api_key=os.getenv("LLAMA_API_KEY")
)
Query your index and ask a question
To do this, you’ll need to define a query (prompt) and then generate a response by calling the index as such:
query = "What state has the highest population?"
response = index.as_query_engine().query(query)
# Print out just the text part of the response
print(response.response)
Having a longer conversation with your bot
By querying a response from the LLM the way we just did above, you are able to easily access information from the documents you loaded. However, if you ask a follow up question, like “Which one has the least?” without context, the model won’t remember what your original question was. This is because we haven’t programmed it to keep track of the chat history.
In order to do this, you need to:
- Create memory using ChatMemoryBuffer
- Create a chat engine and add the created memory using ContextChatEngine
To create a chat engine:
from llama_index.core.chat_engine import ContextChatEngine
from llama_index.core.memory import ChatMemoryBuffer
# Create a retriever from the index
retriever = index.as_retriever()
# Set up memory
memory = ChatMemoryBuffer.from_defaults(token_limit=2000)
# Create chat engine with memory
chat_engine = ContextChatEngine.from_defaults(
retriever=retriever,
memory=memory,
llm=OpenAI(model="gpt-4o"),
)
Next, feed your query into your chat engine:
# To query:
response = chat_engine.chat("What is the population of New York?")
print(response.response)
This gives the response: “As of 2024, the estimated population of New York is 19,867,248.”
I can then ask a follow up question:
response = chat_engine.chat("What about California?")
print(response.response)
This gives the following response: “As of 2024, the population of California is 39,431,263.” As you can see, the model remembered that what we were asking about previously was population and responded accordingly.

Conclusion
Retrieval Augmented Generation is an efficient way to train an LLM on specific data. LlamaCloud offers a simple and straightforward way to build your own RAG framework and query the model that lies underneath.
The code I used for this tutorial was written in a notebook, but it can also be wrapped in a Streamlit app to create a more natural back and forth conversation with a chatbot. I’ve included the Streamlit code here on my Github.
Thanks for reading
- Connect with me on LinkedIn
- Buy me a coffee to support my work!
- I offer 1:1 data science tutoring, career coaching/mentoring, writing advice, resume reviews & more on Topmate!