Connecting the Dots for Better Movie Recommendations

promises of retrieval-augmented generation (RAG) is that it allows AI systems to answer questions using up-to-date or domain-specific information, without retraining the model. But most RAG pipelines still treat documents and information as flat and disconnected—retrieving isolated chunks based on vector similarity, with no sense of how those chunks relate.

In order to remedy RAG’s ignorance of—often obvious—connections between documents and chunks, developers have turned to graph RAG approaches, but often found that the benefits of graph RAG were not worth the added complexity of implementing it.

In our recent article on the open-source Graph RAG Project and GraphRetriever, we introduced a new, simpler approach that combines your existing vector search with lightweight, metadata-based graph traversal, which doesn’t require graph construction or storage. The graph connections can be defined at runtime—or even query-time—by specifying which document metadata values you would like to use to define graph “edges,” and these connections are traversed during retrieval in graph RAG.

In this article, we expand on one of the use cases in the Graph RAG Project documentation—a demo notebook can be found here—which is a simple but illustrative example: searching movie reviews from a Rotten Tomatoes dataset, automatically connecting each review with its local subgraph of related information, and then putting together query responses with full context and relationships between movies, reviews, reviewers, and other data and metadata attributes.

The dataset: Rotten Tomatoes reviews and movie metadata

The dataset used in this case study comes from a public Kaggle dataset titled “Massive Rotten Tomatoes Movies and Reviews”. It includes two primary CSV files:

rotten_tomatoes_movies.csv — containing structured information on over 200,000 movies, including fields like title, cast, directors, genres, language, release date, runtime, and box office earnings.
rotten_tomatoes_movie_reviews.csv — a collection of nearly 2 million user-submitted movie reviews, with fields such as review text, rating (e.g., 3/5), sentiment classification, review date, and a reference to the associated movie.

Each review is linked to a movie via a shared movie_id, creating a natural relationship between unstructured review content and structured movie metadata. This makes it a perfect candidate for demonstrating GraphRetriever’s ability to traverse document relationships using metadata alone—no need to manually build or store a separate graph.

By treating metadata fields such as movie_id, genre, or even shared actors and directors as graph edges, we can build a connected retrieval flow that enriches each query with related context automatically.

The challenge: putting movie reviews in context

A common goal in AI-powered search and recommendation systems is to let users ask natural, open-ended questions and get meaningful, contextual results. With a large dataset of movie reviews and metadata, we want to support full-context responses to prompts like:

“What are some good family movies?”
“What are some recommendations for exciting action movies?”
“What are some classic movies with amazing cinematography?”

A great answer to each of these prompts requires subjective review content along with some semi-structured attributes like genre, audience, or visual style. To give a good answer with full context, the system needs to:

Retrieve the most relevant reviews based on the user’s query, using vector-based semantic similarity
Enrich each review with full movie details—title, release year, genre, director, etc.—so the model can present a complete, grounded recommendation
Connect this information with other reviews or movies that provide an even broader context, such as: What are other reviewers saying? How do other movies in the genre compare?

A traditional RAG pipeline might handle step 1 well—pulling relevant snippets of text. But, without knowledge of how the retrieved chunks relate to other information in the dataset, the model’s responses can lack context, depth, or accuracy.

How graph RAG addresses the challenge

Given a user’s query, a plain RAG system might recommend a movie based on a small set of directly semantically relevant reviews. But graph RAG and GraphRetriever can easily pull in relevant context—for example, other reviews of the same movies or other movies in the same genre—to compare and contrast before making recommendations.

From an implementation standpoint, graph RAG provides a clean, two-step solution:

Step 1: Build a standard RAG system

First, just like with any RAG system, we embed the document text using a language model and store the embeddings in a vector database. Each embedded review may include structured metadata, such as reviewed_movie_id, rating, and sentiment—information we’ll use to define relationships later. Each embedded movie description includes metadata such as movie_id, genre, release_year, director, etc.

This allows us to handle typical vector-based retrieval: when a user enters a query like “What are some good family movies?”, we can quickly fetch reviews from the dataset that are semantically related to family movies. Connecting these with broader context occurs in the next step.

Step 2: Add graph traversal with GraphRetriever

Once the semantically relevant reviews are retrieved in step 1 using vector search, we can then use GraphRetriever to traverse connections between reviews and their related movie records.

Specifically, the GraphRetriever:

Fetches relevant reviews via semantic search (RAG)
Follows metadata-based edges (like reviewed_movie_id) to retrieve more information that is directly related to each review, such as movie descriptions and attributes, data about the reviewer, etc
Merges the content into a single context window for the language model to use when generating an answer

A key point: no pre-built knowledge graph is needed. The graph is defined entirely in terms of metadata and traversed dynamically at query time. If you want to expand the connections to include shared actors, genres, or time periods, you just update the edge definitions in the retriever config—no need to reprocess or reshape the data.

So, when a user asks about exciting action movies with some specific qualities, the system can bring in datapoints like the movie’s release year, genre, and cast, improving both relevance and readability. When someone asks about classic movies with amazing cinematography, the system can draw on reviews of older films and pair them with metadata like genre or era, giving responses that are both subjective and grounded in facts.

In short, GraphRetriever bridges the gap between unstructured opinions (subjective text) and structured context (connected metadata)—producing query responses that are more intelligent, trustworthy, and complete.

GraphRetriever in action

To show how GraphRetriever can connect unstructured review content with structured movie metadata, we walk through a basic setup using a sample of the Rotten Tomatoes dataset. This involves three main steps: creating a vector store, converting raw data into LangChain documents, and configuring the graph traversal strategy.

See the example notebook in the Graph RAG Project for complete, working code.

Create the vector store and embeddings

We begin by embedding and storing the documents, just like we would in any RAG system. Here, we are using OpenAIEmbeddings and the Astra DB vector store:

from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(),
    collection_name=COLLECTION,
)

The structure of data and metadata

We store and embed document content as we usually would for any RAG system, but we also preserve structured metadata for use in graph traversal. The document content is kept minimal (review text, movie title, description), while the rich structured data is stored in the “metadata” fields in the stored document object.

This is example JSON from one movie document in the vector store:

> pprint(documents[0].metadata)

{'audienceScore': '66',
 'boxOffice': '$111.3M',
 'director': 'Barry Sonnenfeld',
 'distributor': 'Paramount Pictures',
 'doc_type': 'movie_info',
 'genre': 'Comedy',
 'movie_id': 'addams_family',
 'originalLanguage': 'English',
 'rating': '',
 'ratingContents': '',
 'releaseDateStreaming': '2005-08-18',
 'releaseDateTheaters': '1991-11-22',
 'runtimeMinutes': '99',
 'soundMix': 'Surround, Dolby SR',
 'title': 'The Addams Family',
 'tomatoMeter': '67.0',
 'writer': 'Charles Addams,Caroline Thompson,Larry Wilson'}

Note that graph traversal with GraphRetriever uses only the attributes this metadata field, does not require a specialized graph DB, and does not use any LLM calls or other expensive

Configure and run GraphRetriever

The GraphRetriever traverses a simple graph defined by metadata connections. In this case, we define an edge from each review to its corresponding movie using the directional relationship between reviewed_movie_id (in reviews) and movie_id (in movie descriptions).

We use an “eager” traversal strategy, which is one of the simplest traversal strategies. See documentation for the Graph RAG Project for more details about strategies.

from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

retriever = GraphRetriever(
    store=vectorstore,
    edges=[("reviewed_movie_id", "movie_id")],
    strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)

In this configuration:

start_k=10: retrieves 10 review documents using semantic search
adjacent_k=10: allows up to 10 adjacent documents to be pulled at each step of graph traversal
select_k=100: up to 100 total documents can be returned
max_depth=1: the graph is only traversed one level deep, from review to movie

Note that because each review links to exactly one reviewed movie, the graph traversal depth would have stopped at 1 regardless of this parameter, in this simple example. See more examples in the Graph RAG Project for more sophisticated traversal.

Invoking a query

You can now run a natural language query, such as:

INITIAL_PROMPT_TEXT = "What are some good family movies?"

query_results = retriever.invoke(INITIAL_PROMPT_TEXT)

And with a little sorting and reformatting of text—see the notebook for details—we can print a basic list of the retrieved movies and reviews, for example:

 Movie Title: The Addams Family
 Movie ID: addams_family
 Review: A witty family comedy that has enough sly humour to keep adults chuckling throughout.

 Movie Title: The Addams Family
 Movie ID: the_addams_family_2019
 Review: ...The film's simplistic and episodic plot put a major dampener on what could have been a welcome breath of fresh air for family animation.

 Movie Title: The Addams Family 2
 Movie ID: the_addams_family_2
 Review: This serviceable animated sequel focuses on Wednesday's feelings of alienation and benefits from the family's kid-friendly jokes and road trip adventures.
 Review: The Addams Family 2 repeats what the first movie accomplished by taking the popular family and turning them into one of the most boringly generic kids films in recent years.

 Movie Title: Addams Family Values
 Movie ID: addams_family_values
 Review: The title is apt. Using those morbidly sensual cartoon characters as pawns, the new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. 
 Review: Addams Family Values has its moments -- rather a lot of them, in fact. You knew that just from the title, which is a nice way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.

We can then pass the above output to the LLM for generation of a final response, using the full set information from the reviews as well as the linked movies.

Setting up the final prompt and LLM call looks like this:

from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint

MODEL = ChatOpenAI(model="gpt-4o", temperature=0)

VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""

A list of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.

Please include all movies that might be helpful to someone looking for movie
recommendations.

Initial Prompt:
{initial_prompt}

Movie Reviews:
{movie_reviews}
""")

formatted_prompt = VECTOR_ANSWER_PROMPT.format(
    initial_prompt=INITIAL_PROMPT_TEXT,
    movie_reviews=formatted_text,
)

result = MODEL.invoke(formatted_prompt)

print(result.content)

And, the final response from the graph RAG system might look like this:

Based on the reviews provided, "The Addams Family" and "Addams Family Values" are recommended as good family movies. "The Addams Family" is described as a witty family comedy with enough humor to entertain adults, while "Addams Family Values" is noted for its clever take on family dynamics and its entertaining moments.

Keep in mind that this final response was the result of the initial semantic search for reviews mentioning family movies—plus expanded context from documents that are directly related to these reviews. By expanding the window of relevant context beyond simple semantic search, the LLM and overall graph RAG system is able to put together more complete and more helpful responses.

Try It Yourself

The case study in this article shows how to:

Blend unstructured and structured data in your RAG pipeline
Use metadata as a dynamic knowledge graph without building or storing one
Improve the depth and relevance of AI-generated responses by surfacing connected context

In short, this is Graph RAG in action: adding structure and relationships to make LLMs not just retrieve, but build context and reason more effectively. If you’re already storing rich metadata alongside your documents, GraphRetriever gives you a practical way to put that metadata to work—with no additional infrastructure.

We hope this inspires you to try GraphRetriever on your own data—it’s all open-source—especially if you’re already working with documents that are implicitly connected through shared attributes, links, or references.

You can explore the full notebook and implementation details here: Graph RAG on Movie Reviews from Rotten Tomatoes.

Trending Now

Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed

Anchore SBOM tracks software supply chain issues

Talkin’ About Infosec News – 10/10/23

Connecting the Dots for Better Movie Recommendations

Stop Building AI Platforms | Towards Data Science

What CISOs need to know about agentic AI

AI Is Not a Black Box (Relatively Speaking)

Latest Posts

Stop Building AI Platforms | Towards Data Science

A cyberattack on United Natural Foods caused bread shortages and bare shelves

Discord flaw lets hackers reuse expired invites in malware campaign

Popular Posts

Stop Building AI Platforms | Towards Data Science

The Future of AI and ML in Cybersecurity: What\’s Next?

AI and ML in Cybersecurity: Exploring the Emerging Trends and Future Directions

Trending Now

Connecting the Dots for Better Movie Recommendations

The dataset: Rotten Tomatoes reviews and movie metadata

The challenge: putting movie reviews in context

How graph RAG addresses the challenge

Step 1: Build a standard RAG system

Step 2: Add graph traversal with GraphRetriever

GraphRetriever in action

Create the vector store and embeddings

The structure of data and metadata

Configure and run GraphRetriever

Invoking a query

Try It Yourself

Related Posts