How to Build an MCQ App

I explain how to build an app that generates multiple choice questions (MCQs) on any user-defined subject. The app is extracting Wikipedia articles that are related to the user’s request and uses RAG to query a chat model to generate the questions.

I will demonstrate how the app works, explain how Wikipedia articles are retrieved, and show how these are used to invoke a chat model. Next, I explain the key components of this app in more detail. The code of the app is available here.

App Demo

The gif above shows the user entering the learning context, the generated MCQ and the feedback after the user submitted an answer.

At the first screen the user describes the context of the MCQs that should be generated. After pressing “Submit Context” the app searches for Wikipedia articles which content matches the user query.

The app splits each Wikipedia page into sections and scores them based on how closely they match the user query. These scores are used to sample the context of the next question which is displayed in the next screen with four choices to answer. The user can select a choice and submit it by “Submit Answer”. It is also possible to skip this question via “Next Question”. In this case it is considered that the question did not meet the user’s expectation. It will be avoided to use the context of this question for the generation of following questions. To end the session the user can choose “End MCQ”.

The next screen after the user submitted an answer shows if the answer was correct and provides an additional explanation. Following, the user can either get a new question via “Next Question” or end the session with “End MCQ”.

The end session screen shows how many questions were correctly and wrongly answered. Additionally, it also contains the number of questions the user rejected via “Next Question”. If the user selects “Start New Session” the start screen will be displayed where a new context for the next session can be provided.

Concept

The aim of this app is to produce high quality and up-to-date questions on any user-defined topic. Thereby user feedback is considered to ensure that the generated questions are meeting the user’s expectations.

To retrieve high-quality and up-to-date context, Wikipedia articles are selected with respect to the user’s query. Each article is split into sections while every section is scored based on its similarity with the user query. If the user rejects a question the respective section score will be downgraded to reduce the likelihood of sampling this section again.

This process can be separated into two workflows:

Context Retrieval
Question Generation

Which are described below.

Context Retrieval

The workflow how the context of the MCQs is derived from Wikipedia based on the user query is shown below.

The user inserts the query that describes the context of the MCQs at the start screen. An example of the user query could be: “Ask me anything about stars and planets”.

To efficiently search for Wikipedia articles this query is converted into keywords. The keywords of the query above are: “Stars”, “Planets”, “Astronomy”, “Solar System”, and “Galaxy”.

For each keyword a Wikipedia search is executed of which the top three pages are selected. Not each of these 15 pages are a good fit to the query provided by the user. To remove irrelevant pages at the earliest possible stage the vector similarity of the embedded user query and page excerpt is calculated. Pages which similarity is below a threshold are filtered out. In our example 3 of 15 pages were removed.

The remaining pages are read and divided into sections. As not the entire page content may be related to the user query, splitting the pages into sections allows to select parts of the page that fit specifically well to the user query. Hence, for each section the vector similarity against the user query is calculated and sections with low similarity are filtered out. The remaining 12 pages contained 305 sections of which 244 were kept after filtering.

The last step of the retrieval workflow is to assign a score to each section with respect to the vector similarity. This score will later be used to sample sections for the question generation.

Question Generation

The workflow to generate a new MCQ is shown below:

The first step is to sample one section with respect to the section scores. The text of this section is inserted together with the user query into a prompt to invoke a chat model. The chat model returns a json formatted response that contains the question, answer choices, and an explanation of the correct answer. In case the context provided is not suitable to generate a MCQ that addresses the user query the chat model is instructed to return a keyword to identify that the question generation was not successful.

If the question generation was successful, the questions and the answer choices are displayed to the user. Once the user submits an answer it is evaluated if the answer was correct, and the explanation of the correct answer is shown. To generate a new question the same workflow is repeated.

In case the question generation was not successful, or the user rejected the question by clicking on “Next Question” the score of the section that was selected to generate the prompt is downgraded. Hence, it is less likely that this section will be selected again.

Key Components

Next, I will explain some key components of the workflows in more detail.

Extracting Wiki Articles

Wikipedia articles are extracted in two steps: First a search is run to find suitable pages. After filtering the search results, the pages separated by sections are read.

Search requests are sent to this URL. Additionally, a header containing the requestor’s contact information and a parameter dictionary with the search query and the number of pages to be returned. The output is in json format that can be converted to a dictionary. The code below shows how to run the request:

headers = {'User-Agent': os.getenv('WIKI_USER_AGENT')}
parameters = {'q': search_query, 'limit': number_of_results}
response = requests.get(WIKI_SEARCH_URL, headers=headers, params=parameters)
page_info = response.json()['pages']

After filtering the search results based on the pages’ excerpts the text of the remaining pages is imported using wikipediaapi:

import wikipediaapi

def get_wiki_page_sections_as_dict(page_title, sections_exclude=SECTIONS_EXCLUDE):
    wiki_wiki = wikipediaapi.Wikipedia(user_agent=os.getenv('WIKI_USER_AGENT'), language='en')
    page = wiki_wiki.page(page_title)
    
    if not page.exists():
        return None
    
    def sections_to_dict(sections, parent_titles=[]):
        result = {'Summary': page.summary}
        for section in sections:
            if section.title in sections_exclude: continue
            section_title = ": ".join(parent_titles + [section.title])
            if section.text:
                result[section_title] = section.text
            result.update(sections_to_dict(section.sections, parent_titles + [section.title]))
        return result
    
    return sections_to_dict(page.sections)

To access Wikipedia articles, the app uses wikipediaapi.Wikipedia, which requires a user-agent string for identification. It returns a WikipediaPage object which contains a summary of the page, page sections with the title and the text of each section. Sections are hierarchically organized meaning each section is another WikipediaPage object with another list of sections that are the subsections of the respective section. The function above reads all sections of a page and returns a dictionary that maps a concatenation of all section and subsection titles to the respective text.

Context Scoring

Sections that fit better to the user query should get a higher probability of being selected. This is achieved by assigning a score to each section which is used as weight for sampling the sections. This score is calculated as follows:

\[s_{section}=w_{rejection}s_{rejection}+(1-w_{rejection})s_{sim}\]

Each section receives a score based on two factors: how often it has been rejected, and how closely its content matches the user query. These scores are combined into a weighted sum. The section rejection score consists of two components: the number of how often the section’s page has been rejected over the highest number of page rejections and the number of this section’s rejections over the highest number of section rejections:

\[s_{rejection}=1-\frac{1}{2}\left( \frac{n_{page(s)}}{\max_{page}n_{page}} + \frac{n_s}{\max_{s}n_s} \right)\]

Prompt Engineering

Prompt engineering is a crucial aspect of the Learning App’s functionality. This app is using two prompts to:

Get keywords for the wikipedia page search
Generate MCQs for sampled context

The template of the keyword generation prompt is shown below:

KEYWORDS_TEMPLATE = """
You're an assistant to generate keywords to search for Wikipedia articles that contain content the user wants to learn. 
For a given user query return at most {n_keywords} keywords. Make sure every keyword is a good match to the user query. 
Rather provide fewer keywords than keywords that are less relevant.

Instructions:
- Return the keywords separated by commas 
- Do not return anything else
"""

This system message is concatenated with a human message containing the user query to invoke the Llm model.

The parameter n_keywords set the maximum number of key words to be generated. The instructions ensure that the response can be easily converted to a list of key words. Despite these instructions, the LLM often returns the maximum number of keywords, including some less relevant ones.

The MCQ prompt contains the sampled section and invokes the chat model to respond with a question, answer choices, and an explanation of the correct answer in a machine-readable format.

MCQ_TEMPLATE = """
You are a learning app that generates multiple-choice questions based on educational content. The user provided the 
following request to define the learning content:

"{user_query}"

Based on the user request, following context was retrieved:

"{context}"

Generate a multiple-choice question directly based on the provided context. The correct answer must be explicitly stated 
in the context and should always be the first option in the choices list. Additionally, provide an explanation for why 
the correct answer is correct.
Number of answer choices: {n_choices}
{previous_questions}{rejected_questions}
The JSON output should follow this structure (for number of choices = 4):

{{"question": "Your generated question based on the context", "choices": ["Correct answer (this must be the first choice)","Distractor 1","Distractor 2","Distractor 3"], "explanation": "A brief explanation of why the correct answer is correct."}}

Instructions:
- Generate one multiple-choice question strictly based on the context.
- Provide exactly {n_choices} answer choices, ensuring the first one is the correct answer.
- Include a concise explanation of why the correct answer is correct.
- Do not return anything else than the json output.
- The provided explanation should not assume the user is aware of the context. Avoid formulations like "As stated in the text...".
- The response must be machine readable and not contain line breaks.
- Check if it is possible to generate a question based on the provided context that is aligned with the user request. If it is not possible set the generated question to "{fail_keyword}".
"""

The inserted parameters are:

user_query: text of user query
context: text of sampled section
n_choices: number of answer choices
previous_questions: instruction to not repeat previous questions with list of all previous questions
rejected_questions: instruction to avoid questions of similar nature or context with list of rejected questions
fail_keyword: keyword that indicates that question could not be generated

Including previous questions reduces the chance that the chat model repeats questions. Additionally, by providing rejected questions, the user’s feedback is considered when generating new questions. The example should ensure that the generated output is in the correct format so that it can be easily converted to a dictionary. Setting the correct answer as the first choice avoids requiring an additional output that indicates the correct answer. When showing the choices to the user the order of choices is shuffled. The last instruction defines what output should be provided in case it is not possible to generate a question matching the user query. Using a standardized keyword makes it easy to identify when the question generation has failed.

Streamlit App

The app is built using Streamlit, an open-source app framework in Python. Streamlit has many functions that allow to add page elements with only one line of code. Like for example the element in which the user can write the query is created via:

context_text = st.text_area("Enter the context for MCQ questions:")

where context_text contains the string, the user has written. Buttons are created with st.button or st.radio where the returned variable contains the information if the button has been pressed or what value has been selected.

The page is generated top-down by a script that defines each element sequentially. Every time the user is interacting with the page, e.g. by clicking on a button the script can be re-run with st.rerun(). When re-running the script, it is important to carry over information from the previous run. This is done by st.session_state which can contain any objects. For example, the MCQ generator instance is assigned to session states as:

st.session_state.mcq_generator = MCQGenerator()

so that when the context retrieval workflow has been executed, the found context is available to generate a MCQ at the next page.

Enhancements

There are many options to enhance this app. Beyond Wikipedia, users could also upload their own PDFs to generate questions from custom materials—such as lecture slides or textbooks. This would enable the user to generate questions on any context, for example it could be used to prepare for exams by uploading course materials.

Another aspect that could be improved is to optimize the context selection to minimize the number of rejected questions by the user. Instead of updating scores, also a ML model could be trained to predict how likely it is that a question will be rejected with respect to features like similarity to accepted and rejected questions. Every time another question is rejected this model could be retrained.

Also, the generated question could be saved so that when a user wants to repeat the learning exercise these questions could be used again. An algorithm could be applied to select previously wrongly answered questions more frequently to focus on improving the learner’s weaknesses.

Summary

This article showcases how retrieval-augmented generation (RAG) can be used to build an interactive learning app that generates high-quality, context-specific multiple-choice questions from Wikipedia articles. By combining keyword-based search, semantic filtering, prompt engineering, and a feedback-driven scoring system, the app dynamically adapts to user preferences and learning goals. Leveraging tools like Streamlit enables rapid prototyping and deployment, making this an accessible framework for educators, students, and developers alike. With further enhancements—such as custom document uploads, adaptive question sequencing, and machine learning-based rejection prediction—the app holds strong potential as a versatile platform for personalized learning and self-assessment.