datasets and are looking for quick insights without too much manual grind, you’ve come to the right place.

In 2025, datasets often contain millions of rows and hundreds of columns, which makes manual analysis next to impossible. Local Large Language Models can transform your raw DataFrame statistics into polished, readable reports in seconds — minutes at worst. This approach eliminates the tedious process of analyzing data by hand and writing executive reports, especially if the data structure doesn’t change.

Pandas handles the heavy lifting of data extraction while LLMs convert your technical outputs into presentable reports. You’ll still need to write functions that pull key statistics from your datasets, but it’s a one-time effort.

This guide assumes you have Ollama installed locally. If you don’t, you can still use third-party LLM vendors, but I won’t explain how to connect to their APIs.

Table of contents:

  • Dataset Introduction and Exploration
  • The Boring Part: Extracting Summary Statistics
  • The Cool Part: Working with LLMs
  • What You Could Improve

Dataset Introduction and Exploration

For this guide, I’m using the MBA admissions dataset from Kaggle. Download it if you want to follow along.

The dataset is licensed under the Apache 2.0 license, which means you can use it freely for both personal and commercial projects.

To get started, you’ll need a few Python libraries installed on your system.

Image 1 – Required Python libraries and versions (image by author)

Once you have everything installed, import the necessary libraries in a new script or a notebook:

import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal

Dataset loading and preprocessing

Start by loading the dataset with Pandas. This snippet loads the CSV file, prints basic information about the dataset shape, and shows how many missing values exist in each column:

df = pd.read_csv("data/MBA.csv")

# Basic dataset info
print(f"Dataset shape: {df.shape}\n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)
Image 2 – Basic dataset statistics (image by author)

Since data cleaning isn’t the main focus of this article, I’ll keep the preprocessing minimal. The dataset only has a couple of missing values that need attention:

df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")

That’s it! Let’s see how to go from this to a meaningful report next.

The Boring Part: Extracting Summary Statistics

Even with all the advances in AI capability and availability, you probably don’t want to send your entire dataset to an LLM provider. There are a couple of good reasons why.

It could consume way too many tokens, which translates directly to higher costs. Processing large datasets can take a long time, especially when you’re running models locally on your own hardware. You might also be dealing with sensitive data that shouldn’t leave your organization.

Some manual work is still the way to go.

This approach requires you to write a function that extracts key elements and statistics from your Pandas DataFrame. You’ll have to write this function from scratch for different datasets, but the core idea transfers easily between projects.

The get_summary_context_message() function takes in a DataFrame and returns a formatted multi-line string with a detailed summary. Here’s what it includes:

  • Total application count and gender distribution
  • International vs domestic applicant breakdown
  • GPA and GMAT score quartile statistics
  • Admission rates by academic major (sorted by rate)
  • Admission rates by work industry (top 8 industries)
  • Work experience analysis with categorical breakdowns
  • Key insights highlighting top-performing categories

Here’s the complete source code for the function:

def get_summary_context_message(df: pd.DataFrame) -> str:
    """
    Generate a comprehensive summary report of MBA admissions dataset statistics.
    
    This function analyzes MBA application data to provide detailed statistics on
    applicant demographics, academic performance, professional backgrounds, and
    admission rates across various categories. The summary includes gender and
    international status distributions, GPA and GMAT score statistics, admission
    rates by academic major and work industry, and work experience impact analysis.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing MBA admissions data with the following expected columns:
        - 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
    
    Returns
    -------
    str
        A formatted multi-line string containing comprehensive MBA admissions
        statistics.
    """
    # Basic application statistics
    total_applications = len(df)

    # Gender distribution
    gender_counts = df["gender"].value_counts()
    male_count = gender_counts.get("Male", 0)
    female_count = gender_counts.get("Female", 0)

    # International status
    international_count = (
        df["international"].sum()
        if df["international"].dtype == bool
        else (df["international"] == True).sum()
    )

    # GPA statistics
    gpa_data = df["gpa"].dropna()
    gpa_avg = gpa_data.mean()
    gpa_25th = gpa_data.quantile(0.25)
    gpa_50th = gpa_data.quantile(0.50)
    gpa_75th = gpa_data.quantile(0.75)

    # GMAT statistics
    gmat_data = df["gmat"].dropna()
    gmat_avg = gmat_data.mean()
    gmat_25th = gmat_data.quantile(0.25)
    gmat_50th = gmat_data.quantile(0.50)
    gmat_75th = gmat_data.quantile(0.75)

    # Major analysis - admission rates by major
    major_stats = []
    for major in df["major"].unique():
        major_data = df[df["major"] == major]
        admitted = len(major_data[major_data["admission"] == "Admit"])
        total = len(major_data)
        rate = (admitted / total) * 100
        major_stats.append((major, admitted, total, rate))

    # Sort by admission rate (descending)
    major_stats.sort(key=lambda x: x[3], reverse=True)

    # Work industry analysis - admission rates by industry
    industry_stats = []
    for industry in df["work_industry"].unique():
        if pd.isna(industry):
            continue
        industry_data = df[df["work_industry"] == industry]
        admitted = len(industry_data[industry_data["admission"] == "Admit"])
        total = len(industry_data)
        rate = (admitted / total) * 100
        industry_stats.append((industry, admitted, total, rate))

    # Sort by admission rate (descending)
    industry_stats.sort(key=lambda x: x[3], reverse=True)

    # Work experience analysis
    work_exp_data = df["work_exp"].dropna()
    avg_work_exp_all = work_exp_data.mean()

    # Work experience for admitted students
    admitted_students = df[df["admission"] == "Admit"]
    admitted_work_exp = admitted_students["work_exp"].dropna()
    avg_work_exp_admitted = admitted_work_exp.mean()

    # Work experience ranges analysis
    def categorize_work_exp(exp):
        if pd.isna(exp):
            return "Unknown"
        elif exp < 2:
            return "0-1 years"
        elif exp < 4:
            return "2-3 years"
        elif exp < 6:
            return "4-5 years"
        elif exp < 8:
            return "6-7 years"
        else:
            return "8+ years"

    df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
    work_exp_category_stats = []

    for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
        category_data = df[df["work_exp_category"] == category]
        if len(category_data) > 0:
            admitted = len(category_data[category_data["admission"] == "Admit"])
            total = len(category_data)
            rate = (admitted / total) * 100
            work_exp_category_stats.append((category, admitted, total, rate))

    # Build the summary message
    summary = f"""MBA Admissions Dataset Summary (2025)
    
Total Applications: {total_applications:,} people applied to the MBA program.

Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)

International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)

Academic Performance Statistics:

GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- 25th percentile: {gpa_25th:.2f}
- 50th percentile (median): {gpa_50th:.2f}
- 75th percentile: {gpa_75th:.2f}

GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- 25th percentile: {gmat_25th:.0f}
- 50th percentile (median): {gmat_50th:.0f}
- 75th percentile: {gmat_75th:.0f}

Major Analysis - Admission Rates by Academic Background:"""

    for major, admitted, total, rate in major_stats:
        summary += (
            f"\n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += (
        "\n\nWork Industry Analysis - Admission Rates by Professional Background:"
    )

    # Show top 8 industries by admission rate
    for industry, admitted, total, rate in industry_stats[:8]:
        summary += (
            f"\n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += "\n\nWork Experience Impact on Admissions:\n\nOverall Work Experience Comparison:"
    summary += (
        f"\n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
    )
    summary += f"\n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"

    summary += "\n\nAdmission Rates by Work Experience Range:"
    for category, admitted, total, rate in work_exp_category_stats:
        summary += (
            f"\n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    # Key insights
    best_major = major_stats[0]
    best_industry = industry_stats[0]

    summary += "\n\nKey Insights:"
    summary += (
        f"\n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
    )
    summary += f"\n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"

    if avg_work_exp_admitted > avg_work_exp_all:
        summary += f"\n- Admitted students have slightly more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
    else:
        summary += "\n- Work experience shows minimal difference between admitted and all applicants"

    return summary

Once you’ve defined the function, simply call it and print the results:

print(get_summary_context_message(df))
Image 3 – Extracted findings and statistics from the dataset (image by author)

Now let’s move on to the fun part.

The Cool Part: Working with LLMs

This is where things get interesting and your manual data extraction work pays off.

Python helper function for working with LLMs

If you have decent hardware, I strongly recommend using local LLMs for simple tasks like this. I use Ollama and the latest version of the Mistral model for the actual LLM processing.

Image 4 – Available Ollama models (image by author)

If you want to use something like ChatGPT through OpenAI API, you can still do that. You’ll just need to modify the function below to set up your API key and return the appropriate instance from Langchain.

Regardless of the option you choose, a call to get_llm() with a test message shouldn’t return an error:

def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
    """
    Create and configure a ChatOllama instance for local LLM inference.
    
    This function initializes a ChatOllama client configured to connect to a
    local Ollama server. The client is set up with deterministic output
    (temperature=0) for consistent responses across multiple calls with the
    same input.
    
    Parameters
    ----------
    model_name : str, optional
        The name of the Ollama model to use for chat completions.
        Must be a valid model name that is available on the local Ollama
        installation. Default is "mistral:latest".
    
    Returns
    -------
    ChatOllama
        A configured ChatOllama instance ready for chat completions.
    """
    return ChatOllama(
        model=model_name, base_url="http://localhost:11434", temperature=0
    )


print(get_llm().invoke("test").content)
Image 5 – LLM test message (image by author)

Summarization prompt

This is where you can get creative and write ultra-specific instructions for your LLM. I’ve decided to keep things light for demonstration purposes, but feel free to experiment here.

There isn’t a single right or wrong prompt.

Whatever you do, make sure to include the format arguments using curly brackets – these values will be filled dynamically later:

SUMMARIZE_DATAFRAME_PROMPT = """
You are an expert data analyst and data summarizer. Your task is to take in complex datasets
and return user-friendly descriptions and findings.

You were given this dataset:
- Name: {dataset_name}
- Source: {dataset_source}

This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the analysis pipeline:


{context}


Based on these findings, write a detailed report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.

Report:
"""

Summarization Python function

With the prompt and the get_llm() functions declared, the only thing left is to connect the dots. The get_report_summary() function takes in arguments that will fill the format placeholders in the prompt, then invokes the LLM with that prompt to generate a report.

You can choose between Markdown or HTML formats:

def get_report_summary(
    dataset: pd.DataFrame,
    dataset_name: str,
    dataset_source: str,
    report_format: Literal["markdown", "html"] = "markdown",
) -> str:
    """
    Generate an AI-powered summary report from a pandas DataFrame.
    
    This function analyzes a dataset and generates a comprehensive summary report
    using a large language model (LLM). It first extracts statistical context
    from the dataset, then uses an LLM to create a human-readable report in the
    specified format.
    
    Parameters
    ----------
    dataset : pd.DataFrame
        The pandas DataFrame to analyze and summarize.
    dataset_name : str
        A descriptive name for the dataset that will be included in the
        generated report for context and identification.
    dataset_source : str
        Information about the source or origin of the dataset.
    report_format : {"markdown", "html"}, optional
        The desired output format for the generated report. Options are:
        - "markdown" : Generate report in Markdown format (default)
        - "html" : Generate report in HTML format
    
    Returns
    -------
    str
        A formatted summary report.
    
    """
    context_message = get_summary_context_message(df=dataset)
    prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
        dataset_name=dataset_name,
        dataset_source=dataset_source,
        context=context_message,
        report_format=report_format,
    )
    return get_llm().invoke(input=prompt).content

Using the function is straightforward – just pass in the dataset, its name, and source. The report format defaults to Markdown:

md_report = get_report_summary(
    dataset=df, 
    dataset_name="MBA Admissions (2025)",
    dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
)
print(md_report)
Image 6 – Final report in Markdown format (image by author)

The HTML report is just as detailed, but could use some styling. Maybe you could ask the LLM to handle that as well!

Image 7 – Final report in HTML format (image by author)

What You Could Improve

I could have easily turned this into a 30-minute read by optimizing every detail of the pipeline, but I kept it simple for demonstration purposes. You don’t have to (and shouldn’t) stop here though.

Here are the things you can improve to make this pipeline even more powerful:

  • Write a function that saves the report (Markdown or HTML) directly to disk. This way you can automate the entire process and generate reports on a schedule without manual intervention.
  • In the prompt, ask the LLM to add CSS styling to the HTML report to make it look more presentable. You could even provide your company’s brand colors and fonts to maintain consistency across all your data reports.
  • Expand the prompt to follow more specific instructions. You might want reports that focus on specific business metrics, follow a particular template, or include recommendations based on the findings.
  • Expand the get_llm() function so it can connect both to Ollama and other vendors like OpenAI, Anthropic, or Google. This gives you flexibility to switch between local and cloud-based models depending on your needs.
  • Do literally anything in the get_summary_context_message() function since it serves as the foundation for all context data provided to the LLM. This is where you can get creative with feature engineering, statistical analysis, and data insights that matter to your specific use case.

I hope this minimal example has set you on the right track to automate your own data reporting workflows.

Share.

Comments are closed.