Graph Coloring for Data Science: A Comprehensive Guide

is coloring a picture of a flower that has six petals arranged in a circle. She wants to color each of the six petals with exactly one of the following four colors: red, orange, yellow, and blue. No two neighboring petals can have the same color. Not all four colors need to be used. How many ways are there for Rita to color the flower petals while satisfying these constraints? This was the basis of problem #25 in the 2025 Cayley Contest, and it happens to be a specific example of a class of combinatorial data science problems related to the notion of graph coloring. In the following sections, we will solve Rita’s concrete problem, derive its general open and closed-form solutions, and look at some interesting practical applications in industry.

Note: All figures and formulas in the following sections have been created by the author of this article.

A Theoretical Puzzle

To solve Rita’s problem, let us begin by visualizing the flower petals as a cyclical graph consisting of 6 nodes connected by edges as shown in Figure 1:

Figure 1: An uncolored cycle of flower petals

Figure 2 shows some valid colorings (also called proper colorings) of the petals:

Let P(n, k) be the number of ways we can color a cycle of n nodes with k colors, such that no neighboring nodes have the same color.

Now, consider what happens if we break the cycle into a chain of n nodes. How many ways P_chain(n, k) are there to color a chain of n nodes with k colors, such that no neighboring nodes have the same color? For the starting (left-most) node in the chain, we have a choice of k colors. But for each of the following n – 1 nodes, we have a choice of only k – 1 colors, since one of the colors will have already been taken by the preceding node. This intuition is illustrated in Figure 3 below:

Thus, we have:

However, notice that in some of these valid colorings, the first and last nodes in the chain will share the same color – if we subtract these cases from P_chain(n, k), then we would obtain P(n, k) as required. Furthermore, notice that the cases to subtract are equivalent to P(n – 1, k), i.e., the number of ways to color a cycle of n – 1 nodes with k colors, such that no neighboring nodes have the same color. This so-called deletion-contraction maneuver is illustrated in Figure 4 below:

Figure 5 below shows the base cases for P(n, k), for a given value k:

Pulling all of these insights together yields the following first-order recurrence relation for positive integers n > 3 and k, with base cases as described above:

Now, solving Rita’s problem amounts to evaluating the recurrence P(n, k) for n = 6 and k = 4. Since the numbers in this case are relatively small, we can carry out the evaluation by expanding P(6, 4) as follows:

Using the expression for the base case P(3, k), note that:

As a result:

Thus, there are exactly 732 ways for Rita to color her flower petals while satisfying the given constraints.

The following Python function (compatible with Python versions ≥ 3.9) operationalizes the recurrence to let us quickly evaluate P(n, k) for larger input values:

def num_proper_colorings(n: int, k: int) -> int:
    """
    Iteratively compute the number of proper colorings of a cycle graph with n nodes and k colors.

    Parameters:
    - n (int): Number of nodes in the cycle graph.
    - k (int): Number of available colors.

    Returns:
    - int: Number of proper colorings.
    """
    if n == 1:
        return k
    elif n == 2:
        return k * (k - 1)
    elif n == 3:
        return k * (k - 1) * (k - 2)

    # Initialize base case num_proper_colorings(3, k)
    num_prev = k * (k - 1) * (k - 2) 

    for i in range(4, n + 1):
        current = k * (k - 1)**(i - 1) - num_prev
        num_prev = current

    return num_prev

Graph coloring can also be operationalized using backtracking, a useful technique for exploring the solution space of various types of data science problems and incrementally constructing candidate solutions. This article provides an intuitive introduction to backtracking, and the following video shows how backtracking can be applied to graph coloring problems in particular:

Closed-Form Solution

The iterative Python function shown above has a time complexity of O(n) with respect to number of nodes n in the cyclical graph. However, if we can find an analytical or closed-form solution to P(n, k), we could evaluate the result directly; the corresponding Python function would have a time complexity of just O(1) and thus represent a substantial time saving as n grows very large. Time complexity and the merits of closed-form solutions are discussed in more depth in this article on algorithmic thinking for data scientists.

To find a simplified closed-form solution, let us rearrange our recurrence relation as follows:

At this point, we can make use of a neat principle in linear algebra: the general solution of a linear algebraic equation f(x) = y is the sum x_h + x_p of the general solution x_h of its homogeneous counterpart f(x) = 0 and a particular solution x_p of f(x) = y. To remind ourselves of why this works, we can substitute x_h + x_p into f(x) = y to get f(x_h + x_p) = y. Since f(x) is a linear function, f(x_h + x_p) = f(x_h) + f(x_p) = y. We know that f(x_h) = 0 and f(x_p) = y. By substitution, f(x_h) + f(x_p) = 0 + y = y. The equation y = y is trivially true, confirming that x_h + x_p is indeed the general solution of f(x) = y. Thus, our task is now to:

Find a general solution x_h of the homogeneous equation P(n, k) + P(n – 1, k) = 0
Find a particular solution x_p to our recurrence P(n, k) + P(n – 1, k) = k(k – 1)^{(n – 1)}
Combine x_h and x_p to derive the general closed-form solution of our recurrence

So, let us start by solving the following homogeneous equation:

Note that, for simplicity, we let:

Each term seems to cancel the previous one, so it must alternate in sign at every step, i.e., there exists a constant term C (which we will derive shortly) such that:

Next, since the right-hand side of the original recurrence is a multiple of (k – 1)^{(n – 1)}, let us try the following particular solution P‘:

A is a constant term that we can derive by plugging P’ into the left-hand side of the original recurrence as follows:

This means A = 1, so we have:

Combining the particular and homogeneous solutions gives us:

Now, we can use our base case for n = 3 to derive C:

Substituting C back into the combined solution gives us the following closed-form solution:

We can verify that this indeed gives us the correct solution for the Cayley contest problem:

Practical Applications

Graph coloring is a fundamental problem in graph theory that involves assigning labels (or “colors”) to the nodes of a graph such that no two adjacent nodes share the same color. In essence, graph coloring attempts to partition a graph into independent sets that may be treated uniformly (i.e., all elements of a set may be assigned the same color) without violating adjacency constraints. This type of problem framing can be applied in a wide range of data science use cases involving constraint-based optimization and resource allocation. We look at some of these applications below.

Scheduling and Timetabling

Graph coloring can be used to solve scheduling problems, where tasks or events must be arranged without conflicts. Graphs used to model such scenarios are often called conflict graphs.

Consider the intuitive case of designing school timetables. Each class can be represented by a node in the graph, and an edge connects two classes if some students go to both classes. Time slots are represented by colors. A proper coloring of the graph – in which adjacent nodes do not share the same color – ensures that no student is faced with the predicament of having to attend two classes at the same time. The case of exam scheduling poses a similar problem, since exams must be assigned time slots in such a way that no student has conflicting exam times. Graph coloring can be used to solve this type of problem and minimize the number of time slots needed overall.

Graph coloring can also help solve problems of constraint-based scheduling in industrial settings. For example, product assembly in a car manufacturing plant typically involves various tasks, such as painting, wiring electronics, chassis fitting, and installing engines. Each task requires specialized tooling, workstations, and skilled workers, and there may be certain restrictions on task ordering. Painting should not happen immediately before wiring, since residual paint fumes could damage sensitive electronics. Engine installation and chassis fitting may require some of the same equipment (e.g., lifts or alignment rigs) that is in short supply and cannot be used simultaneously. To apply graph coloring, we can model each task as a node in a graph, with an edge connecting two nodes if the corresponding tasks are in conflict (i.e., the tasks cannot be scheduled consecutively). Colors represent distinct time slots or stages of assembly. Proper graph coloring ensures that conflicting tasks are not scheduled back-to-back; this helps optimize the manufacturing workflow, reduce downtime and resource bottlenecks, and prevent costly errors.

Clustering and Feature Selection

In data mining and machine learning (ML), clustering algorithms group data points together based on shared characteristics or relationships. Graph coloring offers a natural approach to clustering by treating the data as a graph, where nodes represent individual data points and edges indicate some relationship between the respective nodes (e.g., similarity, class membership). Graph coloring enables us to partition the data into independent sets (i.e., groups of nodes that are not directly connected) for effective detection of clusters; this method may be particularly useful in social network analysis, biological data modeling, and recommendation systems, where relationships between entities can be quite complex. Proper graph coloring helps ensure that each cluster is internally cohesive while being distinct from other clusters, providing a clean and interpretable structure for downstream analysis. Interested readers can check out this article and this book for a deep dive into graph-theoretic representations of data for feature engineering.

Finally, feature selection is an important consideration in building efficient and interpretable ML models, especially in the context of high-dimensional data (e.g., as seen in domains such as genomics and finance). Training a model on many features is computationally expensive, and highly correlated features tend to hold redundant information, which can lead to inefficient training and overfitting. Graph coloring offers an elegant solution: construct a graph where each node represents a feature, and edges connect pairs of highly correlated features. A proper coloring of such a graph ensures that no two highly correlated features are assigned the same color, allowing the selection of one representative feature per color. This technique reduces dimensionality while preserving informational diversity, leading to simpler models that can potentially generalize better.

The Wrap

Graph coloring, while rooted in combinatorial mathematics, has practical relevance that extends well beyond theoretical puzzles. Starting from a math contest problem involving flower petals, we derived general open and closed-form solutions for the proper coloring of cyclical graphs, and looked at how graph coloring can be applied to a wide range of data science problems. The key to such practical applications lies in smart problem framing: if the problem is framed as a graph in the right way – with careful consideration given to the definition of nodes, edges, and coloring constraints – then the solution approach may become readily apparent. To borrow a quote that is often (mis)attributed to Einstein, “if [you] had an hour to solve a problem, [you should] spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” Ultimately, as the field of data science continues to evolve, techniques such as graph coloring will likely become increasingly relevant in both research and applied settings.