Make it work, then make it beautiful, then if you really, really have to, make it fast. 90 percent of the time, if you make it beautiful, it will already be fast. So really, just make it beautiful! (Source)
— Joe Armstrong (co-designers of the Erlang programming language.)
article about Python for the series “Data Science: From School to Work.” Since the beginning, you have learned how to manage your Python project with UV, how to write a clean code using PEP and SOLID principles, how to handle errors and use loguru to log your code and how to write tests.
Now you are in a position to create working, production-ready code. But code is never perfect and can always be improved. A final (optional, but highly recommended) step in creating code is optimization.
To optimize your code, you need to be able to track what’s going on in it. To do so, we use tools called Profilers. They generate profiles of your code. It means a set of statistics that describes how often and for how long various parts of the program executed. They make it possible to identify bottlenecks and parts of the code that consume too many resources. In other words, they show where your code should be optimized.
Today, there is such a proliferation of profilers in Python that the default profiler in Pycharm is called yappi for “Yet Another Python Profiler”.
This article is therefore not an exhaustive list of all existing profilers. In this article, I present a tool for each aspect of the code we want to profile: memory, time and CPU/GPU consumption. Other packages will be mentioned with some references but will not be detailed.
I – Memory profilers
Memory profiling is the technique of monitoring and evaluating a program’s memory utilization while running. This method helps developers in finding memory leaks, optimizing memory utilization, and comprehending their programs’ memory consumption patterns. Memory profiling is crucial to prevent applications from using more memory than necessary and causing sluggish performance or crashes.
1/ memory-profiler
memory_profiler
is an easy-to-use Python module designed to profile memory usage of a script. It depends on psutil
module. To install the package, simply type:
pip install memory_profiler # (in your virtual environment)
# or if you use uv (what I encourage)
uv add memory_profiler
Profiling executable
One of the advantages of this package is that it is not limited to pythonic use. It installs the mprof
command that allows monitoring the activity of any executable.
For instance, you can monitor the memory consummation of applications like ollama
by running this command:
mprof run ollama run gemma3:4b
# or with uv
mprof run ollama run gemma3:4b
To see the result, you have to install matplotlib
first. Then, you can plot the recorded memory profile of your executable by running:
mprof plot
# or with uv
mprof run ollama run gemma3:4b
The graph then looks like this:

Profiling Python code
Let’s get back to what brings us here, the monitoring of a Python code.
memory_profiler
works with a line-by-line mode using a simple decorator @profile
. First, you decorate the interest function and then you run the script. The output will be written directly to the terminal. Consider the following monitoring.py
script:
@profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
if __name__ == '__main__':
my_func()
It is important to notice that it is not necessary to import the package from memory_profiler import profile
at the begin of the script. In this case you have to specify some specific arguments to the Python interpreter.
python-m memory_profiler monitoring.py # with a space between python and -m
# or
uv run -m memory_profiler monitoring.py
And you have the following output with a line-by-line details:

The output is a table with five columns.
- Line #: The line number of the profiled code
- Mem usage: The memory usage of the Python interpreter after executing that line.
- Increment: The change in memory usage compared to the previous line.
- Occurrences: The number of times that line was executed.
- Line Contents: The actual source code.
This output is very detailed and allows very fine monitoring of a specific function.
Important: Unfortunately, this package is no longer actively maintained. The creator is looking for a substitute.
2/ tracemalloc
tracemalloc
is a built-in module in Python that tracks memory allocations and deallocations. Tracemalloc provides an easy-to-use interface for capturing and analyzing memory usage snapshots, making it an invaluable tool for any Python developer.
It offers the following details:
- Shows where each object was allocated by providing a traceback.
- Gives memory allocation statistics by file and line number, including the overall size, count, and average size of memory blocks.
- Allows you to compare two snapshots to identify potential memory leaks.
The package tracemalloc
may be usefull to identify memory leak in your code.
Personally, I find it less intuitive to set up than the other packages presented in this article. Here are some links to go further:
II – Time profilers
Time profiling is the process of measuring the time spent in different parts of a program. By identifying performance bottlenecks, you can focus their optimization efforts on the parts of the code that will have the most significant impact.
1/ line-profiler
The line-profiler
package is quite similar to memory-profiler
, but it serves a different purpose. It’s designed to profile specific functions by measuring the execution time of each line within those functions. To use LineProfiler effectively, you need to explicitly specify which functions you want it to profile by simply adding the @profile
decorator above them.
To install it just type:
pip install line_profiler # (in your virtual environment)
# or
uv add line_profiler
Considering the following script named monitoring.py
@profile
def create_list(lst_len: int):
arr = []
for i in range(0, lst_len):
arr.append(i)
def print_statement(idx: int):
if idx == 0:
print("Starting array creation!")
elif idx == 1:
print("Array created successfully!")
else:
raise ValueError("Invalid index provided!")
@profile
def main():
print_statement(0)
create_list(400000)
print_statement(1)
if __name__ == "__main__":
main()
To measure the execution time of the function main()
and create_list()
, we add the decorator @profile
.
The easiest way to get a time profiling of this script to use the kernprof
script.
kernprof -lv monitoring.py # (in your virtual environment)
# or
uv run kernprof -lv monitoring.py
It will create a binary file named your_script.py.lprof
. The argument -v
allows to show directyl the output in the terminal.
Otherwise, you can view the results later like so:
python-m line_profiler monitoring.py.lprof # (in your virtual environment)
# or
uv run python -m line_profiler monitoring.py.lprof
It provides the following informations:

There are two tables, one by profiled function. Each table containes the following informations
- Line #: The line number in the file.
- Hits: The number of times that line was executed.
- Time: The total amount of time spent executing the line in the timer’s units. In the header information before the tables, you will see a line “Timer unit:” giving the conversion factor to seconds. It may be different on different systems.
- Per Hit: The average amount of time spent executing the line once in the timer’s units
- % Time: The percentage of time spent on that line relative to the total amount of recorded time spent in the function.
- Line Contents: The actual source code.
1/ cProfile
Python comes with two built-in profilers:
cProfile
: A C extension with reasonable overhead that makes it suitable for profiling long-running programs. It is recommended for most users.profile
: A pure Python module whose interface is imitated bycProfile
, but which adds significant overhead to profiled programs. It can be a valuable tool when you need to extend or customize the profiling functionality.
The base syntax is cProfile.run(statement, filename=None, sort=-1)
. The filename
argument can be passed to save the output. And the sort
argument can be used to specify how the output has to be printed. By default, it is set to -1( no value).
For instance, if you modify the monitoring script like this:
import cProfile
def create_list(lst_len: int):
arr = []
for i in range(0, lst_len):
arr.append(i)
def print_statement(idx: int):
if idx == 0:
print("Starting array creation!")
elif idx == 1:
print("Array created successfully!")
else:
raise ValueError("Invalid index provided!")
def main():
print_statement(0)
create_list(400000)
print_statement(1)
if __name__ == "__main__":
cProfile.run("main()")
we have the following output:

First, we have the script outputs: print_statement(0)
and print_statement(1)
.
Then, we have the profiler output: The first line shows the number of function calls and the time it took to run. The second line is a reminder of the sorted parameter. And, the profiler provides a table with six columns:
- ncalls: Shows the number of calls made
- tottime: Total time taken by the given function. Note that the time made in calls to sub-functions are excluded.
- percall: Total time / No of calls. (remainder is left out)
- cumtime: Unlike tottime, this includes time spent in this and all subfunctions that the higher-level function calls. It is most useful and is accurate for recursive functions.
- percall: The percall following cumtime is calculated as the quotient of cumtime divided by primitive calls. The primitive calls include all the calls that were not included through recursion.
- filename: The name of the method.
The first and the last rows of the table come from cProfile. The other rows are about the script.
You can customize the output by using the Profile()
class. First, you have to initialize an instance of Profile class and using the method enable()
and disable()
to, respectively, start and to end the collecting of profiling data. Then, the pstats
module can be used to manipulate the results collected by the profiler object.
To sort output by cumulative time, instead of the standard name the previous code can be rewritten like this:
import cProfile, pstats
# ...
# Same as before
if __name__ == "__main__":
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()
stats = pstats.Stats(profiler).sort_stats('cumtime')
stats.print_stats()
And the output becomes:

As you can see, now the table is sorted by cumtime
. And the two rows of cProfile of the previous table are not in this table.
Visualize profiling with Snakeviz.
The output is very easy to analyse. But, it can become unreadable if the profiled code becomes too big.
Another way to analyse the ouput is to visualize data instead of read it. To do so, we use the Snakeviz
package. To install it, simply type:
pip install snakeviz # (in your virtual environment)
# or
uv add snakeviz
Then, replace stats.print_stats()
by stats.dump_stats("profile.prof")
to save profiling data. Now, you can have a visualization of your profiling by typing:
snakeviz profile.prof
It launches a file browser interface from which you can choose among two data visualizations: Icicle and Sunburst.


It is easier to read than the print_stats()
output because you can interact with each element by moving your mouse over it. For instance, you can have more details about the function create_list()

evaluate_model()
(from the author).Create a call graph with gprof2dot
A call graph is a visual representation of the relationships between functions or methods in a program, showing which functions call others and how long each function or method takes. It can be seen as a map of your code.
pip install gprof2dot # (in your virtual environment)
# or
uv add gprof2dot
Then exectute your by typing
python-m cProfile -o monitoring.pstats .\monitoring.py # (in your virtual environment)
# or
uv run python-m cProfile -o monitoring.pstats .\monitoring.py
It will create a monitoring.pstats
that can be turn into a call graph using the following command:
gprof2dot -f pstats monitoring.pstats | dot -Tpng -o monitoring.png # (in your virtual environment)
# or
uv run gprof2dot -f pstats monitoring.pstats | dot -Tpng -o monitoring.png
Then the call graph is saved into a png file named monitoring.png

2/ Other interesting packages
a/ PyCallGraph
PyCallGraph is a Python module that creates call graph visualizations. To use it, you have to :
To create a call graph of your code, supply run it a PyCallGraph context like this:
from pycallgraph import PyCallGraph
from pycallgraph.output import GraphvizOutput
with PyCallGraph(output=GraphvizOutput()):
# code you want to profile
Then, you get a png of the call graph of your code is named by default pycallgraph.png
.
I’ve made the call graph of the previous example:

In each box, you have the name of the function, the time spent in and the number of calls. Like with snakeviz, the graph may be very complex if your code has many dependencies. But the color indicates the bottlenecks. In complex code, it’s very interesting to study it to see the dependencies and relationships.
b/ PyInstrument
PyInstrument is also a Python profiler very easy to use. You can add the profiler in your script by surredning the code like this:
from pyinstrument import Profiler
profiler = Profiler()
profiler.start()
# code you want to profile
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
The output gives

It is less detailled than cProfile but it is also more readable. Your functions are highlighted and sorted by time.
Butthe true interest of PyInstrument comes with its html output. To get this html output simply type in the terminal:
pyinstrument --html .\monitoring.py
# or
uv run pyinstrument --html .\monitoring.py
It launches a file browser interface from which you can choose among two data visualizations: Call stack and Timeline.


Here, the profile is more detailed and you have many options to filter.
CPU/GPU profiler
CPU and GPU profiling is the process of analyzing the utilization and performance of a program on the central processing unit (CPU) and graphics processing unit (GPU). By measuring how much resources are spent on different parts of the code on these processing units, developers can identify performance bottlenecks, understand where their code is being executed, and optimize their application to achieve better performance and efficiency.
As far as I know, there is only one package that can profile GPU power consumption.
1/ Scalene
Scalene is a high-performance CPU, GPU and memory profiler designed specifically for Python. It’s an open-source package that provides detailed insights. It is designed to be fast, accurate, and easy to use, making it an excellent tool for developers looking to optimize their code.
- CPU/GPU Profiling: Scalene provides detailed information on CPU/GPU usage, including the time spent in different parts of your code. It can help you identify performance bottlenecks and optimize your code for better execution times.
- Memory Profiling: Scalene tracks memory allocation and deallocation, helping you understand how your code uses memory. This is particularly useful for identifying memory leaks or optimizing memory-intensive applications.
- Line-by-Line Profiling: Scalene provides line-by-line profiling, which gives you a detailed breakdown of the time spent in each line of your code. This feature is invaluable for pinpointing performance issues.
- Visualization: Scalene includes a graphical interface for visualizing profiling results, making it easier to understand and navigate the data.
To highlight all the advantages of Scalene, I’ve developed functions with the sole aim of consuming memory memory_waster()
, CPU cpu_waster()
and GPU gpu_convolution()
. All of them are in a script scalene_tuto.py
.
import random
import copy
import math
import cupy as cp
import numpy as np
def memory_waster():
"""Wastes memory but in a controlled way"""
memory_hogs = []
# Create moderately sized redundant data structures
for i in range(100):
garbage_data = []
for j in range(1000):
waste = f"Useless string #{j} repeated " * 10
garbage_data.append(waste)
garbage_data.append(
{
"id": j,
"data": waste,
"numbers": [random.random() for _ in range(50)],
"range_data": list(range(100)),
}
)
memory_hogs.append(garbage_data)
for iteration in range(4):
print(f"Creating copy #{iteration}...")
memory_copy = copy.deepcopy(memory_hogs)
memory_hogs.extend(memory_copy)
return memory_hogs
def cpu_waster():
meaningless_result = 0
for i in range(10000):
for j in range(10000):
temp = (i**2 + j**2) * random.random()
temp = temp / (random.random() + 0.01)
temp = abs(temp**0.5)
meaningless_result += temp
# Some trigonometric operations
angle = random.random() * math.pi
temp += math.sin(angle) * math.cos(angle)
if i % 100 == 0:
random_mess = [random.randint(1, 1000) for _ in range(1000)] # Smaller list
random_mess.sort()
random_mess.reverse()
random_mess.sort()
return meaningless_result
def gpu_convolution():
image_size = 128
kernel_size = 64
image = np.random.random((image_size, image_size)).astype(np.float32)
kernel = np.random.random((kernel_size, kernel_size)).astype(np.float32)
image_gpu = cp.asarray(image)
kernel_gpu = cp.asarray(kernel)
result = cp.zeros_like(image_gpu)
for y in range(kernel_size // 2, image_size - kernel_size // 2):
for x in range(kernel_size // 2, image_size - kernel_size // 2):
pixel_value = 0
for ky in range(kernel_size):
for kx in range(kernel_size):
iy = y + ky - kernel_size // 2
ix = x + kx - kernel_size // 2
pixel_value += image_gpu[iy, ix] * kernel_gpu[ky, kx]
result[y, x] = pixel_value
result_cpu = cp.asnumpy(result)
cp.cuda.Stream.null.synchronize()
return result_cpu
def main():
print("\n1/ Wasting some memory (controlled)...")
_ = memory_waster()
print("\n2/ Wasting CPU cycles (controlled)...")
_ = cpu_waster()
print("\n3/ Wasting GPU cycles (controlled)...")
_ = gpu_convolution()
if __name__ == "__main__":
main()
For the GPU function, you have to install cupy
according to your cuda version (nvcc --version
to get it)
pip install cupy-cuda12x # (in your virtual environment)
# or
uv add install cupy-cuda12x
Further details on installing cupy can be found in the documentation.
To run Scalene, use the command
scalene scalene_tuto.py
# or
uv run scalene scalene_tuto.py
It profiles both CPU, GPU, and memory by default. If you only want one or some of the options, use the flags --cpu
, --gpu
, and --memory
.
Scalene provides a line-level and a function level profiling. And it has two interfaces: the Command Line Interface (CLI) and the web interface.
Important: It is better to use Scalene with Ubuntu using WSL. Otherwise, the profiler does not retrieve memory consumption information.
a) Command Line Interface
By default, Scalene’s output is the web interface. To obtain the CLI instead, add the flag --cli
.
scalene scalene_tuto.py --cli
# or
uv run scalene scalene_tuto.py --cli
You have the following results:


By default, the code is displayed in dark mode. So if, like me, you work in light mode, the result isn’t very pretty.
The visualization is categorized into three distinct colors, each representing a different profiling metric.
- The blue section represents CPU profiling, which provides a breakdown of the time spent executing Python code, native code (such as C or C++), and system-related tasks (like I/O operations).
- The green section is dedicated to memory profiling, showing the percentage of memory allocated by Python code, as well as the overall memory usage over time and its peak values.
- The yellow section focuses on GPU profiling, displaying the GPU’s running time and the volume of data copied between the GPU and CPU, measured in mb/s. It’s worth noting that GPU profiling is currently limited to NVIDIA GPUs.
b) The web interface.
The web interface is divided in three parts.



The color code is the same as in the command lien interface. But some icons are added:
- 💥: Optimizable code region (performance indication in the Function Profile section).
- ⚡: Optimizable lines of code.
c) AI Suggestions
One of the great advantages of Scalene is the ability to use AI to improve the slowness and/or overconsumption you have identified. It currently supports OpenAI API, Amazon BedRock, Azure OpenAI and ollama in local

After selecting your tools, you just have to click on 💥 or ⚡if you want to optimize a part of the code or just a line.
I test it with codellama:7b-python
from ollama to optimize the gpu_convolution()
function. Unfortunately, as mentioned in the interface:
Note that optimizations are AI-generated and may not be correct.
None of the suggested optimizations worked. But the codebase was not conducive to optimization as it was artificially complicated. Just remove unnecessary lines to save time and memory. Also, I used a small model, which could be the reason.
Even though my tests were inconclusive, I think this option can be interesting and will surely continue to improve.
Conclusion
Nowadays, we are less concerned about the resource consumption of our developments, and very quickly these optimization deficits can accumulate, making the code slow, too slow for production, and sometimes even requiring the purchase of more powerful hardware.
Code profiling tools are indispensable when it comes to identifying areas in need of optimization.
The combination of the memory profiler and line profiler provides a very good initial analysis: easy to set up, with easy-to-understand reports.
Tools such as cProfile and Scalene are complete and have graphical representations, but require more time to analyze. Finally, the AI optimization option offered by Scalene is a real asset, even if in my case the model used was not sufficient to provide anything relevant.
Curious about Python & Data Science?
Follow me for more tutorials and insights!