Things I Wish I Had Known Before Starting ML

Ahh, the sea.

vacation on the Mediterranean Sea, I found myself lying on the beach, staring into the waves. Lady Luck was having a good day: the sun glared down from a blue and cloudless sky, heating the sand and salty sea around me. For the first time in a while, I had downtime. There was nothing related to ML in the remote region where I was, where the rough roads would have scared away anybody who is used to the even pavements of western countries.

Then, away from work and, partially, civilization, somewhere between zoning out and full-on daydreaming, my thoughts began to drift. In our day-to-day business, we are too, well, busy to spend time doing nothing. But nothing is strong word here: as my thoughts drifted, I first recalled recent events, then pondered about work, and then, eventually, arrived at machine learning.

Maybe traces of my previous article—where I reflected on 6.5 years of “doing” ML—were still lingering in the back of my mind. Or maybe it was simply the complete absence of anything technical around me, where the sea was my only companion. Whatever the reason was, I mentally started rehearsing the years behind me. What had gone well? What had gone sideways? And—most importantly—what do I wish someone had told me at the beginning?

This post is a collection of those things. It’s not meant to be a list of dumb mistakes that I urge others to avoid at all costs. Instead, it’s my attempt to write down the things that would have made my journey a bit smoother (but only a bit, uncertainty is necessary to make the future just that: the future). Parts of my list overlap with my previous post, and for good reason: some lessons are worth repeating, and reading again.

Here’s Part 1 of that list. Part 2 is currently buried in my sandy, sea-water stained notebook. My plan is to follow up with it in the next couple of weeks, once I have enough time to turn it into a quality article.

1. Doing ML Mostly Means Preparing Data

This is a point I try not to think too much about, or it will tell me: you did not do your homework.

When I started out, my internal monologue was something like: “I just want to do ML.” Whatever that meant; I had visions of plugging neural networks together, combining methods, and running large-scale training. While I did all of that at one point or another, I found that “doing ML” often means spending a lot of time just preparing the data so that you can eventually train a machine learning model. Model training, ironically, is often the shortest and last part of the whole process.

Thus, every time I finally get to the model training step, I mentally breathe a sigh of relief, because it means I’ve made it through the invisible part: preparing the data. There’s nothing “sellable” in merely preparing the data. In my experience, preparing the data is not noticeable in any way (as long as it’s done well enough).

Here’s the usual pattern for it:

You have a project.
You get a real-world dataset. (If you work with a well-curated benchmark dataset, then you’re lucky!)
You want to train a model.
But first… data cleaning, fixing, merging, validating.

Let me give you a personal example, one that I’ve told as a funny story (which it is now. Back then, it meant redoing a few days of machine learning work under time pressure…).

I once worked on a project where I wanted to predict vegetation density (using the NDVI index) from ERA5 weather data. ERA5 is a massive gridded dataset, freely available from the European Centre for Medium-Range Weather Forecasts. I merged this dataset with NDVI satellite data from NOAA (basically, the American weather agency), carefully aligned the resolutions, and everything seemed fine—no shape mismatches, no errors were thrown.

Then, I called the data preparation done and trained a Vision Transformer model on the combined dataset. A few days later, I visualized the results and… surprise! The model thought Earth was upside down. Literally—my input data was right-side up, but the target vegetation density was flipped at the equator.

What had happened? A subtle bug in my resolution translation flipped the latitude orientation of the vegetation data. I hadn’t noticed it because I was spending a lot of time on data preparation already, and wanted to get to the “fun part” quickly.

This kind of mistake hones in an important point: real-world ML projects are data projects. Especially outside academic research, you’re not working with CIFAR or ImageNet. You’re working with messy, incomplete, partially labellel, multi-source datasets that require:

Cleaning
Aligning
Normalizing
Debugging
Visual inspection

And even more, that list is non-exhaustive. Then repeating all of the above.

Getting the data right is the work. Everything else builds on that (sadly invisible) foundation.

2. Writing Papers Is Like Preparing a Sales Pitch

Some papers just read well. You might not be able to explain why, but they have a flow, a logic, a clarity that’s hard to ignore. That’s rarely by accident*. For me, it turned out that writing papers resembles crafting a very specific kind of sales pitch. You’re selling your idea, your approach, your insight to a skeptical audience.

This was a surprising realization for me.

When I started out, I assumed most papers looked and felt the same. All of them were “scientific writing” to me. But over time, as I read more papers I began to notice the differences. It’s like that saying: to outsiders, all sheep look the same; to the shepherd, each one is distinct.

For example, compare these two papers that I came across recently:

Both use machine learning. But they speak to different audiences, with different levels of abstraction, different narrative styles, and even different motivations. The first one assumes that technical novelty is central. The second one focuses on relevance for applications. Obviously, there also is the visual difference between the two.

The more papers you read, the more you realize: there’s not one way to write a “good” paper. There are many ways, and the way varies depending on the audience.

And unless you’re one of those very rare brilliant minds (think Terence Tao or someone of that caliber), you’ll likely need support to write well. Especially when tailoring a paper for a specific conference or journal. In practice, that means working closely with a senior ML person who understands the field.

Crafting a good paper is like preparing a sales pitch. You need to:

Frame the problem the right way
Understand your audience (i.e. target venue)
Emphasize the parts that resonate most
And polish until the message sticks

3. Bug Fixing Is the Way Forward

Years ago, I had that romantic idea of ML as exploring elegant models, inventing new activation functions, or crafting clever loss functions. That may be true for a small set of researchers. But for me, progress often looked like: “Why doesn’t this code run?”. Or, even more frustrating: “That code just ran a few seconds ago-why does it no longer run now?”

Let’s say your project requires using Vision Transformers on environmental satellite data (i.e., the model side of Section 1 above). You have two options:

Implement everything from scratch (not recommended unless you’re feeling particularly adventurous, or need to do it for course credits).
Find an existing implementation and adapt it.

In 99% of the cases, option 2 is the obvious choice. But “just plug in your data” almost never works. You’ll run into:

Different compute environments
Assumptions about input shapes
Preprocessing quirks (such as data normalization)
Hard-coded dependencies (of which I am guilty, too)

Quickly, your day can become an endless series of debugging, backtracking, testing edge cases, modifying dataloaders, checking GPU memory**, and rerunning scripts. Then, slowly, things begin to work. Eventually, your model trains.

But it’s not fast. It’s bug fixing your way forward.

4. I (Very Certainly) Won’t Make That Breakthrough

You’ve definitely heard of them. The Transformer paper. The GANs. Stable Diffusion. There’s a small part in my that thinks: maybe I’ll be the one to write the next transformative paper. And sure, someone has to. But statistically, it probably won’t be me. Or you, apologies. And that’s fine.

The works that cause a field to change rapidly are exceptional by definition. Those works being exceptional directly implies that most works, even good work, are barely recognized. Sometimes, I still hope that one of my projects would “blow up.” But, so far, most didn’t. Some didn’t even get published. But, hey, that’s not failure—it’s the baseline. If you expect every paper to be a home run, then you are on the fast lane to disappointment.

Closing thoughts

To me, Machine learning often appears as a sleek, cutting-edge field—one where breakthroughs are just around the corner and where the “doing” means smart people make magic with GPUs and math. But in my day-to-day work, it’s rarely like that.

More often, my day-to-day work consists of:

Handling messy datasets
Debugging code pulled from GitHub
Redrafting papers, over and over
Not producing novel results, again

And that’s okay.

Footnotes

The previous article mentioned: https://towardsdatascience.com/lessons-learned-after-6-5-years-of-machine-learning/

* If you are interested, my favorite paper is this one: https://arxiv.org/abs/2103.09762. I read it one year ago on a Friday afternoon.

** To this day, I still get mail notifications about how clearing the GPU memory is impossible in TensorFlow. This 5-year old GitHub issue gives the details.