Advancing Cybersecurity ML Training on Evasive Malware

CrowdStrike data scientists are members of a team of cybersecurity researchers that recently released EMBER2024, an update to EMBER, the popular open source malware benchmark dataset originally released in 2018.

The EMBER2024 dataset includes metadata, labels, and calculated features for over 3.2 million files from six different file formats. It provides data scientists conducting cybersecurity research with an extensive, modern dataset to support the training and evaluation of machine learning models for malware detection, including a collection of advanced malware that has demonstrated its ability to evade antivirus products.

An academic paper, EMBER2024: A Benchmark Dataset for Holistic Evaluation of Malware Classifiers, details this new dataset and was presented at the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-2025) in Toronto in August 2025. The paper also includes 14 benchmark models trained on different subsets of the data and varying classification tasks.

There are many barriers to releasing public datasets in the cybersecurity field, including preserving customer privacy and hiding defender capabilities from attackers. Because of this, CrowdStrike researchers were excited for the opportunity to help update this very popular dataset. In this post, researchers can learn more about what this dataset provides and the new research enabled by it.

Original EMBER Dataset (2018): An Influential Resource for Malware Classification

The original EMBER dataset was a labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable (PE) files. Released in 2018, it was accompanied by an academic paper co-authored by a CrowdStrike data scientist who is part of the EMBER2024 team. The paper was subsequently updated the following year.

The goal of EMBER was to invigorate research in the field of malware classification, just as other benchmark datasets had done for image classification. It has helped to significantly advance malware detection in cybersecurity products, including the CrowdStrike Falcon® Platform. As of this writing, the paper has been cited in academic research over 700 times since its original publication in 2018, reflecting just how influential EMBER has been in the field of ML training for cybersecurity. Researchers have used EMBER to measure how quickly malware classifiers degrade over time, explore adversarial machine learning attacks and defenses, and as a basis for educational projects. Last year, CrowdStrike researchers augmented the data with tags and leaf similarity information to create EMBERSim, an effort to make building Binary Code Similarity techniques using benign data easier.

EMBER2024 builds on the innovative and influential original, delivering a leap forward in capability.

EMBER2024: Updated to Help Train the Next Generation of Cybersecurity ML Researchers

WIth an ongoing industry shift to ML-based malware detection, the importance of innovative tools like EMBER has only increased.

A team of researchers from multiple organizations — including a member of the CrowdStrike Data Science team who co-created the original EMBER dataset — recently undertook the project of updating and improving EMBER. They had ambitious plans to expand and extend the original dataset in many different ways, ending up with in excess of 3.2 million files from six file formats. Figure 1 shows how many of each file type are included in EMBER2024. The dataset features seven different types of labels and tags that support training classifiers on seven common tasks, including malicious/benign detection, malware family classification, and malware behavior identification. Source code is included that will allow researchers to replicate the feature calculation, model training, and file collection techniques used to construct the dataset. A supplemental release also includes the raw bytes and disassembly for 16.3 million functions from malicious files identified and compiled by the FLARE team’s capa tool.

Figure 1. File type stats for the EMBER 2024 dataset
File Type	Train	Test	Challenge	Total
Win32	1,560,000	360,000	3,225	1,923,225
Win64	520,000	120,000	814	640,814
.NET	260,000	60,000	805	320,805
APK	208,000	48,000	256	256,256
PDF	52,000	12,000	805	64,805
ELF	26,000	6,000	386	32,386

CrowdStrike’s contribution to the project consisted of updating the original feature calculation code to make it easier to use. EMBER2018 features require version 0.9.0 of the LIEF library. Updating this library results in features that may not be equivalent to what’s calculated with 0.9.0. But LIEF 0.9.0 requires Python 3.6, which is now quite out of date and unsupported. One of EMBER’s main use cases is teaching students how to work with machine learning in cybersecurity, and this very old dependency was just introducing them to the pain of Python packaging and versioning instead.

To solve this problem, the feature calculation code was updated to use the most recent version of the pefile library instead of LIEF. Because pefile is pure Python, it’s more likely that a single locked version of pefile will be able to be installed on future versions of Python as they’re released. Future versions of pefile are also unlikely to introduce breaking changes to the calculated features so that locking the pefile version required can be delayed as long as possible. While making this change, the repository also switched to using more modern Python tooling (polars, uv, etc.).

In addition to the dependency update, EMBER2024 features now include information about a file’s richheader, authenticode, and any warnings that the pefile module outputs while attempting to read the PE file format. Figure 2 shows the categories of features that are calculated along with examples of all of the metadata included. A full description of all changes to the feature calculation can be found in the paper and the source code.