The rise of cybercrime has made fraudulent webpage detection an essential task in ensuring that the internet is safe. It is evident that these risks, such as the theft of private information, malware, and viruses, are associated with online activities on emails, social media applications, and websites. These web threats, called malicious URLs, are used by cybercriminals to lure users to visit web pages that appear real or legitimate.

This paper explores the development of a deep learning system involving a transformer algorithm to detect malicious URLs with the aim of improving an existing method such as Long Short-Term Memory (LSTM). (Devlin et al., 2019) introduced a Natural language modelling algorithm (BERT) developed by Google Brain in 2017. This model is capable of making more accurate predictions to outperform the recurrent neural network systems such as Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU). In this project, I compared the BERT’s performance with LSTM as a text classification technique. With the processed dataset containing over 600,000 URLs, a pre-trained model is developed, and results are compared using performance metrics such as r2 score, accuracy, recall, etc. (Y. E. Seyyar et al., 2022). This LSTM algorithm achieved an accuracy rate of 91.36% and an F1 score of 0.90 (higher than BERT’s) in the classification in terms of both unusual and common requests. Keywords: Malicious URLs, Long Short Term Memory, phishing, benign, Bidirectional Encoder Representations from Transformers (BERT).

1.0 Introduction

With the usability of the Web through the Internet, there has been an increasing number of users over the years. As all digital devices are connected to the internet, this has also resulted in an increasing number of phishing threats through websites, social media, emails, applications, etc. (Morgan, S., 2024) reported that more than $9.5 trillion was lost globally due to leaks of private information.

Therefore, innovative approaches have been introduced over the years to automate the task of ensuring safer internet usage and data protection. The Symantec 2016 Internet Security Report (Vanhoenshoven et al., 2016) shows scammers have caused most cyber-attacks involving corporate data breaches on browsers and websites, as well as other sheer malware attempts using the Uniform Resource Locator by baiting users.

Structure of a URL (Image by author)

In recent years, blacklisting, reputation-based systems, and machine learning algorithms have been used by cybersecurity professionals to improve malware detection and make the web safer. Google’s statistics reported that over 9,500 suspicious web pages are blacklisted and blocked per day. The existence of these malicious web pages represents a significant risk to the information security of web applications, particularly those that deal with sensitive data (Sankaran et al., 2021). Because it’s so easy to implement, blacklisting has become the standard way. The false-positive rate is also significantly lowered with this strategy. The problem, however, is that it’s extremely difficult to keep an extensive list of malicious URLs up to date, especially considering that new URLs are typically created every day. In order to circumvent filters and trick users, cybercriminals have come up with ingenious methods, such as obfuscating the URL so it appears to be real. This field of Artificial Intelligence (AI) has seen significant advancements and applications in a variety of domains, including cybersecurity. One critical aspect of cybersecurity is detecting and preventing malicious URLs, which can result in serious consequences such as data breaches, identity theft, and financial losses. Given the dynamic and ever-changing nature of cyber threats, detecting malicious URLs is a difficult task.

This project aims to develop a deep learning system for text classification called Malicious URL Detection using pre-trained Bidirectional Encoder Representations from Transformers (BERT). Can the BERT model outperform existing techniques in malicious URL detection? The expected outcome of this study is to demonstrate the effectiveness of the BERT model in detecting malicious URLs and compare its performance with recurrent neural network techniques such as LSTM. I used evaluation metrics such as accuracy, precision, recall, and F1-score to compare the models’ performance.

2.0. Background

Machine learning methods like Random Forest and Multi-Layer Perception, Support Vector Machines, and deep learning methods like LSTM and other CNN are just a few of the methods proposed by the existing literature for detecting harmful URLs. However, there are drawbacks to these methods, such as the fact that they necessitate traditional features, as they are unable to deal with complex data, thereby resulting in overfitting.

2.1. Related works

To improve the time for obtaining the page content or processing the text, (Kan and Thi, 2005) used a method of categorising websites based on their URLs. Classification features were collected from the URL after it was parsed into several tokens. Token dependencies in time order were modelled by the characteristics. They concluded that the classification rate increased when high-quality URL segmentation was combined with feature extraction. This approach paved the way for other research on developing complex deep learning models for text classification. As a binary text classification problem, (Vanhoenshoven et al., 2016) developed models for the detection of malicious URLs and evaluated the performance of classifiers, including Naive Bayes, Support Vector Machines, Multi-Layer Perceptron, etc. Subsequently, text embedding methods implementing transformers have produced state-of-the-art results in NLP tasks. A similar model was devised by (Maneriker et al., 2021), in which they pre-trained and refined an existing transformer architecture using only URL data. The URL dataset included 1.29 million entries for training and 1.78 million entries for testing. Initially, the BERT architecture supported the masked language modelling framework, which would not be necessary in this report.

For the classification process, the BERT and RoBERTa algorithms were fine-tuned, and results were evaluated and compared to propose a model called URLTran (URL Transformers) that uses transformers to significantly improve the performance of malicious URL detection with very low false positive rates in comparison to other deep learning networks. With this method, the URLTran model achieved an 86.8% true positive rate (TPR) compared to the best baseline’s TPR of 71.20%, resulting in an improvement of 21.9%. This mentioned method was able to classify and predict whether a detected URL is benign or malicious.

Additionally, an RNN-based model was proposed by (Ren et al, 2019) where extracted URLs were converted into word vectors (characters) by using pre-trained Word2Vec, and Bi-LSTM (bi-directional long short-term memory) and classifying them. After validation and evaluation, the model achieved 98% accuracy and an F1 score of 95.9%. This model outperformed almost all of the NLP techniques but only processed text characterization one at a time. However, there is a need to develop an improved model using BERT to process sequential input all at once. Although these models have demonstrated some improvement with big data, they are not without their limitations. The sequential nature of text data, for instance, may be difficult with RNNs, while CNNs most times do not capture long-term dependencies in the data (Alzubaidi et al., 2021). As the volume and complexity of textual data on the web continue to increase, it’s possible that current models will become inadequate.

3.0. Objectives

This project presented the importance of a bidirectional pre-trained model for text classification. (Radford et al., 2018) implemented unidirectional language models to pre-train BERT. Compared to this, a shallow concatenation of independently trained left-to-right and right-to-left linear models was created (Devlin et al., 2019; Peters et al., 2018). Here, I used a pre-trained BERT model to achieve state-of-the-art performance on a large scale of sentence-level and token-level tasks (Han et al., 2021) with the aim to outperform many RNNs architectures, thereby reducing the need for these frameworks. In this case, the hyper-parameters of the LSTM algorithm will not be fine-tuned.

Specifically, this research paper emphasises:

  1. Developing an LSTM and pre-trained BERT models to detect (classify) whether a URL is unsafe or not.
  2. Comparing results of the base model (LSTM) and pre-trained BERT using evaluation metrics such as recall, accuracy, F1 score, precision. This would help to determine if the base model performance is better or not.
  3. BERT automatically learns latent representation of words and characters in context. The only task is to fine-tune the BERT model to improve the baseline performance. This proposes a computationally simple approach to RNNs as an alternative to the more resource-intensive, and computationally expensive architectures.
  4. Analysis and model development and evaluation took about 7 weeks and the aim was to achieve a significantly reduced training runtime with Google’s BERT model.

4.0. Methodology

This section explains all the processes involved in implementing a deep learning system for detecting malicious URLs. Here, a transformer-based framework was developed from an NLP sequence perspective (Rahali and Akhloufi, 2021) and used to statistically analyse a public dataset.

Figure 4.0. Methodology Process (Adapted from Rahali and Akhloufi, 2021)

4.1. The dataset

The dataset used for this report was compiled and extracted from Kaggle (license info). This dataset was prepared to carry out the classification of webpages (URLs) as malicious or benign. The datasets consisting of URL entries for training, validation and testing were collected.

Image by author (code visualisation)

To investigate the data using deep learning models, a huge dataset of 651,191 URL entries were retrieved from Phishtank, PhishStorm, and malware domain blacklist. It contains:

  • Benign URLs: These are the safe web pages to browse. Exactly 428,103 entries were known to be secure.
  • Defacement URLs: These webpages are used by cybercriminals or hackers to clone real and secure websites. These contain 96,457 URLs.
  • Phishing URLs: They are disguised as genuine links to trick users to provide personal and sensitive information which risks the loss of funds. 94,111 entries of the whole dataset were flagged as phishing URLs.
  • Malware URLs: They are designed to manipulate users to download them as software and applications, thereby exploiting vulnerabilities. There are 32,520 malware webpage links in the dataset.
Table 4.1. The types of URLs and their fraction of the dataset (Image by author)

4.2. Feature extraction

For the URL dataset, feature extraction was used to transform raw input data into a format supported by machine learning algorithms (Li et al., 2020). It converts categorical data into numerical features, while feature selection selects a subset of relevant features from the original dataset (Dash and Liu, 1997; Tang and Liu, 2014).
View data analysis and model development file here. The following steps were taken:

1. Combining the phishing, malware and defacement URLs as Malicious URL types for better selection. The whole URLs are then labelled benign or malicious.

2. Converting the URL types from categorical variables into numerical values. This is a crucial process because the deep learning model training requires only numerical values. Benign and phishing URLs are classified as 0 and 1, respectively, and passed into a new column called “Category”.

3. The ‘url_len’ feature was used to compute the URL length to extract features from the URLs in the dataset. Using the ‘process_tld’ function, the top-level domain (TLD) of each URL was extracted.

4. Some potential features for URL classification include the presence of specific characters [‘@’, ‘?’, ‘-‘, ‘=’, ‘.’, ‘#’, ‘%’, ‘+’, ‘$’, ‘!’, ‘*’, ‘,’, ‘//’] were represented and added as columns to the dataset using the ‘abnormal_url’ feature. This feature (function) uses binary classification to verify if there are abnormalities in every URL character. 5. Another selection was done on the dataset such as the number of characters (letters and counts), https, shorting service and ip address of all entries. These provide more information for training the model.

4.3. Classification – model development and training

Using pre-labelled features, the training data learns the association between labels and text. This stage involves identifying the URL types in the dataset. As an NLP technique, it is required to assign texts (words) into sentences and queries (Minaee et al, 2021). A recurrent neural network model architecture defines an optimised model. To ensure a balanced dataset, the data was split into 80% training set and a 20% testing set. The texts were labelled using word embeddings for both the LSTM and the pre-trained BERT models. The dependent variables include the encoded URL types (Categories) considering it is an automatic binary classification.

4.3.1. Long short-term memory model

LSTM was found to be the most popular architecture because of its ability to capture long-term dependencies using word2vec (Mikolov et al, 2013) to train on billions of words. After preprocessing and feature extraction, the data was set up for the LSTM model training, testing and validation. To determine the appropriate sequence length, the number and size of layers (input and output layers) were proposed before training the model. The hyperparameters such as epoch, learning rate, batch size, etc. were tuned to achieve optimal performance.

The memory cell of a typical LSTM unit has three gates (input gate, forget gate, and output gate) (Feng et al, 2020). Contrary to a “feedforward neural network, the output of a neuron” at any time can be the same neutron as the input (Do et al, 2021). To prevent overfitting, a dropout function is implemented on multiple layers one after the other. The first layer added is an embedding layer, which is used to create dense vector representations of words in the input text data. However, only one LSTM layer was used in this architecture due to the long training time.

4.3.2. BERT model

Researchers proposed BERT architecture for NLP tasks because it has higher overall performance than RNNs and LSTM. A pre-trained BERT model was implemented in this project to process text sequences and capture the semantic information of the input, which can help improve and reduce training time and accuracy of malicious URL detection. After the URL data was pre-processed, they were converted into sequences of tokens and then feeding these sequences into the BERT model for processing (Chang et al., 2021). Due to large data entries in this project, the BERT model was fine-tuned to learn the relevant features of each type of URL. Once the model is trained, it was used to classify URLs as malicious (phishing) or benign with improved accuracy and performance.

Google’s BERT model architecture (Song et al, 2021)

(Figure 4.3.2) describes the processes involved in model training with the BERT algorithm. A tokenization phase is required for splitting text into characters. Initially, raw text is separated into words, which are then converted to unique integer IDs via a
lookup table. WordPiece tokenization (Song et al, 2021) was implemented using the BertTokenizer class. The tokenizer includes the BERT token splitting algorithm and a WordPieceTokenizer (Rahali and Akhloufi, 2023). It accepts words (sentences) as input and outputs token IDs.

5.0. Experiments

Specific hyper-parameters were used for BERT, while an LSTM model with a single hidden layer was tuned based on performance on the validation set. Due to an unbalanced dataset, only 522,214 entries were parsed consisting of 417,792 training data and 104,422 testing data with a train-test split of 70% to 30%.

The parameters used for training are described below:

Table 5.0. Hyperparameters used in the Keras library for the LSTM and BERT models (Image by author)

5.1. LSTM (baseline)

The results indicated a corresponding dropout rate of 0.2 and batch size 1024 to achieve a training accuracy of 91.23% and validation accuracy of 91.36%. Only one LSTM layer was used in the architecture due to long training time (average of 25.8 minutes). However, adding more layers to the neural network results in a high
computation problem, thereby reducing the model’s overall performance.

LSTM algorithm experiment setup (Do et al, 2021)

5.2. Pre-trained BERT model

This model was tokenized but the drawback was the classifier could not initialize at checkpoint. Therefore, some layers were affected. This model requires further sequence classification before pre-training. The expectations were not met due to complex computation. However, it was proposed to have excellent performance.

6.0. Results

An experimental outcome is evaluated for the two models developed using performance metrics. These metrics are to show how well the test data performed on the models. They are presented to evaluate the proposed approach’s effectiveness in detecting malicious web pages.

6.1. Performance Metrics

To evaluate the performance of the proposed metrics, a confusion matrix was used due to its evaluation measures.

Table 6.1 Binary classification of actual and predicted outcomes
  • True Positive (TP): samples that are accurately predicted malicious (phishing) (Amanullah et al., 2020).
  • True Negative (TN): samples that are accurately predicted as benign URLs.
  • False Positive (FP): samples that are incorrectly predicted as phishing URLs.
  • False Negative (FN): instances that are incorrectly predicted as benign URLs.
    Accuracy = (TP + TN) / (TP + TN + FP + FN)
    Precision = TP / (TP + FP)
    Recall = TP / (TP + FN)
    F1-score = (2 × Precision × Recall) / (Precision + Recall)
Table 6.2. Classification report for the developed models (Image by author)

The LSTM model achieved an accuracy of 91.36% and a loss of 0.25, while the pre-trained BERT model achieved a lower accuracy (75.9%) than expected as a result of hardware malfunction.

6.2. Validation

The LSTM performed well because the validation data accuracy will detect malicious URLs 9 out of 10 times.

Accuracy validation and loss validation (LSTM). Image by author

However, the pre-trained BERT could not reach a higher expectation due to unbalance and large dataset.

Confusion matrix for LSTM and BERT models (Image by author)

7.0. Conclusion

Overall, LSTM models can be a powerful tool for modelling sequential data and making predictions based on temporal dependencies. However, it is important to carefully consider the nature of the data and the problem at hand before deciding to use an LSTM model, as well as to properly set up and tune the model to achieve the best results. Due to large dataset, an increase batch size (1024) resulted in a shorter training time and improved the validation accuracy of the model. This could be as a result of not tokenizing the model during training and testing. BERT’s maximum sequence length is 512 tokens, which might be inconvenient for some applications. If a sequence is going to be shorter than the limit, tokens need to be added to it, otherwise, it should be to be truncated (Rahali and Akhloufi, 2021). Also, to understand words and sentences better, BERT needs modified embeddings to represent context in character. Although these capabilities performed well with complex word embeddings, it might also result in longer training time when used with larger datasets. However, a need for further further research is required to detect patterns during malicious URL detection.

References

  • Alzubaidi, L., Zhang, J., Humaidi, A. J., Duan, Y., Santamaría, J., Fadhel, M. A., & Farhan, L. (2021). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, 8(1), 1-74. https://doi.org/10.1186/s40537-021-00444-8
  • Amanullah, M. A., Habeeb, R. A. A., Nasaruddin, F. H., Gani, A., Ahmed, E., Nainar, A. S. M., Akim, N. M., & Imran, M. (2020). Deep learning and big data technologies for IoT security. Computer Communications, 151, 495-517. https://doi.org/10.1016/j.comcom.2020.01.016
  • Chang, W., Du, F., and Wang, Y. (2021). “Research on Malicious URL Detection Technology Based on BERT Model,” IEEE 9th International Conference on Information, Communication and Networks (ICICN), Xi’an, China, pp. 340-345, doi: 10.1109/ICICN52636.2021.9673860.
  • Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1-4), 131-156. https://doi.org/10.1016/S1088-467X(97)00008-5
  • Do, N.Q., Selamat, A., Krejcar, O., Yokoi, T. and Fujita, H. (2021). Phishing webpage classification via deep learning-based algorithms: an empirical study. Applied Sciences, 11(19), p.9210.
  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Feng, J., Zou, L., Ye, O., and Han, Han. (2020) “Web2Vec: Phishing Webpage Detection Method Based on Multidimensional Features Driven by Deep Learning,” in IEEE Access, vol. 8, pp. 221214-221224, doi: 10.1109/ACCESS.2020.3043188
  • Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L., Han, W., Huang, M., Jin, Q., Lan, Y., Liu, Y., Liu, Z., Lu, Z., Qiu, X., Song, R., . . . Zhu, J. (2021). Pre-trained models: Past, present and future. AI Open, 2, 225- 250. https://doi.org/10.1016/j.aiopen.2021.08.002
  • Morgan, S. (2024). 2024 Cybersecurity Almanac: 100 Facts, Figures, Predictions and Statistics. Cybersecurity Ventures. https://cybersecurityventures.com/2024-cybersecurity-almanac/ Kan, M-Y., and Thi, H. (2005). Fast webpage classification using URL features. 325- 326. 10.1145/1099554.1099649.
  • Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S. and He, L. (2020). A survey on text classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364. Maneriker, P., Stokes, J. W., Lazo, E. G., Carutasu, D., Tajaddodianfar, F., & Gururajan, A. (2021). URLTran: Improving Phishing URL Detection Using Transformers. ArXiv. /abs/2106.05256
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  • Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M. and Gao, J. (2021). Deep Learning–based Text Classification. ACM Computing Surveys, 54(3), pp.1–40. doi:https://doi.org/10.1145/3439726.
  • Peters, M.E., Ammar, W., Bhagavatula, C. and Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. arXiv:1705.00108 [cs]. [online] Available at: https://arxiv.org/abs/1705.00108.
  • Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. [online] Available at: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
  • Rahali, A. & Akhloufi, M. A. (2021) MalBERT: Using transformers for cybersecurity and malicious software detection. arXiv Preprint arXiv:2103.03806
  • Ren, F., Jiang, Z., & Liu, J. (2019). A Bi-Directional Lstm Model with Attention for Malicious URL Detection. 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), 1, 300-305.
  • Sankaran, M., Mathiyazhagan, S., ., P., Dharmaraj, M. (2021). ‘Detection Of Malicious Urls Using Machine Learning Techniques’, Int. J. of Aquatic Science, 12(3), pp. 1980- 1989
  • Song, X., Salcianu, A., Song, Y., Dopson, D., and Zhou, D. (2020). Fast WordPiece Tokenization. ArXiv. /abs/2012.15524 Tang, J., Alelyani, S. and Liu, H. (2014). Feature selection for classification: A review. Data classification: Algorithms and applications, p.37.
  • Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K. & Koppen, M. (2016) Detecting malicious URLs using machine learning techniques. IEEE.
  • Y. E. Seyyar, A. G. Yavuz and H. M. Ünver. (2022) An attack detection framework based on BERT and deep learning. IEEE Access, vol. 10, pp. 68633-68644, 2022, doi: 10.1109/ACCESS.2022.3185748.
Share.

Comments are closed.