A Filter-Based Feature Selection Framework to Detect Phishing URLs Using Stacking Ensemble Machine Learning

  • Nimra Bari
  • , Tahir Saleem
  • , Munam Shah
  • , Abdulmohsen Algarni
  • , Asma Patel
  • , Insaf Ullah

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Today, phishing is an online attack designed to obtain sensitive information such as credit card and bank account numbers, passwords, and usernames. We can find several anti-phishing solutions, such as heuristic detection, virtual similarity detection, black and white lists, and machine learning (ML). However, phishing attempts remain a problem, and establishing an effective anti-phishing strategy is a work in progress. Furthermore, while most anti-phishing solutions achieve the highest levels of accuracy on a given dataset, their methods suffer from an increased number of false positives. These methods are ineffective against zero-hour attacks. Phishing sites with a high False Positive Rate (FPR) are considered genuine because they can cause people to lose a lot of money by visiting them. Feature selection is critical when developing phishing detection strategies. Good feature selection helps improve accuracy; however, duplicate features can also increase noise in the dataset and reduce the accuracy of the algorithm. Therefore, a combination of filter-based feature selection methods is proposed to detect phishing attacks, including constant feature removal, duplicate feature removal, quasi-feature removal, correlated feature removal, mutual information extraction, and Analysis of Variance (ANOVA) testing. The technique has been tested with different Machine Learning classifiers: Random Forest, Artificial Neural Network (ANN), Ada-Boost, Extreme Gradient Boosting (XGBoost), Logistic Regression, Decision Trees, Gradient Boosting Classifiers, Support Vector Machine (SVM), and two types of ensemble models, stacking and majority voting to gain A low false positive rate is achieved. Stacked ensemble classifiers (gradient boosting, random forest, support vector machine) achieve 1.31% FPR and 98.17% accuracy on Dataset 1, 2.81% FPR and Dataset 3 shows 2.81% FPR and 97.61% accuracy, while Dataset 2 shows 3.47% FPR and 96.47% accuracy.
Original languageEnglish
Pages (from-to)1167-1187
Number of pages21
JournalComputer Modeling in Engineering and Sciences
Volume145
Issue number1
DOIs
Publication statusPublished - 30 Oct 2025

Bibliographical note

Copyright © 2025 The Author(s). Published by Tech Science Press.
This work is licensed under a Creative Commons Attribution 4.0 International License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Data Access Statement

The datasets used in this study are publicly available from the following sources:

• Dataset 1: Sourced from Mohammad et al. (2012) and available at the UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Phishing+Websites (accessed on 21 August 2025).

• Dataset 2: Sourced from Buber (2019). The data was collected from PhishTank and Open Phish, which are publicly accessible platforms. https://www.phishtank.com/ (accessed on 21 August 2025) and https://openphish.com/ (accessed on 21 August 2025).

• Dataset 3: Sourced from Hannousse (2021). This dataset is a benchmark for machine learning-based phishing detection and is available for research purposes. 10.1016/j.engappai.2021.104347 (accessed on 21 August 2025).

Funding

This research was financially supported by the Deanship of Scientific Research and Graduate Studies at King Khalid University under research grant number (R.G.P.2/21/46) and in part by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, under Grant KFU253116.

Fingerprint

Dive into the research topics of 'A Filter-Based Feature Selection Framework to Detect Phishing URLs Using Stacking Ensemble Machine Learning'. Together they form a unique fingerprint.

Cite this