Using Feature Selection and Classification Scheme
for Automating Phishing Email Detection

Isredza Rahmi A HAMID¹, Jemal ABAWAJY¹, Tai-hoon KIM²
¹School of Information Technology, Deakin University,
Waurn Ponds, VIC., 3217, Australia,
iraha@deakin.edu.au, Jemal@deakin.edu.au
²School of Computing and Information Science, University of Tasmania,
Centenary Building, room 350, Private Bag 87 Hobart TAS 7001
(Corresponding Author)
taihoonn@empas.com

Abstract: Email has become the critical communication medium for most organizations. Unfortunately, email-born attacks in computer networks are causing considerable economic losses worldwide. Exiting phishing email blocking appliances have little effect in weeding out the vast majority of phishing emails. At the same time, online criminals are becoming more dangerous and sophisticated. Phishing emails are more active than ever before and putting the average computer user and organizations at risk of significant data, brand and financial loss. In this paper, we propose a hybrid feature selection approach based on combination of content-based and behaviour-based. The approach could mine the attacker behaviour based on email header. On a publicly available test corpus, our hybrid features selection is able to achieve 94% accuracy rate.

Keywords: Internet Security, Behavior-based, Feature Selection, Phishing.

>>Full Text
CITE THIS PAPER AS:
Isredza Rahmi A HAMID, Jemal ABAWAJY, Tai-hoon KIM, Using Feature Selection and Classification Scheme for Automating Phishing Email Detection, Studies in Informatics and Control, ISSN 1220-1766, vol. 22 (1), pp. 61-70, 2013. https://doi.org/10.24846/v22i1y201307

Introduction

Phishing emails have become common problem in recent years. According to Islam and Abawajy [22], “phishing attacks continue to pose serious risks for consumers and businesses as well as threatening global security and the economy”. This calls for the development of effective countermeasures against email-born phishing attacks in order to safeguard critical infrastructures such as banking

Phishing is a type of semantic attack in which victims are sent emails that deceive them into providing sensitive information such as account numbers, passwords, or other personal to phisher. Normally, phishers send a large number of fake e-mails pretending to be from a legitimate and well-known business organization. Generally, the email content insists the victim to update personal information to avoid losing access rights to services provided by the organization.

Unfortunately, they lure user to a bogus web site implemented by the attacker. According to Anti-Phishing Working Group phishing trend report, the number of phishing attacks through email increased from about 170000 in 2005 to about 440000 in the 2009 [2]. Based on Gartner survey, approximately 109 million U.S adults have received phishing e-mail attacks with average loss per victim estimated to be $1,244.

Phishing email detection has drawn a lot of considerations from many researchers. Several good anti-techniques such as content-based [6], [11], [16] and behavior-based [7], [5], [13] have been developed to address the phishing problems. However, phishing attacks have continued to be a serious problem. This is because phishing has become more and more complicated and the phishes continually change their ways of perpetrating phishing attack to defeat the anti-phishing techniques. Moreover, most phishing emails are nearly identical to the normal email. Therefore existing anti-phishing techniques such as content-based approach are not able to curb phishing attacks. Furthermore, most of the existing emails filtering approaches are static where it can easily be defeated by modifying contents of emails and link strings.

In this paper, we present an approach to detect phishing email using hybrid features that combine content-based and behaviour-based approaches. The main objective of this paper is to identify behaviour-based features in phishing emails which cannot be disguised by an attacker. By analyzing attacker’s pattern, it is observed that phishing email that has a tendency to come from more than one domain could indicate abnormal activity.

Domain server that handles more than one type of domain email could show abnormal email as well. This information is done by analyzing email header which is usually neglected by others. We considered analyzing the message-ID tag and sender email in order to mine the attacker’s behaviour. This study applies the proposed hybrid feature selection to 6923 datasets which come from Nazario [14] phishing email collection ranging from 2004 to 2007 and SpamAssassin [17] as ham emails.

The result shows that the proposed hybrid feature selection approach is effective in identifying and classifying phishing email.

The remainder of this paper is organized as follows. Section 2 describes related research regarding phishing email detection approaches proposed in recent year. Section 3 examines the phishing email feature selection approach pertaining the data and feature set used in the experiment and hybrid feature selection algorithm as well. Section 4 gives the performance analysis result and the effectiveness of the proposed hybrid feature selection. Section 5 concludes the work and direction for future work is discussed.

REFERENCES

BERGHOLZ, A., G. PAAB, F. REICHARTZ, S. STROBEL, J. H. CHUNG, Improved Phishing Detection using Model-based Features, In Proc. of the International Conference on E-mail and Anti-Spam, 2008.
The Anti-Phishing work Group. Available: http://www.apwg.org/
LIU, C., S. STAMM, Fighting Unicode-Obfuscated Spam, Proceedings of E-Crime Research, ACM, New York, USA, 2007, pp. 45-59.
TOOLAN, F., J. CARTHY, Phishing Detection using Classifier Ensemble, In eCrime Researchers Summit, 2009, pp. 1-9.
TOOLAN, F., J. CARTHY, Feature Selection for Spam and Phishing Detection, In eCrime Researchers Summit (eCrime), 2010, pp. 1-12.
FETTE, I., N. SADEH, A. TOMASIC, Learning to Detect Phishing Emails, Proceedings of the 16th International Conference on World Wide Web (WWW ’07), ACM, New York, USA, 2006, pp. 649-656.
ZHANG, J., Z. DU, W. LIU, A Behaviour-based Detection Approach to Mass-Mailing Host, In Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, vol. 4, 2007, pp. 2140-2144.
MA, L., B. OFOGHANI, P. WATTERS, S. BROWN, Detecting Phishing Emails using Hybrid Features, Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing, 2009, pp. 493-497.
ZHOU, L., Y. SHI, D. ZHANG, A Statistical Language Modelling Approach to Online Deception Detection, IEEE Transactions on Knowledge and Data Engineering, vol. 20, No. 8, 2007, pp. 1077-1081.
BAZARGANIGILANI, M., Phishing E-Mail Detection using Ontology Concept and Naïve Bayes Algorithm, International Journal of Research and Reviews in Computer Science (IJRRCS), vol. 2, no. 2, 2011, pp. 249-252.
CHANDRASEKARAN, M., K. NARAYANAN, S. UPADYAYA, Phishing Email Detection Based on Structural Properties, Proceeding of the Cyber Security Conference, 2006.
CHANDRASEKARAN, M., V. SHANKARANARAYANAN, S. UPADHYAYA, CUSP: Customizable and Usable Spam Filters for Detecting Phishing Emails, Proceeding 3r Annual Symposium on Information Assurance (ASIA ’08), Albany, NY, 2008, pp. 10-17.
SYED, N. A., N. FEAMSTER, A. GRAY, Learning To Predict Bad Behaviour, NIPS 2007 Workshop on Machine Learning in Adversarial Environments for Computer Security, 2008.
NAZARIO, J., Phishing Corpus, Available: http://www.monkey.org/jose/wiki/doku.php?id=phishingcorpus.
BASNET, R. B., A. H. SUNG, Classifying Phishing Emails using Confidence-Weighted Linear Classifiers, International Conference on Information Security and Artificial Intelligence (ISAI 2010), 2010, pp. 108-112.
ABU-NIMEH, S., D. NAPPA, X. WANG, S. NAIR, Comparison of Machine Learning Techniques for Phishing Detection, Proceeding of APWG eCrime Researchers Summit, Pittsburgh, ACM, New York, USA, 2007, pp. 60-69.
Spamassassin public corpus, Available: http://spamassassin.apache.org/publiccorpus.
GANSTERER, W. N. D. POLZ, E-Mail Classification for Phishing Defence, in LNCS Advances, Volume 5478, 2009, pp 449-460.
A HAMID, I. R., J. H. ABAWAJY, Hybrid Feature Selection for Phishing Email Detection, The 11th International Conference on Algorithms and Architectures for Parallel Processing, Springer, Berlin, Germany, 2011, pp. 266-275.
PENG, Y., G. KOU, D. ERGU, W. WU, Y. SHI, An Integrated Feature Selection and Classification Scheme, Studies in Informatics and Control, ISSN 1220-1766, vol. 21 (3), 2012, pp. 241-248.
FERCHICHI, S. E., K. LAABIDI, S. ZIDI, Genetic Algorithm and Tabu Search for Feature Selection, Studies in Informatics and Control, ISSN 1220-1766, vol. 18 (2), 2009, pp. 181-187.
ISLAM, R., J. H. ABAWAJY, A Multi-tier Phishing Detection and Filtering Approach, Journal of Network and Computer Applications, vol. 36 (1), 2013, pp. 324-336.