Android Malfare Detection: Building Useful Representations

Luiza Sayfullina, Emil Eirola, Dmitri Komashinskiy, Paolo Palumbo, Juha Karhunen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

5 Citations (Scopus)
120 Downloads (Pure)


The problem of proactively detecting Android Malware has proven to be a challenging one. The challenges stem from a variety of issues, but recent literature has shown that this task is hard to solve with high accuracy when only a restricted set of features, like permissions or similar fixed sets of features, are used. The opposite approach of including all available features is also problematic, as it causes the features space to grow beyond reasonable size. In this paper we focus on finding an efficient way to select a representative feature space, preserving its discriminative power on unseen data. We go beyond traditional approaches like Principal Component Analysis, which is too heavy for large-scale problems with millions of features. In particular we show that many feature groups that can be extracted from Android application packages, like features extracted from the manifest file or strings extracted from the Dalvik Executable (DEX), should be filtered and used in classification separately. Our proposed dimensionality reduction scheme is applied to each group separately and consists of raw string preprocessing, feature selection via log-odds and finally applying random projections. With the size of the feature space growing exponentially as a function of the training set's size, our approach drastically decreases the size of the feature space of several orders of magnitude, this in turn allows accurate classification to become possible in a real world scenario. After reducing the dimensionality we use the feature groups in a light-weight ensemble of logistic classifiers. We evaluated the proposed classification scheme on real malware data provided by the antivirus vendor and achieved state-of-the-art 88.24% true positive and reasonably low 0.04% false positive rates with a significantly compressed feature space on a balanced test set of 10,000 samples.
Original languageEnglish
Title of host publication2016 15th IEEE International Conference on Machine Learning and Applications, ICMLA 2016, Proceedings
Subtitle of host publicationAnaheim, California, USA, December 18-20, 2016.
ISBN (Print)978-1-5090-6166-2
Publication statusPublished - 2017
MoE publication typeA4 Article in a conference publication
EventIEEE International Conference on Machine Learning and Applications - Anaheim, United States
Duration: 18 Dec 201620 Dec 2016
Conference number: 15


ConferenceIEEE International Conference on Machine Learning and Applications
Abbreviated titleICMLA
CountryUnited States
Internet address


  • Android
  • Dimensionality reduction
  • Feature selection
  • Logistic regression
  • Malware classification
  • Random projection

Fingerprint Dive into the research topics of 'Android Malfare Detection: Building Useful Representations'. Together they form a unique fingerprint.

Cite this