EMA: Explainable Malware Analysis

EMA is a Recovery and Resilience Plan project aimed at explainable malware analysis.  KInIT collaborates with universities in Slovakia – Comenius University and Slovak University of Technology in Bratislava.

With the ever-increasing amounts of new malware samples appearing daily, artificial intelligence, especially machine learning methods, has gained prominence in malware analysis. These methods enable the processing of large volumes of data and achieve very high success rates in tasks like malware detection or classification. Still, their main disadvantage lies in their opaque nature: they cannot provide human-understandable justifications for their outputs. This is addressed by a new branch of research under the explainable AI initiative, which provides a number of methods to overcome this problem. The project’s overall objective is to enhance the applicability of XAI methodology in malware analysis.

The project sets out to complete multiple tasks to accomplish the main objective. First, our partners at STU will create a complex malware data-gathering solution that is ready to utilise various malware analysis solutions to gather complex malware features, mainly arising from dynamic malware analysis. Afterward, an ontological representation will be created to support the usage of XAI methods later on. Alongside the creation of ontologies, data breakdown strategies will be used to reduce the dataset’s size. XAI methods will then be explored in the created dataset(s) to generate explanations for the task of malware detection. Quantitative and qualitative measures will then be used to evaluate the usefulness of generated explanations in practice.

KInIT contributes to the EMA project in the following areas:

  • Data breakdown – every day, tens of thousands to hundreds of thousands of malware samples are created. In order to deal with the large amount of samples, data breakdown strategies have to be employed. Our main objective is to break down the data based on clustering approaches.
  • Contrastive learning to improve clustering quality – the better the quality of the created clusters, the more trustworthy, useful, and reliable the solution is for malware experts. We aim to improve the clustering results that can be achieved using unsupervised representations by contrastive and self-supervised learning approaches. 
  • Cluster explanation – knowing what separates one cluster from all other clusters helps support the decisions of malware experts as well as assist them in generating useful malware signatures (unique identifiers for a specific malware strain/family)
  • Measuring the quality of explanations – when evaluating generated explanations to malware experts, it is worthwhile to have a methodology for how to assess the generated explanations quantitatively
  • Human-centered evaluation of explanations – after the evaluation and improvement of the quality of explanations using quantitative analysis, a qualitative evaluation conducted together with malware experts is the next step for evaluating the usefulness of the generated explanations

Partners:

Project team

Daniela Chudá
Researcher
Martin Mocko
PhD Student
Jaroslav Kopčan
Research Engineer