Data driven analysis to achieve any business or strategic goal, also commonly referred as Data Mining, is like finding a needle in a haystack. But when the ‘needle’ in question is a ‘malware or vulnerability’, which is supposed to be found, well in time before attackers find it, the whole perspective changes. It’s nothing compared to the pain of a needle’s prick on your finger with just a drop of blood as the cost to pay; the magnitude of impact of a vulnerability or malware going undetected for an organization can be enormous and devastating.
Data Mining is the analysis of large existing datasets at disposal with an organization (internal, external or publicly available datasets for research), using interdisciplinary techniques from Statistics to Computer Science to Machine Learning, with the goal of discovering some valuable insights from the otherwise not-so-useful voluminous data at disposal. Data Mining & Data Analytics have already found their mature applications in areas ranging from Bioinformatics, Healthcare, Customer Behaviour Analysis, Web Mining, etc. In addition to the above areas, Data Mining techniques are being leveraged for various Cybersecurity applications as well, and many successful and mature use cases in this domain are already rife. Continuous R&D advancements in the field of Computer Science & Machine Learning are contributing to devise even more sophisticated malware and vulnerability detection systems.
Data Mining & Sandboxing
In addition to common malware analysis techniques like static or dynamic file analysis etc., available malware detection tools and sandboxes are now using data mining and AI based hybrid analysis for malware detection. Novel methods that use machine learning are being researched to analyse the behaviour of malwares using sandboxes; which aids in their early detection by discovering code obfuscation, polymorphisms patterns and mutation techniques used by them for detection evasion. In order to run a malicious code on a sandbox, it has to be loaded in the memory and executed; and hence it must be unpacked and decrypted. Sandboxes provide an emulated, isolated computing environment which allows running the malicious and benign executables in controlled manner and observe their runtime behaviour. Such patterns are otherwise difficult to detect through static file level analysis as compared to dynamic analysis of the binaries at runtime using previously learnt patterns.
The use of both, supervised and unsupervised learning is being explored for identifying, classifying and clustering similar groups of malwares by analysing their behaviour in real-time. The behaviour models learnt over time expedites the overall malware detection process for organization and hence fastens the overall response time of an organization for damage control and also thwart future attacks. Even without the sandboxes, data mining techniques may also be applied for static analysis of malware files. Distinguishing code features can be learnt and used to train models which shall in future discover similar static patterns in new malwares which are otherwise difficult to detect using traditional signature or heuristic based approaches.
Enlisted below are some of the available research efforts along these lines which can be extended for developing scalable solutions:
- Creating system profile dictionaries of applications based on their enlisted functionalities at the time of download and comparing it with similar profile build, based on their actual observed behaviour
- Analysing publicly available dynamic sandboxes for gaining malware detection intelligence
- Adversary resistant deep learning models with human like intelligence for malware detection
- Ensemble learning for predicting android malware families
- Sequence classification methods
- Graph based malware classification using machine learning
- Differential pattern mining for abnormal event detection
- Detecting zero-day polymorphic worms using jaccard similarity
For behaviour analysis of malwares using data mining & machine learning, what is important is selecting the right feature sets from static and dynamic execution reports of malicious and benign software applications. Three categories of feature sets should be selected: only static, only dynamic and a mix of these; as extracting and working with dynamic features only is computationally extensive and costly. Feature engineering techniques from dimensionality reduction like PCA, LDA to feature selection metric likes like Information Gain, Chi square, etc. must be applied for available malware analysis datasets while learning models for future classification of probable malware files. Not to forget, in any machine learning/data mining application, the quality of datasets used to train the models govern the performance and accuracy of the systems being deployed for future real-time classification; in this case detecting malicious software applications from benign ones. The training dataset should also include non-malicious applications with system level dependencies, network activities, registry values, file parameters like process threads, and other features similar to those of a malicious application; this shall ensure better system accuracy while solving the two-class classification problem of detecting malicious and non-malicious applications.
Data Mining & Machine Learning techniques are also being applied in the Cyber Security domain for many other use cases and tasks such as:
- Intrusion detection, anomaly detection in database transactions as well as in cyber physical systems
- Detecting malicious mobile applications and mobile botnets on play stores
- Spam detection
- Detecting fake profiles, mis information, hate crime, content filtering, etc.
However, one must not forget that all this is a double-edged sword. The adversaries are also becoming equipped with the Data Mining, AI / ML based techniques to outsmart the systems. The Cybersecurity solutions are also vulnerable to machine learning based adversarial attacks, e.g. fooling the systems with samples generated using deep learning to bypass getting detected by the system. As they say, it is forever going to be a cat-and-mouse game!