14
2. An overview of machine learning and anomaly detection
research
Machine learning has been one of the most rapidly advancing technologies for years now, and
continues to grow even more, with the advancement of computational power, artificial
intelligence (AI) and Internet of Things (IoT). In the domain of cyber security, machine learning
has greatly influenced the way networks are protected, which is something crucial in the era
of the Internet Services.
Intrusion Detection Systems (IDS) are now capable of recognising unknown attacks that try to
penetrate the network, by scanning the traffic for anomalies. Anomalies in the network are all
instances in the data that do not conform to the behaviour exhibited by normal traffic [1].
There don’t necessarily have to be malicious attacks, as performance-related anomalies also
occur in the network (traffic overload, malfunctioning devices, etc). However, anomalies in
data can translate to significant and often critical problems with the information passed
through the network. In network security, the anomalies researchers and the relevant systems
are looking for are security-based, which means that they stem from malicious actions against
the network. These intrusions to the network aim to compromise the confidentiality, integrity
or availability of a system or service, by bypassing the security mechanisms built in the
network’s infrastructure. As a result, security experts use IDS in order to protect the network
from outside threats.
An IDS is a software and/or hardware system that monitors the events occurring in a network
and analyses them for signs of intrusion by unwanted traffic (malicious activity). IDSs can be
signature-based, that can only detect known attacks, and need constant updating from the
vendors in order to keep up with the rapidly growing new malware, or they can be anomaly-
based, which can capture any deviations from normal behaviour, and are better at recognising
attacks that were previously unknown. However, they generate a large number of false alarms,
due to the limitations of their capabilities and training.
Anomaly detection IDSs rely heavily on machine learning, since their function is to classify data
based on what is considered normal traffic and deviations from it. The fact that they require
training is the reason they have limited capabilities still. There are four machine learning model
categories that can be applied to anomaly detection: a) supervised, b) semi-supervised, c)
unsupervised and d) hybrid training models. In supervised training, the IDS model trains on
labelled data, from a dataset that contains both normal and malicious traffic and any unseen
instance is compared to the model to determine which class it belongs to. In semi-supervised
models, training data contains only normal data instances, thus it cannot differentiate between
attack classes when it encounters malicious traffic, only normal and abnormal events. With
unsupervised training methods, the model doesn’t require any training data, which would
make it the most widely applicable way, but the unlabelled nature of the data makes them less
useful and quantifiable in their performance. Naturally, the hybrid approach combines features
of all the aforementioned methods, to create the optimal result for large scale applications.