15
In this thesis, the models that were developed and compared are all supervised learning
algorithms, thus a more thorough explanation of the way this kind of machine learning works
in our context will be given in the next section.
2.1. Supervised machine learning
Supervised anomaly detection systems are based on prior knowledge that they acquired during
training. They build a predictive model that compares new instances with the existing classes
(normal or abnormal traffic) and decides upon each event accordingly.
Supervised machine learning is defined by the use of labelled data for training [5][6], that will
help classify or predict accurately when the model is used. By the term label, it is implied that
the training dataset includes input data and the corresponding outcome each entry should
generate. As input data is fed to it during training, the model adjusts its weights
(interconnections inside its nodes) so that the outcome it produces matches the correct
outcome as much as possible. This is measured through the use of a loss function that
calculates the deviation between the produced result and the correct result. The goal of
training is to minimize the loss function.
There are two categories that supervised learning applies to: classification and regression. In
regression problems, the model needs to understand the relationship between the dependent
values and the independent ones. It is usually applied when we want to make future
projections, like weather prediction, stock prices, business revenue, etc. On the other hand, in
classification applications, the model needs to understand what features make an instance the
class it is, and assign the input data into the right categories, like the spam folder of our emails.
Anomaly detection is a classification problem, and some of the most common supervised
learning methods for classification are the ones that were used in our project, which are
analysed in section 3. Classification models analysis.
Supervised learning differs fundamentally from unsupervised learning, because, unlike with
unsupervised learning, it uses labelled data. Unsupervised learning methods try to discover
patterns in the data and cluster them or make associations. Each method has its own
advantages and disadvantages, but let’s take a look at what these are for supervised learning
methods.
Disadvantages and challenges of supervised learning:
- Creating labels for some datasets can be time consuming, or even impossible in some
cases, due to the limited information available on the data.
- Irrelevant input features can hinder the performance of the model greatly, as well as
when unlikely, incomplete, or out of bounds values are inputted as training data.
- When dealing with classification applications, representing all classes in a balanced way
is a challenge and the performance of the model is lowered if the data is imbalanced.