13
1. Introduction
In today’s world, most processes and services of everyday life pass through the Internet.
Networking has advanced greatly in the past few years, and will continue to do so, with the
vast implementation of 5G and 6G that is already being tested and researched. Because of the
important role that networks and the Internet play in our society, cyber security has become
vital for the protection of our data and devices. Intrusion Detection Systems (IDS) are an
important part of cyber security and of the network’s infrastructure, as they can detect and
prevent the malicious programs and users from breaching the network and stop various kinds
of attacks before they pose a danger. With the rapid growth of machine learning and artificial
intelligence (AI), IDSs have shifted from signature-based techniques, that work by recognising
specific patterns in mostly known attacks, to more abstract anomaly-based detection, which
classifies traffic as normal (safe) and abnormal (dangerous).
Anomalies in a network can be caused by malicious activities that take advantage of network
services, overload of traffic, malfunctioning devices and compromising various network
parameters [1], and can be performance-related (e.g., traffic flooding because of a
malfunctioning node) or security-related (e.g., intentional flooding of the network resources
so that legitimate users cannot access the services). Anomaly detection systems can detect any
kind of deviation from the normal behaviour, so they are better than more classical signature-
based systems at catching novel and unknown attacks; however, it comes at the cost of raising
more false alarms.
The NSL-KDD dataset is one of the most commonly used network traffic sets ever since its
creation in 2009 [2][3][4]. It is still used in research as a benchmark for network traffic
classification models, like all of the papers cited here. Thus, it provided an excellent dataset for
the comparison of the different machine learning models tested, for a reliable source of
different types of attack labels and high difficulty level of attacks in both the training and the
test sets. In addition, the differences between the two subsets provided for a good real-world
test of the models’ abilities to classify correctly.
In this thesis, our objective is to use the NSL-KDD dataset to compare five of the most
commonly used supervised learning classification models, which are: logistic regression, k-
nearest neighbours, decision tree, Gaussian Naïve Bayes, and the multi-layer perceptron. For
this purpose, section 2 provides a brief introduction to machine learning techniques for
anomaly detection, as well as relevant research that has been carried out in the past couple
years; it also discusses the advantages of the NSL-KDD dataset. Section 3 gives more
information on the five algorithms that are used for our experiment. In section 4, after creating
three instances of the dataset, in order to compare the different classification scenarios
(multiclass, binary, and 4-class classification), the NSL-KDD dataset is initially analysed and pre-
processed, and subsequently fed into the different models and optimised for best accuracy
scores. Lastly, in section 5, the models are evaluated, and the results of this research discussed,
and section 6 focuses on the problems that anomaly detection is still facing, as well as future
work and research on the topic.