Page 13

1. Introduction

In today’s world, most processes and services of everyday life pass through the Internet.
Networking has advanced greatly in the past few years, and will continue to do so, with the
vast implementation of 5G and 6G that is already being tested and researched. Because of the
important role that networks and the Internet play in our society, cyber security has become
vital for the protection of our data and devices. Intrusion Detection Systems (IDS) are an
important part of cyber security and of the network’s infrastructure, as they can detect and
prevent the malicious programs and users from breaching the network and stop various kinds
of attacks before they pose a danger. With the rapid growth of machine learning and artificial
intelligence (AI), IDSs have shifted from signature-based techniques, that work by recognising
specific patterns in mostly known attacks, to more abstract anomaly-based detection, which
classifies traffic as normal (safe) and abnormal (dangerous).

Anomalies in a network can be caused by malicious activities that take advantage of network
services, overload of traffic, malfunctioning devices and compromising various network
parameters [1], and can be performance-related (e.g., traffic flooding because of a
malfunctioning node) or security-related (e.g., intentional flooding of the network resources
so that legitimate users cannot access the services). Anomaly detection systems can detect any
kind of deviation from the normal behaviour, so they are better than more classical signature-
based systems at catching novel and unknown attacks; however, it comes at the cost of raising
more false alarms.

The  NSL-KDD  dataset  is  one  of  the  most  commonly  used  network  traffic  sets  ever  since  its
creation  in  2009  [2][3][4].  It  is  still  used  in  research  as  a  benchmark  for  network  traffic
classification models, like all of the papers cited here. Thus, it provided an excellent dataset for
the  comparison  of  the  different  machine  learning  models  tested,  for  a  reliable  source  of
different types of attack labels and high difficulty level of attacks in both the training and the
test sets. In addition, the differences between the two subsets provided for a good real-world
test of the models’ abilities to classify correctly.

In  this  thesis,  our  objective  is  to  use  the  NSL-KDD  dataset  to  compare  five  of  the  most
commonly  used  supervised  learning  classification  models,  which  are:  logistic  regression,  k-
nearest neighbours, decision tree, Gaussian Naïve Bayes, and the multi-layer perceptron. For
this  purpose,  section  2  provides  a  brief  introduction  to  machine  learning  techniques  for
anomaly detection, as well as relevant research that has been carried out in the past couple
years;  it  also  discusses  the  advantages  of  the  NSL-KDD  dataset.  Section  3  gives  more
information on the five algorithms that are used for our experiment. In section 4, after creating
three  instances  of  the  dataset,  in  order  to  compare  the  different  classification  scenarios
(multiclass, binary, and 4-class classification), the NSL-KDD dataset is initially analysed and pre-
processed, and subsequently fed into the different  models and optimised for best accuracy
scores. Lastly, in section 5, the models are evaluated, and the results of this research discussed,
and section 6 focuses on the problems that anomaly detection is still facing, as well as future
work and research on the topic.