Page 17

2.3. The advantages of the NSL-KDD dataset

Lastly for this section, it is worthwhile to mention some more information on why the NSL-KDD
was chosen and where it came from. The NSL-KDD was created in 2009, as an effort to
overcome some of the limitations and problems that its ancestors, DARPA (1998) and
KDDCup99 (1999), had. it is, like the original KDDCup99 before it, a publicly available dataset
of network traffic data records, which contains a selected subset of the data in KDDCup99 [1].
The selection of that data occurred by applying some filters targeting the problematic instances
in it, and at the same time, providing best practices for data mining to create the new dataset.
So, the main advantages of using this dataset are:

- It doesn’t include any redundant records in it, thus avoiding biasing toward more

frequent records.

- There are no duplicate records in the test set, so that the performance of the models

is not biased by those with falsely higher detection rate.

- The number of selected records from each difficulty level is inversely proportional to

the percentage of records in the original KDDCup99, therefore the classification rates
of various machine learning methods vary in a wider range.

- Opposite to the KDDCup99, that had millions of data records in it, both the KDDTrain+

and the KDDTest+ have a reasonable amount of records in them, making it affordable
to run experiments on the complete datasets instead of selecting a random small
portion of it. That is why evaluation results of different research groups are consistent
and comparable (like it happens with our models).

The  NSL-KDD  is  not  a  perfect  dataset,  as  it  is  quite  outdated,  and  because  it  is  a  synthetic
dataset. There is, however, much value in those rare, good datasets that are available, even if
they are old. Firstly, they are already labelled, a process that is very time consuming or even
impossible  sometimes,  which  allows  researchers  to  test  supervised  learning  methods,  or
validate the unsupervised models more frequently used today. Benchmark datasets, like NSL-
KDD,  are  used  for  validation  and  evaluation  of  new  approaches  to  intrusion  detection,  and
comparison  between  different  methods,  old  and  new.  They  are  also  the  only  way  to  have
repeatability  in  the  experiments  done  over  the  years,  especially  because  they  are  publicly
available  to  all  researchers.  A  rich  in  features  dataset  like  NSL-KDD  also  allows  different
approaches to fine-tune into different parameters, and extract features for more light-weight
models, or simply provide a base on which new datasets can be built.

The network traffic datasets are valuable assets for IDS research. However, none of them can
clearly represent the real-world traffic, as it is constantly evolving, and new attacks always
appear (or haven’t been discovered yet). Apart from the privacy and security concerns that
hinder the mining of real data, simulations are also difficult to do realistically. Evaluation of IDS
datasets is challenged by all the difficulties in collecting attack and victim scripts, by the rapid
speed at which attacks evolve and are produced, and also by the many different network
services that not only make traffic more complex, but also leave new gaps for exploitation.