Page 24

4. Characteristics and pre-processing of the NSL-KDD dataset

The NSL-KDD dataset is devised of many subsets of data. Specifically, the main sub-datasets
are  the  KDDTrain+  and  the  KDDTest+,  which  have  125,973  and  22,544  rows  respectively,
giving  it  a  17.9%  rate  of  test  to  training  data.  These  two  contain  the  full  training  and  test
datasets  in  .csv  format,  including  attack  type  labels  for  each  record  and  a  difficulty  level
(ranging from 1 to 21). Apart from the two, the set contains a subset of the test including only
the records with difficulty level lower than 21/21 named KDDTest-21, and another subset of
25,192  records,  randomly  taken  from  the  training  dataset,  the  KDDTrain+_20Percent.  The
records  of  KDDTest-21  and  KDDTrain+_20Percent  are  all  included  in  the  bigger  datasets,
KDDTest+ and KDDTrain+ respectively, so all the information of the dataset is present in the
main files.

The datasets are comprised of records of network traffic, as seen by a simple IDS network. Each
record (row) has 43 features (columns), out of which the first 41 (#0 − #40) are
characteristics of the traffic, #41 is the attack label, and #42 is the difficulty level of the input.

4.1. The attack labels (traffic type)

In total, there are 39 attacks (40 different labels including the normal traffic) that belong in
four classes: Denial of Service (DoS), Remote to Local (R2L), User to Root (U2R) and Probe
attacks. The four classes of the NSL-KDD dataset are different in their objectives, the way they
infect the network, and how they are distributed in the dataset. Also, a fifth category of the
dataset is the normal traffic, which, naturally, is encountered more than all the attack traffic.

Denial of Service (DoS): DoS attacks flood the network with abnormal traffic, so that the normal
traffic can’t reach it. As a result, the network will most likely shut down, in order to be
protected from the volume of data trying to pass through the IDS.

Remote to Local (R2L): as the name suggests, R2L is an attack that tries to get local access to a
system or network from a remote machine that can’t normally do that, so the attacker tries to
“hack” their way into the network.

User to Root (U2R): this is an attack where a normal user account tries to gain privileged access
as a super-user (root access), by exploiting vulnerabilities and gaps in the devices of the system
or network.

Probe: probe or surveillance attacks try to steal information from a network. That can be client
information, banking data, passwords or other personal data that are passing through the
network.