24
4. Characteristics and pre-processing of the NSL-KDD dataset
The NSL-KDD dataset is devised of many subsets of data. Specifically, the main sub-datasets
are the KDDTrain+ and the KDDTest+, which have 125,973 and 22,544 rows respectively,
giving it a 17.9% rate of test to training data. These two contain the full training and test
datasets in .csv format, including attack type labels for each record and a difficulty level
(ranging from 1 to 21). Apart from the two, the set contains a subset of the test including only
the records with difficulty level lower than 21/21 named KDDTest-21, and another subset of
25,192 records, randomly taken from the training dataset, the KDDTrain+_20Percent. The
records of KDDTest-21 and KDDTrain+_20Percent are all included in the bigger datasets,
KDDTest+ and KDDTrain+ respectively, so all the information of the dataset is present in the
main files.
The datasets are comprised of records of network traffic, as seen by a simple IDS network. Each
record (row) has 43 features (columns), out of which the first 41 (#0 − #40) are
characteristics of the traffic, #41 is the attack label, and #42 is the difficulty level of the input.
4.1. The attack labels (traffic type)
In total, there are 39 attacks (40 different labels including the normal traffic) that belong in
four classes: Denial of Service (DoS), Remote to Local (R2L), User to Root (U2R) and Probe
attacks. The four classes of the NSL-KDD dataset are different in their objectives, the way they
infect the network, and how they are distributed in the dataset. Also, a fifth category of the
dataset is the normal traffic, which, naturally, is encountered more than all the attack traffic.
Denial of Service (DoS): DoS attacks flood the network with abnormal traffic, so that the normal
traffic can’t reach it. As a result, the network will most likely shut down, in order to be
protected from the volume of data trying to pass through the IDS.
Remote to Local (R2L): as the name suggests, R2L is an attack that tries to get local access to a
system or network from a remote machine that can’t normally do that, so the attacker tries to
“hack” their way into the network.
User to Root (U2R): this is an attack where a normal user account tries to gain privileged access
as a super-user (root access), by exploiting vulnerabilities and gaps in the devices of the system
or network.
Probe: probe or surveillance attacks try to steal information from a network. That can be client
information, banking data, passwords or other personal data that are passing through the
network.