29
4.2. The features of NSL-KDD
Other than the label feature of the dataset, which is found in the 42𝑛𝑑 column (#41), and the
last column that is the severity/difficulty level (#42, which is removed during the pre-
processing phase), the rest of the features represent the information that the IDS uses to
determine the type of traffic and assign the appropriate label to it. These features can be
categorised by the information they contain, and the way it is extracted from the packets
arriving at the network.
There are four categories by which the features are grouped [25]:
Intrinsic features (columns #0 − #8), that contain information from the header without
needing to dive into the payload, which hold the basic information about the incoming packet.
Content features (columns #9 − #21), that hold information about the incoming packets in a
connection-based way that allow the IDS access to the payload.
Time-based features (columns #22 − #30) have the traffic analysed over a 2 second window,
and mostly contain rates and counts (e.g., of connection attempts, port number, connections
that activate certain flags, etc.) rather than information from the packets themselves.
Lastly, host-based features (columns #31 − #40) are similar to the last category, but instead
of analysing inside the 2-second window, they gather information over a series of connections
made (e.g., percentage of connections with the same destination host address/port number),
in order to access attacks that span longer than the window allowed previously.
A table of all the 41 features, with a brief explanation of each, can be found in Annex A: table
of the NSL-KDD features.
The NSL-KDD dataset has different kinds of data in its features, which makes it necessary to
pre-process the data, to be able to find correlation and investigate it, or feed the data into a
model. More specifically, there are four types of data in the dataset: categorical (columns
#1, #2, #3, #41), binary (columns #6, #11, #13, #19 − #21), discrete (columns #7, #8, #14,
#22 − #40, #42) and continuous (columns #0, #4, #5, #9, #10, #12, #15 − #18). Binary,
discrete, and continuous values, being numerical, are okay to be left as they are, but the
categorical values, being in strings forms, are not suitable for further analysis and finding
relationships among the data.
4.2.1. Categorical features
The categorical variables found in the dataset, apart from the attack label column (#41) that
has already been investigated, have to do with three features of the connection: protocol (col.
#1), service (col. #2) and flags (col. #3). In this section, each of these features in the dataset is
going to be briefly analysed, to get an idea of what the network traffic looks like, what is more