Page 29

4.2. The features of NSL-KDD

Other than the label feature of the dataset, which is found in the 42𝑛𝑑 column (#41), and the
last  column  that  is  the  severity/difficulty  level  (#42,  which  is  removed  during  the  pre-
processing  phase),  the  rest  of  the  features  represent  the  information  that  the  IDS  uses  to
determine  the  type  of  traffic  and  assign  the  appropriate  label  to  it.  These  features  can  be
categorised  by  the  information  they  contain,  and  the  way  it  is  extracted  from  the  packets
arriving at the network.

There are four categories by which the features are grouped [25]:

Intrinsic features (columns #0 − #8), that contain information from the header without
needing to dive into the payload, which hold the basic information about the incoming packet.

Content features (columns #9 − #21), that hold information about the incoming packets in a
connection-based way that allow the IDS access to the payload.

Time-based features (columns #22 − #30) have the traffic analysed over a 2 second window,
and mostly contain rates and counts (e.g., of connection attempts, port number, connections
that activate certain flags, etc.) rather than information from the packets themselves.

Lastly, host-based features (columns #31 − #40) are similar to the last category, but instead
of analysing inside the 2-second window, they gather information over a series of connections
made (e.g., percentage of connections with the same destination host address/port number),
in order to access attacks that span longer than the window allowed previously.

A table of all the 41 features, with a brief explanation of each, can be found in Annex A: table
of the NSL-KDD features.

The NSL-KDD dataset has different kinds of data in its features, which makes it necessary to
pre-process the data, to be able to find correlation and investigate it, or feed the data into a
model.  More  specifically,  there  are  four  types  of  data  in  the  dataset:  categorical  (columns
#1, #2, #3, #41), binary (columns #6, #11, #13, #19  −  #21), discrete (columns #7, #8, #14,
#22  −  #40, #42) and continuous (columns #0, #4, #5, #9, #10, #12, #15  −  #18). Binary,
discrete,  and  continuous  values,  being  numerical,  are  okay  to  be  left  as  they  are,  but  the
categorical  values,  being  in  strings  forms,  are  not  suitable  for  further  analysis  and  finding
relationships among the data.

4.2.1. Categorical features

The categorical variables found in the dataset, apart from the attack label column (#41) that
has already been investigated, have to do with three features of the connection: protocol (col.
#1), service (col. #2) and flags (col. #3). In this section, each of these features in the dataset is
going to be briefly analysed, to get an idea of what the network traffic looks like, what is more