43
4.3.3. X and Y components, scaling the data
With the correlation of the NSL-KDD calculated, we saw the connection between the target
labels and the features of the dataset in each classification scenario, which features mostly
affect the prediction of the model. Now, the next step is to split the dataframes’ features
(columns 0 − 40) from the target labels (column 41), into X and Y components, so that they
can be used as input and output respectively for the models we use.
New instances of the dataframes are created by copying the first 40 columns of the original
training and test dataframes into x-train/x-test dataframe type variables, and the last column
(#41) into y-train/y-test variables (columns #19 and #42 are deleted). This is repeated for all
the different classification cases, even though the X component is essentially the same in all of
them. These variables are created from the dataframes as they were before one-hot encoding,
as it is easier to load them without minding the amount of encoded extra columns that are
created after it; thus, it is necessary to do the encoding again, but only on the X component
(columns 1, 2 and 3), leaving the Y dataframe of traffic type labels a categorical list of one
column.
Next is the alignment of the training and test arrays, because the one-hot encoded training set
is a 125973 × 121 (records × columns) array, while the test is 22544 × 115. As they are, the
two arrays cannot be used by the same model; they must be uniform, as the structure of the
model recognises a specific form of input.
As was mentioned above, due to the one-hot encoding algorithm used, the categorical columns
move at the end of the features space and are expanded to all their unique labels as 1𝑠 and
0𝑠. It is important to consider how the alignment occurs in a way that doesn’t disturb the
classes of each previously categorical variable. With the pandas method .align() it was possible
to add the extra columns at the right place (with the ‘outer’ join option) and add the value 0 to
them (with the fill value option), so that the extra categories not found in one of the dataframes
was added in between the rest of the particular categorical feature and conformed to the one-
hot way of encoding the variables. As a result, all the datasets were 125973 × 121 or
22544 × 121, and there was no 𝑁𝑎𝑁 value in any of them.
After the alignment of the training and test datasets, the dataframes were scaled, using the
Standard Scaler from the sklearn.preprocessing library [29]. Scaling the numerical data is an
important step in pre-processing, as their different ranges and dimensions result in bias when
the weights of the model are calculated. The range of the different features, seen in Annex A
(Table
) is 0 − 1,379,963,888 in the case of column 4, but only 0/1 in all the binary variables,
or even hundredth decimals in the float variables that represent rates. No matter how few the
datapoints with values this far apart are, the model needs to be trained and validated on similar
sizes of data, so that the relationships between each record and each feature can be better
recognised.