Page 44

Standard scaler follows the standard normal distribution to calculate the value of each
datapoint, which means that it takes mean = 0 and scales the data so that the total variance,
meaning the new range of the data, is = 1 (unit variance). The scaling is calculated as:

𝑥′ =

𝑥 − 𝜇

𝑠

Equation 6: standard scaling equation

Where 𝑥′ is the new scaled data, 𝑥 is the data to be scaled, 𝜇 is the mean of the training samples
and 𝑠 is the standard deviation.

Standard  scaling  occurs  in  two  steps  for  the  training  data,  the  fitting  phase,  and  the
transformation  phase.  The  fit(data)  function  is  used  to  compute  the  mean  and  standard
deviation of each (numeric) feature, while transform(data) is used after the fitting to perform
the scaling of the data, using the variables calculated with the fit function. This way, the scaler
is  trained  (calculates  𝜇, 𝑠)  on  the  training  set,  and  then,  with  those  parameters  set,  the
transformation is also applied on the test data. Thus, it is important to apply fit and transform
on the training data, but only transform on the test data, so that the model is not biased with
information from the test data. While theoretically, the training and test set might have mean
and deviation values that are very close, we shouldn’t let the model be influenced by the test
set distribution and features when it is training. Also, the test data should be scaled according
to the training set’s distribution parameters, so that its divergence from the original training
data is prominent and the model truly tested on unknown records.

After aligning and scaling the data, the dataset looks like this:

[[−0.11024922 − 0.0076786 − 0.00491864 . . . −0.01972622 0.82515007 − 0.04643159]

[−0.11024922 − 0.00773737 − 0.00491864 . . . −0.01972622 0.82515007 − 0.04643159]

[−0.11024922 − 0.00776224 − 0.00491864 . . . −0.01972622 − 1.21190076 − 0.04643159]

. . .

[−0.11024922 − 0.00738219 − 0.00482315 . . . −0.01972622 0.82515007 − 0.04643159]

[−0.11024922 − 0.00776224 − 0.00491864 . . . −0.01972622 − 1.21190076 − 0.04643159]

[−0.11024922 − 0.00773652 − 0.00491864 . . . −0.01972622 0.82515007 − 0.04643159]]

Figure 29: training dataset (multiclass) after standard scaling

Now that the scaling of the data is finished, the training and test datasets are ready to be fed
into models for the three different classification scenarios (multiclass, binary, and 4-classes
classification), for the intrusion detection performance to be measured.

To sum up, this section describes the whole pre-processing phase of this project. We saw how
the data was loaded into dataframes and how the three kinds of classifications were created,
by changing the traffic type labels in column #41 into ‘normal’ and ‘abnormal’ in the case of
binary classification, or ‘normal’, ‘DoS’, ‘Probe’, ‘U2R’ and ‘R2L’ types of attacks in the 4 attack
classes. Then, the distribution of the categorical features, that describe the type of traffic, was
analysed, and the categorical variables of the dataframes were encoded into numerical values,
via one-hot encoding. Correlation calculations showed that the features that mostly affect the
type of traffic are the time- and host-based ones. After that, the datasets were split into X and