44
Standard scaler follows the standard normal distribution to calculate the value of each
datapoint, which means that it takes mean = 0 and scales the data so that the total variance,
meaning the new range of the data, is = 1 (unit variance). The scaling is calculated as:
π₯β² =
π₯ β π
π
Equation 6: standard scaling equation
Where π₯β² is the new scaled data, π₯ is the data to be scaled, π is the mean of the training samples
and π is the standard deviation.
Standard scaling occurs in two steps for the training data, the fitting phase, and the
transformation phase. The fit(data) function is used to compute the mean and standard
deviation of each (numeric) feature, while transform(data) is used after the fitting to perform
the scaling of the data, using the variables calculated with the fit function. This way, the scaler
is trained (calculates π, π ) on the training set, and then, with those parameters set, the
transformation is also applied on the test data. Thus, it is important to apply fit and transform
on the training data, but only transform on the test data, so that the model is not biased with
information from the test data. While theoretically, the training and test set might have mean
and deviation values that are very close, we shouldnβt let the model be influenced by the test
set distribution and features when it is training. Also, the test data should be scaled according
to the training setβs distribution parameters, so that its divergence from the original training
data is prominent and the model truly tested on unknown records.
After aligning and scaling the data, the dataset looks like this:
[[β0.11024922 β 0.0076786 β 0.00491864 . . . β0.01972622 0.82515007 β 0.04643159]
[β0.11024922 β 0.00773737 β 0.00491864 . . . β0.01972622 0.82515007 β 0.04643159]
[β0.11024922 β 0.00776224 β 0.00491864 . . . β0.01972622 β 1.21190076 β 0.04643159]
. . .
[β0.11024922 β 0.00738219 β 0.00482315 . . . β0.01972622 0.82515007 β 0.04643159]
[β0.11024922 β 0.00776224 β 0.00491864 . . . β0.01972622 β 1.21190076 β 0.04643159]
[β0.11024922 β 0.00773652 β 0.00491864 . . . β0.01972622 0.82515007 β 0.04643159]]
Figure 29: training dataset (multiclass) after standard scaling
Now that the scaling of the data is finished, the training and test datasets are ready to be fed
into models for the three different classification scenarios (multiclass, binary, and 4-classes
classification), for the intrusion detection performance to be measured.
To sum up, this section describes the whole pre-processing phase of this project. We saw how
the data was loaded into dataframes and how the three kinds of classifications were created,
by changing the traffic type labels in column #41 into βnormalβ and βabnormalβ in the case of
binary classification, or βnormalβ, βDoSβ, βProbeβ, βU2Rβ and βR2Lβ types of attacks in the 4 attack
classes. Then, the distribution of the categorical features, that describe the type of traffic, was
analysed, and the categorical variables of the dataframes were encoded into numerical values,
via one-hot encoding. Correlation calculations showed that the features that mostly affect the
type of traffic are the time- and host-based ones. After that, the datasets were split into X and