42
The correlation ranges from β1 to +1. When it trends towards +1, it shows that the two
columns have a proportional relationship (when π΄ β then π΅ β). On the other hand, when
correlation is closer to β1, then the two columns have an inversed proportional relationship
to each other (when π΄ β then π΅ β). When two features have no relationship to each other, so
the way one changes doesnβt influence the other, correlation is 0. For the visualisation of
following graphs, a palette was chosen that highlights correlation the closer it is to +1 and β1,
while correlations close to 0 are dark.
The correlation calculations are done pairwise, which means that the product of this metric is
a matrix with dimensions # ππππ’πππ Γ # ππππ’πππ . This is why the diagonal highlighted in all
the graphs has correlation equal to 1, it calculates correlation between the column and itself.
Because of the difference in the resulting columns due to one-hot encoding, the correlation
matrices also have different sizes from each other, as is shown in Table 6 below:
Table 6: correlation matrices dimensions
Multiclass training set 144 Γ 144
Multiclass test set 153 Γ 153
Binary class training set 123 Γ 123
Binary test set 117 Γ 117
4-class training set 126 Γ 126
4-class test set 120 Γ 120
Using the seaborn python library, the matrices were visualised as heatmaps, where high
correlation (towards +1 and β1) has lighter hues the more it tends to Β±1, and darker the
more it nears 0.
It is clear and expected (since all the features apart from the traffic type are the same) that all
the dataframes behave the same way, with higher intercorrelation among the time-based and
host-based features (columns #21 β #40). Those are the features that also seem to be
influenced by the categorical features the most, as we can see in the bottom left of the
heatmaps. Another interesting thing we can see from the correlation heatmaps is that in the
multiclass set (Figure 23, Figure 24) in the last columns, where the encoded traffic types are,
the correlation is lower than in both the binary and the 4-class classifications; this is expected,
when the types of classifications are compared. What is interesting in the multiclass case, is
that the attacks show no correlation to each other, be it attacks of the same category or other
classes. In fact, the correlation between different classes of attacks in the 4-class classification
(Figure 277, Figure 288) seems to be higher than that of the multiclass separated attacks that
belong in the same class.
Thanks to the correlation between the features, especially the correlation between the traffic
type and the other features of each record, we can understand what makes the model classify
something as a DoS or a U2R attack, or as normal traffic.