38
case of the services feature (col. #2), where, in the training set we can find 70 different
services, and in the test set we find 64. This problem was later addressed, at the later steps of
the pre-processing, as the subsets were prepared to be fed into the models.
Table 5: labels of the dataframes before and after one-hot encoding
Labels
of
columns
before
Labels of columns in multiclass training
dataframe, after one-hot encoding
Labels of columns in binary
classification training dataframe,
after one-hot encoding
Labels of columns in 4-class
classification training dataframe,
after one-hot encoding
[0,
1
,
2
,
3
,
4, 5,
6, 7,
8, 9,
10, 11,
12, 13,
14, 15,
16, 17,
18, 20,
21, 22,
23, 24,
25, 26,
27, 28,
29, 30,
31, 32,
33, 34,
35, 36,
37, 38,
39, 40,
41
]
[0, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
'1_icmp', '1_tcp', '1_udp'
,
'2_IRC', '2_X11',
'2_Z39_50', '2_aol', '2_auth', '2_bgp',
'2_courier', '2_csnet_ns', '2_ctf',
'2_daytime', '2_discard', '2_domain',
'2_domain_u', '2_echo', '2_eco_i', '2_ecr_i',
'2_efs', '2_exec', '2_finger', '2_ftp',
'2_ftp_data', '2_gopher', '2_harvest',
'2_hostnames', '2_http', '2_http_2784',
'2_http_443', '2_http_8001', '2_imap4',
'2_iso_tsap', '2_klogin', '2_kshell', '2_ldap',
'2_link', '2_login', '2_mtp', '2_name',
'2_netbios_dgm', '2_netbios_ns',
'2_netbios_ssn', '2_netstat', '2_nnsp',
'2_nntp', '2_ntp_u', '2_other',
'2_pm_dump', '2_pop_2', '2_pop_3',
'2_printer', '2_private', '2_red_i',
'2_remote_job', '2_rje', '2_shell', '2_smtp',
'2_sql_net', '2_ssh', '2_sunrpc', '2_supdup',
'2_systat', '2_telnet', '2_tftp_u', '2_tim_i',
'2_time', '2_urh_i', '2_urp_i', '2_uucp',
'2_uucp_path', '2_vmnet', '2_whois'
,
'3_OTH', '3_REJ', '3_RSTO', '3_RSTOS0',
'3_RSTR', '3_S0', '3_S1', '3_S2', '3_S3', '3_SF',
'3_SH'
,
'41_back', '41_buffer_overflow',
'41_ftp_write', '41_guess_passwd',
'41_imap', '41_ipsweep', '41_land',
'41_loadmodule', '41_multihop',
'41_neptune', '41_nmap', '41_normal',
'41_perl', '41_phf', '41_pod',
'41_portsweep', '41_rootkit', '41_satan',
'41_smurf', '41_spy', '41_teardrop',
'41_warezclient', '41_warezmaster'
]
[0, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40,
'1_icmp',
'1_tcp', '1_udp'
,
'2_IRC', '2_X11',
'2_Z39_50', '2_aol', '2_auth',
'2_bgp', '2_courier', '2_csnet_ns',
'2_ctf', '2_daytime', '2_discard',
'2_domain', '2_domain_u',
'2_echo', '2_eco_i', '2_ecr_i',
'2_efs', '2_exec', '2_finger', '2_ftp',
'2_ftp_data', '2_gopher',
'2_harvest', '2_hostnames',
'2_http', '2_http_2784',
'2_http_443', '2_http_8001',
'2_imap4', '2_iso_tsap', '2_klogin',
'2_kshell', '2_ldap', '2_link',
'2_login', '2_mtp', '2_name',
'2_netbios_dgm', '2_netbios_ns',
'2_netbios_ssn', '2_netstat',
'2_nnsp', '2_nntp', '2_ntp_u',
'2_other', '2_pm_dump', '2_pop_2',
'2_pop_3', '2_printer', '2_private',
'2_red_i', '2_remote_job', '2_rje',
'2_shell', '2_smtp', '2_sql_net',
'2_ssh', '2_sunrpc', '2_supdup',
'2_systat', '2_telnet', '2_tftp_u',
'2_tim_i', '2_time', '2_urh_i',
'2_urp_i', '2_uucp', '2_uucp_path',
'2_vmnet', '2_whois'
,
'3_OTH',
'3_REJ', '3_RSTO', '3_RSTOS0',
'3_RSTR', '3_S0', '3_S1', '3_S2',
'3_S3', '3_SF', '3_SH'
,
'41_abnormal', '41_normal'
]
[0, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40,
'1_icmp', '1_tcp',
'1_udp'
,
'2_IRC', '2_X11', '2_Z39_50',
'2_aol', '2_auth', '2_bgp', '2_courier',
'2_csnet_ns', '2_ctf', '2_daytime',
'2_discard', '2_domain',
'2_domain_u', '2_echo', '2_eco_i',
'2_ecr_i', '2_efs', '2_exec', '2_finger',
'2_ftp', '2_ftp_data', '2_gopher',
'2_harvest', '2_hostnames', '2_http',
'2_http_2784', '2_http_443',
'2_http_8001', '2_imap4',
'2_iso_tsap', '2_klogin', '2_kshell',
'2_ldap', '2_link', '2_login', '2_mtp',
'2_name', '2_netbios_dgm',
'2_netbios_ns', '2_netbios_ssn',
'2_netstat', '2_nnsp', '2_nntp',
'2_ntp_u', '2_other', '2_pm_dump',
'2_pop_2', '2_pop_3', '2_printer',
'2_private', '2_red_i',
'2_remote_job', '2_rje', '2_shell',
'2_smtp', '2_sql_net', '2_ssh',
'2_sunrpc', '2_supdup', '2_systat',
'2_telnet', '2_tftp_u', '2_tim_i',
'2_time', '2_urh_i', '2_urp_i',
'2_uucp', '2_uucp_path', '2_vmnet',
'2_whois'
,
'3_OTH', '3_REJ', '3_RSTO',
'3_RSTOS0', '3_RSTR', '3_S0', '3_S1',
'3_S2', '3_S3', '3_SF', '3_SH'
,
'41_DoS', '41_Probe', '41_R2L',
'41_U2R', '41_normal'
]
In total:
42
In total: 144
In total: 123
In total: 126
4.3.2. Correlation
The encoded dataframe is now ready to have its correlation measured, since all the variables
are going to be taken into account. The pandas .corr function [28] is used to find the pair wise
correlation of all the columns in the dataset, thus the relationship between the features is
explored. If the categorical variables were not changed to numerical, they wouldn’t have been
considered during the calculations, so even the most interesting label, that of the type of
traffic, which is the goal of the model, wouldn’t be correlated with the features of the dataset.
Correlation was calculated for all the dataframes, so there are six figures in total, for multiclass
(Figure
, Figure 26) and 4-class (Figure
classification.