Feature engineering¶

We hebben de data nu bekeken en hebben alle data die niet ok was verwijderd nu komen we aan een stap waar we met domain kennis gaan proberen om extra features te gaan aanmaken.

Het doel van feature engineering is om uw gegevens beter af te stemmen op het probleem.

Metingen van “schijnbare temperatuur”, zoals de hitte-index en de gevoelstemperatuur. Deze grootheden proberen de waargenomen temperatuur voor mensen te meten op basis van luchttemperatuur, vochtigheid en windsnelheid, dingen die we direct kunnen meten. Je zou een schijnbare temperatuur kunnen zien als het resultaat van een soort feature-engineering, een poging om de waargenomen gegevens relevanter te maken voor waar we echt om geven: hoe het werkelijk buiten aanvoelt!

Features combineren¶

We kunnen features wiskundig gaan combineren of features zelf onderwerpen aan een wiskundige operatie. Als we bijvoorbeeld ergens lengte hebben dan kunnen we opervlakte gaan berekenen om zo misschien toch lineaire relaties blood te leggen met de target variabele.

from pathlib import Path

# Installed packages
import pandas as pd
import numpy as np
from pandas_profiling.utils.cache import cache_file

# Read car dataset
file_name = cache_file(
    "cars.csv",
    "https://raw.githubusercontent.com/sanithps98/Automobile-Dataset-Analysis/master/module_5_auto.csv",
)
autos = pd.read_csv(file_name)

autos.head()

	Unnamed: 0	Unnamed: 0.1	symboling	normalized-losses	make	aspiration	num-of-doors	body-style	drive-wheels	engine-location	...	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	horsepower-binned	gas
0	0	0	3	122	alfa-romero	std	two	convertible	rwd	front	...	9.0	111.0	5000.0	21	27	13495.0	11.190476	Medium	1
1	1	1	3	122	alfa-romero	std	two	convertible	rwd	front	...	9.0	111.0	5000.0	21	27	16500.0	11.190476	Medium	1
2	2	2	1	122	alfa-romero	std	two	hatchback	rwd	front	...	9.0	154.0	5000.0	19	26	16500.0	12.368421	Medium	1
3	3	3	2	164	audi	std	four	sedan	fwd	front	...	10.0	102.0	5500.0	24	30	13950.0	9.791667	Medium	1
4	4	4	2	164	audi	std	four	sedan	4wd	front	...	8.0	115.0	5500.0	18	22	17450.0	13.055556	Medium	1

5 rows × 31 columns

autos["stroke_ratio"] = autos.stroke / autos.bore

autos[["stroke", "bore", "stroke_ratio"]].head()

	stroke	bore	stroke_ratio
0	2.68	3.47	0.772334
1	2.68	3.47	0.772334
2	3.47	2.68	1.294776
3	3.40	3.19	1.065831
4	3.40	3.19	1.065831

We kunnen zelfs verder gaan en gekende formules gaan toepassen. Vandaar dat hier domein kennis cruciaal zal zijn.

Tellen¶

We kunnen verschillende features met elkaar gaan combineren om tot een iets algemener feature te komen te komen.

roadway_features = ["Amenity", "Bump", "Crossing", "GiveWay",
    "Junction", "NoExit", "Railway", "Roundabout", "Station", "Stop",
    "TrafficCalming", "TrafficSignal"]
accidents["RoadwayFeatures"] = accidents[roadway_features].sum(axis=1)

data

Groeperen¶

We kunnen data per rij gaan groeperen en daar aggregaties op gaan toepassen.

Geospatial features¶

Bijvoorbeeld de gemiddelde verkoopwaarden van een huis binnen een regio kan zo een feature zijn.

geospatial features

Temporal features¶

Bijvoorbeeld aankoop transactie gaan groeperen per week en als kolom voorzien tijdstip-week.

tijdsgebonden features

# Standard Library Imports
from pathlib import Path

import pandas as pd
from pandas_profiling.utils.cache import cache_file

# Read the Titanic Dataset
file_name = cache_file(
    "titanic.csv",
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
titanic = pd.read_csv(file_name)

from pycaret.classification import *

clf1 = setup(titanic, target = 'Survived', silent=True, session_id=123, log_experiment=False, experiment_name='Survived')

X = get_config('X')
X

	Description	Value
0	session_id	123
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(891, 12)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	8
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 712)
12	Transformed Test Set	(268, 712)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	Survived
21	USI	fdd0
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

	Age	Fare	Pclass_1	Pclass_2	Pclass_3	Name_Abbott Mr. Rossmore Edward	Name_Abelson Mr. Samuel	Name_Abelson Mrs. Samuel (Hannah Wizosky)	Name_Adahl Mr. Mauritz Nils Martin	Name_Adams Mr. John	...	Cabin_E8	Cabin_F E69	Cabin_F33	Cabin_F4	Cabin_G6	Cabin_T	Cabin_not_available	Embarked_C	Embarked_Q	Embarked_S
0	22.000000	7.250000	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
1	38.000000	71.283302	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
2	26.000000	7.925000	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
3	35.000000	53.099998	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
4	35.000000	8.050000	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	27.000000	13.000000	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
887	19.000000	30.000000	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
888	29.885691	23.450001	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
889	26.000000	30.000000	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
890	32.000000	7.750000	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0

891 rows × 712 columns

titanic.drop(columns="Name",inplace=True)

clf1 = setup(titanic, target = 'Survived', silent=True, session_id=123, log_experiment=False, experiment_name='Survived', ordinal_features = {'Embarked' :['S', 'C', 'Q']})
X = get_config('X')
X

	Description	Value
0	session_id	123
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(891, 11)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	7
8	Ordinal Features	True
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 566)
12	Transformed Test Set	(268, 566)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	Survived
21	USI	f8ef
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

	Age	Fare	Embarked	Pclass_1	Pclass_2	Pclass_3	Sex_male	SibSp_0	SibSp_1	SibSp_2	...	Cabin_E49	Cabin_E67	Cabin_E68	Cabin_E8	Cabin_F E69	Cabin_F2	Cabin_F33	Cabin_G6	Cabin_T	Cabin_not_available
0	22.000000	7.250000	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
1	38.000000	71.283302	2.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	26.000000	7.925000	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
3	35.000000	53.099998	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	35.000000	8.050000	1.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	27.000000	13.000000	1.0	0.0	1.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
887	19.000000	30.000000	1.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
888	29.885691	23.450001	1.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
889	26.000000	30.000000	2.0	1.0	0.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
890	32.000000	7.750000	3.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0

891 rows × 566 columns

(WIP) Dimension reduction¶

Met clustering is het mogelijk om features met elkaar te verbinden.

Basis ML