Data Cleaning¶

Nu we de data hebben leren kennen. Kunnen we data beginnen op kuisen. We willen de data waar we op verder werken zo clean mogelijk hebben. We moeten een strategie bepalen wat te doen met missing values en wat doen we met dubbele records.

Missing values¶

Een eerste stap is steed om niet relevante kolomen te verwijderen. Na elke cleaning stap kunnen we via EDA terug kijken naar de data en proberen om onze dat niog beter te begrijpen.

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
from sklearn import datasets
data =datasets.load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

df = df.copy().drop(['ZN'], axis=1)

Eventueel kolommen met teveel missing values verwijderen

NA_val = df.isna().sum()
NA_val

CRIM       0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

def na_filter(na, threshold = .4): #only select variables that passees the threshold
    col_pass = []
    for i in na.keys():
        if na[i]/df.shape[0]<threshold:
            col_pass.append(i)
    return col_pass
df = df[na_filter(NA_val)]
df.columns

Index(['CRIM', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

Aangezien we geen missing values hebben blijven alle kolommen hier behouden.

Missing data kunnen we ook oplossen door een lege kolom in te vullen met een waarde. Typisch wordt er dan gekeken naar de verdeling van de andere data en nemen we het gemiddelde van de andere data. Hou wel rekening met het feit dat je zelf dan wel eigenlijk verschillende veronderstellingen in de data aan het brengen bent. Deze missing data strategy kan best met domein expertent afgetoetst worden.

Outliers¶

We gaan kijken via EDA of er geen vreemde waardes in onze dataset zitten en we gaan deze eventueel verwijderen.

Stel onze dataset mag enkel volwassenen bevatten dan kunnen we de records met een leeftijd minder dan 18j gaan verwijderen.

df = df[df['AGE'].between(18, 75)]
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

	CRIM	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000	217.000000
mean	0.601116	6.688848	0.050691	0.475648	6.402037	47.441935	5.109055	5.691244	320.110599	17.941014	384.657834	8.848986
std	1.975536	4.643262	0.219874	0.058566	0.604022	16.571513	1.989007	5.267965	110.757479	1.908030	42.826531	3.748327
min	0.006320	0.460000	0.000000	0.385000	4.973000	18.400000	1.986500	1.000000	187.000000	12.600000	3.650000	2.470000
25%	0.044170	3.410000	0.000000	0.433000	5.927000	33.100000	3.554900	4.000000	254.000000	16.600000	388.650000	6.050000
50%	0.081990	5.320000	0.000000	0.458000	6.312000	47.200000	4.779400	4.000000	289.000000	18.300000	393.370000	8.440000
75%	0.188360	8.140000	0.000000	0.515000	6.728000	61.800000	6.458400	5.000000	345.000000	19.200000	396.900000	11.410000
max	15.575700	25.650000	1.000000	0.655000	8.337000	75.000000	12.126500	24.000000	666.000000	22.000000	396.900000	21.140000

Rijen met null waardes kunnen we best ook verwijderen

df = df.dropna(axis=0)

df.shape

(217, 12)

Outliers kunnen we ook laten detecteren door modellen zelf. We gaan hier niet verder in detail maar dit zijn voorbeelden van ML modellen om outliers te gaan detecteren. Bij het trainen van deze modellen moet je de fractie van outliers meegeven. Het is dus belangrijk om via EDA een gevoel te krijgen betreffende de outliers.

Isolation Forest
k-Nearest Neighbors Detector

In een ML pipeline kunnen deze outliers dan weer speciale aandacht krijgen. Typisch worden deze ML modellen gebruikt als we de outliers niet per single feature kunnen gaan wegfilteren. Bij model training gaan we hier iets verder op in.

Duplicates¶

dubbele records moeten we ten allen tijden vermijden.

df=df.drop_duplicates()

df.shape

(217, 12)

Soms zijn duplicates geen exacte kopieen maar is het nodig om speciaal op ML gebaseerde technieken te gebruiken om te dedupliceren.

De library dedup.io kan hier helpen. Aan de hand van ML zal er geprobeerd worden om duplicates te vinden.

Record Linkage

Tekst¶

Als kolommen tekst bevatten dan kan het zijn dat we tekst moeten standardiseren omdat er teveel gelijkende termen gebruikt geweest zijn.

# Standard Library Imports
from pathlib import Path

# Installed packages
import pandas as pd
from pandas_profiling.utils.cache import cache_file

# Read the Titanic Dataset
file_name = cache_file(
    "pakistan_intellectual_capital.csv",
    "https://raw.githubusercontent.com/bencmbit/datasets/master/pakistan_intellectual_capital.csv",
)
df = pd.read_csv(file_name)

df.head()

	Unnamed: 0	S#	Teacher Name	University Currently Teaching	Department	Province University Located	Designation	Terminal Degree	Graduated from	Country	Year	Area of Specialization/Research Interests	Other Information
0	2	3	Dr. Abdul Basit	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	Software Engineering & DBMS	NaN
1	4	5	Dr. Waheed Noor	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	DBMS	NaN
2	5	6	Dr. Junaid Baber	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	Information processing, Multimedia mining	NaN
3	6	7	Dr. Maheen Bakhtyar	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	NLP, Information Retrieval, Question Answering...	NaN
4	24	25	Samina Azim	Sardar Bahadur Khan Women's University	Computer Science	Balochistan	Lecturer	BS	Balochistan University of Information Technolo...	Pakistan	2005.0	VLSI Electronics DLD Database	NaN

countries = df['Country'].unique()
countries.sort()
countries

array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',
       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',
       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',
       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',
       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',
       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',
       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],
      dtype=object)

We gaan spaties voor en na verwijderen.

df['Country'] = df['Country'].str.lower()
df['Country'] = df['Country'].str.strip()

We zien dat er landen zijn die gelijkaardig geschreven zijn. We gaan deze fouten proberen te verwijderen.

import fuzzywuzzy
from fuzzywuzzy import process

fuzzywuzzy.process.extract("usa", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

[(' USA', 100),
 ('USA', 100),
 ('USofA', 75),
 ('Austria', 60),
 ('Australia', 50),
 ('Spain', 50),
 ('Urbana', 44),
 ('UK', 40),
 ('Malaysia', 36),
 ('Pakistan', 36)]

We kunnen nu een methode schrijven om de fuzzy matched zaken te gaan matchen en te vervangen.

def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

replace_matches_in_column(df=df, column='Country', string_to_match="usa")

All done!

df['Country'].unique()

array(['thailand', 'pakistan', 'germany', 'usa', 'uk', 'china', 'france',
       'southkorea', 'malaysia', 'sweden', 'italy', 'canada', 'norway',
       'ireland', 'new zealand', 'urbana', 'portugal',
       'russian federation', 'finland', 'netherland', 'greece', 'turkey',
       'south korea', 'macau', 'singapore', 'japan', 'hongkong',
       'saudi arabia', 'mauritius', 'scotland'], dtype=object)

Schalen¶

We gaan gegevens transformeren naar een bepaalde schaal zijnde 0-1 of 0-100. Dit is zeer belangrijk als we ML algorithmes gebruiken die gebruiken maken van afstand tussen 2 punten. Schalen maakt het mogelijk om elke feature een zelfde gewicht te geven bij veranderen. Stel 1 feature is gewicht ik gram en een andere feature is lengte in meter dan gaan we beide features schalen naar iets tussen 0 en 1 zodat we beter kunnen vergelijken.

from mlxtend.preprocessing import minmax_scaling

original_data = np.random.exponential(size = 1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.histplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.histplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

Text(0.5, 1.0, 'Scaled data')

Normaliseren¶

Schalen verandert alleen het bereik van uw gegevens. Normalisatie is een radicalere transformatie. Het punt van normalisatie is om je waarnemingen te veranderen, zodat ze kunnen worden beschreven als een normale verdeling.

Over het algemeen wilt u uw gegevens alleen normaliseren als u een machine learning- of statistiektechniek gaat gebruiken die ervan uitgaat dat uw gegevens normaal verdeeld zijn. Enkele voorbeelden hiervan zijn t-tests, ANOVA’s, lineaire regressie, lineaire discriminantanalyse (LDA) en Gauss-naïeve Bayes.

De methode die werd gebruikt om hier te normaliseren, wordt de Box-Cox-transformatie genoemd. Laten we even kijken hoe het normaliseren van sommige gegevens eruitziet:

from scipy import stats

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.histplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.histplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

Text(0.5, 1.0, 'Normalized data')

Na de datacleaning stappen kunnen we best nog eens EDA opstarten. EDA in combinatie met Data cleaning is een iteratief process.

Er bestaat ook een low code library die dit alles wat kan vergemakkelijken.

# Standard Library Imports
from pathlib import Path

# Read the Titanic Dataset
file_name = cache_file(
    "titanic.csv",
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
)
titanic = pd.read_csv(file_name)

titanic.drop(columns="Name",inplace=True)

from pycaret.classification import *

clf1 = setup(titanic, target = 'Survived', silent=True, session_id=123, log_experiment=False, experiment_name='Survived')

	Description	Value
0	session_id	123
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(891, 11)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	7
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 568)
12	Transformed Test Set	(268, 568)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	Survived
21	USI	8fe0
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

X = get_config('X')
X.shape

(891, 568)

clf1 = setup(titanic, target = 'Survived', normalize = True ,silent=True, session_id=123, log_experiment=False, experiment_name='Survived')

	Description	Value
0	session_id	123
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(891, 11)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	7
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 567)
12	Transformed Test Set	(268, 567)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	Survived
21	USI	6472
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	True
30	Normalize Method	zscore
31	Transformation	False
32	Transformation Method	None
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

X = get_config('X')
X.head().apply(lambda s: s.apply(lambda x: format(x, 'f')))

	Age	Fare	Pclass_1	Pclass_3	Sex_female	SibSp_0	SibSp_1	...	Cabin_not_available	Embarked_C	Embarked_S
0	-0.620440	-0.480292	0.000000	1.000000	0.000000	0.000000	1.000000	...	1.000000	0.000000	1.000000
1	0.638427	0.699093	1.000000	0.000000	1.000000	0.000000	1.000000	...	0.000000	1.000000	0.000000
2	-0.305723	-0.467859	0.000000	1.000000	1.000000	1.000000	0.000000	...	1.000000	0.000000	1.000000
3	0.402390	0.364187	1.000000	0.000000	1.000000	0.000000	1.000000	...	0.000000	0.000000	1.000000
4	0.402390	-0.465557	0.000000	1.000000	0.000000	1.000000	0.000000	...	1.000000	0.000000	1.000000

5 rows × 567 columns

X.Fare.plot.hist()

<AxesSubplot:ylabel='Frequency'>

clf1 = setup(titanic, target = 'Survived', transformation = True ,silent=True, session_id=123, log_experiment=False, experiment_name='Survived')

	Description	Value
0	session_id	123
1	Target	Survived
2	Target Type	Binary
3	Label Encoded	0: 0, 1: 1
4	Original Data	(891, 11)
5	Missing Values	True
6	Numeric Features	3
7	Categorical Features	7
8	Ordinal Features	False
9	High Cardinality Features	False
10	High Cardinality Method	None
11	Transformed Train Set	(623, 567)
12	Transformed Test Set	(268, 567)
13	Shuffle Train-Test	True
14	Stratify Train-Test	False
15	Fold Generator	StratifiedKFold
16	Fold Number	10
17	CPU Jobs	-1
18	Use GPU	False
19	Log Experiment	False
20	Experiment Name	Survived
21	USI	2bbe
22	Imputation Type	simple
23	Iterative Imputation Iteration	None
24	Numeric Imputer	mean
25	Iterative Imputation Numeric Model	None
26	Categorical Imputer	constant
27	Iterative Imputation Categorical Model	None
28	Unknown Categoricals Handling	least_frequent
29	Normalize	False
30	Normalize Method	None
31	Transformation	True
32	Transformation Method	yeo-johnson
33	PCA	False
34	PCA Method	None
35	PCA Components	None
36	Ignore Low Variance	False
37	Combine Rare Levels	False
38	Rare Level Threshold	None
39	Numeric Binning	False
40	Remove Outliers	False
41	Outliers Threshold	None
42	Remove Multicollinearity	False
43	Multicollinearity Threshold	None
44	Remove Perfect Collinearity	True
45	Clustering	False
46	Clustering Iteration	None
47	Polynomial Features	False
48	Polynomial Degree	None
49	Trignometry Features	False
50	Polynomial Threshold	None
51	Group Features	False
52	Feature Selection	False
53	Feature Selection Method	classic
54	Features Selection Threshold	None
55	Feature Interaction	False
56	Feature Ratio	False
57	Interaction Threshold	None
58	Fix Imbalance	False
59	Fix Imbalance Method	SMOTE

X = get_config('X')
X.head().apply(lambda s: s.apply(lambda x: format(x, 'f')))

	Age	Fare	Pclass_1	Pclass_3	Sex_female	SibSp_0	SibSp_1	...	Cabin_not_available	Embarked_C	Embarked_S
0	-0.585410	-0.839834	0.000000	1.000000	0.000000	0.000000	1.000000	...	1.000000	0.000000	1.000000
1	0.658780	1.306386	1.000000	0.000000	1.000000	0.000000	1.000000	...	0.000000	1.000000	0.000000
2	-0.261811	-0.753659	0.000000	1.000000	1.000000	1.000000	0.000000	...	1.000000	0.000000	1.000000
3	0.434577	1.046076	1.000000	0.000000	1.000000	0.000000	1.000000	...	0.000000	0.000000	1.000000
4	0.434577	-0.738490	0.000000	1.000000	0.000000	1.000000	0.000000	...	1.000000	0.000000	1.000000