SlideShare a Scribd company logo
2
Most read
3
Most read
6
Most read
# [ Data Preprocessing ] ( CheatSheet )
1. Handling Missing Values
ā— Identify Missing Values: df.isnull().sum()
ā— Drop Rows with Missing Values: df.dropna()
ā— Fill Missing Values with a Specific Value: df.fillna(value)
ā— Fill Missing Values with Mean/Median/Mode: df.fillna(df.mean())
ā— Interpolate Missing Values: df.interpolate()
ā— Forward Fill or Backward Fill: df.ffill() or df.bfill()
2. Data Transformation
ā— Standardization (Z-Score Normalization): (df - df.mean()) /
df.std()
ā— Min-Max Normalization: (df - df.min()) / (df.max() - df.min())
ā— Log Transformation: np.log(df)
ā— Square Root Transformation: np.sqrt(df)
ā— Power Transformation (e.g., Box-Cox): scipy.stats.boxcox(df)
3. Feature Encoding
ā— One-Hot Encoding: pd.get_dummies(df)
ā— Label Encoding: sklearn.preprocessing.LabelEncoder()
ā— Binary Encoding: category_encoders.BinaryEncoder()
ā— Frequency Encoding: df.groupby('column').size() / len(df)
ā— Mean Encoding: df.groupby('category')['target'].mean()
4. Handling Categorical Data
ā— Convert to Category Type: df['column'].astype('category')
ā— Ordinal Encoding: df['column'].cat.codes
ā— Using Pandas' Cut for Binning: pd.cut(df['column'], bins)
ā— Using Pandas' QCut for Quantile Binning: pd.qcut(df['column'], q)
By: Waleed Mousa
5. Feature Scaling
ā— Robust Scaler: sklearn.preprocessing.RobustScaler()
ā— MaxAbsScaler: sklearn.preprocessing.MaxAbsScaler()
ā— Normalizer: sklearn.preprocessing.Normalizer()
6. Feature Selection
ā— Variance Threshold: sklearn.feature_selection.VarianceThreshold()
ā— SelectKBest: sklearn.feature_selection.SelectKBest()
ā— Recursive Feature Elimination: sklearn.feature_selection.RFE()
ā— SelectFromModel: sklearn.feature_selection.SelectFromModel()
ā— Correlation Matrix with Heatmap: sns.heatmap(df.corr(), annot=True)
7. Handling Outliers
ā— IQR Method: Q1 = df.quantile(0.25); Q3 = df.quantile(0.75); IQR =
Q3 - Q1; df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))]
ā— Z-Score Method: (abs(df - df.mean()) / df.std()) < 3
ā— Winsorizing: scipy.stats.mstats.winsorize()
8. Text Preprocessing (NLP)
ā— Tokenization: nltk.word_tokenize(text)
ā— Removing Stop Words: nltk.corpus.stopwords.words('english')
ā— Stemming: nltk.stem.PorterStemmer()
ā— Lemmatization: nltk.stem.WordNetLemmatizer()
ā— TF-IDF Vectorization:
sklearn.feature_extraction.text.TfidfVectorizer()
9. Time Series Data
ā— DateTime Conversion: pd.to_datetime(df['column'])
ā— Set DateTime as Index: df.set_index('datetime_column')
ā— Resampling for Time Series Aggregation: df.resample('D').mean()
ā— Time Series Decomposition:
statsmodels.tsa.seasonal.seasonal_decompose(df['column'])
By: Waleed Mousa
10. Data Splitting
ā— Train-Test Split: sklearn.model_selection.train_test_split()
ā— K-Fold Cross-Validation: sklearn.model_selection.KFold()
ā— Stratified Sampling: sklearn.model_selection.StratifiedKFold()
11. Data Cleaning
ā— Trimming Whitespace: df['column'].str.strip()
ā— Replacing Values: df.replace(old_value, new_value)
ā— Dropping Columns: df.drop(columns=['column_to_drop'])
ā— Renaming Columns: df.rename(columns={'old_name': 'new_name'})
ā— Converting Data Types: df.astype({'column': 'new_type'})
12. Image Data Preprocessing
ā— Resizing Images: cv2.resize()
ā— Normalizing Pixel Values: image / 255.0
ā— Image Augmentation: ImageDataGenerator() in
keras.preprocessing.image
ā— Grayscale Conversion: cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
13. Dimensionality Reduction
ā— Principal Component Analysis (PCA): sklearn.decomposition.PCA()
ā— t-SNE: sklearn.manifold.TSNE()
ā— LDA: sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
14. Dealing with Imbalanced Data
ā— Random Over-Sampling: imblearn.over_sampling.RandomOverSampler()
ā— Random Under-Sampling: imblearn.under_sampling.RandomUnderSampler()
ā— SMOTE: imblearn.over_sampling.SMOTE()
15. Combining Features
ā— Polynomial Features: sklearn.preprocessing.PolynomialFeatures()
By: Waleed Mousa
ā— Concatenating Features: np.concatenate([feature1, feature2],
axis=1)
16. Handling Multivariate Data
ā— Granger Causality Test:
statsmodels.tsa.stattools.grangercausalitytests()
ā— Vector AutoRegression (VAR): statsmodels.tsa.api.VAR()
17. Signal Processing
ā— Fourier Transform: np.fft.fft()
ā— Wavelet Transform: pywt.Wavelet()
18. Error Metrics
ā— Mean Squared Error (MSE): sklearn.metrics.mean_squared_error()
ā— Mean Absolute Error (MAE): sklearn.metrics.mean_absolute_error()
ā— R-Squared: sklearn.metrics.r2_score()
19. Data Wrangling
ā— Pivot Tables: df.pivot_table()
ā— Stacking and Unstacking: df.stack(), df.unstack()
ā— Melting Data: pd.melt(df)
20. Advanced DataFrame Operations
ā— Apply Functions: df.apply(lambda x: ...)
ā— GroupBy Operations: df.groupby('column').aggregate(function)
ā— Merge and Join DataFrames: pd.merge(df1, df2, on='key'),
df1.join(df2, on='key')
21. Sequence Data Processing
ā— Padding Sequences: keras.preprocessing.sequence.pad_sequences()
ā— One-Hot Encoding for Sequences: keras.utils.to_categorical()
By: Waleed Mousa
22. Data Verification
ā— Assert Statements: pd.util.testing.assert_frame_equal()
ā— Data Consistency Check: pd.util.testing.assert_series_equal()
23. Data Aggregation
ā— Cumulative Sum: df.cumsum()
ā— Cumulative Product: df.cumprod()
ā— Weighted Average: np.average(values, weights=weights)
24. Geospatial Data
ā— Coordinate Transformation: geopandas.GeoDataFrame()
ā— Spatial Join: geopandas.sjoin()
ā— Distance Calculation: geopy.distance.distance(coord1, coord2)
25. Handling JSON Data
ā— Normalize JSON: pd.json_normalize(json_data)
ā— Read JSON: pd.read_json('file.json')
ā— To JSON: df.to_json()
26. Handling XML Data
ā— Parse XML: xml.etree.ElementTree.parse('file.xml')
ā— Find Elements in XML: tree.findall('path')
27. Probability Distributions
ā— Normal Distribution: np.random.normal()
ā— Uniform Distribution: np.random.uniform()
ā— Binomial Distribution: np.random.binomial()
28. Hypothesis Testing
ā— t-Test: scipy.stats.ttest_ind()
ā— ANOVA Test: scipy.stats.f_oneway()
By: Waleed Mousa
ā— Chi-Squared Test: scipy.stats.chi2_contingency()
29. Database Interaction
ā— Read SQL Query: pd.read_sql_query('SELECT * FROM table',
connection)
ā— Write to SQL: df.to_sql('table', connection)
30. Data Profiling
ā— Descriptive Statistics: df.describe()
ā— Correlation Analysis: df.corr()
ā— Unique Value Counts: df['column'].value_counts()
ā— Pandas Profiling for Comprehensive Reports:
pandas_profiling.ProfileReport(df)
31. Advanced Handling of Missing Values
ā— KNN Imputation: from sklearn.impute import KNNImputer; imputer =
KNNImputer(n_neighbors=5); df_imputed = imputer.fit_transform(df)
ā— Iterative Imputation: from sklearn.experimental import
enable_iterative_imputer; from sklearn.impute import
IterativeImputer; imputer = IterativeImputer(); df_imputed =
imputer.fit_transform(df)
32. Feature Engineering
ā— Lag Features for Time Series: df['lag_feature'] =
df['feature'].shift(1)
ā— Rolling Window Features: df['rolling_mean'] =
df['feature'].rolling(window=5).mean()
ā— Expanding Window Features: df['expanding_mean'] =
df['feature'].expanding().mean()
ā— Datetime Features Extraction: df['hour'] = df['datetime'].dt.hour
ā— Binning Numeric Features: pd.cut(df['numeric_feature'], bins=3,
labels=False)
33. Data Normalization for Text
By: Waleed Mousa
ā— Removing Punctuation: df['text'].str.replace('[^ws]', '',
regex=True)
ā— Removing Numbers: df['text'].str.replace('d+', '', regex=True)
ā— Converting to Lowercase: df['text'].str.lower()
ā— Removing Whitespaces: df['text'].str.strip()
34. Advanced Text Preprocessing
ā— Removing HTML Tags: df['text'].str.replace('<.*?>', '', regex=True)
ā— Removing URLs: df['text'].str.replace('httpS+|www.S+', '',
regex=True)
ā— Using NLTK for Tokenization: nltk.word_tokenize(df['text'])
ā— Using Spacy for Lemmatization:
spacy.load('en_core_web_sm').lemmatizer(df['text'])
35. Advanced Feature Scaling
ā— Quantile Transformer: sklearn.preprocessing.QuantileTransformer()
ā— Power Transformer:
sklearn.preprocessing.PowerTransformer(method='yeo-johnson')
36. Balancing Data
ā— Oversampling with SMOTE-NC for Categorical Features:
imblearn.over_sampling.SMOTENC(categorical_features=[0, 2, 3])
ā— Cluster-Based Oversampling:
imblearn.over_sampling.ClusterCentroids()
37. Feature Selection Based on Model
ā— L1 Regularization for Feature Selection:
sklearn.linear_model.LogisticRegression(penalty='l1')
ā— Tree-Based Feature Selection:
sklearn.ensemble.ExtraTreesClassifier()
38. Data Discretization
ā— Discretization into Quantiles: pd.qcut(df['feature'], q=4)
By: Waleed Mousa
ā— K-Means Discretization:
sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal',
strategy='kmeans')
39. Dealing with Date and Time
ā— Time Delta Calculation: (df['date_end'] - df['date_start']).dt.days
ā— Extracting Day of Week: df['date'].dt.dayofweek
ā— Setting Frequency in Time Series: df.asfreq('D')
40. Handling Geospatial Data
ā— Creating Geospatial Features: geopandas.GeoDataFrame(df,
geometry=geopandas.points_from_xy(df.longitude, df.latitude))
ā— Calculating Distance Between Points:
df['geometry'].distance(other_point)
41. Advanced NLP Techniques
ā— Named Entity Recognition (NER) with Spacy:
spacy.load('en_core_web_sm').entity(df['text'])
ā— Topic Modeling with Latent Dirichlet Allocation (LDA):
gensim.models.LdaMulticore(corpus, num_topics=10)
42. Data Decomposition
ā— Singular Value Decomposition (SVD): scipy.linalg.svd(matrix)
ā— Non-Negative Matrix Factorization (NMF):
sklearn.decomposition.NMF(n_components=2)
43. Advanced Image Preprocessing
ā— Edge Detection in Images (Canny): cv2.Canny(image, threshold1,
threshold2)
ā— Image Thresholding: cv2.threshold(image, threshold, max_value,
cv2.THRESH_BINARY)
44. Handling JSON and Complex Data Types
By: Waleed Mousa
ā— Flattening JSON Nested Structures: pd.json_normalize(data,
sep='_')
ā— Parsing JSON Strings in DataFrame: df['json_col'].apply(lambda x:
json.loads(x))
45. Working with Time Series and Sequences
ā— Differencing a Time Series: df['value'].diff(periods=1)
ā— Creating Cumulative Features: df['cumulative_sum'] =
df['value'].cumsum()
46. Data Validation
ā— Asserting Dataframe Equality: pd.testing.assert_frame_equal(df1,
df2)
ā— Checking DataFrame Schema with Pandera:
pandera.SchemaModel.validate(df)
47. Custom Transformations
ā— Applying Custom Functions: df.apply(lambda row:
custom_function(row), axis=1)
ā— Vectorized String Operations: df['text'].str.cat(sep=' ')
48. Feature Extraction from Time Series
ā— Fourier Transform for Periodicity: np.fft.fft(df['time_series'])
ā— Autocorrelation Features:
pd.plotting.autocorrelation_plot(df['time_series'])
49. Working with APIs and Remote Data
ā— Reading Data from a REST API: pd.read_json(api_endpoint)
ā— Loading Data from Cloud Services (e.g., AWS S3):
pd.read_csv('s3://bucket_name/file.csv')
50. Advanced Data Aggregation
By: Waleed Mousa
ā— Weighted Moving Average: df['value'].rolling(window=5).apply(lambda
x: np.average(x, weights=[0.1, 0.2, 0.3, 0.2, 0.2]))
ā— Cumulative Maximum or Minimum: df['cumulative_max'] =
df['value'].cummax()
ā— GroupBy with Custom Aggregation Functions:
df.groupby('group').agg({'value': ['mean', 'std',
custom_agg_function]})
ā— Pivot Table with Multiple Aggregates:
df.pivot_table(index='group', values='value', aggfunc=['mean',
'sum', 'count'])
By: Waleed Mousa

More Related Content

PDF
Plotly dash and data visualisation in Python
PPTX
Machine Learning - Dataset Preparation
PDF
The 7 steps of Machine Learning
PDF
Distributed machine learning
PDF
Random forest (Machine learning)
PDF
Introduction to Machine learning with Python
PDF
Reinforcement Learning using OpenAI Gym
PDF
Semi-supervised Machine Learning
Plotly dash and data visualisation in Python
Machine Learning - Dataset Preparation
The 7 steps of Machine Learning
Distributed machine learning
Random forest (Machine learning)
Introduction to Machine learning with Python
Reinforcement Learning using OpenAI Gym
Semi-supervised Machine Learning

Similar to Data Preprocessing Cheatsheet for learners (20)

PPTX
Unit 4_Working with Graphs _python (2).pptx
PPTX
Working with Graphs _python.pptx
PDF
Machine Learning Algorithms
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
PDF
Machine_Learning_concepts_in__detail.pdf
PDF
Data science using python, Data Preprocessing
PDF
Data Analytics ,Data Preprocessing What is Data Preprocessing?
PDF
overview of_data_processing
Ā 
PPTX
Introduction to ML_Data Preprocessing.pptx
PPTX
Lecture 1 Pandas Basics.pptx machine learning
PPTX
Python Cheat Sheet Presentation Learning
PPTX
Manipulation and Python Tools-fundamantals of data science
PDF
Machine Learning with Python
PDF
Lesson 2 data preprocessing
PDF
Data preprocessing in Machine Learning
PDF
Data Science With Python
PDF
ML-Unit-4.pdf
PPTX
EDA.pptx
PPTX
Data preprocessing in Machine learning
PPTX
Lecture 9.pptx
Unit 4_Working with Graphs _python (2).pptx
Working with Graphs _python.pptx
Machine Learning Algorithms
Pandas Data Cleaning and Preprocessing PPT.pptx
Machine_Learning_concepts_in__detail.pdf
Data science using python, Data Preprocessing
Data Analytics ,Data Preprocessing What is Data Preprocessing?
overview of_data_processing
Ā 
Introduction to ML_Data Preprocessing.pptx
Lecture 1 Pandas Basics.pptx machine learning
Python Cheat Sheet Presentation Learning
Manipulation and Python Tools-fundamantals of data science
Machine Learning with Python
Lesson 2 data preprocessing
Data preprocessing in Machine Learning
Data Science With Python
ML-Unit-4.pdf
EDA.pptx
Data preprocessing in Machine learning
Lecture 9.pptx
Ad

Recently uploaded (20)

PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
modul_python (1).pptx for professional and student
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Predictive modeling basics in data cleaning process
PDF
Transcultural that can help you someday.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
SAP 2 completion done . PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
modul_python (1).pptx for professional and student
STUDY DESIGN details- Lt Col Maksud (21).pptx
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Predictive modeling basics in data cleaning process
Transcultural that can help you someday.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Data Science and Data Analysis
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Reliability_Chapter_ presentation 1221.5784
SAP 2 completion done . PRESENTATION.pptx
Ad

Data Preprocessing Cheatsheet for learners

  • 1. # [ Data Preprocessing ] ( CheatSheet ) 1. Handling Missing Values ā— Identify Missing Values: df.isnull().sum() ā— Drop Rows with Missing Values: df.dropna() ā— Fill Missing Values with a Specific Value: df.fillna(value) ā— Fill Missing Values with Mean/Median/Mode: df.fillna(df.mean()) ā— Interpolate Missing Values: df.interpolate() ā— Forward Fill or Backward Fill: df.ffill() or df.bfill() 2. Data Transformation ā— Standardization (Z-Score Normalization): (df - df.mean()) / df.std() ā— Min-Max Normalization: (df - df.min()) / (df.max() - df.min()) ā— Log Transformation: np.log(df) ā— Square Root Transformation: np.sqrt(df) ā— Power Transformation (e.g., Box-Cox): scipy.stats.boxcox(df) 3. Feature Encoding ā— One-Hot Encoding: pd.get_dummies(df) ā— Label Encoding: sklearn.preprocessing.LabelEncoder() ā— Binary Encoding: category_encoders.BinaryEncoder() ā— Frequency Encoding: df.groupby('column').size() / len(df) ā— Mean Encoding: df.groupby('category')['target'].mean() 4. Handling Categorical Data ā— Convert to Category Type: df['column'].astype('category') ā— Ordinal Encoding: df['column'].cat.codes ā— Using Pandas' Cut for Binning: pd.cut(df['column'], bins) ā— Using Pandas' QCut for Quantile Binning: pd.qcut(df['column'], q) By: Waleed Mousa
  • 2. 5. Feature Scaling ā— Robust Scaler: sklearn.preprocessing.RobustScaler() ā— MaxAbsScaler: sklearn.preprocessing.MaxAbsScaler() ā— Normalizer: sklearn.preprocessing.Normalizer() 6. Feature Selection ā— Variance Threshold: sklearn.feature_selection.VarianceThreshold() ā— SelectKBest: sklearn.feature_selection.SelectKBest() ā— Recursive Feature Elimination: sklearn.feature_selection.RFE() ā— SelectFromModel: sklearn.feature_selection.SelectFromModel() ā— Correlation Matrix with Heatmap: sns.heatmap(df.corr(), annot=True) 7. Handling Outliers ā— IQR Method: Q1 = df.quantile(0.25); Q3 = df.quantile(0.75); IQR = Q3 - Q1; df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR)))] ā— Z-Score Method: (abs(df - df.mean()) / df.std()) < 3 ā— Winsorizing: scipy.stats.mstats.winsorize() 8. Text Preprocessing (NLP) ā— Tokenization: nltk.word_tokenize(text) ā— Removing Stop Words: nltk.corpus.stopwords.words('english') ā— Stemming: nltk.stem.PorterStemmer() ā— Lemmatization: nltk.stem.WordNetLemmatizer() ā— TF-IDF Vectorization: sklearn.feature_extraction.text.TfidfVectorizer() 9. Time Series Data ā— DateTime Conversion: pd.to_datetime(df['column']) ā— Set DateTime as Index: df.set_index('datetime_column') ā— Resampling for Time Series Aggregation: df.resample('D').mean() ā— Time Series Decomposition: statsmodels.tsa.seasonal.seasonal_decompose(df['column']) By: Waleed Mousa
  • 3. 10. Data Splitting ā— Train-Test Split: sklearn.model_selection.train_test_split() ā— K-Fold Cross-Validation: sklearn.model_selection.KFold() ā— Stratified Sampling: sklearn.model_selection.StratifiedKFold() 11. Data Cleaning ā— Trimming Whitespace: df['column'].str.strip() ā— Replacing Values: df.replace(old_value, new_value) ā— Dropping Columns: df.drop(columns=['column_to_drop']) ā— Renaming Columns: df.rename(columns={'old_name': 'new_name'}) ā— Converting Data Types: df.astype({'column': 'new_type'}) 12. Image Data Preprocessing ā— Resizing Images: cv2.resize() ā— Normalizing Pixel Values: image / 255.0 ā— Image Augmentation: ImageDataGenerator() in keras.preprocessing.image ā— Grayscale Conversion: cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) 13. Dimensionality Reduction ā— Principal Component Analysis (PCA): sklearn.decomposition.PCA() ā— t-SNE: sklearn.manifold.TSNE() ā— LDA: sklearn.discriminant_analysis.LinearDiscriminantAnalysis() 14. Dealing with Imbalanced Data ā— Random Over-Sampling: imblearn.over_sampling.RandomOverSampler() ā— Random Under-Sampling: imblearn.under_sampling.RandomUnderSampler() ā— SMOTE: imblearn.over_sampling.SMOTE() 15. Combining Features ā— Polynomial Features: sklearn.preprocessing.PolynomialFeatures() By: Waleed Mousa
  • 4. ā— Concatenating Features: np.concatenate([feature1, feature2], axis=1) 16. Handling Multivariate Data ā— Granger Causality Test: statsmodels.tsa.stattools.grangercausalitytests() ā— Vector AutoRegression (VAR): statsmodels.tsa.api.VAR() 17. Signal Processing ā— Fourier Transform: np.fft.fft() ā— Wavelet Transform: pywt.Wavelet() 18. Error Metrics ā— Mean Squared Error (MSE): sklearn.metrics.mean_squared_error() ā— Mean Absolute Error (MAE): sklearn.metrics.mean_absolute_error() ā— R-Squared: sklearn.metrics.r2_score() 19. Data Wrangling ā— Pivot Tables: df.pivot_table() ā— Stacking and Unstacking: df.stack(), df.unstack() ā— Melting Data: pd.melt(df) 20. Advanced DataFrame Operations ā— Apply Functions: df.apply(lambda x: ...) ā— GroupBy Operations: df.groupby('column').aggregate(function) ā— Merge and Join DataFrames: pd.merge(df1, df2, on='key'), df1.join(df2, on='key') 21. Sequence Data Processing ā— Padding Sequences: keras.preprocessing.sequence.pad_sequences() ā— One-Hot Encoding for Sequences: keras.utils.to_categorical() By: Waleed Mousa
  • 5. 22. Data Verification ā— Assert Statements: pd.util.testing.assert_frame_equal() ā— Data Consistency Check: pd.util.testing.assert_series_equal() 23. Data Aggregation ā— Cumulative Sum: df.cumsum() ā— Cumulative Product: df.cumprod() ā— Weighted Average: np.average(values, weights=weights) 24. Geospatial Data ā— Coordinate Transformation: geopandas.GeoDataFrame() ā— Spatial Join: geopandas.sjoin() ā— Distance Calculation: geopy.distance.distance(coord1, coord2) 25. Handling JSON Data ā— Normalize JSON: pd.json_normalize(json_data) ā— Read JSON: pd.read_json('file.json') ā— To JSON: df.to_json() 26. Handling XML Data ā— Parse XML: xml.etree.ElementTree.parse('file.xml') ā— Find Elements in XML: tree.findall('path') 27. Probability Distributions ā— Normal Distribution: np.random.normal() ā— Uniform Distribution: np.random.uniform() ā— Binomial Distribution: np.random.binomial() 28. Hypothesis Testing ā— t-Test: scipy.stats.ttest_ind() ā— ANOVA Test: scipy.stats.f_oneway() By: Waleed Mousa
  • 6. ā— Chi-Squared Test: scipy.stats.chi2_contingency() 29. Database Interaction ā— Read SQL Query: pd.read_sql_query('SELECT * FROM table', connection) ā— Write to SQL: df.to_sql('table', connection) 30. Data Profiling ā— Descriptive Statistics: df.describe() ā— Correlation Analysis: df.corr() ā— Unique Value Counts: df['column'].value_counts() ā— Pandas Profiling for Comprehensive Reports: pandas_profiling.ProfileReport(df) 31. Advanced Handling of Missing Values ā— KNN Imputation: from sklearn.impute import KNNImputer; imputer = KNNImputer(n_neighbors=5); df_imputed = imputer.fit_transform(df) ā— Iterative Imputation: from sklearn.experimental import enable_iterative_imputer; from sklearn.impute import IterativeImputer; imputer = IterativeImputer(); df_imputed = imputer.fit_transform(df) 32. Feature Engineering ā— Lag Features for Time Series: df['lag_feature'] = df['feature'].shift(1) ā— Rolling Window Features: df['rolling_mean'] = df['feature'].rolling(window=5).mean() ā— Expanding Window Features: df['expanding_mean'] = df['feature'].expanding().mean() ā— Datetime Features Extraction: df['hour'] = df['datetime'].dt.hour ā— Binning Numeric Features: pd.cut(df['numeric_feature'], bins=3, labels=False) 33. Data Normalization for Text By: Waleed Mousa
  • 7. ā— Removing Punctuation: df['text'].str.replace('[^ws]', '', regex=True) ā— Removing Numbers: df['text'].str.replace('d+', '', regex=True) ā— Converting to Lowercase: df['text'].str.lower() ā— Removing Whitespaces: df['text'].str.strip() 34. Advanced Text Preprocessing ā— Removing HTML Tags: df['text'].str.replace('<.*?>', '', regex=True) ā— Removing URLs: df['text'].str.replace('httpS+|www.S+', '', regex=True) ā— Using NLTK for Tokenization: nltk.word_tokenize(df['text']) ā— Using Spacy for Lemmatization: spacy.load('en_core_web_sm').lemmatizer(df['text']) 35. Advanced Feature Scaling ā— Quantile Transformer: sklearn.preprocessing.QuantileTransformer() ā— Power Transformer: sklearn.preprocessing.PowerTransformer(method='yeo-johnson') 36. Balancing Data ā— Oversampling with SMOTE-NC for Categorical Features: imblearn.over_sampling.SMOTENC(categorical_features=[0, 2, 3]) ā— Cluster-Based Oversampling: imblearn.over_sampling.ClusterCentroids() 37. Feature Selection Based on Model ā— L1 Regularization for Feature Selection: sklearn.linear_model.LogisticRegression(penalty='l1') ā— Tree-Based Feature Selection: sklearn.ensemble.ExtraTreesClassifier() 38. Data Discretization ā— Discretization into Quantiles: pd.qcut(df['feature'], q=4) By: Waleed Mousa
  • 8. ā— K-Means Discretization: sklearn.preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans') 39. Dealing with Date and Time ā— Time Delta Calculation: (df['date_end'] - df['date_start']).dt.days ā— Extracting Day of Week: df['date'].dt.dayofweek ā— Setting Frequency in Time Series: df.asfreq('D') 40. Handling Geospatial Data ā— Creating Geospatial Features: geopandas.GeoDataFrame(df, geometry=geopandas.points_from_xy(df.longitude, df.latitude)) ā— Calculating Distance Between Points: df['geometry'].distance(other_point) 41. Advanced NLP Techniques ā— Named Entity Recognition (NER) with Spacy: spacy.load('en_core_web_sm').entity(df['text']) ā— Topic Modeling with Latent Dirichlet Allocation (LDA): gensim.models.LdaMulticore(corpus, num_topics=10) 42. Data Decomposition ā— Singular Value Decomposition (SVD): scipy.linalg.svd(matrix) ā— Non-Negative Matrix Factorization (NMF): sklearn.decomposition.NMF(n_components=2) 43. Advanced Image Preprocessing ā— Edge Detection in Images (Canny): cv2.Canny(image, threshold1, threshold2) ā— Image Thresholding: cv2.threshold(image, threshold, max_value, cv2.THRESH_BINARY) 44. Handling JSON and Complex Data Types By: Waleed Mousa
  • 9. ā— Flattening JSON Nested Structures: pd.json_normalize(data, sep='_') ā— Parsing JSON Strings in DataFrame: df['json_col'].apply(lambda x: json.loads(x)) 45. Working with Time Series and Sequences ā— Differencing a Time Series: df['value'].diff(periods=1) ā— Creating Cumulative Features: df['cumulative_sum'] = df['value'].cumsum() 46. Data Validation ā— Asserting Dataframe Equality: pd.testing.assert_frame_equal(df1, df2) ā— Checking DataFrame Schema with Pandera: pandera.SchemaModel.validate(df) 47. Custom Transformations ā— Applying Custom Functions: df.apply(lambda row: custom_function(row), axis=1) ā— Vectorized String Operations: df['text'].str.cat(sep=' ') 48. Feature Extraction from Time Series ā— Fourier Transform for Periodicity: np.fft.fft(df['time_series']) ā— Autocorrelation Features: pd.plotting.autocorrelation_plot(df['time_series']) 49. Working with APIs and Remote Data ā— Reading Data from a REST API: pd.read_json(api_endpoint) ā— Loading Data from Cloud Services (e.g., AWS S3): pd.read_csv('s3://bucket_name/file.csv') 50. Advanced Data Aggregation By: Waleed Mousa
  • 10. ā— Weighted Moving Average: df['value'].rolling(window=5).apply(lambda x: np.average(x, weights=[0.1, 0.2, 0.3, 0.2, 0.2])) ā— Cumulative Maximum or Minimum: df['cumulative_max'] = df['value'].cummax() ā— GroupBy with Custom Aggregation Functions: df.groupby('group').agg({'value': ['mean', 'std', custom_agg_function]}) ā— Pivot Table with Multiple Aggregates: df.pivot_table(index='group', values='value', aggfunc=['mean', 'sum', 'count']) By: Waleed Mousa