ML in Two Worlds - Scikit-Learn vs PySpark (Blog-2)
Hey! In this blog, I'm exploring the basic implementation of Logistic Regression in PySpark while drawing parallels with LR in Scikit-learn and sharing with you my questions, learnings, etc.
Quick Recap of my previous post "Blog 1": TLDR; I explored the basic implementation of Linear Regression in Spark and drew some parallels with the popular and well-known Scikit-learn package. We found that Spark expects data to be in vectors, which is the preferred standard for distributed computing. We learned that Spark inherently partitions data and performs distributed computation without any external specification. And also, we've seen the evaluation in both Spark and Scikit-learn, where one is straightforward, while the other provides fine-grained details.
Diving into Logistic Regression in Scikit-learn vs PySpark:
Here, we are going to use the popular Titanic dataset. Why? The Titanic dataset has numerical columns and also categorical columns, null values, etc., so we can play around differnet data cleaning, preprocessing stuff.
1. Data Loading and Preprocessing:
Scikit-learn Approach:
In Scikit-learn, we use pandas for data loading, which simplifies the process of handling CSV files. The OneHotEncoder is particularly useful for converting categorical variables into a format that logistic regression can understand, showcasing Scikit-learn's straightforward approach to preprocessing.
PySpark's Approach:
Spark's approach to data loading and preprocessing is designed for distributed data, handling large datasets efficiently across clusters. It uses StringIndexer and OneHotEncoder for categorical encoding, and VectorAssembler to combine all features into a single vector column, a requirement for MLlib's algorithms.
Question: Why does Spark require using StringIndexer and OneHotEncoder for categorical data, unlike Scikit-learn's direct OneHotEncoder application?
In Spark, machine learning algorithms work with numerical data, not text. So, first, it converts text to numeric indices using StringIndexer, and MLlib models expect data to be in vectors. OneHotEncoder, then converts these numerical indices into a binary vector representation that can be understood by the machine learning model. This approach allows Spark to process data efficiently across distributed systems. However, if you skip the StringIndexer step, Spark’s OneHotEncoder will encounter errors because it expects numeric indices.
2. Model Training
Scikit-learn Approach:
Scikit-learn's logistic regression is intuitive, with train_test_split for data splitting and LogisticRegression for model training. It's suitable for quick iterations and experiments on smaller datasets.
Spark's MLlib Equivalent:
PySpark utilizes a Pipeline to encapsulate the preprocessing and model training steps.
Question: Why does Spark use a Pipeline for model training, and can it train models without one like Scikit-learn's straightforward fit method?
Spark's Pipeline encapsulates all preprocessing and model training steps into a single workflow. This is for:
Efficiency: Ensures that data passes through the workflow exactly once, which is critical for performance in distributed systems.
Reproducibility: Guarantees that the same preprocessing steps are applied in the same order, enhancing model consistency across runs.
Manually handling each step increases the risk of errors and inconsistencies, especially in distributed environments where data is partitioned across multiple nodes. It could also lead to performance inefficiencies due to repeated data passes.
3. Model Evaluation
Scikit-learn Approach:
Scikit-learn offers a range of metrics for model evaluation, including the ROC AUC score, which facilitates a straightforward assessment of the model. In addition to this, Scikit-learn provides many other important metrics such as Accuracy Score, Precision, Recall (Sensitivity), F1 Score, Confusion Matrix, and Classification Report, under the sklearn.metrics module.
Spark's MLlib Equivalent:
Spark's MLlib also offers a range of evaluation metrics, including AUC score, through the BinaryClassificationEvaluator. This supports a thorough assessment of model performance in a distributed computing context.
Aditionally, Spark MLlib provides a variety of metrics for evaluating classification models. Here are some of them:
Binary Classification Metrics: For binary classification problems, Spark MLlib provides metrics such as area under the ROC curve (AUC-ROC) and area under the precision-recall curve (AUC-PR).
Multiclass Classification Metrics: For multiclass classification problems, Spark MLlib provides a suite of metrics including precision, recall, F-measure, weighted precision, weighted recall, weighted F-measure, and accuracy.
Label-based Metrics: These are metrics for multilabel classification problems. They include precision, recall, F-measure, accuracy, and Hamming loss.
Ranking Systems: Spark MLlib provides metrics for ranking systems, including precision at k, recall at k, NDCG at k, and MAP.
Few Interesting Questions I still have:
1. How does the choice of data preprocessing techniques affect model performance in Scikit-learn vs. PySpark?
2. In what scenarios would the efficiency and scalability of PySpark's distributed computing outweigh Scikit-learn's simplicity?
3. How do the evaluation metrics provided by Scikit-learn and PySpark compare in terms of providing insights into model performance?
Checkout my previous blog:
Linear Regression in PySpark vs Scikit-learn: Blog 1