Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

This post provides technical best practices for:

  • Accelerating basic ML techniques, such as classification, clustering, and regression
  • Preprocessing time series data and training ML models efficiently with RAPIDS, a suite of open-source libraries for executing data science and analytics pipelines entirely on GPUs
  • Understanding algorithm performance and which evaluation metrics to use for each ML task

Accelerating data science pipelines with GPUs

GPU-accelerated data analytics is made possible with RAPIDS cuDF, a GPU DataFrame library, and RAPIDS cuML, a GPU-accelerated ML library.

cuDF is a Python GPU DataFrame library built on the Apache Arrow columnar memory format for loading, joining, aggregating, filtering, and manipulating data. It has an API similar to pandas, an open-source software library built on top of Python specifically for data manipulation and analysis. This makes it a useful tool for data analytics workflows, including data preprocessing and exploratory tasks to prepare dataframes for ML. For more information on how you can accelerate your data analytics pipeline with cuDF, refer to the series on accelerated data analytics.

Once your data is preprocessed, cuDF seamlessly integrates with cuML, which leverages GPU acceleration to provide a large set of ML algorithms that can help execute complex ML tasks at scale, much faster than CPU-based frameworks like scikit-learn

cuML provides a straightforward API closely mirroring the scikit-learn API, making it easy to integrate into existing ML projects. With cuDF and cuML, data scientists and data analysts working on ML projects get the easy interactivity of the most popular open-source data science tools with the power of GPU acceleration across the data pipeline. This minimizes adoption time to pushing ML workflows forward. 

Note: This resource serves as an introduction to ML with cuML and cuDF, demonstrating common algorithms for learning purposes. It’s not intended as a definitive guide for feature engineering or model building. Each ML scenario is unique and might require custom techniques. Always consider your problem specifics when building ML models.

Understanding the Meteonet dataset

Before diving into the analysis, it is important to understand the structure and content of the Meteonet dataset, which is well-suited for time series analysis. This dataset is a comprehensive collection of weather data that is immensely beneficial for researchers and data scientists in meteorology. 

An overview of the Meteonet dataset and the meaning of each column is provided below:

  1. number_sta: A unique identifier for each weather station.
  2. lat and lon: Latitude and longitude of the weather station, representing its geographical location.
  3. height_sta: Height of the weather station above sea level in meters.
  4. date: Date and time of data recording, essential for time series analysis.
  5. dd: Wind direction in degrees, indicating the direction from which the wind is coming.
  6. ff: Wind speed, measured in meters per second.
  7. precip: Amount of precipitation measured in millimeters.
  8. hu: Humidity, represented as a percentage indicating the concentration of water vapor in the air.
  9. td: Dew point temperature in degrees Celsius, indicating when the air becomes saturated with moisture.
  10. t: Air temperature in degrees Celsius.
  11. psl: Atmospheric pressure at sea level in hPa (hectopascals).

Machine learning with RAPIDS 

This tutorial covers the acceleration of three fundamental ML algorithms with cuDF and cuML: regression and classification.

Installation

Before analyzing the Meteonet dataset, install and set up RAPIDS cuDF and cuML. Refer to the RAPIDS Installation Guide for instructions based on your system requirements. 

Classification

Classification is a type of ML algorithm used to predict a categorical value based on a set of features. In this case, the goal is to predict weather conditions (such as sunny, cloudy, or rainy) and wind direction using temperature, humidity, and other factors.

Random forest is a powerful and versatile ML method capable of performing both regression and classification tasks. This section uses the cuML Random Forest Classifier to classify the weather conditions and wind direction at a certain time and location. The accuracy of the model can be used to evaluate its performance.

For this tutorial, 3 years of northwest station data has been consolidated into a single dataframe named NW_data.csv. To see the complete steps for combining the data, visit the Introduction to Machine Learning Using cuML notebook on GitHub.

import cudf, cuml
from cuml.ensemble import RandomForestClassifier as cuRF

# Load data
df = cudf.read_csv('./NW_data.csv').dropna()
        

To prepare the data for classification, perform preprocessing tasks such as converting the date column to datetime format and extracting the hour.

# Convert date column to datetime and extract hour
df['date'] = cudf.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour

# Drop the original 'date' column
df = df.drop(['date'], axis=1)
        

Create two new categorical columns: wind_direction and weather_condition. 

For wind_direction, discretize the dd column (assumed to be wind direction in degrees) into four categories: north (0-90 degrees), east (90-180 degrees), south (180-270 degrees), and west (270-360 degrees).

# Discretize wind direction
df['wind_direction'] = cudf.cut(df['dd'], bins=[-0.1, 90, 180, 270, 360], labels=['N', 'E', 'S', 'W'])
        

For weather_condition, discretize the precip column (which is the amount of precipitation) into three categories: sunny (no rain), cloudy (little rain), and rainy (more rain).

# Discretize weather condition based on precipitation amount
df['weather_condition'] = cudf.cut(df['precip'], bins=[-0.1, 0.1, 1, float('inf')], labels=['sunny', 'cloudy', 'rainy'])
        

Then convert these categorical columns into numerical labels that the RandomForestClassifier can work with using .cat.codes.

# Convert 'wind_direction' and 'weather_condition' columns to category
df['wind_direction'] = df['wind_direction'].astype('category').cat.codes
df['weather_condition'] = df['weather_condition'].astype('category').cat.codes
        

Model training

Now that preprocessing is done, the next step is to define a function to predict wind direction and weather conditions:

def train_and_evaluate(target):
    # Split into features and target
    X = df.drop(target, axis=1)
    y = df[target]

    # Split the dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Define the model
    model = cuRF()

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    predictions = model.predict(X_test)

    # Evaluate the model
    accuracy = accuracy_score(y_test, predictions)
    print(f"Accuracy for predicting {target} is {accuracy}")

    return model
        

Now that the function is ready, the next step is to train the model with the following call, mentioning the target variable:

# Train and evaluate models
weather_condition_model = train_and_evaluate('weather_condition')
wind_direction_model = train_and_evaluate('wind_direction')
        

This tutorial uses the cuML Random Forest Classifier to classify weather conditions and wind direction in the northwest dataset. Preprocessing steps include converting the date column, discretizing wind direction and weather conditions, and converting categorical columns to numerical labels. The models were trained and evaluated using accuracy as the evaluation metric.

Regression

Regression is an ML algorithm used to predict a continuous value based on a set of features. For example, you could use regression to predict the price of a house based on its features, such as the number of bedrooms, the square footage, and the location.

Linear regression is a popular algorithm for predicting a quantitative response. For this tutorial, use the cuML implementation of linear regression to predict temperature, humidity, and precipitation at different times and locations. The R^2 score can be used to evaluate the performance of your regression models.

Start by importing the required libraries for this section:

from cuml import make_regression, train_test_split
from cuml.linear_model import LinearRegression as cuLinearRegression
from cuml.metrics.regression import r2_score
from cuml.preprocessing.LabelEncoder import LabelEncoder
        

Next, load the NW dataset by reading the NW_data.csv file into a dataframe and dropping any rows with missing values:

# Load data
df = cudf.read_csv('/NW_data.csv').dropna()
        

For detailed steps on downloading NW_data.csv, see the Introduction to Machine Learning Using cuML notebook on GitHub.

For many ML algorithms, categorical input data must be converted to numeric forms. For this example, number_sta, which signifies ‘station number,’ is converted using LabelEncoder, which assigns unique numeric values to each category.

Next, numeric features must be normalized to prevent the model from being biased by the variable scales. 

Then transform the ‘date’ column into an ‘hour’ feature, as weather patterns often correlate with the time of day. Finally, drop the ‘date’ column, as the models used cannot process this directly.

# Convert categorical variables to numeric variables
le = LabelEncoder()
df['number_sta'] = le.fit_transform(df['number_sta'])

# Normalize numeric features
numeric_columns = ['lat', 'lon', 'height_sta', 'dd', 'ff', 'hu', 'td', 't', 'psl']
for col in numeric_columns:
    if df[col].dtype != 'object':
        df[col] = (df[col] - df[col].mean()) / df[col].std()
    else:
        print(f"Skipping normalization for non-numeric column: {col}")


# Convert date column to datetime and extract hour
df['date'] = cudf.to_datetime(df['date'])
df['hour'] = df['date'].dt.hour

# Drop the original 'date' column
df = df.drop(['date'], axis=1)
        

Model training and performance

With preprocessing done, the next step is to define a function that trains two models to predict temperature and humidity from weather stations.

To evaluate the performance of the regression model, use R^2, the coefficient of determination. A higher R^2 indicates a model that better predicts the data.

def train_and_evaluate(target):
    # Split into features and target
    X = df.drop(target, axis=1)
    y = df[target]

    # Split the dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Define the model
    model = cuLinearRegression()

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions
    predictions = model.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, predictions)
    print(f"R^2 score for predicting {target} is {r2}")

    return model
        

Now that the function is written, the next step is to train the model with the following call, specifying the target variable:

# Train and evaluate models
temperature_model = train_and_evaluate('t')
humidity_model = train_and_evaluate('hu')
        

This examples demonstrates how to use the cuML linear regression to predict temperature, humidity, and precipitation using the northwest dataset. To evaluate the performance of the regression models, we used the R^2 score. It’s important to note that model performance can be further improved by exploring techniques such as feature selection, regularization, and advanced models.


Conclusion

GPU-accelerated machine learning with cuDF and cuML can drastically speed up your data science pipelines. With faster data preprocessing using cuDF and the cuML scikit-learn-compatible API, it is easy to start leveraging the power of GPUs for machine learning. 

Justin Burns

Tech Resource Optimization Specialist | Enhancing Efficiency for Startups

11mo

RAPIDS cuDF and cuML are game-changers for accelerating ML workflows with GPUs, making data processing and model training much faster! 🚀💻

Stefan Xhunga

Chief Executive Officer @ Kriselaengineering | Driving Business Growth

11mo

Martin Khristi ✅ The fusion of GPU acceleration with data analytics and machine learning represents a paradigm shift in the way data scientists approach data processing, model training, and AI governance. By embracing GPU-accelerated tools like cuDF and cuML, data scientists can amplify algorithm performance, accelerate model training, and enhance the efficiency of ML pipelines.

To view or add a comment, sign in

Others also viewed

Explore topics