Mambular Tabular Deep Learning Series 1: Multi-Layer Perceptron (MLP)

Mambular Tabular Deep Learning Series 1: Multi-Layer Perceptron (MLP)

Introduction

This series of articles introduces Mambular, an open-source Python package that simplifies deep learning for tabular data. Through this series, we will explore various models, preprocessing techniques, hyperparameter optimization, and other strategies for improving tabular deep learning workflows. We will begin with the basics, showcasing how easy it is to fit a Multi-Layer Perceptron (MLP) — including some advanced architectures — to any tabular data problem. The topics we’ll cover include:

  • MLP

  • FT-Transformer

  • Tabular ResNet

  • SAINT

  • NODE

  • TabM

  • Mambular

  • TabulaRNN

  • Advanced Preprocessing

  • Hyperparameter Optimization

All models, preprocessing techniques, and related tools are implemented in the Mambular package. Stay up-to-date with ongoing development as new models, preprocessing methods, and features are continuously added:  https://github.com/basf/mamba-tabular

Throughout this series, we will use the same dataset and fit each model to it. This approach allows us to report the performance of each new model and compare it with other techniques, such as gradient boosting.

What is Tabular Data?

Tabular data is structured data arranged in rows and columns, like what you’d see in an Excel spreadsheet or a SQL database. Each row often represents an entity (e.g., a customer, transaction, or patient), while columns represent attributes (e.g., age, income, or diagnosis).

Traditionally, models like XGBoost and LightGBM have dominated this space because they work well with a mix of categorical (e.g., gender: male/female) and numerical (e.g., income: $40,000) data. However, these models don’t always capture deeper patterns in the data, such as complex interactions between columns. That’s where Tabular Deep Learning comes in.

Why is Tabular Deep Learning Challenging?

Unlike gradient-boosted decision trees (GBDTs), tabular deep learning demands careful preprocessing and feature engineering. Raw numerical inputs can degrade model performance if they have large variations, such as multi-digit income values. To address this, tabular deep learning requires both sophisticated model architectures and advanced preprocessing.

Mambular is designed to simplify this process. It integrates automatic preprocessing and provides customizable, pre-defined model architectures, enabling users to train models as easily as (or even more easily than) training a GBDT.

Automatic Data Preprocessing

Mambular’s preprocessing pipeline is seamlessly integrated into the training process. When fitting a model, the package automatically transforms raw tabular data into a format optimized for deep learning. It distinguishes between categorical and numerical features and applies suitable preprocessing techniques:

  • Categorical Features: Supports integer encoding or one-hot encoding.

  • Numerical Features: Includes common methods such as standardization but also advanced techniques like Piecewise Linear Encodings (PLE) (Gorishniy et al., 2021). Other built-in methods include min-max scaling, basis expansion, quantile transformation, and more.

Tabular Deep Learning with an MLP

How Does an MLP Work?

A Multi-Layer Perceptron (MLP) is a type of feedforward neural network that maps input data to outputs using a series of layers. Here’s a breakdown of its components:

  • Input Layer:  This layer receives preprocessed tabular data as input. The data can include both categorical and numerical features.

  • Hidden Layers:  - These consist of one or more layers where each neuron computes a weighted sum of its inputs, adds a bias term, and applies a non-linear activation function (e.g., ReLU, sigmoid, or tanh). - Hidden layers are responsible for learning complex patterns and representations, enabling the MLP to model non-linear relationships in the data. - Dropout: Dropout is a regularization technique applied to hidden layers. It randomly sets a fraction of neurons to zero during training, preventing the model from overfitting by encouraging it to rely on multiple pathways instead of memorizing patterns.

  • Output Layer:  Produces the final prediction. For regression tasks, this is typically a single node with a linear activation. For classification tasks, this could be multiple nodes with a softmax activation.

Embeddings for Tabular Data

Another important aspect for tabular DL in recent years are feature embeddings. Why use Embeddings? Embeddings allow the model to represent data in a dense and continuous numerical space. For categorical features, embeddings assign a unique vector representation to each category, enabling the model to handle them numerically while capturing relationships between categories. For example:

  • The categories “red,” “blue,” and “green” might be mapped to embedding vectors like [0.1, 0.3], [0.5, 0.2], and [0.2, 0.4].

Embeddings for Numerical Features: While embeddings are commonly used for categorical data, they can also be applied to numerical features. Numerical embeddings can be implemented using techniques such as:

  • Simple Linear Embeddings: A linear transformation of numerical values, optionally followed by an activation function to capture non-linear relationships.

  • Periodic Linear ReLU (PLR): This divides the numerical feature range into intervals and learns a linear representation for each segment, with fᵢ(x) = concat[sin(v), cos(v)] with v= [2π c₁ x, …, 2π cₖ x]

These embeddings help the MLP effectively utilize numerical data by learning representations that align with the underlying patterns in the data.

Model Fitting

Now that we have outlined the concept of a simple MLP for tabular problems, let’s move on to model fitting. The dataset and packages are publicly available, so everything can be copied and run locally or in a Google Colab notebook, provided the necessary packages are installed. We will start by installing the mambular package, loading the dataset, and fitting XGBoost as a baseline to compare the simplicity of training a GBDT model with the ease of fitting a tabular deep learning model using Mambular.

Install Mambular

Prepare the Data

First, lets fit a XGBoost model. XGBoost is a well known and highly used model for tabular problems, often outperforming tabular DL models on a lot of tasks.

XGBoost

Train a MLP Model with Mambular

Now let’s compare that to fitting a MLP using Mambular. As a start, lets not use embeddings and simply standardize the numerical features. We use the default ReLU activation functions, set dropout to 0.3 and use hidden layer sizes of [256, 256, 128].

Already a good fit, and just as easy to setup as a XGBoost model. Let’s try out some different parameters to see how it affects performance. Let’s leverage PLR Embeddings and keep the other model parameters identical to the ones before:

A small improvement compared to not using embeddings. Let’s finally try out different preprocessing and still use embeddings. We will use PLE (Piecewise Linear Encodings) as preprocessing and use a simple linear embedding layer. The workings of PLE will be explained in the next part of our series in more detail:

Again a small improvement :) However, not quite the level of XGBoost. Try playing around with some more parameters and improve performance, or even use the built-in bayesian HPO from mambular.

Below we have summarized the results and added a simple linear regression and a random forest for comparison. Throughout this series, we will add the results of each introduced method to this table:

Conclusion

Tabular DL has long been behind GBDTs in performance. More importantly, however, it has been far behind in terms of usability. Mambular’s automatic preprocessing makes it easy to implement deep learning models for tabular data. This article introduced the MLP model, but stay tuned for future installments in this series, where we’ll dive into other powerful models. With Mambular, you can unlock the full potential of your tabular data for deep learning tasks.

GitHub Repo: https://github.com/basf/mamba-tabular

How to Cite This Paper

If you’d like to reference this work in your own research or writing, you can use the following citation:

BibTeX:

Plain Text: Thielmann, A. F., Kumar, M., Weisser, C., Reuter, A., Säfken, B., & Samiee, S. (2024). Mambular: A Sequential Model for Tabular Deep Learning. arXiv. Available at: https://arxiv.org/abs/2408.06291.

To view or add a comment, sign in

Others also viewed

Explore topics