Detecting Medical Fraud (Part 1) — Data Acquisition & Preprocessing

Rayhaan Rasheed

AI & Innovation Leader | Technology and Digital Advisor for F100 Brands | ex-Deloitte, ex-VaynerMedia

Published Jan 29, 2020

+ Follow

View original article on Medium

See the full project on GitHub

Background

Millions of Americans rely on federally subsidized healthcare to afford medical procedures, medication, and assistive devices. According to the Center for Medicare and Medicaid Services (CMS), the United States government spent more than a trillion dollars on its healthcare system in 2018, and they predict more each year after. Instead of billing the patient directly, physicians and medical institutions get paid directly from insurance companies and government funds. With a tremendous amount of money in the system, it is natural for some to take unlawful actions. Fraudulent schemes in the healthcare system range from billing for services that were not provided to organized crime infiltrating the Medicare program. The Federal Bureau of Investigation estimates more than ten percent of total health spending consists of fraudulent billing.

Data

To address the issue of detecting fraudulent entities, CMS released large, multidimensional datasets for the different parts of their Medicare/Medicaid program. For this experiment, only the Provider Utilization and Payment Data (Part B) is used to understand and detect healthcare fraud. Part B helps cover doctors’ services and outpatient care. This is where a high percentage of fraud exists since overbilling is one of the most common forms of fraudulent activities. Each observation in the dataset corresponds to a procedure performed by a specific physician, where each physician is given a unique NPI code. There are 8,910,479 observations in total.

import numpy as np
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder, OneHotEncoder 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

CMS = pd.read_csv(file, delimiter="\t")
CMS = CMS.iloc[1:]            
CMS.dropna(subset=['npi'], inplace=True)            
CMS = CMS.loc[
            (CMS.npi != 0000000000) & (CMS['hcpcs_drug_indicator'] == 'N')]

To know which physicians and organizations are fraudulent, the Department of Health and Human Services’ Office of the Inspector General (OIG) established the List of Excluded Individuals and Entities (LEIE), which is maintained monthly. The OIG is required by law to exclude anyone from federally funded healthcare programs if they are convicted of Medicare/Medicaid fraud. The LEIE is a dataset where each person has a set of features including their unique NPI code.

LEIE = pd.read_csv("LEIE.csv")
NPI = list(LEIE.NPI)
CMS["fraud"] = 0
CMS.fraud.loc[CMS['npi'].isin(NPI)] = 1
CMS.sort_values(by=['fraud'], inplace=True)

The figures above do not show any trend or pattern that clearly separates the fraudulent cases from the normal ones, making it difficult to find all the red dots in the sea of green dots. This is because medical fraud comes in a variety of forms, and many cases of medical malpractice go unnoticed because they closely resemble normal procedures.

Anomaly Detection

The scatterplots above plot an equal number of fraudulent and normal cases, but the reality is that only 0.017% of the entire dataset is fraudulent. Therefore, this is not a simple binary classification problem but an anomaly detection one. An anomaly detection (AD) problem arises when we are asked to detect extremely low occurring events within highly imbalanced data. Essentially, it’s like trying to find a tiny needle in a really large haystack. Common AD techniques include Autoencoders, K-Nearest Neighbor, Local Outlier Factor, and Isolation Forests.

Preprocessing

After understanding the characteristics of the dataset using exploratory data analysis, it is time to prepare it for the modeling phase. The dataset needs to be transformed and split to create the inputs for the detection model. Since this is an AD problem, there are certain criteria to keep in mind during the preprocessing phase.

Data Encoding

There are a lot of variables that contain a finite set of labeled values. These categorical variables need to be converted into numerical values for the model to accept them. The Label Encoder and One-Hot Encoder, help us achieve that goal. Without going into too much detail, the Label Encoder turns the labeled values into numerical values, and the One-Hot Encoder expands each variable to create a binarized column for each unique value. As a result, the overall feature space is expanded by the number of unique labeled values. This is helpful given the large number of observations in the dataset.

le = LabelEncoder()
CMS[obj_cols] = CMS[obj_cols].apply(
             lambda col:le.fit_transform(col))
X = CMS.loc[CMS.fraud == 0].values
Xf = CMS.loc[CMS.fraud == 1].values 
y = CMS['fraud'].valuesdel CMS 
ohe = OneHotEncoder(categorical_features=[1], sparse=False)
X = ohe.fit_transform(X)Xf = ohe.transform(Xf)

Train/Test Split

After the data has been encoded, it is time to split the data into train and test sets. Since the end objective is to detect an extremely small group of fraudulent physicians, the model needs to learn and understand the complexities and fundamental components that represent normal, non-fraudulent physicians. The training set is comprised of only normal physicians. The testing set contains all of the fraudulent physicians and the remaining normal physicians shuffled together.

X_train, X_test_norm = train_test_split(X, test_size=0.3)
y_train = np.zeros(len(X_train))  
X_test = np.concatenate((X_test_norm, Xf), axis=0)
y_test_norm = list(np.zeros((len(X_test_norm))))
yf_test = list(np.ones((len(Xf))))
y_test = np.array(y_test_norm + yf_test)
X_test, y_test = shuffle(X_test, y_test, random_state=0)

Standardization

The final preprocessing step is to standardize the data. This means all the variables are rescaled to have a mean of 0 and a standard deviation of 1. This step is important to use when the features of our input data have large differences between their ranges. Large differences are cumbersome for many models, especially distance-based models. Keeping everything on the same scale allows for faster computation.

stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

Part 2: What to Expect

Now that the dataset is cleaned and preprocessed, it is time to build our anomaly detection model. The second part of this two-part project overview discusses the model I chose to detect fraudulent physicians and the results I achieved after running the train and test sets.

References

[1] U.S. Government, U.S. Centers for Medicare & Medicaid Services. The Official U.S. Government Site for Medicare

[2] List of Excluded Individuals/Entities, U.S. Government, U.S. Department of Health and Human Services, Office of the Inspector General

[3] IMAGE: Detecting Value-Added Tax Evasion by Business Entities of Kazakhstan, Assylbekov, Zhenisbek & Melnykov, Igor & Bekishev, Rustam & Baltabayeva, Assel & Bissengaliyeva, Dariya & Mamlin, Eldar. (2016)

Sam Rahim

Love the way you broke this down. Looking forward to Part 2!

Farah Saleem

Driving Quality & Compliance in Pharma, Supply Chain, Beauty & Healthcare | Proven QA Leadership in Highly Regulated Environments

Nicely done and well explained. Congrats!

Asra Rasheed

AI Innovation Leader @ CoLab SVP-Industry Strategy and GTM Former Exec @ Disney + CEO @ Reading Rainbow, Entrepreneur | Startup Advisor

LinkedIn respects your privacy

Detecting Medical Fraud (Part 1) — Data Acquisition & Preprocessing

Rayhaan Rasheed

AI & Innovation Leader | Technology and Digital Advisor for F100 Brands | ex-Deloitte, ex-VaynerMedia

Background

Data

Anomaly Detection

Preprocessing

Data Encoding

Train/Test Split

Standardization

Part 2: What to Expect

References

More articles by this author

Others also viewed

HIPAA: Protecting patient privacy or punishing physicians?

💡The Hidden Risks in ABA Billing: Why Accurate Documentation Matters

Protecting Personally Identifiable Information (PII) in Healthcare: A Deep Dive into DPDP Act 2023 and Global Privacy Laws

“Feds Raid the White Coats: Billion-Dollar Medicare Scam EXPOSED!”

Medicare and Medicaid Fraud: Key Findings and Investigations

The Dark Side of Healthcare

Heavy HIPAA Enforcement Efforts!

Data Breaches in Healthcare – Hackers Shifting Focus to Steal Valuable Health Records

HHS and DOJ Form False Claims Act Working Group: A New Era of Cross-Agency Enforcement

A National Patient Identifier: An Idea Whose Time Has Come Again

Explore content categories

Background

Data

Anomaly Detection

Preprocessing

Data Encoding

Train/Test Split

Standardization

Part 2: What to Expect

References

Detecting Medical Fraud (Part 2) — Building an Autoencoder in PyTorch

Feb 5, 2020

How Data Literacy Can Make You a Better Professional in 2020

Jan 22, 2020

Others also viewed

HIPAA: Protecting patient privacy or punishing physicians?

💡The Hidden Risks in ABA Billing: Why Accurate Documentation Matters

Protecting Personally Identifiable Information (PII) in Healthcare: A Deep Dive into DPDP Act 2023 and Global Privacy Laws

“Feds Raid the White Coats: Billion-Dollar Medicare Scam EXPOSED!”

Medicare and Medicaid Fraud: Key Findings and Investigations

The Dark Side of Healthcare

Heavy HIPAA Enforcement Efforts!

Data Breaches in Healthcare – Hackers Shifting Focus to Steal Valuable Health Records

HHS and DOJ Form False Claims Act Working Group: A New Era of Cross-Agency Enforcement

A National Patient Identifier: An Idea Whose Time Has Come Again

Explore content categories