Detecting Medical Fraud (Part 1) — Data Acquisition & Preprocessing
View original article on Medium
See the full project on GitHub
Background
Millions of Americans rely on federally subsidized healthcare to afford medical procedures, medication, and assistive devices. According to the Center for Medicare and Medicaid Services (CMS), the United States government spent more than a trillion dollars on its healthcare system in 2018, and they predict more each year after. Instead of billing the patient directly, physicians and medical institutions get paid directly from insurance companies and government funds. With a tremendous amount of money in the system, it is natural for some to take unlawful actions. Fraudulent schemes in the healthcare system range from billing for services that were not provided to organized crime infiltrating the Medicare program. The Federal Bureau of Investigation estimates more than ten percent of total health spending consists of fraudulent billing.
Data
To address the issue of detecting fraudulent entities, CMS released large, multidimensional datasets for the different parts of their Medicare/Medicaid program. For this experiment, only the Provider Utilization and Payment Data (Part B) is used to understand and detect healthcare fraud. Part B helps cover doctors’ services and outpatient care. This is where a high percentage of fraud exists since overbilling is one of the most common forms of fraudulent activities. Each observation in the dataset corresponds to a procedure performed by a specific physician, where each physician is given a unique NPI code. There are 8,910,479 observations in total.
import numpy as np import pandas as pd import os from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.utils import shuffle CMS = pd.read_csv(file, delimiter="\t") CMS = CMS.iloc[1:] CMS.dropna(subset=['npi'], inplace=True) CMS = CMS.loc[ (CMS.npi != 0000000000) & (CMS['hcpcs_drug_indicator'] == 'N')]
To know which physicians and organizations are fraudulent, the Department of Health and Human Services’ Office of the Inspector General (OIG) established the List of Excluded Individuals and Entities (LEIE), which is maintained monthly. The OIG is required by law to exclude anyone from federally funded healthcare programs if they are convicted of Medicare/Medicaid fraud. The LEIE is a dataset where each person has a set of features including their unique NPI code.
LEIE = pd.read_csv("LEIE.csv") NPI = list(LEIE.NPI) CMS["fraud"] = 0 CMS.fraud.loc[CMS['npi'].isin(NPI)] = 1 CMS.sort_values(by=['fraud'], inplace=True)
The figures above do not show any trend or pattern that clearly separates the fraudulent cases from the normal ones, making it difficult to find all the red dots in the sea of green dots. This is because medical fraud comes in a variety of forms, and many cases of medical malpractice go unnoticed because they closely resemble normal procedures.
Anomaly Detection
The scatterplots above plot an equal number of fraudulent and normal cases, but the reality is that only 0.017% of the entire dataset is fraudulent. Therefore, this is not a simple binary classification problem but an anomaly detection one. An anomaly detection (AD) problem arises when we are asked to detect extremely low occurring events within highly imbalanced data. Essentially, it’s like trying to find a tiny needle in a really large haystack. Common AD techniques include Autoencoders, K-Nearest Neighbor, Local Outlier Factor, and Isolation Forests.
Preprocessing
After understanding the characteristics of the dataset using exploratory data analysis, it is time to prepare it for the modeling phase. The dataset needs to be transformed and split to create the inputs for the detection model. Since this is an AD problem, there are certain criteria to keep in mind during the preprocessing phase.
Data Encoding
There are a lot of variables that contain a finite set of labeled values. These categorical variables need to be converted into numerical values for the model to accept them. The Label Encoder and One-Hot Encoder, help us achieve that goal. Without going into too much detail, the Label Encoder turns the labeled values into numerical values, and the One-Hot Encoder expands each variable to create a binarized column for each unique value. As a result, the overall feature space is expanded by the number of unique labeled values. This is helpful given the large number of observations in the dataset.
le = LabelEncoder() CMS[obj_cols] = CMS[obj_cols].apply( lambda col:le.fit_transform(col)) X = CMS.loc[CMS.fraud == 0].values Xf = CMS.loc[CMS.fraud == 1].values y = CMS['fraud'].valuesdel CMS ohe = OneHotEncoder(categorical_features=[1], sparse=False) X = ohe.fit_transform(X)Xf = ohe.transform(Xf)
Train/Test Split
After the data has been encoded, it is time to split the data into train and test sets. Since the end objective is to detect an extremely small group of fraudulent physicians, the model needs to learn and understand the complexities and fundamental components that represent normal, non-fraudulent physicians. The training set is comprised of only normal physicians. The testing set contains all of the fraudulent physicians and the remaining normal physicians shuffled together.
X_train, X_test_norm = train_test_split(X, test_size=0.3) y_train = np.zeros(len(X_train)) X_test = np.concatenate((X_test_norm, Xf), axis=0) y_test_norm = list(np.zeros((len(X_test_norm)))) yf_test = list(np.ones((len(Xf)))) y_test = np.array(y_test_norm + yf_test) X_test, y_test = shuffle(X_test, y_test, random_state=0)
Standardization
The final preprocessing step is to standardize the data. This means all the variables are rescaled to have a mean of 0 and a standard deviation of 1. This step is important to use when the features of our input data have large differences between their ranges. Large differences are cumbersome for many models, especially distance-based models. Keeping everything on the same scale allows for faster computation.
stdsc = StandardScaler() X_train = stdsc.fit_transform(X_train) X_test = stdsc.transform(X_test)
Part 2: What to Expect
Now that the dataset is cleaned and preprocessed, it is time to build our anomaly detection model. The second part of this two-part project overview discusses the model I chose to detect fraudulent physicians and the results I achieved after running the train and test sets.
References
[1] U.S. Government, U.S. Centers for Medicare & Medicaid Services. The Official U.S. Government Site for Medicare
[2] List of Excluded Individuals/Entities, U.S. Government, U.S. Department of Health and Human Services, Office of the Inspector General
[3] IMAGE: Detecting Value-Added Tax Evasion by Business Entities of Kazakhstan, Assylbekov, Zhenisbek & Melnykov, Igor & Bekishev, Rustam & Baltabayeva, Assel & Bissengaliyeva, Dariya & Mamlin, Eldar. (2016)
Love the way you broke this down. Looking forward to Part 2!
Driving Quality & Compliance in Pharma, Supply Chain, Beauty & Healthcare | Proven QA Leadership in Highly Regulated Environments
5yNicely done and well explained. Congrats!
AI Innovation Leader @ CoLab SVP-Industry Strategy and GTM Former Exec @ Disney + CEO @ Reading Rainbow, Entrepreneur | Startup Advisor
5yIncredible! Congrats!