Learning On The Border:Active Learning in Imbalanced classification Data

Learning on the Border: Active Learning in Imbalanced Data Classification SeyDa, Jian Hungm Leon Bottou, C. Lee Giles,CIKM’07 Presenter: Ping-Hua Yang

Abstract This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. In this paper, we demonstrate that active learning is capable of solving the class imbalance problem by providing the learner more balanced classes.

Outline Introduction Related work Methodology Performance metrics Datasets Experiments and empirical evaluation Conclusions

Introduction A training dataset is called imbalanced If at least on of the classes are represented by significantly less number of instances than the others Examples of application which may have class imbalance problem Predicting pre-term births Identifying fraudulent credit card transactions Text categorization Classification of protein databases Detecting certain objects from satellite images

Introduction In classification tasks, it’s generally more important to correctly classify the minority class instances. Mispredicting a rare event can result in more serious consequences. However in classification problems with imbalanced data, the minority class examples are more likely to misclassified than the majority class examples. Due to machine learning algorithms design principles. This paper proposes a framework which has high prediction performance to overcome this serious data mining problem. In this paper we propose several methods ： Using active learning strategy to deal with the class imbalance problem. SVM based active learning selection strategy

Introduction Many research direction in recent to overcome the class imbalance problem is to resample the original training dataset to create more balanced classes.

Related work Assign distinct costs to the classification( [ P.Domingos, 1999],[M.Pazzani,C.Merz,P.Nurphy,K.Ali,T.Hume,C.Brunk,1994] The misclassification penalty for the positive class is assigned a higher value than negative class. This method requires tuning to come up with good penalty parameters for the misclassified examples. Resample the original training dataset([N.V.Chawla,2002],[N.Japkowicz,1995],[M.Kubat,1997],[C.X.Ling,1998]) either by over-sampling the minority class or under-sampling the majority class. under-sampling may discard potentially useful data that could be important. over-sampling may suffer from over-fitting and due to the increase in the number of samples, the training time of the learning process gets longer.

Related work Use a recognition-based, instead of discrimination-based inductive leaning([N.Japkowicz,1995],[B.Raskutti,2004]) These methods attempt to measure the amount of similarity between a query object and the target class. The major drawback of those methods is the need for tuning the similarity threshold SMOTE – synthetic minority over-sampling technique([N.V.Chawla,2002]) Minority class is oversampled by creating synthetic examples rather than with replacement. Preprocessing the data with SMOTE may lead to improved prediction performance. SMOTE brings more computational cost and increased number of training data.

Methodology Active leaning has access to a vast pool of unlabeled examples, and it tries to make a clever choice to select the most informative example to obtain its label. The strategy of selecting instances within the margin addresses the imbalanced dataset classification very well.

Support Vector Machines SVM are well known for their strong theoretical foundations, generalization performance and ability to handle high dimensional. Using the training set, SVM builds an optimum hyper-plane. This hyper-plane can be obtained by minimizing the following objective function w ： the norm of the hyper-plane ， y i ： labels ， Φ(*) ： mapping from input space to feature space ， b ： offset ， ξ ： slack variables

Support Vector Machines The dual representation of equation 1 K(x i ,x j ) = (Φ(x i ),Φ(x j )) ， α i ： Lagrange multipliers After solving the QP problem, the norm of the hyper-plane w can be represented as

Active Learning In equation 5, only the support vectors have an effect on the SVM solution. If SVM is retrained with a new set of data which only consist of those support vectors, the learner will end up finding the same hyper-plane. In this paper we will focus on a form of selection strategy called SVM based active learning. In SVMs, the most informative instance is the closest instance to the hyper-plane. For the possibility of a non-symmetric version space, there are more complex selection methods.

Active Learning with Small Pools The basic working principle of SVM active learning Learn an SVM on the existing training data, Select the closest instance to the hyper-plane, Add the new selected instance to the training set and train again. In classical active learning, the search for the most informative instance is performed over the entire dataset. For large datasets, searching the entire training set is a very time consuming and computationally expensive task. By using the “ 59 trick ” which does not necessitate a full search through the entire dataset but locates an approximate most informative sample.

Active Learning with Small Pools The selection method picks L(L<< # training instances) random training samples in each iteration and selects the best among them. Pick a random subset X L , L<<N Select the closest sample x i from X L based on the condition that x i is among the top p% closest instances in X N with probability (1- η )

Active Learning with Small Pools

Online SVM for Active Learning LASVM is an online kernel classifier which relies on the traditional soft margin SVM formulation. LASVM requires less computational resources. LASVM’s model is continually modified as it process training instances on by one. Each LASVM iteration receives a fresh training example and tries to optimize the dual cost function in equation(3) using feasible direction searches. The new informative instance selected by active learning can be integrated to the existing model without retraining all the samples repeatedly.

Active Learning with Early Stopping A theoretically sound method to stop training is when the examples in the margin are exhausted. To check if there are still unseen training instances in the margin The distance of the new selected is compared to the support vector of current model. A practical implementation of this idea is to count the number of support vectors during the active learning training process.

Active Learning with Early Stopping

Performance Metrics Classification accuracy is not a good metric to evaluate classifiers in applications with class imbalance problem. In non-separable case, if the misclassification penalty C is very small, SVM learner simply tends to classify every example as negative. G-means sensitivity : TruPos./(TruPos.+FalseNeg.) specifity : TrueNeg./(TrueNeg.+FalsePos.) Receiver Operating Curve (ROC) ROC is a plot of the true positive rate against the false positive rate as the decision threshold is changed.

Performance Matrix Area under the ROC Curve (AUC) Numerical measure of a model’s discrimination performance. Show how successfully and correctly the model separates the positive and negative. Precision Recall Break-Even Point (PRBEP) PRBEP is the accuracy of the positive class at the threshold where precision equals to recall.

Experiments and Empirical evaluation

Conclusions The results of this paper offer a better understanding of the effect of the active learning on imbalanced datasets. By focusing the learning on the instances around the classification boundary, more balanced class distributions can be provided to the learner in the earlier steps of the learning.

Learning On The Border:Active Learning in Imbalanced classification Data

More Related Content

What's hot (20)

Similar to Learning On The Border:Active Learning in Imbalanced classification Data (20)

Recently uploaded (20)

Learning On The Border:Active Learning in Imbalanced classification Data