Dissertation

Introduction
Random forest is one of the most successful integration methods, showing excellent
performance at the level of promotion and support vector machines. The method is fast, anti-
noise, does not over fit, and provides the possibility to interpret and visualize its output. We will
study some possibilities to increase the strength of individual trees in the forest or reduce their
correlation. Using several attribute evaluation methods instead of just one method will produce
promising results. On the other hand, in most similar cases, using marginal weighted voting
instead of ordinary voting can provide improvements that are statistically significant across
multiple data sets.
Nowadays, machine learning (ML) is becoming more and more important, and with the
rapid growth of medical data and information quality, it has become a key technology. However,
due to complex, incomplete and multi-dimensional healthcare data, early and accurate detection
of diseases remains a challenge. Data preprocessing is an important step in machine learning.
The main purpose of machine learning is to provide processed data to improve prediction
accuracy. This study summarizes popular data preprocessing steps based on usage, popularity,
and literature. After that, the selected preprocessing method is applied to the original data, and
then the classifier uses it for prediction.
Company data mining faces the challenge of discovering systematic knowledge in big
data streams to support management decision-making. Although the research on operations
research, direct marketing and machine learning focuses on the analysis and design of data
mining algorithms, the interaction between data mining and the previous stage of data
preprocessing has not been studied in detail. This paper studies the effects of different
preprocessing techniques of attribute scaling, sampling, classification coding and continuous

attribute coding on the performance of decision trees, neural networks and support vector
machines.
Problem statement.
Using machine learning to predict breast cancer cases through patient treatment history
and health data. We will be using the Data set of Wisconsin breast cancer (diagnosis) center.
Among women, breast cancer is the leading cause of death. Breast cancer risk prediction can
provide information for screening and preventive measures. Previous work found that adding
input to the widely used Gaelic model can improve its ability to predict breast cancer risk.
However, these models use simple statistical architectures, and other inputs come from
expensive and/or invasive processes. In contrast, we have developed a machine learning model
that uses highly accessible personal health data to predict breast cancer risk over five years. We
created a machine learning model using only Gail model input and a model using Gail model
input and other personal health data related to breast cancer risk.
The basic goals of cancer prediction and prognosis are different from those of cancer
detection and diagnosis. In cancer prediction/prognosis, one is related to three key points of
prediction: 1) cancer susceptibility prediction (ie risk assessment); 2) cancer recurrence
prediction and 3) cancer survival rate prediction. In the first case, people are trying to predict the
likelihood of developing a certain type of cancer before the disease occurs. In the second case,
people are trying to predict the possibility of developing cancer after the disease has clearly
disappeared.
In the third case, people try to predict the outcome (life expectancy, survival, progression,
tumor drug sensitivity) after the disease is diagnosed. In the latter two cases, the success of

prognostic prediction obviously depends in part on the success or quality of the diagnosis.
However, the prognosis of the disease can only be achieved after medical diagnosis, and
prognosis prediction must consider more than simple diagnosis.
Using multi-factor analysis of variances of various performance indicators and method
parameters, direct marketing can be used to evaluate the impact of different preprocessing
options on real-world data sets. Our case-based analysis provides empirical evidence that data
preprocessing will have a significant impact on the accuracy of predictions, and certain solutions
have proven to be inferior to competing methods. In addition, it was also found that: (1) The
selected method proved to be as sensitive to different data representation methods as method
parameterization, which indicates the potential of improving performance through effective
preprocessing; (2) The influence of the preprocessing scheme depends on the method. Different,
indicating that the use of different "best practice" settings can promote the excellent results of a
specific method; (3) Therefore, the sensitivity of the algorithm to preprocessing is an important
criterion for method evaluation and selection. In predictive data mining, it needs to be different
from traditional methods. The predictive ability and calculation efficiency index are considered
together.
In order to maximize the prediction accuracy of data mining, the research of management
science and machine learning is mainly devoted to enhancing competitive classifiers and
effectively adjusting algorithm parameters. Classification algorithms are usually tested in
extensive benchmark experiments, using pre-processed data sets to evaluate the impact on
prediction accuracy and computational efficiency. In contrast, the research focus in DPP is on the
development of algorithms for specific DPP tasks. Although the feature selection resampling and
continuous attribute discretization are analyzed in detail, there are few publication survey data

projections affecting classification attributes and scaling. More importantly, there is no detailed
analysis of the interaction of prediction accuracy in data mining, especially in the field of
company direct marketing.
3.1. Preprocessing methods
This dissertation considers the three main standard preprocessing steps of NLP,
including: stemming, punctuation removal and stop word removal. In stemming analysis, we
obtain the stem form of each word in the data set, which is a part of the word that can be
appended with affixes. The blocking algorithm is language specific and differs in performance
and accuracy. Many different methods can be used, such as: affix deletion stemming, n-gram
stemming and table query stemming. An important preprocessing step of NLP is to remove
punctuation, which-used to divide text into sentences, paragraphs and phrases-affects the results
of any text processing method, especially the results that depend on the frequency of words and
phrases, because punctuation is Often used in text. Before any NLP processing, the most
common words used in stop words are deleted. A group of frequently used words without any
other information, such as articles, definite words and prepositions called stop words. By
deleting these very common words from the text, we can focus on the important words.
Important Hyperparameters
The Hyperparameters in the random forest are either used to improve the predictive ability
of the model or make the model faster. The following describes the Hyperparameters of
sklearn's built-in random forest function:
1. Increasing the Predictive Power

 n_estimators: The number of trees that the algorithm builds before making the maximum
vote or averaging the predictions. Generally, more trees will improve performance and make
predictions more stable, but it will also slow down calculations.
 max_features: The maximum number of features a random forest allows to try in a single
tree. Sklearn’s offers several options.
 min_sample_leaf: Determine the minimum number of leaves required to split internal
nodes.
2. Increasing the Models Speed
 n_jobs: Tell the engine how many processors are allowed. If the value is 1, only one
processor can be used. The value "-1" means there is no limit.
 random_state: Make the output of the model reproducible. If the model has a certain
random_state value, and has the same Hyperparameters and the same training data, the
model will always produce the same results.
 oob_score: (Also known as oob sampling)-a random forest cross-validation method. In this
sampling, about one-third of the data is not used to train the model, but can be used to
evaluate its performance. These samples are called bagged samples. It is very similar to the
leave-one-out cross-validation method, but there is almost no additional computational
burden.
Why is Random Forest So Cool?
Impressive in Versatility

Whether you have a regression task or a classification task, random forest is a suitable model to
meet your needs. It can handle binary features, classification features and digital features. Hardly
any pretreatment is required. The data does not need to be rescaled or converted.
Parallelizable
They are parallelizable, which means we can split the process into multiple machines to
run. This can shorten the calculation time. On the contrary, the enhanced model is sequential and
takes longer to calculate. Specifically, in Python, to run this code on multiple computers, please
provide the parameter "jobs = -1". -1 means use all available computers. Greatness and high
dimensions. Random forest is suitable for processing high-dimensional data because we are
dealing with a subset of the data.
Quick Prediction/Training Speed
Training is faster than decision trees because we only deal with part of the features in this model,
so we can easily use hundreds of features. The prediction speed is significantly faster than the
training speed because we can save the generated forest for future use.
Robust to Outliers and Non-linear Data
Random forest deals with outliers by essentially classifying them. It is also indifferent to nonlinear
features.
Handles Unbalanced Data

It has a method for balancing errors in the overall imbalance of the class. Random forest
tries to minimize the overall error rate, so when we have an unbalanced data set, the larger
category will get a lower error rate, and the smaller category will have a larger error rate.
Low Bias, Moderate Variance
Each decision tree has a higher variance, but a lower deviation. However, since we
average all trees in the random forest, we also average the variance, so we have a low bias and
medium variance model.
Advantages of using Random Forest
As with any algorithm, there are advantages and disadvantages to using it. In the next two
sections, we will introduce the advantages and disadvantages of using random forests for
classification and regression. The random forest algorithm is not biased because there are
multiple trees and each tree is trained on a subset of the data. Basically, the random forest
algorithm relies on the power of the "crowd". Therefore, the overall bias of the algorithm is
reduced.
The algorithm is very stable. Even if new data points are introduced in the data set, the
overall algorithm will not be affected too much, because the new data may affect one tree, but it
is difficult to affect all trees. Random forest algorithm works well when it has both classification
and number functions. When the data lacks values or is not well scaled, the random forest
algorithm can also work well (although we have done feature scaling in this article for
demonstration purposes only).
Drawbacks

Model interpretability: Random forest models are not all that interpretable; they are like black
boxes. For very large data sets, the size of the trees can take up a lot of memory. It can tend to
over fit, so you should tune the hyper parameters. It has been observed that random forests are
too suitable for certain data sets with noisy classification/regression tasks. It is more complicated
and computationally expensive than the decision tree algorithm.
Due to their complexity, they require more time for training than other similar algorithms.
Materials and methods
Data
Materials and methods
Data
The model was trained and evaluated on the PLCO dataset. This data set was generated
as part of a randomized, controlled, prospective study to determine the effectiveness of different
prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research
and filled out the baseline questionnaire, detailing their previous and current health status. All
processing of this data set is done in Python (version 3.6.7).
We initially downloaded the data of all women from the PLCO data set. The dataset
consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the
following conditions:
1. Lack of data on whether they have been diagnosed with breast cancer and the time of
diagnosis
2. Were diagnosed with breast cancer before the baseline questionnaire
3. Not Self-identification as white, black, or Hispanic

4. Identified as Hispanic, but no information about the place of birth
5. Missing data for 13 selected predictors
Before the baseline questionnaire, we excluded women who had been diagnosed with
breast cancer because BCRAT was not sufficient for women with a personal history of breast
cancer.
BCRAT is also not suitable for women with breast cancer who have received chest
radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal
carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil
syndrome. Since there is no data for these conditions in the PLCO data set we assume that these
conditions do not apply to any women in the data set. Since only PLCO white, black, and
Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded
specific topics based on self-identified race/ethnicity.
We do not include subjects who consider themselves Hispanic but do not have data on
their place of birth because BCRAT implements different breast cancer compound rates for US-
born and foreign-born Hispanic women. When deleting objects based on these conditions, we
reduced the number of women to 64,739.
We trained a set of machine learning models that fed five of the usual seven inputs into
BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first-
degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in
the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
Our input to the model with a broader set of predictors includes five BCRAT data and
eight additional factors. These other predictors were selected based on the availability in the
PLCO data set and their correlation with breast cancer risk including menopausal age, indicators

of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of
years of birth control, the number of live births, and personal cancer history indicators.
To facilitate the training and testing of the model, we made limited modifications to the
predictor variables. First, we assign values to categorical variables appropriately. The PLCO data
set classifies age at menarche, age at first live birth, age at menopause, generation of hormones,
and age of birth control as categorical variables. For example, the menarche variable's age code
is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15
years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that
represents the maximum age/age or less (for example, under ten years old), we set the value of
the variable to the maximum value (for example, ten years old).
For values that represent a range strictly less than the maximum value (for example, less
than ten years old), we set the variable's value equal to the upper limit of the range (for example,
less than ten years old). Similarly, for values representing the minimum age/age or above (16
years old or above), we set it to the minimum value (for example, 16 years old). For values that
contain a closed range (for example, 12-13 years old), we set the variable's importance to the
average cost of the field (for example, 12.5 years old).
After modifying the categorical variables, we made some adjustments to the age version
of the first live birth, and the race/ethnic variables entered into the machine learning model. For
the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as
the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of
BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and
American-born Hispanic women. For the machine learning model, we set the age of the first live
birth variable of zero birth women to the current generation, and use two indicators to represent

race/ethnicity, one symbol for white women and one indicator for black women. Each woman is
classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the
white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s
white and black racial symbols are both 0. For the machine learning model, we did not
distinguish between Hispanic women born in the United States and Hispanic women born
abroad.
Important Hyperparameters
The Hyperparameters in the random forest are either used to improve the model's
predictive ability or make the model faster. The following describes the Hyperparameters of
sklearn's worked in random forest function:
1. Increasing the Predictive Power
· n_estimators: The number of trees that the algorithm builds before making the
most extreme vote or averaging the predictions. By and large, more trees will improve
performance and make predictions more stable, yet it will also slow down calculations.
· max_features: The greatest number of features a random forest allows to attempt
in a single tree. Sklearn's offers several options.
· min_sample_leaf: Determine the minimum number of leaves required to split
internal nodes.
2. Increasing the Models Speed
· n_jobs: Tell the engine what number of processors are permitted. On the off chance
that the worth is 1, only one processor can be used. The worth "- 1" means there is no restriction.
· random_state: Make the yield of the model reproducible. In the event that the model has

a specific random_state esteem and has the same Hyperparameters and the same training data, it
will always deliver the same results.
· oob_score: (Also known as OOBsampling)- a random forest cross-validation
method. In this sampling, 33% of the data is not used to train the model however can be used to
evaluate its performance. These samples are called bagged samples. It is fundamentally the same
as the forget about one cross-validation method, yet there is almost no additional computational
weight.

Dissertation

More Related Content

What's hot (19)

Similar to Dissertation (20)

More from Mefratechnologies (9)

Recently uploaded (20)

Dissertation