SlideShare a Scribd company logo
Introduction
Random forest is one of the most successful integration methods, showing excellent
performance at the level of promotion and support vector machines. The method is fast, anti-
noise, does not over fit, and provides the possibility to interpret and visualize its output. We will
study some possibilities to increase the strength of individual trees in the forest or reduce their
correlation. Using several attribute evaluation methods instead of just one method will produce
promising results. On the other hand, in most similar cases, using marginal weighted voting
instead of ordinary voting can provide improvements that are statistically significant across
multiple data sets.
Nowadays, machine learning (ML) is becoming more and more important, and with the
rapid growth of medical data and information quality, it has become a key technology. However,
due to complex, incomplete and multi-dimensional healthcare data, early and accurate detection
of diseases remains a challenge. Data preprocessing is an important step in machine learning.
The main purpose of machine learning is to provide processed data to improve prediction
accuracy. This study summarizes popular data preprocessing steps based on usage, popularity,
and literature. After that, the selected preprocessing method is applied to the original data, and
then the classifier uses it for prediction.
Company data mining faces the challenge of discovering systematic knowledge in big
data streams to support management decision-making. Although the research on operations
research, direct marketing and machine learning focuses on the analysis and design of data
mining algorithms, the interaction between data mining and the previous stage of data
preprocessing has not been studied in detail. This paper studies the effects of different
preprocessing techniques of attribute scaling, sampling, classification coding and continuous
attribute coding on the performance of decision trees, neural networks and support vector
machines.
Problem statement.
Using machine learning to predict breast cancer cases through patient treatment history
and health data. We will be using the Data set of Wisconsin breast cancer (diagnosis) center.
Among women, breast cancer is the leading cause of death. Breast cancer risk prediction can
provide information for screening and preventive measures. Previous work found that adding
input to the widely used Gaelic model can improve its ability to predict breast cancer risk.
However, these models use simple statistical architectures, and other inputs come from
expensive and/or invasive processes. In contrast, we have developed a machine learning model
that uses highly accessible personal health data to predict breast cancer risk over five years. We
created a machine learning model using only Gail model input and a model using Gail model
input and other personal health data related to breast cancer risk.
The basic goals of cancer prediction and prognosis are different from those of cancer
detection and diagnosis. In cancer prediction/prognosis, one is related to three key points of
prediction: 1) cancer susceptibility prediction (ie risk assessment); 2) cancer recurrence
prediction and 3) cancer survival rate prediction. In the first case, people are trying to predict the
likelihood of developing a certain type of cancer before the disease occurs. In the second case,
people are trying to predict the possibility of developing cancer after the disease has clearly
disappeared.
In the third case, people try to predict the outcome (life expectancy, survival, progression,
tumor drug sensitivity) after the disease is diagnosed. In the latter two cases, the success of
prognostic prediction obviously depends in part on the success or quality of the diagnosis.
However, the prognosis of the disease can only be achieved after medical diagnosis, and
prognosis prediction must consider more than simple diagnosis.
Using multi-factor analysis of variances of various performance indicators and method
parameters, direct marketing can be used to evaluate the impact of different preprocessing
options on real-world data sets. Our case-based analysis provides empirical evidence that data
preprocessing will have a significant impact on the accuracy of predictions, and certain solutions
have proven to be inferior to competing methods. In addition, it was also found that: (1) The
selected method proved to be as sensitive to different data representation methods as method
parameterization, which indicates the potential of improving performance through effective
preprocessing; (2) The influence of the preprocessing scheme depends on the method. Different,
indicating that the use of different "best practice" settings can promote the excellent results of a
specific method; (3) Therefore, the sensitivity of the algorithm to preprocessing is an important
criterion for method evaluation and selection. In predictive data mining, it needs to be different
from traditional methods. The predictive ability and calculation efficiency index are considered
together.
In order to maximize the prediction accuracy of data mining, the research of management
science and machine learning is mainly devoted to enhancing competitive classifiers and
effectively adjusting algorithm parameters. Classification algorithms are usually tested in
extensive benchmark experiments, using pre-processed data sets to evaluate the impact on
prediction accuracy and computational efficiency. In contrast, the research focus in DPP is on the
development of algorithms for specific DPP tasks. Although the feature selection resampling and
continuous attribute discretization are analyzed in detail, there are few publication survey data
projections affecting classification attributes and scaling. More importantly, there is no detailed
analysis of the interaction of prediction accuracy in data mining, especially in the field of
company direct marketing.
3.1. Preprocessing methods
This dissertation considers the three main standard preprocessing steps of NLP,
including: stemming, punctuation removal and stop word removal. In stemming analysis, we
obtain the stem form of each word in the data set, which is a part of the word that can be
appended with affixes. The blocking algorithm is language specific and differs in performance
and accuracy. Many different methods can be used, such as: affix deletion stemming, n-gram
stemming and table query stemming. An important preprocessing step of NLP is to remove
punctuation, which-used to divide text into sentences, paragraphs and phrases-affects the results
of any text processing method, especially the results that depend on the frequency of words and
phrases, because punctuation is Often used in text. Before any NLP processing, the most
common words used in stop words are deleted. A group of frequently used words without any
other information, such as articles, definite words and prepositions called stop words. By
deleting these very common words from the text, we can focus on the important words.
Important Hyperparameters
The Hyperparameters in the random forest are either used to improve the predictive ability
of the model or make the model faster. The following describes the Hyperparameters of
sklearn's built-in random forest function:
1. Increasing the Predictive Power
 n_estimators: The number of trees that the algorithm builds before making the maximum
vote or averaging the predictions. Generally, more trees will improve performance and make
predictions more stable, but it will also slow down calculations.
 max_features: The maximum number of features a random forest allows to try in a single
tree. Sklearn’s offers several options.
 min_sample_leaf: Determine the minimum number of leaves required to split internal
nodes.
2. Increasing the Models Speed
 n_jobs: Tell the engine how many processors are allowed. If the value is 1, only one
processor can be used. The value "-1" means there is no limit.
 random_state: Make the output of the model reproducible. If the model has a certain
random_state value, and has the same Hyperparameters and the same training data, the
model will always produce the same results.
 oob_score: (Also known as oob sampling)-a random forest cross-validation method. In this
sampling, about one-third of the data is not used to train the model, but can be used to
evaluate its performance. These samples are called bagged samples. It is very similar to the
leave-one-out cross-validation method, but there is almost no additional computational
burden.
Why is Random Forest So Cool?
Impressive in Versatility
Whether you have a regression task or a classification task, random forest is a suitable model to
meet your needs. It can handle binary features, classification features and digital features. Hardly
any pretreatment is required. The data does not need to be rescaled or converted.
Parallelizable
They are parallelizable, which means we can split the process into multiple machines to
run. This can shorten the calculation time. On the contrary, the enhanced model is sequential and
takes longer to calculate. Specifically, in Python, to run this code on multiple computers, please
provide the parameter "jobs = -1". -1 means use all available computers. Greatness and high
dimensions. Random forest is suitable for processing high-dimensional data because we are
dealing with a subset of the data.
Quick Prediction/Training Speed
Training is faster than decision trees because we only deal with part of the features in this model,
so we can easily use hundreds of features. The prediction speed is significantly faster than the
training speed because we can save the generated forest for future use.
Robust to Outliers and Non-linear Data
Random forest deals with outliers by essentially classifying them. It is also indifferent to nonlinear
features.
Handles Unbalanced Data
It has a method for balancing errors in the overall imbalance of the class. Random forest
tries to minimize the overall error rate, so when we have an unbalanced data set, the larger
category will get a lower error rate, and the smaller category will have a larger error rate.
Low Bias, Moderate Variance
Each decision tree has a higher variance, but a lower deviation. However, since we
average all trees in the random forest, we also average the variance, so we have a low bias and
medium variance model.
Advantages of using Random Forest
As with any algorithm, there are advantages and disadvantages to using it. In the next two
sections, we will introduce the advantages and disadvantages of using random forests for
classification and regression. The random forest algorithm is not biased because there are
multiple trees and each tree is trained on a subset of the data. Basically, the random forest
algorithm relies on the power of the "crowd". Therefore, the overall bias of the algorithm is
reduced.
The algorithm is very stable. Even if new data points are introduced in the data set, the
overall algorithm will not be affected too much, because the new data may affect one tree, but it
is difficult to affect all trees. Random forest algorithm works well when it has both classification
and number functions. When the data lacks values or is not well scaled, the random forest
algorithm can also work well (although we have done feature scaling in this article for
demonstration purposes only).
Drawbacks
Model interpretability: Random forest models are not all that interpretable; they are like black
boxes. For very large data sets, the size of the trees can take up a lot of memory. It can tend to
over fit, so you should tune the hyper parameters. It has been observed that random forests are
too suitable for certain data sets with noisy classification/regression tasks. It is more complicated
and computationally expensive than the decision tree algorithm.
Due to their complexity, they require more time for training than other similar algorithms.
Materials and methods
Data
Materials and methods
Data
The model was trained and evaluated on the PLCO dataset. This data set was generated
as part of a randomized, controlled, prospective study to determine the effectiveness of different
prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research
and filled out the baseline questionnaire, detailing their previous and current health status. All
processing of this data set is done in Python (version 3.6.7).
We initially downloaded the data of all women from the PLCO data set. The dataset
consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the
following conditions:
1. Lack of data on whether they have been diagnosed with breast cancer and the time of
diagnosis
2. Were diagnosed with breast cancer before the baseline questionnaire
3. Not Self-identification as white, black, or Hispanic
4. Identified as Hispanic, but no information about the place of birth
5. Missing data for 13 selected predictors
Before the baseline questionnaire, we excluded women who had been diagnosed with
breast cancer because BCRAT was not sufficient for women with a personal history of breast
cancer.
BCRAT is also not suitable for women with breast cancer who have received chest
radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal
carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil
syndrome. Since there is no data for these conditions in the PLCO data set we assume that these
conditions do not apply to any women in the data set. Since only PLCO white, black, and
Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded
specific topics based on self-identified race/ethnicity.
We do not include subjects who consider themselves Hispanic but do not have data on
their place of birth because BCRAT implements different breast cancer compound rates for US-
born and foreign-born Hispanic women. When deleting objects based on these conditions, we
reduced the number of women to 64,739.
We trained a set of machine learning models that fed five of the usual seven inputs into
BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first-
degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in
the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
Our input to the model with a broader set of predictors includes five BCRAT data and
eight additional factors. These other predictors were selected based on the availability in the
PLCO data set and their correlation with breast cancer risk including menopausal age, indicators
of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of
years of birth control, the number of live births, and personal cancer history indicators.
To facilitate the training and testing of the model, we made limited modifications to the
predictor variables. First, we assign values to categorical variables appropriately. The PLCO data
set classifies age at menarche, age at first live birth, age at menopause, generation of hormones,
and age of birth control as categorical variables. For example, the menarche variable's age code
is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15
years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that
represents the maximum age/age or less (for example, under ten years old), we set the value of
the variable to the maximum value (for example, ten years old).
For values that represent a range strictly less than the maximum value (for example, less
than ten years old), we set the variable's value equal to the upper limit of the range (for example,
less than ten years old). Similarly, for values representing the minimum age/age or above (16
years old or above), we set it to the minimum value (for example, 16 years old). For values that
contain a closed range (for example, 12-13 years old), we set the variable's importance to the
average cost of the field (for example, 12.5 years old).
After modifying the categorical variables, we made some adjustments to the age version
of the first live birth, and the race/ethnic variables entered into the machine learning model. For
the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as
the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of
BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and
American-born Hispanic women. For the machine learning model, we set the age of the first live
birth variable of zero birth women to the current generation, and use two indicators to represent
race/ethnicity, one symbol for white women and one indicator for black women. Each woman is
classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the
white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s
white and black racial symbols are both 0. For the machine learning model, we did not
distinguish between Hispanic women born in the United States and Hispanic women born
abroad.
Important Hyperparameters
The Hyperparameters in the random forest are either used to improve the model's
predictive ability or make the model faster. The following describes the Hyperparameters of
sklearn's worked in random forest function:
1. Increasing the Predictive Power
· n_estimators: The number of trees that the algorithm builds before making the
most extreme vote or averaging the predictions. By and large, more trees will improve
performance and make predictions more stable, yet it will also slow down calculations.
· max_features: The greatest number of features a random forest allows to attempt
in a single tree. Sklearn's offers several options.
· min_sample_leaf: Determine the minimum number of leaves required to split
internal nodes.
2. Increasing the Models Speed
· n_jobs: Tell the engine what number of processors are permitted. On the off chance
that the worth is 1, only one processor can be used. The worth "- 1" means there is no restriction.
· random_state: Make the yield of the model reproducible. In the event that the model has
a specific random_state esteem and has the same Hyperparameters and the same training data, it
will always deliver the same results.
· oob_score: (Also known as OOBsampling)- a random forest cross-validation
method. In this sampling, 33% of the data is not used to train the model however can be used to
evaluate its performance. These samples are called bagged samples. It is fundamentally the same
as the forget about one cross-validation method, yet there is almost no additional computational
weight.

More Related Content

DOCX
Introductionedited
PDF
An efficient feature selection algorithm for health care data analysis
PDF
Explainable AI in Drug Hunting
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
Machine learning meets user analytics - Metageni tech talk
PDF
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
PDF
Chronic Kidney Disease Prediction Using Machine Learning
PDF
INDUSTRIALIZING INDUSTRIALIZING MACHINE LEARNING IN PHARMA: Challenges, Use C...
Introductionedited
An efficient feature selection algorithm for health care data analysis
Explainable AI in Drug Hunting
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
Machine learning meets user analytics - Metageni tech talk
BIOMARKER EXTRACTION FROM EMR / EHR DATA - ASHISH SHARMA & KAIWEN ZHONG
Chronic Kidney Disease Prediction Using Machine Learning
INDUSTRIALIZING INDUSTRIALIZING MACHINE LEARNING IN PHARMA: Challenges, Use C...

What's hot (19)

PDF
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
PDF
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
PDF
Efficiency of Prediction Algorithms for Mining Biological Databases
PDF
IRJET- Disease Prediction System
PDF
IRJET - Employee Performance Prediction System using Data Mining
PDF
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
PDF
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
PDF
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
PDF
IRJET-Survey on Data Mining Techniques for Disease Prediction
PDF
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PDF
Controlling informative features for improved accuracy and faster predictions...
PDF
Assess data reliability from a set of criteria using the theory of belief fun...
PDF
Research Method EMBA chapter 12
PDF
Exploratory data analysis data visualization
DOCX
Final Report
PDF
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
PDF
IRJET- A Review of Data Cleaning and its Current Approaches
PDF
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
PDF
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
IRJET- Hybrid Architecture of Heart Disease Prediction System using Genetic N...
CATEGORIZATION OF FACTORS AFFECTING CLASSIFICATION ALGORITHMS SELECTION
Efficiency of Prediction Algorithms for Mining Biological Databases
IRJET- Disease Prediction System
IRJET - Employee Performance Prediction System using Data Mining
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
Comparative Analysis of Machine Learning Algorithms for their Effectiveness i...
IRJET-Survey on Data Mining Techniques for Disease Prediction
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
Controlling informative features for improved accuracy and faster predictions...
Assess data reliability from a set of criteria using the theory of belief fun...
Research Method EMBA chapter 12
Exploratory data analysis data visualization
Final Report
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
IRJET- A Review of Data Cleaning and its Current Approaches
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
Ad

Similar to Dissertation (20)

PDF
Supervised learning techniques and applications
PDF
Anomaly detection via eliminating data redundancy and rectifying data error i...
PDF
Analysis of Common Supervised Learning Algorithms Through Application
PDF
Analysis of Common Supervised Learning Algorithms Through Application
PDF
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
PDF
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
PDF
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
PDF
HEALTH PREDICTION ANALYSIS USING DATA MINING
DOCX
Running Head Data Mining in The Cloud .docx
PDF
Guide To Predictive Analytics with Machine Learning.pdf
DOCX
Machine Learning Approaches and its Challenges
PDF
Data Science - Part V - Decision Trees & Random Forests
PDF
A genetic algorithm-based feature selection approach for diabetes prediction
PDF
Software Cost Estimation Using Clustering and Ranking Scheme
PDF
Data Science Interview Questions PDF By ScholarHat
PDF
IRJET- Medical Data Mining
PPTX
Empowering Business Growth with Predictive Analytic - Statswork
PDF
Fundamentals of data science presentation
PDF
SW-Asset-Predictive Analytics Models.pdf
PDF
The potential of predictive Analytics Models
Supervised learning techniques and applications
Anomaly detection via eliminating data redundancy and rectifying data error i...
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
An Efficient PSO Based Ensemble Classification Model on High Dimensional Data...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
HEALTH PREDICTION ANALYSIS USING DATA MINING
Running Head Data Mining in The Cloud .docx
Guide To Predictive Analytics with Machine Learning.pdf
Machine Learning Approaches and its Challenges
Data Science - Part V - Decision Trees & Random Forests
A genetic algorithm-based feature selection approach for diabetes prediction
Software Cost Estimation Using Clustering and Ranking Scheme
Data Science Interview Questions PDF By ScholarHat
IRJET- Medical Data Mining
Empowering Business Growth with Predictive Analytic - Statswork
Fundamentals of data science presentation
SW-Asset-Predictive Analytics Models.pdf
The potential of predictive Analytics Models
Ad

More from Mefratechnologies (9)

DOCX
Cyber bullying
DOCX
Pgbm161+module+guide+oct+2020+starts
DOCX
Impact of hrm on organization growth thesis
PPTX
Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)
DOCX
Addition text
PPTX
Poster template for global health council edited
PPTX
Poster template for global health council
DOCX
DOCX
Final charter edited
Cyber bullying
Pgbm161+module+guide+oct+2020+starts
Impact of hrm on organization growth thesis
Poster template assessment 1 uncc300 sem 2 2020 (editable file) (2)
Addition text
Poster template for global health council edited
Poster template for global health council
Final charter edited

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Transcultural that can help you someday.
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Managing Community Partner Relationships
PDF
annual-report-2024-2025 original latest.
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
DATA COLLECTION METHODS-ppt for nursing research
PDF
Mega Projects Data Mega Projects Data
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Transcultural that can help you someday.
ISS -ESG Data flows What is ESG and HowHow
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction-to-Cloud-ComputingFinal.pptx
Database Infoormation System (DBIS).pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Acceptance and paychological effects of mandatory extra coach I classes.pptx
A Complete Guide to Streamlining Business Processes
STERILIZATION AND DISINFECTION-1.ppthhhbx
Managing Community Partner Relationships
annual-report-2024-2025 original latest.
[EN] Industrial Machine Downtime Prediction
Optimise Shopper Experiences with a Strong Data Estate.pdf
DATA COLLECTION METHODS-ppt for nursing research
Mega Projects Data Mega Projects Data
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Topic 5 Presentation 5 Lesson 5 Corporate Fin
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

Dissertation

  • 1. Introduction Random forest is one of the most successful integration methods, showing excellent performance at the level of promotion and support vector machines. The method is fast, anti- noise, does not over fit, and provides the possibility to interpret and visualize its output. We will study some possibilities to increase the strength of individual trees in the forest or reduce their correlation. Using several attribute evaluation methods instead of just one method will produce promising results. On the other hand, in most similar cases, using marginal weighted voting instead of ordinary voting can provide improvements that are statistically significant across multiple data sets. Nowadays, machine learning (ML) is becoming more and more important, and with the rapid growth of medical data and information quality, it has become a key technology. However, due to complex, incomplete and multi-dimensional healthcare data, early and accurate detection of diseases remains a challenge. Data preprocessing is an important step in machine learning. The main purpose of machine learning is to provide processed data to improve prediction accuracy. This study summarizes popular data preprocessing steps based on usage, popularity, and literature. After that, the selected preprocessing method is applied to the original data, and then the classifier uses it for prediction. Company data mining faces the challenge of discovering systematic knowledge in big data streams to support management decision-making. Although the research on operations research, direct marketing and machine learning focuses on the analysis and design of data mining algorithms, the interaction between data mining and the previous stage of data preprocessing has not been studied in detail. This paper studies the effects of different preprocessing techniques of attribute scaling, sampling, classification coding and continuous
  • 2. attribute coding on the performance of decision trees, neural networks and support vector machines. Problem statement. Using machine learning to predict breast cancer cases through patient treatment history and health data. We will be using the Data set of Wisconsin breast cancer (diagnosis) center. Among women, breast cancer is the leading cause of death. Breast cancer risk prediction can provide information for screening and preventive measures. Previous work found that adding input to the widely used Gaelic model can improve its ability to predict breast cancer risk. However, these models use simple statistical architectures, and other inputs come from expensive and/or invasive processes. In contrast, we have developed a machine learning model that uses highly accessible personal health data to predict breast cancer risk over five years. We created a machine learning model using only Gail model input and a model using Gail model input and other personal health data related to breast cancer risk. The basic goals of cancer prediction and prognosis are different from those of cancer detection and diagnosis. In cancer prediction/prognosis, one is related to three key points of prediction: 1) cancer susceptibility prediction (ie risk assessment); 2) cancer recurrence prediction and 3) cancer survival rate prediction. In the first case, people are trying to predict the likelihood of developing a certain type of cancer before the disease occurs. In the second case, people are trying to predict the possibility of developing cancer after the disease has clearly disappeared. In the third case, people try to predict the outcome (life expectancy, survival, progression, tumor drug sensitivity) after the disease is diagnosed. In the latter two cases, the success of
  • 3. prognostic prediction obviously depends in part on the success or quality of the diagnosis. However, the prognosis of the disease can only be achieved after medical diagnosis, and prognosis prediction must consider more than simple diagnosis. Using multi-factor analysis of variances of various performance indicators and method parameters, direct marketing can be used to evaluate the impact of different preprocessing options on real-world data sets. Our case-based analysis provides empirical evidence that data preprocessing will have a significant impact on the accuracy of predictions, and certain solutions have proven to be inferior to competing methods. In addition, it was also found that: (1) The selected method proved to be as sensitive to different data representation methods as method parameterization, which indicates the potential of improving performance through effective preprocessing; (2) The influence of the preprocessing scheme depends on the method. Different, indicating that the use of different "best practice" settings can promote the excellent results of a specific method; (3) Therefore, the sensitivity of the algorithm to preprocessing is an important criterion for method evaluation and selection. In predictive data mining, it needs to be different from traditional methods. The predictive ability and calculation efficiency index are considered together. In order to maximize the prediction accuracy of data mining, the research of management science and machine learning is mainly devoted to enhancing competitive classifiers and effectively adjusting algorithm parameters. Classification algorithms are usually tested in extensive benchmark experiments, using pre-processed data sets to evaluate the impact on prediction accuracy and computational efficiency. In contrast, the research focus in DPP is on the development of algorithms for specific DPP tasks. Although the feature selection resampling and continuous attribute discretization are analyzed in detail, there are few publication survey data
  • 4. projections affecting classification attributes and scaling. More importantly, there is no detailed analysis of the interaction of prediction accuracy in data mining, especially in the field of company direct marketing. 3.1. Preprocessing methods This dissertation considers the three main standard preprocessing steps of NLP, including: stemming, punctuation removal and stop word removal. In stemming analysis, we obtain the stem form of each word in the data set, which is a part of the word that can be appended with affixes. The blocking algorithm is language specific and differs in performance and accuracy. Many different methods can be used, such as: affix deletion stemming, n-gram stemming and table query stemming. An important preprocessing step of NLP is to remove punctuation, which-used to divide text into sentences, paragraphs and phrases-affects the results of any text processing method, especially the results that depend on the frequency of words and phrases, because punctuation is Often used in text. Before any NLP processing, the most common words used in stop words are deleted. A group of frequently used words without any other information, such as articles, definite words and prepositions called stop words. By deleting these very common words from the text, we can focus on the important words. Important Hyperparameters The Hyperparameters in the random forest are either used to improve the predictive ability of the model or make the model faster. The following describes the Hyperparameters of sklearn's built-in random forest function: 1. Increasing the Predictive Power
  • 5.  n_estimators: The number of trees that the algorithm builds before making the maximum vote or averaging the predictions. Generally, more trees will improve performance and make predictions more stable, but it will also slow down calculations.  max_features: The maximum number of features a random forest allows to try in a single tree. Sklearn’s offers several options.  min_sample_leaf: Determine the minimum number of leaves required to split internal nodes. 2. Increasing the Models Speed  n_jobs: Tell the engine how many processors are allowed. If the value is 1, only one processor can be used. The value "-1" means there is no limit.  random_state: Make the output of the model reproducible. If the model has a certain random_state value, and has the same Hyperparameters and the same training data, the model will always produce the same results.  oob_score: (Also known as oob sampling)-a random forest cross-validation method. In this sampling, about one-third of the data is not used to train the model, but can be used to evaluate its performance. These samples are called bagged samples. It is very similar to the leave-one-out cross-validation method, but there is almost no additional computational burden. Why is Random Forest So Cool? Impressive in Versatility
  • 6. Whether you have a regression task or a classification task, random forest is a suitable model to meet your needs. It can handle binary features, classification features and digital features. Hardly any pretreatment is required. The data does not need to be rescaled or converted. Parallelizable They are parallelizable, which means we can split the process into multiple machines to run. This can shorten the calculation time. On the contrary, the enhanced model is sequential and takes longer to calculate. Specifically, in Python, to run this code on multiple computers, please provide the parameter "jobs = -1". -1 means use all available computers. Greatness and high dimensions. Random forest is suitable for processing high-dimensional data because we are dealing with a subset of the data. Quick Prediction/Training Speed Training is faster than decision trees because we only deal with part of the features in this model, so we can easily use hundreds of features. The prediction speed is significantly faster than the training speed because we can save the generated forest for future use. Robust to Outliers and Non-linear Data Random forest deals with outliers by essentially classifying them. It is also indifferent to nonlinear features. Handles Unbalanced Data
  • 7. It has a method for balancing errors in the overall imbalance of the class. Random forest tries to minimize the overall error rate, so when we have an unbalanced data set, the larger category will get a lower error rate, and the smaller category will have a larger error rate. Low Bias, Moderate Variance Each decision tree has a higher variance, but a lower deviation. However, since we average all trees in the random forest, we also average the variance, so we have a low bias and medium variance model. Advantages of using Random Forest As with any algorithm, there are advantages and disadvantages to using it. In the next two sections, we will introduce the advantages and disadvantages of using random forests for classification and regression. The random forest algorithm is not biased because there are multiple trees and each tree is trained on a subset of the data. Basically, the random forest algorithm relies on the power of the "crowd". Therefore, the overall bias of the algorithm is reduced. The algorithm is very stable. Even if new data points are introduced in the data set, the overall algorithm will not be affected too much, because the new data may affect one tree, but it is difficult to affect all trees. Random forest algorithm works well when it has both classification and number functions. When the data lacks values or is not well scaled, the random forest algorithm can also work well (although we have done feature scaling in this article for demonstration purposes only). Drawbacks
  • 8. Model interpretability: Random forest models are not all that interpretable; they are like black boxes. For very large data sets, the size of the trees can take up a lot of memory. It can tend to over fit, so you should tune the hyper parameters. It has been observed that random forests are too suitable for certain data sets with noisy classification/regression tasks. It is more complicated and computationally expensive than the decision tree algorithm. Due to their complexity, they require more time for training than other similar algorithms. Materials and methods Data Materials and methods Data The model was trained and evaluated on the PLCO dataset. This data set was generated as part of a randomized, controlled, prospective study to determine the effectiveness of different prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research and filled out the baseline questionnaire, detailing their previous and current health status. All processing of this data set is done in Python (version 3.6.7). We initially downloaded the data of all women from the PLCO data set. The dataset consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the following conditions: 1. Lack of data on whether they have been diagnosed with breast cancer and the time of diagnosis 2. Were diagnosed with breast cancer before the baseline questionnaire 3. Not Self-identification as white, black, or Hispanic
  • 9. 4. Identified as Hispanic, but no information about the place of birth 5. Missing data for 13 selected predictors Before the baseline questionnaire, we excluded women who had been diagnosed with breast cancer because BCRAT was not sufficient for women with a personal history of breast cancer. BCRAT is also not suitable for women with breast cancer who have received chest radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil syndrome. Since there is no data for these conditions in the PLCO data set we assume that these conditions do not apply to any women in the data set. Since only PLCO white, black, and Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded specific topics based on self-identified race/ethnicity. We do not include subjects who consider themselves Hispanic but do not have data on their place of birth because BCRAT implements different breast cancer compound rates for US- born and foreign-born Hispanic women. When deleting objects based on these conditions, we reduced the number of women to 64,739. We trained a set of machine learning models that fed five of the usual seven inputs into BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first- degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in the PLCO data set. We compared the machine learning model BCRAT and got these five inputs. Our input to the model with a broader set of predictors includes five BCRAT data and eight additional factors. These other predictors were selected based on the availability in the PLCO data set and their correlation with breast cancer risk including menopausal age, indicators
  • 10. of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of years of birth control, the number of live births, and personal cancer history indicators. To facilitate the training and testing of the model, we made limited modifications to the predictor variables. First, we assign values to categorical variables appropriately. The PLCO data set classifies age at menarche, age at first live birth, age at menopause, generation of hormones, and age of birth control as categorical variables. For example, the menarche variable's age code is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15 years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that represents the maximum age/age or less (for example, under ten years old), we set the value of the variable to the maximum value (for example, ten years old). For values that represent a range strictly less than the maximum value (for example, less than ten years old), we set the variable's value equal to the upper limit of the range (for example, less than ten years old). Similarly, for values representing the minimum age/age or above (16 years old or above), we set it to the minimum value (for example, 16 years old). For values that contain a closed range (for example, 12-13 years old), we set the variable's importance to the average cost of the field (for example, 12.5 years old). After modifying the categorical variables, we made some adjustments to the age version of the first live birth, and the race/ethnic variables entered into the machine learning model. For the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as the "BCRA" software package (version 2.1) in R (version 3.4.3), using The implementation of BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and American-born Hispanic women. For the machine learning model, we set the age of the first live birth variable of zero birth women to the current generation, and use two indicators to represent
  • 11. race/ethnicity, one symbol for white women and one indicator for black women. Each woman is classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s white and black racial symbols are both 0. For the machine learning model, we did not distinguish between Hispanic women born in the United States and Hispanic women born abroad. Important Hyperparameters The Hyperparameters in the random forest are either used to improve the model's predictive ability or make the model faster. The following describes the Hyperparameters of sklearn's worked in random forest function: 1. Increasing the Predictive Power · n_estimators: The number of trees that the algorithm builds before making the most extreme vote or averaging the predictions. By and large, more trees will improve performance and make predictions more stable, yet it will also slow down calculations. · max_features: The greatest number of features a random forest allows to attempt in a single tree. Sklearn's offers several options. · min_sample_leaf: Determine the minimum number of leaves required to split internal nodes. 2. Increasing the Models Speed · n_jobs: Tell the engine what number of processors are permitted. On the off chance that the worth is 1, only one processor can be used. The worth "- 1" means there is no restriction. · random_state: Make the yield of the model reproducible. In the event that the model has
  • 12. a specific random_state esteem and has the same Hyperparameters and the same training data, it will always deliver the same results. · oob_score: (Also known as OOBsampling)- a random forest cross-validation method. In this sampling, 33% of the data is not used to train the model however can be used to evaluate its performance. These samples are called bagged samples. It is fundamentally the same as the forget about one cross-validation method, yet there is almost no additional computational weight.