SlideShare a Scribd company logo
SPARK MACHINE LEARNING
Certification Course Academic Year (2017-2018)
Done by:
K Teja Sreenivas
INTRODUCTION:
– Machine learning is a type of artificial intelligence (AI) that
allows software applications to become more accurate in
predicting outcomes without being explicitly programmed.
The basic premise of machine learning is to build
algorithm that can receive input data and use statistical
learning to predict an output value within an acceptable
range.
– Machine learning algorithms are often categorized as
being supervised or Unsupervised.
MACHINE LEARNING TYPES:
Spark machine learning
LIFE CYCLE IN DESIGNING A
MACHINE LEARNING MODEL
 1. Data collection
 2. Data processing
 3. Feature Engineering
 4. Model Building
 5. Model Evaluation
 6. Model evaluation
 7. Model Deployment
SPARK FOR MACHINE LEARNING:
• Spark is a distributed file system used in place of hadoop. Big Data is used over
network clusters and used as an essential application in several industries. The broad
use of Hadoop and MapReduce technologies shows how such technology is
constantly evolving. The increase in the use of Apache Spark, which is a data
processing engine, is testament to this fact.
• Superior abilities for Big Data applications are provided by Apache Spark when
compared to other Big Data Technologies like MapReduce or Hadoop. The Apache
Spark features are as follows:
1. Holistic framework
2. Speed
3. Easy to use
4. Enhanced support
PROBLEM STATMENT:
Prediction of Annual returns
using sets of weights which
are simulated using US stock
market historical data to
obtain their performances.
DATA SET ATTRIBUTE INFORMATION:
• The inputs are the weights of the stock-picking concepts as follows
X1=the weight of the Large B/P concept
X2=the weight of the Large ROE concept
X3=the weight of the Large S/P concept
X4=the weight of the Large Return Rate in the last quarter concept
X5=the weight of the Large Market Value concept
X6=the weight of the Small systematic Risk concept
The outputs are the investment performance indicators (normalized) as follows
Y1=Annual Return
Y2=Excess Return
Y3=Systematic Risk
Y4=Total Risk
Y5=Abs. Win Rate
Y6=Rel. Win Rate
TERMINOLOGY:
• P/B ratio : The price-to-book ratio, or P/B ratio, is a financial ratio used to compare a company's current market price to its
book value. It is also sometimes known as a Market-to-Book ratio.
• ROE: Return on equity (ROE) is the amount of net income returned as a percentage of shareholder equity. Return on
equity measures a corporation's profitability by revealing how much profit a company generates with the money
shareholders have invested.
• The S&P 500 measures the value of stocks of the 500 largest corporations by market capitalization listed on the New York
Stock Exchange or Nasdaq Composite. Standard & Poor's intention is to have a price that provides a quick look at the stock
market and economy.
• Return Rate: A rate of return is the gain or loss on an investment over a specified time period, expressed as a percentage
of the investment's cost. Gains on investments are defined as income received plus any capital gains realized on the sale of
the investment.
• market value: The amount for which something can be sold on a given market.
• Systematic Risk: Systematic risk is the risk inherent to the entire market or market segment. Systematic risk, also known
as “undiversifiable risk,” “volatility,” or “market risk,” affects the overall market, not just a particular stock or industry. This
type of risk is both unpredictable and impossible to completely avoid.
SOFTWARE TOOLS USED:
• SPARK
• SPYDER
• ANACONDA
• JUPYTER
• PYTHON
• VERTUAL MACHINE
• HDFS
from pyspark import SparkContext , SQLContext
sqlContext = SQLContext(sc)
#data collection:
data = sqlContext.read.csv('/home/tej/Documents/ML with spark/train.csv',header=True, sep=',')
data.show(n=5)
X_train =
data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks','Annual_Return')
X_train=X_train.select(X_train.Large_ROE.cast('float'),X_train.Large_Return_Rate_last_quarter.cast
('float'),X_train.Large_Market_Value.cast('float'),X_train.Small_systematic_Risk.cast('float'),X_train.
systematic_risks.cast('float'),X_train.Large_BnP.cast('float'),X_train.Large_SnP.cast('float'),X_train.A
nnual_Return.cast('float'))
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
assembler=VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_
last_quarter','Large_Market_Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_train=assembler.transform(X_train)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures",
maxCategories=7).fit(X_train)
X_train=featureIndexer.transform(X_train)
Spark machine learning
from pyspark.ml.regression import LinearRegression
linear_reg = LinearRegression(labelCol='Annual_Return',featuresCol =
'indexedFeatures')
linear_reg_model = linear_reg.fit(X_train)
Spark machine learning
test_data = sqlContext.read.csv('/home/tej/Documents/ML with spark/test.csv',header=True, sep=',')
X_test=test_data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_Value',
'Small_systematic_Risk','systematic_risks','Annual_Return')
X_test=
test_data.select(X_test.Large_ROE.cast('float'),X_test.Large_Return_Rate_last_quarter.cast('float'),X_test.Large_Mark
et_Value.cast('float'),X_test.Small_systematic_Risk.cast('float'),X_test.systematic_risks.cast('float'),X_test.Large_BnP.c
ast('float'),X_test.Large_SnP.cast('float'),X_test.Annual_Return.cast('float'))
assembler =
VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_test=assembler.transform(X_test)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_test)
X_test=featureIndexer.transform(X_test)
linear_predictions = linear_reg_model.transform(X_test)
linear_predictions.show()
linear_predictions.select('Annual_Return','prediction').show()
Spark machine learning
CONCLUSION:
• From the final output it is clear that using linear model in training the data set we have
obtained predictions which show perdictions of annul returns with less than 0.1 unit
error on average.
key learning :
• we have learnt the basic uses of a machine learning and the uses of spark
in the implementation of the machine learning model.
• The various phases involved in the designing machine learning model in
understood and implemented using a machine learning Random forest model
•
THANKYOU !

More Related Content

PPTX
Quantitative methods for choosing projects net present
PPTX
QUANTITATIVE METHODS FOR CHOOSING PROJECTS - Internal Rate Of Return (IRR)
PPTX
Historical Trending Reports in Salesforce
PDF
TopQuants2014_JokTang_DenysSemagin
PDF
HPC Computing Trends
PPT
Software Sizing
DOCX
BUSINESS FINANCE BBUS350Stock Portfolio ProjectObjectiveTh.docx
DOCX
BUSINESS FINANCE BBUS350Stock Portfolio ProjectObjectiveTh.docx
Quantitative methods for choosing projects net present
QUANTITATIVE METHODS FOR CHOOSING PROJECTS - Internal Rate Of Return (IRR)
Historical Trending Reports in Salesforce
TopQuants2014_JokTang_DenysSemagin
HPC Computing Trends
Software Sizing
BUSINESS FINANCE BBUS350Stock Portfolio ProjectObjectiveTh.docx
BUSINESS FINANCE BBUS350Stock Portfolio ProjectObjectiveTh.docx

Similar to Spark machine learning (20)

PDF
Leveraging Data Analysis for Sales
PPTX
I Know First Presentation (May 2016)
DOC
A Study on Empirical Testing of Capital Asset Pricing Model
PDF
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
PPTX
Business analytics and it's tools and competitive advantage
PDF
CAST HIGHLIGHT - Overview & Demos
PPTX
2016-10 Using the Copy & Move webpart
PPTX
Lecture 4 - Production, -Cost - and Profit Analysis.pptx
PPTX
Stock Market Prediction
DOC
Fin 550 Massive Success / snaptutorial.com
PPTX
Know risk for mining industry 1
PDF
Project Evaluation and Estimation in Software Development
PPTX
Chapter 2: Information Systems in Organizations
PPT
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
PDF
Project on microsoft excel of prtfolio managment system
PDF
IRJET - Stock Recommendation System using Machine Learning Approache
PDF
Risk Insight v1.0 User Guide
PDF
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
PPTX
Leveraging Data Analysis for Sales
I Know First Presentation (May 2016)
A Study on Empirical Testing of Capital Asset Pricing Model
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Business analytics and it's tools and competitive advantage
CAST HIGHLIGHT - Overview & Demos
2016-10 Using the Copy & Move webpart
Lecture 4 - Production, -Cost - and Profit Analysis.pptx
Stock Market Prediction
Fin 550 Massive Success / snaptutorial.com
Know risk for mining industry 1
Project Evaluation and Estimation in Software Development
Chapter 2: Information Systems in Organizations
Dhaval Shah on "Strategic Alignment Of Projects For Higher Profits And Increa...
Project on microsoft excel of prtfolio managment system
IRJET - Stock Recommendation System using Machine Learning Approache
Risk Insight v1.0 User Guide
Gain Comparison between NIFTY and Selected Stocks identified by SOM using Tec...
Ad

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Predictive modeling basics in data cleaning process
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Leprosy and NLEP programme community medicine
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Mega Projects Data Mega Projects Data
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
modul_python (1).pptx for professional and student
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Knowledge Engineering Part 1
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Predictive modeling basics in data cleaning process
climate analysis of Dhaka ,Banglades.pptx
Clinical guidelines as a resource for EBP(1).pdf
Leprosy and NLEP programme community medicine
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
[EN] Industrial Machine Downtime Prediction
Galatica Smart Energy Infrastructure Startup Pitch Deck
STUDY DESIGN details- Lt Col Maksud (21).pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Reliability_Chapter_ presentation 1221.5784
Mega Projects Data Mega Projects Data
SAP 2 completion done . PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Ad

Spark machine learning

  • 1. SPARK MACHINE LEARNING Certification Course Academic Year (2017-2018) Done by: K Teja Sreenivas
  • 2. INTRODUCTION: – Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. The basic premise of machine learning is to build algorithm that can receive input data and use statistical learning to predict an output value within an acceptable range. – Machine learning algorithms are often categorized as being supervised or Unsupervised.
  • 5. LIFE CYCLE IN DESIGNING A MACHINE LEARNING MODEL  1. Data collection  2. Data processing  3. Feature Engineering  4. Model Building  5. Model Evaluation  6. Model evaluation  7. Model Deployment
  • 6. SPARK FOR MACHINE LEARNING: • Spark is a distributed file system used in place of hadoop. Big Data is used over network clusters and used as an essential application in several industries. The broad use of Hadoop and MapReduce technologies shows how such technology is constantly evolving. The increase in the use of Apache Spark, which is a data processing engine, is testament to this fact. • Superior abilities for Big Data applications are provided by Apache Spark when compared to other Big Data Technologies like MapReduce or Hadoop. The Apache Spark features are as follows: 1. Holistic framework 2. Speed 3. Easy to use 4. Enhanced support
  • 7. PROBLEM STATMENT: Prediction of Annual returns using sets of weights which are simulated using US stock market historical data to obtain their performances.
  • 8. DATA SET ATTRIBUTE INFORMATION: • The inputs are the weights of the stock-picking concepts as follows X1=the weight of the Large B/P concept X2=the weight of the Large ROE concept X3=the weight of the Large S/P concept X4=the weight of the Large Return Rate in the last quarter concept X5=the weight of the Large Market Value concept X6=the weight of the Small systematic Risk concept The outputs are the investment performance indicators (normalized) as follows Y1=Annual Return Y2=Excess Return Y3=Systematic Risk Y4=Total Risk Y5=Abs. Win Rate Y6=Rel. Win Rate
  • 9. TERMINOLOGY: • P/B ratio : The price-to-book ratio, or P/B ratio, is a financial ratio used to compare a company's current market price to its book value. It is also sometimes known as a Market-to-Book ratio. • ROE: Return on equity (ROE) is the amount of net income returned as a percentage of shareholder equity. Return on equity measures a corporation's profitability by revealing how much profit a company generates with the money shareholders have invested. • The S&P 500 measures the value of stocks of the 500 largest corporations by market capitalization listed on the New York Stock Exchange or Nasdaq Composite. Standard & Poor's intention is to have a price that provides a quick look at the stock market and economy. • Return Rate: A rate of return is the gain or loss on an investment over a specified time period, expressed as a percentage of the investment's cost. Gains on investments are defined as income received plus any capital gains realized on the sale of the investment. • market value: The amount for which something can be sold on a given market. • Systematic Risk: Systematic risk is the risk inherent to the entire market or market segment. Systematic risk, also known as “undiversifiable risk,” “volatility,” or “market risk,” affects the overall market, not just a particular stock or industry. This type of risk is both unpredictable and impossible to completely avoid.
  • 10. SOFTWARE TOOLS USED: • SPARK • SPYDER • ANACONDA • JUPYTER • PYTHON • VERTUAL MACHINE • HDFS
  • 11. from pyspark import SparkContext , SQLContext sqlContext = SQLContext(sc) #data collection: data = sqlContext.read.csv('/home/tej/Documents/ML with spark/train.csv',header=True, sep=',') data.show(n=5) X_train = data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_ Value','Small_systematic_Risk','systematic_risks','Annual_Return')
  • 12. X_train=X_train.select(X_train.Large_ROE.cast('float'),X_train.Large_Return_Rate_last_quarter.cast ('float'),X_train.Large_Market_Value.cast('float'),X_train.Small_systematic_Risk.cast('float'),X_train. systematic_risks.cast('float'),X_train.Large_BnP.cast('float'),X_train.Large_SnP.cast('float'),X_train.A nnual_Return.cast('float')) from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer assembler=VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_ last_quarter','Large_Market_Value','Small_systematic_Risk','systematic_risks'],outputCol='features') X_train=assembler.transform(X_train) featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_train) X_train=featureIndexer.transform(X_train)
  • 14. from pyspark.ml.regression import LinearRegression linear_reg = LinearRegression(labelCol='Annual_Return',featuresCol = 'indexedFeatures') linear_reg_model = linear_reg.fit(X_train)
  • 16. test_data = sqlContext.read.csv('/home/tej/Documents/ML with spark/test.csv',header=True, sep=',') X_test=test_data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_Value', 'Small_systematic_Risk','systematic_risks','Annual_Return') X_test= test_data.select(X_test.Large_ROE.cast('float'),X_test.Large_Return_Rate_last_quarter.cast('float'),X_test.Large_Mark et_Value.cast('float'),X_test.Small_systematic_Risk.cast('float'),X_test.systematic_risks.cast('float'),X_test.Large_BnP.c ast('float'),X_test.Large_SnP.cast('float'),X_test.Annual_Return.cast('float')) assembler = VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_ Value','Small_systematic_Risk','systematic_risks'],outputCol='features') X_test=assembler.transform(X_test) featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_test) X_test=featureIndexer.transform(X_test)
  • 19. CONCLUSION: • From the final output it is clear that using linear model in training the data set we have obtained predictions which show perdictions of annul returns with less than 0.1 unit error on average. key learning : • we have learnt the basic uses of a machine learning and the uses of spark in the implementation of the machine learning model. • The various phases involved in the designing machine learning model in understood and implemented using a machine learning Random forest model •