Spark machine learning

SPARK MACHINE LEARNING
Certification Course Academic Year (2017-2018)
Done by:
K Teja Sreenivas

INTRODUCTION:
– Machine learning is a type of artificial intelligence (AI) that
allows software applications to become more accurate in
predicting outcomes without being explicitly programmed.
The basic premise of machine learning is to build
algorithm that can receive input data and use statistical
learning to predict an output value within an acceptable
range.
– Machine learning algorithms are often categorized as
being supervised or Unsupervised.

LIFE CYCLE IN DESIGNING A
MACHINE LEARNING MODEL
 1. Data collection
 2. Data processing
 3. Feature Engineering
 4. Model Building
 5. Model Evaluation
 6. Model evaluation
 7. Model Deployment

SPARK FOR MACHINE LEARNING:
• Spark is a distributed file system used in place of hadoop. Big Data is used over
network clusters and used as an essential application in several industries. The broad
use of Hadoop and MapReduce technologies shows how such technology is
constantly evolving. The increase in the use of Apache Spark, which is a data
processing engine, is testament to this fact.
• Superior abilities for Big Data applications are provided by Apache Spark when
compared to other Big Data Technologies like MapReduce or Hadoop. The Apache
Spark features are as follows:
1. Holistic framework
2. Speed
3. Easy to use
4. Enhanced support

PROBLEM STATMENT:
Prediction of Annual returns
using sets of weights which
are simulated using US stock
market historical data to
obtain their performances.

DATA SET ATTRIBUTE INFORMATION:
• The inputs are the weights of the stock-picking concepts as follows
X1=the weight of the Large B/P concept
X2=the weight of the Large ROE concept
X3=the weight of the Large S/P concept
X4=the weight of the Large Return Rate in the last quarter concept
X5=the weight of the Large Market Value concept
X6=the weight of the Small systematic Risk concept
The outputs are the investment performance indicators (normalized) as follows
Y1=Annual Return
Y2=Excess Return
Y3=Systematic Risk
Y4=Total Risk
Y5=Abs. Win Rate
Y6=Rel. Win Rate

TERMINOLOGY:
• P/B ratio : The price-to-book ratio, or P/B ratio, is a financial ratio used to compare a company's current market price to its
book value. It is also sometimes known as a Market-to-Book ratio.
• ROE: Return on equity (ROE) is the amount of net income returned as a percentage of shareholder equity. Return on
equity measures a corporation's profitability by revealing how much profit a company generates with the money
shareholders have invested.
• The S&P 500 measures the value of stocks of the 500 largest corporations by market capitalization listed on the New York
Stock Exchange or Nasdaq Composite. Standard & Poor's intention is to have a price that provides a quick look at the stock
market and economy.
• Return Rate: A rate of return is the gain or loss on an investment over a specified time period, expressed as a percentage
of the investment's cost. Gains on investments are defined as income received plus any capital gains realized on the sale of
the investment.
• market value: The amount for which something can be sold on a given market.
• Systematic Risk: Systematic risk is the risk inherent to the entire market or market segment. Systematic risk, also known
as “undiversifiable risk,” “volatility,” or “market risk,” affects the overall market, not just a particular stock or industry. This
type of risk is both unpredictable and impossible to completely avoid.

SOFTWARE TOOLS USED:
• SPARK
• SPYDER
• ANACONDA
• JUPYTER
• PYTHON
• VERTUAL MACHINE
• HDFS

from pyspark import SparkContext , SQLContext
sqlContext = SQLContext(sc)
#data collection:
data = sqlContext.read.csv('/home/tej/Documents/ML with spark/train.csv',header=True, sep=',')
data.show(n=5)
X_train =
data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks','Annual_Return')

X_train=X_train.select(X_train.Large_ROE.cast('float'),X_train.Large_Return_Rate_last_quarter.cast
('float'),X_train.Large_Market_Value.cast('float'),X_train.Small_systematic_Risk.cast('float'),X_train.
systematic_risks.cast('float'),X_train.Large_BnP.cast('float'),X_train.Large_SnP.cast('float'),X_train.A
nnual_Return.cast('float'))
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
assembler=VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_
last_quarter','Large_Market_Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_train=assembler.transform(X_train)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures",
maxCategories=7).fit(X_train)
X_train=featureIndexer.transform(X_train)

from pyspark.ml.regression import LinearRegression
linear_reg = LinearRegression(labelCol='Annual_Return',featuresCol =
'indexedFeatures')
linear_reg_model = linear_reg.fit(X_train)

test_data = sqlContext.read.csv('/home/tej/Documents/ML with spark/test.csv',header=True, sep=',')
X_test=test_data.select('Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_Value',
'Small_systematic_Risk','systematic_risks','Annual_Return')
X_test=
test_data.select(X_test.Large_ROE.cast('float'),X_test.Large_Return_Rate_last_quarter.cast('float'),X_test.Large_Mark
et_Value.cast('float'),X_test.Small_systematic_Risk.cast('float'),X_test.systematic_risks.cast('float'),X_test.Large_BnP.c
ast('float'),X_test.Large_SnP.cast('float'),X_test.Annual_Return.cast('float'))
assembler =
VectorAssembler(inputCols=['Large_BnP','Large_ROE','Large_SnP','Large_Return_Rate_last_quarter','Large_Market_
Value','Small_systematic_Risk','systematic_risks'],outputCol='features')
X_test=assembler.transform(X_test)
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=7).fit(X_test)
X_test=featureIndexer.transform(X_test)

linear_predictions = linear_reg_model.transform(X_test)
linear_predictions.show()
linear_predictions.select('Annual_Return','prediction').show()

CONCLUSION:
• From the final output it is clear that using linear model in training the data set we have
obtained predictions which show perdictions of annul returns with less than 0.1 unit
error on average.
key learning :
• we have learnt the basic uses of a machine learning and the uses of spark
in the implementation of the machine learning model.
• The various phases involved in the designing machine learning model in
understood and implemented using a machine learning Random forest model
•

Spark machine learning

More Related Content

Similar to Spark machine learning (20)

Recently uploaded (20)

Spark machine learning