Apache Spark

Outline
 An Overview on Spark
 Spark Programming Guide
 An Example on Spark
 Running Applications on Spark
 Spark Streaming
 Spark Streaming Programing Guide
 An Example on Spark Streaming
 Spark and Storm: A Comparison
 Spark SQL
15 January 2015Majid Hajibaba - Spark 2

An Overview

Cluster Mode Overview
 Spark applications run as independent sets of processes on a cluster
 Executor processes run tasks in multiple threads
 Driver should be close to the workers
 For remotely operating, use RPC instead of remote driver
• Coordinator
• Standalone
• Mesos
• YARN
http://spark.apache.org/docs/1.0.1/cluster-overview.html
15 January 2015 4Majid Hajibaba - Spark

 Core is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications in a cluster
 higher-level components (Shark; GraphX; Streaming; …) are Like
libraries in a software project
 tight integration has several benefits
 Simple Improvements, Minimized Costs, Combine Processing Models
 .
Spark - A Unified Stack

Spark Processing Model
 In memory iterative MapReduce
MapReduce
Processing Model

Spark Goal
 Provide distributed memory abstractions for clusters to support apps
with working sets
 Retain the attractive properties of MapReduce:
 Fault tolerance
 Data locality
 Scalability
 Solution: augment data flow model with “resilient distributed datasets”
(RDDs)

Resilient Distributed Datasets (RDDs)
 Immutable collection of elements that can be operated on in parallel
 Created by transforming data using data flow operators (e.g. map)
 Parallel operations on RDDs
 Benefits
 Consistency is easy
 due to immutability
 Inexpensive fault tolerance
 log lineage
 no replicating/checkpointing
 Locality-aware scheduling of tasks on partitions
 Applicable to a broad variety of applications

RDDs
Immutable
Collection of
Objects
Partitioned and Distributed

Linking with Spark
 Spark 1.2.0 works with Java 6 and higher
 To write a Spark application in Java, you need to add a dependency on
Spark. Spark is available through Maven Central at:
 Importing Spark classes into the program:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.2.0
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;

Initializing Spark - Creating a SparkContext
 Tells Spark how to access a cluster
 The entry point / The first thing a Spark program
 This is done through the following constructor:
 Example:
 Or through SparkConf for advanced configuration
new SparkContext(master, appName, [sparkHome], [jars])
JavaSparkContext ctx = new
JavaSparkContext("master_url",
"application name", ["path_to_spark_home",
"path_to_jars"]);

SparkConf
 Configuration for a Spark application
 Sets various Spark parameters as key-value pairs
 SparkConf object contains information about the application
 The constructor will load values from any spark.* Java system
properties set and the classpath in the application
 Example
SparkConf conf =
new SparkConf().setAppName(appName).setMaster(master);
SparkConf sparkConf = new SparkConf().setAppName("application
name");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

Loading data into an RDD
 Spark's primary unit for data representation
 Allows for easy parallel operations on the data
 Native collections in Java can serve as the basis for an RDD
 number of partitions can be set manually by passing it as a second parameter to
parallelize (e.g. ctx.parallelize(data, 10)).
 To loading external data from a file can use textFile method in SparkContext
as:
 textFile(path: String, minSplits: Int )
 path: the path of text file
 minSplits: min number of splits for Hadoop RDDs
 The resulting is an overridden string with each line being a unique element in
the RDD
JavaRDD<Integer> dataRDD = ctx.parallelize(Arrays.asList(1,2,4));

textFile method
 Read a text file and return it as an RDD of Strings
 File can be take from
 a local file system (available on all nodes in Distributed mode)
 HDFS
 Hadoop-supported file system URI
.
JavaRDD<String> lines = ctx.textFile(“file_path”, 1);
import org.apache.spark.Sparkfiles;
...
ctx.addFile(“file_path");
JavaRDD<String> lines = ctx.textFile(SparkFiles.get(“file_path"));
...
JavaRDD<String> lines = ctx.textFile(“hdfs://...”);

Manipulating RDD
 Transformations: to create a new dataset from an existing one
 map: works on each individual element in the input RDD and produces a new
output element
 Transformation functions do not transform the existing elements, rather they
return a new RDD with the new elements
 Actions: to return a value to the driver program after running a computation
on the dataset
 reduce: operates on pairs to aggregates all the data elements of the dataset
import org.apache.spark.api.java.function.Function;
rdd.map(new Function<Integer, Integer>() {
public Integer call(Integer x) { return x+1;}
});
import org.apache.spark.api.java.function.Function2;
rdd.reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer x, Integer y) { return x+y;}
});

RDD Basics
 A simple program
 This dataset is not loaded in memory
 lines is merely a pointer to the file
 lineLengths is not immediately computed
 Breaks the computation into tasks to run on separate machines
 Each machine runs both its part of the map and a local reduction
 Local reduction only answers to the driver program
 To use lineLengths again later, we could add the following before the reduce:
 This would cause lineLengths to be saved in memory after the first time it is
computed.
JavaRDD<String> lines = ctx.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);
lineLengths.persist();

 functions are represented by classes implementing the interfaces in the
org.apache.spark.api.java.function package
 Two ways to create such functions:
1. Use lambda expressions to concisely define an implementation (In Java 8)
2. Implement the Function interfaces in your own class, and pass an instance of
it to Spark
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new
Function<String, Integer>() {
public Integer call(String s) { return s.length(); }
});
int totalLength = lineLengths.reduce(new
Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b)
{ return a + b; }
});
class GetLength implements Function<String, Integer> {
public Integer call(String s) { return s.length(); }
}
class Sum implements Function2<Integer, Integer, Integer> {
public Integer call(Integer a, Integer b) { return a + b;}
}
JavaRDD<String> lines = sc.textFile("data.txt");
JavaRDD<Integer> lineLengths = lines.map(new GetLength());
int totalLength = lineLengths.reduce(new Sum());
Passing Functions to Spark
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
int totalLength = lineLengths.reduce((a, b) -> a + b);

Working with Key-Value Pairs
 key-value pairs are represented using the scala.Tuple2 class
 call new Tuple2(a, b) to create a tuple
 access its fields with tuple._1() and tuple._2()
 RDDs of key-value pairs
 distributed “shuffle” operations (e.g. grouping or aggregating the elements
by a key)
 Represented by the JavaPairRDD class
 JavaPairRDDs can be constructed from JavaRDDs Using special versions of
the map operations (mapToPair, flatMapToPair)
 The JavaPairRDD will have both standard RDD:
 reduceByKey
 sortByKey
import scala.Tuple2;
...
Tuple2<String, String> tuple = new Tuple2(“foo”,”bar”);
System.out.println(tuple._1() + “ " + tuple._2());

Working with Key-Value Pairs
 reduceByKey example
 to count how many times each line of text occurs in a file
 sortByKey example
 to sort the pairs alphabetically
 and to bring them back to the driver program as an array of objects
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
...
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new
Tuple2(s, 1));
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a +
b);
...
counts.sortByKey();
counts.collect();

flatMap
 flatMap is a combination of map and flatten
 Return a Sequence rather than a single item; Then flattens the result
 Use case: to parse all the data, but may fail to parse some of it
http://www.slideshare.net/frodriguezolivera/apache-spark-streaming

RDD Operations

Counting Words

A Complete Example
 Word Counter Program
 Package and classes
Import
needed
classes
Package’s name
(will be passed to spark submitter)

A Complete Example
 Main Class
Creating a SparkContext
Creating a SparkConf
Application name
Loading data into an RDD
Base RDD

A Complete Example
 JavaRDDs and JavaPairRDDs functions
construct
JavaPairRDDs
from JavaRDDs
count how many
times each word of
text occurs in a file
values for each key are aggregated
create a tuple (key-value pairs )
Transformed RDD

A Complete Example
 Printing results
accessing tuples
action

 Iteration 1
 output = count.collect();
Spark Execution Model

 Iteration 2
 output = count.reduce(func);
Spark Execution Model

Building Application
 With sbt ($ sbt package)
 With maven ($ mvn package)
./src
./src/main
./src/main/java
./src/main/java/app.java
<project>
<artifactId>word-counter</artifactId>
<name>Word Counter</name>
<packaging>jar</packaging>
<version>1.0</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>
</dependencies>
</project>
name := "Word Counter"
organization := "org.apache.spark"
version := "1.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
Directory layout
Pom.xml
name.sbt

Submitting Application
 Starting Spark (Master and Slaves)
 Submitting a job
 Submission syntax:
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]
$ sudo ./bin/spark-submit
--class "org.apache.spark.examples.JavaWordCount"
--master spark://127.0.0.1:7077
test/target/word-counter-1.0.jar /var/log/syslog
$ ./sbin/start-all.sh

Spark Streaming

Overview
 Data can be ingested from many sources like Kafka, Flume, Twitter,
ZeroMQ, Kinesis or TCP sockets
 Data can be processed using complex algorithms expressed with high-
level functions like map, reduce, join and window
 Processed data can be pushed out to filesystems, databases, and live
dashboards
 Potential for combining batch processing and streaming processing in
the same system
 you can apply Spark’s machine learning and graph processing algorithms on
data streams

 Run a streaming computation as a series of very small, deterministic
batch jobs
 Chop up the live stream into batches of X seconds
 Spark treats each batch of data
as RDDs and processes them using
RDD operations
 Finally, the processed results of
the RDD operations are returned
in batches
 Batch sizes as low as ½ second,
latency of about 1 second
Spark Streaming – How Work

Dstreams (Discretized Stream)
 represents a continuous stream of data
 is represented as a sequence of RDDs
 can be created from
 input data streams from sources such as Kafka, Flume, and Kinesis
 by applying high-level operations on other Dstreams
 Example: lines to words

Running Example - JavaNetworkWordCount
 You will first need to run Netcat as a data server by using
 Remember you must be installed spark
 Then, in a different terminal, you can start the example by using
 Then, any lines typed in the terminal running the netcat server will be
counted and printed on screen every second.
$ nc -lk 9999
$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999

Spark Streaming Programing
Guide

Linking with Spark
 Like as Spark batch processing
 Spark 1.2.0 works with Java 6 and higher
 To write a Spark application in Java, you need to add a dependency on
Spark.
 add the following dependency to your Maven project.
 add the following dependency to your SBT project.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.2.0</version>
</dependency>
libraryDependencies += "org.apache.spark" %
"spark-streaming_2.10" % "1.2.0"

Initializing – Creating StreamingContext
 Like as SparkContext
 Using constructor
 The batchDuration is the size of the batches
 the time interval at which streaming data will be divided into batches
 can be created from a SparkConf object
 can also be created from an existing JavaSparkContext
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, Duration(1000));
...
JavaSparkContext ctx = ... //existing JavaSparkContext
JavaStreamingContext ssc =
new JavaStreamingContext(ctx, Durations.seconds(1));
new StreamingContext(master,appName,batchDuration,[sparkHome],[jars])

Setting the Right Batch Size
 batches of data should be processed as fast as they are being generated
 the batch interval used may have significant impact on the data rates
 figure out the right batch size for an application
 test it with a conservative batch interval and a low data rate
 5-10 seconds
 If system is stable (the delay is comparable to the batch size)
 increasing the data rate and/or reducing the batch size
 If system is unstable (the delay is continuously increasing)
 Get to the previous stable batch size

Input DStreams and Receivers
 Input DStream is associated with a Receiver
 except file stream
 Receiver
 receives the data from a source and
 stores it in memory for processing
 Spark Streaming provides two categories of built-in streaming sources.
 Basic sources
 like file systems, socket connections, and Akka actors
 directly available in the StreamingContext API
 Advanced sources
 like Kafka, Flume, Kinesis, Twitter, etc.
 are available through extra utility classes
 Custom sources

Basic Sources
 File Streams
 will monitor the directory dataDirectory and process any files created in that directory
 For simple text files
 Socket Streams
 Custom Actors
 Actors are concurrent processes that communicate by exchanging messages
 Queue of RDDs
 Each RDD into the queue will be treated as a batch of data in the DStream, and
processed like a stream
streamingContext.fileStream<KeyClass, ValueClass,
InputFormatClass>(dataDirectory);
streamingContext.textFileStream(dataDirectory)
streamingContext.actorStream(actorProps, actor-name)
streamingContext.queueStream(queueOfRDDs)
streamingContext.socketStream(String hostname, int port,
Function converter, StorageLevel storageLevel)

Advanced Sources
 require interfacing with external non-Spark libraries
 Twitter
 Linking: Add the artifact spark-streaming-twitter_2.10 to the SBT/Maven
 Programming: Import the TwitterUtils class and create a DStream with
TwitterUtils.createStream as shown below
 Deploying: Generate an uber JAR with all the dependencies (including the
dependency spark-streaming-twitter_2.10 and its transitive dependencies) and
then deploy the application. This is further explained in the Deploying section.
 Flume
 Kafka
 Kinesis
import org.apache.spark.streaming.twitter.*;
TwitterUtils.createStream(jssc);

Custom Sources
 implement an user-defined receive

Socket Text Stream
 Create an input stream from network source hostname:port
 Data is received using a TCP socket
 Receive bytes is interpreted as UTF8 encoded n delimited lines
 Storage level to use for storing the received objects
socketTextStream(String hostname, int port);
import org.apache.spark.api.java.StorageLevels;
...
ssc.socketTextStream(“localhost”,9999,
StorageLevels.MEMORY_AND_DISK_SER);
socketTextStream(String hostname, int port, StorageLevel
storageLevel)

Class ReceiverInputDStream
 Abstract class for defining any InputDStream
 Start a receiver on worker nodes to receive external data
 JavaReceiverInputDStream
 An interface to ReceiverInputDStream
 The abstract class for defining input stream received over the network
 Example:
 Creates a DStream from text data received over a TCP socket connection
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
...
JavaReceiverInputDStream<String> lines =
ssc.socketTextStream(“localhost”, 9999, StorageLevels.MEMORY);

Output Operations on DStreams
 Allow DStream’s data to be pushed out external systems
 Trigger the actual execution of all the DStream transformations
 Similar to actions for RDDs
Output Operation Meaning
print()
Prints first ten elements of every batch of data in a
DStream on the driver node running the streaming
application.
saveAsTextFiles (prefix, [suffix])
Save DStream's contents as a text files. The file name at
each batch interval is generated based on prefix and suffix.
saveAsObjectFiles(prefix, [suffix])
Save DStream's contents as a SequenceFile of serialized
Java objects.
saveAsHadoopFiles(prefix, [suffix]) Save DStream's contents as a Hadoop file.
foreachRDD(func)
Applies a function to each RDD generated from the
stream. This function should push the data in each RDD to
a external system, like saving the RDD to files, or writing
it over the network to a database. The function is executed
in the driver process running the streaming application.

 Persisting (or caching) a dataset in memory across operations
 Each node stores any computed partitions in memory and reuses them
 Methods
 .cache()  just memory - for iterative algorithms
 .persist()  just memory - reuses in other actions on dataset
 .persist(storageLevel)  storageLevel:
 Example:
.
RDD Persistence
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream(
args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);

UpdateStateByKey
 To maintain state
 Update state with new information
 Define the state
 Define the state update function
 using updateStateByKey requires the checkpointing
import com.google.common.base.Optional;
...
Function2<List<Integer>, Optional<Integer>, Optional<Integer>>
updateFunction = new Function2<List<Integer>, Optional<Integer>,
Optional<Integer>>() {
@Override public Optional<Integer> call(List<Integer> values,
Optional<Integer> state) {
Integer newSum = ... // add the new values with the
//previous running count
return Optional.of(newSum);
}};
...
JavaPairDStream<String, Integer> runningCounts =
pairs.updateStateByKey(updateFunction);
applied on a DStream containing words

 To operate 24/7 and be resilient to failures
 Needs to checkpoints enough information to recover from failures
 Two types of data that are checkpointed
 Metadata checkpointing
 To recover from failure of the node running the driver
 Includes Configuration; DStream operations; Incomplete batches
 Data checkpointing
 To cut off the dependency chains
 Remove accumulated metadata in stateful operations
 To enable checkpointing:
 The interval of checkpointing of a DStream can be set by using
 checkpoint interval of 5 - 10 times is good
dstream.checkpoint(checkpointInterval)
ctx.checkpoint(hdfsPath)
Checkpointing

A Complete Example
 Network Word Counter Program
 Package and classes
Import
needed
classes
Package’s name

A Complete Example
 Main Class
Creating a SparkStreamingContext
Creating a
SparkConf
Application name
Socket Streams as Source
Input DStream
Setting batch size

A Complete Example
 JavaDStream and JavaPairDStream functions
construct
JavaPairDstream
from JavaDstream
count how many
times each word
of text occurs in
an stream
values for each key are aggregated
create a tuple (key-value pairs )
Transformed DStream

A Complete Example
 Printing results
Wait for the execution to stop
Start the execution of the
streams
Print the first ten elements

Spark and Storm
A Comparison
15 January
2015
59Majid Hajibaba - Spark

Spark vs. Strom
Spark Storm
Origin UC Berkeley, 2009 Twitter
Implemented in Scala Clojure (Lisp like)
Enterprise Support Yes No
Source Model Open Source Open Source
Big Data Processing Batch and Stream Stream
Processing Type processing in short
interval batches
real time
Latency a few Second sub-Second
Programming API Scala, Java, Python Any PL
Guarantee Data
Processing
Exactly one At least one
Bach Processing Yes No
Coordination With zookeeper zookeeper

Apache Spark
Ippon USA

Apache Storm

Comparison
 Higher throughput than Storm
 Spark Streaming: 670k records/sec/node
 Storm: 115k records/sec/node
 Commercial systems: 100-500k records/sec/node

Spark SQL

Spark SQL
 Allows relational queries expressed in SQL to be executed using Spark
 Data Sources are in JavaSchemaRDDs
 JavaSchemaRDD
 new type of RDD
 is similar to a table in a traditional relational database
 are composed of Row objects along with a schema that describes it
 can be created from an existing RDD, a JSON dataset, or …

Spark SQL Programming Guide

Initializing - Creating JavaSQLContext
 To create a basic JavaSQLContext, all you need is a JavaSparkContext
 It must be based spark context
import org.apache.spark.sql.api.java.JavaSQLContext;
...
...
JavaSparkContext sc = ...; // An existing JavaSparkContext.
JavaSQLContext sqlContext = new JavaSQLContext(sc);

SchemaRDD
 SchemaRDD can be operated on
 as normal RDDs
 as a temporary table
 allows you to run SQL queries over it
 Converting RDDs into SchemaRDDs
 Reflection based approach
 Uses reflection to infer the schema of an RDD
 More concise code
 Works well when we know the schema while writing the application
 Programmatic based approach
 Construct a schema and then apply it to an existing RDD
 More verbose
 Allows to construct SchemaRDDs when the columns and types are not known until
runtime

JavaBean
 Is just a standard (a convention)
 Is a class that encapsulates many objects into a single object
 All properties private (using get/set)
 A public no-argument constructor
 Implements Serializable
 Lots of libraries depend on it
public static class Person implements Serializable {
private String name;
private int age;
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public int getAge() { return age; }
public void setAge(int age) { this.age = age; }
}

Reflection based - An Example
 Load a text file like people.txt
 Convert each line to a JavaBean
 people now is an RDD of JavaBeans
JavaRDD<Person> people = sc.textFile("people.txt").map(
new Function<String, Person>() {
public Person call(String line) throws Exception {
String[] parts = line.split(",");
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
}
});

Reflection based - An Example
 Apply a schema to an RDD of JavaBeans (people)
 Register it as a temporary table
 SQL can be run over RDDs that have been registered as tables
 The result is SchemaRDD and support all the normal RDD operations
 The columns of a row in the result can be accessed by ordinal
JavaSchemaRDD schemaPeople =
sqlContext.applySchema(people, Person.class);
schemaPeople.registerTempTable("people");
JavaSchemaRDD teenagers = sqlContext.sql(
"SELECT name FROM people WHERE age >= 13 AND age <= 19")
List<String> teenagerNames = teenagers.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();

Programmatic based
 JavaBean classes cannot be defined ahead of time
 SchemaRDD can be created programmatically with three steps
 Create an RDD of Rows from the original RDD
 Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
 Apply the schema to the RDD of Rows via applySchema method provided by
JavaSQLContext.
 Example
 The structure of records (schema) is encoded in a string
 Load a text file and convert each line to a JavaBean.
String schemaString = "name age";
JavaRDD<String> people =
sc.textFile("examples/src/main/resources/people.txt");

Programmatic based – An Example
 Generate the schema based on the string of schema
 Convert records of the RDD (people) to Rows
import org.apache.spark.sql.api.java.DataType;
import org.apache.spark.sql.api.java.StructField;
import org.apache.spark.sql.api.java.StructType;
...
List<StructField> fields = new ArrayList<StructField>();
for (String fieldName: schemaString.split(" ")) {
fields.add(DataType.createStructField(fieldName,
DataType.StringType, true));}
StructType schema = DataType.createStructType(fields);
import org.apache.spark.sql.api.java.Row;
...
JavaRDD<Row> rowRDD = people.map(
new Function<String, Row>() {
public Row call(String record) throws Exception {
String[] fields = record.split(",");
return Row.create(fields[0], fields[1].trim());
}
});

Programmatic based – An Example
 Apply the schema to the RDD.
 Register the SchemaRDD as a table.
 SQL can be run over RDDs that have been registered as tables
 The result is SchemaRDD and support all the normal RDD operations
 The columns of a row in the result can be accessed by ordinal
JavaSchemaRDD peopleSchemaRDD =
sqlContext.applySchema(rowRDD, schema);
peopleSchemaRDD.registerTempTable("people");
JavaSchemaRDD results = sqlContext.sql("SELECT name FROM people");
List<String> names = results.map(
new Function<Row, String>() {
public String call(Row row) {
return "Name: " + row.getString(0);
}
}).collect();

JSON Datasets
 Inferring the schema of a JSON dataset and load it to JavaSchemaRDD
 Two methods in a JavaSQLContext
 jsonFile() : loads data from a directory of JSON files where each line of the
files is a JSON object – but not regular multi-line JSON file
 jsonRDD(): loads data from an existing RDD where each element of the RDD
is a string containing a JSON object
 A JSON file can be like this:
JavaSchemaRDD people = sqlContext.jsonFile(path);

JSON Datasets
 The inferred schema can be visualized using the printSchema()
 The result is something like this:
 Register this JavaSchemaRDD as a table
 SQL statements can be run by using the sql methods
people.printSchema();
people.registerTempTable("people");
"SELECT name FROM people WHERE age >= 13 AND age <= 19");

JSON Datasets
 JavaSchemaRDD can be created for a JSON dataset represented by an
RDD[String] storing one JSON object per string
 Arrays are native examples of RDDs
 Register this JavaSchemaRDD as a table
 SQL statements can be run by using the sql methods
.
List<String> jsonData =
Arrays.asList("{"name":"Yin","address":
{"city":"Columbus","state":"Ohio"}}");
JavaRDD<String> anotherPeopleRDD = sc.parallelize(jsonData);
JavaSchemaRDD anotherPeople =
sqlContext.jsonRDD(anotherPeopleRDD);
people.registerTempTable("people");
"SELECT name FROM people WHERE age >= 13 AND age <= 19");

Thrift JDBC/ODBC server
 To start the JDBC/ODBC server:
 By default, the server listens on localhost:10000
 We can use beeline to test the Thrift JDBC/ODBC server
 Connect to the JDBC/ODBC server in beeline with
 Beeline will ask for a username and password
 Simply enter the username on your machine and a blank password
 See existing databases;
 Create a database;
$ ./sbin/start-thriftserver.sh
$ ./bin/beeline
beeline> !connect jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> SHOW DATABASES;
0: jdbc:hive2://localhost:10000> CREATE DATABASE DBTEST;

End
any question?

Apache Spark

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark (20)

More from Majid Hajibaba (9)

Recently uploaded (20)

Apache Spark

Editor's Notes