SlideShare a Scribd company logo
Meetup Data Analysis
By,
Sushmanth Sagala
Spark Project – UpX Academy
Project Information
 Domain: Social
 Technology use: Spark streaming, Spark, Spark MLlib
 Dataset: http://stream.meetup.com/2/open_events
 Meetup is an online social networking portal that facilitates offline group meetings
in various localities around the world. Meetup allows members to find and join
groups unified by a common interest, such as politics, books and games.
Sample Events Dataset
Business Questions
 Streaming/Spark Sql
 Load the streaming data
 Count the number of events happening in a city eg. Hyderabad
 Count the number of free events
 Count the events in Technology category
 Count the number of Big data events happening in US
 Find the average duration of Technology events
 Spark MLLIB
 Group the events by their category (k-means clustering)
Q1. Load the streaming data
 Custom receiver to load data from external URL.
 Asynchronous HTTP request to read the data from streaming URL.
 def onStart() {
val cf = new AsyncHttpClientConfig.Builder()
cf.setRequestTimeout(Integer.MAX_VALUE)
cf.setReadTimeout(Integer.MAX_VALUE)
cf.setPooledConnectionIdleTimeout(Integer.MAX_VALUE)
client= new AsyncHttpClient(cf.build())
inputPipe = new PipedInputStream(1024 * 1024)
outputPipe = new PipedOutputStream(inputPipe)
val producerThread = new Thread(new DataConsumer(inputPipe))
producerThread.start()
client.prepareGet(url).execute(new AsyncHandler[Unit]{
def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = {
bodyPart.writeTo(outputPipe)
AsyncHandler.STATE.CONTINUE
}
….
})
}
Q1. Load the streaming data
 Class DataConsumer extends Runnable to read the stream data and store.
val bufferedReader = new BufferedReader( new InputStreamReader( inputStream
))
var input=bufferedReader.readLine()
while(input!=null){
store(input)
input=bufferedReader.readLine() }
 Defining the case classes to extract respective data.
case class EventDetails(id: String, name: String, city: String, country: String,
payment_required: Int, cat_id: Int, cat_name: String, duration: Long)
case class Venue(name: Option[String], address1: Option[String], city: Option[String],
state: Option[String], zip: Option[String], country: Option[String], lon: Option[Float], lat:
Option[Float])
case class Event(id: String, name: Option[String], eventUrl: Option[String], description:
Option[String], duration: Option[Long], rsvpLimit: Option[Int], paymentRequired:
Option[Int], status: Option[String])
case class Group(id: Option[String], name: Option[String], city: Option[String], state:
Option[String], country: Option[String])
case class Category(name: Option[String], id: Option[Int], shortname: Option[String])
Q1. Load the streaming data
 parseEvent method uses Json4s lib to extract the json data and define the
EventDetails type.
val json=parse(eventJson).camelizeKeys
val event=json.extract[Event]
val venue=(json  "venue").extract[Venue]
val group=(json  "group").extract[Group]
val category=(json  "group"  "category").extract[Category]
EventDetails(event.id, event.name.getOrElse(""), venue.city.getOrElse(""),
venue.country.getOrElse(""), event.paymentRequired.getOrElse(0),
category.id.getOrElse(0), category.shortname.getOrElse(""),
event.duration.getOrElse(10800000L))
 Starting the event stream with Batch Interval of 2 secs,
val ssc=new StreamingContext(conf, Seconds(2))
val eventStream = ssc.receiverStream(new
MeetupReceiver("http://stream.meetup.com/2/open_events")).flatMap(parseEvent)
Stateful Stream
 Using Window stream to do aggregations across Intervals of stream.
 Window and Slide interval = 10 sec
 Batch interval = 2 sec
 val windowEventStream = eventStream.window(Seconds(10),Seconds(10))
windowEventStream.cache()
 Custom Functions to sum aggregations while using updateStateByKey.
 def updateSumFunc(values: Seq[Int], state: Option[Int]): Option[Int] = { val
currentCount = values.sum val previousCount = state.getOrElse(0)
Some(currentCount + previousCount) }
 def updateSumFunc2f(values: Seq[Double], state: Option[Double]): Option[Double]
= { val currentCount = values.sum val previousCount = state.getOrElse(0.0)
Some(currentCount + previousCount) }
Q2. Count the number of events happening in a
city eg. Hyderabad
 Filtering the list of events happening in a city say “New York”.
 Reducing the events to get the number of events happening in this city for the
current Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val cityEventsStream = windowEventStream.filter{event => event.city == "New
York"}.map{event =>
(event.city,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of events happening in “New York” during each
Window interval.
 cityEventsStream.foreachRDD(rdd => {rdd.foreach{case (city, count) =>
println("No. of Events happening in %s city::%s".format(city, count))}})
Q3. Count the number of free events
 Filtering the list of free events happening by using condition when ever
payment_required value is 0.
 Reducing the events to get the number of free events happening for the current
Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val freeEventsStream = windowEventStream.filter{event =>
event.payment_required == 0}.map{event =>
("Free",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of free events happening during each Window
interval.
 freeEventsStream.foreachRDD(rdd => {rdd.foreach{case (free, count) =>
println("No. of Free Events happening::%s".format(count))}})
Q4. Count the events in Technology category
 Filtering the list of Technology events happening.
 Reducing the events to get the number of Technology events happening for the current
Window computation.
 Aggregating the events count across the Window intervals using updateStateByKey.
 Reusing the Technology category events for another question by storing the count in a
stateless variable.
 val techEventsStream = windowEventStream.filter{event => event.cat_name == "tech"}
 var techCount = 0
 val countTexhEventsStream = techEventsStream.map{event =>
(event.cat_name,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of Technology events happening during each Window
interval.
 countTexhEventsStream.foreachRDD(rdd => {rdd.foreach{case (cat_name, count) =>
techCount = count; println("No. of %s Events happening::%s".format(cat_name,count))}})
Q5. Count the number of Big data events
happening in US
 Filtering the list of Big data events happening in “US”.
 Reducing the events to get the number of Big data events happening in US for the
current Window computation.
 Aggregating the events count across the Window intervals using
updateStateByKey.
 val bigDataUSEventsStream = windowEventStream.filter{event => event.country
== "us" && event.name.toLowerCase.indexOf("big data") >= 0}.map{event =>
("Big Data",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)
 Printing the count of number of Big data events happening in “US” during each
Window interval.
 bigDataUSEventsStream.foreachRDD(rdd => {rdd.foreach{case (name, count) =>
println("No. of %s Events happening in US::%s".format(name,count))}})
Q6. Find the average duration of
Technology events
 Reducing the Technology events to get the event duration for the current Window
computation.
 Aggregating the events duration across the Window intervals using updateStateByKey.
 Computing the Average duration and Printing the Average duration for Technology
events happening during each Window interval.
 val sumDurTechEventsStream = techEventsStream.map{event => (event.cat_name + "
Events", event.duration.toDouble /
60000.0)}.reduceByKey(_+_).updateStateByKey(updateSumFunc2f _)
 sumDurTechEventsStream.foreachRDD(rdd => {
rdd.map{case(x:String, y:Double) => (x, y / techCount.toDouble)}.foreach{case
(cat_name:String, avg:Double) => {
val hrs = (avg / 60.0).toInt
val min = (avg % 60).toInt
println("Avg duration of %s happening::%d hours %d
minutes".format(cat_name,hrs,min)) }
} })
Sample output screenshot
Q7. Group the events by their category (k-
means clustering)
 Building a recommendation model by using k-means clustering on events.
 Recommendation of group members is done based on clustering the event categories and rsvp’s
responses respect to events.
 Parsing history Events.
 val eventsHistory = ssc.sparkContext.textFile("data/events/events.json", 1).flatMap(parseHisEvent)
 Parsing history Rsvps.
 case class Member(memberName: Option[String], memberId: Option[String])
 case class MemberEvent(eventId: Option[String], eventName: Option[String], eventUrl:
Option[String], time: Option[Long])
 val json=parse(rsvpJson).camelizeKeys
 val member=(json  "member").extract[Member]
 val event=(json  "event").extract[MemberEvent]
 val response=(json  "response").extract[String]
 (member, event, response)
 val rsvpHistory = ssc.sparkContext.textFile("data/rsvps/rsvps.json", 1).flatMap(parseRsvp)
Q7. Group the events by their category (k-
means clustering)
 Broadcasting Dictionary to load list of English dictionary words.
 val localDictionary =
Source.fromURL(getClass.getResource("/wordsEn.txt")).getLines.zipWithIndex.toMa
p
 val dictionary= ssc.sparkContext.broadcast(localDictionary)
 Feature Extraction to get the 10 most popular words from the event description, to
form the event category vectors for each event.
 def eventToVector(dictionary: Map[String, Int], description: String):
Option[Vector]={
val wordsIterator = breakToWords(description)
val topWords=popularWords(wordsIterator)
if (topWords.size==10) Some(Vectors.sparse(dictionary.size,topWords)) else
None }
 val eventVectors=eventsHistory.flatMap{
event=>eventToVector(dictionary.value,event.description.getOrElse("")) }
Q7. Group the events by their category (k-
means clustering)
 Training the history events based on k-means clustering model.
 val eventClusters = KMeans.train(eventVectors, 10, 2)
 Creating the Event History Ids and RSVP Member Event Id to join based on the
Event ID.
 val eventHistoryById=eventsHistory.map{event=>(event.id,
event.description.getOrElse(""))}.reduceByKey{(first: String, second: String)=>first}
 val membersByEventId=rsvpHistory.flatMap{ case(member, memberEvent,
response) => memberEvent.eventId.map{id=>(id,(member, response))} }
 val rsvpEventInfo=membersByEventId.join(eventHistoryById)
 Example: (eventId, ((member, response), description))
 (221069430, ((Member(Some(Susan Beck),Some(101089292)), yes), ‘…’))
 (221149038, ((Member(Some(Tracy Ramey),Some(153724262), no), ‘…’))
Q7. Group the events by their category (k-
means clustering)
 Predicting the Event cluster based on the trained model.
 val memberEventInfo = rsvpEventInfo.flatMap{ case(eventId, ((member, response),
description)) => {eventToVector(dictionary.value,description).map{ eventVector=>
val eventCluster=eventClusters.predict(eventVector) (eventCluster,(member,
response)) } } }
 Clustering members into groups based on the predictions.
 val memberGroups = memberEventInfo.filter{case(cluster, (member,
memberResponse)) => memberResponse == "yes"}.map{case(cluster, (member,
memberResponse)) =>
(cluster,member)}.groupByKey().map{case(cluster,memberItr) =>
(cluster,memberItr.toSet)}
Q7. Group the events by their category (k-
means clustering)
 Member Recommendations based on the clustering.
 val recommendations = memberEventInfo.join(memberGroups).map{case(cluster,
((member, memberResponse), members)) => (member.memberName, members-
member)}
 Example: (member.memberName, members)
 (Some(Rosie),Set(Member(Some(Derek),Some(84715352)), Member(Some(Pastor
Jim Billetdeaux),Some(7569836)), Member(Some(Tom),Some(11503256)),
Member(Some(Haeran Dempsey),Some(10724391)),
Member(Some(Jane),Some(130609252)), Member(Some(Cathy),Some(42921402))))
Sample output screenshot
Conclusion
 Meetup Streaming data loaded and analysed successfully.
 Streaming events data loaded through Spark Streaming using Custom Receivers
and handled as Asynchronous HTTP requests.
 History Events and Rsvp data analysed through Spark MLlib to build an Group
member recommendations based on K-means clustering model.
 Code: https://github.com/ssushmanth/meetup-stream

More Related Content

PDF
Streaming Solr - Activate 2018 talk
PPTX
Query for json databases
PDF
Streaming Operational Data with MariaDB MaxScale
PPTX
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Receipt processing with Google Cloud Platform and the Google Assistant
PDF
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
PPT
Ken 20150306 心得分享
Streaming Solr - Activate 2018 talk
Query for json databases
Streaming Operational Data with MariaDB MaxScale
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
Goal Based Data Production with Sim Simeonov
Receipt processing with Google Cloud Platform and the Google Assistant
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
Ken 20150306 心得分享

What's hot (18)

PPTX
MongoDB Aggregation
PPTX
MongoDB - Aggregation Pipeline
PDF
Arctic15 keynote
PDF
NoSQL meets Microservices - Michael Hackstein
PDF
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
PDF
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
PDF
Rxjs kyivjs 2015
PPTX
Cnam azure 2014 mobile services
PPTX
Living in eventually consistent reality
DOCX
Android Sample Project By Wael Almadhoun
PPT
Learning with F#
PDF
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
PDF
Hadoop - MongoDB Webinar June 2014
PDF
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
PDF
rx.js make async programming simpler
PPTX
1403 app dev series - session 5 - analytics
MongoDB Aggregation
MongoDB - Aggregation Pipeline
Arctic15 keynote
NoSQL meets Microservices - Michael Hackstein
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Java/Scala Lab: Борис Трофимов - Обжигающая Big Data.
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Rxjs kyivjs 2015
Cnam azure 2014 mobile services
Living in eventually consistent reality
Android Sample Project By Wael Almadhoun
Learning with F#
ヘルスケアサービスを実現する最新技術 〜HealthKit・GCP + Goの活用〜
Hadoop - MongoDB Webinar June 2014
ヘルスケアサービスを実現する最新技術
〜HealthKit・GCP+Go〜
rx.js make async programming simpler
1403 app dev series - session 5 - analytics
Ad

Similar to Spark Streaming - Meetup Data Analysis (20)

PPTX
Apache Flink @ NYC Flink Meetup
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PPT
strata_spark_streaming.ppt
PDF
Conviva spark
PPT
Spark streaming
PDF
Flink Streaming Berlin Meetup
PDF
Strata London 16: sightseeing, venues, and friends
PDF
An Architecture for Agile Machine Learning in Real-Time Applications
PDF
Data Stream Processing - Concepts and Frameworks
PDF
Apache Spark Data intensive processing in practice
PDF
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PDF
Apache Spark - Data intensive processing in practice
PPTX
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
PPTX
KDD 2016 Streaming Analytics Tutorial
PPTX
Meetup spark structured streaming
PDF
Introduction to Apache Spark
Apache Flink @ NYC Flink Meetup
Apache Spark Streaming: Architecture and Fault Tolerance
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
strata_spark_streaming.ppt
Conviva spark
Spark streaming
Flink Streaming Berlin Meetup
Strata London 16: sightseeing, venues, and friends
An Architecture for Agile Machine Learning in Real-Time Applications
Data Stream Processing - Concepts and Frameworks
Apache Spark Data intensive processing in practice
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Apache Spark - Data intensive processing in practice
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
KDD 2016 Streaming Analytics Tutorial
Meetup spark structured streaming
Introduction to Apache Spark
Ad

Recently uploaded (20)

PPTX
Spectroscopy.pptx food analysis technology
PDF
August Patch Tuesday
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Mushroom cultivation and it's methods.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
Spectroscopy.pptx food analysis technology
August Patch Tuesday
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25-Week II
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Mushroom cultivation and it's methods.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
Heart disease approach using modified random forest and particle swarm optimi...
Univ-Connecticut-ChatGPT-Presentaion.pdf
OMC Textile Division Presentation 2021.pptx

Spark Streaming - Meetup Data Analysis

  • 1. Meetup Data Analysis By, Sushmanth Sagala Spark Project – UpX Academy
  • 2. Project Information  Domain: Social  Technology use: Spark streaming, Spark, Spark MLlib  Dataset: http://stream.meetup.com/2/open_events  Meetup is an online social networking portal that facilitates offline group meetings in various localities around the world. Meetup allows members to find and join groups unified by a common interest, such as politics, books and games.
  • 4. Business Questions  Streaming/Spark Sql  Load the streaming data  Count the number of events happening in a city eg. Hyderabad  Count the number of free events  Count the events in Technology category  Count the number of Big data events happening in US  Find the average duration of Technology events  Spark MLLIB  Group the events by their category (k-means clustering)
  • 5. Q1. Load the streaming data  Custom receiver to load data from external URL.  Asynchronous HTTP request to read the data from streaming URL.  def onStart() { val cf = new AsyncHttpClientConfig.Builder() cf.setRequestTimeout(Integer.MAX_VALUE) cf.setReadTimeout(Integer.MAX_VALUE) cf.setPooledConnectionIdleTimeout(Integer.MAX_VALUE) client= new AsyncHttpClient(cf.build()) inputPipe = new PipedInputStream(1024 * 1024) outputPipe = new PipedOutputStream(inputPipe) val producerThread = new Thread(new DataConsumer(inputPipe)) producerThread.start() client.prepareGet(url).execute(new AsyncHandler[Unit]{ def onBodyPartReceived(bodyPart: HttpResponseBodyPart) = { bodyPart.writeTo(outputPipe) AsyncHandler.STATE.CONTINUE } …. }) }
  • 6. Q1. Load the streaming data  Class DataConsumer extends Runnable to read the stream data and store. val bufferedReader = new BufferedReader( new InputStreamReader( inputStream )) var input=bufferedReader.readLine() while(input!=null){ store(input) input=bufferedReader.readLine() }  Defining the case classes to extract respective data. case class EventDetails(id: String, name: String, city: String, country: String, payment_required: Int, cat_id: Int, cat_name: String, duration: Long) case class Venue(name: Option[String], address1: Option[String], city: Option[String], state: Option[String], zip: Option[String], country: Option[String], lon: Option[Float], lat: Option[Float]) case class Event(id: String, name: Option[String], eventUrl: Option[String], description: Option[String], duration: Option[Long], rsvpLimit: Option[Int], paymentRequired: Option[Int], status: Option[String]) case class Group(id: Option[String], name: Option[String], city: Option[String], state: Option[String], country: Option[String]) case class Category(name: Option[String], id: Option[Int], shortname: Option[String])
  • 7. Q1. Load the streaming data  parseEvent method uses Json4s lib to extract the json data and define the EventDetails type. val json=parse(eventJson).camelizeKeys val event=json.extract[Event] val venue=(json "venue").extract[Venue] val group=(json "group").extract[Group] val category=(json "group" "category").extract[Category] EventDetails(event.id, event.name.getOrElse(""), venue.city.getOrElse(""), venue.country.getOrElse(""), event.paymentRequired.getOrElse(0), category.id.getOrElse(0), category.shortname.getOrElse(""), event.duration.getOrElse(10800000L))  Starting the event stream with Batch Interval of 2 secs, val ssc=new StreamingContext(conf, Seconds(2)) val eventStream = ssc.receiverStream(new MeetupReceiver("http://stream.meetup.com/2/open_events")).flatMap(parseEvent)
  • 8. Stateful Stream  Using Window stream to do aggregations across Intervals of stream.  Window and Slide interval = 10 sec  Batch interval = 2 sec  val windowEventStream = eventStream.window(Seconds(10),Seconds(10)) windowEventStream.cache()  Custom Functions to sum aggregations while using updateStateByKey.  def updateSumFunc(values: Seq[Int], state: Option[Int]): Option[Int] = { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) }  def updateSumFunc2f(values: Seq[Double], state: Option[Double]): Option[Double] = { val currentCount = values.sum val previousCount = state.getOrElse(0.0) Some(currentCount + previousCount) }
  • 9. Q2. Count the number of events happening in a city eg. Hyderabad  Filtering the list of events happening in a city say “New York”.  Reducing the events to get the number of events happening in this city for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val cityEventsStream = windowEventStream.filter{event => event.city == "New York"}.map{event => (event.city,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of events happening in “New York” during each Window interval.  cityEventsStream.foreachRDD(rdd => {rdd.foreach{case (city, count) => println("No. of Events happening in %s city::%s".format(city, count))}})
  • 10. Q3. Count the number of free events  Filtering the list of free events happening by using condition when ever payment_required value is 0.  Reducing the events to get the number of free events happening for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val freeEventsStream = windowEventStream.filter{event => event.payment_required == 0}.map{event => ("Free",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of free events happening during each Window interval.  freeEventsStream.foreachRDD(rdd => {rdd.foreach{case (free, count) => println("No. of Free Events happening::%s".format(count))}})
  • 11. Q4. Count the events in Technology category  Filtering the list of Technology events happening.  Reducing the events to get the number of Technology events happening for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  Reusing the Technology category events for another question by storing the count in a stateless variable.  val techEventsStream = windowEventStream.filter{event => event.cat_name == "tech"}  var techCount = 0  val countTexhEventsStream = techEventsStream.map{event => (event.cat_name,1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of Technology events happening during each Window interval.  countTexhEventsStream.foreachRDD(rdd => {rdd.foreach{case (cat_name, count) => techCount = count; println("No. of %s Events happening::%s".format(cat_name,count))}})
  • 12. Q5. Count the number of Big data events happening in US  Filtering the list of Big data events happening in “US”.  Reducing the events to get the number of Big data events happening in US for the current Window computation.  Aggregating the events count across the Window intervals using updateStateByKey.  val bigDataUSEventsStream = windowEventStream.filter{event => event.country == "us" && event.name.toLowerCase.indexOf("big data") >= 0}.map{event => ("Big Data",1)}.reduceByKey(_+_).updateStateByKey(updateSumFunc _)  Printing the count of number of Big data events happening in “US” during each Window interval.  bigDataUSEventsStream.foreachRDD(rdd => {rdd.foreach{case (name, count) => println("No. of %s Events happening in US::%s".format(name,count))}})
  • 13. Q6. Find the average duration of Technology events  Reducing the Technology events to get the event duration for the current Window computation.  Aggregating the events duration across the Window intervals using updateStateByKey.  Computing the Average duration and Printing the Average duration for Technology events happening during each Window interval.  val sumDurTechEventsStream = techEventsStream.map{event => (event.cat_name + " Events", event.duration.toDouble / 60000.0)}.reduceByKey(_+_).updateStateByKey(updateSumFunc2f _)  sumDurTechEventsStream.foreachRDD(rdd => { rdd.map{case(x:String, y:Double) => (x, y / techCount.toDouble)}.foreach{case (cat_name:String, avg:Double) => { val hrs = (avg / 60.0).toInt val min = (avg % 60).toInt println("Avg duration of %s happening::%d hours %d minutes".format(cat_name,hrs,min)) } } })
  • 15. Q7. Group the events by their category (k- means clustering)  Building a recommendation model by using k-means clustering on events.  Recommendation of group members is done based on clustering the event categories and rsvp’s responses respect to events.  Parsing history Events.  val eventsHistory = ssc.sparkContext.textFile("data/events/events.json", 1).flatMap(parseHisEvent)  Parsing history Rsvps.  case class Member(memberName: Option[String], memberId: Option[String])  case class MemberEvent(eventId: Option[String], eventName: Option[String], eventUrl: Option[String], time: Option[Long])  val json=parse(rsvpJson).camelizeKeys  val member=(json "member").extract[Member]  val event=(json "event").extract[MemberEvent]  val response=(json "response").extract[String]  (member, event, response)  val rsvpHistory = ssc.sparkContext.textFile("data/rsvps/rsvps.json", 1).flatMap(parseRsvp)
  • 16. Q7. Group the events by their category (k- means clustering)  Broadcasting Dictionary to load list of English dictionary words.  val localDictionary = Source.fromURL(getClass.getResource("/wordsEn.txt")).getLines.zipWithIndex.toMa p  val dictionary= ssc.sparkContext.broadcast(localDictionary)  Feature Extraction to get the 10 most popular words from the event description, to form the event category vectors for each event.  def eventToVector(dictionary: Map[String, Int], description: String): Option[Vector]={ val wordsIterator = breakToWords(description) val topWords=popularWords(wordsIterator) if (topWords.size==10) Some(Vectors.sparse(dictionary.size,topWords)) else None }  val eventVectors=eventsHistory.flatMap{ event=>eventToVector(dictionary.value,event.description.getOrElse("")) }
  • 17. Q7. Group the events by their category (k- means clustering)  Training the history events based on k-means clustering model.  val eventClusters = KMeans.train(eventVectors, 10, 2)  Creating the Event History Ids and RSVP Member Event Id to join based on the Event ID.  val eventHistoryById=eventsHistory.map{event=>(event.id, event.description.getOrElse(""))}.reduceByKey{(first: String, second: String)=>first}  val membersByEventId=rsvpHistory.flatMap{ case(member, memberEvent, response) => memberEvent.eventId.map{id=>(id,(member, response))} }  val rsvpEventInfo=membersByEventId.join(eventHistoryById)  Example: (eventId, ((member, response), description))  (221069430, ((Member(Some(Susan Beck),Some(101089292)), yes), ‘…’))  (221149038, ((Member(Some(Tracy Ramey),Some(153724262), no), ‘…’))
  • 18. Q7. Group the events by their category (k- means clustering)  Predicting the Event cluster based on the trained model.  val memberEventInfo = rsvpEventInfo.flatMap{ case(eventId, ((member, response), description)) => {eventToVector(dictionary.value,description).map{ eventVector=> val eventCluster=eventClusters.predict(eventVector) (eventCluster,(member, response)) } } }  Clustering members into groups based on the predictions.  val memberGroups = memberEventInfo.filter{case(cluster, (member, memberResponse)) => memberResponse == "yes"}.map{case(cluster, (member, memberResponse)) => (cluster,member)}.groupByKey().map{case(cluster,memberItr) => (cluster,memberItr.toSet)}
  • 19. Q7. Group the events by their category (k- means clustering)  Member Recommendations based on the clustering.  val recommendations = memberEventInfo.join(memberGroups).map{case(cluster, ((member, memberResponse), members)) => (member.memberName, members- member)}  Example: (member.memberName, members)  (Some(Rosie),Set(Member(Some(Derek),Some(84715352)), Member(Some(Pastor Jim Billetdeaux),Some(7569836)), Member(Some(Tom),Some(11503256)), Member(Some(Haeran Dempsey),Some(10724391)), Member(Some(Jane),Some(130609252)), Member(Some(Cathy),Some(42921402))))
  • 21. Conclusion  Meetup Streaming data loaded and analysed successfully.  Streaming events data loaded through Spark Streaming using Custom Receivers and handled as Asynchronous HTTP requests.  History Events and Rsvp data analysed through Spark MLlib to build an Group member recommendations based on K-means clustering model.  Code: https://github.com/ssushmanth/meetup-stream