SlideShare a Scribd company logo
R Statistics with MongoDB

R Statistics with Mon‐
goDB
Dr. Markus Schmidberger
October 14th, 2013 Munich, Germany
Email: markus@mongosoup.de
Twitter: @cloudHPC

1 von 36
Dr. Markus Schmidberger

R Statistics with MongoDB

2 von 36
R Statistics with MongoDB

Outline

Introduction to Big Data, MongoSoup and R
R statistics with MongoDB and Examples
Summary & Questions

3 von 36
R Statistics with MongoDB

Big Data
Wikipedia: … a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management
tools or traditional data processing. …
storing
processing

4 von 36
Storing: NoSQL - MongoDB

R Statistics with MongoDB

databases using looser consistency models to store data
German MongoDB as a Service: MongoSoup
cloudControl Add-On
currently running on AWS EU-Region (Ireland)
all features available: shared / dedicated hosting, replica
set, sharding
24/7 support available

5 von 36
R Statistics with MongoDB

MongoSoup in < 5 min

go to cloudControl: www.cloudcontrol.com
add an account and a billing address
create a new app, e.g. “rmongodb”
install cloudControl command line tools: cctrlapp
enable your preferred MongoSoup hosting: cctrlapp
rmongodb/default addon.add mongosoup.medium
go to the cloudControl Web-Console-AddOns and get your
credentials
https://www.cloudcontrol.com/console/app/rmongodb

6 von 36
Processing: Analyzing with R and Hadoop
R Statistics with MongoDB

backward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, more
processing time required
Check my Strata London Tutorial “Big Data Analyses with R”

7 von 36
R Statistics with MongoDB

Introduction to R

R is a free software environment for statistical computing
and graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community

www.r-project.org

8 von 36
huge online-libraries with > 5000 R-packages:

R Statistics with MongoDB

http://cran.r-project.org
possibility to write personalized code and to contribute new
packages
really famous since January 6, 2009: The New York Times,
“Data Analysts Captivated by R's Power”

9 von 36
R Statistics with MongoDB

RStudio IDE

http://www.rstudio.com

10 von 36
R Statistics with MongoDB

R as calculator

(5+5) - 1 * 3
[1] 7
x <- 3
x
[1] 3
x^2 + 4
[1] 13

11 von 36
R Statistics with MongoDB

y <- c(1,2,3)
y
[1] 1 2 3
x <- 1:10
x
[1]

1

2

3

4

5

6

7

8

9 10

x < 5
[1] TRUE TRUE TRUE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE

12 von 36
R Statistics with MongoDB

x[3:7]

[1] 3 4 5 6 7
mean(x)
[1] 5.5
help("mean")
?mean

13 von 36
R Statistics with MongoDB

14 von 36
Many Statistical Functions

R Statistics with MongoDB

kmeans(dat, 4)
K-means clustering with 4 clusters of sizes
21, 18, 30, 31
Cluster means:
[,1]
[,2]
1 0.7755 0.8509
2 -0.1557 -0.2305
3 1.2299 1.1472
4 0.1510 0.1507
Clustering vector:
[1] 4 2 4 4 2 4 4
2 2 4 4 4 2 4 2 4 4
[36] 4 4 4 4 4 4 4
3 1 3 3 3 1 1 3 3 3
[71] 1 3 1 1 3 3 3
1 3 1 3 3 3 3 1 3 3

4
2
4
3
3
3

2
4
2
1
1

4
2
4
3
1

4
2
2
1
3

4
4
2
3
3

2 2 4 4 1 4 2
4
4 2 2 1 1 1 1
3
1 1 1 3 3 3 3

Within cluster sum of squares by cluster:
[1] 3.318 1.166 4.019 3.195
(between_SS / total_SS = 83.0 %)
Available components:
[1] "cluster"
"centers"
"totss"
"withinss"
[5] "tot.withinss" "betweenss"
"size"

15 von 36
R Statistics with MongoDB

plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex
= 4)

16 von 36
R Shiny - easy web application

R Statistics with MongoDB

developed by RStudio
turns R analyses into interactive web applications that
anyone can use
let your users choose input parameters using friendly
controls like sliders, drop-downs, and text fields
easily incorporate any number of outputs like plots, tables,
and summaries
no HTML or JavaScript knowledge is necessary, only R
http://www.rstudio.com/shiny/

17 von 36
R Statistics with MongoDB

R and Databases
SQL provides a standard language to filter, aggregate, group,
sort data
SQL in new places: Hive, Impala, …
ODBC provides SQL interface to non-database data (Excel,
CSV, text files)
R stores relational data in data.frames (extended lists)

18 von 36
R Statistics with MongoDB

data(iris)
head(iris, n=3)
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
3
4.7
3.2
1.3
0.2 setosa
class(iris)
[1] "data.frame"

19 von 36
R Statistics with MongoDB

R package: sqldf

running SQL statements on R data frames
library(sqldf)
sqldf("select * from iris limit 2")
Sepal_Length Sepal_Width Petal_Length
Petal_Width Species
1
5.1
3.5
1.4
0.2 setosa
2
4.9
3.0
1.4
0.2 setosa
sqldf("select count(*) from iris")
count(*)
1
150

20 von 36
Other relational R package

R Statistics with MongoDB

RMySQL package provides an interface to MySQL
RPostgreSQL package provides an interface to PostgreSQL
ROracle package provides an interface for Oracle
RJDBC package provides access to databases through a
JDBC interface
RSQLite package provides access to SQLite
(SQLite engine is included)
One big problem:
all packages read the full result in R memory

21 von 36
R Statistics with MongoDB

R and MongoDB

on CRAN there are two packages to connect R with MongoDB
rmongodb supported by MongoDB, Inc.
powerful for big data
difficult to use due to BSON objects
RMongo
easy to use
limited functionality
reads full results in R memory
does not work on MAC OS X

22 von 36
R Statistics with MongoDB

R package: RMongo

library(Rmongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb",
"dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo,
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
dbShowCollections(mongo)
dbGetQuery(mongo, "zips","{'state':'AL'}")
dbInsertDocument(mongo, "test_data",
'{"foo": "bar", "size": 5 }')
dbDisconnect(mongo)

23 von 36
R Statistics with MongoDB

R package: rmongodb

developed on top of the MongoDB supported C driver
library(rmongodb)
mongo <mongo.create(host="dbs001.mongosoup.de",
db="cc_JwQcDLJSYQJb",
username="JwQcDLJSYQJb",
password="RSXPkUkXXXXX")
mongo
[1] 0
attr(,"mongo")
<pointer: 0x105a1de80>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0

24 von 36
R Statistics with MongoDB

mongo.get.database.collections(mongo,
"cc_JwQcDLJSYQJb")
[1] "cc_JwQcDLJSYQJb.zips"
"cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test"
mongo <- mongo.disconnect(mongo)

25 von 36
R Statistics with MongoDB

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "state", "AL")
[1] TRUE
query <- mongo.bson.from.buffer(buf)
query
state : 2

26 von 36

AL
R Statistics with MongoDB

res <- mongo.find.one(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res
city : 2
loc : 4
0 : 1
1 : 1
pop : 16
state : 2
_id : 2

27 von 36

ACMAR

6055
AL
35004

-86.515570
33.584132
R Statistics with MongoDB

out <- mongo.bson.to.list(res)
out$loc
[1] -86.52

33.58

typeof(out$loc)
[1] "double"
out$pop
[1] 6055
out$state
[1] "AL"

28 von 36
R Statistics with MongoDB

cursor <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.zips", query)
res <- NULL
while (mongo.cursor.next(cursor)){
value <- mongo.cursor.value(cursor)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}
err <- mongo.cursor.destroy(cursor)
head(res, n=4)
city
_id
Rvalue "ACMAR"
"35004"
Rvalue "ADAMSVILLE"
"35005"
Rvalue "ADGER"
"35006"
Rvalue "KEYSTONE"
"35007"

29 von 36

loc

pop

Numeric,2 6055

state
"AL"

Numeric,2 10616 "AL"
Numeric,2 3205

"AL"

Numeric,2 14218 "AL"
It is all about creating BSON query or field objects

R Statistics with MongoDB

b <- mongo.bson.from.list(
list(name="Fred", age=29, city="Boston"))
b
name : 2
age : 1
city : 2

Fred
29.000000
Boston

mongo.bson.to.list(b)
$name
[1] "Fred"
$age
[1] 29
$city
[1] "Boston"

30 von 36
R Statistics with MongoDB

?mongo.bson
?mongo.bson.buffer.append
?mongo.bson.buffer.start.array
?mongo.bson.buffer.start.object
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "aggregate",
"zips")
mongo.bson.buffer.start.array(buf,
"pipeline")
mongo.bson.buffer.start.object(buf,
"$group")
mongo.bson.buffer.append(buf, "_id",
"$state")
mongo.bson.buffer.start.object(buf,
"totalPop")
mongo.bson.buffer.append(buf, "$sum",
"$pop")
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, "$match")
mongo.bson.buffer.start.object(buf,
"totalPop")
mongo.bson.buffer.append(buf, "$gte",
"10000")
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)

31 von 36
CCP Web Analytics Challenge

R Statistics with MongoDB

buf <- mongo.bson.buffer.create()
query <- mongo.bson.from.buffer(buf)
buf <- mongo.bson.buffer.create()
err <- mongo.bson.buffer.append(buf, "user",
1)
err <- mongo.bson.buffer.append(buf, "type",
1)
field <- mongo.bson.from.buffer(buf)
out <- mongo.find(mongo,
"cc_JwQcDLJSYQJb.ccp", query, fields=field,
limit=1000)
res <- NULL
while (mongo.cursor.next(out)){
value <- mongo.cursor.value(out)
Rvalue <- mongo.bson.to.list(value)
res <- rbind(res, Rvalue)
}

32 von 36
R Statistics with MongoDB

boxplot( as.integer(table(unlist(res[,2]))
), cex=4, horizontal=TRUE, main="Number of
actions per user")

33 von 36
R Statistics with MongoDB

Shiny Mongo
R based MongoDB User Interface
R packages shiny and rmongodb
less than 200 lines of code
DEMO: http://localhost:8100

https://github.com/comsysto/ShinyMongo

34 von 36
R Statistics with MongoDB

Summary
R is a powerful statistical tool to analyse many different kind
of data
R can access databases
MongoDB and rmongodb ready for Big Data
start playing around with R, Big Data and MongoDB
http://www.r-project.org
http://www.mongodb.org
http://www.mongosoup.de 

35 von 36
R Statistics with MongoDB

See you soon

thanks a lot for your attention
there are R trainings in December 2013 in Munich
http://comsysto.com/events.html#r
we are hosting many events and meetups
meet you at the MongoSoup booth

Email: markus@mongosoup.de
Twitter: @cloudHPC

36 von 36

More Related Content

PPTX
The Very ^ 2 Basics of R
PDF
R statistics with mongo db
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
PDF
pandas - Python Data Analysis
PPTX
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
PDF
Big data analysis in python @ PyCon.tw 2013
PPTX
India software developers conference 2013 Bangalore
PDF
3 R Tutorial Data Structure
The Very ^ 2 Basics of R
R statistics with mongo db
Introduction to Pandas and Time Series Analysis [PyCon DE]
pandas - Python Data Analysis
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)
Big data analysis in python @ PyCon.tw 2013
India software developers conference 2013 Bangalore
3 R Tutorial Data Structure

What's hot (20)

PDF
SparkSQL and Dataframe
ODP
Data Analysis in Python
PDF
Getting started with pandas
PPTX
Data engineering and analytics using python
PDF
R Introduction
PDF
Pandas
PDF
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
PPTX
Python and Data Analysis
PDF
Pivoting Data with SparkSQL by Andrew Ray
PPTX
Introduction to pandas
PDF
Spark Dataframe - Mr. Jyotiska
PDF
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
PPTX
ComputeFest 2012: Intro To R for Physical Sciences
PDF
Data profiling with Apache Calcite
PDF
AfterGlow
PPTX
Predicting the relevance of search results for e-commerce systems
PDF
Python for R Users
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
PPTX
Spark meetup v2.0.5
SparkSQL and Dataframe
Data Analysis in Python
Getting started with pandas
Data engineering and analytics using python
R Introduction
Pandas
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
Python and Data Analysis
Pivoting Data with SparkSQL by Andrew Ray
Introduction to pandas
Spark Dataframe - Mr. Jyotiska
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
ComputeFest 2012: Intro To R for Physical Sciences
Data profiling with Apache Calcite
AfterGlow
Predicting the relevance of search results for e-commerce systems
Python for R Users
Efficient Data Storage for Analytics with Apache Parquet 2.0
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
Spark meetup v2.0.5
Ad

Similar to R Statistics With MongoDB (20)

PDF
Getting Started with MongoDB
PPTX
Data Science Stack with MongoDB and RStudio
PDF
Getting Started with MongoDB (TCF ITPC 2014)
PDF
Los Angeles R users group - Dec 14 2010 - Part 2
PDF
Final Project - Ricardo B Lourenço
PPTX
Introduction To R
PPT
Introduction to mongodb
PPTX
Munching the mongo
PPTX
introtomongodb
PDF
Open source analytics
PDF
Precog & MongoDB User Group: Skyrocket Your Analytics
PDF
Big dataclasses 2019_nosql
PDF
Which Questions We Should Have
PDF
Advanced Analytics & Statistics with MongoDB
PDF
Mongo db notes for professionals
PDF
Mongodb.pdf
PDF
SQLBits Module 2 RStats Introduction to R and Statistics
PPTX
PPTX
Mango Database - Web Development
PDF
MongoDB classes 2019
Getting Started with MongoDB
Data Science Stack with MongoDB and RStudio
Getting Started with MongoDB (TCF ITPC 2014)
Los Angeles R users group - Dec 14 2010 - Part 2
Final Project - Ricardo B Lourenço
Introduction To R
Introduction to mongodb
Munching the mongo
introtomongodb
Open source analytics
Precog & MongoDB User Group: Skyrocket Your Analytics
Big dataclasses 2019_nosql
Which Questions We Should Have
Advanced Analytics & Statistics with MongoDB
Mongo db notes for professionals
Mongodb.pdf
SQLBits Module 2 RStats Introduction to R and Statistics
Mango Database - Web Development
MongoDB classes 2019
Ad

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Empathic Computing: Creating Shared Understanding
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
KodekX | Application Modernization Development
PDF
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
Empathic Computing: Creating Shared Understanding
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
KodekX | Application Modernization Development
Electronic commerce courselecture one. Pdf

R Statistics With MongoDB

  • 1. R Statistics with MongoDB R Statistics with Mon‐ goDB Dr. Markus Schmidberger October 14th, 2013 Munich, Germany Email: markus@mongosoup.de Twitter: @cloudHPC 1 von 36
  • 2. Dr. Markus Schmidberger R Statistics with MongoDB 2 von 36
  • 3. R Statistics with MongoDB Outline Introduction to Big Data, MongoSoup and R R statistics with MongoDB and Examples Summary & Questions 3 von 36
  • 4. R Statistics with MongoDB Big Data Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. … storing processing 4 von 36
  • 5. Storing: NoSQL - MongoDB R Statistics with MongoDB databases using looser consistency models to store data German MongoDB as a Service: MongoSoup cloudControl Add-On currently running on AWS EU-Region (Ireland) all features available: shared / dedicated hosting, replica set, sharding 24/7 support available 5 von 36
  • 6. R Statistics with MongoDB MongoSoup in < 5 min go to cloudControl: www.cloudcontrol.com add an account and a billing address create a new app, e.g. “rmongodb” install cloudControl command line tools: cctrlapp enable your preferred MongoSoup hosting: cctrlapp rmongodb/default addon.add mongosoup.medium go to the cloudControl Web-Console-AddOns and get your credentials https://www.cloudcontrol.com/console/app/rmongodb 6 von 36
  • 7. Processing: Analyzing with R and Hadoop R Statistics with MongoDB backward-looking analysis is outdated today: quasi real-time analysis tomorrow: forward-looking predictive analysis more complex methods, more data available, more processing time required Check my Strata London Tutorial “Big Data Analyses with R” 7 von 36
  • 8. R Statistics with MongoDB Introduction to R R is a free software environment for statistical computing and graphics offers tools to manage and analyze data standard statistical methods are implemented compiles and runs under different OS support via huge community www.r-project.org 8 von 36
  • 9. huge online-libraries with > 5000 R-packages: R Statistics with MongoDB http://cran.r-project.org possibility to write personalized code and to contribute new packages really famous since January 6, 2009: The New York Times, “Data Analysts Captivated by R's Power” 9 von 36
  • 10. R Statistics with MongoDB RStudio IDE http://www.rstudio.com 10 von 36
  • 11. R Statistics with MongoDB R as calculator (5+5) - 1 * 3 [1] 7 x <- 3 x [1] 3 x^2 + 4 [1] 13 11 von 36
  • 12. R Statistics with MongoDB y <- c(1,2,3) y [1] 1 2 3 x <- 1:10 x [1] 1 2 3 4 5 6 7 8 9 10 x < 5 [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 12 von 36
  • 13. R Statistics with MongoDB x[3:7] [1] 3 4 5 6 7 mean(x) [1] 5.5 help("mean") ?mean 13 von 36
  • 14. R Statistics with MongoDB 14 von 36
  • 15. Many Statistical Functions R Statistics with MongoDB kmeans(dat, 4) K-means clustering with 4 clusters of sizes 21, 18, 30, 31 Cluster means: [,1] [,2] 1 0.7755 0.8509 2 -0.1557 -0.2305 3 1.2299 1.1472 4 0.1510 0.1507 Clustering vector: [1] 4 2 4 4 2 4 4 2 2 4 4 4 2 4 2 4 4 [36] 4 4 4 4 4 4 4 3 1 3 3 3 1 1 3 3 3 [71] 1 3 1 1 3 3 3 1 3 1 3 3 3 3 1 3 3 4 2 4 3 3 3 2 4 2 1 1 4 2 4 3 1 4 2 2 1 3 4 4 2 3 3 2 2 4 4 1 4 2 4 4 2 2 1 1 1 1 3 1 1 1 3 3 3 3 Within cluster sum of squares by cluster: [1] 3.318 1.166 4.019 3.195 (between_SS / total_SS = 83.0 %) Available components: [1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size" 15 von 36
  • 16. R Statistics with MongoDB plot(dat, col = cl$cluster, cex=2, pch=16) points(cl$centers, col = 1:4, pch = 13, cex = 4) 16 von 36
  • 17. R Shiny - easy web application R Statistics with MongoDB developed by RStudio turns R analyses into interactive web applications that anyone can use let your users choose input parameters using friendly controls like sliders, drop-downs, and text fields easily incorporate any number of outputs like plots, tables, and summaries no HTML or JavaScript knowledge is necessary, only R http://www.rstudio.com/shiny/ 17 von 36
  • 18. R Statistics with MongoDB R and Databases SQL provides a standard language to filter, aggregate, group, sort data SQL in new places: Hive, Impala, … ODBC provides SQL interface to non-database data (Excel, CSV, text files) R stores relational data in data.frames (extended lists) 18 von 36
  • 19. R Statistics with MongoDB data(iris) head(iris, n=3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa class(iris) [1] "data.frame" 19 von 36
  • 20. R Statistics with MongoDB R package: sqldf running SQL statements on R data frames library(sqldf) sqldf("select * from iris limit 2") Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa sqldf("select count(*) from iris") count(*) 1 150 20 von 36
  • 21. Other relational R package R Statistics with MongoDB RMySQL package provides an interface to MySQL RPostgreSQL package provides an interface to PostgreSQL ROracle package provides an interface for Oracle RJDBC package provides access to databases through a JDBC interface RSQLite package provides access to SQLite (SQLite engine is included) One big problem: all packages read the full result in R memory 21 von 36
  • 22. R Statistics with MongoDB R and MongoDB on CRAN there are two packages to connect R with MongoDB rmongodb supported by MongoDB, Inc. powerful for big data difficult to use due to BSON objects RMongo easy to use limited functionality reads full results in R memory does not work on MAC OS X 22 von 36
  • 23. R Statistics with MongoDB R package: RMongo library(Rmongo) mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017) dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") dbShowCollections(mongo) dbGetQuery(mongo, "zips","{'state':'AL'}") dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }') dbDisconnect(mongo) 23 von 36
  • 24. R Statistics with MongoDB R package: rmongodb developed on top of the MongoDB supported C driver library(rmongodb) mongo <mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX") mongo [1] 0 attr(,"mongo") <pointer: 0x105a1de80> attr(,"class") [1] "mongo" attr(,"host") [1] "dbs001.mongosoup.de" attr(,"name") [1] "" attr(,"username") [1] "JwQcDLJSYQJb" attr(,"password") [1] "RSXPkUkxRdOX" attr(,"db") [1] "cc_JwQcDLJSYQJb" attr(,"timeout") [1] 0 24 von 36
  • 25. R Statistics with MongoDB mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb") [1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test" mongo <- mongo.disconnect(mongo) 25 von 36
  • 26. R Statistics with MongoDB buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, "state", "AL") [1] TRUE query <- mongo.bson.from.buffer(buf) query state : 2 26 von 36 AL
  • 27. R Statistics with MongoDB res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query) res city : 2 loc : 4 0 : 1 1 : 1 pop : 16 state : 2 _id : 2 27 von 36 ACMAR 6055 AL 35004 -86.515570 33.584132
  • 28. R Statistics with MongoDB out <- mongo.bson.to.list(res) out$loc [1] -86.52 33.58 typeof(out$loc) [1] "double" out$pop [1] 6055 out$state [1] "AL" 28 von 36
  • 29. R Statistics with MongoDB cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query) res <- NULL while (mongo.cursor.next(cursor)){ value <- mongo.cursor.value(cursor) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue) } err <- mongo.cursor.destroy(cursor) head(res, n=4) city _id Rvalue "ACMAR" "35004" Rvalue "ADAMSVILLE" "35005" Rvalue "ADGER" "35006" Rvalue "KEYSTONE" "35007" 29 von 36 loc pop Numeric,2 6055 state "AL" Numeric,2 10616 "AL" Numeric,2 3205 "AL" Numeric,2 14218 "AL"
  • 30. It is all about creating BSON query or field objects R Statistics with MongoDB b <- mongo.bson.from.list( list(name="Fred", age=29, city="Boston")) b name : 2 age : 1 city : 2 Fred 29.000000 Boston mongo.bson.to.list(b) $name [1] "Fred" $age [1] 29 $city [1] "Boston" 30 von 36
  • 31. R Statistics with MongoDB ?mongo.bson ?mongo.bson.buffer.append ?mongo.bson.buffer.start.array ?mongo.bson.buffer.start.object buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, "aggregate", "zips") mongo.bson.buffer.start.array(buf, "pipeline") mongo.bson.buffer.start.object(buf, "$group") mongo.bson.buffer.append(buf, "_id", "$state") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$sum", "$pop") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.start.object(buf, "$match") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$gte", "10000") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf) query <- mongo.bson.from.buffer(buf) 31 von 36
  • 32. CCP Web Analytics Challenge R Statistics with MongoDB buf <- mongo.bson.buffer.create() query <- mongo.bson.from.buffer(buf) buf <- mongo.bson.buffer.create() err <- mongo.bson.buffer.append(buf, "user", 1) err <- mongo.bson.buffer.append(buf, "type", 1) field <- mongo.bson.from.buffer(buf) out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000) res <- NULL while (mongo.cursor.next(out)){ value <- mongo.cursor.value(out) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue) } 32 von 36
  • 33. R Statistics with MongoDB boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user") 33 von 36
  • 34. R Statistics with MongoDB Shiny Mongo R based MongoDB User Interface R packages shiny and rmongodb less than 200 lines of code DEMO: http://localhost:8100 https://github.com/comsysto/ShinyMongo 34 von 36
  • 35. R Statistics with MongoDB Summary R is a powerful statistical tool to analyse many different kind of data R can access databases MongoDB and rmongodb ready for Big Data start playing around with R, Big Data and MongoDB http://www.r-project.org http://www.mongodb.org http://www.mongosoup.de  35 von 36
  • 36. R Statistics with MongoDB See you soon thanks a lot for your attention there are R trainings in December 2013 in Munich http://comsysto.com/events.html#r we are hosting many events and meetups meet you at the MongoSoup booth Email: markus@mongosoup.de Twitter: @cloudHPC 36 von 36