SlideShare a Scribd company logo
BIG DATA
BIG
ANALYTICS
A OHRI
Pre- Agenda
-Presenter Introduction
-Audience Introduction
-Expectations
--------------------------------------------
Presenter Introduction
www.linkedin.com/in/ajayohri
Working with Analytics since 2004
Educated at IIM Lucknow, DCE, U Tenn
Author (R for Business Analytics (Springer))
Blogger at www.decisionstats.com


Interviewed 100+ Analytics leaders
Audience Introduction

● Affiliation-Academic/ Govt/Private
● Years of working with Big Data-
● Specific Interest Area in Analytics-
Great Expectations
From You
1.No mobile rings , no sleeping (discreet sleeping),
2.Please take notes using pencil,parchment, paper,pen,
computer,tablet,stylus,mobile etc,
3.Please ask Questions in the END(from notes taken at
Step 2)
From Me
1 Breadth of Case Studies (!)
2 Open Source focus (R mostly, clojure, python)
3 Actionable Ideas are useful !
i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
Agenda
-Presenter Introduction
-Audience Identification
-Expectations

--------------------------------------------
-Big Data
-Big Data Analytics using R
        -Case Study 1(Amazon AWS,SAP Hana
DB)
-Big Data Analytics using other tools
       -Case Study 2 (BigML.com, Picloud.com)
--------------------------------------------
Big Data
What is Big Data?
"Big data" is a term applied to data sets whose size is beyond the ability of
commonly used software tools to capture, manage, and process the data within
a tolerable elapsed time. Examples include web logs, RFID, sensor networks,
social networks, social data (due to the social data revolution), Internet text and
documents, Internet search indexing, call detail records, astronomy,
atmospheric science, genomics, biogeochemical, biological, and other complex
and often interdisciplinary scientific research, military surveillance, medical
records, photography archives, video archives, and large-scale e-commerce.

IBM-     http://www-01.ibm.com/software/data/bigdata/

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the
data in the world today has been created in the last two years alone. This data
comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase transaction records,
and cell phone GPS signals to name a few. This data is big data.
Big Data
What is Big Data?



Big Data Conferences
--O'Reilly's Strata
--Hadoop World
--Many many conferences......including ours
Thought for Today
In 2012 , data that is classified as Big Data will
be classified as Little Data by 2018

True ----------False
?
What is Cloud Computing?
Cloud computing is a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of
configurable computing resources (e.g., networks, servers,
storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort
or service provider interaction. This cloud model is
composed of five essential characteristics, three service
models, and four deployment models.
http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

 National Institute of Standards and Technology
--
Cloud Computing and
Big Data Analytics
Cost of computing Big Data would be too much,
but for cloud computing.

Cloud runs on X OS predominantly, and needs
customized solutions as of 2012

Open source solutions (OS- Analytics) are
more easily customized
Sources of Big Data
--Internet
------Server Logs,Clickstream,Analytics

--Social Media

--Governments and UN bodies

--Internal Data from customers
Storing Big Data for R
--Lots of RAM (?!)
--RDBMS
--Documents  (Couch DB ,MongoDB)


--HDFS (Hadoop)
Storing Big Data for R
--Documents      (Couch DB ,MongoDB)

Package RMongo provides an R interface to a Java client
for `MongoDB' (http://en.wikipedia.org/wiki/MongoDB)
databases, which are queried using JavaScript rather than
SQL. Package rmongodb is another client using
mongodb's C driver.
https://github.com/wactbprot/R4CouchDB
R talking to CouchDB using Couch's ReSTful HTTP API.
construct HTTP calls with RCurl, then move on to the
R4CouchDB package for a higher level interface.
http://digitheadslabnotebook.blogspot.in/2010/10/couchdb-
and-r.html
Big Data Packages in R- 1/2
http://cran.r-project.org/web/views/HighPerformanceComputing.html

●   The biglm package by Lumley uses incremental computations to offers lm()
    and glm() functionality to data sets stored outside of R's main memory.
●   The ff package by Adler et al. offers file-based access to data sets that are
    too large to be loaded into memory, along with a number of higher-level
    functions.
●   The bigmemory package by Kane and Emerson permits storing large
    objects such as matrices in memory (as well as via files) and uses external
    pointer objects to refer to them. This permits transparent access from R
    without bumping against R's internal memory limits. Several R processes
    on the same computer can also shared big memory objects.
●    The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates
    operating on data in a streaming fashion, without Hadoo
Big Data Packages in R -2/2
●   http://cran.r-project.org/web/packages/biganalytics/

This package extends the bigmemory package with various analytics.
Functions bigkmeans and binit may also be used with native R objects
●   http://cran.r-project.org/web/packages/bigtabulate/index.html
This package extends the bigmemory package with table- and split-like support
for big.matrix objects. The functions may also be used with regular R matrices
for improving speed and memory-efficiency.
●   http://cran.at.r-project.org/web/packages/synchronicity/index.html
.For mutex (locking) support for advanced shared-memory usage, see
synchronicity.
https://r-forge.r-project.org/R/?group_id=556 lists more projects. For linear
algebra support, see bigalgebra.
Big Data and Revolution
Analytics
Primary -RevoScaleR package /XDF format

Also sponsored RHadoop
https://github.com/RevolutionAnalytics/RHadoop
RHadoop -rhdfs package
rhdfs-

https://github.com/decisionstats/RHadoop/wiki/rhdfs
Overview
This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and
modify files stored in HDFS. The following functions are part of this package
   ●    File Manipulations
   ●    hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get
   ●    File Read/Write
   ●    hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file
   ●    Directory
   ●    hdfs.dircreate, hdfs.mkdir
   ●    Utility
   ●    hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists
   ●    Initialization
   ●    hdfs.init, hdfs.defaults
http://hadoop.apache.org/hdfs/
Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple
replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid
computations
RHadoop -rhbase package
rhbase-

https://github.com/decisionstats/RHadoop/wiki/rhbase
Overview
This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify
tables stored in HBASE. The following functions are part of this package
   ●    Table Maninpulation
   ●    hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table
   ●    Read/Write
   ●    hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan
   ●    Utility
   ●    hb.list.tables
   ●    Initialization
   ●    hb.defaults, hb.init
http://hbase.apache.org/

HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
RHadoop -rmr package
rmr-
Overview

This R package allows an R programmer to perform
statistical analysis via MapReduce on a Hadoop cluster.

● Average flight delay (Orbitz): original and updated
  version with presentation
● Network analysis: original and a summary
Also see       https://github.com/decisionstats/RHadoop/wiki/Tutorial


for   logistic regression and k-means
Big Data Social Network
 Analysis
Analyzing A Big Social Network using R and
distributed graph engines
http://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-
hadoop/
Big Data Social Media
Analysis
Can be used for Customers (                                                       and also for latent influencers   )-   http://www.r-
bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
Big Data Social Media
Analysis
R package twitteR                       can be
                                http://cran.r-project.org/web/packages/twitteR/index.html


used for prototyping but Twitter's API is rate
limited to 1500 per hour(?)/day, so we can use
Datasift APIhttp://datasift.com/pricing#costs
Big Data Social Media
Analysis
 How does information propagate through a
social network?
http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
Big Data Social Network
Analysis
Can be used for Terrorists (                             and also for potential protestors   )-
Drew Conway             http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdf

Primary focus is one three aspects of network analysis
1. Identifying leadership and key actors
2. Revealing underlying structure and intra-network community structure
3. Evolution and decay of social networks
Big Data and Revolution
Analytics
Primary -RevoScaleR package /XDF format
Also sponsored RHadoop

● For a case study, UpStream software ( slide 16):
http://www.revolutionanalytics.com/news-events/free-webinars/2012/how-big-data-is-changing-retail-marketing-analytics/

● Big data GLMs (you might find the chart on this page
  useful):
http://blog.revolutionanalytics.com/2012/06/big-data-generalized-linear-models-with-revolution-r-enterprise.html

● Data distillation with Hadoop and R:
http://blog.revolutionanalytics.com/2012/06/data-distillation-with-hadoop-and-r.html

● Analysis of the million row movie data set (building
  recommendation engines):
http://blog.revolutionanalytics.com/2012/04/simple-tools-for-building-a-recommendation-engine.html
Big Data and Revolution
Analytics
marketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits,
emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
More R and Hadoop Case
Studies
few examples where R and Hadoop are used for data distillation:
 ● Using robust regression on a series of raw voice-over-IP packets to
    calculate how long participants talk during a phone conversation.
 ● Using graph theory (and R's igraph package) to quantify the number of
    close friends of members of a social network.
 ● Orbitz uses R and Hadoop to extract flights and hotels that will be
    presented during a travel search, based on previous transaction.
 ● Using k-means clustering to extract similar "groups" of transactions, which
    are then aggregated and used as the record level for structured analysis
Using RDBMS (Big?) Data
through R
--RDBMS                                       -RODBC
Package
http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases
http://cran.r-project.org/web/packages/RMySQL/index.html RMySQL
http://cran.r-project.org/web/packages/ROracle/index.html ROracle
http://cran.r-project.org/web/packages/RPostgreSQL/index.html RPostgresSQL
http://cran.r-project.org/web/packages/DBI/index.html
http://cran.r-project.org/web/packages/RSQLite/index.html RSQLite
Using RDBMS (is it Big
Data?)
through R
--RDBMS                                      -RODBC
Package
http://cran.r-project.org/web/packages/RODBC/RODBC.pdf
> library(RODBC)
> odbcDataSources(type = c("all", "user", "system"))
          SQLServer          PostgreSQL30        PostgreSQL35W
          "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)"
               MySQL
 "MySQL ODBC 5.1 Driver"
Querying Big Data
--RDBMS-SQL

--Hadoop-Pig (but many ways)
Big Data Analytics
- Challenges

---Traditional statistics theory grew up when data was
constrained

--Traditional analytics programming was NOT parallel
processing

--Shortage of trained people
Big Data Analytics
- Solutions

---Teaching more parallel programming and algorithms

--More focus on data reduction techniques like clustering ,
segmentation than on hypothesis testing. Sampling,
anyone?

--Training more data scientists
Big Data Analytics
- Tools used
-Why R

-High Performance Computing

http://cran.r-project.org/web/views/HighPerformanceComputing.html


-Big Data Within R
http://www.slideshare.net/bytemining/r-hpc
Using R (interfaces)
--Using R Studio for easier development


--Using Rattle GUI for straight off the shelf data
mining and Using R Commander for Extensions

--Using Revolution Analytics RPE
-----Example of Snippets
Using R
--Using R for text mining
---Text Mining from Twitter Case Study
---Datasift Export to Amazon S3




--Using R for geo-coded analysis
---Hana DB

--Using R for Graphical Analysis of Big Data
TablePlot
3D using R Commander

--Using R for forecasting
Using Plugin R Commander E -Pack
Existing Big Data Case
Studies
Departure of Aeroplanes-SAP Hana 200m
http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html




R using SAP Hana

http://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/
SAP Hana DB uses R
http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-
enterprise/index.html
Oracle R Enterprise
Case Studies and Examples
http://www.oracle.com/technetwork/database/options/advanced-analytics/r-
enterprise/index.html
Revolution Analytics
RevoScaleR package
he RevoScaleR package to extract time series data from time-stamped logs (in
this case, the "US Domestic Flights From 1990 to 2009" dataset on
Infochimps):
Analyzing time series data of all sorts is a fundamental business analytics task
to which the R language is beautifully suited. In addition to the time series
functions built into base stats library there are dozens of R packages devoted to
time series...
We have shown how data manipulation functions of the RevoScaleR package
to extract time stamped data from a large data file, aggregate it, and form it into
monthly time series that can easily be analyzed with standard R functions.



http://www.inside-r.org/howto/extracting-time-series-large-data-sets

http://blog.revolutionanalytics.com/2011/09/how-to-extract-time-series-from-
large-timestamped-logs-with-r.html
Using R on Amazon -Case
Study
--Bioconductor in the Cloud

--Custom Amazon Instance




--Concerns for non- American users of Amazon
Using BigML on cloud
Case Study
Classification using Clojure on Cloud
https://bigml.com/gallery/models/fraud_and_crime




--Concerns on depending on third party tools
--Example Cloudnumbers.com
Using Google APIs
https://code.google.com/apis/console/?pli=1



Google Storage API

Google Predictive Analysis API

Introduction to other APIS

----Concerns to users of Google APIs
Using Google APIs case
study
Google Storage API
Google Predictive Analysis API
http://code.google.com/p/google-prediction-api-r-client/
Using Google APIs case
study
Introduction to other Big Data Google APIS

----Concerns to users of Google APIs
Using Python- PiCloud
com/
                        http://www.picloud.
Privacy hazards of big data
analytics.
Big Brother -1984 --- 2012

They know where you are (mobiles)
They know what you are looking for (internet)
They know your past (financial history +social media)
They can use your medical history
Laws authorize them (Patriot Act?)

--example Emotional Analysis of Images http:
//www.affectiva.com/
References and
Acknowledgements
David Smith,      Revolution Analytics
David Champagne, Revolution Analytics
All R Bloggers,Developers, Packagers
Blag - SAP Hana Analytics
Charlie Berger -and Oracle R Team
Jim Kobielus -IBM Big Data Team
R Development Core Team (2012). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
Thanks
Book- R for Business
Analytics
http://www.springer.com/statistics/book/978-1-4614-4342-1

More Related Content

PDF
Big Data Final Presentation
PPTX
Big data ppt
PPTX
Big Data Analytics
DOCX
Big data abstract
PDF
PDF
Introduction to Big Data
PPT
Big Data: An Overview
PDF
The evolution of data analytics
Big Data Final Presentation
Big data ppt
Big Data Analytics
Big data abstract
Introduction to Big Data
Big Data: An Overview
The evolution of data analytics

What's hot (20)

PDF
Big Data Hadoop Training by Easylearning Guru
PPT
Big data analytics, survey r.nabati
PDF
Data analytics using the cloud challenges and opportunities for india
PPT
Big Tools for Big Data
PPTX
The Big Data Stack
PPTX
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
PPTX
big data overview ppt
PPTX
Big data frameworks
PDF
Big data analytics with Apache Hadoop
PPTX
Big Data - An Overview
PPTX
introduction to big data frameworks
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Big Data simplified
PDF
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
ODP
Big Data Analytics - Introduction
PPT
Big data introduction, Hadoop in details
PDF
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
PPTX
Introduction to BIg Data and Hadoop
PPTX
Whatisbigdataandwhylearnhadoop
PPTX
Big data – a brief overview
Big Data Hadoop Training by Easylearning Guru
Big data analytics, survey r.nabati
Data analytics using the cloud challenges and opportunities for india
Big Tools for Big Data
The Big Data Stack
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
big data overview ppt
Big data frameworks
Big data analytics with Apache Hadoop
Big Data - An Overview
introduction to big data frameworks
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data simplified
THE 3V's OF BIG DATA: VARIETY, VELOCITY, AND VOLUME from Structure:Data 2012
Big Data Analytics - Introduction
Big data introduction, Hadoop in details
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Introduction to BIg Data and Hadoop
Whatisbigdataandwhylearnhadoop
Big data – a brief overview
Ad

Viewers also liked (20)

PDF
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
PDF
A technical Introduction to Big Data Analytics
PPT
Seminar Presentation Hadoop
PPTX
What is Big Data?
PDF
Open source analytics
PPTX
DevOps - Motivadores e Benefícios
PPTX
Rd big data & analytics v1.0
PDF
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...
PPT
Vertical vs Horizontal Scaling
PDF
Introduction to Big Data Analytics on Apache Hadoop
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PDF
Klarity - Asia digital analytic summit
PDF
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
PDF
Big Data from Social Media and Crowdsourcing in Emergencies
PPTX
Social media & big data
PDF
Big Data and Social Media
PPTX
Big Data Social Media & Smart Apps
PDF
Product Placement: The Present & The Future
PPT
Introduction to Social Media
PPTX
Big Data und Social Media
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
A technical Introduction to Big Data Analytics
Seminar Presentation Hadoop
What is Big Data?
Open source analytics
DevOps - Motivadores e Benefícios
Rd big data & analytics v1.0
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...
Vertical vs Horizontal Scaling
Introduction to Big Data Analytics on Apache Hadoop
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Klarity - Asia digital analytic summit
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
Big Data from Social Media and Crowdsourcing in Emergencies
Social media & big data
Big Data and Social Media
Big Data Social Media & Smart Apps
Product Placement: The Present & The Future
Introduction to Social Media
Big Data und Social Media
Ad

Similar to Big data Big Analytics (20)

PPTX
BigData
PDF
Getting started with R & Hadoop
PDF
Running R on Hadoop - CHUG - 20120815
PPTX
The Powerful Marriage of Hadoop and R (David Champagne)
PDF
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PPTX
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
PPT
Big data and Internet
PPTX
Frankfurt Big Data Lab & Refugee Projeect
PPTX
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
PPTX
NoSQL Type, Bigdata, and Analytics
PPTX
Big Data
PPTX
Bigdata and Hadoop with applications
PDF
Big data analytics 1
KEY
Big data and APIs for PHP developers - SXSW 2011
PPTX
selected topics in CS-CHaaapteerobe.pptx
PDF
bigdata.pdf
PDF
Big data-analytics-cpe8035
PPTX
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
PDF
Big Data Analytics Lecture notes pdf notes
BigData
Getting started with R & Hadoop
Running R on Hadoop - CHUG - 20120815
The Powerful Marriage of Hadoop and R (David Champagne)
Chapter1-Introduction Εισαγωγικές έννοιες
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Big data and Internet
Frankfurt Big Data Lab & Refugee Projeect
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
NoSQL Type, Bigdata, and Analytics
Big Data
Bigdata and Hadoop with applications
Big data analytics 1
Big data and APIs for PHP developers - SXSW 2011
selected topics in CS-CHaaapteerobe.pptx
bigdata.pdf
Big data-analytics-cpe8035
Unit-1 -2-3- BDA PIET 6 AIDS.pptx
Big Data Analytics Lecture notes pdf notes

More from Ajay Ohri (20)

PDF
Introduction to R ajay Ohri
PPTX
Introduction to R
PDF
Social Media and Fake News in the 2016 Election
PDF
Pyspark
PDF
Download Python for R Users pdf for free
PDF
Install spark on_windows10
DOCX
Ajay ohri Resume
PDF
Statistics for data scientists
PPTX
National seminar on emergence of internet of things (io t) trends and challe...
PDF
Tools and techniques for data science
PPTX
How Big Data ,Cloud Computing ,Data Science can help business
PDF
Training in Analytics and Data Science
PDF
Tradecraft
PDF
Software Testing for Data Scientists
PDF
Craps
PDF
A Data Science Tutorial in Python
PDF
How does cryptography work? by Jeroen Ooms
PDF
Using R for Social Media and Sports Analytics
PDF
Kush stats alpha
PPTX
Analyze this
Introduction to R ajay Ohri
Introduction to R
Social Media and Fake News in the 2016 Election
Pyspark
Download Python for R Users pdf for free
Install spark on_windows10
Ajay ohri Resume
Statistics for data scientists
National seminar on emergence of internet of things (io t) trends and challe...
Tools and techniques for data science
How Big Data ,Cloud Computing ,Data Science can help business
Training in Analytics and Data Science
Tradecraft
Software Testing for Data Scientists
Craps
A Data Science Tutorial in Python
How does cryptography work? by Jeroen Ooms
Using R for Social Media and Sports Analytics
Kush stats alpha
Analyze this

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation

Big data Big Analytics

  • 2. Pre- Agenda -Presenter Introduction -Audience Introduction -Expectations --------------------------------------------
  • 3. Presenter Introduction www.linkedin.com/in/ajayohri Working with Analytics since 2004 Educated at IIM Lucknow, DCE, U Tenn Author (R for Business Analytics (Springer)) Blogger at www.decisionstats.com Interviewed 100+ Analytics leaders
  • 4. Audience Introduction ● Affiliation-Academic/ Govt/Private ● Years of working with Big Data- ● Specific Interest Area in Analytics-
  • 5. Great Expectations From You 1.No mobile rings , no sleeping (discreet sleeping), 2.Please take notes using pencil,parchment, paper,pen, computer,tablet,stylus,mobile etc, 3.Please ask Questions in the END(from notes taken at Step 2) From Me 1 Breadth of Case Studies (!) 2 Open Source focus (R mostly, clojure, python) 3 Actionable Ideas are useful ! i.e I spent 3 hours in X talk but I did learn to do Y, or I am now interested in trying out Z
  • 6. Agenda -Presenter Introduction -Audience Identification -Expectations -------------------------------------------- -Big Data -Big Data Analytics using R -Case Study 1(Amazon AWS,SAP Hana DB) -Big Data Analytics using other tools -Case Study 2 (BigML.com, Picloud.com) --------------------------------------------
  • 7. Big Data What is Big Data? "Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time. Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce. IBM- http://www-01.ibm.com/software/data/bigdata/ Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
  • 8. Big Data What is Big Data? Big Data Conferences --O'Reilly's Strata --Hadoop World --Many many conferences......including ours
  • 9. Thought for Today In 2012 , data that is classified as Big Data will be classified as Little Data by 2018 True ----------False ?
  • 10. What is Cloud Computing? Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf National Institute of Standards and Technology --
  • 11. Cloud Computing and Big Data Analytics Cost of computing Big Data would be too much, but for cloud computing. Cloud runs on X OS predominantly, and needs customized solutions as of 2012 Open source solutions (OS- Analytics) are more easily customized
  • 12. Sources of Big Data --Internet ------Server Logs,Clickstream,Analytics --Social Media --Governments and UN bodies --Internal Data from customers
  • 13. Storing Big Data for R --Lots of RAM (?!) --RDBMS --Documents (Couch DB ,MongoDB) --HDFS (Hadoop)
  • 14. Storing Big Data for R --Documents (Couch DB ,MongoDB) Package RMongo provides an R interface to a Java client for `MongoDB' (http://en.wikipedia.org/wiki/MongoDB) databases, which are queried using JavaScript rather than SQL. Package rmongodb is another client using mongodb's C driver. https://github.com/wactbprot/R4CouchDB R talking to CouchDB using Couch's ReSTful HTTP API. construct HTTP calls with RCurl, then move on to the R4CouchDB package for a higher level interface. http://digitheadslabnotebook.blogspot.in/2010/10/couchdb- and-r.html
  • 15. Big Data Packages in R- 1/2 http://cran.r-project.org/web/views/HighPerformanceComputing.html ● The biglm package by Lumley uses incremental computations to offers lm() and glm() functionality to data sets stored outside of R's main memory. ● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions. ● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. This permits transparent access from R without bumping against R's internal memory limits. Several R processes on the same computer can also shared big memory objects. ● The HadoopStreaming Provides a framework for writing map/reduce scripts for use in Hadoop Streaming. Also facilitates operating on data in a streaming fashion, without Hadoo
  • 16. Big Data Packages in R -2/2 ● http://cran.r-project.org/web/packages/biganalytics/ This package extends the bigmemory package with various analytics. Functions bigkmeans and binit may also be used with native R objects ● http://cran.r-project.org/web/packages/bigtabulate/index.html This package extends the bigmemory package with table- and split-like support for big.matrix objects. The functions may also be used with regular R matrices for improving speed and memory-efficiency. ● http://cran.at.r-project.org/web/packages/synchronicity/index.html .For mutex (locking) support for advanced shared-memory usage, see synchronicity. https://r-forge.r-project.org/R/?group_id=556 lists more projects. For linear algebra support, see bigalgebra.
  • 17. Big Data and Revolution Analytics Primary -RevoScaleR package /XDF format Also sponsored RHadoop https://github.com/RevolutionAnalytics/RHadoop
  • 18. RHadoop -rhdfs package rhdfs- https://github.com/decisionstats/RHadoop/wiki/rhdfs Overview This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS. The following functions are part of this package ● File Manipulations ● hdfs.copy, hdfs.move, hdfs.rename, hdfs.delete, hdfs.rm, hdfs.del, hdfs.chown, hdfs.put, hdfs.get ● File Read/Write ● hdfs.file, hdfs.write, hdfs.close, hdfs.flush, hdfs.read, hdfs.seek, hdfs.tell, hdfs.line.reader, hdfs.read.text.file ● Directory ● hdfs.dircreate, hdfs.mkdir ● Utility ● hdfs.ls, hdfs.list.files, hdfs.file.info, hdfs.exists ● Initialization ● hdfs.init, hdfs.defaults http://hadoop.apache.org/hdfs/ Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations
  • 19. RHadoop -rhbase package rhbase- https://github.com/decisionstats/RHadoop/wiki/rhbase Overview This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE. The following functions are part of this package ● Table Maninpulation ● hb.new.table, hb.delete.table, hb.describe.table, hb.set.table.mode, hb.regions.table ● Read/Write ● hb.insert, hb.get, hb.delete, hb.insert.data.frame, hb.get.data.frame, hb.scan ● Utility ● hb.list.tables ● Initialization ● hb.defaults, hb.init http://hbase.apache.org/ HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
  • 20. RHadoop -rmr package rmr- Overview This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster. ● Average flight delay (Orbitz): original and updated version with presentation ● Network analysis: original and a summary Also see https://github.com/decisionstats/RHadoop/wiki/Tutorial for logistic regression and k-means
  • 21. Big Data Social Network Analysis Analyzing A Big Social Network using R and distributed graph engines http://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over- hadoop/
  • 22. Big Data Social Media Analysis Can be used for Customers ( and also for latent influencers )- http://www.r- bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/
  • 23. Big Data Social Media Analysis R package twitteR can be http://cran.r-project.org/web/packages/twitteR/index.html used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day, so we can use Datasift APIhttp://datasift.com/pricing#costs
  • 24. Big Data Social Media Analysis How does information propagate through a social network? http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/
  • 25. Big Data Social Network Analysis Can be used for Terrorists ( and also for potential protestors )- Drew Conway http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdf Primary focus is one three aspects of network analysis 1. Identifying leadership and key actors 2. Revealing underlying structure and intra-network community structure 3. Evolution and decay of social networks
  • 26. Big Data and Revolution Analytics Primary -RevoScaleR package /XDF format Also sponsored RHadoop ● For a case study, UpStream software ( slide 16): http://www.revolutionanalytics.com/news-events/free-webinars/2012/how-big-data-is-changing-retail-marketing-analytics/ ● Big data GLMs (you might find the chart on this page useful): http://blog.revolutionanalytics.com/2012/06/big-data-generalized-linear-models-with-revolution-r-enterprise.html ● Data distillation with Hadoop and R: http://blog.revolutionanalytics.com/2012/06/data-distillation-with-hadoop-and-r.html ● Analysis of the million row movie data set (building recommendation engines): http://blog.revolutionanalytics.com/2012/04/simple-tools-for-building-a-recommendation-engine.html
  • 27. Big Data and Revolution Analytics marketing analytics company UpStream Software, used map-reduce to convert transactions from Omniture logs (web visits, emails clicked on, ads displayed) into customer behaviors: response to an offer, research into a product, purchases.
  • 28. More R and Hadoop Case Studies few examples where R and Hadoop are used for data distillation: ● Using robust regression on a series of raw voice-over-IP packets to calculate how long participants talk during a phone conversation. ● Using graph theory (and R's igraph package) to quantify the number of close friends of members of a social network. ● Orbitz uses R and Hadoop to extract flights and hotels that will be presented during a travel search, based on previous transaction. ● Using k-means clustering to extract similar "groups" of transactions, which are then aggregated and used as the record level for structured analysis
  • 29. Using RDBMS (Big?) Data through R --RDBMS -RODBC Package http://cran.r-project.org/doc/manuals/R-data.html#Relational-databases http://cran.r-project.org/web/packages/RMySQL/index.html RMySQL http://cran.r-project.org/web/packages/ROracle/index.html ROracle http://cran.r-project.org/web/packages/RPostgreSQL/index.html RPostgresSQL http://cran.r-project.org/web/packages/DBI/index.html http://cran.r-project.org/web/packages/RSQLite/index.html RSQLite
  • 30. Using RDBMS (is it Big Data?) through R --RDBMS -RODBC Package http://cran.r-project.org/web/packages/RODBC/RODBC.pdf > library(RODBC) > odbcDataSources(type = c("all", "user", "system")) SQLServer PostgreSQL30 PostgreSQL35W "SQL Server""PostgreSQL ANSI(x64)" "PostgreSQL Unicode(x64)" MySQL "MySQL ODBC 5.1 Driver"
  • 32. Big Data Analytics - Challenges ---Traditional statistics theory grew up when data was constrained --Traditional analytics programming was NOT parallel processing --Shortage of trained people
  • 33. Big Data Analytics - Solutions ---Teaching more parallel programming and algorithms --More focus on data reduction techniques like clustering , segmentation than on hypothesis testing. Sampling, anyone? --Training more data scientists
  • 34. Big Data Analytics - Tools used -Why R -High Performance Computing http://cran.r-project.org/web/views/HighPerformanceComputing.html -Big Data Within R http://www.slideshare.net/bytemining/r-hpc
  • 35. Using R (interfaces) --Using R Studio for easier development --Using Rattle GUI for straight off the shelf data mining and Using R Commander for Extensions --Using Revolution Analytics RPE -----Example of Snippets
  • 36. Using R --Using R for text mining ---Text Mining from Twitter Case Study ---Datasift Export to Amazon S3 --Using R for geo-coded analysis ---Hana DB --Using R for Graphical Analysis of Big Data TablePlot 3D using R Commander --Using R for forecasting Using Plugin R Commander E -Pack
  • 37. Existing Big Data Case Studies Departure of Aeroplanes-SAP Hana 200m http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html R using SAP Hana http://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/
  • 38. SAP Hana DB uses R http://scn.sap.com/community/in-memory-business-data-management/blog/2011/11/28/dealing-with-r-and-hana
  • 39. Oracle R Enterprise Case Studies and Examples http://www.oracle.com/technetwork/database/options/advanced-analytics/r- enterprise/index.html
  • 40. Oracle R Enterprise Case Studies and Examples http://www.oracle.com/technetwork/database/options/advanced-analytics/r- enterprise/index.html
  • 41. Revolution Analytics RevoScaleR package he RevoScaleR package to extract time series data from time-stamped logs (in this case, the "US Domestic Flights From 1990 to 2009" dataset on Infochimps): Analyzing time series data of all sorts is a fundamental business analytics task to which the R language is beautifully suited. In addition to the time series functions built into base stats library there are dozens of R packages devoted to time series... We have shown how data manipulation functions of the RevoScaleR package to extract time stamped data from a large data file, aggregate it, and form it into monthly time series that can easily be analyzed with standard R functions. http://www.inside-r.org/howto/extracting-time-series-large-data-sets http://blog.revolutionanalytics.com/2011/09/how-to-extract-time-series-from- large-timestamped-logs-with-r.html
  • 42. Using R on Amazon -Case Study --Bioconductor in the Cloud --Custom Amazon Instance --Concerns for non- American users of Amazon
  • 43. Using BigML on cloud Case Study Classification using Clojure on Cloud https://bigml.com/gallery/models/fraud_and_crime --Concerns on depending on third party tools --Example Cloudnumbers.com
  • 44. Using Google APIs https://code.google.com/apis/console/?pli=1 Google Storage API Google Predictive Analysis API Introduction to other APIS ----Concerns to users of Google APIs
  • 45. Using Google APIs case study Google Storage API Google Predictive Analysis API http://code.google.com/p/google-prediction-api-r-client/
  • 46. Using Google APIs case study Introduction to other Big Data Google APIS ----Concerns to users of Google APIs
  • 47. Using Python- PiCloud com/ http://www.picloud.
  • 48. Privacy hazards of big data analytics. Big Brother -1984 --- 2012 They know where you are (mobiles) They know what you are looking for (internet) They know your past (financial history +social media) They can use your medical history Laws authorize them (Patriot Act?) --example Emotional Analysis of Images http: //www.affectiva.com/
  • 49. References and Acknowledgements David Smith, Revolution Analytics David Champagne, Revolution Analytics All R Bloggers,Developers, Packagers Blag - SAP Hana Analytics Charlie Berger -and Oracle R Team Jim Kobielus -IBM Big Data Team R Development Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.
  • 51. Book- R for Business Analytics http://www.springer.com/statistics/book/978-1-4614-4342-1