SlideShare a Scribd company logo
08/05/2025
B . R A M A M U R T HY
B.Ramamurthy 2016
1
Data-Intensive Computing
08/05/2025
2
 The phrase was initially coined by National Science Foundation (NSF)
 What is it?
 Volume, velocity, variety, veracity (uncertainty) (Gartner, IBM)
 How is it addressed?
 Why now?
 What do you expect to extract by processing this large data?
 Intelligence for decision making
 What is different now?
 Storage models, processing models
 Big Data, analytics and cloud infrastructures
 Summary
Data-intensive computing
08/05/2025
B.Ramamurthy 2016
Motivation
3
 Tremendous advances have taken place in statistical methods and tools,
machine learning and data mining approaches, and internet based
dissemination tools for analysis and visualization.
 Many tools are open source and freely available for anybody to use.
 Is there an easy entry-point into learning these technologies?
 Can we make these tools easily accessible to the students, researchers
and decision makers similar to how “office” productivity software is
used?
08/05/2025
B.Ramamurthy 2016
High Level Goals for the course
 Understand foundations of data analytics so that you can interpret and
communicate results and make informed decisions
 Study and learn to apply common statistical methods and machine
learning algorithms to solve business problems
 Learn to work with popular tools to analyze and visualize data; more
importantly encourage consistency across departments on analytics/tools used
 Working with cloud for data storage and for deployment of applications
 Learn methods for mastering and applying emerging concepts and
technologies for continuous data-driven improvements to your
research/work/business processes
 Transform complex analytics into routine processes
4
08/05/2025
B.Ramamurthy 2016
Newer kinds of Data
 New kinds of data from different sources (see p.23 of Data Science book) :
tweets, geo location, emails, blogs
 Two major types: structured and unstructured data
 Structured data: data collected and stored according to well defined schema;
Realtime stock quotes
 Unstructured data: messages from social media, news, talks, books, letters,
manuscripts, court documents..
 “Regardless of their differences, they work in tandem in any effective big data
operation. Companies wishing to make the most of their data should use tools
that utilize the benefits of both.”5
 We will discuss methods for analyzing both structured and unstructured data
5
 Bioinformatics data: from about 3.3 billion base pairs in a
human genome to huge number of sequences of proteins and
the analysis of their behaviors
 The internet: web logs, facebook, twitter, maps, blogs, etc.:
Analytics …
 Financial applications: that analyze volumes of data for trends
and other deeper knowledge
 Health Care: huge amount of patient data, drug and treatment
data
 The universe: The Hubble ultra deep telescope shows 100s of
galaxies each with billions of stars: Sloan Digital Sky Survey:
http://www.sdss.org/
08/05/2025 6
Data Deluge: smallest to largest
Big-data Problem Solving Approaches
 Algorithmic: after all we have working towards this for ever: scalable/tracktable
 High Performance computing (HPC: multi-core) CCR has machines that are: 16
CPU , 32 core machine with 128GB RAM: openmp, MPI, etc.
 GPGPU programming: general purpose graphics processor (NVIDIA)
 Statistical packages like R running on parallel threads on powerful machines
 Machine learning algorithms on super computers
 Hadoop MapReduce like parallel processing.
 Spark like approaches providing in-memory computing models
Processing Granularity
• Single-core, single processor
• Single-core, multi-processor
Si
n
gl
e-
c
o
re
• Multi-core, single processor
• Multi-core, multi-processor
Multi-
core
• Cluster of processors (single or multi-core) with
shared memory
• Cluster of processors with distributed memory
Cluster
Grid of clusters
Embarrassingly parallel
processing
MapReduce, distributed file system
Cloud computing
8
08/05/2025
Bina Ramamurthy 2011
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
Heavy
societal
involvement
Powerful
multi-core
processors
Superior
software
methodologies
Virtualization
leveraging the
powerful
hardware
Wider bandwidth
for
communication
Proliferatio
n of devices
Explosion
of domain
application
s
08/05/2025 9
A Golden Era in Computing
08/05/2025
10
 Intelligence is a set of discoveries made by federating/processing information
collected from diverse sources.
 Information is a cleansed form of raw data.
 For statistically significant information we need reasonable amount of data.
 For gathering good intelligence we need large amount of information.
 As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of
data is generated by the millions of experiments and applications.
 Thus intelligence applications are invariably data-heavy, data-driven and data-
intensive.
 Data is gathered from the web (public or private, covert or overt), generated by
large number of domain applications.
Intelligence and Scale of Data
Intelligence (or origins of Big-data computing?)
 Search for Extra Terrestrial Intelligence (seti@home
project)
 The Wow signal http://www.bigear.org/wow.htm
08/05/2025 11
08/05/2025
12
 Google search: How is different from regular search in existence before it?
 It took advantage of the fact the hyperlinks within web pages form an underlying structure
that can be mined to determine the importance of various pages.
 Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to
go to CityGrille”?
 Learning capacity from previous data of habits, profiles, and other information gathered over
time.
 Collaborative and interconnected world inference capable: facebook friend suggestion
 Large scale data requiring indexing
 …Do you know amazon is going to ship things before you order? Here
Characteristics of intelligent applications
08/05/2025
13
Data-intensive application
characteristics
AggregatedC
ontent (Raw
data)
Algorithms
(thinking)
Reference
Structures
(knowledge)
Data structures
(infrastructure)
Models
08/05/2025
14
 Aggregated content: large amount of data pertinent to the specific application;
each piece of information is typically connected to many other pieces. Ex: DBs
 Reference structures: Structures that provide one or more structural and
semantic interpretations of the content. Reference structure about specific
domain of knowledge come in three flavors: dictionaries, knowledge bases, and
ontologies
 Algorithms: modules that allows the application to harness the information
which is hidden in the data. Applied on aggregated content and some times
require reference structure Ex: MapReduce
 Data Structures: newer data structures to leverage the scale and the WORM
characteristics; ex: MS Azure, Apache Hadoop, Google BigTable
Basic Elements
08/05/2025
15
 Search engines
 Recommendation systems:
 CineMatch of Netflix Inc. movie recommendations
 Amazon.com: book/product recommendations
 Biological systems: high throughput sequences (HTS)
 Analysis: disease-gene match
 Query/search for gene sequences
 Space exploration
 Financial analysis
Examples of data-intensive applications
08/05/2025
16
 Social networking sites
 Mashups : applications that draw upon content
retrieved from external sources to create entirely new
innovative services.
 Portals
 Wikis: content aggregators; linked data; excellent data
and fertile ground for applying concepts discussed in
the text
 Media-sharing sites
 Online gaming
 Biological analysis
 Space exploration
More intelligent data-intensive
applications
08/05/2025
17
 Statistical inference
 Machine learning is the capability of the software system to generalize
based on past experience and the use of these generalization to provide
answers to questions related old, new and future data.
 Data mining
 Soft computing
 Deep learning
 We also need algorithms that are specially designed for the emerging
storage models and data characteristics.
Algorithms
• Internet introduced a new challenge in the form web logs, web crawler’s data:
large scale “peta scale”
• But observe that this type of data has an uniquely different characteristic than
your transactional or the “customer order” data, or “bank account data” :
• The data type is “write once read many (WORM)” ;
• Privacy protected healthcare and patient information;
• Historical financial data;
• Other historical data
 Relational file system and tables are insufficient.
• Large <key, value> stores (files) and storage management system.
• Built-in features for fault-tolerance, load balancing, data-transfer and
aggregation,…
• Clusters of distributed nodes for storage and computing.
• Computing is inherently parallel
08/05/2025 18
Different Type of Storage
 Originated from the Google File System (GFS) is the special <key, value> store
 Hadoop Distributed file system (HDFS) is the open source version of this.
(Currently an Apache project)
 Parallel processing of the data using MapReduce (MR) programming model
 Challenges
 Formulation of MR algorithms
 Proper use of the features of infrastructure (Ex: sort)
 Best practices in using MR and HDFS
 An extensive ecosystem consisting of other components such as column-based
store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc.
08/05/2025 19
Big-data Concepts
 We have witnessed explosion in algorithmic solutions.
 “In pioneer days they used oxen for heavy pulling, when one couldn’t
budge a log they didn’t try to grow a larger ox. We shouldn’t be trying
for bigger computers, but for more systems of computers.” Grace
Hopper
 What you cannot achieve by an algorithm can be achieved by more data.
 Big data if analyzed right gives you better answers: Center for disease
control prediction of flu vs. prediction of flu through “search” data 2 full
weeks before the onset of flu season! http://www.google.org/flutrends/
08/05/2025 20
Data & Analytics
 Cloud is a facilitator for Big Data computing and is an
indispensable in this context
 Cloud provides processor, software, operating systems,
storage, monitoring, load balancing, clusters and other
requirements as a service
 Cloud offers accessibility to Big Data computing
 Cloud computing models:
 platform (PaaS), Microsoft Azure
 software (SaaS), Google App Engine (GAE)
 infrastructure (IaaS), Amazon web services (AWS)
 Services-based application programming interface (API)
08/05/2025 21
Cloud Computing
B.Ramamurthy 2016
Top Ten Largest Databases
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate
0
1000
2000
3000
4000
5000
6000
7000
Top ten largest databases (2007)
Terabytes
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/
08/05/2025
22
B.Ramamurthy 2016
Top Ten Largest Databases in 2007 vs
Facebook ‘s cluster in 2010
LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate
0
1000
2000
3000
4000
5000
6000
7000
Top ten largest databases (2007)
Terabytes
Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world
08/05/2025
23
Facebook
21 PetaByte
In 2010
08/05/2025
B.Ramamurthy 2016
Data Strategy
 In this era of big data, what is your data strategy?
 Strategy as in simple “Planning for the data challenge”
 It is not only about big data: all sizes and forms of data
 Data collections from customers used to be an elaborate task: surveys, and other
such instruments
 Nowadays data is available in abundance: thanks to the technological advances
as well as the social networks
 Data is also generated by many of your own business processes and applications
 Data strategy means many different things: we will discuss this next
24
08/05/2025
B.Ramamurthy 2016
Components of a data Strategy1
 Data integration
 Meta data
 Data modeling
 Organizational roles and responsibilities
 Performance and metrics
 Security and privacy
 Structured data management
 Unstructured data management
 Business intelligence
 Data analysis and visualization
 Tapping into social data
This course will provide training in emerging technologies, tools, environments and APIs available
for developing and implementing one or more of these components.
25
08/05/2025
B.Ramamurthy 2016
Data Strategy for newer kinds of data
26
 How will you collect data? Aggregate data? What are your sources? (Eg.
Social media)
 How will you store them? And Where?
 How will you use the data? Analyze them? Analytics? Data mining?
Pattern recognition?
 How will you present or report the data to the stakeholders and
decision makers? visualization?
 Archive the data for provenance and accountability.
08/05/2025
B.Ramamurthy 2016
Tools for Analytics
 Elaborate tools with nifty visualizations; expensive licensing fees: Ex:
Tableau, Tom Sawyer
 Software that you can buy for data analytics: Brilig, small, affordable
but short-lived
 Open sources tools: Gephi, sporadic support
 Open source, freeware with excellent community involvement: R
system
 Some desirable characteristics of the tools: simple, quick to apply,
intuitive, useful, flat learning curve
 A demo to prove this point: data  actions /decisions
27
08/05/2025
B.Ramamurthy 2016 28
Demo: Exam1 Grade: Traditional reporting 1
Q1 Q2 Q3 Q4 Q5 Total
16.7 13.9 9.6 18.5 13.7 72.4
20.0 16.0 9.0 19.0 17.0 76.0
20.0 20.0 15.0 25.0 20.0 90.0
Q1 Q2 Q3 Q4 Q5 Total
16.0 14.2 9.6 19.4 14.0 73.2
80.1% 71.1% 64.0% 77.4% 70.2% 73.2%
Q1 Q2 Q3 Q4 Q5 Total
17.3 13.6 9.7 17.6 13.3 71.5
86.7% 67.8% 64.6% 70.3% 66.7% 71.5%
Question 1..5, total, mean, median, mode; mean ver1, mean ver2
08/05/2025
B.Ramamurthy 2016
Traditional approach 2: points vs #students
29
Distribution of exam1 points
08/05/2025
B.Ramamurthy 2016
Individual questions analyzed..
30
08/05/2025
B.Ramamurthy 2016
Interpretation and action/decisions
31
08/05/2025
B.Ramamurthy 2016
R-code
32
data2<-read.csv(file.choose())
exam1<-data2$midterm
hist(exam1, col=rainbow(8))
boxplot(data2, col=rainbow(6))
boxplot(data2,col=c("orange","green","blue","grey","yellow", "sienna"))
fn<-boxplot(data2,col=c("orange","green","blue","grey","yellow", "pink"))$stats
text(5.55, fn[1,6], paste("Minimum =", fn[1,6]), adj=0, cex=.7)
text(5.55, fn[2,6], paste("LQuartile =", fn[2,6]), adj=0, cex=.7)
text(5.0, fn[3,6], paste("Median =", fn[3,6]), adj=0, cex=.7)
text(5.55, fn[4,6], paste("UQuartile =", fn[4,6]), adj=0, cex=.7)
text(5.55, fn[5,6], paste("Maximum =", fn[5,6]), adj=0, cex=.7)
grid(nx=NA, ny=NULL)
08/05/2025
B.Ramamurthy 2016
Demo Details
 Grade data stored in excel file and common input format
 Converted this file to csv
 Start a R Studio project
 Read in the csv data (using a file chooser option) into data2
 boxplot(data2)
 That is it.
 You can now add legends, colors, and labels to make it presentable.
 Export the plot as a image or pdf to report the results
33
08/05/2025
B.Ramamurthy 2016
Today’s Topic: Exploratory data analysis (EDA)
 The R Programming language
 The R project for statistical computing
 R Studio integrated development environment (IDE)
 Data analysis with R: charts, plots, maps, packages
 Also look at the CRAN: Comprehensive R Archive Network
 Understanding your data
 Basic statistical analysis
 Chapter 1 : What is Data Science?
 Chapter 2: Exploratory Data Analysis and Data Science Process
34
08/05/2025
B.Ramamurthy 2016
 R is a software package for statistical computing.
 R is an interpreted language
 It is open source with high level of contribution from the community
 “R is very good at plotting graphics, analyzing data, and fitting
statistical models using data that fits in the computer’s memory.”
 “It’s not as good at storing data in complicated structures, efficiently
querying data, or working with data that doesn’t fit in the computer’s
memory.”
R Language
35
08/05/2025
B.Ramamurthy 2016
R Programming Language3,4
 R is popular language for statistical analysis of data, visualization and
reporting.
 It is a complete “programming” language.
 R is a free software: Gnu General Public Licensing (GPL)
 R Studio is a powerful IDE for R.
 R is not a tool for data acquisition/collection/data entry. This is a major
point on which it differs from Excel and other data input applications.
36
08/05/2025
B.Ramamurthy 2016
 There are many packages available for statistical analysis such as SAS
and SPSS but they are expensive (user license based) and are proprietary.
 R is open source and it can pretty much do what SAS can do but free.
 R is considered one of the best statistical tools in the world.
 People can submit their own R packages/libraries, using latest cutting
edge techniques.
 To date R has got almost 5,000 packages in the CRAN (Comprehensive R
Archive Network – The site which maintains the R project) repository.
 R is great for exploratory data analysis (EDA): for understanding the
nature of your data and quickly create useful visualization
37
Why R?
08/05/2025
B.Ramamurthy 2016
 An R package is a set of related functions
 To use a package you need to load it into R
 R offers a large number of packages for various vertical and horizontal
domains:
 Horizontal: display graphics, statistical packages, machine learning
 Verticals: wide variety of industries: analyzing stock market data,
modeling credit risks, social sciences, automobile data
R Packages
38
08/05/2025
B.Ramamurthy 2016
 A package is a collection of functions and data files bundled together.
 In order to use the components of a package it needs to be installed in
the local library of the R environment.
 Loading packages
 Custom packages
 Building packages
 Activity: explore what R packages are available, if any, for your domain
http://cran.r-project.org/web/packages/available_packages_by_name.html
 Later on, try to create a custom package for your business domain.
39
R Packages
08/05/2025
B.Ramamurthy 2016
 Library Package Class
 R also provides many data sets for exploring its features
Library
40
08/05/2025
B.Ramamurthy 2016
 R Basics, fundamentals
 The R language
 Working with data
 Statistics with R language
 R syntax
 R Control structures
 R Objects
 R formulas
 Install and use packages
 Quick overview and tutorial
41
Learning R
08/05/2025
B.Ramamurthy 2016
R Studio
 Lets examine the R studio environment
42
08/05/2025
B.Ramamurthy 2016
Input Data sources
 Data for the analytics can be from many different sources: simple .csv
file, relational database, xml based web documents, sources on the
cloud (dropbox, storage drives).
 Today we will examine how to input data into R from: csv file and by
scraping the web files.
 This will allow you to input any web data and excel data you have into R
for processing and analytics.
 We will discuss ODBC and cloud sources in a later lecture.
43
08/05/2025
B.Ramamurthy 2016
Features of RStudio
Regions of RStudio: (i) console, (ii) data, (iii) script, (iv) plots and packages
 Primary feature: Project is a collection of files: data, graphs, R script: lets create a
new project
R allows all the basic arithmetic: +, - , variables
Vectors: collection of same type of elements; very important data element
 Creating a vector; changing a vector; factoring a vector
 x<- c(1,4,9,19)
Calling a function: mean (x)
Missing data: NA (not available), NULL(absence of anything)
 z<- c(8, NA, 19)
 z <- c(8,NULL, 18)
 znew<-na.omit(z)
44
08/05/2025
B.Ramamurthy 2016
Features (contd.)
Ingesting (reading) data into R
 Reading csv
 Reading from the web
 We will spend some time here to plan your data collection strategy
Data included with R
 Lot of historical data (old data is easy to publicize/declassify)
Simple commands to work with data sets
 summary(data)
 head(data)
45
08/05/2025
B.Ramamurthy 2016
References
[1] S. Adelman, L. Moss, M. Abai. Data Strategy. Addison-Wesley, 2005.
[2] T. Davenport. A Predictive Analytics Primer. Sept2, 2014, Harvard Business Review.
http://blogs.hbr.org/2014/09/a-predictive-analytics-primer/
[3] The R project, http://www.r-project.org/
[4] J.P. Lander. R for Everyone: Advanced Analytics and graphics. Addison Wesley. 2014.
[5] M. NemSchoff. A quick guide to structured and unstructured data. In Smart Data Collective,
June 28, 2014.
46
08/05/2025
B.Ramamurthy 2016
Summary
47
 We are entering a watershed moment in the internet era.
 This involves in its core and center, big data analytics and tools that
provide intelligence in a timely manner to support decision making.
 Newer storage models, processing models, and approaches have
emerged.
 We will learn about these and develop software using these newer
approaches to data.

More Related Content

PPTX
selected topics in CS-CHaaapteerobe.pptx
PPTX
Big data
PPTX
Big data
PPTX
Big data and data mining
PDF
Big-Data-Analytics.8592259.powerpoint.pdf
PPTX
PPTX
Unit 1 - Introduction to Big Data and hadoop.pptx
PDF
Introduction to Data Analytics and data analytics life cycle
selected topics in CS-CHaaapteerobe.pptx
Big data
Big data
Big data and data mining
Big-Data-Analytics.8592259.powerpoint.pdf
Unit 1 - Introduction to Big Data and hadoop.pptx
Introduction to Data Analytics and data analytics life cycle

Similar to DataJan27.pptxDataFoundationsPresentation (20)

PDF
Data minig with Big data analysis
PDF
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
PPT
Big data analytics, survey r.nabati
PDF
AIS 3 - EDITED.pdf
PDF
Review of big data analytics (bda) architecture trends and analysis
PPTX
BigData
PDF
Big Data for Library Services (2017)
PPTX
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
PDF
Real World Application of Big Data In Data Mining Tools
PDF
Big Data Mining - Classification, Techniques and Issues
PDF
Big Data Testing Using Hadoop Platform
PPTX
Big Data Session 1.pptx
PDF
Identifying and analyzing the transient and permanent barriers for big data
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
PPTX
Data mining with big data
PPTX
Big Data PPT by Rohit Dubey
DOCX
Big data (word file)
DOCX
Handling and Analyzing Big Data_ A Professional Guide
PDF
An Comprehensive Study of Big Data Environment and its Challenges.
Data minig with Big data analysis
KIT-601-L-UNIT-1 (Revised) Introduction to Data Analytcs.pdf
Big data analytics, survey r.nabati
AIS 3 - EDITED.pdf
Review of big data analytics (bda) architecture trends and analysis
BigData
Big Data for Library Services (2017)
KIT-601 Lecture Notes-UNIT-1.pdf
Real World Application of Big Data In Data Mining Tools
Big Data Mining - Classification, Techniques and Issues
Big Data Testing Using Hadoop Platform
Big Data Session 1.pptx
Identifying and analyzing the transient and permanent barriers for big data
Chapter1-Introduction Εισαγωγικές έννοιες
Data mining with big data
Big Data PPT by Rohit Dubey
Big data (word file)
Handling and Analyzing Big Data_ A Professional Guide
An Comprehensive Study of Big Data Environment and its Challenges.
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
PPT
Quality review (1)_presentation of this 21
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Lecture1 pattern recognition............
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
Database Infoormation System (DBIS).pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Computer network topology notes for revision
Foundation of Data Science unit number two notes
Quality review (1)_presentation of this 21
oil_refinery_comprehensive_20250804084928 (1).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Ad

DataJan27.pptxDataFoundationsPresentation

  • 1. 08/05/2025 B . R A M A M U R T HY B.Ramamurthy 2016 1 Data-Intensive Computing
  • 2. 08/05/2025 2  The phrase was initially coined by National Science Foundation (NSF)  What is it?  Volume, velocity, variety, veracity (uncertainty) (Gartner, IBM)  How is it addressed?  Why now?  What do you expect to extract by processing this large data?  Intelligence for decision making  What is different now?  Storage models, processing models  Big Data, analytics and cloud infrastructures  Summary Data-intensive computing
  • 3. 08/05/2025 B.Ramamurthy 2016 Motivation 3  Tremendous advances have taken place in statistical methods and tools, machine learning and data mining approaches, and internet based dissemination tools for analysis and visualization.  Many tools are open source and freely available for anybody to use.  Is there an easy entry-point into learning these technologies?  Can we make these tools easily accessible to the students, researchers and decision makers similar to how “office” productivity software is used?
  • 4. 08/05/2025 B.Ramamurthy 2016 High Level Goals for the course  Understand foundations of data analytics so that you can interpret and communicate results and make informed decisions  Study and learn to apply common statistical methods and machine learning algorithms to solve business problems  Learn to work with popular tools to analyze and visualize data; more importantly encourage consistency across departments on analytics/tools used  Working with cloud for data storage and for deployment of applications  Learn methods for mastering and applying emerging concepts and technologies for continuous data-driven improvements to your research/work/business processes  Transform complex analytics into routine processes 4
  • 5. 08/05/2025 B.Ramamurthy 2016 Newer kinds of Data  New kinds of data from different sources (see p.23 of Data Science book) : tweets, geo location, emails, blogs  Two major types: structured and unstructured data  Structured data: data collected and stored according to well defined schema; Realtime stock quotes  Unstructured data: messages from social media, news, talks, books, letters, manuscripts, court documents..  “Regardless of their differences, they work in tandem in any effective big data operation. Companies wishing to make the most of their data should use tools that utilize the benefits of both.”5  We will discuss methods for analyzing both structured and unstructured data 5
  • 6.  Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors  The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics …  Financial applications: that analyze volumes of data for trends and other deeper knowledge  Health Care: huge amount of patient data, drug and treatment data  The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars: Sloan Digital Sky Survey: http://www.sdss.org/ 08/05/2025 6 Data Deluge: smallest to largest
  • 7. Big-data Problem Solving Approaches  Algorithmic: after all we have working towards this for ever: scalable/tracktable  High Performance computing (HPC: multi-core) CCR has machines that are: 16 CPU , 32 core machine with 128GB RAM: openmp, MPI, etc.  GPGPU programming: general purpose graphics processor (NVIDIA)  Statistical packages like R running on parallel threads on powerful machines  Machine learning algorithms on super computers  Hadoop MapReduce like parallel processing.  Spark like approaches providing in-memory computing models
  • 8. Processing Granularity • Single-core, single processor • Single-core, multi-processor Si n gl e- c o re • Multi-core, single processor • Multi-core, multi-processor Multi- core • Cluster of processors (single or multi-core) with shared memory • Cluster of processors with distributed memory Cluster Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing 8 08/05/2025 Bina Ramamurthy 2011 Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large
  • 10. 08/05/2025 10  Intelligence is a set of discoveries made by federating/processing information collected from diverse sources.  Information is a cleansed form of raw data.  For statistically significant information we need reasonable amount of data.  For gathering good intelligence we need large amount of information.  As pointed out by Jim Grey in the Fourth Paradigm book enormous amount of data is generated by the millions of experiments and applications.  Thus intelligence applications are invariably data-heavy, data-driven and data- intensive.  Data is gathered from the web (public or private, covert or overt), generated by large number of domain applications. Intelligence and Scale of Data
  • 11. Intelligence (or origins of Big-data computing?)  Search for Extra Terrestrial Intelligence (seti@home project)  The Wow signal http://www.bigear.org/wow.htm 08/05/2025 11
  • 12. 08/05/2025 12  Google search: How is different from regular search in existence before it?  It took advantage of the fact the hyperlinks within web pages form an underlying structure that can be mined to determine the importance of various pages.  Restaurant and Menu suggestions: instead of “Where would you like to go?” “Would you like to go to CityGrille”?  Learning capacity from previous data of habits, profiles, and other information gathered over time.  Collaborative and interconnected world inference capable: facebook friend suggestion  Large scale data requiring indexing  …Do you know amazon is going to ship things before you order? Here Characteristics of intelligent applications
  • 14. 08/05/2025 14  Aggregated content: large amount of data pertinent to the specific application; each piece of information is typically connected to many other pieces. Ex: DBs  Reference structures: Structures that provide one or more structural and semantic interpretations of the content. Reference structure about specific domain of knowledge come in three flavors: dictionaries, knowledge bases, and ontologies  Algorithms: modules that allows the application to harness the information which is hidden in the data. Applied on aggregated content and some times require reference structure Ex: MapReduce  Data Structures: newer data structures to leverage the scale and the WORM characteristics; ex: MS Azure, Apache Hadoop, Google BigTable Basic Elements
  • 15. 08/05/2025 15  Search engines  Recommendation systems:  CineMatch of Netflix Inc. movie recommendations  Amazon.com: book/product recommendations  Biological systems: high throughput sequences (HTS)  Analysis: disease-gene match  Query/search for gene sequences  Space exploration  Financial analysis Examples of data-intensive applications
  • 16. 08/05/2025 16  Social networking sites  Mashups : applications that draw upon content retrieved from external sources to create entirely new innovative services.  Portals  Wikis: content aggregators; linked data; excellent data and fertile ground for applying concepts discussed in the text  Media-sharing sites  Online gaming  Biological analysis  Space exploration More intelligent data-intensive applications
  • 17. 08/05/2025 17  Statistical inference  Machine learning is the capability of the software system to generalize based on past experience and the use of these generalization to provide answers to questions related old, new and future data.  Data mining  Soft computing  Deep learning  We also need algorithms that are specially designed for the emerging storage models and data characteristics. Algorithms
  • 18. • Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” • But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : • The data type is “write once read many (WORM)” ; • Privacy protected healthcare and patient information; • Historical financial data; • Other historical data  Relational file system and tables are insufficient. • Large <key, value> stores (files) and storage management system. • Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… • Clusters of distributed nodes for storage and computing. • Computing is inherently parallel 08/05/2025 18 Different Type of Storage
  • 19.  Originated from the Google File System (GFS) is the special <key, value> store  Hadoop Distributed file system (HDFS) is the open source version of this. (Currently an Apache project)  Parallel processing of the data using MapReduce (MR) programming model  Challenges  Formulation of MR algorithms  Proper use of the features of infrastructure (Ex: sort)  Best practices in using MR and HDFS  An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. 08/05/2025 19 Big-data Concepts
  • 20.  We have witnessed explosion in algorithmic solutions.  “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper  What you cannot achieve by an algorithm can be achieved by more data.  Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! http://www.google.org/flutrends/ 08/05/2025 20 Data & Analytics
  • 21.  Cloud is a facilitator for Big Data computing and is an indispensable in this context  Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service  Cloud offers accessibility to Big Data computing  Cloud computing models:  platform (PaaS), Microsoft Azure  software (SaaS), Google App Engine (GAE)  infrastructure (IaaS), Amazon web services (AWS)  Services-based application programming interface (API) 08/05/2025 21 Cloud Computing
  • 22. B.Ramamurthy 2016 Top Ten Largest Databases LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate 0 1000 2000 3000 4000 5000 6000 7000 Top ten largest databases (2007) Terabytes Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/ 08/05/2025 22
  • 23. B.Ramamurthy 2016 Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 LOC CIA Amazon YOUTube ChoicePt Sprint Google AT&T NERSC Climate 0 1000 2000 3000 4000 5000 6000 7000 Top ten largest databases (2007) Terabytes Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world 08/05/2025 23 Facebook 21 PetaByte In 2010
  • 24. 08/05/2025 B.Ramamurthy 2016 Data Strategy  In this era of big data, what is your data strategy?  Strategy as in simple “Planning for the data challenge”  It is not only about big data: all sizes and forms of data  Data collections from customers used to be an elaborate task: surveys, and other such instruments  Nowadays data is available in abundance: thanks to the technological advances as well as the social networks  Data is also generated by many of your own business processes and applications  Data strategy means many different things: we will discuss this next 24
  • 25. 08/05/2025 B.Ramamurthy 2016 Components of a data Strategy1  Data integration  Meta data  Data modeling  Organizational roles and responsibilities  Performance and metrics  Security and privacy  Structured data management  Unstructured data management  Business intelligence  Data analysis and visualization  Tapping into social data This course will provide training in emerging technologies, tools, environments and APIs available for developing and implementing one or more of these components. 25
  • 26. 08/05/2025 B.Ramamurthy 2016 Data Strategy for newer kinds of data 26  How will you collect data? Aggregate data? What are your sources? (Eg. Social media)  How will you store them? And Where?  How will you use the data? Analyze them? Analytics? Data mining? Pattern recognition?  How will you present or report the data to the stakeholders and decision makers? visualization?  Archive the data for provenance and accountability.
  • 27. 08/05/2025 B.Ramamurthy 2016 Tools for Analytics  Elaborate tools with nifty visualizations; expensive licensing fees: Ex: Tableau, Tom Sawyer  Software that you can buy for data analytics: Brilig, small, affordable but short-lived  Open sources tools: Gephi, sporadic support  Open source, freeware with excellent community involvement: R system  Some desirable characteristics of the tools: simple, quick to apply, intuitive, useful, flat learning curve  A demo to prove this point: data  actions /decisions 27
  • 28. 08/05/2025 B.Ramamurthy 2016 28 Demo: Exam1 Grade: Traditional reporting 1 Q1 Q2 Q3 Q4 Q5 Total 16.7 13.9 9.6 18.5 13.7 72.4 20.0 16.0 9.0 19.0 17.0 76.0 20.0 20.0 15.0 25.0 20.0 90.0 Q1 Q2 Q3 Q4 Q5 Total 16.0 14.2 9.6 19.4 14.0 73.2 80.1% 71.1% 64.0% 77.4% 70.2% 73.2% Q1 Q2 Q3 Q4 Q5 Total 17.3 13.6 9.7 17.6 13.3 71.5 86.7% 67.8% 64.6% 70.3% 66.7% 71.5% Question 1..5, total, mean, median, mode; mean ver1, mean ver2
  • 29. 08/05/2025 B.Ramamurthy 2016 Traditional approach 2: points vs #students 29 Distribution of exam1 points
  • 32. 08/05/2025 B.Ramamurthy 2016 R-code 32 data2<-read.csv(file.choose()) exam1<-data2$midterm hist(exam1, col=rainbow(8)) boxplot(data2, col=rainbow(6)) boxplot(data2,col=c("orange","green","blue","grey","yellow", "sienna")) fn<-boxplot(data2,col=c("orange","green","blue","grey","yellow", "pink"))$stats text(5.55, fn[1,6], paste("Minimum =", fn[1,6]), adj=0, cex=.7) text(5.55, fn[2,6], paste("LQuartile =", fn[2,6]), adj=0, cex=.7) text(5.0, fn[3,6], paste("Median =", fn[3,6]), adj=0, cex=.7) text(5.55, fn[4,6], paste("UQuartile =", fn[4,6]), adj=0, cex=.7) text(5.55, fn[5,6], paste("Maximum =", fn[5,6]), adj=0, cex=.7) grid(nx=NA, ny=NULL)
  • 33. 08/05/2025 B.Ramamurthy 2016 Demo Details  Grade data stored in excel file and common input format  Converted this file to csv  Start a R Studio project  Read in the csv data (using a file chooser option) into data2  boxplot(data2)  That is it.  You can now add legends, colors, and labels to make it presentable.  Export the plot as a image or pdf to report the results 33
  • 34. 08/05/2025 B.Ramamurthy 2016 Today’s Topic: Exploratory data analysis (EDA)  The R Programming language  The R project for statistical computing  R Studio integrated development environment (IDE)  Data analysis with R: charts, plots, maps, packages  Also look at the CRAN: Comprehensive R Archive Network  Understanding your data  Basic statistical analysis  Chapter 1 : What is Data Science?  Chapter 2: Exploratory Data Analysis and Data Science Process 34
  • 35. 08/05/2025 B.Ramamurthy 2016  R is a software package for statistical computing.  R is an interpreted language  It is open source with high level of contribution from the community  “R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer’s memory.”  “It’s not as good at storing data in complicated structures, efficiently querying data, or working with data that doesn’t fit in the computer’s memory.” R Language 35
  • 36. 08/05/2025 B.Ramamurthy 2016 R Programming Language3,4  R is popular language for statistical analysis of data, visualization and reporting.  It is a complete “programming” language.  R is a free software: Gnu General Public Licensing (GPL)  R Studio is a powerful IDE for R.  R is not a tool for data acquisition/collection/data entry. This is a major point on which it differs from Excel and other data input applications. 36
  • 37. 08/05/2025 B.Ramamurthy 2016  There are many packages available for statistical analysis such as SAS and SPSS but they are expensive (user license based) and are proprietary.  R is open source and it can pretty much do what SAS can do but free.  R is considered one of the best statistical tools in the world.  People can submit their own R packages/libraries, using latest cutting edge techniques.  To date R has got almost 5,000 packages in the CRAN (Comprehensive R Archive Network – The site which maintains the R project) repository.  R is great for exploratory data analysis (EDA): for understanding the nature of your data and quickly create useful visualization 37 Why R?
  • 38. 08/05/2025 B.Ramamurthy 2016  An R package is a set of related functions  To use a package you need to load it into R  R offers a large number of packages for various vertical and horizontal domains:  Horizontal: display graphics, statistical packages, machine learning  Verticals: wide variety of industries: analyzing stock market data, modeling credit risks, social sciences, automobile data R Packages 38
  • 39. 08/05/2025 B.Ramamurthy 2016  A package is a collection of functions and data files bundled together.  In order to use the components of a package it needs to be installed in the local library of the R environment.  Loading packages  Custom packages  Building packages  Activity: explore what R packages are available, if any, for your domain http://cran.r-project.org/web/packages/available_packages_by_name.html  Later on, try to create a custom package for your business domain. 39 R Packages
  • 40. 08/05/2025 B.Ramamurthy 2016  Library Package Class  R also provides many data sets for exploring its features Library 40
  • 41. 08/05/2025 B.Ramamurthy 2016  R Basics, fundamentals  The R language  Working with data  Statistics with R language  R syntax  R Control structures  R Objects  R formulas  Install and use packages  Quick overview and tutorial 41 Learning R
  • 42. 08/05/2025 B.Ramamurthy 2016 R Studio  Lets examine the R studio environment 42
  • 43. 08/05/2025 B.Ramamurthy 2016 Input Data sources  Data for the analytics can be from many different sources: simple .csv file, relational database, xml based web documents, sources on the cloud (dropbox, storage drives).  Today we will examine how to input data into R from: csv file and by scraping the web files.  This will allow you to input any web data and excel data you have into R for processing and analytics.  We will discuss ODBC and cloud sources in a later lecture. 43
  • 44. 08/05/2025 B.Ramamurthy 2016 Features of RStudio Regions of RStudio: (i) console, (ii) data, (iii) script, (iv) plots and packages  Primary feature: Project is a collection of files: data, graphs, R script: lets create a new project R allows all the basic arithmetic: +, - , variables Vectors: collection of same type of elements; very important data element  Creating a vector; changing a vector; factoring a vector  x<- c(1,4,9,19) Calling a function: mean (x) Missing data: NA (not available), NULL(absence of anything)  z<- c(8, NA, 19)  z <- c(8,NULL, 18)  znew<-na.omit(z) 44
  • 45. 08/05/2025 B.Ramamurthy 2016 Features (contd.) Ingesting (reading) data into R  Reading csv  Reading from the web  We will spend some time here to plan your data collection strategy Data included with R  Lot of historical data (old data is easy to publicize/declassify) Simple commands to work with data sets  summary(data)  head(data) 45
  • 46. 08/05/2025 B.Ramamurthy 2016 References [1] S. Adelman, L. Moss, M. Abai. Data Strategy. Addison-Wesley, 2005. [2] T. Davenport. A Predictive Analytics Primer. Sept2, 2014, Harvard Business Review. http://blogs.hbr.org/2014/09/a-predictive-analytics-primer/ [3] The R project, http://www.r-project.org/ [4] J.P. Lander. R for Everyone: Advanced Analytics and graphics. Addison Wesley. 2014. [5] M. NemSchoff. A quick guide to structured and unstructured data. In Smart Data Collective, June 28, 2014. 46
  • 47. 08/05/2025 B.Ramamurthy 2016 Summary 47  We are entering a watershed moment in the internet era.  This involves in its core and center, big data analytics and tools that provide intelligence in a timely manner to support decision making.  Newer storage models, processing models, and approaches have emerged.  We will learn about these and develop software using these newer approaches to data.