SlideShare a Scribd company logo
Social Media Analytics using Azure Technologies
Koray Kocabaş
#sqlsatistanbul
Sponsors
Media Sponsor
Main Sponsor
Swag Sponsor
#sqlsatistanbul
What do we need ?
Just a quick blog post, update on LinkedIn, or a tweet on Twitter is all we need.
#sqlsatistanbul
Session Evaluations
Evaluate sessions and get a chance for the raffle:
http://spoke.at/sqlsat451
#sqlsatistanbul
About Me...
Koray Kocabaş
Data Platform (SQL Server) MVP
Yemeksepeti Business Intelligence
Bahcesehir University Instructor
@koraykocabas
https://tr.linkedin.com/in/koraykocabas
Blog: http://www.misjournal.com
E-Mail: koraykocabas@outlook.com
The Data Deluge
#sqlsatistanbul
What kind of solutions using Big Data
• Clickstream analysis to find buying patterns
• Sentiment analysis for text data
• Fraud detection; forensic analysis
• Machine learning
• Healthcare research
• Predictive Maintenance
Just dream it. Data is everywhere!
Social media analytics using Azure Technologies
Twitter launched in 2006
Active users per month
~316 Millions (August)
~320 Millions (October)
%80 of users is Mobile!
Tweets per second 6.000
Tweets per day ~500 Million
Tweets per year ~200 Billion
Twitter generate a lot of data (12
TB per day)
90 % of buyers trust peer
recommendations
55 % of Twitter users are females
The average Twitter user has 27
Followers
Why it is so Popular?
Social media analytics using Azure Technologies
Social media analytics using Azure Technologies
Event based data
Unstructured data
Detail event information
Streaming
Who is the influencer
TweetTracker
TweetArchivist
Radian6
Sysomos
Tweet Deck
Hootsuite
Twitter Problems Dashboards For Tweets
#sqlsatistanbul
PROBLEMS...
#sqlsatistanbul
1. Collect Twitter Data & Get Simple Information
2. Data Enrichment
3. Store Semi - Structured Data
4. Analyze Semi - Structured Data
5. Visualize Meaningful Results
#sqlsatistanbul
#sqlsatistanbul
Collect Twitter Data & Get Simple Information
#sqlsatistanbul
#sqlsatistanbul
Real-Time Analytics
Intake millions of events per second
Process data from connected devices/apps
Detect patterns and anomalies in streaming data
Transform, augment, correlate, temporal operations
No hardware (PaaS offering)
Up and running in a few clicks (and within minutes)
No performance tuning
Efficiently pay only for usage
Not paying for idle resources
Low startup costs
Scale from small to large when required
Only SQL queries needed (Thousand lines of code in other solutions, such as Apache Storm)
#sqlsatistanbul
Stream Analytics Query Language Functions
DML Statements
• SELECT
• FROM
• WHERE
• GROUP BY
• HAVING
• CASE
• JOIN
• UNION
Windowing Extensions
• Tumbling Window
• Hopping Window
• Sliding Window
• Duration
Aggregate Functions
• SUM
• COUNT
• AVG
• MIN
• MAX
Scaling Functions
• WITH
• PARTITION BY
Date and Time Functions
• DATENAME
• DATEPART
• DAY
• MONTH
• YEAR
• DATETIMEFROMPARTS
• DATEDIFF
• DATADD
String Functions
• LEN
• CONCAT
• CHARINDEX
• SUBSTRING
Statistical Functions
• VAR
• VARP
• STDEV
0 5 10 15 20 25 30
0 5 10 15 20 25 30
4
4
5
The count of tweets every 10 secondsTumbling Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, TumblingWindow(second,10)
0 5 10 15 20 25 30
Every 5 seconds give me the count of
tweets over 10 seconds by topic
Hopping Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, HoppingWindow(second,10,5)
0 5 10 15 20 25 30
If the tweets count is above a threshold
of 8 for a total of 5 seconds
Sliding Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, SlidingWindow(second,5)
HAVING Count(*)>8
#sqlsatistanbul
Stream Analytics
Event Hub
#sqlsatistanbul
Data Enrichment
#sqlsatistanbul
Data Azure Machine Learning Consumers
Local storage
Upload data from PC…
Cloud storage
Azure Storage
Azure Table
Hive
etc.
Excel
Business Apps
Business problem Modeling Business valueDeployment
Azure Marketplace
(Applications store)
Azure ML Gallery
(community)
ML Web Services
(REST API Services)
ML Studio
(Web IDE)
Workspace:
Experiments
Datasets
Trained models
Notebooks
Access settings
Data Model API
Manage
API
#sqlsatistanbul
#sqlsatistanbul
https://sites.google.com/site/miningtwitter/questions/sentiment/sentiment
http://www.slideshare.net/ajayohri/twitter-analysis-by-kaify-rais
Sentiment140 (formerly known as "Twitter Sentiment")
allows you to discover the sentiment of a brand, product,
or topic on Twitter.
#sqlsatistanbul
SQL Server 2016
CTP 3.1
Revolution R Open
3.2.2 for Revolution
R Enterprise
Revolution R
Enterprise 7.5.0
Revolution R Enterprise is able to deliver speeds 42 times faster than competing technology from SAS.
Microsoft announced on January 23, 2015 that they had reached an agreement to purchase Revolution Analytics for an as yet undisclosed amount.
#sqlsatistanbul
The Klout Score is a number between 1-100 that
represents your influence.
Collect and normalize more than 12 billion signals
a day
Hive data warehouse of more than 1 trillion rows
Klout acquired for $200 million by Lithium
Technologies
#sqlsatistanbul
Store Semi - Structured Data
Analyze Semi - Structured Data
#sqlsatistanbul
#sqlsatistanbul
Developed by Facebook. Later it was adopted in Apache as an open source project.
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis
Integration between Hadoop and BI and visualization
Provides an SQL Like language called Hive QL to query data
Create Index, includes Partitioning
Not supported Update (isn’t correct)
Hive provides Users, Groups, Roles. But it’s not designed for high security.
Console (hive>), script, ODBC/JDBC, SQuirreL, HUE, Web Interface, etc.
Most popular Business Intelligence Tools support Hive
#sqlsatistanbul
Data Types
Primitive Data Types: int, bigint, float, double, boolean, decimal, string, timestamp, date etc.
Complex Data Types: arrays, maps, structs
ARRAY<string>: workplace: istanbul, ankara
STRUCT<sex:string,age:int> : Female,25
MAP<string,int>: SOLR:92
Hive RDBMS
SQL Interface SQL Interface
Focus on analytics ay focus on online or analytics
No transactions Transactions usually supported
Partition adds, no random Inserts. Random Insert and Update supported
Distributed processing via map/reduce Distributed processing varies by vendor (if available)
Scales to hundreds of nodes Seldom scale beyond 20 nodes
Built for commodity hardware Often built on proprietary hardware (especially when scaling out)
Low cost per petabyte What's petabyte? :) (note: Are you sure?)
#sqlsatistanbul
http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf
#sqlsatistanbul
#sqlsatistanbul
Originally developed at Yahoo! (Huge contributions from Hortonworks, Twitter)
A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs
Processing large semi-structured data sets using Hadoop Map Reduce
Write complex MapReduce jobs using a simple script language (Pig Latin)
Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.)
Developers can develop UDF
Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera)
Easy to use and efficient
#sqlsatistanbul
Data Types
Simple Data Types: int, float, double, chararray (UTF-8), bytearray
Complex Data Types: map (Key,Value), Tuple, Bag (list of tuples)
Commands
Loading: LOAD, STORE, DUMP
Filtering: FILTER, FOREACH, DISTINCT
Grouping: JOIN, GROUP, COGROUP, CROSS
Ordering: ORDER, LIMIT
Merging & Split: UNION, SPLIT
SQL SCRIPT PIG SCRIPT
SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('t') AS (col1:int, col2:int, col3:int);
SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3;
SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10;
SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2);
E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3);
... HAVING sum(col3) > 5 F=FILTER E BY $2>5;
... ORDER BY col1 G=ORDER F BY $0
SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1;
J=DISTINCT I;
SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1
K=GROUP A BY col1;
L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}
#sqlsatistanbul
Ohhh Finally Demo Time!
#sqlsatistanbul
Visualize Meaningful Results
#sqlsatistanbul
#sqlsatistanbul
Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data
and Business Analytics Immersion, Getting Started with Microsoft Azure Machine Learning
Real World Big Data in Azure, Big Data on Amazon Web Services, Reporting with MongoDB,
Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science &
Hadoop Workflows at Scale With Scalding, SQL on Hadoop - Analyzing Big Data with Hive
Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for
Healthcare, Data Science at Scale, The Data Scientist's Toolbox, R Programming
Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics
using Hadoop eco system, Big Data: How Data Analytics Is Transforming the World, Applied Data
Science with R, Hadoop Enterprise Integration
Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and
Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical
Thinking for Data Science and Analytics
#sqlsatistanbul

More Related Content

PPTX
Turning data from insights into value
PPTX
Predictive modelling with azure ml
PDF
Floods of Twitter Data - StampedeCon 2016
PPTX
Mastering MapReduce: MapReduce for Big Data Management and Analysis
PDF
Big Data Landscape 2016
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PDF
Data analysis trend 2015 2016 v071
PPTX
Infochimps + CloudCon: Infinite Monkey Theorem
Turning data from insights into value
Predictive modelling with azure ml
Floods of Twitter Data - StampedeCon 2016
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Big Data Landscape 2016
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Data analysis trend 2015 2016 v071
Infochimps + CloudCon: Infinite Monkey Theorem

What's hot (19)

PDF
Understanding Cortana Intelligence Suite & Power BI Demo
PPTX
From Data to Insights to Action: When Transactions and Analytics Converge
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PPTX
Geo-Analytics with Apache Spark and In-Memory Data Grids
PPTX
Graph Thinking: Why it Matters
PDF
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
PPTX
Real-time Microservices and In-Memory Data Grids
PPT
Survey of Real-time Processing Systems for Big Data
PDF
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
PPT
My other computer is a datacentre - 2012 edition
PPTX
Importance of Big Data Analytics
PPTX
Role of Analytics in Digital Business
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PPTX
Meet the Infochimps Platform
PDF
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
PPTX
Big Data Ecosystem
PDF
Real-time Big Data Analytics: From Deployment to Production
PPTX
IBM Big Data in the Cloud
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Understanding Cortana Intelligence Suite & Power BI Demo
From Data to Insights to Action: When Transactions and Analytics Converge
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Geo-Analytics with Apache Spark and In-Memory Data Grids
Graph Thinking: Why it Matters
Data Visualisation with Hadoop Mashups, Hive, Power BI and Excel 2013
Real-time Microservices and In-Memory Data Grids
Survey of Real-time Processing Systems for Big Data
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
My other computer is a datacentre - 2012 edition
Importance of Big Data Analytics
Role of Analytics in Digital Business
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Meet the Infochimps Platform
Action from Insight - Joining the 2 Percent Who are Getting Big Data Right
Big Data Ecosystem
Real-time Big Data Analytics: From Deployment to Production
IBM Big Data in the Cloud
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Ad

Viewers also liked (6)

PPTX
Big data con SQL Server 2014
PPTX
Analyzing StackExchange data with Azure Data Lake
PDF
Gephi Tutorial Visualization
PDF
Gephi Quick Start
DOCX
Research methodology theory chapt. 1- kotthari
PDF
Big Data and Fast Data - Lambda Architecture in Action
Big data con SQL Server 2014
Analyzing StackExchange data with Azure Data Lake
Gephi Tutorial Visualization
Gephi Quick Start
Research methodology theory chapt. 1- kotthari
Big Data and Fast Data - Lambda Architecture in Action
Ad

Similar to Social media analytics using Azure Technologies (20)

PDF
Azure HDInsight
PPTX
Big Data Warehousing Meetup with Riak
PDF
1 Introduction to Microsoft data platform analytics for release
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
PPTX
The Microsoft BigData Story
PPTX
NoSQL for the SQL Server Pro
PPTX
Introduction to SQL++ for Big Data: Same Language, More Power
PDF
Operational-Analytics
PPTX
PPT
01-introduction.ppt the paper that you can unless you want to join me because...
PPT
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
PPTX
Big data
PDF
Webinar: Selecting the Right SQL-on-Hadoop Solution
PPTX
Big data analytics
PPT
Guest Lecture on Big Data in Business,
PPTX
Relational Database to Apache Spark (and sometimes back again)
Azure HDInsight
Big Data Warehousing Meetup with Riak
1 Introduction to Microsoft data platform analytics for release
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
Big Data Analytics in the Cloud with Microsoft Azure
Giovanni Lanzani – SQL & NoSQL databases for data driven applications - NoSQL...
The Microsoft BigData Story
NoSQL for the SQL Server Pro
Introduction to SQL++ for Big Data: Same Language, More Power
Operational-Analytics
01-introduction.ppt the paper that you can unless you want to join me because...
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
Big data
Webinar: Selecting the Right SQL-on-Hadoop Solution
Big data analytics
Guest Lecture on Big Data in Business,
Relational Database to Apache Spark (and sometimes back again)

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Introduction to Business Data Analytics.
PPT
Quality review (1)_presentation of this 21
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Galatica Smart Energy Infrastructure Startup Pitch Deck
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction-to-Cloud-ComputingFinal.pptx
1_Introduction to advance data techniques.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Moving the Public Sector (Government) to a Digital Adoption
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Business Data Analytics.
Quality review (1)_presentation of this 21

Social media analytics using Azure Technologies

  • 1. Social Media Analytics using Azure Technologies Koray Kocabaş
  • 3. #sqlsatistanbul What do we need ? Just a quick blog post, update on LinkedIn, or a tweet on Twitter is all we need.
  • 4. #sqlsatistanbul Session Evaluations Evaluate sessions and get a chance for the raffle: http://spoke.at/sqlsat451
  • 5. #sqlsatistanbul About Me... Koray Kocabaş Data Platform (SQL Server) MVP Yemeksepeti Business Intelligence Bahcesehir University Instructor @koraykocabas https://tr.linkedin.com/in/koraykocabas Blog: http://www.misjournal.com E-Mail: koraykocabas@outlook.com
  • 7. #sqlsatistanbul What kind of solutions using Big Data • Clickstream analysis to find buying patterns • Sentiment analysis for text data • Fraud detection; forensic analysis • Machine learning • Healthcare research • Predictive Maintenance Just dream it. Data is everywhere!
  • 9. Twitter launched in 2006 Active users per month ~316 Millions (August) ~320 Millions (October) %80 of users is Mobile! Tweets per second 6.000 Tweets per day ~500 Million Tweets per year ~200 Billion Twitter generate a lot of data (12 TB per day) 90 % of buyers trust peer recommendations 55 % of Twitter users are females The average Twitter user has 27 Followers
  • 10. Why it is so Popular?
  • 13. Event based data Unstructured data Detail event information Streaming Who is the influencer TweetTracker TweetArchivist Radian6 Sysomos Tweet Deck Hootsuite Twitter Problems Dashboards For Tweets
  • 15. #sqlsatistanbul 1. Collect Twitter Data & Get Simple Information 2. Data Enrichment 3. Store Semi - Structured Data 4. Analyze Semi - Structured Data 5. Visualize Meaningful Results
  • 17. #sqlsatistanbul Collect Twitter Data & Get Simple Information
  • 19. #sqlsatistanbul Real-Time Analytics Intake millions of events per second Process data from connected devices/apps Detect patterns and anomalies in streaming data Transform, augment, correlate, temporal operations No hardware (PaaS offering) Up and running in a few clicks (and within minutes) No performance tuning Efficiently pay only for usage Not paying for idle resources Low startup costs Scale from small to large when required Only SQL queries needed (Thousand lines of code in other solutions, such as Apache Storm)
  • 20. #sqlsatistanbul Stream Analytics Query Language Functions DML Statements • SELECT • FROM • WHERE • GROUP BY • HAVING • CASE • JOIN • UNION Windowing Extensions • Tumbling Window • Hopping Window • Sliding Window • Duration Aggregate Functions • SUM • COUNT • AVG • MIN • MAX Scaling Functions • WITH • PARTITION BY Date and Time Functions • DATENAME • DATEPART • DAY • MONTH • YEAR • DATETIMEFROMPARTS • DATEDIFF • DATADD String Functions • LEN • CONCAT • CHARINDEX • SUBSTRING Statistical Functions • VAR • VARP • STDEV
  • 21. 0 5 10 15 20 25 30
  • 22. 0 5 10 15 20 25 30 4 4 5 The count of tweets every 10 secondsTumbling Windows SELECT Topic, Count(*) AS Count FROM sqlsaturdaystream TIMESTAMP BY CreatedAt GROUP BY Topic, TumblingWindow(second,10)
  • 23. 0 5 10 15 20 25 30 Every 5 seconds give me the count of tweets over 10 seconds by topic Hopping Windows SELECT Topic, Count(*) AS Count FROM sqlsaturdaystream TIMESTAMP BY CreatedAt GROUP BY Topic, HoppingWindow(second,10,5)
  • 24. 0 5 10 15 20 25 30 If the tweets count is above a threshold of 8 for a total of 5 seconds Sliding Windows SELECT Topic, Count(*) AS Count FROM sqlsaturdaystream TIMESTAMP BY CreatedAt GROUP BY Topic, SlidingWindow(second,5) HAVING Count(*)>8
  • 27. #sqlsatistanbul Data Azure Machine Learning Consumers Local storage Upload data from PC… Cloud storage Azure Storage Azure Table Hive etc. Excel Business Apps Business problem Modeling Business valueDeployment Azure Marketplace (Applications store) Azure ML Gallery (community) ML Web Services (REST API Services) ML Studio (Web IDE) Workspace: Experiments Datasets Trained models Notebooks Access settings Data Model API Manage API
  • 30. #sqlsatistanbul SQL Server 2016 CTP 3.1 Revolution R Open 3.2.2 for Revolution R Enterprise Revolution R Enterprise 7.5.0 Revolution R Enterprise is able to deliver speeds 42 times faster than competing technology from SAS. Microsoft announced on January 23, 2015 that they had reached an agreement to purchase Revolution Analytics for an as yet undisclosed amount.
  • 31. #sqlsatistanbul The Klout Score is a number between 1-100 that represents your influence. Collect and normalize more than 12 billion signals a day Hive data warehouse of more than 1 trillion rows Klout acquired for $200 million by Lithium Technologies
  • 32. #sqlsatistanbul Store Semi - Structured Data Analyze Semi - Structured Data
  • 34. #sqlsatistanbul Developed by Facebook. Later it was adopted in Apache as an open source project. A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis Integration between Hadoop and BI and visualization Provides an SQL Like language called Hive QL to query data Create Index, includes Partitioning Not supported Update (isn’t correct) Hive provides Users, Groups, Roles. But it’s not designed for high security. Console (hive>), script, ODBC/JDBC, SQuirreL, HUE, Web Interface, etc. Most popular Business Intelligence Tools support Hive
  • 35. #sqlsatistanbul Data Types Primitive Data Types: int, bigint, float, double, boolean, decimal, string, timestamp, date etc. Complex Data Types: arrays, maps, structs ARRAY<string>: workplace: istanbul, ankara STRUCT<sex:string,age:int> : Female,25 MAP<string,int>: SOLR:92 Hive RDBMS SQL Interface SQL Interface Focus on analytics ay focus on online or analytics No transactions Transactions usually supported Partition adds, no random Inserts. Random Insert and Update supported Distributed processing via map/reduce Distributed processing varies by vendor (if available) Scales to hundreds of nodes Seldom scale beyond 20 nodes Built for commodity hardware Often built on proprietary hardware (especially when scaling out) Low cost per petabyte What's petabyte? :) (note: Are you sure?)
  • 38. #sqlsatistanbul Originally developed at Yahoo! (Huge contributions from Hortonworks, Twitter) A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs Processing large semi-structured data sets using Hadoop Map Reduce Write complex MapReduce jobs using a simple script language (Pig Latin) Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.) Developers can develop UDF Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera) Easy to use and efficient
  • 39. #sqlsatistanbul Data Types Simple Data Types: int, float, double, chararray (UTF-8), bytearray Complex Data Types: map (Key,Value), Tuple, Bag (list of tuples) Commands Loading: LOAD, STORE, DUMP Filtering: FILTER, FOREACH, DISTINCT Grouping: JOIN, GROUP, COGROUP, CROSS Ordering: ORDER, LIMIT Merging & Split: UNION, SPLIT SQL SCRIPT PIG SCRIPT SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('t') AS (col1:int, col2:int, col3:int); SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3; SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10; SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2); E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3); ... HAVING sum(col3) > 5 F=FILTER E BY $2>5; ... ORDER BY col1 G=ORDER F BY $0 SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1; J=DISTINCT I; SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1 K=GROUP A BY col1; L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}
  • 43. #sqlsatistanbul Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data and Business Analytics Immersion, Getting Started with Microsoft Azure Machine Learning Real World Big Data in Azure, Big Data on Amazon Web Services, Reporting with MongoDB, Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science & Hadoop Workflows at Scale With Scalding, SQL on Hadoop - Analyzing Big Data with Hive Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for Healthcare, Data Science at Scale, The Data Scientist's Toolbox, R Programming Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics using Hadoop eco system, Big Data: How Data Analytics Is Transforming the World, Applied Data Science with R, Hadoop Enterprise Integration Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical Thinking for Data Science and Analytics