PROJECT 1: Analyzing clickstream data
On a Web site, clickstream analysis (sometimes called clickstream analytics) is the process of collecting, analyzing, and
reporting aggregate data about which pages visitors visit in what order - which are the result of the succession of mouse
clicks each visitor makes (that is, the clickstream).
Download Link
1. Loading the data files into HDFS
2. Starting the new Beeline shell (hive-server 2)
3. Creating new database – alabs_db
4.Creating and loading HIVE table – users
Sagnik_AnalytixLabs_Projects
5. All 3 HIVE base tables – omniturelogs, products and users created
6. Content of HIVE script – webanalytics.sql
6. Using webanalytics.sql, omniture and webanalytics tables are created
7. Creating omniture2 view
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 2: Sentiment
Analysis/Opinion Mining
Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and
computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely
applied to reviews and social media for a variety of applications, ranging from marketing to customer service.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of twitter_conf.conf file
3. Executing the TwitterAgent flume agent using twitter_conf.conf file
4. Twitter data moved to HDFS
5. Content of tweets.sql file
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
6. Executing tweets.sql to create tables and views for analysis
7. Tables and views for analysis are created
Tweets ID sentiment
PROJECT 3: Lending
Club Loan Analysis
Lending Club is a US peer-to-peer lending company. Lending Club operates an online lending platform that enables
borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the
world's largest peer-to-peer lending platform.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of loan_analysis.sql file
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
3. Tables and view created using loan_analysis.sql
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 4: HVAC
Temperature Analysis
HVAC (stands for Heating, Ventilation and Air Conditioning) equipment needs a control system to regulate the operation of
a heating and/or air conditioning system. Usually a sensing device is used to compare the actual state (e.g. temperature)
with a target state. Then the control system draws a conclusion what action has to be taken.
Data Download Link
Tableau Link
1. Loading the data files into HDFS
2. Content of sensor_analysis.sql file
Sagnik_AnalytixLabs_Projects
3. Tables and view created using sensor_analysis.sql
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects
PROJECT 5: Upsell Analysis
Upselling is a sales technique whereby a seller induces the customer to purchase more expensive items, upgrades or other
add-ons in an attempt to make a more profitable sale.
Data Download Link
1. Sample data
2. Content of upsell_analysis.sql file
A
B
C
3. A
What is A doing?
• Concatenates first name and last name to a single field – name
• Assigns each customer a category
• Calculates the total amount spent by the customer in each category
• Order customers by the total amount spent in descending order
4. B
4.1 What is B doing?
• Extracts name from A
• Each customer is assigned his respective categories using COLLECT_LIST() function which converts
multiple rows to a single row of array datatype
• Each customer is assigned his respective amount spent on those categories
• Calculating the overall total amount spent by each customer on all categories
• Evaluating the recommended category for each customer as per the amount spent per category
4.2 Sample data of B
5. Sample data after C
PROJECT 6: Web Logs’ Analysis
An access log is a list of all the requests for individual files that people have requested from a Web site. These files will
include the HTML files and their imbedded graphic images and any other associated files that get transmitted. The access
log (sometimes referred to as the "raw data") can be analysed and summarized by another program.
Data Download Link
Tableau Link
1. Accessing apache access logs using flume
1.1 flume.conf
1.2 Extract web logs’ data using the following command:
/usr/lib/flume-ng/bin/flume-ng agent –n source_agent –c conf –f /usr/lib/flume-
ng/conf/flume.conf
2. Sample log data
3. Moving log file to HDFS
3. PIG script – log_processing.pig
3.1 Content
3.2 Execution
Sagnik_AnalytixLabs_Projects
4. Creating HIVE table on the processed log data
Sagnik_AnalytixLabs_Projects
Sagnik_AnalytixLabs_Projects

More Related Content

PPTX
Visible Governance: How to set up data governance using Visible Analyst Comme...
PPS
Electronic Library Bremen – state & focus of development
PDF
Eight styles of data integration
PPTX
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
DOCX
Log miner using_onlinecatalogue
PDF
Day 8.1 system_admin_tasks
PPTX
Simon Waddington BL RIC WORKSHOP 22032011
PPT
Data Integration (ETL)
Visible Governance: How to set up data governance using Visible Analyst Comme...
Electronic Library Bremen – state & focus of development
Eight styles of data integration
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
Log miner using_onlinecatalogue
Day 8.1 system_admin_tasks
Simon Waddington BL RIC WORKSHOP 22032011
Data Integration (ETL)

What's hot (17)

TXT
Docs
PDF
OpenDataMonitor Overview
PDF
MOCHA 2018 Challenge @ ESWC2018
PPT
Sap business intelligence 4.0 report basic
PPTX
Koha Cataloguing Module
PDF
Data integration
PDF
Day 6.1 and_6.2__flat_files_and_service_api
PPTX
ETL Process
PPTX
Data flow in Extraction of ETL data warehousing
PPT
Building the DW - ETL
PDF
4Science Submission Module Preview
PPTX
Metadata harvesting
PPT
Data stage
PPS
OneBridge Online Log Viewer (OOLV2)
PPT
Pnbhfl training final
PPTX
Introduction to SQL Server 2008 Management Data Warehouse (MDW)
Docs
OpenDataMonitor Overview
MOCHA 2018 Challenge @ ESWC2018
Sap business intelligence 4.0 report basic
Koha Cataloguing Module
Data integration
Day 6.1 and_6.2__flat_files_and_service_api
ETL Process
Data flow in Extraction of ETL data warehousing
Building the DW - ETL
4Science Submission Module Preview
Metadata harvesting
Data stage
OneBridge Online Log Viewer (OOLV2)
Pnbhfl training final
Introduction to SQL Server 2008 Management Data Warehouse (MDW)
Ad

Similar to Sagnik_AnalytixLabs_Projects (20)

PPTX
Big data analytics with hadoop volume 2
PDF
Website Content Analysis Using Clickstream Data and Apriori Algorithm
PDF
Log Analysis Engine with Integration of Hadoop and Spark
PDF
Big Data Tutorial - Marko Grobelnik - 25 May 2012
PDF
Big data-hadoop-training-course-content-content
PDF
Social data analysis using apache flume, hdfs, hive
KEY
The data layer
PDF
Extending the Data Warehouse with Hadoop - Hadoop world 2011
PDF
Analyzing Multi-Structured Data
PDF
EDF2013: Big Data Tutorial: Marko Grobelnik
PDF
Security data deluge
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
Fluentd meetup #3
PDF
InternReport
PDF
Big Data Use Cases – Hadoop, Spark and Flink Case Studies.pdf
PDF
Clickstream Analysis
PDF
Meaure Marketing Online - IABC Ottawa
PPTX
Wa mw 2013
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Big data analytics with hadoop volume 2
Website Content Analysis Using Clickstream Data and Apriori Algorithm
Log Analysis Engine with Integration of Hadoop and Spark
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big data-hadoop-training-course-content-content
Social data analysis using apache flume, hdfs, hive
The data layer
Extending the Data Warehouse with Hadoop - Hadoop world 2011
Analyzing Multi-Structured Data
EDF2013: Big Data Tutorial: Marko Grobelnik
Security data deluge
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Fluentd meetup #3
InternReport
Big Data Use Cases – Hadoop, Spark and Flink Case Studies.pdf
Clickstream Analysis
Meaure Marketing Online - IABC Ottawa
Wa mw 2013
Dirty data? Clean it up! - Datapalooza Denver 2016
Ad

Sagnik_AnalytixLabs_Projects

  • 1. PROJECT 1: Analyzing clickstream data On a Web site, clickstream analysis (sometimes called clickstream analytics) is the process of collecting, analyzing, and reporting aggregate data about which pages visitors visit in what order - which are the result of the succession of mouse clicks each visitor makes (that is, the clickstream). Download Link
  • 2. 1. Loading the data files into HDFS
  • 3. 2. Starting the new Beeline shell (hive-server 2)
  • 4. 3. Creating new database – alabs_db
  • 5. 4.Creating and loading HIVE table – users
  • 7. 5. All 3 HIVE base tables – omniturelogs, products and users created
  • 8. 6. Content of HIVE script – webanalytics.sql
  • 9. 6. Using webanalytics.sql, omniture and webanalytics tables are created
  • 17. PROJECT 2: Sentiment Analysis/Opinion Mining Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service. Data Download Link Tableau Link
  • 18. 1. Loading the data files into HDFS
  • 19. 2. Content of twitter_conf.conf file
  • 20. 3. Executing the TwitterAgent flume agent using twitter_conf.conf file
  • 21. 4. Twitter data moved to HDFS
  • 22. 5. Content of tweets.sql file
  • 25. 6. Executing tweets.sql to create tables and views for analysis
  • 26. 7. Tables and views for analysis are created
  • 28. PROJECT 3: Lending Club Loan Analysis Lending Club is a US peer-to-peer lending company. Lending Club operates an online lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the world's largest peer-to-peer lending platform. Data Download Link Tableau Link
  • 29. 1. Loading the data files into HDFS
  • 30. 2. Content of loan_analysis.sql file
  • 33. 3. Tables and view created using loan_analysis.sql
  • 39. PROJECT 4: HVAC Temperature Analysis HVAC (stands for Heating, Ventilation and Air Conditioning) equipment needs a control system to regulate the operation of a heating and/or air conditioning system. Usually a sensing device is used to compare the actual state (e.g. temperature) with a target state. Then the control system draws a conclusion what action has to be taken. Data Download Link Tableau Link
  • 40. 1. Loading the data files into HDFS
  • 41. 2. Content of sensor_analysis.sql file
  • 43. 3. Tables and view created using sensor_analysis.sql
  • 46. PROJECT 5: Upsell Analysis Upselling is a sales technique whereby a seller induces the customer to purchase more expensive items, upgrades or other add-ons in an attempt to make a more profitable sale. Data Download Link
  • 48. 2. Content of upsell_analysis.sql file
  • 49. A B C
  • 50. 3. A What is A doing? • Concatenates first name and last name to a single field – name • Assigns each customer a category • Calculates the total amount spent by the customer in each category • Order customers by the total amount spent in descending order
  • 51. 4. B 4.1 What is B doing? • Extracts name from A • Each customer is assigned his respective categories using COLLECT_LIST() function which converts multiple rows to a single row of array datatype • Each customer is assigned his respective amount spent on those categories • Calculating the overall total amount spent by each customer on all categories • Evaluating the recommended category for each customer as per the amount spent per category
  • 53. 5. Sample data after C
  • 54. PROJECT 6: Web Logs’ Analysis An access log is a list of all the requests for individual files that people have requested from a Web site. These files will include the HTML files and their imbedded graphic images and any other associated files that get transmitted. The access log (sometimes referred to as the "raw data") can be analysed and summarized by another program. Data Download Link Tableau Link
  • 55. 1. Accessing apache access logs using flume 1.1 flume.conf 1.2 Extract web logs’ data using the following command: /usr/lib/flume-ng/bin/flume-ng agent –n source_agent –c conf –f /usr/lib/flume- ng/conf/flume.conf
  • 57. 3. Moving log file to HDFS
  • 58. 3. PIG script – log_processing.pig 3.1 Content
  • 61. 4. Creating HIVE table on the processed log data