Open Source Solution for Data Analyst Workflow

idBigData Meetup #17
SQL Big Data Analytics
Open Source Solution for Big Data Analyst Workflow
Institut Teknologi Bandung, 28 September 2017
Sigit Prasetyo

sigit.prasetyo@idbigdata.com
@sigitpras303
linkedin.com/in/sigitprasetyo303
flikr.com/photografer-kw3
Sigit Prasetyo

idBigData.com IDBigData idBigData @idBigData hub.idBigData.com
Data-Driven Company
A data-driven company is an organization where every person who can
use data to make better decisions, has access to the data they need
when they need it.
Being data-driven is not about seeing a few canned reports at the
beginning of every day or week; it's about giving the business
decision makers the power to explore data independently, even
if they're working with big or disparate data sources.
https://www.infoworld.com/article/3074322/big-data/what-is-a-data-driven-company.html

Moneyball

Data Journey
Data Collection
01
Data Preparation
02
Data Exploration
03
Data Formatting
04
Data Presentation
05

What is Data Analysts ?
Data Analysts are experienced data professionals in their organization who
can query and process data, provide reports, summarize and
visualize data.
They have a strong understanding of how to leverage existing tools and
methods to solve a problem, and help people from across the company
understand specific queries with ad-hoc reports and charts.
Skills: Data Analysts need to have a baseline
understanding of some core skills: statistics,
data munging, data visualization, exploratory
data analysis,
https://cognitiveclass.ai/blog/data-scientist-vs-data-engineer/
Tools: Microsoft Excel, SPSS, SPSS Modeler,
SAS, SAS Miner, SQL, Microsoft Access,
Tableau, SSAS

Big Data Data Analyst Certification
Required Skills
Prepare the Data
Use Extract, Transfer, Load (ETL) processes to
prepare data for queries.
Provide Structure to the Data
Use Data Definition Language (DDL) statements
to create or alter structures in the metastore for
use by Hive and Impala.
Data Analysis
Use Query Language (QL) statements in Hive and
Impala to analyze data on the cluster.
Certification Exam Subject Areas
1. Extract, Transform, and Load Data with Apache
Pig
2. Manipulate Data with Apache Pig
3. Create tables and load data in Apache Hive
4. Query data with Apache Hive
5. SQL Queries with Drill
6. Working with Self-Describing Data
7. Advanced Topics including Troubleshooting

Why SQL ?
SQL : Structured Query Language
A very high level language
(Almost) Every application use database
Easier to find a SQL developer
The easiest step to enter Hadoop

SQL On Hadoop
Schema-free SQL Query Engine
for Hadoop, NoSQL and Cloud
Storage
OLTP and operational analytics
for Apache Hadoop
Data warehouse software
facilitates reading, writing, and
managing large datasets residing
in distributed storage using SQL.
The open source, native analytic
database for Apache Hadoop*
A big data warehouse system on
Hadoop
Apache Hadoop Native SQL.
Advanced, MPP, elastic query
engine and analytic database for
enterprises*
Distributed SQL Query Engine for
Big Data

Why not Excel ?
Easy to use
Flat database
(Almost) Complete tool for data analyst (formula, statistic, chart)
What if ..
Bigger data
Complex relational

Let’s Play Lego
Read simple to complex data
Data exploration + Ad Hoc Query
Data visualization
Machine Learning
HDFS + MAPREDUCE + HIVE + ZEPPELIN

SQL Data Analytics Sandbox
VirtualBox
Linux Mint OS 18.2
Apache Hadoop Vanila
Single NodeYARN - Resource Management
HDFS HDFS HDFS
Hadoop Distributed File System
HDFS
MapReduce
Execution Engine
MapReduce
Execution Engine
Data Preparation
Data Exploration
Apache Zeppelin
https://github.com/project303/dasb

Apache Hive
Initially developed by Facebook
Included in most Hadoop distro (Cloudera, Hortonworks, MapR, Yava)
Built In Function and User Defined Function
Transactional (ACID)
Has Index
Support Procedural Language
Machine Learning - HiveMall*
Supported Execution Engine
- MapReduce
- Apache Tez
- Spark
JDBC connection support

Apache Zeppelin
Interactive Notebook
Web Front End
Multiple Interpreter
Built-in Visualization

Proof Of Concept
Perform Squid Access Log Data Analysis.
Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and
more. It reduces bandwidth and improves response times by caching and
reusing frequently-requested web pages.
Scenario :
Load data access.log into HDFS
Analyze whether there is something uncommon in it by using Hive

Know Your Data
Data Format : text file that contain 10 fields and separated by space for each field
remotehost rfc931 authuser [date] "request" status size referer agent tcp_code
Field Description :
1. Remotehost
Remote hostname (or IP number if DNS hostname is not
available, or if DNSLookup is Off.
2. Rfc931
The remote logname of the user.
3. User ID
The username as which the user has authenticated himself.
Always NULL ("-") for Squid logs.
4. [date]
Date and time of the request.
5. "Request"
The request line exactly as it came from the client. GET,
HEAD, POST, etc. for HTTP requests. ICP_QUERY for ICP
requests.
6. Status
The HTTP status code returned to the client. See the HTTP
status codes for a complete list.
7. Size
The content-length of data transferred in byte.
8. Referer
9. Agent
Application that access the internet
10. TCP Code
The ``cache result'' of the request. This describes if the
request was a cache hit or miss, and if the object was
refreshed

Know Your Data
Sample Data :
192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET
http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-"
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT
192.168.6.103 - - [17/Sep/2017:00:01:14 +0700] "POST http://netmarbleslog.netmarble.com/
HTTP/1.0" 200 299 "-" "okhttp/2.5.0" TCP_MISS:DIRECT
Remotehost : 192.168.129
[date] : [17/Sep/2017:00:00:21 +0700]
"Request" :
"GET http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1"
Status : 200
Size : 862
Agent :
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)"
TCP Code : TCP_MISS:DIRECT

Starting Apache Zeppelin

Accessing Zeppelin

Preparation

Load Data To HDFS

Create External Table

RegexSerDe
Sample Data :
192.168.6.129 - - [17/Sep/2017:00:00:21 +0700] "GET
http://api.account.xiaomi.com/pass/v2/safe/user/coreInfo? HTTP/1.1" 200 862 "-"
"Dalvik/2.1.0 (Linux; U; Android 5.1.1; 2014817 MIUI/V8.5.1.0.LHJMIED)" TCP_MISS:DIRECT

View Table Content

Create View

Let’s Tell The Story

Monday Traffic Behaviour

IP Traffic Behaviour

Agent Name
Status → 403 Forbidden

The Most Used Agent

Thank You & Stay Connected
s.id/idbigdata
Credit for icon
Gregor Cresnar
www.flaticon.com/authors/gregor-cresnar
Prosymbols
www.flaticon.com/authors/prosymbols
Freepik
www.freepik.com
Pavel Kozlov
www.flaticon.com/authors/pavel-kozlov
Yannick
www.flaticon.com/authors/yannick
Dave Gandy
www.flaticon.com/authors/dave-gandy
SimpleIcon
www.flaticon.com/authors/simpleicon

Open Source Solution for Data Analyst Workflow

More Related Content

What's hot (20)

Similar to Open Source Solution for Data Analyst Workflow (20)

Recently uploaded (20)

Open Source Solution for Data Analyst Workflow