OU RSE Tutorial Big Data Cluster

Working with large tables:
processing and analytics with the Big Data Cluster
Enrico Daga
enrico.daga@open.ac.uk - @enridaga
Knowledge Media Institute - The Open University
http://isds.kmi.open.ac.uk/
OU Research Software Engineers - October 2018

Objective
• To introduce the concept of distributed computing
• To show how to use the Big Data Cluster
• To taste some tools for data processing
• To understand the difference with more traditional
approaches (e.g. Relational Data Warehouse)

Background
• Projects:
• MK:Smart and the MK Data Hub
• CityLABS
• Data science activity @ OU

Outline
• Tabular
data

• Distributed
computing

• Hadoop

• Big
Data
Cluster

• Hue,
Hive,
PIG

• Hands-‐On

Tabular data
Many
different
types
of
data
objects
are
tables
or
can
be
translated
and
manipulated
as

data
tables

• Excel
Documents,
Relational
databases
-‐>
Tables

• Text
Documents
-‐>
Word
Vectors
-‐>
Tables

• Web
Data
-‐>
Graph
-‐>
Tables

• JSON
-‐>
Tree
-‐>
Graph
-‐>
Tables

• …

Tables can be large
• Web
Server
Logs

• Thousands
each
day
even
for
a
small
Web
site,
Billion
for
large

• Social
Media

• 500M
of
twits
every
day

• Search
Engines

• Based
on
word
/
document
statistics
…

• Google
Indexes
contain
hundreds
of
billions
of
documents

Many
other
cases:

• Stock
Exchange

• Black
Boxes

• Power
Grid

• Transport

• …

Tables can be large
• Most
operations
on
tabular
data
require
to
scan
all
the
rows
in
the

table:

• Filter,
Count,
MIN,
MAX,
AVG,
…

• One
example:
Computing
TF/IDF:
https://en.wikipedia.org/wiki/Tf-‐idf
“In
information
retrieval,
tf–idf
or
TFIDF,
short
for
term

frequency–inverse
document
frequency,
is
a
numerical
statistic

that
is
intended
to
reflect
how
important
a
word
is
to
a

document
in
a
collection
or
corpus.”

Distributed computing
• An approach based on the distribution of data and the
parallelisation of operations
• Data is replicated over a number of redundant nodes
• Computation is segmented over a number of workers
• to retrieve data from each node
• to perform atomic operations
• to compose the result

https://en.wikipedia.org/wiki/File:WordCountFlow.JPG

Apache Hadoop
• Open Source project derived from Google’s MapReduce.
• Use multiple disks for parallel reads
• Keeps multiple copies of the data for fault tolerance
• Applies MapReduce to split/merge the processing in several
workers
http://hadoop.apache.org/

Apache Hadoop

KMi Big Data Cluster
A private environment for large scale data processing and analytics.
HDFS

Hadoop
Distributed
File
System
Hadoop
Map
Reduce
Libraries
HIVE PIG
HCatalog
Zookeeper,
YARN,
…
Cloudera
Open
Source
HUE
Workbench
SPARK
HBase
https://www.cloudera.com/products/open-‐source.html

HUE
• A user interface over most Hadoop tools
• Authentication
• HDFS Browsing
• Data download and upload
• Job monitoring
http://gethue.com/

Apache HIVE
• A data warehouse over Hadoop/HDFS
• A query language similar to SQL
• Allows to create SQL-like tables over files or HBase tables
• Naturally views several files as single table
• HiveQL has almost all the operators that developers
familiar with SQL know
• Applies MapReduce underneath
https://hive.apache.org/

Apache Pig
• Originally developed at Yahoo Research around 2006
• A full fledged ETL language (Pig Latin)
• Load/Save data from/to HDFS
• Iterate over data tuples
• Arithmetic operations
• Relational operations
• Filtering, ordering, etc…
• Applies MapReduce underneath

Caveat
• Read / Write operations to disk are slow and cost resources
• Reading and merging from multiple files is expensive
• Hardware, file system, I/O errors

Caveat
• Relational database design principles are NOT recommended,
e.g.:
• Integrity constraints
• De-duplication
• MapReduce is inefficient per definition!
• Bad at managing transactions
• Heavy work even for very simple queries

Hands-On!
• Gutenberg project
• Public domain books
• ~50k books in English, ~2 billion words
• Context: build a specialised search engine over the Gutenberg
project
• Task: Compute TF/IDF of these books
http://www.gutenberg.org/

Computing TF-IDF
• TF: term frequency
• Sum of term hits adjusted for doc length
• tf(t,d) = count(t,d) / len(d)
• {doc,”cat”,hits=5,len=2000} = 0.0025
• IDF: inverse document frequency
• N = all documents (D)
• divided by the documents having term
• in log scale
• We can’t do this easily with a laptop …
• e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-‐idf

Step 1/4 - Generate Term Vectors
Natural
Language
Processing
task:

-‐ Remove
common
words
(the,
of,
for,
…)

-‐ Part
of
Speech
tagging
(Verb,
Noun,
…)

-‐ Stemming
(going
-‐>
go)

-‐ Abstract
(12,
1.000,
20%
-‐>
<NUMBER>)
gutenberg_docs
doc_id text
Gutenberg-‐1 …
Gutenberg-‐2 …
Gutenberg-‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-‐1 0 note[VBP]
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
Lookup
book
Gutenberg-‐11800
as
follows:

http://www.gutenberg.org/ebooks/11800

Step 2/4 Compute Terms Frequency (TF)
tf(t,d)
=
count(t,d)
/
len(d)gutenberg_terms
doc_id position WORD
Gutenberg-‐1 1 file[NN]
Gutenberg-‐1 2 combine[VBZ]
…
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-‐1 call[VB] 2
Gutenberg-‐1 world[NN] 22
Gutenberg-‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
count(t,d)
len(d) count(t,d)
/

len(d)
…
for
each
term
in
each
doc
…

Step 3/4 Compute Inverse Document Frequency (IDF)
term_usages
+ num_docs_with_term
+ 11234
+ 5436
3987
term_freqs
doc_id term term_freq
Gutenberg-‐1 call[VB] 1.791697274828445E-‐5
Gutenberg-‐1 world[NN] 1.791697274828445E-‐5
Gutenberg-‐1 combine[VBZ] 8.958486374142224E-‐6
…
count
doc_id
having
term
term_usages_idf
doc_id term term_freq idf
Gutenberg-‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
log(48790/d)
N
=
48790

Step 4/4 Compute TF/IDF (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-‐5307 will[MD] 0.09273305662791352
Gutenberg-‐5307 must[MD] 0.0927780327905548
Gutenberg-‐5307 good[JJ] 0.11554635054423526
…
term_freq
*
if

Let’s go …
• Step by step instructions at the following link:
• https://github.com/andremann/DataHub-workshop/tree/master/
Working-with-large-tables

Summary
• We introduced the notion of distributed computing
• We have shown how to process large datasets
• You tasted state of the art tools for data processing
using the MK DataHub Hadoop Cluster
• We experienced how to compute TF/IDF on a corpus of
documents with HIVE and PIG

Acknowledgments

OU RSE Tutorial Big Data Cluster

More Related Content

What's hot (20)

Similar to OU RSE Tutorial Big Data Cluster (20)

More from Enrico Daga (18)

Recently uploaded (20)

OU RSE Tutorial Big Data Cluster