SlideShare a Scribd company logo
Working with large tables:
processing and analytics with the Big Data Cluster
Enrico Daga
enrico.daga@open.ac.uk - @enridaga
Knowledge Media Institute - The Open University
http://isds.kmi.open.ac.uk/
OU Research Software Engineers - October 2018
enrico.daga@open.ac.uk - @enridaga
Objective
• To introduce the concept of distributed computing
• To show how to use the Big Data Cluster
• To taste some tools for data processing
• To understand the difference with more traditional
approaches (e.g. Relational Data Warehouse)
enrico.daga@open.ac.uk - @enridaga
Background
• Projects:
• MK:Smart and the MK Data Hub
• CityLABS
• Data science activity @ OU
enrico.daga@open.ac.uk - @enridaga
Outline
• Tabular	
  data	
  
• Distributed	
  computing	
  
• Hadoop	
  
• Big	
  Data	
  Cluster	
  
• Hue,	
  Hive,	
  PIG	
  
• Hands-­‐On
enrico.daga@open.ac.uk - @enridaga
Tabular data
Many	
  different	
  types	
  of	
  data	
  objects	
  are	
  tables	
  or	
  can	
  be	
  translated	
  and	
  manipulated	
  as	
  
data	
  tables	
  
• Excel	
  Documents,	
  Relational	
  databases	
  -­‐>	
  Tables	
  
• Text	
  Documents	
  -­‐>	
  Word	
  Vectors	
  -­‐>	
  Tables	
  
• Web	
  Data	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• JSON	
  -­‐>	
  Tree	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Web	
  Server	
  Logs	
  	
  
• Thousands	
  each	
  day	
  even	
  for	
  a	
  small	
  Web	
  site,	
  Billion	
  for	
  large	
  
• Social	
  Media	
  
• 500M	
  of	
  twits	
  every	
  day	
  
• Search	
  Engines	
  
• Based	
  on	
  word	
  /	
  document	
  statistics	
  …	
  
• Google	
  Indexes	
  contain	
  hundreds	
  of	
  billions	
  of	
  documents	
  
Many	
  other	
  cases:	
  
• Stock	
  Exchange	
  
• Black	
  Boxes	
  
• Power	
  Grid	
  
• Transport	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Most	
  operations	
  on	
  tabular	
  data	
  require	
  to	
  scan	
  all	
  the	
  rows	
  in	
  the	
  
table:	
  
• Filter,	
  Count,	
  MIN,	
  MAX,	
  AVG,	
  …	
  
• One	
  example:	
  Computing	
  TF/IDF:
https://en.wikipedia.org/wiki/Tf-­‐idf
“In	
  information	
  retrieval,	
  tf–idf	
  or	
  TFIDF,	
  short	
  for	
  term	
  
frequency–inverse	
  document	
  frequency,	
  is	
  a	
  numerical	
  statistic	
  
that	
  is	
  intended	
  to	
  reflect	
  how	
  important	
  a	
  word	
  is	
  to	
  a	
  
document	
  in	
  a	
  collection	
  or	
  corpus.”
enrico.daga@open.ac.uk - @enridaga
Distributed computing
• An approach based on the distribution of data and the
parallelisation of operations
• Data is replicated over a number of redundant nodes
• Computation is segmented over a number of workers
• to retrieve data from each node
• to perform atomic operations
• to compose the result
enrico.daga@open.ac.uk - @enridaga
https://en.wikipedia.org/wiki/File:WordCountFlow.JPG
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
• Open Source project derived from Google’s MapReduce.
• Use multiple disks for parallel reads
• Keeps multiple copies of the data for fault tolerance
• Applies MapReduce to split/merge the processing in several
workers
http://hadoop.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
enrico.daga@open.ac.uk - @enridaga
KMi Big Data Cluster
A private environment for large scale data processing and analytics.
HDFS	
  
Hadoop	
  Distributed	
  File	
  System
Hadoop	
  Map	
  Reduce	
  Libraries
HIVE PIG
HCatalog
Zookeeper,	
  YARN,	
  …
Cloudera	
  Open	
  Source
HUE	
  Workbench
SPARK
HBase
https://www.cloudera.com/products/open-­‐source.html
enrico.daga@open.ac.uk - @enridaga
HUE
• A user interface over most Hadoop tools
• Authentication
• HDFS Browsing
• Data download and upload
• Job monitoring
http://gethue.com/
enrico.daga@open.ac.uk - @enridaga
Apache HIVE
• A data warehouse over Hadoop/HDFS
• A query language similar to SQL
• Allows to create SQL-like tables over files or HBase tables
• Naturally views several files as single table
• HiveQL has almost all the operators that developers
familiar with SQL know
• Applies MapReduce underneath
https://hive.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Pig
• Originally developed at Yahoo Research around 2006
• A full fledged ETL language (Pig Latin)
• Load/Save data from/to HDFS
• Iterate over data tuples
• Arithmetic operations
• Relational operations
• Filtering, ordering, etc…
• Applies MapReduce underneath
enrico.daga@open.ac.uk - @enridaga
Caveat
• Read / Write operations to disk are slow and cost resources
• Reading and merging from multiple files is expensive
• Hardware, file system, I/O errors
enrico.daga@open.ac.uk - @enridaga
Caveat
• Relational database design principles are NOT recommended,
e.g.:
• Integrity constraints
• De-duplication
• MapReduce is inefficient per definition!
• Bad at managing transactions
• Heavy work even for very simple queries
enrico.daga@open.ac.uk - @enridaga
Hands-On!
• Gutenberg project
• Public domain books
• ~50k books in English, ~2 billion words
• Context: build a specialised search engine over the Gutenberg
project
• Task: Compute TF/IDF of these books
http://www.gutenberg.org/
enrico.daga@open.ac.uk - @enridaga
Computing TF-IDF
• TF: term frequency
• Sum of term hits adjusted for doc length
• tf(t,d) = count(t,d) / len(d)
• {doc,”cat”,hits=5,len=2000} = 0.0025
• IDF: inverse document frequency
• N = all documents (D)
• divided by the documents having term
• in log scale
• We can’t do this easily with a laptop …
• e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
enrico.daga@open.ac.uk - @enridaga
Step 1/4 - Generate Term Vectors
Natural	
  Language	
  Processing	
  task:	
  	
  
-­‐ Remove	
  common	
  words	
  (the,	
  of,	
  for,	
  …)	
  
-­‐ Part	
  of	
  Speech	
  tagging	
  (Verb,	
  Noun,	
  …)	
  
-­‐ Stemming	
  (going	
  -­‐>	
  go)	
  
-­‐ Abstract	
  (12,	
  1.000,	
  20%	
  -­‐>	
  <NUMBER>)
gutenberg_docs
doc_id text
Gutenberg-­‐1 …
Gutenberg-­‐2 …
Gutenberg-­‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Lookup	
  book	
  Gutenberg-­‐11800	
  as	
  follows:	
  
http://www.gutenberg.org/ebooks/11800
enrico.daga@open.ac.uk - @enridaga
Step 2/4 Compute Terms Frequency (TF)
tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)gutenberg_terms
doc_id position WORD
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Gutenberg-­‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-­‐1 call[VB] 2
Gutenberg-­‐1 world[NN] 22
Gutenberg-­‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count(t,d)
len(d) count(t,d)	
  /	
  
	
  len(d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
enrico.daga@open.ac.uk - @enridaga
Step 3/4 Compute Inverse Document Frequency (IDF)
term_usages
+ num_docs_with_term
+ 11234
+ 5436
3987
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count	
  doc_id	
  having	
  term
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
log(48790/d)
N	
  =	
  48790
enrico.daga@open.ac.uk - @enridaga
Step 4/4 Compute TF/IDF (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-­‐5307 will[MD] 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.11554635054423526
…
term_freq	
  *	
  if
enrico.daga@open.ac.uk - @enridaga
Let’s go …
• Step by step instructions at the following link:
• https://github.com/andremann/DataHub-workshop/tree/master/
Working-with-large-tables
enrico.daga@open.ac.uk - @enridaga
Summary
• We introduced the notion of distributed computing
• We have shown how to process large datasets
• You tasted state of the art tools for data processing
using the MK DataHub Hadoop Cluster
• We experienced how to compute TF/IDF on a corpus of
documents with HIVE and PIG
enrico.daga@open.ac.uk - @enridaga
Acknowledgments

More Related Content

PDF
Python for Financial Data Analysis with pandas
PDF
Analyzing Web Archives
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PDF
R and-hadoop
PDF
Big Data technology Landscape
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
Large-Scale Data Storage and Processing for Scientists with Hadoop
PPTX
Big data and Hadoop
Python for Financial Data Analysis with pandas
Analyzing Web Archives
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
R and-hadoop
Big Data technology Landscape
Introduction to Big Data & Hadoop Architecture - Module 1
Large-Scale Data Storage and Processing for Scientists with Hadoop
Big data and Hadoop

What's hot (20)

PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Illuminating DSpace's Linked Data Support
PPTX
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
PPT
ODP
An introduction to Apache Hadoop Hive
PPTX
Apache Hadoop at 10
PDF
PDF
Big data, Hadoop, NoSQL DB - introduction
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PPTX
Scalding by Adform Research, Alex Gryzlov
PDF
Clustering output of Apache Nutch using Apache Spark
PDF
pandas: Powerful data analysis tools for Python
PPTX
Linked data-tooling-xml
PDF
Intro to-technologies-Green-City-Hackathon-Athens
PDF
Geek camp
PPTX
Hadoop Presentation
PPTX
HUG France - Apache Drill
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PDF
Sparkler - Spark Crawler
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Illuminating DSpace's Linked Data Support
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
An introduction to Apache Hadoop Hive
Apache Hadoop at 10
Big data, Hadoop, NoSQL DB - introduction
Apache Arrow (Strata-Hadoop World San Jose 2016)
Scalding by Adform Research, Alex Gryzlov
Clustering output of Apache Nutch using Apache Spark
pandas: Powerful data analysis tools for Python
Linked data-tooling-xml
Intro to-technologies-Green-City-Hackathon-Athens
Geek camp
Hadoop Presentation
HUG France - Apache Drill
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Sparkler - Spark Crawler
Ad

Similar to OU RSE Tutorial Big Data Cluster (20)

PDF
CityLABS Workshop: Working with large tables
PPT
11. From Hadoop to Spark 1:2
PPTX
Big Data Processing
PPSX
Hadoop-Quick introduction
PDF
DataIntensiveComputing.pdf
PPT
Map reducecloudtech
PPT
PDF
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
PDF
Engage 2020 - Best Practices for analyzing Domino Applications
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
PDF
Apache Spark Presentation good for big data
PDF
Intro to Big Data - Spark
PPTX
Hadoop Ecosystem
PPTX
Berlin Hadoop Get Together Apache Drill
PPTX
Drill lightning-london-big-data-10-01-2012
PDF
R, Hadoop and Amazon Web Services
PDF
"R, Hadoop, and Amazon Web Services (20 December 2011)"
PDF
Intro to Big Data
PPTX
Engineering patterns for implementing data science models on big data platforms
PDF
Processing Big Data: An Introduction to Data Intensive Computing
CityLABS Workshop: Working with large tables
11. From Hadoop to Spark 1:2
Big Data Processing
Hadoop-Quick introduction
DataIntensiveComputing.pdf
Map reducecloudtech
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Engage 2020 - Best Practices for analyzing Domino Applications
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Apache Spark Presentation good for big data
Intro to Big Data - Spark
Hadoop Ecosystem
Berlin Hadoop Get Together Apache Drill
Drill lightning-london-big-data-10-01-2012
R, Hadoop and Amazon Web Services
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Intro to Big Data
Engineering patterns for implementing data science models on big data platforms
Processing Big Data: An Introduction to Data Intensive Computing
Ad

More from Enrico Daga (18)

PDF
Citizen Experiences in Cultural Heritage Archives: a Data Journey
PDF
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
PDF
Data integration with a façade. The case of knowledge graph construction.
PDF
Knowledge graph construction with a façade - The SPARQL Anything Project
PDF
Capturing the semantics of documentary evidence for humanities research
PDF
Trying SPARQL Anything with MEI
PDF
The SPARQL Anything project
PDF
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
PDF
Linked data for knowledge curation in humanities research
PDF
Capturing Themed Evidence, a Hybrid Approach
PDF
Challenging knowledge extraction to support
the curation of documentary evide...
PDF
Ld4 dh tutorial
PDF
Propagating Data Policies - A User Study
PDF
Linked Data at the OU - the story so far
PDF
Propagation of Policies in Rich Data Flows
PDF
A bottom up approach for licences classification and selection
PDF
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
PDF
Early Analysis and Debuggin of Linked Open Data Cubes
Citizen Experiences in Cultural Heritage Archives: a Data Journey
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Data integration with a façade. The case of knowledge graph construction.
Knowledge graph construction with a façade - The SPARQL Anything Project
Capturing the semantics of documentary evidence for humanities research
Trying SPARQL Anything with MEI
The SPARQL Anything project
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Linked data for knowledge curation in humanities research
Capturing Themed Evidence, a Hybrid Approach
Challenging knowledge extraction to support
the curation of documentary evide...
Ld4 dh tutorial
Propagating Data Policies - A User Study
Linked Data at the OU - the story so far
Propagation of Policies in Rich Data Flows
A bottom up approach for licences classification and selection
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Early Analysis and Debuggin of Linked Open Data Cubes

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx

OU RSE Tutorial Big Data Cluster

  • 1. Working with large tables: processing and analytics with the Big Data Cluster Enrico Daga enrico.daga@open.ac.uk - @enridaga Knowledge Media Institute - The Open University http://isds.kmi.open.ac.uk/ OU Research Software Engineers - October 2018
  • 2. enrico.daga@open.ac.uk - @enridaga Objective • To introduce the concept of distributed computing • To show how to use the Big Data Cluster • To taste some tools for data processing • To understand the difference with more traditional approaches (e.g. Relational Data Warehouse)
  • 3. enrico.daga@open.ac.uk - @enridaga Background • Projects: • MK:Smart and the MK Data Hub • CityLABS • Data science activity @ OU
  • 4. enrico.daga@open.ac.uk - @enridaga Outline • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On
  • 5. enrico.daga@open.ac.uk - @enridaga Tabular data Many  different  types  of  data  objects  are  tables  or  can  be  translated  and  manipulated  as   data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  • 6. enrico.daga@open.ac.uk - @enridaga Tables can be large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion  for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of  documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  • 7. enrico.daga@open.ac.uk - @enridaga Tables can be large • Most  operations  on  tabular  data  require  to  scan  all  the  rows  in  the   table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  • 8. enrico.daga@open.ac.uk - @enridaga Distributed computing • An approach based on the distribution of data and the parallelisation of operations • Data is replicated over a number of redundant nodes • Computation is segmented over a number of workers • to retrieve data from each node • to perform atomic operations • to compose the result
  • 10. enrico.daga@open.ac.uk - @enridaga Apache Hadoop • Open Source project derived from Google’s MapReduce. • Use multiple disks for parallel reads • Keeps multiple copies of the data for fault tolerance • Applies MapReduce to split/merge the processing in several workers http://hadoop.apache.org/
  • 12. enrico.daga@open.ac.uk - @enridaga KMi Big Data Cluster A private environment for large scale data processing and analytics. HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase https://www.cloudera.com/products/open-­‐source.html
  • 13. enrico.daga@open.ac.uk - @enridaga HUE • A user interface over most Hadoop tools • Authentication • HDFS Browsing • Data download and upload • Job monitoring http://gethue.com/
  • 14. enrico.daga@open.ac.uk - @enridaga Apache HIVE • A data warehouse over Hadoop/HDFS • A query language similar to SQL • Allows to create SQL-like tables over files or HBase tables • Naturally views several files as single table • HiveQL has almost all the operators that developers familiar with SQL know • Applies MapReduce underneath https://hive.apache.org/
  • 15. enrico.daga@open.ac.uk - @enridaga Apache Pig • Originally developed at Yahoo Research around 2006 • A full fledged ETL language (Pig Latin) • Load/Save data from/to HDFS • Iterate over data tuples • Arithmetic operations • Relational operations • Filtering, ordering, etc… • Applies MapReduce underneath
  • 16. enrico.daga@open.ac.uk - @enridaga Caveat • Read / Write operations to disk are slow and cost resources • Reading and merging from multiple files is expensive • Hardware, file system, I/O errors
  • 17. enrico.daga@open.ac.uk - @enridaga Caveat • Relational database design principles are NOT recommended, e.g.: • Integrity constraints • De-duplication • MapReduce is inefficient per definition! • Bad at managing transactions • Heavy work even for very simple queries
  • 18. enrico.daga@open.ac.uk - @enridaga Hands-On! • Gutenberg project • Public domain books • ~50k books in English, ~2 billion words • Context: build a specialised search engine over the Gutenberg project • Task: Compute TF/IDF of these books http://www.gutenberg.org/
  • 19. enrico.daga@open.ac.uk - @enridaga Computing TF-IDF • TF: term frequency • Sum of term hits adjusted for doc length • tf(t,d) = count(t,d) / len(d) • {doc,”cat”,hits=5,len=2000} = 0.0025 • IDF: inverse document frequency • N = all documents (D) • divided by the documents having term • in log scale • We can’t do this easily with a laptop … • e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
  • 20. enrico.daga@open.ac.uk - @enridaga Step 1/4 - Generate Term Vectors Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  • 21. enrico.daga@open.ac.uk - @enridaga Step 2/4 Compute Terms Frequency (TF) tf(t,d)  =  count(t,d)  /  len(d)gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  • 22. enrico.daga@open.ac.uk - @enridaga Step 3/4 Compute Inverse Document Frequency (IDF) term_usages + num_docs_with_term + 11234 + 5436 3987 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count  doc_id  having  term term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … log(48790/d) N  =  48790
  • 23. enrico.daga@open.ac.uk - @enridaga Step 4/4 Compute TF/IDF (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … term_freq  *  if
  • 24. enrico.daga@open.ac.uk - @enridaga Let’s go … • Step by step instructions at the following link: • https://github.com/andremann/DataHub-workshop/tree/master/ Working-with-large-tables
  • 25. enrico.daga@open.ac.uk - @enridaga Summary • We introduced the notion of distributed computing • We have shown how to process large datasets • You tasted state of the art tools for data processing using the MK DataHub Hadoop Cluster • We experienced how to compute TF/IDF on a corpus of documents with HIVE and PIG