SlideShare a Scribd company logo
Big Data in the “Real World”
Edward Capriolo
What is “big data”?
● Big data is a collection of data sets so large and
complex that it becomes difficult to process
using traditional data processing applications.
● The challenges include capture, curation,
storage, search, sharing, transfer, analysis,
and visualization.
http://en.wikipedia.org/wiki/Big_data
Big Data Challenges
●
The challenges include:
– capture
– curation
– storage
– search
– sharing
– transfer
– analysis
– visualization
– large
– complex
What is “big data” exactly?
● What is considered "big data" varies depending on
the capabilities of the organization managing the
set, and on the capabilities of the applications that
are traditionally used to process and analyze the
data set in its domain.
● As of 2012, limits on the size of data sets that are
feasible to process in a reasonable amount of
time were on the order of exabytes of data.
http://en.wikipedia.org/wiki/Big_data
Big Data Qualifiers
● varies
● capabilities
● traditionally
● feasibly
● reasonably
● [somptha]bytes of data
My first “big data” challenge
● Real time news delivery platform
● Ingest news as text and provide full text search
● Qualifiers
– Reasonable: Real time search was < 1 second
– Capabilities: small company, <100 servers
● Big Data challenges
– Storage: roughly 300GB for 60 days data
– Search: searches of thousands of terms
Big data nyu
Traditionally
● Data was placed in mysql
● MySQL full text search
● Easy to insert
● Easy to search
● Worked great!
– Until it got real world load
Feasibly in hardware
(circa 2008)
● 300GB data and 16GB ram
● ...MySQL stores an in-memory binary tree of the keys.
Using this tree, MySQL can calculate the count of matching
rows with reasonable speed. But speed declines
logarithmically as the number of terms increases.
● The platters revolve at 15,000 RPM or so, which works out
to 250 revolutions per second. Average latency is listed as
2.0ms
● As the speed of an HDD increases the power it takes to run
it increases disproportionately
http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive
http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/
http://dev.mysql.com/doc/internals/en/full-text-search.html
“Big Data” is about giving up things
● In theoretical computer science, the CAP theorem states
that it is impossible for a distributed computer system to
simultaneously provide all three of the following guarantees:
– Consistency (all nodes see the same data at the same time)
– Availability (a guarantee that every request receives a response
about whether it was successful or failed)
– Partition tolerance (the system continues to operate despite
arbitrary message loss or failure of part of the system)
http://en.wikipedia.org/wiki/CAP_theorem
http://www.youtube.com/watch?v=I4yGBfcODmU
Multi-Master solution
● Write the data to N mysql servers and round
robin reads between them
– Good: More machines to serve reads
– Bad: Requires Nx hardware
– Hard: Keeping machines loaded with same data
especially auto-generated-ids
– Hard: What about when the data does not even fit
on a single machine?
Big data nyu
Sharding
● Rather then replicate all data to all machines
● Replicate data to selective machines
– Good: localized data
– Good: better caching
– Hard: Joins across shards
– Hard: Management
– Hard: Failure
● Parallel RDBMS = $$$
Life lesson
“applications that are traditionally used to”
● How did we solve our problem?
– We switched to lucene
● A tool designed for full text search
● Eventually sharded lucene
● When you hold a hammer:
– Not everything is a nail
● Understand what you really need
● Understand reasonable and feasable
Big data Challenge 2
● Large high volume web site
● Process them and produce reports
● Big Data challenges
– Storage: Store GB of data a day for years
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable to want daily reports less then one day
– Honestly needs to be faster / reruns etc
Enter hadoop
● Hadoop (0.17.X) was fairly new at the time
● Use cases of map reduce were emerging
– Hive had just been open sourced by Facebook
● Many database vendors were calling
map/reduce “a step backwards”
– They had solved these problems “in the 80s”
Hadoop file system HDFS
● Distributed redundant storage
– We were a NoSPOF across the board
● Commodity hardware vs buying a big
SAN/NAS device
● We already had processes that scp'ed data to
servers, easily adapted to placing them into
hdfs
● HDFS easy huge
Map Reduce
● As a proof of concept I wrote a group/count
application that would group/count on column
in our logs
● Was able to show linear speed up with
increased nodes
●
Winning (why hadoop kicked arse)
● Data capture, curation
– bulk loading data into RDBMS (indexes, overhead)
– bulk loading into hadoop is network copy
● Data anaysis
– RDBMS would not parallel-ize queries (even across
partitions)
– Some queries could cause very locks and
performance degradation
http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
Enter hive
● Capture- NO
● Curation- YES
● Storage- YES
● Search- YES
● Sharing- YES
● Transfer- NO
● Analysis-YES
● Visualization-NO
Logging from apache to hive
Sample program group and count
Source data looks like
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/igloo.htm
jan 10 2009:.........:200:/ed.htm
In case your the math type
(input) <k1, v1> →
map -> <k2, v2> -> combine -> <k2, v2> ->
reduce -> <k3, v3> (output)
Map(k1,v1) -> list(k2,v2)
Reduce(k2, list (v2)) -> list(v3)
A mapper
A reducer
Hive style
hive>create table web_data
( sdate STRING, stime STRING,
envvar STRING, remotelogname STRING ,servername STRING,
localip STRING, literaldash STRING, method STRING, url
STRING, querystring STRING, status STRING, litteralzero
STRING ,bytessent INT,header STRING, timetoserver INT,
useragent STRING ,cookie STRING, referer STRING);
SELECT url,count(1) FROM web_data GROUP BY url;
Life lessons volume 2
● feasible and reasonable were completely
different then case 1#
● Query from seconds -> hours
● Size from GB to TB
● Feasilble from 4 Nodes to 15
Big Data Challenge #3
(work at m6d)
● Large high volume ad serving site
● Process them and produce reports
● Support data science and biz-dev users
● Big Data challenges
– Storage: Store and process terabytes of data
● Complex data types, encoded data
– Analysis, visualization: support reports of existing system
● Qualifiers
– Reasonable: adhoc, daily,hourly, weekly, monthly reports
Data data everywhere
● We have to use cookies in many places
● Cookies have limited size
● Cookies have complex values encoded
Some encoding tricks we might do
LastSeen: long (64 bits)
Segment: int (32 bits)
Literal ','
Segment: int (32 bits)
Zipcode (32bits)
● 1 chose a relevant
epoc and use byte
● Use a byte for # of
segments
● Use a 4 byte radix
encoded number
● ... and so on
Getting at embedded data
● Write N UDFS for each object like:
– getLastSeenForCookie(String)
– getZipcodeForCookie(String)
– ...
● But this would have made a huge toolkit
● Traditionally you do not want to break first
normal form
Struct solution
● Hive has a struct like a c struct
● Struct is list of name value pair
● Structs can contain other structs
● This gives us the serious ability to do object
mapping
● UDFs can return struct types
Using a UDF
● add jar myjar.jar;
● Create temporary function parseCookie as
'com.md6.ParseCookieIntoStruct' ;
● Select
parseCookie(encodedColumn).lastSeen from
mydata;
LATERAL VIEW + EXPLODE
SELECT
client_id, entry.spendcreativeid
FROM datatable
LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist)
entryList as entry
where hit_date=20110321 AND mid=001406;
3214498023360851706 215286
3214498023360851706 195785
3214498023360851706 128640
All that data might boil down to...
Life lessons volume #3
● Big data is not only batch or real-time
● Big data is feed back loops
– Machine learning
– Ad hoc performance checks
● Generated SQL tables periodically synced to
web server
● Data shared between sections of an
organization to make business decisions

More Related Content

ODP
Web-scale data processing: practical approaches for low-latency and batch
PPT
Building your own NSQL store
PDF
Cassandra background-and-architecture
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Pythian: My First 100 days with a Cassandra Cluster
PPTX
An Overview of Apache Cassandra
PDF
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
PPTX
Web-scale data processing: practical approaches for low-latency and batch
Building your own NSQL store
Cassandra background-and-architecture
DataStax and Esri: Geotemporal IoT Search and Analytics
Pythian: My First 100 days with a Cassandra Cluster
An Overview of Apache Cassandra
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency

What's hot (20)

PPTX
M6d cassandrapresentation
PDF
ScyllaDB: NoSQL at Ludicrous Speed
PDF
Hadoop-2.6.0 Slides
PDF
Webinar: Using Control Theory to Keep Compactions Under Control
PPTX
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
PPTX
Introduction to NoSQL & Apache Cassandra
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Signal Digital: The Skinny on Wide Rows
PDF
NewSQL - The Future of Databases?
PPTX
Cassandra Tuning - above and beyond
PDF
Cassandra: Open Source Bigtable + Dynamo
PPTX
Druid realtime indexing
PPTX
BI, Reporting and Analytics on Apache Cassandra
PDF
Spark and cassandra (Hulu Talk)
PDF
Re-Engineering PostgreSQL as a Time-Series Database
PPTX
Cassandra concepts, patterns and anti-patterns
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PPT
Webinar: Getting Started with Apache Cassandra
PDF
Large volume data analysis on the Typesafe Reactive Platform
PDF
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
M6d cassandrapresentation
ScyllaDB: NoSQL at Ludicrous Speed
Hadoop-2.6.0 Slides
Webinar: Using Control Theory to Keep Compactions Under Control
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Introduction to NoSQL & Apache Cassandra
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Signal Digital: The Skinny on Wide Rows
NewSQL - The Future of Databases?
Cassandra Tuning - above and beyond
Cassandra: Open Source Bigtable + Dynamo
Druid realtime indexing
BI, Reporting and Analytics on Apache Cassandra
Spark and cassandra (Hulu Talk)
Re-Engineering PostgreSQL as a Time-Series Database
Cassandra concepts, patterns and anti-patterns
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Getting Started with Apache Cassandra
Large volume data analysis on the Typesafe Reactive Platform
Cassandra Summit 2014: Cyanite — Better Graphite Storage with Apache Cassandra
Ad

Viewers also liked (20)

PPTX
Shoestring Video-SoMeT 2011-Brian Matson
PDF
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
DOCX
Lecture Commentary On Homosexuality
PPTX
Chatham mba open house (10 5 2013 rc)
ODP
Data Integration And Visualization
PPTX
Msu bmp widescreen
PPT
Наши будни и праздники
PPTX
Vishal anand director of bricks and mortar
PPTX
Trabajo extractase de ingles
PPT
Cfu3721 definitions of_concepts_2013__2 (1)
PDF
Alegações Finais Impeachment Dilma
PPSX
Mal ppt 2013
PPT
PDF
長野市放課後子ども総合プラン有料化の方針
PDF
Practical eCommerce with WooCommerce
PPTX
Как стать лидером в ТРАДО
PPT
PPT
Keynote01 -boris--foundation update-8-10-2012
PPTX
Heaven - escena baralla al parc
DOC
Formato planeacion
Shoestring Video-SoMeT 2011-Brian Matson
Paananen: Hyviä uutisia Kouluterveyskyselystä 2013
Lecture Commentary On Homosexuality
Chatham mba open house (10 5 2013 rc)
Data Integration And Visualization
Msu bmp widescreen
Наши будни и праздники
Vishal anand director of bricks and mortar
Trabajo extractase de ingles
Cfu3721 definitions of_concepts_2013__2 (1)
Alegações Finais Impeachment Dilma
Mal ppt 2013
長野市放課後子ども総合プラン有料化の方針
Practical eCommerce with WooCommerce
Как стать лидером в ТРАДО
Keynote01 -boris--foundation update-8-10-2012
Heaven - escena baralla al parc
Formato planeacion
Ad

Similar to Big data nyu (20)

PPT
PUC Masterclass Big Data
PPTX
Colorado Springs Open Source Hadoop/MySQL
PPTX
Big data explanation with real time use case
PDF
Big data analytics 1
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
PPTX
Big data4businessusers
PPTX
Big data business case
PDF
Balogh gyorgy big_data
PDF
Introduction to Big Data
PDF
Meta scale kognitio hadoop webinar
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PPSX
Big data with Hadoop - Introduction
PDF
Big Data @ Bodensee Barcamp 2010
PPTX
A Big Data Concept
PPTX
Big Data and NoSQL for Database and BI Pros
PPT
Hive @ Hadoop day seattle_2010
PDF
Big data introduction
PPTX
Big Data (NJ SQL Server User Group)
PPTX
Big Data Tutorial V4
PDF
Big data technology
PUC Masterclass Big Data
Colorado Springs Open Source Hadoop/MySQL
Big data explanation with real time use case
Big data analytics 1
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Big data4businessusers
Big data business case
Balogh gyorgy big_data
Introduction to Big Data
Meta scale kognitio hadoop webinar
Lecture 5 - Big Data and Hadoop Intro.ppt
Big data with Hadoop - Introduction
Big Data @ Bodensee Barcamp 2010
A Big Data Concept
Big Data and NoSQL for Database and BI Pros
Hive @ Hadoop day seattle_2010
Big data introduction
Big Data (NJ SQL Server User Group)
Big Data Tutorial V4
Big data technology

More from Edward Capriolo (14)

PPT
Nibiru: Building your own NoSQL store
PPT
Cassandra4hadoop
ODP
Intravert Server side processing for Cassandra
ODP
M6d cassandra summit
ODP
Apache Kafka Demo
ODP
Cassandra NoSQL Lan party
PPT
Breaking first-normal form with Hive
ODP
Casbase presentation
PPT
Hadoop Monitoring best Practices
PPT
Whirlwind tour of Hadoop and HIve
ODP
Cli deep dive
ODP
Cassandra as Memcache
PPT
Counters for real-time statistics
PPT
Real world capacity
Nibiru: Building your own NoSQL store
Cassandra4hadoop
Intravert Server side processing for Cassandra
M6d cassandra summit
Apache Kafka Demo
Cassandra NoSQL Lan party
Breaking first-normal form with Hive
Casbase presentation
Hadoop Monitoring best Practices
Whirlwind tour of Hadoop and HIve
Cli deep dive
Cassandra as Memcache
Counters for real-time statistics
Real world capacity

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
NewMind AI Weekly Chronicles - August'25 Week I
Cloud computing and distributed systems.

Big data nyu

  • 1. Big Data in the “Real World” Edward Capriolo
  • 2. What is “big data”? ● Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications. ● The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. http://en.wikipedia.org/wiki/Big_data
  • 3. Big Data Challenges ● The challenges include: – capture – curation – storage – search – sharing – transfer – analysis – visualization – large – complex
  • 4. What is “big data” exactly? ● What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. ● As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. http://en.wikipedia.org/wiki/Big_data
  • 5. Big Data Qualifiers ● varies ● capabilities ● traditionally ● feasibly ● reasonably ● [somptha]bytes of data
  • 6. My first “big data” challenge ● Real time news delivery platform ● Ingest news as text and provide full text search ● Qualifiers – Reasonable: Real time search was < 1 second – Capabilities: small company, <100 servers ● Big Data challenges – Storage: roughly 300GB for 60 days data – Search: searches of thousands of terms
  • 8. Traditionally ● Data was placed in mysql ● MySQL full text search ● Easy to insert ● Easy to search ● Worked great! – Until it got real world load
  • 9. Feasibly in hardware (circa 2008) ● 300GB data and 16GB ram ● ...MySQL stores an in-memory binary tree of the keys. Using this tree, MySQL can calculate the count of matching rows with reasonable speed. But speed declines logarithmically as the number of terms increases. ● The platters revolve at 15,000 RPM or so, which works out to 250 revolutions per second. Average latency is listed as 2.0ms ● As the speed of an HDD increases the power it takes to run it increases disproportionately http://serverfault.com/questions/190451/what-is-the-throughput-of-15k-rpm-sas-drive http://thessdguy.com/why-dont-hdds-spin-faster-than-15k-rpm/ http://dev.mysql.com/doc/internals/en/full-text-search.html
  • 10. “Big Data” is about giving up things ● In theoretical computer science, the CAP theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: – Consistency (all nodes see the same data at the same time) – Availability (a guarantee that every request receives a response about whether it was successful or failed) – Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) http://en.wikipedia.org/wiki/CAP_theorem http://www.youtube.com/watch?v=I4yGBfcODmU
  • 11. Multi-Master solution ● Write the data to N mysql servers and round robin reads between them – Good: More machines to serve reads – Bad: Requires Nx hardware – Hard: Keeping machines loaded with same data especially auto-generated-ids – Hard: What about when the data does not even fit on a single machine?
  • 13. Sharding ● Rather then replicate all data to all machines ● Replicate data to selective machines – Good: localized data – Good: better caching – Hard: Joins across shards – Hard: Management – Hard: Failure ● Parallel RDBMS = $$$
  • 14. Life lesson “applications that are traditionally used to” ● How did we solve our problem? – We switched to lucene ● A tool designed for full text search ● Eventually sharded lucene ● When you hold a hammer: – Not everything is a nail ● Understand what you really need ● Understand reasonable and feasable
  • 15. Big data Challenge 2 ● Large high volume web site ● Process them and produce reports ● Big Data challenges – Storage: Store GB of data a day for years – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable to want daily reports less then one day – Honestly needs to be faster / reruns etc
  • 16. Enter hadoop ● Hadoop (0.17.X) was fairly new at the time ● Use cases of map reduce were emerging – Hive had just been open sourced by Facebook ● Many database vendors were calling map/reduce “a step backwards” – They had solved these problems “in the 80s”
  • 17. Hadoop file system HDFS ● Distributed redundant storage – We were a NoSPOF across the board ● Commodity hardware vs buying a big SAN/NAS device ● We already had processes that scp'ed data to servers, easily adapted to placing them into hdfs ● HDFS easy huge
  • 18. Map Reduce ● As a proof of concept I wrote a group/count application that would group/count on column in our logs ● Was able to show linear speed up with increased nodes ●
  • 19. Winning (why hadoop kicked arse) ● Data capture, curation – bulk loading data into RDBMS (indexes, overhead) – bulk loading into hadoop is network copy ● Data anaysis – RDBMS would not parallel-ize queries (even across partitions) – Some queries could cause very locks and performance degradation http://hardcourtlessons.blogspot.com/2010/05/definition-of-winning.html
  • 20. Enter hive ● Capture- NO ● Curation- YES ● Storage- YES ● Search- YES ● Sharing- YES ● Transfer- NO ● Analysis-YES ● Visualization-NO
  • 22. Sample program group and count Source data looks like jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/index.htm jan 10 2009:.........:200:/igloo.htm jan 10 2009:.........:200:/ed.htm
  • 23. In case your the math type (input) <k1, v1> → map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output) Map(k1,v1) -> list(k2,v2) Reduce(k2, list (v2)) -> list(v3)
  • 26. Hive style hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING); SELECT url,count(1) FROM web_data GROUP BY url;
  • 27. Life lessons volume 2 ● feasible and reasonable were completely different then case 1# ● Query from seconds -> hours ● Size from GB to TB ● Feasilble from 4 Nodes to 15
  • 28. Big Data Challenge #3 (work at m6d) ● Large high volume ad serving site ● Process them and produce reports ● Support data science and biz-dev users ● Big Data challenges – Storage: Store and process terabytes of data ● Complex data types, encoded data – Analysis, visualization: support reports of existing system ● Qualifiers – Reasonable: adhoc, daily,hourly, weekly, monthly reports
  • 29. Data data everywhere ● We have to use cookies in many places ● Cookies have limited size ● Cookies have complex values encoded
  • 30. Some encoding tricks we might do LastSeen: long (64 bits) Segment: int (32 bits) Literal ',' Segment: int (32 bits) Zipcode (32bits) ● 1 chose a relevant epoc and use byte ● Use a byte for # of segments ● Use a 4 byte radix encoded number ● ... and so on
  • 31. Getting at embedded data ● Write N UDFS for each object like: – getLastSeenForCookie(String) – getZipcodeForCookie(String) – ... ● But this would have made a huge toolkit ● Traditionally you do not want to break first normal form
  • 32. Struct solution ● Hive has a struct like a c struct ● Struct is list of name value pair ● Structs can contain other structs ● This gives us the serious ability to do object mapping ● UDFs can return struct types
  • 33. Using a UDF ● add jar myjar.jar; ● Create temporary function parseCookie as 'com.md6.ParseCookieIntoStruct' ; ● Select parseCookie(encodedColumn).lastSeen from mydata;
  • 34. LATERAL VIEW + EXPLODE SELECT client_id, entry.spendcreativeid FROM datatable LATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist) entryList as entry where hit_date=20110321 AND mid=001406; 3214498023360851706 215286 3214498023360851706 195785 3214498023360851706 128640
  • 35. All that data might boil down to...
  • 36. Life lessons volume #3 ● Big data is not only batch or real-time ● Big data is feed back loops – Machine learning – Ad hoc performance checks ● Generated SQL tables periodically synced to web server ● Data shared between sections of an organization to make business decisions