SlideShare a Scribd company logo
Big Data for Oracle Professionals
Arup Nanda
Big Data Explorer
Time
Growth
Tweet @ArupNanda
Hadoop
Map/Reduce
YARN
NoSQL
Spark
Flume.
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html
HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html
HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)"
ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET
/download/windows/asctab31.zip HTTP/1.0" 200 1540096
"http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html"
"Mozilla/4.7 [en]C-SYMPA (Win95; U)"
123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0"
200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200
8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF"
"Mozilla/4.05 (Macintosh; I; PPC)"
123.123.123.123 - -
Tweet @ArupNanda
fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com
petabytes
unpredictable format
transient.
Tweet @ArupNanda
Metadata Repository
Tweet @ArupNanda
Tweet @ArupNanda
Tweet @ArupNanda
olumeV
arietyV
elocityV
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSE
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
CUSTOMERS
CUST_ID
NAME
ADDRESS
SPOUSES
CUST_ID
NAME
CURRENT
EMPLOYERS
CUST_ID
NAME
CURRENT
Tweet @ArupNanda
Name = Data
Relationship status = Data
Married to = Data
In a relationship with = Data
Friends = Data, Data, Data
Likes = Data, Data
Mutually Exclusive, Maybe not?
Multiple Data Points
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
First Name Martha
Child goes to Acme School
Tweet @ArupNanda
First Name Martha
Child goes to Acme School
Teacher Mrs Gillen
Teacher Mrs Gillen
Jill
Tweet @ArupNanda
First Name John
Spouse Jane
Child Jill
Goes to Acme School
Teacher Mr Fullmeister
Tweet @ArupNanda
First Name Irene
Boyfriend Henry
Works at Starwood
Hobby Photography
Ex-Spouse Jane
Tweet @ArupNanda
Tweet @ArupNanda
First Name Irene
Key Value
Key-Value Pair
Tweet @ArupNanda
John Smith and his wife Jane,
along with their daughter Jill,
were strolling on the beach
when they heard a crash. John
ran towards 

Tweet @ArupNanda
Map
Tweet @ArupNanda
begin
get post
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count := like_count + 1
else
no_comment := no_comment + 1
end if
end loop
end
Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Tweet @ArupNanda
Counter() Counter() Counter()
Likes=100
No Comments=
300
Likes=50
No Comments=
350
Likes=150
No Comments=
250
Likes=300
No Comments=
900
Reduce
Tweet @ArupNanda
Map Reduce/
Dividing the
work among
different
nodes
Collating the
results to get
final answer
Tweet @ArupNanda
Counter
()
Counter
()
Counter
()
Likes=100
No
Comments=
300
Likes=50
No
Comments=
350
Likes=150
No
Comments=
250Likes=300
No
Comments=
900
‱ Divide the workload
‱ Submit and track the jobs
‱ If a job fails, restart it
on another node
‱ 

Hadoop
Tweet @ArupNanda
Resource Management
Applications
YARN
Yet Another Resource Negotiator
Map Reduce v2.
Tweet @ArupNanda
Counter() Counter() Counter()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop Distributed Filesystem (HDFS)
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
‱ Not shared storage
‱ Data is discrete
‱ Version control not required
‱ Concurrency not required
‱ Transactional integrity across
nodes not required.
Comparison
with RAC
Tweet @ArupNanda
Advantages of Hadoop
‱ Processors need not be super-fast
‱ Immensely scalable
‱ Storage is redundant by design
‱ No RAID level required.
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Tweet @ArupNanda
Scalable?
ACID Properties
Reliability at a cost
Large overhead in data processing
Tweet @ArupNanda
Website logs
Combine with structured data
SOAP Messages
Twitter, Facebook 

Tweet @ArupNanda
Data Access: through programs
NoSQL Databases
Tweet @ArupNanda
Key Value
Key Value DB
Key Document
Document DB.
Key Value
Key Value
Key Value
{
empID:1,
empName:Larry
salary:infinity
}
Tweet @ArupNanda
SQL-interface required
Hive
HiveQL
Tweet @ArupNanda
Creating a Hive Table
create table accounts (
accno int,
accname string,
balance float
)
row format delimited
fields terminated by ‘,’
stored as texfile
location '/user/hive/db1.db/accounts'
Tweet @ArupNanda
select count(*)
from store_sales ss
join household_demographics hd on (ss.ss_hdemo_sk
= hd.hd_demo_sk)
join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
join store s on (s.s_store_sk = ss.ss_store_sk)
where
t.t_hour = 8
t.t_minute >= 30
hd.hd_dep_count = 2
order by cnt;
HiveQL
Tweet @ArupNanda
Map/Reduce
Divide the work and
collate the results
Needs development
in Java, Python, Ruby, etc.
A framework to work on
the dataset in parallel Pig
Pig Latin
Scripting language for
Pig
Tweet @ArupNanda
select category, avg(pagerank)
from urls
where pagerank > 0.2
group by category
having count(*) > 1000000
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY
COUNT(good_urls)>1000000;
output = FOREACH big_groups GENERATE category,
AVG(good_urls.pagerank);
SQL
Pig Latin
Tweet @ArupNanda
HBase
HiveQL
Pig
A database built
on Hadoop
An SQL-like (but
not the same)
query language
Procedural Logic
without M/R Code.
Tweet @ArupNanda
normal programming
languages, e.g. Python
YARN
Map/Reduce code in Java
Spark
Tweet @ArupNanda
Count
er()
Count
er()
Count
er()
Filesystem Filesystem Filesystem
1 2 32 3 13 1 2
Hadoop processing in
files
Memory is cheaper
Interactive
processing needs
faster access.
Tweet @ArupNanda
Spark
Core
SparkShell SparkSQL MLib SparkR PySpark
Can use Java, Python or Scala
Tweet @ArupNanda
Divide and conquer is the key
Non-shared division of data is important
Local access
Redundancy
Hadoop is a framework
You have to write the programs
Big data is batch-oriented
Hive is SQL-like
Pig Latin is a 4GL-like scripting language
Spark uses memory
Tweet @ArupNanda
Oh, I so want to Learn!
Cloudera – prebuilt VMs
https://www.cloudera.com/documentation/ente
rprise/5-9-x/topics/cloudera_quickstart_vm.html
Hortonworks – prebuilt
VMs
https://hortonworks.com/downloads/#sandbox
Tweet @ArupNanda
Thanks!
arup.blogspot.com @ArupNanda
Tweet @ArupNanda

More Related Content

PPT
Hadoop Summit 2009 Hive
PDF
Sqrrl October Webinar: Data Modeling and Indexing
 
PPTX
Putting Lipstick on Apache Pig at Netflix
PDF
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
PDF
Thailand Hadoop Big Data Challenge #1
PPT
Hive User Meeting 2009 8 Facebook
PPTX
How LinkedIn Uses Scalding for Data Driven Product Development
PDF
Hadoop Workshop using Cloudera on Amazon EC2
Hadoop Summit 2009 Hive
Sqrrl October Webinar: Data Modeling and Indexing
 
Putting Lipstick on Apache Pig at Netflix
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Thailand Hadoop Big Data Challenge #1
Hive User Meeting 2009 8 Facebook
How LinkedIn Uses Scalding for Data Driven Product Development
Hadoop Workshop using Cloudera on Amazon EC2

Similar to Big Data for Oracle Professionals (20)

PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PDF
Amazon Athena (April 2017)
PDF
Social media analytics using Azure Technologies
PDF
SDPHP - Percona Toolkit (It's Basically Magic)
PDF
Velocity 2015-final
PDF
Big Data Analytics with Scala at SCALA.IO 2013
PPTX
An Architect's guide to real time big data systems
PDF
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
PPTX
Beyond shuffling - Strata London 2016
PDF
Structured streaming for machine learning
ODP
Beyond php - it's not (just) about the code
PDF
Improving Spark SQL at LinkedIn
PDF
Epic South Disasters
ODP
Beyond php - it's not (just) about the code
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PPT
strata_spark_streaming.ppt
PPT
apache spark presentation for distributed processing
PPT
SPARK bigdata pyspark databricks learn spark.ppt
Spark SQL Deep Dive @ Melbourne Spark Meetup
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Amazon Athena (April 2017)
Social media analytics using Azure Technologies
SDPHP - Percona Toolkit (It's Basically Magic)
Velocity 2015-final
Big Data Analytics with Scala at SCALA.IO 2013
An Architect's guide to real time big data systems
Streaming ML on Spark: Deprecated, experimental and internal ap is galore!
Beyond shuffling - Strata London 2016
Structured streaming for machine learning
Beyond php - it's not (just) about the code
Improving Spark SQL at LinkedIn
Epic South Disasters
Beyond php - it's not (just) about the code
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
strata_spark_streaming.ppt
apache spark presentation for distributed processing
SPARK bigdata pyspark databricks learn spark.ppt
Ad

More from Gerger (13)

PDF
Source Control for the Oracle Database
 
PDF
Apache Spark, the Next Generation Cluster Computing
 
PDF
Best Way to Write SQL in Java
 
PDF
Version control for PL/SQL
 
PDF
Gitora, Version Control for PL/SQL
 
PDF
Gitora, Version Control for PL/SQL
 
PDF
PostgreSQL for Oracle Developers and DBA's
 
PDF
Shaping Optimizer's Search Space
 
PDF
Gitora, Version Control for PL/SQL
 
PDF
Monitoring Oracle Database Instances with Zabbix
 
PDF
Introducing ProHuddle
 
PDF
Use Cases of Row Pattern Matching in Oracle 12c
 
PDF
Introducing Gitora,the version control tool for PL/SQL
 
Source Control for the Oracle Database
 
Apache Spark, the Next Generation Cluster Computing
 
Best Way to Write SQL in Java
 
Version control for PL/SQL
 
Gitora, Version Control for PL/SQL
 
Gitora, Version Control for PL/SQL
 
PostgreSQL for Oracle Developers and DBA's
 
Shaping Optimizer's Search Space
 
Gitora, Version Control for PL/SQL
 
Monitoring Oracle Database Instances with Zabbix
 
Introducing ProHuddle
 
Use Cases of Row Pattern Matching in Oracle 12c
 
Introducing Gitora,the version control tool for PL/SQL
 
Ad

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
medical staffing services at VALiNTRY
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Digital Strategies for Manufacturing Companies
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
Understanding Forklifts - TECH EHS Solution
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms II-SECS-1021-03
Wondershare Filmora 15 Crack With Activation Key [2025
CHAPTER 2 - PM Management and IT Context
medical staffing services at VALiNTRY
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Softaken Excel to vCard Converter Software.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
L1 - Introduction to python Backend.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Digital Strategies for Manufacturing Companies
How to Migrate SBCGlobal Email to Yahoo Easily
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Choose the Right IT Partner for Your Business in Malaysia
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf

Big Data for Oracle Professionals

  • 1. Big Data for Oracle Professionals Arup Nanda Big Data Explorer Time Growth Tweet @ArupNanda
  • 2. Hadoop Map/Reduce YARN NoSQL Spark Flume. Tweet @ArupNanda fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" ppp931.on.bellglobal.com - - [26/Apr/2000:00:16:12 -0400] "GET /download/windows/asctab31.zip HTTP/1.0" 200 1540096 "http://www.htmlgoodies.com/downloads/freeware/webdevelopment/15.html" "Mozilla/4.7 [en]C-SYMPA (Win95; U)" 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - [26/Apr/2000:00:23:47 -0400] "GET /asctortf/ HTTP/1.0" 200 8130 "http://search.netscape.com/Computers/Data_Formats/Document/Text/RTF" "Mozilla/4.05 (Macintosh; I; PPC)" 123.123.123.123 - - Tweet @ArupNanda
  • 3. fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com petabytes unpredictable format transient. Tweet @ArupNanda Metadata Repository Tweet @ArupNanda
  • 7. CUSTOMERS CUST_ID NAME ADDRESS SPOUSES CUST_ID NAME CURRENT EMPLOYERS CUST_ID NAME CURRENT Tweet @ArupNanda Name = Data Relationship status = Data Married to = Data In a relationship with = Data Friends = Data, Data, Data Likes = Data, Data Mutually Exclusive, Maybe not? Multiple Data Points Tweet @ArupNanda
  • 8. First Name John Spouse Jane Child Jill Goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Tweet @ArupNanda
  • 9. First Name John Spouse Jane Child Jill Goes to Acme School First Name Martha Child goes to Acme School Tweet @ArupNanda First Name Martha Child goes to Acme School Teacher Mrs Gillen Teacher Mrs Gillen Jill Tweet @ArupNanda
  • 10. First Name John Spouse Jane Child Jill Goes to Acme School Teacher Mr Fullmeister Tweet @ArupNanda First Name Irene Boyfriend Henry Works at Starwood Hobby Photography Ex-Spouse Jane Tweet @ArupNanda
  • 11. Tweet @ArupNanda First Name Irene Key Value Key-Value Pair Tweet @ArupNanda
  • 12. John Smith and his wife Jane, along with their daughter Jill, were strolling on the beach when they heard a crash. John ran towards 
 Tweet @ArupNanda Map Tweet @ArupNanda
  • 13. begin get post while (there_are_remaining_posts) loop extract status of "like" for the specific post if status = "like" then like_count := like_count + 1 else no_comment := no_comment + 1 end if end loop end Counter() Tweet @ArupNanda Counter() Counter() Counter() Tweet @ArupNanda
  • 14. Counter() Counter() Counter() Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250 Likes=300 No Comments= 900 Reduce Tweet @ArupNanda Map Reduce/ Dividing the work among different nodes Collating the results to get final answer Tweet @ArupNanda
  • 15. Counter () Counter () Counter () Likes=100 No Comments= 300 Likes=50 No Comments= 350 Likes=150 No Comments= 250Likes=300 No Comments= 900 ‱ Divide the workload ‱ Submit and track the jobs ‱ If a job fails, restart it on another node ‱ 
 Hadoop Tweet @ArupNanda Resource Management Applications YARN Yet Another Resource Negotiator Map Reduce v2. Tweet @ArupNanda
  • 16. Counter() Counter() Counter() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop Distributed Filesystem (HDFS) Tweet @ArupNanda Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 ‱ Not shared storage ‱ Data is discrete ‱ Version control not required ‱ Concurrency not required ‱ Transactional integrity across nodes not required. Comparison with RAC Tweet @ArupNanda
  • 17. Advantages of Hadoop ‱ Processors need not be super-fast ‱ Immensely scalable ‱ Storage is redundant by design ‱ No RAID level required. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Tweet @ArupNanda Scalable? ACID Properties Reliability at a cost Large overhead in data processing Tweet @ArupNanda
  • 18. Website logs Combine with structured data SOAP Messages Twitter, Facebook 
 Tweet @ArupNanda Data Access: through programs NoSQL Databases Tweet @ArupNanda
  • 19. Key Value Key Value DB Key Document Document DB. Key Value Key Value Key Value { empID:1, empName:Larry salary:infinity } Tweet @ArupNanda SQL-interface required Hive HiveQL Tweet @ArupNanda
  • 20. Creating a Hive Table create table accounts ( accno int, accname string, balance float ) row format delimited fields terminated by ‘,’ stored as texfile location '/user/hive/db1.db/accounts' Tweet @ArupNanda select count(*) from store_sales ss join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk) join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk) join store s on (s.s_store_sk = ss.ss_store_sk) where t.t_hour = 8 t.t_minute >= 30 hd.hd_dep_count = 2 order by cnt; HiveQL Tweet @ArupNanda
  • 21. Map/Reduce Divide the work and collate the results Needs development in Java, Python, Ruby, etc. A framework to work on the dataset in parallel Pig Pig Latin Scripting language for Pig Tweet @ArupNanda select category, avg(pagerank) from urls where pagerank > 0.2 group by category having count(*) > 1000000 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>1000000; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); SQL Pig Latin Tweet @ArupNanda
  • 22. HBase HiveQL Pig A database built on Hadoop An SQL-like (but not the same) query language Procedural Logic without M/R Code. Tweet @ArupNanda normal programming languages, e.g. Python YARN Map/Reduce code in Java Spark Tweet @ArupNanda
  • 23. Count er() Count er() Count er() Filesystem Filesystem Filesystem 1 2 32 3 13 1 2 Hadoop processing in files Memory is cheaper Interactive processing needs faster access. Tweet @ArupNanda Spark Core SparkShell SparkSQL MLib SparkR PySpark Can use Java, Python or Scala Tweet @ArupNanda
  • 24. Divide and conquer is the key Non-shared division of data is important Local access Redundancy Hadoop is a framework You have to write the programs Big data is batch-oriented Hive is SQL-like Pig Latin is a 4GL-like scripting language Spark uses memory Tweet @ArupNanda Oh, I so want to Learn! Cloudera – prebuilt VMs https://www.cloudera.com/documentation/ente rprise/5-9-x/topics/cloudera_quickstart_vm.html Hortonworks – prebuilt VMs https://hortonworks.com/downloads/#sandbox Tweet @ArupNanda