SlideShare a Scribd company logo
Big Data with Hadoop, Spark
and BigQuery
Google Cloud Next Extended 2017
Speaker: Imam Raza
Speaker.bio.toString()
Senior Software Architect @Folio3
Specialities:
Designing scalable Enterprise Software Architecture,
Designing scalable mobile app.
IBM Big Data certified professional.
MongoDB certified professional.
About this presentation
me.loveQuestion==true. Let's have interactive session.
The content is designed on basis of industry experience.
Would have some lab sessions
Switching the gear with interesting silicon valley facts.
Agenda
What is Big Data?
What is Big Data components?
What is hadoop?
What is spark
What is BigQuery?
Designing scalable Vs fashionable applications.
What is Big Data?
Google Trends on Big Data
Big Data Definition
Five Vs of Big Data
1st V:Velocity
2nd V: Volume
Growth in Global Data
How big is Zettabyte?
3rd V: Variety
4th V: Veracity
5th V: Value
Value
Big Data Business
implementation
Recommendation Engines
Netflix Show “House of Card” was an immediate hit
Big data business application
Better understand and target customers
Understand and optimize Business process
Improving Health
Improving security and Law enforcement
Improving sports performance
Improving and optimizing Cities and Countries
Types of Source of Big Data
Structured Data (RDBMS, Spreadsheets)
Unstructured Data (raw data)
Semi-Structured Data (XML,JSON)
Switching the gear
A mandatory books for silicon valley
graduates looking for jobs.
Big Data Ecosystem
Big Data Tools
● Hadoop
● YARN
● Mesos
● Spark
● BigQuery
● BigSQL
● Kafka
● Hive
● Pig
● Sqoop
● ZooKeeper
● HBase
● Shark
● Cassandra
● MongoDB
● CouchDB
● Cloudera
● Pentaho
● etc
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Karachi)
Spark Hadoop System
Hadoop
Hadoop is an open-source software framework that
supports data-intensive distributed applications
A Hadoop cluster is composed of a single master node
and multiple worker nodes
Hadoop Primary Components
HDFS – Hadoop Distributed File System.(Storing large
amounts of data)
MapReduce Programming Model- (Processing large
amounts of data)
HDFS
Moving Code to Data Philosophy
If code and data are on different machines, one of them must be moved to
the other machine before the code can be executed on the data.
If the code is smaller than the data, better to send the code to the machine
holding the data than the other way around, if all the machines are equally
fast.
In the world of Big Data, the code is almost always smaller than the data.
Job/Task Management
MapReduce
MapReduce Example
Hadoop / MapReduce RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High (referential, typed)
Data schema Dynamic Static
Access method Batch Interactive and Batch
Scaling Linear Nonlinear (worse than
linear)
Data structure Unstructured Structured
Normalization of data Not Required Required
Query Response Time Has latency (due to batch
processing)
Can be near immediate
Apache Spark
Apache Spark is a lightning-fast cluster computing technology,
designed for fast computation.
It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations,
which includes interactive queries and stream processing
Apache Spark features
Speed: Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running
on disk.
Support Multi languages: provides built-in APIs in Java, Scala, or
Python
Advanced Analytics: Supports SQL queries, Streaming data,
Machine learning (ML), and Graph algorithms.
Apache Spark Libs
Apache Spark Lab Session
Via http://datascientistworkbench.com
Switching the gear
Silicon valley awakes early in the
morning
Big Query
BigQuery
A service that enables interactive analysis of massively large datasets
Based on Dremel, a scalable, interactive ad hoc query system for analysis
of read-only nested data
Working in conjunction with Google Storage
Has a RESTful web service interface.
BigQuery
You can issue SQL queries over big data
Interactive web interface
As small response time as possible
Auto scales under the hood
.
BigQuery
SaaS (/ PaaS)
Interfacing:
REST API
Web console
Command line tools
Language libraries
Insert only
.
BigQuery Lab Session
Via https://bigquery.cloud.google.com
Switching the gear
Zareen is a pakistani restaurant in
Google Mountain View.
1477 Plymouth Street, Suite C
Mountain View, CA 94043
http://www.zareensrestaurant.com/
Designing Scalable Vs
Fashionable apps
References
https://cloud.google.com/bigquery/public-data/
https://bigquery.cloud.google.com
IBM BigData virtual Lab (https://datascientistworkbench.com/)
IBM Big data University (http://bigdatauniversity.com)
Questions

More Related Content

PDF
Cloud Developer Days - BigQuery
PDF
An overview of BigQuery
PDF
Big Data and ML on Google Cloud
PPTX
BigQuery for the Big Data win
PDF
Quick Intro to Google Cloud Technologies
PPTX
Introduction to Google Cloud Platform for Big Data - Trusted Conf
PDF
Google Bigtable
PDF
Bigquery 101
Cloud Developer Days - BigQuery
An overview of BigQuery
Big Data and ML on Google Cloud
BigQuery for the Big Data win
Quick Intro to Google Cloud Technologies
Introduction to Google Cloud Platform for Big Data - Trusted Conf
Google Bigtable
Bigquery 101

What's hot (20)

PDF
Google BigQuery Best Practices
PPTX
Understanding cloud with Google Cloud Platform
PDF
StackEngine Demo - Docker Austin
PDF
Google BigQuery
PDF
Google Cloud Storage | Google Cloud Platform Tutorial | Google Cloud Architec...
PPTX
Finding new Customers using D&B and Excel Power Query
PPTX
Using Premium Data - for Business Analysts
PPTX
Google Cloud Platform (GCP)
PDF
Big Query Basics
PDF
Google cloud big data summit master gcp big data summit la - 10-20-2015
PDF
Introducing the Hub for Data Orchestration
POTX
EDB Postgres in DBaaS & Container Platforms
PDF
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
PPTX
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
PDF
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
PDF
Google Cloud Platform Introduction - 2016Q3
PDF
TDC2016SP - Trilha BigData
PDF
Customer Experience at Disney+ Through Data Perspective
PPTX
Azure Big Data Story
PDF
Google Cloud Next 2021 Recap
Google BigQuery Best Practices
Understanding cloud with Google Cloud Platform
StackEngine Demo - Docker Austin
Google BigQuery
Google Cloud Storage | Google Cloud Platform Tutorial | Google Cloud Architec...
Finding new Customers using D&B and Excel Power Query
Using Premium Data - for Business Analysts
Google Cloud Platform (GCP)
Big Query Basics
Google cloud big data summit master gcp big data summit la - 10-20-2015
Introducing the Hub for Data Orchestration
EDB Postgres in DBaaS & Container Platforms
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Google Cloud Platform Introduction - 2016Q3
TDC2016SP - Trilha BigData
Customer Experience at Disney+ Through Data Perspective
Azure Big Data Story
Google Cloud Next 2021 Recap
Ad

Viewers also liked (20)

PDF
Big Data University ML0101EN Certificate _ Big Data University
PPTX
Material design
PDF
Google Cloud Platform
PPT
Android presentation
PPTX
GDG Devfest 2016 session on Android N
PPTX
Apple WWDC 2014 highlights
PDF
SAP HANA Cloud Platform CodeJam
PDF
Introduction to big data and apache spark
PPTX
Google Developer Group(GDG) DevFest Event 2012 Android talk
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
PPTX
MBaaS (Mobile Backend As a Service)
PPT
Western Civilization Lecture 4
PDF
PySpark Best Practices
PPTX
Polymer and web component
PPTX
Introduction to real time big data with Apache Spark
PDF
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
PPTX
Big data - What is It?
PDF
Event Report - Google Next 2017 - Good progress by Google - but is it enough?
PDF
Introduction to Google Cloud Platform Technologies
Big Data University ML0101EN Certificate _ Big Data University
Material design
Google Cloud Platform
Android presentation
GDG Devfest 2016 session on Android N
Apple WWDC 2014 highlights
SAP HANA Cloud Platform CodeJam
Introduction to big data and apache spark
Google Developer Group(GDG) DevFest Event 2012 Android talk
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Spark Internals - Hadoop Source Code Reading #16 in Japan
MBaaS (Mobile Backend As a Service)
Western Civilization Lecture 4
PySpark Best Practices
Polymer and web component
Introduction to real time big data with Apache Spark
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...
Big data - What is It?
Event Report - Google Next 2017 - Good progress by Google - but is it enough?
Introduction to Google Cloud Platform Technologies
Ad

Similar to Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Karachi) (20)

PDF
Top 10 Big Data Tools that you should know about.pdf
PDF
Big data with java
PDF
Bigdata and Hadoop Bootcamp
PDF
Big Data , Big Problem?
PPTX
The Future of Data Science
PPTX
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
PPT
Hadoop in action
PPTX
Hadoop in a Nutshell
PPTX
big data eco system fundamentals of data science
PPTX
Hadoop and BigData - July 2016
PDF
RDBMS vs Hadoop vs Spark
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
A Glimpse of Bigdata - Introduction
PPT
Big Data & Hadoop
PDF
Big data processing with apache spark
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PDF
Comparison among rdbms, hadoop and spark
PPTX
Big data ppt
Top 10 Big Data Tools that you should know about.pdf
Big data with java
Bigdata and Hadoop Bootcamp
Big Data , Big Problem?
The Future of Data Science
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Hadoop in action
Hadoop in a Nutshell
big data eco system fundamentals of data science
Hadoop and BigData - July 2016
RDBMS vs Hadoop vs Spark
Oct 2011 CHADNUG Presentation on Hadoop
A Glimpse of Bigdata - Introduction
Big Data & Hadoop
Big data processing with apache spark
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Big Data Analytics with Hadoop, MongoDB and SQL Server
Hadoop a Natural Choice for Data Intensive Log Processing
Comparison among rdbms, hadoop and spark
Big data ppt

Recently uploaded (20)

PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ai tools demonstartion for schools and inter college
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
AI in Product Development-omnex systems
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
medical staffing services at VALiNTRY
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
System and Network Administraation Chapter 3
PDF
Digital Strategies for Manufacturing Companies
Understanding Forklifts - TECH EHS Solution
ai tools demonstartion for schools and inter college
ISO 45001 Occupational Health and Safety Management System
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Softaken Excel to vCard Converter Software.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
AI in Product Development-omnex systems
ManageIQ - Sprint 268 Review - Slide Deck
medical staffing services at VALiNTRY
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Which alternative to Crystal Reports is best for small or large businesses.pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
System and Network Administraation Chapter 3
Digital Strategies for Manufacturing Companies

Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Karachi)

Editor's Notes

  • #10: …refers to the vast amounts of data generated every second. We are not talking Terabytes but Zettabytes or Brontobytes. If we take all the data generated in the world between the beginning of time and 2000, the same amount of data will soon be generated every minute. New big data tools use distributed systems so that we can store and analyse data across databases that are dotted around anywhere in the world.
  • #11: Quintillion 10^18
  • #13: How big is a zettabyte? One bit is binary. It's either a one or a zero. Eight bits make up one byte, and 1024 bytes make up one kilobyte. 1024 kilobytes make up one megabyte. Large videos and DVDs will be in gigabytes where 1024 megabytes make up one gigabyte of storage space. These days we have USBs or memory sticks that can store a few dozen gigabytes of information where computers and hard drives now store terabytes of information. One terabyte is 1025 gigabytes. 1024 terabytes make up one petabyte, and 1024 petabytes make up an exabyte. Think of a big urban city or a busy international airport like Heathrow, JFK, O'Hare, Dubai, or O. R. Tambo in Johannesburg. And now we're talking petabytes and exabytes. All those airplanes are capturing and transmitting data. All the people in those airports have mobile devices. Also consider the security cameras and all the staff in and around the airport. A digital universe study conducted by IDC claimed digital information reached 0.8 zettabytes last year and predicted this number would grow to 35 zettabytes by 2020. It is predicted that by 2020, one tenth of the world's data will be produced by machines, and most of the world's data will be produced in emerging markets. It is also predicted that the amount of data produced will increasingly outpace available storage. Advances in cloud computing have contributed
  • #14: Refers to the different types of data we can now use.In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world’s data is unstructured (text, images, video, voice, etc.) With big data technology we can now analyse and bring together data of different types such as messages, social media conversations, photos, sensor data, video or voice recordings.
  • #15: Big Data Veracity refers to the biases, noise and abnormality in data. refers to the messiness or trustworthiness of the data. With many forms of big data quality and accuracy are less controllable (just think of Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but technology now allows us to work with this type of data
  • #20: The first season of the show was released in 2013 and it was an immediate hit. At the time, the New York Times reported that Netflix executives knew that House of Cards would be a hit before they even filmed it, but how do they know that? Big data. Netflix has a lot of data. Netflix knows the time of day when movies are watched. It logs when users pause, rewind and fast forward. It has ratings from millions of users as well as the information on searches they make. By looking at all these big data, Netflix knew many of its users had streamed the work of David Fincher and films featuring Kevin Spacey had always done well. And it knew that the British version of House of Cards had also done well. It also knew that people who liked Fincher also liked Spacey. All these information suggested that buying the series would be a good bet for the company, and in fact it was. In other words, thanks to big data, Netflix knows what people want before they do.
  • #21: Better understand and target customers: To better understand and target customers, companies expand their traditional data sets with social media data, browser, text analytics or sensor data to get a more complete picture of their customers. The big objective, in many cases, is to create predictive models. Using big data, Telecom companies can now better predict customer churn; retailers can predict what products will sell, and car insurance companies understand how well their customers actually drive. Understand and Optimize Business Processes: Big data is also increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictive models generated from social media data, web search trends and weather forecasts. Another example is supply chain or delivery route optimization using data from geographic positioning and radio frequency identification sensors. Improving Health: The computing power of big data analytics enables us to find new cures and better understand and predict disease patterns. We can use all the data from smart watches and wearable devices to better understand links between lifestyles and diseases. Big data analytics also allow us to monitor and predict epidemics and disease outbreaks, simply by listening to what people are saying, i.e. “Feeling rubbish today - in bed with a cold” or searching for on the Internet, i.e. “cures for flu”. Improving Security and Law Enforcement: Security services use big data analytics to foil terrorist plots and detect cyber attacks. Police forces use big data tools to catch criminals and even predict criminal activity and credit card companies use big data analytics it to detect fraudulent transactions Improving Sports Performance: Most elite sports have now embraced big data analytics. Many use video analytics to track the performance of every player in a football or baseball game, sensor technology is built into sports equipment such as basket balls or golf clubs, and many elite sports teams track athletes outside of the sporting environment – using smart technology to track nutrition and sleep, as well as social media conversations to monitor emotional wellbeing.
  • #28: Apache Kafka ElasticSearch Cassandra Mesos
  • #29: Apache Kafka ElasticSearch Cassandra Mesos