SlideShare a Scribd company logo
DATA SCIENCE BIG DATA
Yazan abu al failat
AGENDA
• Data science and scientists
• Cycle of big data management
• Big data architecture
WHAT IS DATA SCIENCE
• Dealing with unstructured and structured data, Data
Science is a field that encompasses anything related
to data cleansing, preparation, and analysis.
• Data Science is an umbrella term for techniques used
when trying to extract insights and information from
data.
WHAT DATA SCIENTISTS DO ?
• Data scientists combine statistics,
mathematics, programming, problem-solving,
capturing data in ingenious ways, analyze to
find patterns, along with the activities of
cleansing, preparing, and aligning the data
THE CYCLE OF BIG DATA MANAGEMENT
(FUNCTIONAL REQUIREMENTS)
Implementation
WHAT OTHER IMPORTANT THINGS TO
SUPPORT THE FUNCTIONAL REQUIREMENTS
• Your needs will depend on the nature of the analysis
you are supporting.
• Performance
• The right amount of computational power and
speed.
• Your architecture also has to have the right amount of
redundancy so that you are protected from
unanticipated latency and downtime.In order to address these points you
need to consider some points in next
slide
ASK THE FOLLOWING
• How much data will my organization need to manage today
and in the future?
• How often will my organization need to manage data in
real time or near real time?
• How much risk can my organization afford? Is my industry
subject to strict security?
• How important is speed to my need to manage data?
• How certain or precise does the data need to be?
BIG DATA ARCHITECTURE
• big data management
architecture must
include a variety of
services that enable
companies to make
use of myriad data
sources in a fast and
effective manner.
INTERFACES
• What makes big data big is the fact that it relies on picking
up lots of data from lots of sources.
• Therefore, open application programming interfaces (APIs)
will be core to any big data architecture.
• Keep in mind that interfaces exist at every level and
between every layer of the stack.
• Without integration services, big data can’t happen.
REDUNDANT PHYSICAL INFRASTRUCTURE
• Without the availability of robust physical infrastructures, big data
would probably not have emerged as such an important trend.
• To support an unanticipated or unpredictable volume of data, a
physical infrastructure for big data has to be different than that for
traditional data.
• The physical infrastructure is based on a distributed computing
model. This means that data may be physically stored in many
different locations and can be linked together through networks,
the use of a distributed file system, and various big data analytic tools
and applications.
REDUNDANT PHYSICAL INFRASTRUCTURE
• Redundancy is important because we are dealing with so much data from
so many different sources.
• Redundancy comes in many forms. If your company has created a private
cloud, you will want to have redundancy built within the private
environment so that it can scale out to support changing workloads.
• If your company wants to contain internal IT growth, it may use external
cloud services to augment its internal resources. In some cases, this
redundancy may come in the form of a Software as a Service (SaaS) offering
that allows companies to do sophisticated data analysis as a service.
• The SaaS approach offers lower costs, quicker startup, and seamless
evolution of the underlying technology.
SECURITY INFRASTRUCTURE
• The more important big data analysis becomes to
companies, the more important it will be to secure that data.
• You will need to take into account
• who is allowed to see the data
• under what circumstances they are allowed to see data.
• You will need to be able to verify the identity of users
as well as protect the identity of patients.
• These types of security requirements need to be part of the
big data
OPERATIONAL DATA SOURCES
• Traditionally, an operational data source consisted of highly structured
data managed by company in a relational database
But as the world changes, it is important to
understand that operational data now has to
contain a broader set of data sources,
including unstructured sources
Social media
OPERATIONAL DATA SOURCES
• You find new emerging approaches to data
management in the big data world, including document,
graph, columnar, and geospatial database architectures.
• These are referred to as NoSQL, or not only SQL,
OPERATIONAL DATA SOURCES
CHARACTERISTICS
• All these operational data sources have several characteristics in
common:
• ✓ They represent systems of record that keep track of the critical
data required for real-time, day-to-day operation of the business.
• ✓ They are continually updated based on transactions happening
within business units and from the web.
• ✓ For these sources to provide an accurate representation of the
business, they must blend structured and unstructured data.
• ✓ These systems also must be able to scale to support thousands of
users on a consistent basis. These might include transactional e-
commerce systems, customer relationship management systems, or
call center applications.
PERFORMANCE
• Your data architecture also needs to perform in
concert with your organization’s supporting
infrastructure
• It might take days to run this model using a
traditional server con- figuration. However, using a
distributed computing model, what took days might
now take minutes.
• Performance might also determine the kind of
database you would use
PERFORMANCE – GRAPHING DATABASE
• A graphing database might be a better choice, as it is
specifically designed to separate the “nodes” or entities from
its “properties” or the information that defines that entity, and
the “edge” or relationship between nodes and properties.
• Using the right database will also improve performance.
Typically the graph database will be used in scientific and
technical applications.
GRAPH EXAMPLE
ORGANIZING DATA SERVICES AND
TOOLS
• Not all the data that organizations use is operational.
• A growing amount of data comes from a variety of sources
that aren’t quite as organized or straightforward, including
data that comes from machines or sensors, and massive
public and private data sources.
ORGANIZING DATA SERVICES AND TOOLS
• In the past, most companies weren’t able to either capture or store
this vast amount of data. It was simply too expensive or too
overwhelming.
• Even if companies were able to capture the data, they did not have
the tools to do anything about it. Very few tools could make sense of
these vast amounts of data. The tools that did exist were complex to
use and did not produce results in a reasonable time frame.
• In the end, those who really wanted to go to analyzing this data were
forced to work with snapshots of data. This has the undesirable
effect of missing important events because they were not in a
particular snapshot.
MAPREDUCE, HADOOP, AND BIG TABLE
• These emerging companies needed to find new technologies that
would allow them to store, access, and analyze huge amounts of
data in near real time
so that they could monetize the benefits of owning this much data
about participants in their networks.
• Their resulting solutions are transforming the data management
market.
• The innovations MapReduce, Hadoop, and Big Table proved a new
generation of data management.
• These technologies address one of the most fundamental
problems — the capability to process massive amounts of data
ACTIVITY
What is Map reduce, Big table
and Hadoop

More Related Content

PDF
Big data tools
PPTX
000 introduction to big data analytics 2021
PPTX
Big data
PPSX
Big Data
PDF
Business intelligence architectures.pdf
PDF
Big Data: Its Characteristics And Architecture Capabilities
PPTX
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
Big data tools
000 introduction to big data analytics 2021
Big data
Big Data
Business intelligence architectures.pdf
Big Data: Its Characteristics And Architecture Capabilities
DMTI Spatial Location Hub Analytics: big data, analytics, visualization

What's hot (20)

PPTX
Introducing Technologies for Handling Big Data by Jaseela
PPTX
Big Data Hadoop
PDF
Big data and oracle
PPTX
Big Data Analytics MIS presentation
PPTX
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
PPTX
Introduction of big data and analytics
PDF
Bigdata (1) converted
PDF
Intro to big data and applications - day 2
PPTX
Augmented Analytics and Automation in the Age of the Data Scientist
PPTX
Bp presentation business intelligence and advanced data analytics september ...
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PPTX
Big data ppt
PPTX
Big data ppt
PPTX
View on big data technologies
PDF
Where HADOOP fits in and challenges
PDF
Big data analytics, research report
PDF
Introduction to BigData
PPTX
BIG DATA and USE CASES
PDF
Big data storage
Introducing Technologies for Handling Big Data by Jaseela
Big Data Hadoop
Big data and oracle
Big Data Analytics MIS presentation
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Introduction of big data and analytics
Bigdata (1) converted
Intro to big data and applications - day 2
Augmented Analytics and Automation in the Age of the Data Scientist
Bp presentation business intelligence and advanced data analytics september ...
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Big data ppt
Big data ppt
View on big data technologies
Where HADOOP fits in and challenges
Big data analytics, research report
Introduction to BigData
BIG DATA and USE CASES
Big data storage
Ad

Similar to Big data (20)

PDF
Sgcp14dunlea
PDF
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
PPT
big_data.ppt
PPT
big_data.ppt
PPTX
Data Mesh in Azure using Cloud Scale Analytics (WAF)
PPTX
Big data unit 2
PPTX
Big data.pptx
PPTX
Introduction to Big Data
PDF
Lecture4 big data technology foundations
PPTX
Top Big data Analytics tools: Emerging trends and Best practices
PDF
M.Florence Dayana
PDF
data_blending
PDF
Total Data Industry Report
PPT
Big Data Analytics Materials, Chapter: 1
PDF
Data Virtualization for Compliance – Creating a Controlled Data Environment
PDF
The Shifting Landscape of Data Integration
PPTX
This is abouts are you doing the same time who is the best person to be safe and
PPTX
Big data
PDF
Big Data Evolution
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Sgcp14dunlea
Cisco_Big_Data_Webinar_At-A-Glance_ABSOLUTE_FINAL_VERSION
big_data.ppt
big_data.ppt
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Big data unit 2
Big data.pptx
Introduction to Big Data
Lecture4 big data technology foundations
Top Big data Analytics tools: Emerging trends and Best practices
M.Florence Dayana
data_blending
Total Data Industry Report
Big Data Analytics Materials, Chapter: 1
Data Virtualization for Compliance – Creating a Controlled Data Environment
The Shifting Landscape of Data Integration
This is abouts are you doing the same time who is the best person to be safe and
Big data
Big Data Evolution
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Ad

Recently uploaded (20)

PDF
Microsoft Core Cloud Services powerpoint
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Global Data and Analytics Market Outlook Report
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
Introduction to Inferential Statistics.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Managing Community Partner Relationships
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
Microsoft Core Cloud Services powerpoint
CYBER SECURITY the Next Warefare Tactics
Topic 5 Presentation 5 Lesson 5 Corporate Fin
DU, AIS, Big Data and Data Analytics.ppt
Global Data and Analytics Market Outlook Report
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Introduction to Inferential Statistics.pptx
modul_python (1).pptx for professional and student
Managing Community Partner Relationships
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Business_Capability_Map_Collection__pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
ISS -ESG Data flows What is ESG and HowHow
STERILIZATION AND DISINFECTION-1.ppthhhbx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Business Analytics and business intelligence.pdf
retention in jsjsksksksnbsndjddjdnFPD.pptx

Big data

  • 1. DATA SCIENCE BIG DATA Yazan abu al failat
  • 2. AGENDA • Data science and scientists • Cycle of big data management • Big data architecture
  • 3. WHAT IS DATA SCIENCE • Dealing with unstructured and structured data, Data Science is a field that encompasses anything related to data cleansing, preparation, and analysis. • Data Science is an umbrella term for techniques used when trying to extract insights and information from data.
  • 4. WHAT DATA SCIENTISTS DO ? • Data scientists combine statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, analyze to find patterns, along with the activities of cleansing, preparing, and aligning the data
  • 5. THE CYCLE OF BIG DATA MANAGEMENT (FUNCTIONAL REQUIREMENTS) Implementation
  • 6. WHAT OTHER IMPORTANT THINGS TO SUPPORT THE FUNCTIONAL REQUIREMENTS • Your needs will depend on the nature of the analysis you are supporting. • Performance • The right amount of computational power and speed. • Your architecture also has to have the right amount of redundancy so that you are protected from unanticipated latency and downtime.In order to address these points you need to consider some points in next slide
  • 7. ASK THE FOLLOWING • How much data will my organization need to manage today and in the future? • How often will my organization need to manage data in real time or near real time? • How much risk can my organization afford? Is my industry subject to strict security? • How important is speed to my need to manage data? • How certain or precise does the data need to be?
  • 8. BIG DATA ARCHITECTURE • big data management architecture must include a variety of services that enable companies to make use of myriad data sources in a fast and effective manner.
  • 9. INTERFACES • What makes big data big is the fact that it relies on picking up lots of data from lots of sources. • Therefore, open application programming interfaces (APIs) will be core to any big data architecture. • Keep in mind that interfaces exist at every level and between every layer of the stack. • Without integration services, big data can’t happen.
  • 10. REDUNDANT PHYSICAL INFRASTRUCTURE • Without the availability of robust physical infrastructures, big data would probably not have emerged as such an important trend. • To support an unanticipated or unpredictable volume of data, a physical infrastructure for big data has to be different than that for traditional data. • The physical infrastructure is based on a distributed computing model. This means that data may be physically stored in many different locations and can be linked together through networks, the use of a distributed file system, and various big data analytic tools and applications.
  • 11. REDUNDANT PHYSICAL INFRASTRUCTURE • Redundancy is important because we are dealing with so much data from so many different sources. • Redundancy comes in many forms. If your company has created a private cloud, you will want to have redundancy built within the private environment so that it can scale out to support changing workloads. • If your company wants to contain internal IT growth, it may use external cloud services to augment its internal resources. In some cases, this redundancy may come in the form of a Software as a Service (SaaS) offering that allows companies to do sophisticated data analysis as a service. • The SaaS approach offers lower costs, quicker startup, and seamless evolution of the underlying technology.
  • 12. SECURITY INFRASTRUCTURE • The more important big data analysis becomes to companies, the more important it will be to secure that data. • You will need to take into account • who is allowed to see the data • under what circumstances they are allowed to see data. • You will need to be able to verify the identity of users as well as protect the identity of patients. • These types of security requirements need to be part of the big data
  • 13. OPERATIONAL DATA SOURCES • Traditionally, an operational data source consisted of highly structured data managed by company in a relational database But as the world changes, it is important to understand that operational data now has to contain a broader set of data sources, including unstructured sources Social media
  • 14. OPERATIONAL DATA SOURCES • You find new emerging approaches to data management in the big data world, including document, graph, columnar, and geospatial database architectures. • These are referred to as NoSQL, or not only SQL,
  • 15. OPERATIONAL DATA SOURCES CHARACTERISTICS • All these operational data sources have several characteristics in common: • ✓ They represent systems of record that keep track of the critical data required for real-time, day-to-day operation of the business. • ✓ They are continually updated based on transactions happening within business units and from the web. • ✓ For these sources to provide an accurate representation of the business, they must blend structured and unstructured data. • ✓ These systems also must be able to scale to support thousands of users on a consistent basis. These might include transactional e- commerce systems, customer relationship management systems, or call center applications.
  • 16. PERFORMANCE • Your data architecture also needs to perform in concert with your organization’s supporting infrastructure • It might take days to run this model using a traditional server con- figuration. However, using a distributed computing model, what took days might now take minutes. • Performance might also determine the kind of database you would use
  • 17. PERFORMANCE – GRAPHING DATABASE • A graphing database might be a better choice, as it is specifically designed to separate the “nodes” or entities from its “properties” or the information that defines that entity, and the “edge” or relationship between nodes and properties. • Using the right database will also improve performance. Typically the graph database will be used in scientific and technical applications.
  • 19. ORGANIZING DATA SERVICES AND TOOLS • Not all the data that organizations use is operational. • A growing amount of data comes from a variety of sources that aren’t quite as organized or straightforward, including data that comes from machines or sensors, and massive public and private data sources.
  • 20. ORGANIZING DATA SERVICES AND TOOLS • In the past, most companies weren’t able to either capture or store this vast amount of data. It was simply too expensive or too overwhelming. • Even if companies were able to capture the data, they did not have the tools to do anything about it. Very few tools could make sense of these vast amounts of data. The tools that did exist were complex to use and did not produce results in a reasonable time frame. • In the end, those who really wanted to go to analyzing this data were forced to work with snapshots of data. This has the undesirable effect of missing important events because they were not in a particular snapshot.
  • 21. MAPREDUCE, HADOOP, AND BIG TABLE • These emerging companies needed to find new technologies that would allow them to store, access, and analyze huge amounts of data in near real time so that they could monetize the benefits of owning this much data about participants in their networks. • Their resulting solutions are transforming the data management market. • The innovations MapReduce, Hadoop, and Big Table proved a new generation of data management. • These technologies address one of the most fundamental problems — the capability to process massive amounts of data
  • 22. ACTIVITY What is Map reduce, Big table and Hadoop