SlideShare a Scribd company logo
Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S.A.Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Foundation of Data Science (PECO311B)
Unit-I: Introduction to Data Science
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Introduction to Data Science
 Defining data science and big data
 Recognizing the different types of data
 Gaining insight into the data science process
 Data Science Process: Overview, Different steps
 Machine Learning Definition and Relation with Data Science
Course Objectives : To introduce the data science basics
Course Outcome(CO1) : Understand concept and process
of data science
Introduction to Data Science
Defining data science and big data
Recognizing the different types of data
Gaining insight into the data science process
Data Science Process: Overview, Different steps
Machine Learning Definition and Relation with Data Science
Introduction to Data Science
DATA SCIENCE....
Data Science Definition
Data science is the field of applying advanced analytics techniques and scientific
principles to extract valuable information from data for business decision-making,
strategic planning and other uses.
Data science is the study of data. It involves developing methods of recording,
storing, and analyzing data to effectively extract useful information. The goal of
data science is to gain insights and knowledge from any type of data-
both structured and unstructured.
Uses of Data Science
 In the healthcare industry, physicians use Data Science to analyze data from
wearable trackers to ensure their patients’ well-being and make vital decisions. Data
Science also enables hospital managers to reduce waiting time and enhance care.
Retailers use Data Science to enhance customer experience and retention.
 Data Science is widely used in the banking and finance sectors for fraud detection
and personalized financial advice.
Benefits and Uses of Data Science
 Transportation providers use Data Science to enhance the transportation journeys of their
customers. For instance, Transport for London maps customer journeys offering
personalized transportation details, and manages unexpected circumstances using
statistical data.
 Construction companies use Data Science for better decision making by tracking
activities, including average time for completing tasks, materials-based expenses, and
more.
 Data Science enables trapping and analyzing massive data from manufacturing
processes, which has gone untapped so far.
Benefits and Uses of Data Science
 With Data Science, one can analyze massive graphical data, temporal data, and geospatial
data to draw insights. It also helps in seismic interpretation and reservoir characterization.
 Data Science facilitates firms to leverage social media content to obtain real-time media
content usage patterns. This enables the firms to create target audience-specific content,
measure content performance, and recommend on-demand content.
 Data Science helps study utility consumption in the energy and utility domain. This study
allows for better control of utility use and enhanced consumer feedback.
 Data Science applications in the public service field include health-related research, financial
market analysis, fraud detection, energy exploration, environmental protection, and more.
Benefits and Uses of Data Science
Benefits of Data Science
1)Increases business predictability
 When a company invests in structuring its data, it can work with what we call predictive
analysis. With the help of the data scientist, it is possible to use technologies such as
Machine Learning and Artificial Intelligence to work with the data that the company has
and, in this way, carry out more precise analyses of what is to come.
 Thus, you increase the predictability of the business and can make decisions today that
will positively impact the future of your business.
Benefits and Uses of Data Science
2) Ensures real-time intelligence
 The data scientist can work with RPA(Robotic Process Automation) professionals to
identify the different data sources of their business and create automated dashboards, which
search all this data in real-time in an integrated manner.
 This intelligence is essential for the managers of your company to make more accurate
and faster decisions.
Benefits and Uses of Data Science
3) Favors the marketing and sales area
 Data scientists can integrate data from different sources, bringing even more accurate insights
to their team. Can you imagine obtaining the entire customer journey map considering all the
touch points your customer had with your brand? This is possible with Data Science.
4) Improves data security
One of the benefits of Data Science is the work done in the area of data security. In that sense,
there is a world of possibilities.
The data scientists work on fraud prevention systems, for example, to keep your company’s
customers safer. On the other hand, he can also study recurring patterns of behavior in a
company’s systems to identify possible architectural flaws.
Benefits and Uses of Data Science
5) Helps interpret complex data
 Data Science is a great solution when we want to cross different data to understand the
business and the market better. Depending on the tools we use to collect data, we can mix data
from “physical” and virtual sources for better visualization.
6) Facilitates the decision-making process
 One of the benefits of Data Science is improving the decision-making process. This is because
we can create tools to view data in real-time, allowing more agility for business managers. This is
done both by dashboards and by the projections that are possible with the data scientist’s treatment
of data.
Benefits and Uses of Data Science
“Data is the new science. Big data holds the answers.”
– Pat Gelsinger, CEO, VMware
Data Science and Big Data
Big data includes:
Structured data: transaction data, OLTP, RDBMS, and other structured formats
Semi-Structured: text files, system log files, XML files, etc.
 Unstructured data: web pages, sensor data, mobile data, online data, sources,
digital audio, and video feeds, digital images, tweets, blogs,
emails, social networks, and other sources
Data Science and Big Data
Difference Between Data Science and Big Data
Meaning:
Big Data:
Large volumes of data that can’t be handled using a normal database
program. Characterized by velocity, volume, and variety.
Data Science:
Data focused scientific activity. Similar in nature to data mining.
Harnesses the potential of big data to support business decisions. Includes
approaches to process big data.
Concept:
Big Data:
Includes all formats and types of data.
Diverse data types are generated from several different sources.
Data Science:
Helps organizations make decisions.
Provides techniques to help extract insights and information to create large
datasets.
A specialized approach that involves scientific programming tools, techniques,
and models to process big data.
Difference Between Data Science and Big Data
Basis of Formation:
Big Data:
Data is generated from system logs.
Data is created in organizations – emails, spreadsheets, DB, transactions, Online
discussion forums.
Video and audio streams that include live feeds. Electronic devices – RFID,
sensors, and so on. Internet traffic and users.
Data Science:
Working apps are made by programming developed models.
It captures complex patterns from big data and developed models.
It is related to data analysis, preparation, and filtering. Applies scientific methods
to find the knowledge in big data.
Difference Between Data Science and Big Data
Approach:
Big Data:
To understand the market and to gain new customers. To find sustainability. To
establish realistic ROI and metrics.
To leverage datasets for the advantage of the business. To gain competitiveness. To
develop business agility.
Data Science:
Data Visualization and prediction.
Data destroy, preserve, publishing, processing, preparation, or acquisition.
Programming skills, like NoSQL, SQL, and Hadoop platforms.
State-of-the-art algorithms and techniques for data mining.
Involves the extensive use of statistics, mathematics, and other tools.
Difference Between Data Science and Big Data
Application Areas:
Big Data:
Security and law enforcement. Research and
development.
Commerce, Sports and health. Performance
optimization Optimizing business processes.
Telecommunications and Financial services.
Data Science:
Web development. Fraud and risk detection.
Image and speech recognition. Search
recommenders etc.
Difference Between Data Science and Big Data
Different types of Data
In data science and big data we will come across many different types of data, and
each of them tends to require different tools and techniques. The main categories
of data are these:
1)Structured
2)Unstructured
3)Natural language
4)Machine-generated
5)Audio, video, and images
6)Streaming
Different types of Data
1) Structured data:
Data which conforms to a data model, has a well define structure, follows a consistent
order and can be easily accessed and used by a person or a computer program.
Structured data is usually stored in well-defined schemas such as Databases.
It is generally tabular with column and rows that clearly define its attributes.
 SQL (Structured Query language) is often used to manage structured data stored in
Databases.
Example: names, dates, addresses, credit card numbers, stock information,
geolocation, etc.
Different types of Data
2) Unstructured data:
 Data which does not conforms to a data model and has no easily identifiable
structure such that it can not be used by a computer program easily.
 Unstructured data is not organised in a pre-defined manner or does not have a pre-
defined data model, thus it is not a good fit for a mainstream relational database.
Example: Videos, texts, images, document files, audio materials, email contents etc.
Different types of Data
3) Natural language:
 Natural language processing (also known as Computational Linguistics) is the
scientific study of language from a computational perspective, with a focus on the
interactions between natural (human) languages and computers.
 Natural Language Processing, or NLP for short, is broadly defined as the
automatic manipulation of natural language, like speech and text, by software.
Example:personal voice assistants like Siri, Cortana, Alexa, etc.
Different types of Data
4) Machine generated data:
 Machine-generated data (MGD) is information that is produced by mechanical or
digital devices
 The term is often used to describe the data that is generated by an organization’s
industrial control systems as well as mechanical devices that are designed to carry out
a single function.
Example:Web server logs, Call detail records, Financial instrument trades, Network
event logs, etc.
Different types of Data
5) Audio, Video and image data:
 Audio, image, and video are data types that pose specific challenges to a data
scientist.
 Audio and video are used for enhancing the experience with Web pages (e.g.
audio background) to serving music, family videos, presentations, etc.
 The Web content accessibility guidelines recommend to always provide
alternatives for time-based media, such as captions, descriptions, or sign language.
Different types of Data
6) Streaming data:
 Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously,
 Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, e-commerce purchases, in-game
player activity, information from social networks, financial trading floors, or geospatial
services, etc.
Example: Location data, Fraud detection, Real-time stock trades, Marketing, sales,
and business analytics, Customer/user activity, Monitoring and reporting on
internal IT systems, etc.
Data Science Process
 The data science process typically consists of six steps:
1) Setting the research goal
2) Retrieving data
3) Data preparation
4) Data exploration
5) Data modelling or model building
6) Presentation and automation
Data Science Process
Data Science Process
1) Setting the research goal:
Data science is mostly applied in the con- text of an organization.
When the business asks you to perform a data science project, you’ll first prepare a
project charter.
This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and
deliverables.
2) Retrieving data:
 The second step is to collect data. You’ve stated in the project charter which data
you need and where you can find it.
 In this step you ensure that you can use the data in your program, which means
checking the existence of, quality, and access to the data.
 Data can also be delivered by third-party companies and takes many forms ranging
from Excel spreadsheets to different types of databases.
Data Science Process
3) Data preparation:
 Data collection is an error-prone process; in this phase you enhance the quality of
the data and prepare it for use in subsequent steps.
This phase consists of three sub- phases: data cleansing removes false values from a
data source and inconsistencies across data sources,
 data integration enriches data sources by combining information from multiple
data sources,
 data transformation ensures that the data is in a suit- able format for use in your
models.
Data Science Process
4) Data exploration:
Data exploration is concerned with building a deeper understanding of your data.
 You try to understand how variables interact with each other, the distribution of
the data, and whether there are outliers.
 To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes by the abbreviation EDA, for Exploratory
Data Analysis.
Data Science Process
5) Data modeling or model building:
 In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question.
 You select a technique from the fields of statistics, machine learning, operations
research, and so on.
 Building a model is an iterative process that involves selecting the variables for
the model, executing the model, and model diagnostics.
Data Science Process
6) Presentation and automation:
Finally, you present the results to your business.
These results can take many forms, ranging from presentations to research reports.
 Sometimes you’ll need to automate the execution of the process because the
business will want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.
Data Science Process
Machine Learning relation with Data Science:

More Related Content

PPTX
Data science in business Administration Nagarajan.pptx
PPTX
Business Analytics Unit III: Developing analytical talent
PPTX
Introduction To Data Science
PDF
Data Science Unit 01 PPT - SPPU Sem 6.pdf
PPTX
ABOUT DATA SCIENCE big data analytics ppt.pptx
PPTX
Data science and business analytics
PPTX
BADS-MBA-Unit 1 that what data science and Interpretation
PPTX
Big Data Courses In Mumbai
Data science in business Administration Nagarajan.pptx
Business Analytics Unit III: Developing analytical talent
Introduction To Data Science
Data Science Unit 01 PPT - SPPU Sem 6.pdf
ABOUT DATA SCIENCE big data analytics ppt.pptx
Data science and business analytics
BADS-MBA-Unit 1 that what data science and Interpretation
Big Data Courses In Mumbai

Similar to Introduction to Data Science: data science process (20)

PDF
OVERVIEW OF DATA SCIENCE (3).pdf
PDF
Data science lecture2_doaa_mohey
PPTX
An Introduction to Data Science.pptx learn
PPTX
The Power of Data Science by DICS INNOVATIVE.pptx
PDF
Introduction to Data Science.pdf
PPTX
Data science
PDF
Untitled document.pdf
PPTX
Data Science_Unit-1.1bhgbbuiububjhbu.pptx
PPTX
Data Science Course Online Training - Visualpath - Best Data Science Training...
PPTX
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
PDF
Embracing data science
PPTX
Unit 1-FDS. .pptx
PDF
Data science-Introductions-Real World Application
PDF
Differences between Data Science and Data Analytics
PPTX
Data Science Introduction: Concepts, lifecycle, applications.pptx
PPT
Data_Science_Presentationforlearning machine learning
PPTX
Data science.chapter-1,2,3
PPTX
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
PPTX
Data Science PPT _basics of data science.pptx
PPTX
Data Science.pptx
OVERVIEW OF DATA SCIENCE (3).pdf
Data science lecture2_doaa_mohey
An Introduction to Data Science.pptx learn
The Power of Data Science by DICS INNOVATIVE.pptx
Introduction to Data Science.pdf
Data science
Untitled document.pdf
Data Science_Unit-1.1bhgbbuiububjhbu.pptx
Data Science Course Online Training - Visualpath - Best Data Science Training...
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
Embracing data science
Unit 1-FDS. .pptx
Data science-Introductions-Real World Application
Differences between Data Science and Data Analytics
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data_Science_Presentationforlearning machine learning
Data science.chapter-1,2,3
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
Data Science PPT _basics of data science.pptx
Data Science.pptx
Ad

More from ShivarkarSandip (20)

PDF
MEASURES OF DATA: SCALE, TENDENCY, VARIATION SHAPE
PDF
STATISTICS AND PROBABILITY FOR DATA SCIENCE,
PDF
Prerquisite for Data Sciecne, KDD, Attribute Type
PDF
NBaysian classifier, Naive Bayes classifier
PDF
Supervised Learning Ensemble Techniques Machine Learning
PDF
Microcontroller 8051- Architecture Memory Organization
PDF
Data Preprocessing -Data Quality Noisy Data
PDF
Supervised Learning Decision Trees Review of Entropy
PDF
Supervised Learning Decision Trees Machine Learning
PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
PDF
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
PDF
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
PDF
Data Warehouse and Architecture, OLAP Operation
PDF
Data Preparation and Preprocessing , Data Cleaning
PDF
Introduction to Data Mining, KDD Process, OLTP and OLAP
PDF
Introduction to Data Mining KDD Process OLAP
PDF
Issues in data mining Patterns Online Analytical Processing
PDF
Introduction to data mining which covers the basics
PDF
Introduction to Data Communication.pdf
PDF
Classification of Signal.pdf
MEASURES OF DATA: SCALE, TENDENCY, VARIATION SHAPE
STATISTICS AND PROBABILITY FOR DATA SCIENCE,
Prerquisite for Data Sciecne, KDD, Attribute Type
NBaysian classifier, Naive Bayes classifier
Supervised Learning Ensemble Techniques Machine Learning
Microcontroller 8051- Architecture Memory Organization
Data Preprocessing -Data Quality Noisy Data
Supervised Learning Decision Trees Review of Entropy
Supervised Learning Decision Trees Machine Learning
Cluster Analysis: Measuring Similarity & Dissimilarity
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Data Warehouse and Architecture, OLAP Operation
Data Preparation and Preprocessing , Data Cleaning
Introduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining KDD Process OLAP
Issues in data mining Patterns Online Analytical Processing
Introduction to data mining which covers the basics
Introduction to Data Communication.pdf
Classification of Signal.pdf
Ad

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT 4 Total Quality Management .pptx
PPT
Project quality management in manufacturing
PDF
Well-logging-methods_new................
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Sustainable Sites - Green Building Construction
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
PPT on Performance Review to get promotions
PPTX
web development for engineering and engineering
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
composite construction of structures.pdf
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Geodesy 1.pptx...............................................
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
DOCX
573137875-Attendance-Management-System-original
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
Lecture Notes Electrical Wiring System Components
UNIT 4 Total Quality Management .pptx
Project quality management in manufacturing
Well-logging-methods_new................
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Sustainable Sites - Green Building Construction
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT on Performance Review to get promotions
web development for engineering and engineering
CYBER-CRIMES AND SECURITY A guide to understanding
composite construction of structures.pdf
bas. eng. economics group 4 presentation 1.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
additive manufacturing of ss316l using mig welding
Geodesy 1.pptx...............................................
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
573137875-Attendance-Management-System-original
Arduino robotics embedded978-1-4302-3184-4.pdf
Lesson 3_Tessellation.pptx finite Mathematics

Introduction to Data Science: data science process

  • 1. Sanjivani Rural Education Society’s Sanjivani College of Engineering, Kopargaon-423 603 (An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune) NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified Department of Computer Engineering (NBA Accredited) Prof. S.A.Shivarkar Assistant Professor Contact No.8275032712 Email- shivarkarsandipcomp@sanjivani.org.in Subject- Foundation of Data Science (PECO311B) Unit-I: Introduction to Data Science
  • 2. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2 Introduction to Data Science  Defining data science and big data  Recognizing the different types of data  Gaining insight into the data science process  Data Science Process: Overview, Different steps  Machine Learning Definition and Relation with Data Science
  • 3. Course Objectives : To introduce the data science basics Course Outcome(CO1) : Understand concept and process of data science Introduction to Data Science
  • 4. Defining data science and big data Recognizing the different types of data Gaining insight into the data science process Data Science Process: Overview, Different steps Machine Learning Definition and Relation with Data Science Introduction to Data Science
  • 6. Data Science Definition Data science is the field of applying advanced analytics techniques and scientific principles to extract valuable information from data for business decision-making, strategic planning and other uses. Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data- both structured and unstructured.
  • 7. Uses of Data Science  In the healthcare industry, physicians use Data Science to analyze data from wearable trackers to ensure their patients’ well-being and make vital decisions. Data Science also enables hospital managers to reduce waiting time and enhance care. Retailers use Data Science to enhance customer experience and retention.  Data Science is widely used in the banking and finance sectors for fraud detection and personalized financial advice. Benefits and Uses of Data Science
  • 8.  Transportation providers use Data Science to enhance the transportation journeys of their customers. For instance, Transport for London maps customer journeys offering personalized transportation details, and manages unexpected circumstances using statistical data.  Construction companies use Data Science for better decision making by tracking activities, including average time for completing tasks, materials-based expenses, and more.  Data Science enables trapping and analyzing massive data from manufacturing processes, which has gone untapped so far. Benefits and Uses of Data Science
  • 9.  With Data Science, one can analyze massive graphical data, temporal data, and geospatial data to draw insights. It also helps in seismic interpretation and reservoir characterization.  Data Science facilitates firms to leverage social media content to obtain real-time media content usage patterns. This enables the firms to create target audience-specific content, measure content performance, and recommend on-demand content.  Data Science helps study utility consumption in the energy and utility domain. This study allows for better control of utility use and enhanced consumer feedback.  Data Science applications in the public service field include health-related research, financial market analysis, fraud detection, energy exploration, environmental protection, and more. Benefits and Uses of Data Science
  • 10. Benefits of Data Science 1)Increases business predictability  When a company invests in structuring its data, it can work with what we call predictive analysis. With the help of the data scientist, it is possible to use technologies such as Machine Learning and Artificial Intelligence to work with the data that the company has and, in this way, carry out more precise analyses of what is to come.  Thus, you increase the predictability of the business and can make decisions today that will positively impact the future of your business. Benefits and Uses of Data Science
  • 11. 2) Ensures real-time intelligence  The data scientist can work with RPA(Robotic Process Automation) professionals to identify the different data sources of their business and create automated dashboards, which search all this data in real-time in an integrated manner.  This intelligence is essential for the managers of your company to make more accurate and faster decisions. Benefits and Uses of Data Science
  • 12. 3) Favors the marketing and sales area  Data scientists can integrate data from different sources, bringing even more accurate insights to their team. Can you imagine obtaining the entire customer journey map considering all the touch points your customer had with your brand? This is possible with Data Science. 4) Improves data security One of the benefits of Data Science is the work done in the area of data security. In that sense, there is a world of possibilities. The data scientists work on fraud prevention systems, for example, to keep your company’s customers safer. On the other hand, he can also study recurring patterns of behavior in a company’s systems to identify possible architectural flaws. Benefits and Uses of Data Science
  • 13. 5) Helps interpret complex data  Data Science is a great solution when we want to cross different data to understand the business and the market better. Depending on the tools we use to collect data, we can mix data from “physical” and virtual sources for better visualization. 6) Facilitates the decision-making process  One of the benefits of Data Science is improving the decision-making process. This is because we can create tools to view data in real-time, allowing more agility for business managers. This is done both by dashboards and by the projections that are possible with the data scientist’s treatment of data. Benefits and Uses of Data Science
  • 14. “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, VMware Data Science and Big Data
  • 15. Big data includes: Structured data: transaction data, OLTP, RDBMS, and other structured formats Semi-Structured: text files, system log files, XML files, etc.  Unstructured data: web pages, sensor data, mobile data, online data, sources, digital audio, and video feeds, digital images, tweets, blogs, emails, social networks, and other sources Data Science and Big Data
  • 16. Difference Between Data Science and Big Data Meaning: Big Data: Large volumes of data that can’t be handled using a normal database program. Characterized by velocity, volume, and variety. Data Science: Data focused scientific activity. Similar in nature to data mining. Harnesses the potential of big data to support business decisions. Includes approaches to process big data.
  • 17. Concept: Big Data: Includes all formats and types of data. Diverse data types are generated from several different sources. Data Science: Helps organizations make decisions. Provides techniques to help extract insights and information to create large datasets. A specialized approach that involves scientific programming tools, techniques, and models to process big data. Difference Between Data Science and Big Data
  • 18. Basis of Formation: Big Data: Data is generated from system logs. Data is created in organizations – emails, spreadsheets, DB, transactions, Online discussion forums. Video and audio streams that include live feeds. Electronic devices – RFID, sensors, and so on. Internet traffic and users. Data Science: Working apps are made by programming developed models. It captures complex patterns from big data and developed models. It is related to data analysis, preparation, and filtering. Applies scientific methods to find the knowledge in big data. Difference Between Data Science and Big Data
  • 19. Approach: Big Data: To understand the market and to gain new customers. To find sustainability. To establish realistic ROI and metrics. To leverage datasets for the advantage of the business. To gain competitiveness. To develop business agility. Data Science: Data Visualization and prediction. Data destroy, preserve, publishing, processing, preparation, or acquisition. Programming skills, like NoSQL, SQL, and Hadoop platforms. State-of-the-art algorithms and techniques for data mining. Involves the extensive use of statistics, mathematics, and other tools. Difference Between Data Science and Big Data
  • 20. Application Areas: Big Data: Security and law enforcement. Research and development. Commerce, Sports and health. Performance optimization Optimizing business processes. Telecommunications and Financial services. Data Science: Web development. Fraud and risk detection. Image and speech recognition. Search recommenders etc. Difference Between Data Science and Big Data
  • 21. Different types of Data In data science and big data we will come across many different types of data, and each of them tends to require different tools and techniques. The main categories of data are these: 1)Structured 2)Unstructured 3)Natural language 4)Machine-generated 5)Audio, video, and images 6)Streaming
  • 22. Different types of Data 1) Structured data: Data which conforms to a data model, has a well define structure, follows a consistent order and can be easily accessed and used by a person or a computer program. Structured data is usually stored in well-defined schemas such as Databases. It is generally tabular with column and rows that clearly define its attributes.  SQL (Structured Query language) is often used to manage structured data stored in Databases. Example: names, dates, addresses, credit card numbers, stock information, geolocation, etc.
  • 23. Different types of Data 2) Unstructured data:  Data which does not conforms to a data model and has no easily identifiable structure such that it can not be used by a computer program easily.  Unstructured data is not organised in a pre-defined manner or does not have a pre- defined data model, thus it is not a good fit for a mainstream relational database. Example: Videos, texts, images, document files, audio materials, email contents etc.
  • 24. Different types of Data 3) Natural language:  Natural language processing (also known as Computational Linguistics) is the scientific study of language from a computational perspective, with a focus on the interactions between natural (human) languages and computers.  Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. Example:personal voice assistants like Siri, Cortana, Alexa, etc.
  • 25. Different types of Data 4) Machine generated data:  Machine-generated data (MGD) is information that is produced by mechanical or digital devices  The term is often used to describe the data that is generated by an organization’s industrial control systems as well as mechanical devices that are designed to carry out a single function. Example:Web server logs, Call detail records, Financial instrument trades, Network event logs, etc.
  • 26. Different types of Data 5) Audio, Video and image data:  Audio, image, and video are data types that pose specific challenges to a data scientist.  Audio and video are used for enhancing the experience with Web pages (e.g. audio background) to serving music, family videos, presentations, etc.  The Web content accessibility guidelines recommend to always provide alternatives for time-based media, such as captions, descriptions, or sign language.
  • 27. Different types of Data 6) Streaming data:  Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously,  Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, etc. Example: Location data, Fraud detection, Real-time stock trades, Marketing, sales, and business analytics, Customer/user activity, Monitoring and reporting on internal IT systems, etc.
  • 28. Data Science Process  The data science process typically consists of six steps: 1) Setting the research goal 2) Retrieving data 3) Data preparation 4) Data exploration 5) Data modelling or model building 6) Presentation and automation
  • 30. Data Science Process 1) Setting the research goal: Data science is mostly applied in the con- text of an organization. When the business asks you to perform a data science project, you’ll first prepare a project charter. This charter contains information such as what you’re going to research, how the company benefits from that, what data and resources you need, a timetable, and deliverables.
  • 31. 2) Retrieving data:  The second step is to collect data. You’ve stated in the project charter which data you need and where you can find it.  In this step you ensure that you can use the data in your program, which means checking the existence of, quality, and access to the data.  Data can also be delivered by third-party companies and takes many forms ranging from Excel spreadsheets to different types of databases. Data Science Process
  • 32. 3) Data preparation:  Data collection is an error-prone process; in this phase you enhance the quality of the data and prepare it for use in subsequent steps. This phase consists of three sub- phases: data cleansing removes false values from a data source and inconsistencies across data sources,  data integration enriches data sources by combining information from multiple data sources,  data transformation ensures that the data is in a suit- able format for use in your models. Data Science Process
  • 33. 4) Data exploration: Data exploration is concerned with building a deeper understanding of your data.  You try to understand how variables interact with each other, the distribution of the data, and whether there are outliers.  To achieve this you mainly use descriptive statistics, visual techniques, and simple modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis. Data Science Process
  • 34. 5) Data modeling or model building:  In this phase you use models, domain knowledge, and insights about the data you found in the previous steps to answer the research question.  You select a technique from the fields of statistics, machine learning, operations research, and so on.  Building a model is an iterative process that involves selecting the variables for the model, executing the model, and model diagnostics. Data Science Process
  • 35. 6) Presentation and automation: Finally, you present the results to your business. These results can take many forms, ranging from presentations to research reports.  Sometimes you’ll need to automate the execution of the process because the business will want to use the insights you gained in another project or enable an operational process to use the outcome from your model. Data Science Process
  • 36. Machine Learning relation with Data Science: