SlideShare a Scribd company logo
Data Analytics(BCA science:VI)
By.
Prof.Vrushali Solanke.
What is Data Science?
 Data science is the deep study of the massive amount of data which involves
extracting meaningful insights from raw ,structured and unstructured data that is
processed using scientific methods, different technologies, and algorithm.
 It is multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
 Data science uses most powerful hardware ,programming system, and most
efficient algorithm to solve the data related problems. It is future of artificial
intelligence.
Introduction of Data Science and Data Analytics
Data science component:
Data science refer to emerging area of work concerned with collection, preparation,
analysis, visualization, management, and preservation of large collection of information.
Data science is all about:
Examples:
 I’m sure you have seen smart watches — or maybe you use one, too. These smart
gadgets can measure your sleep quality, how much you walk, your heart rate, etc.
Tesla is famous for using data science – e.g. deep learning – for their
self-driving
Need of the data science:
Following are some main reasons for
using data science technology:
 With the help of data science technology, we can convert massive amount of raw
and unstructured data into meaningful insights.
 Data science technology is opting by various companies, whether it is big brand
or start up. Google Amazon ,Netflix etc. which handle huge amount of data, are
using data science algorithm for better consumers experience.
 Data science is working for automating transportation such as creating self
driving car, which is feature of transportation.
 Data science can help in different prediction such as various survey ,election
,flight ticket confirmation, etc.
 Data is the oil for today's world. With the right tools, technologies, algorithms, we can use
data and convert it into a distinctive business advantage
 Data Science can help you to detect fraud using advanced machine learning algorithms
It helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines.You can perform sentiment analysis to
gauge customer brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right customer to enhance your
business
Basics of data:
 Data:Data are characteristics or information, usually numeric,
that are collected through observation. In a more technical
sense, data are a set of values of qualitative or quantitative
variables about one or more persons or objects, while a
datum (singular of data) is a single value of a single variable.
 There are 3 main category of data:
1)Structured data
2)Unstructured data
3)Semi structured data.
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
Structured data:
 Structured data usually resides in relational databases RDBMS. Fields store length-
delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of
variable length like names are contained in records, making it a simple matter to search. Data
may be human- or machine-generated as long as the data is created within an RDBMS
structure. This format is eminently searchable both with human generated queries and via
algorithms using type of data and field names, such as alphabetical or numeric, currency or
date.
 Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.
What Is Unstructured Data?
 Unstructured data is essentially everything else. Unstructured data has internal structure
but is not structured via pre-defined data models or schema. It may be textual or non-
textual, and human- or machine-generated. It may also be stored within a non-relational
database like NoSQL.
 Typical human-generated unstructured data includes:
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: we sometimes refer to it as semi structured. However, its message field is
unstructured and traditional analytical tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, phone recordings, collaboration software(like Microsoft Teams,
Google Docs etc.).
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications.
 Typical machine-generated unstructured data:
Machine generated data is information that is automatically created by a
computer, process, application, or other machine without human
intervention.
• Satellite imagery: Weather data, land forms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic
imagery, atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
Introduction of Data Science and Data Analytics
Introduction of Data Science and Data Analytics
Semi-structured data :
Semi-structured data maintains internal tags and markings that
identify separate data elements, which enables information
grouping and hierarchies.
Email is a very common example of a semi-structured data
type. Email’s native metadata enables classification and
keyword searching without any additional tools.
Sharing sensor data is a growing use case, as are Web-based
data sharing and transport: electronic data interchange (EDI),
many social media platforms, document markup languages,
and NoSQL databases.
Examples of Semi-structured Data
• Markup language XML This is a semi-structured document language. XML is a set of document encoding
rules that defines a human- and machine-readable format. Its value is that its tag-driven structure is highly
flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web.
• Open standard JSON (JavaScript Object Notation) JSON is another semi-structured data interchange
format. Its structure consists of name/value pairs (or object, hash table, etc.) and an ordered value list (or
array, sequence, list). Since the structure is interchangeable among languages, JSON excels at transmitting
data between web applications and servers.
• NoSQL Semi-structured data is also an important element of many NoSQL (“not only SQL”) databases.
NoSQL databases differ from relational databases because they do not separate the organization (schema)
from the data. This makes NoSQL a better choice to store information that does not easily fit into the record
and table format, such as text with varying lengths. It also allows for easier data exchange between
databases. Some newer NoSQL databases like MongoDB and Couchbase also incorporate semi-structured
documents by natively storing them in the JSON format.
Differences between Structured, Semi-
structured and Unstructured data:
Properties Structured data Semi-structured data Unstructured data
Technology
It is based on Relational database
table
It is based on XML/RDF(Resource
Description Framework).
It is based on character and binary
data
Transaction management
Matured transaction and various
concurrency techniques
Transaction is adapted from DBMS
not matured
No transaction management and
no concurrency
Version management Versioning over tuples, row, tables
Versioning over tuples or graph is
possible
Versioned as a whole
Flexibility
It is schema dependent and less
flexible
It is more flexible than structured
data but less flexible than
unstructured data
It is more flexible and there is
absence of schema
Scalability
It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data
It is more scalable.
Robustness Very robust New technology, not very spread —
Query performance
Structured query allow complex
joining
Queries over anonymous nodes are
possible
Only textual queries are possible
Basics of Data Science:
Providing some sort of understanding of the data. i.e the term use as information
extraction.
Insight:It is gained by analysing data and information to understand what is
going on with particular situation.
Data:It is row unorganized set of information.
Data
Information
Insights
Need of the Data Science:
 Todays most of the data is unstructured and semi structured .
 Sources of these current data are, financial logs, text file, multimedia forms,
sensors, and instruments.
 Simple BI tools not capable of processing this huge volume and variety of data.
 This is why we required more complex and advance analytical tools and algorithm
for processing, analyzing and drawing meaningful insights from it.
 Data science is all about uncovering findings from data.
 Example: Netflix .
What is Data Science?
 Turning raw data into insights to make better decision.
 Data science is blend of the various tools, algorithm, and machine
learning principles with goal to discovered hidden pattern from raw
data.
 It is art and science of extracting actionable insights from raw data.
 Data science is also known as data driven science.
Definition of a data science by famous Venn diagram. (by,Drew
Conway)
Basic areas in Data science:
 Mathematics and statistics:
 Computer programming:
 Domain knowledge:
Data science can add values in following ways:
1.It empower management and officers to make better decision.
2.It helps to direct action to trends, for defining goal.
3.It helps staff to adopt best practice and focus on issues that matter.
4.It helps to identify opportunities and decision making with quantifiable
data.
5.It helps in identification and refining of target audience for business.
What is Data science? Is it statistic or Machine Learning?
 Statistics is a tool or method for data science, while data science is
wide domain where a statistical method is essential component.
 All Statistician can not be a Data scientist and all Data scientist can
not be Statistician.
 Machine learning can be define as practice of using algorithm to use
data, learn from it and then forecast future trends for that topic.
 Data science uses Machine learning as a tool to provide insights.
 Data science is the multidisciplinary blend of data inference,
algorithm development and technology in order to solve analytically
complex problems.
Sr.No Features BI Data Science
1. Data Sources Structured(Usually
SQL,often Data
Both structured and
unstructured (logs, cloud
data,SQL,NOSQL,text)
2. Approach Statistics and visualization Statistics, Machine Learning,
Graph Analysis, NLP
3. Focus Past and present Present and future
4. Tools Pentaho, Microsoft BI,
QlikView, R
RapidMiner, BigML, Weka, R
Sr.No Machine Learning Data Science
1. A subset of AI that focuses on
narrow range of activities
Data science is not exactly subset of machine
learning but it uses machine learning to analyze
and make future prediction
2 Develop new (individual) model Explore many models, built and tune hybrids
3 Prove mathematical properties of
models
Understand empirical properties of models
4 Improve/Validate on a few,
relatively clean, small datasets.
Develop/use tools that can handle massive
datasets.
5 It produces predictions It produces insights
6 It is the part of data science Data science is an all encompassing terms that
includes aspects of machine learning for
functionality
Data Scientist:
 A data scientist is a professional responsible for collecting, analysing, and
interpreting large amount of data to identify ways to help a business improve
operation.
 Data scientist has sufficient knowledge of expertise in business needs, domain
knowledge, analytical skills and programming expertise to manage end to end
scientific methods in each state in big data.
 Responsibilities of Data scientist:
1)Collecting large amount of unruly data and transforming it into a more
usable format.
2)Solving business related problems using data driven techniques.
3)Working with variety of programming language.
4)Having a solid grasp of statistics, including statistical test and distribution.
5)Knowledge about top of analytical techniques such as, Machine learning,deep
learnig, text analytics.
6)Looking for order and pattern of data ,as well as spotting trends.
Category of Data Scientist:
 Data scientist were classified into 4 categories:
1)Data developer: Developer, Engineer
2)Data Researcher: Researcher, Scientist, Statistician
3)Data Creative: Jack of all trends, Artist, Hacker
4)Data Businessperson: Leader, Businessperson, Entrepreneur
Skills for Data Scientist:
 Programming skills: Data scientist should have command over programming
language, like R or Python and Database querying language like SQL.
 Statistics: He should have knowledge about test, distribution, maximum likelihood
estimators.
 Machine Learning: For large company with huge amount of data (e.g. Netflix,
Google, Amazon etc.),it may be essential to familiar with ML methods like k-nearest
neighbors, random forest, etc.
 Multivariable calculus and Linear Algebra: The company where product is
defined by data ,these concepts are most important.
 Data Visualization and Communication: This technique is important for
younger companies that are driven data driven decision for first time. e.g. Tableau It is
important to not just familiar with visualization tools but should also finding the principle
behind visually encoding data.
Data Science Process:
 A process of discovering useful relationship and pattern in data is ,enabled by a set
of iterative activities collectively known as the Data Science Process.
 Data science Process involves :
1)Understanding the problem
2)Preparing the data samples
3)Developing the model
4)Applying the model on a dataset to see how the model may work in real
world.
5)Deploying and maintaining the models.
Step-1:Prior knowledge:
i)It helps to define what problem is being solved, how it fits in the business context,
and what data is needed in order to solve the problem.
ii)Data science process starts with the need for analysis, a question, or a business
objective.
iii)without well define statement of the problem, it is impossible to come up with
right dataset.
iv)Data science process is going to explained using hypothetical use case.
Step-2:Data Preparation:
i)Preparing the dataset which suits task is most time consuming part of the process.
ii)This phase consist of three sub phases: Data Cleaning, Removes of false values
from data source and inconsistency across data source, data integration enrich the data
source by combining information from multiple data source, and data transformation
ensure that the data is in suitable format for use in your model.Data exploration is
concerned with building a deeper understanding of your data.
 Sampling: It is the process of selecting a subset of records as a representation of the
original dataset for use in data analysis or modeling. Sampling reduces the amount of
that need to be processed and speed up the build process of the modeling.
 Model Build process: In this process ,it is necessary to segment the dataset into training
and test samples.
 Step-3:Modeling:
i)A Model is the abstract representation of data. This step create representative model
inferred from data.
ii)Training dataset: The dataset used to create the model, with known attributes and
target, is called the training dataset.
iv)Test dataset or validation dataset: The validity of the created model will also need to
be checked with another known dataset called the test or validation dataset.
v)Building the model is the iterative process that involves selecting the variables for the
model, executing the model and model diagnostics.
 Step-4:Application:
i)Deployment: Here Model is become production ready.
ii)The Model deployment stage has to deal with :assessing model
readiness, Technical integration, response time, model maintenance,
and assimilation.
iv)Evaluation :It is the part of process where you test to see if you have a
good model or not, before deploying or presenting.
Step-5:Knowledge:
i)The data science process start with prior knowledge and end with
posterior knowledge.
Stages in Data Science project:
Stage-1:Data Acquisition:
i)Data science project begin with identifying various data sources.(e.g. logs
from webserver, social media data, data from online repositories like census dataset,
data stream from online sources via API’s, web scraping)
ii)Data acquisition involves acquiring data from all the identified internal and
external sources that answer the business questions.
iii)main job of data scientist in this step is to tracking where each data slice
comes from and whether the data slice acquired is up to date or not. It is important
track these information during entire lifecycle of a data science.
Stage-2:Data Preparation:
i)Data acquired in first step is not in a usable format to run the required
analysis and might contain missing entries, inconsistencies and semantic errors.
ii)Next, Data scientist have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step does not produce any
meaningful insights.
iii)Through regular data cleaning, data scientist can easily identify what fault
exist in the data acquisition process, what assumption they should make and what
model they can apply to produce analysis result.
iv)Data after reformatting can be converted to JSON, CSV or any other format
makes it easy to load into one of the data science tools.
v)Exploratory data analysis :it forms an integral part ,at this stage as
summarization of the clean data can help to identify outlier, anomalies, and patterns.
vi)Through this step, data scientist come to know what do they actually
to do with this data. This stage is the time consuming.
Stage 3:Hypothesis and Modelling:
i)This stage required writing, running, refining the program to analyse and
derive meaning full insights from data. These program may written in Python, R,
MATLAB or Perl language.
ii)Different machine learning are applied to the data to identify the machine
learning model that best fit the business needs. All the machine learning model
by train dataset.
Stage-4:Evaluation and interpretation:
i)There are different evaluation metric for different performance metric. E.g.
to predict Daily stocks then the RMSE(Root Mean Squared Error)will have to be
consider for evaluation. If model for classifying Spam email message then
performance metric like average accuracy, AUC, Log loss have to be consider.
ii)Machine learning model performance should be measured and
using validation and test sets to identify best model based on model accuracy and
over fitting.
Stage-5:Deployment:
i)Machine learning model might have to be recoded ,bcz Data scientist
favour Python programming language but production environment support Java.
ii)Models are first deployed in a pre-production or test-environment before
actually deploy them in production.
Stage-6:Operation/Maintenance:
i)This step involves developing the plan for monitoring and maintaining the
data science project in the long run.
Stage-6:Optimization:
i)It involve the retraining the machine learning model in production
new data source comes .
Thank you.

More Related Content

PPTX
Data analytics
PPTX
Data Mining: Application and trends in data mining
PPT
DATA WAREHOUSING AND DATA MINING
PPTX
Data science & data scientist
DOCX
Big data lecture notes
PDF
Predicting student performance using aggregated data sources
PDF
Exploratory data analysis data visualization
PDF
(2017/06)Practical points of deep learning for medical imaging
Data analytics
Data Mining: Application and trends in data mining
DATA WAREHOUSING AND DATA MINING
Data science & data scientist
Big data lecture notes
Predicting student performance using aggregated data sources
Exploratory data analysis data visualization
(2017/06)Practical points of deep learning for medical imaging

What's hot (20)

PPTX
Application of-image-segmentation-in-brain-tumor-detection
PPTX
Overfitting & Underfitting
PPT
Lecture 04 data resource management
PPTX
Data analytics vs. Data analysis
PPTX
Big data by Mithlesh sadh
PPTX
Presentation About Big Data (DBMS)
PDF
The Importance of Data Visualization
PPTX
Introduction to Data Analytics
PDF
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
PPT
Data Warehouse Basic Guide
PPTX
Data visualization-tools
DOC
Dbms Lecture Notes
PPTX
Big Data Analytics
PPTX
Exploratory data analysis with Python
PDF
Feature Engineering in Machine Learning
PDF
Performance Metrics for Machine Learning Algorithms
PPTX
Machine Learning and AI
PPTX
Data science life cycle
PPTX
5 v of big data
PDF
Data science presentation
Application of-image-segmentation-in-brain-tumor-detection
Overfitting & Underfitting
Lecture 04 data resource management
Data analytics vs. Data analysis
Big data by Mithlesh sadh
Presentation About Big Data (DBMS)
The Importance of Data Visualization
Introduction to Data Analytics
PREDICTION OF DIABETES MELLITUS USING MACHINE LEARNING TECHNIQUES
Data Warehouse Basic Guide
Data visualization-tools
Dbms Lecture Notes
Big Data Analytics
Exploratory data analysis with Python
Feature Engineering in Machine Learning
Performance Metrics for Machine Learning Algorithms
Machine Learning and AI
Data science life cycle
5 v of big data
Data science presentation
Ad

Similar to Introduction of Data Science and Data Analytics (20)

PPTX
Week-1-Introduction to Data Mining.pptx
PPTX
Chapter 2 - Intro to Data Sciences[2].pptx
PPSX
Introduction to Big Data Analytics.ppsx
PPTX
Chapter -2- Data science Emerging Tech.pptx
PPTX
Emerging Technology Chapter 2 Data Science
PPTX
2016 Chapter 2 - Intro. to Data Sciences.pptx
PDF
the study of data to extract meaningful insights for business
PPTX
introduction to data science
PPTX
Chapter 2- Data Science and big data.pptx
PPTX
Big data Analytics(BAD601) -module-1 ppt
PPTX
Big data visualization state of the art
PPTX
Introduction of big data unit 1
PPT
Database
PPT
Database
PDF
Data Science Introduction and Process in Data Science
PPTX
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
PPTX
Data Science
DOCX
Core Concepts and Cutting Edge Technologies in Data Science
PPTX
Big data analytics(BAD601) module-1 ppt
PPTX
U - 2 Emerging.pptx
Week-1-Introduction to Data Mining.pptx
Chapter 2 - Intro to Data Sciences[2].pptx
Introduction to Big Data Analytics.ppsx
Chapter -2- Data science Emerging Tech.pptx
Emerging Technology Chapter 2 Data Science
2016 Chapter 2 - Intro. to Data Sciences.pptx
the study of data to extract meaningful insights for business
introduction to data science
Chapter 2- Data Science and big data.pptx
Big data Analytics(BAD601) -module-1 ppt
Big data visualization state of the art
Introduction of big data unit 1
Database
Database
Data Science Introduction and Process in Data Science
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
Data Science
Core Concepts and Cutting Edge Technologies in Data Science
Big data analytics(BAD601) module-1 ppt
U - 2 Emerging.pptx
Ad

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Lecture1 pattern recognition............
PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction-to-Cloud-ComputingFinal.pptx
Lecture1 pattern recognition............
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Business Acumen Training GuidePresentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Introduction of Data Science and Data Analytics

  • 2. What is Data Science?  Data science is the deep study of the massive amount of data which involves extracting meaningful insights from raw ,structured and unstructured data that is processed using scientific methods, different technologies, and algorithm.  It is multidisciplinary field that uses tools and techniques to manipulate the data so that you can find something new and meaningful.  Data science uses most powerful hardware ,programming system, and most efficient algorithm to solve the data related problems. It is future of artificial intelligence.
  • 5. Data science refer to emerging area of work concerned with collection, preparation, analysis, visualization, management, and preservation of large collection of information.
  • 6. Data science is all about:
  • 7. Examples:  I’m sure you have seen smart watches — or maybe you use one, too. These smart gadgets can measure your sleep quality, how much you walk, your heart rate, etc.
  • 8. Tesla is famous for using data science – e.g. deep learning – for their self-driving
  • 9. Need of the data science:
  • 10. Following are some main reasons for using data science technology:  With the help of data science technology, we can convert massive amount of raw and unstructured data into meaningful insights.  Data science technology is opting by various companies, whether it is big brand or start up. Google Amazon ,Netflix etc. which handle huge amount of data, are using data science algorithm for better consumers experience.  Data science is working for automating transportation such as creating self driving car, which is feature of transportation.  Data science can help in different prediction such as various survey ,election ,flight ticket confirmation, etc.
  • 11.  Data is the oil for today's world. With the right tools, technologies, algorithms, we can use data and convert it into a distinctive business advantage  Data Science can help you to detect fraud using advanced machine learning algorithms It helps you to prevent any significant monetary losses  Allows to build intelligence ability in machines.You can perform sentiment analysis to gauge customer brand loyalty  It enables you to take better and faster decisions  Helps you to recommend the right product to the right customer to enhance your business
  • 12. Basics of data:  Data:Data are characteristics or information, usually numeric, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.  There are 3 main category of data: 1)Structured data 2)Unstructured data 3)Semi structured data.
  • 15. Structured data:  Structured data usually resides in relational databases RDBMS. Fields store length- delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.  Common relational database applications with structured data include airline reservation systems, inventory control, sales transactions, and ATM activity. Structured Query Language (SQL) enables queries on this type of structured data within relational databases.
  • 16. What Is Unstructured Data?  Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non- textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.  Typical human-generated unstructured data includes: • Text files: Word processing, spreadsheets, presentations, email, logs. • Email: we sometimes refer to it as semi structured. However, its message field is unstructured and traditional analytical tools cannot parse it. • Social Media: Data from Facebook, Twitter, LinkedIn. • Website: YouTube, Instagram, photo sharing sites. • Mobile data: Text messages, locations. • Communications: Chat, phone recordings, collaboration software(like Microsoft Teams, Google Docs etc.). • Media: MP3, digital photos, audio and video files. • Business applications: MS Office documents, productivity applications.
  • 17.  Typical machine-generated unstructured data: Machine generated data is information that is automatically created by a computer, process, application, or other machine without human intervention. • Satellite imagery: Weather data, land forms, military movements. • Scientific data: Oil and gas exploration, space exploration, seismic imagery, atmospheric data. • Digital surveillance: Surveillance photos and video. • Sensor data: Traffic, weather, oceanographic sensors.
  • 20. Semi-structured data : Semi-structured data maintains internal tags and markings that identify separate data elements, which enables information grouping and hierarchies. Email is a very common example of a semi-structured data type. Email’s native metadata enables classification and keyword searching without any additional tools. Sharing sensor data is a growing use case, as are Web-based data sharing and transport: electronic data interchange (EDI), many social media platforms, document markup languages, and NoSQL databases.
  • 21. Examples of Semi-structured Data • Markup language XML This is a semi-structured document language. XML is a set of document encoding rules that defines a human- and machine-readable format. Its value is that its tag-driven structure is highly flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web. • Open standard JSON (JavaScript Object Notation) JSON is another semi-structured data interchange format. Its structure consists of name/value pairs (or object, hash table, etc.) and an ordered value list (or array, sequence, list). Since the structure is interchangeable among languages, JSON excels at transmitting data between web applications and servers. • NoSQL Semi-structured data is also an important element of many NoSQL (“not only SQL”) databases. NoSQL databases differ from relational databases because they do not separate the organization (schema) from the data. This makes NoSQL a better choice to store information that does not easily fit into the record and table format, such as text with varying lengths. It also allows for easier data exchange between databases. Some newer NoSQL databases like MongoDB and Couchbase also incorporate semi-structured documents by natively storing them in the JSON format.
  • 22. Differences between Structured, Semi- structured and Unstructured data: Properties Structured data Semi-structured data Unstructured data Technology It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data Transaction management Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency Version management Versioning over tuples, row, tables Versioning over tuples or graph is possible Versioned as a whole Flexibility It is schema dependent and less flexible It is more flexible than structured data but less flexible than unstructured data It is more flexible and there is absence of schema Scalability It is very difficult to scale DB schema It’s scaling is simpler than structured data It is more scalable. Robustness Very robust New technology, not very spread — Query performance Structured query allow complex joining Queries over anonymous nodes are possible Only textual queries are possible
  • 23. Basics of Data Science: Providing some sort of understanding of the data. i.e the term use as information extraction. Insight:It is gained by analysing data and information to understand what is going on with particular situation. Data:It is row unorganized set of information. Data Information Insights
  • 24. Need of the Data Science:  Todays most of the data is unstructured and semi structured .  Sources of these current data are, financial logs, text file, multimedia forms, sensors, and instruments.  Simple BI tools not capable of processing this huge volume and variety of data.  This is why we required more complex and advance analytical tools and algorithm for processing, analyzing and drawing meaningful insights from it.  Data science is all about uncovering findings from data.  Example: Netflix .
  • 25. What is Data Science?  Turning raw data into insights to make better decision.  Data science is blend of the various tools, algorithm, and machine learning principles with goal to discovered hidden pattern from raw data.  It is art and science of extracting actionable insights from raw data.  Data science is also known as data driven science.
  • 26. Definition of a data science by famous Venn diagram. (by,Drew Conway)
  • 27. Basic areas in Data science:  Mathematics and statistics:  Computer programming:  Domain knowledge: Data science can add values in following ways: 1.It empower management and officers to make better decision. 2.It helps to direct action to trends, for defining goal. 3.It helps staff to adopt best practice and focus on issues that matter. 4.It helps to identify opportunities and decision making with quantifiable data. 5.It helps in identification and refining of target audience for business.
  • 28. What is Data science? Is it statistic or Machine Learning?  Statistics is a tool or method for data science, while data science is wide domain where a statistical method is essential component.  All Statistician can not be a Data scientist and all Data scientist can not be Statistician.  Machine learning can be define as practice of using algorithm to use data, learn from it and then forecast future trends for that topic.  Data science uses Machine learning as a tool to provide insights.  Data science is the multidisciplinary blend of data inference, algorithm development and technology in order to solve analytically complex problems.
  • 29. Sr.No Features BI Data Science 1. Data Sources Structured(Usually SQL,often Data Both structured and unstructured (logs, cloud data,SQL,NOSQL,text) 2. Approach Statistics and visualization Statistics, Machine Learning, Graph Analysis, NLP 3. Focus Past and present Present and future 4. Tools Pentaho, Microsoft BI, QlikView, R RapidMiner, BigML, Weka, R
  • 30. Sr.No Machine Learning Data Science 1. A subset of AI that focuses on narrow range of activities Data science is not exactly subset of machine learning but it uses machine learning to analyze and make future prediction 2 Develop new (individual) model Explore many models, built and tune hybrids 3 Prove mathematical properties of models Understand empirical properties of models 4 Improve/Validate on a few, relatively clean, small datasets. Develop/use tools that can handle massive datasets. 5 It produces predictions It produces insights 6 It is the part of data science Data science is an all encompassing terms that includes aspects of machine learning for functionality
  • 31. Data Scientist:  A data scientist is a professional responsible for collecting, analysing, and interpreting large amount of data to identify ways to help a business improve operation.  Data scientist has sufficient knowledge of expertise in business needs, domain knowledge, analytical skills and programming expertise to manage end to end scientific methods in each state in big data.  Responsibilities of Data scientist: 1)Collecting large amount of unruly data and transforming it into a more usable format. 2)Solving business related problems using data driven techniques. 3)Working with variety of programming language. 4)Having a solid grasp of statistics, including statistical test and distribution. 5)Knowledge about top of analytical techniques such as, Machine learning,deep learnig, text analytics. 6)Looking for order and pattern of data ,as well as spotting trends.
  • 32. Category of Data Scientist:  Data scientist were classified into 4 categories: 1)Data developer: Developer, Engineer 2)Data Researcher: Researcher, Scientist, Statistician 3)Data Creative: Jack of all trends, Artist, Hacker 4)Data Businessperson: Leader, Businessperson, Entrepreneur
  • 33. Skills for Data Scientist:  Programming skills: Data scientist should have command over programming language, like R or Python and Database querying language like SQL.  Statistics: He should have knowledge about test, distribution, maximum likelihood estimators.  Machine Learning: For large company with huge amount of data (e.g. Netflix, Google, Amazon etc.),it may be essential to familiar with ML methods like k-nearest neighbors, random forest, etc.  Multivariable calculus and Linear Algebra: The company where product is defined by data ,these concepts are most important.  Data Visualization and Communication: This technique is important for younger companies that are driven data driven decision for first time. e.g. Tableau It is important to not just familiar with visualization tools but should also finding the principle behind visually encoding data.
  • 35.  A process of discovering useful relationship and pattern in data is ,enabled by a set of iterative activities collectively known as the Data Science Process.  Data science Process involves : 1)Understanding the problem 2)Preparing the data samples 3)Developing the model 4)Applying the model on a dataset to see how the model may work in real world. 5)Deploying and maintaining the models.
  • 36. Step-1:Prior knowledge: i)It helps to define what problem is being solved, how it fits in the business context, and what data is needed in order to solve the problem. ii)Data science process starts with the need for analysis, a question, or a business objective. iii)without well define statement of the problem, it is impossible to come up with right dataset. iv)Data science process is going to explained using hypothetical use case. Step-2:Data Preparation: i)Preparing the dataset which suits task is most time consuming part of the process. ii)This phase consist of three sub phases: Data Cleaning, Removes of false values from data source and inconsistency across data source, data integration enrich the data source by combining information from multiple data source, and data transformation ensure that the data is in suitable format for use in your model.Data exploration is concerned with building a deeper understanding of your data.
  • 37.  Sampling: It is the process of selecting a subset of records as a representation of the original dataset for use in data analysis or modeling. Sampling reduces the amount of that need to be processed and speed up the build process of the modeling.  Model Build process: In this process ,it is necessary to segment the dataset into training and test samples.  Step-3:Modeling: i)A Model is the abstract representation of data. This step create representative model inferred from data. ii)Training dataset: The dataset used to create the model, with known attributes and target, is called the training dataset. iv)Test dataset or validation dataset: The validity of the created model will also need to be checked with another known dataset called the test or validation dataset. v)Building the model is the iterative process that involves selecting the variables for the model, executing the model and model diagnostics.
  • 38.  Step-4:Application: i)Deployment: Here Model is become production ready. ii)The Model deployment stage has to deal with :assessing model readiness, Technical integration, response time, model maintenance, and assimilation. iv)Evaluation :It is the part of process where you test to see if you have a good model or not, before deploying or presenting. Step-5:Knowledge: i)The data science process start with prior knowledge and end with posterior knowledge.
  • 39. Stages in Data Science project:
  • 40. Stage-1:Data Acquisition: i)Data science project begin with identifying various data sources.(e.g. logs from webserver, social media data, data from online repositories like census dataset, data stream from online sources via API’s, web scraping) ii)Data acquisition involves acquiring data from all the identified internal and external sources that answer the business questions. iii)main job of data scientist in this step is to tracking where each data slice comes from and whether the data slice acquired is up to date or not. It is important track these information during entire lifecycle of a data science.
  • 41. Stage-2:Data Preparation: i)Data acquired in first step is not in a usable format to run the required analysis and might contain missing entries, inconsistencies and semantic errors. ii)Next, Data scientist have to clean and reformat the data by manually editing it in the spreadsheet or by writing code. This step does not produce any meaningful insights. iii)Through regular data cleaning, data scientist can easily identify what fault exist in the data acquisition process, what assumption they should make and what model they can apply to produce analysis result. iv)Data after reformatting can be converted to JSON, CSV or any other format makes it easy to load into one of the data science tools. v)Exploratory data analysis :it forms an integral part ,at this stage as summarization of the clean data can help to identify outlier, anomalies, and patterns. vi)Through this step, data scientist come to know what do they actually to do with this data. This stage is the time consuming.
  • 42. Stage 3:Hypothesis and Modelling: i)This stage required writing, running, refining the program to analyse and derive meaning full insights from data. These program may written in Python, R, MATLAB or Perl language. ii)Different machine learning are applied to the data to identify the machine learning model that best fit the business needs. All the machine learning model by train dataset. Stage-4:Evaluation and interpretation: i)There are different evaluation metric for different performance metric. E.g. to predict Daily stocks then the RMSE(Root Mean Squared Error)will have to be consider for evaluation. If model for classifying Spam email message then performance metric like average accuracy, AUC, Log loss have to be consider. ii)Machine learning model performance should be measured and using validation and test sets to identify best model based on model accuracy and over fitting.
  • 43. Stage-5:Deployment: i)Machine learning model might have to be recoded ,bcz Data scientist favour Python programming language but production environment support Java. ii)Models are first deployed in a pre-production or test-environment before actually deploy them in production. Stage-6:Operation/Maintenance: i)This step involves developing the plan for monitoring and maintaining the data science project in the long run. Stage-6:Optimization: i)It involve the retraining the machine learning model in production new data source comes .