Introduction of Data Science and Data Analytics

Data Analytics(BCA science:VI)
By.
Prof.Vrushali Solanke.

What is Data Science?
 Data science is the deep study of the massive amount of data which involves
extracting meaningful insights from raw ,structured and unstructured data that is
processed using scientific methods, different technologies, and algorithm.
 It is multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
 Data science uses most powerful hardware ,programming system, and most
efficient algorithm to solve the data related problems. It is future of artificial
intelligence.

Data science refer to emerging area of work concerned with collection, preparation,
analysis, visualization, management, and preservation of large collection of information.

Examples:
 I’m sure you have seen smart watches — or maybe you use one, too. These smart
gadgets can measure your sleep quality, how much you walk, your heart rate, etc.

Tesla is famous for using data science – e.g. deep learning – for their
self-driving

Following are some main reasons for
using data science technology:
 With the help of data science technology, we can convert massive amount of raw
and unstructured data into meaningful insights.
 Data science technology is opting by various companies, whether it is big brand
or start up. Google Amazon ,Netflix etc. which handle huge amount of data, are
using data science algorithm for better consumers experience.
 Data science is working for automating transportation such as creating self
driving car, which is feature of transportation.
 Data science can help in different prediction such as various survey ,election
,flight ticket confirmation, etc.

 Data is the oil for today's world. With the right tools, technologies, algorithms, we can use
data and convert it into a distinctive business advantage
 Data Science can help you to detect fraud using advanced machine learning algorithms
It helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines.You can perform sentiment analysis to
gauge customer brand loyalty
 It enables you to take better and faster decisions
 Helps you to recommend the right product to the right customer to enhance your
business

Basics of data:
 Data:Data are characteristics or information, usually numeric,
that are collected through observation. In a more technical
sense, data are a set of values of qualitative or quantitative
variables about one or more persons or objects, while a
datum (singular of data) is a single value of a single variable.
 There are 3 main category of data:
1)Structured data
2)Unstructured data
3)Semi structured data.

Structured data:
 Structured data usually resides in relational databases RDBMS. Fields store length-
delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of
variable length like names are contained in records, making it a simple matter to search. Data
may be human- or machine-generated as long as the data is created within an RDBMS
structure. This format is eminently searchable both with human generated queries and via
algorithms using type of data and field names, such as alphabetical or numeric, currency or
date.
 Common relational database applications with structured data include airline reservation
systems, inventory control, sales transactions, and ATM activity. Structured Query Language
(SQL) enables queries on this type of structured data within relational databases.

What Is Unstructured Data?
 Unstructured data is essentially everything else. Unstructured data has internal structure
but is not structured via pre-defined data models or schema. It may be textual or non-
textual, and human- or machine-generated. It may also be stored within a non-relational
database like NoSQL.
 Typical human-generated unstructured data includes:
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: we sometimes refer to it as semi structured. However, its message field is
unstructured and traditional analytical tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, phone recordings, collaboration software(like Microsoft Teams,
Google Docs etc.).
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications.

 Typical machine-generated unstructured data:
Machine generated data is information that is automatically created by a
computer, process, application, or other machine without human
intervention.
• Satellite imagery: Weather data, land forms, military movements.
• Scientific data: Oil and gas exploration, space exploration, seismic
imagery, atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.

Semi-structured data :
Semi-structured data maintains internal tags and markings that
identify separate data elements, which enables information
grouping and hierarchies.
Email is a very common example of a semi-structured data
type. Email’s native metadata enables classification and
keyword searching without any additional tools.
Sharing sensor data is a growing use case, as are Web-based
data sharing and transport: electronic data interchange (EDI),
many social media platforms, document markup languages,
and NoSQL databases.

Examples of Semi-structured Data
• Markup language XML This is a semi-structured document language. XML is a set of document encoding
rules that defines a human- and machine-readable format. Its value is that its tag-driven structure is highly
flexible, and coders can adapt it to universalize data structure, storage, and transport on the Web.
• Open standard JSON (JavaScript Object Notation) JSON is another semi-structured data interchange
format. Its structure consists of name/value pairs (or object, hash table, etc.) and an ordered value list (or
array, sequence, list). Since the structure is interchangeable among languages, JSON excels at transmitting
data between web applications and servers.
• NoSQL Semi-structured data is also an important element of many NoSQL (“not only SQL”) databases.
NoSQL databases differ from relational databases because they do not separate the organization (schema)
from the data. This makes NoSQL a better choice to store information that does not easily fit into the record
and table format, such as text with varying lengths. It also allows for easier data exchange between
databases. Some newer NoSQL databases like MongoDB and Couchbase also incorporate semi-structured
documents by natively storing them in the JSON format.

Differences between Structured, Semi-
structured and Unstructured data:
Properties Structured data Semi-structured data Unstructured data
Technology
It is based on Relational database
table
It is based on XML/RDF(Resource
Description Framework).
It is based on character and binary
data
Transaction management
Matured transaction and various
concurrency techniques
Transaction is adapted from DBMS
not matured
No transaction management and
no concurrency
Version management Versioning over tuples, row, tables
Versioning over tuples or graph is
possible
Versioned as a whole
Flexibility
It is schema dependent and less
flexible
It is more flexible than structured
data but less flexible than
unstructured data
It is more flexible and there is
absence of schema
Scalability
It is very difficult to scale DB
schema
It’s scaling is simpler than
structured data
It is more scalable.
Robustness Very robust New technology, not very spread —
Query performance
Structured query allow complex
joining
Queries over anonymous nodes are
possible
Only textual queries are possible

Basics of Data Science:
Providing some sort of understanding of the data. i.e the term use as information
extraction.
Insight:It is gained by analysing data and information to understand what is
going on with particular situation.
Data:It is row unorganized set of information.
Data
Information
Insights

Need of the Data Science:
 Todays most of the data is unstructured and semi structured .
 Sources of these current data are, financial logs, text file, multimedia forms,
sensors, and instruments.
 Simple BI tools not capable of processing this huge volume and variety of data.
 This is why we required more complex and advance analytical tools and algorithm
for processing, analyzing and drawing meaningful insights from it.
 Data science is all about uncovering findings from data.
 Example: Netflix .

What is Data Science?
 Turning raw data into insights to make better decision.
 Data science is blend of the various tools, algorithm, and machine
learning principles with goal to discovered hidden pattern from raw
data.
 It is art and science of extracting actionable insights from raw data.
 Data science is also known as data driven science.

Definition of a data science by famous Venn diagram. (by,Drew
Conway)

Basic areas in Data science:
 Mathematics and statistics:
 Computer programming:
 Domain knowledge:
Data science can add values in following ways:
1.It empower management and officers to make better decision.
2.It helps to direct action to trends, for defining goal.
3.It helps staff to adopt best practice and focus on issues that matter.
4.It helps to identify opportunities and decision making with quantifiable
data.
5.It helps in identification and refining of target audience for business.

What is Data science? Is it statistic or Machine Learning?
 Statistics is a tool or method for data science, while data science is
wide domain where a statistical method is essential component.
 All Statistician can not be a Data scientist and all Data scientist can
not be Statistician.
 Machine learning can be define as practice of using algorithm to use
data, learn from it and then forecast future trends for that topic.
 Data science uses Machine learning as a tool to provide insights.
 Data science is the multidisciplinary blend of data inference,
algorithm development and technology in order to solve analytically
complex problems.

Sr.No Features BI Data Science
1. Data Sources Structured(Usually
SQL,often Data
Both structured and
unstructured (logs, cloud
data,SQL,NOSQL,text)
2. Approach Statistics and visualization Statistics, Machine Learning,
Graph Analysis, NLP
3. Focus Past and present Present and future
4. Tools Pentaho, Microsoft BI,
QlikView, R
RapidMiner, BigML, Weka, R

Sr.No Machine Learning Data Science
1. A subset of AI that focuses on
narrow range of activities
Data science is not exactly subset of machine
learning but it uses machine learning to analyze
and make future prediction
2 Develop new (individual) model Explore many models, built and tune hybrids
3 Prove mathematical properties of
models
Understand empirical properties of models
4 Improve/Validate on a few,
relatively clean, small datasets.
Develop/use tools that can handle massive
datasets.
5 It produces predictions It produces insights
6 It is the part of data science Data science is an all encompassing terms that
includes aspects of machine learning for
functionality

Data Scientist:
 A data scientist is a professional responsible for collecting, analysing, and
interpreting large amount of data to identify ways to help a business improve
operation.
 Data scientist has sufficient knowledge of expertise in business needs, domain
knowledge, analytical skills and programming expertise to manage end to end
scientific methods in each state in big data.
 Responsibilities of Data scientist:
1)Collecting large amount of unruly data and transforming it into a more
usable format.
2)Solving business related problems using data driven techniques.
3)Working with variety of programming language.
4)Having a solid grasp of statistics, including statistical test and distribution.
5)Knowledge about top of analytical techniques such as, Machine learning,deep
learnig, text analytics.
6)Looking for order and pattern of data ,as well as spotting trends.

Category of Data Scientist:
 Data scientist were classified into 4 categories:
1)Data developer: Developer, Engineer
2)Data Researcher: Researcher, Scientist, Statistician
3)Data Creative: Jack of all trends, Artist, Hacker
4)Data Businessperson: Leader, Businessperson, Entrepreneur

Skills for Data Scientist:
 Programming skills: Data scientist should have command over programming
language, like R or Python and Database querying language like SQL.
 Statistics: He should have knowledge about test, distribution, maximum likelihood
estimators.
 Machine Learning: For large company with huge amount of data (e.g. Netflix,
Google, Amazon etc.),it may be essential to familiar with ML methods like k-nearest
neighbors, random forest, etc.
 Multivariable calculus and Linear Algebra: The company where product is
defined by data ,these concepts are most important.
 Data Visualization and Communication: This technique is important for
younger companies that are driven data driven decision for first time. e.g. Tableau It is
important to not just familiar with visualization tools but should also finding the principle
behind visually encoding data.

 A process of discovering useful relationship and pattern in data is ,enabled by a set
of iterative activities collectively known as the Data Science Process.
 Data science Process involves :
1)Understanding the problem
2)Preparing the data samples
3)Developing the model
4)Applying the model on a dataset to see how the model may work in real
world.
5)Deploying and maintaining the models.

Step-1:Prior knowledge:
i)It helps to define what problem is being solved, how it fits in the business context,
and what data is needed in order to solve the problem.
ii)Data science process starts with the need for analysis, a question, or a business
objective.
iii)without well define statement of the problem, it is impossible to come up with
right dataset.
iv)Data science process is going to explained using hypothetical use case.
Step-2:Data Preparation:
i)Preparing the dataset which suits task is most time consuming part of the process.
ii)This phase consist of three sub phases: Data Cleaning, Removes of false values
from data source and inconsistency across data source, data integration enrich the data
source by combining information from multiple data source, and data transformation
ensure that the data is in suitable format for use in your model.Data exploration is
concerned with building a deeper understanding of your data.

 Sampling: It is the process of selecting a subset of records as a representation of the
original dataset for use in data analysis or modeling. Sampling reduces the amount of
that need to be processed and speed up the build process of the modeling.
 Model Build process: In this process ,it is necessary to segment the dataset into training
and test samples.
 Step-3:Modeling:
i)A Model is the abstract representation of data. This step create representative model
inferred from data.
ii)Training dataset: The dataset used to create the model, with known attributes and
target, is called the training dataset.
iv)Test dataset or validation dataset: The validity of the created model will also need to
be checked with another known dataset called the test or validation dataset.
v)Building the model is the iterative process that involves selecting the variables for the
model, executing the model and model diagnostics.

 Step-4:Application:
i)Deployment: Here Model is become production ready.
ii)The Model deployment stage has to deal with :assessing model
readiness, Technical integration, response time, model maintenance,
and assimilation.
iv)Evaluation :It is the part of process where you test to see if you have a
good model or not, before deploying or presenting.
Step-5:Knowledge:
i)The data science process start with prior knowledge and end with
posterior knowledge.

Stages in Data Science project:

Stage-1:Data Acquisition:
i)Data science project begin with identifying various data sources.(e.g. logs
from webserver, social media data, data from online repositories like census dataset,
data stream from online sources via API’s, web scraping)
ii)Data acquisition involves acquiring data from all the identified internal and
external sources that answer the business questions.
iii)main job of data scientist in this step is to tracking where each data slice
comes from and whether the data slice acquired is up to date or not. It is important
track these information during entire lifecycle of a data science.

Stage-2:Data Preparation:
i)Data acquired in first step is not in a usable format to run the required
analysis and might contain missing entries, inconsistencies and semantic errors.
ii)Next, Data scientist have to clean and reformat the data by manually
editing it in the spreadsheet or by writing code. This step does not produce any
meaningful insights.
iii)Through regular data cleaning, data scientist can easily identify what fault
exist in the data acquisition process, what assumption they should make and what
model they can apply to produce analysis result.
iv)Data after reformatting can be converted to JSON, CSV or any other format
makes it easy to load into one of the data science tools.
v)Exploratory data analysis :it forms an integral part ,at this stage as
summarization of the clean data can help to identify outlier, anomalies, and patterns.
vi)Through this step, data scientist come to know what do they actually
to do with this data. This stage is the time consuming.

Stage 3:Hypothesis and Modelling:
i)This stage required writing, running, refining the program to analyse and
derive meaning full insights from data. These program may written in Python, R,
MATLAB or Perl language.
ii)Different machine learning are applied to the data to identify the machine
learning model that best fit the business needs. All the machine learning model
by train dataset.
Stage-4:Evaluation and interpretation:
i)There are different evaluation metric for different performance metric. E.g.
to predict Daily stocks then the RMSE(Root Mean Squared Error)will have to be
consider for evaluation. If model for classifying Spam email message then
performance metric like average accuracy, AUC, Log loss have to be consider.
ii)Machine learning model performance should be measured and
using validation and test sets to identify best model based on model accuracy and
over fitting.

Stage-5:Deployment:
i)Machine learning model might have to be recoded ,bcz Data scientist
favour Python programming language but production environment support Java.
ii)Models are first deployed in a pre-production or test-environment before
actually deploy them in production.
Stage-6:Operation/Maintenance:
i)This step involves developing the plan for monitoring and maintaining the
data science project in the long run.
Stage-6:Optimization:
i)It involve the retraining the machine learning model in production
new data source comes .

Introduction of Data Science and Data Analytics

More Related Content

What's hot (20)

Similar to Introduction of Data Science and Data Analytics (20)

Recently uploaded (20)

Introduction of Data Science and Data Analytics