fundamentals_of_data_science_and_its_intro.pptx

Data Mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
• other terms - knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
• Manypeopletreatdataminingasasynonymforanotherpopularlyusedterm,
Knowledge Discovery from Data, or KDD.
• The data can be structured, semi-structured or unstructured, and can be
stored in various forms such as databases, data warehouses, and data
lakes.

Big Data includes huge volume, high velocity, and extensible
variety of data.
There are 3 types: Structured data, Semi-structured data, and
Unstructured data.
Structured data:
• Structured data refers to data that is organized and
formatted in a specific way to make it easily readable and
understandable by both humans and machines.
• Structured data is typically found in databases and
spreadsheets, and is characterized by its organized nature.

Semi-structured data :
Is a type of data that is not purely structured, but also not
completely unstructured.
It contains some level of organization or structure, but does
not conform to a rigid schema or data model, and may
contain elements that are not easily categorized or classified.
Eg: XML document, E-mails
Unstructured data –
Unstructured data is a data which is not organized in a
predefined manner .
Eg: text document, images, video etc

• The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions
or predictions.
• This involves exploring the data using various techniques such as
clustering, classification, regression analysis, association rule mining,
and anomaly detection.

1. Business & Marketing
•Customer Segmentation – Identifying groups of customers with similar
behaviors (purchasing behavior, interests, and preferences.) for targeted
marketing.
•Market Basket Analysis – Finding product associations (e.g., Amazon’s
"Customers who bought this also bought...").
•Churn Prediction – Predicting customers likely to leave a service (e.g., cancel
subscriptions, close accounts).
•Fraud Detection – identifies fraudulent activities in financial transactions,
insurance claims, and online activities.

2. Healthcare & Medicine
•Disease Prediction & Diagnosis – Using historical medical data to
predict diseases (e.g., cancer detection).
•Drug Discovery – Analysing drug interactions and predicting new drug
formulations.
•Personalized Medicine – Recommending treatments based on patient
data.

3. Finance & Banking
•Credit Scoring & Risk Assessment – Evaluating loan applications based
on past data.
•Algorithmic Trading – Predicting stock market trends using historical
data.
•Money Laundering Detection – Identifying suspicious financial activities.
4. Manufacturing & Industry
•Predictive Maintenance – Predicting machine failures before they occur.
•Supply Chain Optimization – Enhancing logistics and inventory
management.
•Quality Control – Detecting defects in manufacturing.

5. Education
•Student Performance Prediction – Identifying students at risk of failing.
•Personalized Learning – Recommending learning materials based on
student progress.
•Dropout Prediction – Understanding factors that lead to student dropouts.
6. Social Media & Web
•Sentiment Analysis – Understanding public opinions from social media.
•Recommendation Systems – Suggesting movies, music, or books (e.g.,
Netflix, Spotify).
•Fake News Detection – Identifying misinformation and fake news.

7. Cybersecurity
•Intrusion Detection – Detecting cyber attacks on networks.
•Malware Analysis – Identifying patterns in malicious software.
8. Agriculture
•Crop Yield Prediction – Forecasting agricultural output using weather
and soil data.
•Pest Detection – Identifying pests using image data.

What data can be mined ?
• Data mining can be applied to structured, semi-structured, and
unstructured data across multiple domains.
• The choice of data type depends on the problem being solved, the
available tools, and the computational resources.

It records the "who," "what," "when," and "where" of each transaction. Examples
include the products purchased, the customer, the date and time, total spent,
applied discounts, and payment method.
Transactional Data

1. Structured Data (Traditional Databases)
•Relational Databases (RDBMS): Data stored in structured formats such as tables with
rows and columns (e.g., MySQL, PostgreSQL, Oracle).
•Data Warehouses: Integrated data from multiple sources optimized for analytics.
•Transactional Data: Data from online transaction processing (OLTP) systems such as
banking records, purchase transactions, and financial statements.
Example
•Customer purchase records in a retail store database.
•Banking transaction logs for fraud detection.

2. Semi-Structured Data
•XML and JSON Data: Data stored in hierarchical or nested formats.
•Logs and Event Data: Web server logs, application logs, or system monitoring data.
•Emails and Messages: Textual data with some structure (headers, timestamps).
Example
•Mining email logs to detect phishing attempts.
•Analyzing JSON-formatted IoT device data for predictive maintenance.

3. Unstructured Data
•Text Data: Documents, articles, social media posts, customer reviews.
•Multimedia Data: Images, audio, and video files.
•Sensor and IoT Data: Data collected from sensors, smart devices, and industrial
equipment.
Example
•Analyzing tweets to detect trending topics.
•Mining CCTV footage for facial recognition in security applications.

4. Spatial Data
•Geospatial Data: Maps, GPS data, satellite images, and geographic information
system (GIS) datasets.
•Location-Based Data: User location logs from mobile apps.
Example
•Identifying crime hotspots based on GPS data.
•Mining traffic patterns to optimize city road networks.
5. Time-Series Data
•Stock Market Data: Historical stock prices, trading volumes.
•Weather Data: Temperature, humidity.
•IoT Sensor Readings: Continuous data streams from smart meters.
Example
•Predicting stock price movements using historical trading data.
•A security camera capturing video footage.

6. Web and Social Media Data
•Clickstream Data: User navigation patterns on websites.
•Social Network Data: Relationships, interactions, and sentiment analysis.
•Search Engine Logs: Queries, user behavior, and recommendation insights.
Example
•Recommending products based on browsing history.
•Detecting fake news using social media analysis

Types of data mining :
1. Descriptive Data Mining:
2. Predictive Data Mining:

1. Descriptive Data Mining
This type focuses on identifying patterns, trends, and relationships in historical data
without making predictions.
Techniques
•Association Rule Mining: Identifies relationships between items in large datasets (e.g.,
Market Basket Analysis).
•Clustering: Groups similar data points together (e.g., customer segmentation).
•Summarization: Provides concise representations of datasets (e.g., data aggregation in
reports).
Example
•Discovering that customers who buy bread often buy butter.
•Segmenting customers into different groups based on purchasing behavior.

2. Predictive Data Mining
This type focuses on making predictions based on past data using machine learning
and statistical methods.
Techniques
•Classification: Assigns labels to data points (e.g., spam vs. non-spam emails).
•Regression Analysis: Predicts continuous values (e.g., house prices, stock prices).
•Time-Series Analysis: Forecasts future trends based on historical data.
Example
•Predicting customer churn in a telecom company.
•Forecasting stock prices based on historical trends.

Functionalities of data mining
1. Classification (predictive )
2. Clustering (Descriptive)
3. Association Rule mining (Descriptive)
4. Prediction (predictive)
5. Anomaly Detection (predictive)
6. Regression Analysis (predictive)
7. Summarization (descriptive)

1. Classification
Classification is a supervised learning technique used to assign predefined
labels (categories) to data based on learned patterns.
How Classification Works:
2.Training Phase – A model learns from labeled training data.
3.Testing Phase – The model predicts labels for new, unseen data.
4.Evaluation – Accuracy, precision, recall, and F1-score measure
performance.
Common Classification Algorithms:
SVM, KNN,DT,RF

Example Use Cases:
✅ Spam Detection – Classify emails as spam or not spam.
✅ Disease Diagnosis – Predict if a patient has a disease based on symptoms.
✅ Sentiment Analysis – Determine if a review is positive or negative.
Eg. Very interesting movie,
Disgusting product
✅ Fraud Detection – Identify fraudulent transactions.

2. Clustering
Clustering is an unsupervised learning technique used in data
mining to group similar data points together based on their
characteristics. It helps in discovering patterns and structures in
large datasets without predefined labels.
Clustering Techniques:
K-means, Hierarchical

Example Use Cases:
✅ Customer Segmentation – Grouping customers based on shopping behavior.
✅ Image Segmentation – Dividing an image into different regions.
Eg. Autonomous Vehicles
•Lane detection for self-driving cars
•Pedestrian and obstacle recognition
•Traffic sign and signal identification
✅ Anomaly Detection – Identifying outliers in financial transactions.
Eg. Intrusion detection in cybersecurity, Defect detection in manufacturing
✅ Document Clustering – Grouping similar articles or documents.
Eg News recommendation – Groups articles by topic (e.g., politics, sports,
technology).

Association Rule Mining is a data mining technique used to find relationships or patterns
between different items in a dataset. It is commonly applied in market basket analysis,
recommendation systems, and fraud detection.
Algorithms: Apriori, FP-Growth.
Example: Market Basket Analysis
•Rule: {Bread} → {Butter}
• Meaning: Customers who buy bread are likely to buy butter.
•Rule: {Milk, Cereal} → {Banana}
• Meaning: If someone buys milk and cereal, they often buy bananas.
3. Association Rule Mining

Applications of Association Rule Mining:
Retail & Market Basket Analysis – Finding frequently purchased product
combinations.
E-commerce & Recommendations – Suggesting products based on previous
purchases.
Healthcare – Finding correlations between symptoms and diseases.
Eg. Example: Diabetes → Hypertension (Patients with diabetes are more likely to
have hypertension).
Obesity + Smoking → Increased Heart Disease Risk.

Anomaly detection identifies unusual patterns in data that do not
conform to expected behavior. It is used in fraud detection,
cybersecurity, healthcare, and predictive maintenance.
4. Anomaly Detection
Applications of Anomaly Detection:
✅ Fraud detection (Credit card transactions, insurance fraud)
✅ Cybersecurity (Intrusion detection, DDoS attack detection)
✅ Healthcare (Detecting anomalies in medical scans or ECG signals)
✅ Manufacturing (Predictive maintenance for machinery failures)

5. Prediction
Prediction involves using historical data to forecast future values.
Applications of Prediction Models:
✅ Stock Market Forecasting (Predicting stock prices)
✅ Sales Forecasting (Estimating future sales based on past data)
✅ Weather Prediction (Forecasting temperature & rainfall)
✅ Disease Prediction (Predicting heart disease risk)
✅ Energy Consumption Forecasting (Predicting electricity demand)

6. Regression
Regression analysis is a predictive modeling technique used to analyze
relationships between a dependent variable (target) and one or more
independent variables (features). It is widely used in finance, economics,
healthcare, and machine learning.
Types of Regression Analysis
1.Linear Regression – Relationship between variables using a straight line.
1.Example: House price prediction based on square footage.
2.Multiple Linear Regression – Uses multiple independent variables.
1.Example: Salary prediction based on experience, education, and skills.

7. Summarization (Descriptive)
Summarization in data mining refers to the process of extracting key information
from large datasets to create a compact, high-level representation. This helps in quick
decision-making, pattern recognition, and data exploration
Applications of Summarization in Data Mining
✅ Business Intelligence – Summarizing sales trends, customer behavior.
✅ Healthcare – Summarizing patient records and medical reports.
✅ Text Mining – Extracting summaries from legal documents, news articles.
✅ Cybersecurity – Summarizing network activity logs to detect threats.
✅ Social Media Analytics – Summarizing trends from tweets, posts, and reviews.

Knowledge Discovery in Databases (KDD)
KDD (Knowledge Discovery in Databases) is the process of discovering
useful knowledge from large datasets.
It is a broad concept that includes data mining as one of its key steps.
The KDD process consists of several stages that transform raw data into
meaningful patterns or knowledge.

1. Data Integration
• Data integration is the process of storing the data from different
sources in common place called DataWarehouse.

2. Data Cleaning
• This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates.
• Missing Data:
This situation arises when some data is missing in the data.
• It can be handled in various ways. Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
• Fill the Missing values:
Filling missing values manually, by attribute mean or the most
probable value.

5.Data Mining:
• This is the core step of the KDD process, where various data
mining techniques are applied to discover patterns, associations,
trends, or anomalies in the data.
• Common data mining techniques include classification, clustering,
regression, association rule mining, and anomaly detection.
6.Pattern Evaluation:
• Once patterns are discovered, they need to be evaluated for their
usefulness and reliability.
• This step involves assessing the quality and significance of the
discovered patterns using metrics such as accuracy, support,
confidence, and lift.

7.Knowledge Presentation:
The final step involves presenting the discovered knowledge to
the users in a format that is understandable and actionable.
This may involve
• visualization techniques,
• reports, or
• interactive tools to help users interpret and utilize the
discovered knowledge

Relation Between KDD and Data Mining:
•KDD is the overall process, whereas data mining is a step
within it.
•Data mining involves techniques like clustering,
classification, regression, and association rule mining.
•The success of KDD depends on proper data preparation,
cleaning, and result evaluation.

DM TECHNIQUES / functionalities:
1. Classification (predictive )
2. Clustering (Descriptive)
3. Association Rule mining (Descriptive)
4. Prediction (predictive)
5. Anomaly Detection (predictive)
6. Regression Analysis (predictive)
7. Summarization (descriptive)

Issues of Data mining
The major issues of Data mining
1. Mining Methodology and User Interaction issues
2. Performance Issues
3. Diverse Data Types Issues

Challenges:
1.Data Quality:
o Data may contain errors, omissions, duplications, leading to inaccurate
outcomes.
o To address these issues
o data cleaning and preprocessing techniques
o Are applied to improve data quality1
.
Sl.No Name Course Marks
1 Asha BCA 9.3
2 Chaitra Chemistry 44.5
3 Asha BCA 9.3
4 BCA Deepthi 0.1
3 Deepthi 6.7

2.Data Complexity:
o Vast amounts of data are generated from various sources like sensors, social media,
and the Internet of Things (IoT).
o Soil moisture sensor data
o Advanced techniques such as clustering, classification, and association rule mining he
lp identify patterns and relationships in complex data
1
.

3.Data Privacy and Security:
o As data collection- risk of breaches and cyber-attacks.
o Data may contain personal, sensitive, or confidential information.
o Regulations like GDPR, CCPA, and HIPAA impose strict rules on data collection, use,
and sharing.
o HIPAA - Health Insurance Portability and Accountability Act -to
protect sensitive patient health information from being disclosed
without the patient’s consent
o Data anonymization (process of removing personally identifiable information
from data sets,)
o encryption techniques (Cipher text) safeguard privacy and security1
.

4.Scalability:
o Data mining algorithms must handle large datasets efficiently.
o As dataset size grows, computational resources and processing time increase.
o Ensuring scalability is essential for effective data mining operations1
.

DATA WAREHOUSE:
A data warehouse is a centralized system designed for storing,
managing, and analyzing large volumes of structured data from
multiple sources. It is optimized for querying and reporting,

1. What is KDD? Explain about steps involved knowledge discovery.
2. How to classify data mining systems? Discuss
3. Explain in detail about Data mining functionalities?
1. Describe about Major issues in Data mining?
2.Define Data mining? Explain about data mining on what kind of data?
3.Differentiate KDD and Data mining.
1.Differentiate DBMS and Data mining.
2. Explain in detail about Data mining techniques.
3.Describe about Major challenges in Data mining?
Write a note on applications of Data mining.

DATA WAREHOUSING COMPONENTS,
Architecture is the proper arrangement of the elements.
We build a data warehouse with software and hardware components

1. Data Sources
•The origin of raw data, which can come from:
• Databases (e.g., MySQL)
• CSVs, XML, JSON
• External data sources (e.g., market data, third-party data)
2. ETL (Extract, Transform, Load) Process
•Extract: Pulling data from different sources.
•Transform: Cleaning, filtering, aggregating, and converting data into a
unified format.
•Load: Storing the processed data into the data warehouse.

3. Data Storage (Data Warehouse Database)
•A centralized repository optimized for analytical queries.
•Data is structured in schemas:
• Star Schema (simpler, faster queries)
• Snowflake Schema (normalized, less redundancy)
4. Metadata Repository
•Stores information about the data, such as:
• Data source details
• Table structures and relationships
• Data lineage (where data comes from and how it's transformed)
• User access permissions

5. Query Processing & Optimization
•Optimizes query execution for fast results.
•May involve OLAP (Online Analytical Processing) for multidimensional
analysis.
6. Data Marts (Optional)
•Subsets of the data warehouse tailored for specific business
functions (e.g., finance, sales, marketing).
•Improves performance by reducing the volume of data queried.

7. Business Intelligence (BI) & Reporting Tools
•Provides visualization, reporting, and dashboards for decision-
making.
•Common BI tools:
• Power BI, Tableau, Google Data Studio
8. Data Governance & Security
•Ensures data quality, consistency, and access control.
•Implements role-based access, encryption, and auditing.

fundamentals_of_data_science_and_its_intro.pptx

More Related Content

Similar to fundamentals_of_data_science_and_its_intro.pptx (20)

Recently uploaded (20)

fundamentals_of_data_science_and_its_intro.pptx