SlideShare a Scribd company logo
FUNDAMENTALS OF DATA
SCIENCE
fundamentals_of_data_science_and_its_intro.pptx
fundamentals_of_data_science_and_its_intro.pptx
Data Mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
• other terms - knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
• Manypeopletreatdataminingasasynonymforanotherpopularlyusedterm,
Knowledge Discovery from Data, or KDD.
• The data can be structured, semi-structured or unstructured, and can be
stored in various forms such as databases, data warehouses, and data
lakes.
fundamentals_of_data_science_and_its_intro.pptx
Big Data includes huge volume, high velocity, and extensible
variety of data.
There are 3 types: Structured data, Semi-structured data, and
Unstructured data.
Structured data:
• Structured data refers to data that is organized and
formatted in a specific way to make it easily readable and
understandable by both humans and machines.
• Structured data is typically found in databases and
spreadsheets, and is characterized by its organized nature.
Semi-structured data :
Is a type of data that is not purely structured, but also not
completely unstructured.
It contains some level of organization or structure, but does
not conform to a rigid schema or data model, and may
contain elements that are not easily categorized or classified.
Eg: XML document, E-mails
Unstructured data –
Unstructured data is a data which is not organized in a
predefined manner .
Eg: text document, images, video etc
• The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions
or predictions.
• This involves exploring the data using various techniques such as
clustering, classification, regression analysis, association rule mining,
and anomaly detection.
fundamentals_of_data_science_and_its_intro.pptx
fundamentals_of_data_science_and_its_intro.pptx
1. Business & Marketing
•Customer Segmentation – Identifying groups of customers with similar
behaviors (purchasing behavior, interests, and preferences.) for targeted
marketing.
•Market Basket Analysis – Finding product associations (e.g., Amazon’s
"Customers who bought this also bought...").
•Churn Prediction – Predicting customers likely to leave a service (e.g., cancel
subscriptions, close accounts).
•Fraud Detection – identifies fraudulent activities in financial transactions,
insurance claims, and online activities.
2. Healthcare & Medicine
•Disease Prediction & Diagnosis – Using historical medical data to
predict diseases (e.g., cancer detection).
•Drug Discovery – Analysing drug interactions and predicting new drug
formulations.
•Personalized Medicine – Recommending treatments based on patient
data.
3. Finance & Banking
•Credit Scoring & Risk Assessment – Evaluating loan applications based
on past data.
•Algorithmic Trading – Predicting stock market trends using historical
data.
•Money Laundering Detection – Identifying suspicious financial activities.
4. Manufacturing & Industry
•Predictive Maintenance – Predicting machine failures before they occur.
•Supply Chain Optimization – Enhancing logistics and inventory
management.
•Quality Control – Detecting defects in manufacturing.
5. Education
•Student Performance Prediction – Identifying students at risk of failing.
•Personalized Learning – Recommending learning materials based on
student progress.
•Dropout Prediction – Understanding factors that lead to student dropouts.
6. Social Media & Web
•Sentiment Analysis – Understanding public opinions from social media.
•Recommendation Systems – Suggesting movies, music, or books (e.g.,
Netflix, Spotify).
•Fake News Detection – Identifying misinformation and fake news.
7. Cybersecurity
•Intrusion Detection – Detecting cyber attacks on networks.
•Malware Analysis – Identifying patterns in malicious software.
8. Agriculture
•Crop Yield Prediction – Forecasting agricultural output using weather
and soil data.
•Pest Detection – Identifying pests using image data.
What data can be mined ?
• Data mining can be applied to structured, semi-structured, and
unstructured data across multiple domains.
• The choice of data type depends on the problem being solved, the
available tools, and the computational resources.
fundamentals_of_data_science_and_its_intro.pptx
fundamentals_of_data_science_and_its_intro.pptx
It records the "who," "what," "when," and "where" of each transaction. Examples
include the products purchased, the customer, the date and time, total spent,
applied discounts, and payment method.
Transactional Data
1. Structured Data (Traditional Databases)
•Relational Databases (RDBMS): Data stored in structured formats such as tables with
rows and columns (e.g., MySQL, PostgreSQL, Oracle).
•Data Warehouses: Integrated data from multiple sources optimized for analytics.
•Transactional Data: Data from online transaction processing (OLTP) systems such as
banking records, purchase transactions, and financial statements.
Example
•Customer purchase records in a retail store database.
•Banking transaction logs for fraud detection.
2. Semi-Structured Data
•XML and JSON Data: Data stored in hierarchical or nested formats.
•Logs and Event Data: Web server logs, application logs, or system monitoring data.
•Emails and Messages: Textual data with some structure (headers, timestamps).
Example
•Mining email logs to detect phishing attempts.
•Analyzing JSON-formatted IoT device data for predictive maintenance.
What data can be mined ?
fundamentals_of_data_science_and_its_intro.pptx
3. Unstructured Data
•Text Data: Documents, articles, social media posts, customer reviews.
•Multimedia Data: Images, audio, and video files.
•Sensor and IoT Data: Data collected from sensors, smart devices, and industrial
equipment.
Example
•Analyzing tweets to detect trending topics.
•Mining CCTV footage for facial recognition in security applications.
What data can be mined ?
4. Spatial Data
•Geospatial Data: Maps, GPS data, satellite images, and geographic information
system (GIS) datasets.
•Location-Based Data: User location logs from mobile apps.
Example
•Identifying crime hotspots based on GPS data.
•Mining traffic patterns to optimize city road networks.
5. Time-Series Data
•Stock Market Data: Historical stock prices, trading volumes.
•Weather Data: Temperature, humidity.
•IoT Sensor Readings: Continuous data streams from smart meters.
Example
•Predicting stock price movements using historical trading data.
•A security camera capturing video footage.
6. Web and Social Media Data
•Clickstream Data: User navigation patterns on websites.
•Social Network Data: Relationships, interactions, and sentiment analysis.
•Search Engine Logs: Queries, user behavior, and recommendation insights.
Example
•Recommending products based on browsing history.
•Detecting fake news using social media analysis
Types of data mining :
1. Descriptive Data Mining:
2. Predictive Data Mining:
1. Descriptive Data Mining
This type focuses on identifying patterns, trends, and relationships in historical data
without making predictions.
Techniques
•Association Rule Mining: Identifies relationships between items in large datasets (e.g.,
Market Basket Analysis).
•Clustering: Groups similar data points together (e.g., customer segmentation).
•Summarization: Provides concise representations of datasets (e.g., data aggregation in
reports).
Example
•Discovering that customers who buy bread often buy butter.
•Segmenting customers into different groups based on purchasing behavior.
2. Predictive Data Mining
This type focuses on making predictions based on past data using machine learning
and statistical methods.
Techniques
•Classification: Assigns labels to data points (e.g., spam vs. non-spam emails).
•Regression Analysis: Predicts continuous values (e.g., house prices, stock prices).
•Time-Series Analysis: Forecasts future trends based on historical data.
Example
•Predicting customer churn in a telecom company.
•Forecasting stock prices based on historical trends.
Functionalities of data mining
1. Classification (predictive )
2. Clustering (Descriptive)
3. Association Rule mining (Descriptive)
4. Prediction (predictive)
5. Anomaly Detection (predictive)
6. Regression Analysis (predictive)
7. Summarization (descriptive)
1. Classification
Classification is a supervised learning technique used to assign predefined
labels (categories) to data based on learned patterns.
How Classification Works:
2.Training Phase – A model learns from labeled training data.
3.Testing Phase – The model predicts labels for new, unseen data.
4.Evaluation – Accuracy, precision, recall, and F1-score measure
performance.
Common Classification Algorithms:
SVM, KNN,DT,RF
Example Use Cases:
✅ Spam Detection – Classify emails as spam or not spam.
✅ Disease Diagnosis – Predict if a patient has a disease based on symptoms.
✅ Sentiment Analysis – Determine if a review is positive or negative.
Eg. Very interesting movie,
Disgusting product
✅ Fraud Detection – Identify fraudulent transactions.
2. Clustering
Clustering is an unsupervised learning technique used in data
mining to group similar data points together based on their
characteristics. It helps in discovering patterns and structures in
large datasets without predefined labels.
Clustering Techniques:
K-means, Hierarchical
Example Use Cases:
✅ Customer Segmentation – Grouping customers based on shopping behavior.
✅ Image Segmentation – Dividing an image into different regions.
Eg. Autonomous Vehicles
•Lane detection for self-driving cars
•Pedestrian and obstacle recognition
•Traffic sign and signal identification
✅ Anomaly Detection – Identifying outliers in financial transactions.
Eg. Intrusion detection in cybersecurity, Defect detection in manufacturing
✅ Document Clustering – Grouping similar articles or documents.
Eg News recommendation – Groups articles by topic (e.g., politics, sports,
technology).
Association Rule Mining is a data mining technique used to find relationships or patterns
between different items in a dataset. It is commonly applied in market basket analysis,
recommendation systems, and fraud detection.
Algorithms: Apriori, FP-Growth.
Example: Market Basket Analysis
•Rule: {Bread} → {Butter}
• Meaning: Customers who buy bread are likely to buy butter.
•Rule: {Milk, Cereal} → {Banana}
• Meaning: If someone buys milk and cereal, they often buy bananas.
3. Association Rule Mining
Applications of Association Rule Mining:
Retail & Market Basket Analysis – Finding frequently purchased product
combinations.
E-commerce & Recommendations – Suggesting products based on previous
purchases.
Healthcare – Finding correlations between symptoms and diseases.
Eg. Example: Diabetes → Hypertension (Patients with diabetes are more likely to
have hypertension).
Obesity + Smoking → Increased Heart Disease Risk.
Anomaly detection identifies unusual patterns in data that do not
conform to expected behavior. It is used in fraud detection,
cybersecurity, healthcare, and predictive maintenance.
4. Anomaly Detection
Applications of Anomaly Detection:
✅ Fraud detection (Credit card transactions, insurance fraud)
✅ Cybersecurity (Intrusion detection, DDoS attack detection)
✅ Healthcare (Detecting anomalies in medical scans or ECG signals)
✅ Manufacturing (Predictive maintenance for machinery failures)
5. Prediction
Prediction involves using historical data to forecast future values.
Applications of Prediction Models:
✅ Stock Market Forecasting (Predicting stock prices)
✅ Sales Forecasting (Estimating future sales based on past data)
✅ Weather Prediction (Forecasting temperature & rainfall)
✅ Disease Prediction (Predicting heart disease risk)
✅ Energy Consumption Forecasting (Predicting electricity demand)
6. Regression
Regression analysis is a predictive modeling technique used to analyze
relationships between a dependent variable (target) and one or more
independent variables (features). It is widely used in finance, economics,
healthcare, and machine learning.
Types of Regression Analysis
1.Linear Regression – Relationship between variables using a straight line.
1.Example: House price prediction based on square footage.
2.Multiple Linear Regression – Uses multiple independent variables.
1.Example: Salary prediction based on experience, education, and skills.
7. Summarization (Descriptive)
Summarization in data mining refers to the process of extracting key information
from large datasets to create a compact, high-level representation. This helps in quick
decision-making, pattern recognition, and data exploration
Applications of Summarization in Data Mining
✅ Business Intelligence – Summarizing sales trends, customer behavior.
✅ Healthcare – Summarizing patient records and medical reports.
✅ Text Mining – Extracting summaries from legal documents, news articles.
✅ Cybersecurity – Summarizing network activity logs to detect threats.
✅ Social Media Analytics – Summarizing trends from tweets, posts, and reviews.
Knowledge Discovery in Databases (KDD)
KDD (Knowledge Discovery in Databases) is the process of discovering
useful knowledge from large datasets.
It is a broad concept that includes data mining as one of its key steps.
The KDD process consists of several stages that transform raw data into
meaningful patterns or knowledge.
fundamentals_of_data_science_and_its_intro.pptx
1. Data Integration
• Data integration is the process of storing the data from different
sources in common place called DataWarehouse.
2. Data Cleaning
• This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates.
• Missing Data:
This situation arises when some data is missing in the data.
• It can be handled in various ways. Some of them are:
• Ignore the tuples:
This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.
• Fill the Missing values:
Filling missing values manually, by attribute mean or the most
probable value.
5.Data Mining:
• This is the core step of the KDD process, where various data
mining techniques are applied to discover patterns, associations,
trends, or anomalies in the data.
• Common data mining techniques include classification, clustering,
regression, association rule mining, and anomaly detection.
6.Pattern Evaluation:
• Once patterns are discovered, they need to be evaluated for their
usefulness and reliability.
• This step involves assessing the quality and significance of the
discovered patterns using metrics such as accuracy, support,
confidence, and lift.
7.Knowledge Presentation:
The final step involves presenting the discovered knowledge to
the users in a format that is understandable and actionable.
This may involve
• visualization techniques,
• reports, or
• interactive tools to help users interpret and utilize the
discovered knowledge
Relation Between KDD and Data Mining:
•KDD is the overall process, whereas data mining is a step
within it.
•Data mining involves techniques like clustering,
classification, regression, and association rule mining.
•The success of KDD depends on proper data preparation,
cleaning, and result evaluation.
DM TECHNIQUES / functionalities:
1. Classification (predictive )
2. Clustering (Descriptive)
3. Association Rule mining (Descriptive)
4. Prediction (predictive)
5. Anomaly Detection (predictive)
6. Regression Analysis (predictive)
7. Summarization (descriptive)
Issues of Data mining
The major issues of Data mining
1. Mining Methodology and User Interaction issues
2. Performance Issues
3. Diverse Data Types Issues
Challenges:
1.Data Quality:
o Data may contain errors, omissions, duplications, leading to inaccurate
outcomes.
o To address these issues
o data cleaning and preprocessing techniques
o Are applied to improve data quality1
.
Sl.No Name Course Marks
1 Asha BCA 9.3
2 Chaitra Chemistry 44.5
3 Asha BCA 9.3
4 BCA Deepthi 0.1
3 Deepthi 6.7
2.Data Complexity:
o Vast amounts of data are generated from various sources like sensors, social media,
and the Internet of Things (IoT).
o Soil moisture sensor data
o Advanced techniques such as clustering, classification, and association rule mining he
lp identify patterns and relationships in complex data
1
.
3.Data Privacy and Security:
o As data collection- risk of breaches and cyber-attacks.
o Data may contain personal, sensitive, or confidential information.
o Regulations like GDPR, CCPA, and HIPAA impose strict rules on data collection, use,
and sharing.
o HIPAA - Health Insurance Portability and Accountability Act -to
protect sensitive patient health information from being disclosed
without the patient’s consent
o Data anonymization (process of removing personally identifiable information
from data sets,)
o encryption techniques (Cipher text) safeguard privacy and security1
.
fundamentals_of_data_science_and_its_intro.pptx
4.Scalability:
o Data mining algorithms must handle large datasets efficiently.
o As dataset size grows, computational resources and processing time increase.
o Ensuring scalability is essential for effective data mining operations1
.
DATA WAREHOUSE:
A data warehouse is a centralized system designed for storing,
managing, and analyzing large volumes of structured data from
multiple sources. It is optimized for querying and reporting,
1. What is KDD? Explain about steps involved knowledge discovery.
2. How to classify data mining systems? Discuss
3. Explain in detail about Data mining functionalities?
1. Describe about Major issues in Data mining?
2.Define Data mining? Explain about data mining on what kind of data?
3.Differentiate KDD and Data mining.
1.Differentiate DBMS and Data mining.
2. Explain in detail about Data mining techniques.
3.Describe about Major challenges in Data mining?
Write a note on applications of Data mining.
DATA WAREHOUSING COMPONENTS,
Architecture is the proper arrangement of the elements.
We build a data warehouse with software and hardware components
1. Data Sources
•The origin of raw data, which can come from:
• Databases (e.g., MySQL)
• CSVs, XML, JSON
• External data sources (e.g., market data, third-party data)
2. ETL (Extract, Transform, Load) Process
•Extract: Pulling data from different sources.
•Transform: Cleaning, filtering, aggregating, and converting data into a
unified format.
•Load: Storing the processed data into the data warehouse.
3. Data Storage (Data Warehouse Database)
•A centralized repository optimized for analytical queries.
•Data is structured in schemas:
• Star Schema (simpler, faster queries)
• Snowflake Schema (normalized, less redundancy)
4. Metadata Repository
•Stores information about the data, such as:
• Data source details
• Table structures and relationships
• Data lineage (where data comes from and how it's transformed)
• User access permissions
5. Query Processing & Optimization
•Optimizes query execution for fast results.
•May involve OLAP (Online Analytical Processing) for multidimensional
analysis.
6. Data Marts (Optional)
•Subsets of the data warehouse tailored for specific business
functions (e.g., finance, sales, marketing).
•Improves performance by reducing the volume of data queried.
7. Business Intelligence (BI) & Reporting Tools
•Provides visualization, reporting, and dashboards for decision-
making.
•Common BI tools:
• Power BI, Tableau, Google Data Studio
8. Data Governance & Security
•Ensures data quality, consistency, and access control.
•Implements role-based access, encryption, and auditing.

More Related Content

PPT
Data Mining-2023 (2).ppt
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
PPTX
Chap1-Introduction.pptx. Data Mining and introduction about it in a specified...
PPT
`Data mining
PDF
2 introductory slides
PPTX
Week-1-Introduction to Data Mining.pptx
PPT
Data Mining
PPTX
Lect 1 introduction
Data Mining-2023 (2).ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
Chap1-Introduction.pptx. Data Mining and introduction about it in a specified...
`Data mining
2 introductory slides
Week-1-Introduction to Data Mining.pptx
Data Mining
Lect 1 introduction

Similar to fundamentals_of_data_science_and_its_intro.pptx (20)

PPTX
Topic(1)-Intro data mining master ALEX.pptx
PPT
unit 1 DATA MINING.ppt
PPT
Chapter 01Intro.ppt full explanation used
PPT
introduction to data minining and unit iii
PPTX
Data Mining Intro
PPT
Data Mining- Unit-I PPT (1).ppt
PPT
Chapter 1. Introduction
PPT
Introduction to data warehouse
PDF
Overview of Data Mining
PPTX
DATA MINING seminar prjzkpwnshzghBwkwodoxjz
PPT
Introduction of Data Mining - Concept and techniques
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPTX
Data mining and knowledge discovery
PPT
Data Mining introduction and basic concepts
PPTX
Introduction To Data Mining and Data Mining Techniques.pptx
Topic(1)-Intro data mining master ALEX.pptx
unit 1 DATA MINING.ppt
Chapter 01Intro.ppt full explanation used
introduction to data minining and unit iii
Data Mining Intro
Data Mining- Unit-I PPT (1).ppt
Chapter 1. Introduction
Introduction to data warehouse
Overview of Data Mining
DATA MINING seminar prjzkpwnshzghBwkwodoxjz
Introduction of Data Mining - Concept and techniques
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data mining and knowledge discovery
Data Mining introduction and basic concepts
Introduction To Data Mining and Data Mining Techniques.pptx
Ad

Recently uploaded (20)

PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Lesson notes of climatology university.
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
Classroom Observation Tools for Teachers
PDF
Empowerment Technology for Senior High School Guide
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
Orientation - ARALprogram of Deped to the Parents.pptx
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Hazard Identification & Risk Assessment .pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Lesson notes of climatology university.
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Cell Types and Its function , kingdom of life
A systematic review of self-coping strategies used by university students to ...
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
Classroom Observation Tools for Teachers
Empowerment Technology for Senior High School Guide
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Weekly quiz Compilation Jan -July 25.pdf
Ad

fundamentals_of_data_science_and_its_intro.pptx

  • 4. Data Mining • Data mining is the process of extracting knowledge or insights from large amounts of data using various statistical and computational techniques. • other terms - knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. • Manypeopletreatdataminingasasynonymforanotherpopularlyusedterm, Knowledge Discovery from Data, or KDD. • The data can be structured, semi-structured or unstructured, and can be stored in various forms such as databases, data warehouses, and data lakes.
  • 6. Big Data includes huge volume, high velocity, and extensible variety of data. There are 3 types: Structured data, Semi-structured data, and Unstructured data. Structured data: • Structured data refers to data that is organized and formatted in a specific way to make it easily readable and understandable by both humans and machines. • Structured data is typically found in databases and spreadsheets, and is characterized by its organized nature.
  • 7. Semi-structured data : Is a type of data that is not purely structured, but also not completely unstructured. It contains some level of organization or structure, but does not conform to a rigid schema or data model, and may contain elements that are not easily categorized or classified. Eg: XML document, E-mails Unstructured data – Unstructured data is a data which is not organized in a predefined manner . Eg: text document, images, video etc
  • 8. • The primary goal of data mining is to discover hidden patterns and relationships in the data that can be used to make informed decisions or predictions. • This involves exploring the data using various techniques such as clustering, classification, regression analysis, association rule mining, and anomaly detection.
  • 11. 1. Business & Marketing •Customer Segmentation – Identifying groups of customers with similar behaviors (purchasing behavior, interests, and preferences.) for targeted marketing. •Market Basket Analysis – Finding product associations (e.g., Amazon’s "Customers who bought this also bought..."). •Churn Prediction – Predicting customers likely to leave a service (e.g., cancel subscriptions, close accounts). •Fraud Detection – identifies fraudulent activities in financial transactions, insurance claims, and online activities.
  • 12. 2. Healthcare & Medicine •Disease Prediction & Diagnosis – Using historical medical data to predict diseases (e.g., cancer detection). •Drug Discovery – Analysing drug interactions and predicting new drug formulations. •Personalized Medicine – Recommending treatments based on patient data.
  • 13. 3. Finance & Banking •Credit Scoring & Risk Assessment – Evaluating loan applications based on past data. •Algorithmic Trading – Predicting stock market trends using historical data. •Money Laundering Detection – Identifying suspicious financial activities. 4. Manufacturing & Industry •Predictive Maintenance – Predicting machine failures before they occur. •Supply Chain Optimization – Enhancing logistics and inventory management. •Quality Control – Detecting defects in manufacturing.
  • 14. 5. Education •Student Performance Prediction – Identifying students at risk of failing. •Personalized Learning – Recommending learning materials based on student progress. •Dropout Prediction – Understanding factors that lead to student dropouts. 6. Social Media & Web •Sentiment Analysis – Understanding public opinions from social media. •Recommendation Systems – Suggesting movies, music, or books (e.g., Netflix, Spotify). •Fake News Detection – Identifying misinformation and fake news.
  • 15. 7. Cybersecurity •Intrusion Detection – Detecting cyber attacks on networks. •Malware Analysis – Identifying patterns in malicious software. 8. Agriculture •Crop Yield Prediction – Forecasting agricultural output using weather and soil data. •Pest Detection – Identifying pests using image data.
  • 16. What data can be mined ? • Data mining can be applied to structured, semi-structured, and unstructured data across multiple domains. • The choice of data type depends on the problem being solved, the available tools, and the computational resources.
  • 19. It records the "who," "what," "when," and "where" of each transaction. Examples include the products purchased, the customer, the date and time, total spent, applied discounts, and payment method. Transactional Data
  • 20. 1. Structured Data (Traditional Databases) •Relational Databases (RDBMS): Data stored in structured formats such as tables with rows and columns (e.g., MySQL, PostgreSQL, Oracle). •Data Warehouses: Integrated data from multiple sources optimized for analytics. •Transactional Data: Data from online transaction processing (OLTP) systems such as banking records, purchase transactions, and financial statements. Example •Customer purchase records in a retail store database. •Banking transaction logs for fraud detection.
  • 21. 2. Semi-Structured Data •XML and JSON Data: Data stored in hierarchical or nested formats. •Logs and Event Data: Web server logs, application logs, or system monitoring data. •Emails and Messages: Textual data with some structure (headers, timestamps). Example •Mining email logs to detect phishing attempts. •Analyzing JSON-formatted IoT device data for predictive maintenance. What data can be mined ?
  • 23. 3. Unstructured Data •Text Data: Documents, articles, social media posts, customer reviews. •Multimedia Data: Images, audio, and video files. •Sensor and IoT Data: Data collected from sensors, smart devices, and industrial equipment. Example •Analyzing tweets to detect trending topics. •Mining CCTV footage for facial recognition in security applications. What data can be mined ?
  • 24. 4. Spatial Data •Geospatial Data: Maps, GPS data, satellite images, and geographic information system (GIS) datasets. •Location-Based Data: User location logs from mobile apps. Example •Identifying crime hotspots based on GPS data. •Mining traffic patterns to optimize city road networks. 5. Time-Series Data •Stock Market Data: Historical stock prices, trading volumes. •Weather Data: Temperature, humidity. •IoT Sensor Readings: Continuous data streams from smart meters. Example •Predicting stock price movements using historical trading data. •A security camera capturing video footage.
  • 25. 6. Web and Social Media Data •Clickstream Data: User navigation patterns on websites. •Social Network Data: Relationships, interactions, and sentiment analysis. •Search Engine Logs: Queries, user behavior, and recommendation insights. Example •Recommending products based on browsing history. •Detecting fake news using social media analysis
  • 26. Types of data mining : 1. Descriptive Data Mining: 2. Predictive Data Mining:
  • 27. 1. Descriptive Data Mining This type focuses on identifying patterns, trends, and relationships in historical data without making predictions. Techniques •Association Rule Mining: Identifies relationships between items in large datasets (e.g., Market Basket Analysis). •Clustering: Groups similar data points together (e.g., customer segmentation). •Summarization: Provides concise representations of datasets (e.g., data aggregation in reports). Example •Discovering that customers who buy bread often buy butter. •Segmenting customers into different groups based on purchasing behavior.
  • 28. 2. Predictive Data Mining This type focuses on making predictions based on past data using machine learning and statistical methods. Techniques •Classification: Assigns labels to data points (e.g., spam vs. non-spam emails). •Regression Analysis: Predicts continuous values (e.g., house prices, stock prices). •Time-Series Analysis: Forecasts future trends based on historical data. Example •Predicting customer churn in a telecom company. •Forecasting stock prices based on historical trends.
  • 29. Functionalities of data mining 1. Classification (predictive ) 2. Clustering (Descriptive) 3. Association Rule mining (Descriptive) 4. Prediction (predictive) 5. Anomaly Detection (predictive) 6. Regression Analysis (predictive) 7. Summarization (descriptive)
  • 30. 1. Classification Classification is a supervised learning technique used to assign predefined labels (categories) to data based on learned patterns. How Classification Works: 2.Training Phase – A model learns from labeled training data. 3.Testing Phase – The model predicts labels for new, unseen data. 4.Evaluation – Accuracy, precision, recall, and F1-score measure performance. Common Classification Algorithms: SVM, KNN,DT,RF
  • 31. Example Use Cases: ✅ Spam Detection – Classify emails as spam or not spam. ✅ Disease Diagnosis – Predict if a patient has a disease based on symptoms. ✅ Sentiment Analysis – Determine if a review is positive or negative. Eg. Very interesting movie, Disgusting product ✅ Fraud Detection – Identify fraudulent transactions.
  • 32. 2. Clustering Clustering is an unsupervised learning technique used in data mining to group similar data points together based on their characteristics. It helps in discovering patterns and structures in large datasets without predefined labels. Clustering Techniques: K-means, Hierarchical
  • 33. Example Use Cases: ✅ Customer Segmentation – Grouping customers based on shopping behavior. ✅ Image Segmentation – Dividing an image into different regions. Eg. Autonomous Vehicles •Lane detection for self-driving cars •Pedestrian and obstacle recognition •Traffic sign and signal identification ✅ Anomaly Detection – Identifying outliers in financial transactions. Eg. Intrusion detection in cybersecurity, Defect detection in manufacturing ✅ Document Clustering – Grouping similar articles or documents. Eg News recommendation – Groups articles by topic (e.g., politics, sports, technology).
  • 34. Association Rule Mining is a data mining technique used to find relationships or patterns between different items in a dataset. It is commonly applied in market basket analysis, recommendation systems, and fraud detection. Algorithms: Apriori, FP-Growth. Example: Market Basket Analysis •Rule: {Bread} → {Butter} • Meaning: Customers who buy bread are likely to buy butter. •Rule: {Milk, Cereal} → {Banana} • Meaning: If someone buys milk and cereal, they often buy bananas. 3. Association Rule Mining
  • 35. Applications of Association Rule Mining: Retail & Market Basket Analysis – Finding frequently purchased product combinations. E-commerce & Recommendations – Suggesting products based on previous purchases. Healthcare – Finding correlations between symptoms and diseases. Eg. Example: Diabetes → Hypertension (Patients with diabetes are more likely to have hypertension). Obesity + Smoking → Increased Heart Disease Risk.
  • 36. Anomaly detection identifies unusual patterns in data that do not conform to expected behavior. It is used in fraud detection, cybersecurity, healthcare, and predictive maintenance. 4. Anomaly Detection Applications of Anomaly Detection: ✅ Fraud detection (Credit card transactions, insurance fraud) ✅ Cybersecurity (Intrusion detection, DDoS attack detection) ✅ Healthcare (Detecting anomalies in medical scans or ECG signals) ✅ Manufacturing (Predictive maintenance for machinery failures)
  • 37. 5. Prediction Prediction involves using historical data to forecast future values. Applications of Prediction Models: ✅ Stock Market Forecasting (Predicting stock prices) ✅ Sales Forecasting (Estimating future sales based on past data) ✅ Weather Prediction (Forecasting temperature & rainfall) ✅ Disease Prediction (Predicting heart disease risk) ✅ Energy Consumption Forecasting (Predicting electricity demand)
  • 38. 6. Regression Regression analysis is a predictive modeling technique used to analyze relationships between a dependent variable (target) and one or more independent variables (features). It is widely used in finance, economics, healthcare, and machine learning. Types of Regression Analysis 1.Linear Regression – Relationship between variables using a straight line. 1.Example: House price prediction based on square footage. 2.Multiple Linear Regression – Uses multiple independent variables. 1.Example: Salary prediction based on experience, education, and skills.
  • 39. 7. Summarization (Descriptive) Summarization in data mining refers to the process of extracting key information from large datasets to create a compact, high-level representation. This helps in quick decision-making, pattern recognition, and data exploration Applications of Summarization in Data Mining ✅ Business Intelligence – Summarizing sales trends, customer behavior. ✅ Healthcare – Summarizing patient records and medical reports. ✅ Text Mining – Extracting summaries from legal documents, news articles. ✅ Cybersecurity – Summarizing network activity logs to detect threats. ✅ Social Media Analytics – Summarizing trends from tweets, posts, and reviews.
  • 40. Knowledge Discovery in Databases (KDD) KDD (Knowledge Discovery in Databases) is the process of discovering useful knowledge from large datasets. It is a broad concept that includes data mining as one of its key steps. The KDD process consists of several stages that transform raw data into meaningful patterns or knowledge.
  • 42. 1. Data Integration • Data integration is the process of storing the data from different sources in common place called DataWarehouse.
  • 43. 2. Data Cleaning • This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. • Missing Data: This situation arises when some data is missing in the data. • It can be handled in various ways. Some of them are: • Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. • Fill the Missing values: Filling missing values manually, by attribute mean or the most probable value.
  • 44. 5.Data Mining: • This is the core step of the KDD process, where various data mining techniques are applied to discover patterns, associations, trends, or anomalies in the data. • Common data mining techniques include classification, clustering, regression, association rule mining, and anomaly detection. 6.Pattern Evaluation: • Once patterns are discovered, they need to be evaluated for their usefulness and reliability. • This step involves assessing the quality and significance of the discovered patterns using metrics such as accuracy, support, confidence, and lift.
  • 45. 7.Knowledge Presentation: The final step involves presenting the discovered knowledge to the users in a format that is understandable and actionable. This may involve • visualization techniques, • reports, or • interactive tools to help users interpret and utilize the discovered knowledge
  • 46. Relation Between KDD and Data Mining: •KDD is the overall process, whereas data mining is a step within it. •Data mining involves techniques like clustering, classification, regression, and association rule mining. •The success of KDD depends on proper data preparation, cleaning, and result evaluation.
  • 47. DM TECHNIQUES / functionalities: 1. Classification (predictive ) 2. Clustering (Descriptive) 3. Association Rule mining (Descriptive) 4. Prediction (predictive) 5. Anomaly Detection (predictive) 6. Regression Analysis (predictive) 7. Summarization (descriptive)
  • 48. Issues of Data mining The major issues of Data mining 1. Mining Methodology and User Interaction issues 2. Performance Issues 3. Diverse Data Types Issues
  • 49. Challenges: 1.Data Quality: o Data may contain errors, omissions, duplications, leading to inaccurate outcomes. o To address these issues o data cleaning and preprocessing techniques o Are applied to improve data quality1 . Sl.No Name Course Marks 1 Asha BCA 9.3 2 Chaitra Chemistry 44.5 3 Asha BCA 9.3 4 BCA Deepthi 0.1 3 Deepthi 6.7
  • 50. 2.Data Complexity: o Vast amounts of data are generated from various sources like sensors, social media, and the Internet of Things (IoT). o Soil moisture sensor data o Advanced techniques such as clustering, classification, and association rule mining he lp identify patterns and relationships in complex data 1 .
  • 51. 3.Data Privacy and Security: o As data collection- risk of breaches and cyber-attacks. o Data may contain personal, sensitive, or confidential information. o Regulations like GDPR, CCPA, and HIPAA impose strict rules on data collection, use, and sharing. o HIPAA - Health Insurance Portability and Accountability Act -to protect sensitive patient health information from being disclosed without the patient’s consent o Data anonymization (process of removing personally identifiable information from data sets,) o encryption techniques (Cipher text) safeguard privacy and security1 .
  • 53. 4.Scalability: o Data mining algorithms must handle large datasets efficiently. o As dataset size grows, computational resources and processing time increase. o Ensuring scalability is essential for effective data mining operations1 .
  • 54. DATA WAREHOUSE: A data warehouse is a centralized system designed for storing, managing, and analyzing large volumes of structured data from multiple sources. It is optimized for querying and reporting,
  • 55. 1. What is KDD? Explain about steps involved knowledge discovery. 2. How to classify data mining systems? Discuss 3. Explain in detail about Data mining functionalities? 1. Describe about Major issues in Data mining? 2.Define Data mining? Explain about data mining on what kind of data? 3.Differentiate KDD and Data mining. 1.Differentiate DBMS and Data mining. 2. Explain in detail about Data mining techniques. 3.Describe about Major challenges in Data mining? Write a note on applications of Data mining.
  • 56. DATA WAREHOUSING COMPONENTS, Architecture is the proper arrangement of the elements. We build a data warehouse with software and hardware components
  • 57. 1. Data Sources •The origin of raw data, which can come from: • Databases (e.g., MySQL) • CSVs, XML, JSON • External data sources (e.g., market data, third-party data) 2. ETL (Extract, Transform, Load) Process •Extract: Pulling data from different sources. •Transform: Cleaning, filtering, aggregating, and converting data into a unified format. •Load: Storing the processed data into the data warehouse.
  • 58. 3. Data Storage (Data Warehouse Database) •A centralized repository optimized for analytical queries. •Data is structured in schemas: • Star Schema (simpler, faster queries) • Snowflake Schema (normalized, less redundancy) 4. Metadata Repository •Stores information about the data, such as: • Data source details • Table structures and relationships • Data lineage (where data comes from and how it's transformed) • User access permissions
  • 59. 5. Query Processing & Optimization •Optimizes query execution for fast results. •May involve OLAP (Online Analytical Processing) for multidimensional analysis. 6. Data Marts (Optional) •Subsets of the data warehouse tailored for specific business functions (e.g., finance, sales, marketing). •Improves performance by reducing the volume of data queried.
  • 60. 7. Business Intelligence (BI) & Reporting Tools •Provides visualization, reporting, and dashboards for decision- making. •Common BI tools: • Power BI, Tableau, Google Data Studio 8. Data Governance & Security •Ensures data quality, consistency, and access control. •Implements role-based access, encryption, and auditing.