Exploratory_Data_Analysis_Fundamentals.pdf

Exploratory Data Analysis Fundamentals
Ashutosh Satapathy, Ph.D.
Asst. Prof., Department of CSE,
Siddhartha Academy of Higher Education
(Deemed to be University)
Vijayawada - 520007
1 / 199

Outline
Introduction to EDA
Data
From Data to Knowledge
Exploratory Data Analysis
Understanding Data Science
Data Science
Phases of Data Science
The Significance of EDA
Why is EDA Significant?
The Role of EDA
Steps in EDA
Example: EDA for a Fitness
App
Making Sense of Data
Data Matters
Dataset
Data Storage
Numerical Data
Categorical Data
Measurement Scales
Data Analysis Approaches
Classical Data Analysis
Bayesian Data Analysis
Key Differences
Examples
Software Tools for EDA
Python
R Programming
Weka
KNIME
EDA using Python
NumPy
pandas
SciPy
Matplotlib
Summary
References
2 / 199

What is Data?
▶ A collection of discrete objects, numbers, words, events, facts,
measurements, observations, or descriptions.
▶ Generated by processes in various disciplines:
▶ Biology: Genetic sequences, protein structures
▶ Economics: Market trends, GDP data
▶ Engineering: Sensor readings, performance metrics
▶ Marketing: Customer preferences, sales data
▶ Example: A dataset of customer purchases in a retail store
includes product IDs, purchase dates, and amounts spent.
3 / 199

From Data to Knowledge
▶ Data: Raw facts and figures (e.g., sales numbers: $500,
$300, $700).
▶ Information: Processed data with context (e.g., average sales
per day: $500).
▶ Knowledge: Insights derived from information (e.g., sales
peak on weekends).
▶ Goal: Transform raw data into actionable knowledge.
▶ Example: Analyzing website traffic data to identify peak
visiting hours, leading to optimized ad schedules.
4 / 199

What is Exploratory Data Analysis (EDA)?
▶ A process to examine datasets and uncover:
▶ Patterns
▶ Anomalies
▶ Hypotheses
▶ Assumptions
▶ Uses statistical measures and visualizations.
▶ Performed before formal modeling or hypothesis testing.
▶ Example: Plotting sales data to spot seasonal trends or
outliers (e.g., a sudden spike in sales due to a promotion).
5 / 199

Why is EDA?
▶ Helps statisticians understand data characteristics.
▶ Uncovers hidden insights before formal modeling.
▶ Guides hypothesis generation and data collection strategies.
▶ Prevents incorrect assumptions in modeling.
▶ Example:
1. In a medical study, EDA reveals missing values in patient
records, prompting data cleaning before analysis.
2. EDA on patient data reveals inconsistent heart rate readings,
prompting sensor recalibration.
6 / 199

Key Steps in EDA
1. Data Collection: Gather raw data (e.g., sensor readings from
a manufacturing plant).
2. Data Cleaning: Handle missing values, outliers (e.g.,
removing erroneous temperature readings).
3. Descriptive Statistics: Compute mean, median, variance
(e.g., average production rate).
4. Visualization: Create plots (histograms, scatter plots) to
identify trends.
5. Hypothesis Generation: Formulate questions based on
patterns (e.g., does production rate vary by shift?).
7 / 199

Example: EDA in Retail Sales
▶ Dataset: Daily sales data for a clothing store over one year.
▶ Steps:
▶ Check for missing sales entries.
▶ Calculate average sales per month.
▶ Plot sales trends using a line graph.
▶ Identify outliers (e.g., Black Friday sales spike).
▶ Insight: Sales peak during holiday seasons, suggesting
increased inventory in November–December.
8 / 199

Tools for EDA
▶ Programming Languages: Python (pandas, matplotlib), R
(ggplot2).
▶ Software: Excel, Tableau, Power BI.
▶ Visualization Techniques: Histograms, box plots, scatter
plots, heatmaps.
▶ Example: Using Python to create a box plot of customer
spending to detect high-spending outliers.
9 / 199

Benefits of EDA
▶ Uncovers hidden patterns (e.g., customer churn trends).
▶ Detects data quality issues (e.g., duplicate entries).
▶ Informs better data collection strategies.
▶ Supports development of robust models.
▶ Example: EDA on weather data reveals inconsistent sensor
readings, leading to sensor recalibration.
10 / 199

What is Data Science?
▶ A cross-disciplinary field combining:
▶ Computer Science
▶ Statistics
▶ Mathematics
▶ Domain Knowledge
▶ Involves building models and extracting insights for business
intelligence.
▶ No Ph.D. required—practical skills are key.
▶ Example: Predicting customer churn using purchase history
and machine learning.
11 / 199

Phases of Data Science
▶ These phases are similar to Cross-Industry Standard Process
for Data Mining (CRISP-DM) framework:
1. Data Requirements
2. Data Collection
3. Data Processing
4. Data Cleaning
5. Exploratory Data Analysis (EDA)
6. Modeling and Algorithms
7. Data Product
8. Communication
▶ Each phase builds toward actionable insights.
12 / 199

Data Requirements
▶ Identify and categorize data needed for analysis:
▶ Numerical (e.g., heart rate)
▶ Categorical (e.g., patient gender)
▶ Define storage and dissemination formats.
▶ Example: For a dementia study, collect sleep patterns, heart
rate, and activity data from sensors to assess mental state.
13 / 199

Data Collection
▶ Gather data from various sources (sensors, databases, APIs).
▶ Ensure proper storage and transfer to IT systems.
▶ Example: Collecting customer feedback from surveys and
social media to analyze sentiment.
14 / 199

Data Processing
▶ Pre-curate data before analysis.
▶ Tasks: Exporting, structuring, and formatting data into tables.
▶ Example: Converting raw sensor data into a structured CSV
file with columns for time, value, and sensor ID.
15 / 199

Data Cleaning
▶ Address incompleteness, duplicates, errors, and missing values.
▶ Techniques: Record matching, outlier detection, filling missing
values.
▶ Example: Removing duplicate customer entries in a sales
dataset and imputing missing purchase amounts using the
median.
16 / 199

Exploratory Data Analysis (EDA)
▶ Core stage to uncover patterns, anomalies, and hypotheses.
▶ Uses descriptive statistics and visualizations.
▶ May involve data transformation techniques.
▶ Example: Plotting sales data to identify seasonal trends or
outliers (e.g., a spike during a holiday sale).
17 / 199

Modeling and Algorithms
▶ Build models to represent relationships between variables.
▶ Judd model: Data = Model + Error.
▶ Example: Linear regression model for pen purchases:
Total = UnitPrice × Quantity
where Total is the dependent variable, and UnitPrice is
independent.
18 / 199

Data Product
▶ Software that uses data inputs to produce outputs and
feedback.
▶ Based on models from analysis.
▶ Example: A recommendation system suggesting products
based on user purchase history.
19 / 199

Communication
▶ Share results with stakeholders via visualizations (tables,
charts, diagrams).
▶ Drives business intelligence and decision-making.
▶ Example: A bar chart showing monthly sales trends to guide
inventory planning.
20 / 199

Why is EDA Significant?
▶ Data is collected in fields like science, economics, engineering,
and marketing.
▶ Large datasets are stored in electronic databases, making
manual analysis impossible.
▶ EDA is the first step in data mining to:
▶ Visualize data
▶ Understand patterns
▶ Create hypotheses
▶ Example: A store collects sales data to find which products
sell best during holidays.
21 / 199

The Role of EDA
▶ Reveals insights without assumptions (ground truth).
▶ Helps data scientists decide on models and hypotheses.
▶ Key components:
▶ Summarizing data (e.g., averages, totals)
▶ Statistical analysis (e.g., correlations)
▶ Visualization (e.g., graphs, charts)
▶ Example: Plotting student grades to spot trends, like higher
scores in math vs. history.
22 / 199

Steps in EDA
▶ EDA involves four key steps:
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of Results
▶ Each step builds toward clear, actionable insights.
23 / 199

Step 1: Problem Definition
▶ Define the business problem to guide analysis.
▶ Tasks:
▶ Set objectives (e.g., increase sales)
▶ List deliverables (e.g., a report)
▶ Assess data status and costs
▶ Example: A café wants to know which drinks sell best to plan
inventory.
24 / 199

Step 2: Data Preparation
▶ Prepare data for analysis by:
▶ Identifying data sources (e.g., sales records)
▶ Cleaning and transforming data
▶ Dividing data into chunks
▶ Example: Organizing student survey data into tables for
analysis of study habits.
25 / 199

Step 3: Data Analysis
▶ Analyze data using:
▶ Descriptive statistics (e.g., mean, median)
▶ Correlation analysis
▶ Predictive models
▶ Example: Calculating average test scores and finding if study
time correlates with grades.
26 / 199

Step 4: Development and Representation
▶ Present results to stakeholders using:
▶ Graphs (e.g., histograms, scatter plots)
▶ Summary tables
▶ Maps or diagrams
▶ Goal: Make results clear for decision-making.
▶ Example: A bar chart showing top-selling café drinks for the
manager.
27 / 199

Example: EDA for a Fitness App
▶ Problem: Understand user activity patterns.
▶ Data Preparation: Collect step counts from fitness trackers.
▶ Data Analysis: Calculate average steps per day, find peak
activity times.
▶ Representation: Create a line graph showing daily steps over
a month.
▶ Insight: Users walk more on weekends, suggesting targeted
promotions.
28 / 199

Data Matters
▶ Data is everywhere: hospitals, universities, real estate, and
more.
▶ Understanding data types helps analyze it correctly.
▶ Example: Hospitals store patient data to track health trends,
like weight or age.
▶ Goal: Turn raw data into meaningful insights.
29 / 199

Dataset
▶ A collection of observations about an object.
▶ Each observation has variables (features) describing it.
▶ Example: A hospital dataset includes patient details like:
▶ Patient ID
▶ Name
▶ Address
▶ Date of Birth
▶ Email
▶ Gender
▶ Weight
30 / 199

Example: Hospital Patient Dataset
▶ Each row is an observation (a patient).
▶ Each column is a variable (e.g., Name, Weight).
▶ Example entry:
▶ PATIENT_ID: 002
▶ Name: Yoshmi Mukhiya
▶ Address: Mannsverk 61, 5094, Bergen
▶ DOB: 10.07.2018
▶ Email: yoshmimukhiya@gmail.com
▶ Gender: Female
▶ Weight: 10
31 / 199

How Data is Stored
▶ Stored in database management systems as tables/schemas.
▶ Each table has rows (observations) and columns (variables).
▶ Example: A hospital patient table:
Table 1: An example of a table for storing patient information
ID Name Address DOB Email Gender Weight
001 Suresh Mukhiya Mannsverk 61 30.12.1989 skmu@hvl.no Male 68
002 Yoshmi Mukhiya Mannsverk 61, 5094, Bergen 10.07.2018 yoshmimukhiya@gmail.com Female 10
003 Anju Mukhiya Mannsverk 61, 5094, Bergen 10.12.1997 anjumukhiya@gmail.com Female 24
004 Asha Gaire Butwal, Nepal 30.11.1990 aasha.gaire@gmail.com Female 23
005 Ola Nordmann Danmark, Sweden 12.12.1789 ola@gmail.com Male 75
32 / 199

Numerical Data
▶ Data involving measurements or quantities.
▶ Also called quantitative data in statistics.
▶ Examples:
▶ Age (e.g., 20 years)
▶ Height (e.g., 170 cm)
▶ Weight (e.g., 65 kg)
▶ Heart rate (e.g., 72 bpm)
▶ Number of family members (e.g., 4)
▶ Used in fields like medicine, sports, and research.
33 / 199

Types of Numerical Data
▶ Numerical data is divided into two types:
▶ Discrete Data: Countable, fixed values.
▶ Continuous Data: Infinite values within a range.
▶ Understanding these types helps in data analysis.
▶ Example: Number of teeth (discrete) vs. body temperature
(continuous).
34 / 199

Discrete Data
▶ Data that is countable with a finite set of values.
▶ Represented by a discrete variable.
▶ Examples:
▶ Number of heads in 200 coin flips (0 to 200).
▶ Country (e.g., Nepal, India, Norway, Japan).
▶ Student rank in class (e.g., 1, 2, 3, 4).
▶ Number of cars in a parking lot (e.g., 25).
▶ Discrete data has distinct, separate values.
35 / 199

Example: Discrete Data
▶ Scenario: Counting students in a classroom.
▶ Variable: Number of students present.
▶ Values: 20, 21, 22, ..., 30 (finite and countable).
▶ Analysis: Calculate the average attendance over a week.
▶ Visual: Bar chart showing daily student counts.
36 / 199

Continuous Data
▶ Data with an infinite number of values within a range.
▶ Represented by a continuous variable.
▶ Examples:
▶ Temperature (e.g., 25.3°C, 25.31°C, 25.312°C).
▶ Weight (e.g., 65.2 kg, 65.25 kg).
▶ Height (e.g., 170.5 cm, 170.51 cm).
▶ Time to run 100 meters (e.g., 12.345 seconds).
▶ Continuous data can take any value in a range.
37 / 199

Example: Continuous Data
▶ Scenario: Measuring student heights in a class.
▶ Variable: Height.
▶ Values: Any number between 150 cm and 190 cm (e.g., 165.7
cm).
▶ Analysis: Find the average height of students.
▶ Visual: Histogram showing height distribution.
38 / 199

Discrete vs. Continuous
Table 2: Discrete data vs. continuous data
Discrete Data Continuous Data
Definition Countable, fixed values Infinite values in a range
Examples Number of students, rank Weight, temperature
Variable Type Discrete variable Continuous variable
Analysis Counts, frequencies Averages, ranges
▶ Example: Number of cars (discrete) vs. car speed
(continuous).
39 / 199

Example: Car Dataset
▶ Dataset: Cars with variables like:
▶ Number of seats (discrete: 2, 4, 5, 7).
▶ Weight (continuous: e.g., 1200.5 kg).
▶ Speed (continuous: e.g., 180.3 km/h).
▶ EDA Tasks:
▶ Count cars by number of seats (discrete).
▶ Calculate average weight (continuous).
▶ Plot speed distribution (continuous).
40 / 199

What is Categorical Data?
▶ Represents characteristics or qualities of an object.
▶ Also called qualitative data in statistics.
▶ Examples:
▶ Gender (Male, Female, Other)
▶ Marital Status (Married, Single, Divorced)
▶ Movie Genres (Comedy, Drama, Action)
▶ Blood Type (A, B, AB, O)
▶ Drug Types (Stimulants, Opioids, Cannabis)
▶ Used in fields like medicine, marketing, and social sciences.
41 / 199

Categorical Variables
▶ Variables that describe categorical data.
▶ Have a limited number of values (like enumerated types in
computer science).
▶ Two main types:
▶ Binary (Dichotomous): Exactly two values.
▶ Polytomous: More than two values.
▶ Example: Gender (Male, Female) vs. Movie Genres (Action,
Comedy, Drama, etc.).
42 / 199

Binary Categorical Variables
▶ Take exactly two values (dichotomous).
▶ Examples:
▶ Experiment Result: Success or Failure
▶ Attendance: Present or Absent
▶ Light Switch: On or Off
▶ Easy to analyze due to only two options.
▶ Example: Checking if a student passed (Yes/No) an exam.
43 / 199

Polytomous Categorical Variables
▶ Take more than two values.
▶ Examples:
▶ Marital Status: Married, Single, Divorced, Widowed, etc.
▶ Movie Genres: Action, Comedy, Drama, Horror, etc.
▶ Blood Type: A, B, AB, O
▶ Example: Surveying students’ favorite subjects (Math,
Science, History, Art).
44 / 199

Measurement Scales
▶ Four different types of measurement scales in statistics.
▶ These scales are used more in academic industries.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
45 / 199

What are Nominal Scales?
▶ Labels for categorical variables without quantitative value.
▶ Mutually exclusive and carry no numerical importance.
▶ Considered qualitative data in statistics.
▶ Examples:
▶ Gender: Male, Female, Non-binary, Other, Prefer not to answer
▶ Languages: English, Spanish, Hindi
▶ Biological Species: Archea, Bacteria, Eukarya
46 / 199

Characteristics of Nominal Scales
▶ No order or ranking among categories.
▶ No arithmetic and comparison operations (e.g., addition,
subtraction, multiplication, division, mean, greater than, less
than) possible.
▶ Numbers as labels have no numerical meaning (e.g., "1 =
Male" is just a label).
▶ Example: Labeling parts of speech (noun, verb, adjective) has
no numerical value.
47 / 199

Examples of Nominal Scales
▶ Common nominal variables:
▶ Gender: Male, Female, Non-binary, Other
▶ Country Languages: Norwegian, Japanese, Nepali
▶ Movie Genres: Comedy, Action, Drama
▶ Taxonomic Ranks: Archea, Bacteria, Eukarya
▶ Parts of Speech: Noun, Pronoun, Adjective
▶ Example: Survey asking students’ favorite food: Pizza, Sushi,
Pasta.
48 / 199

Analyzing Nominal Data
▶ Key methods:
▶ Frequency: Count how often a label appears.
▶ Proportion: Frequency divided by total events.
▶ Percentage: Proportion multiplied by 100.
▶ Example: In a class of 50 students:
▶ Gender: 25 Male, 20 Female, 5 Non-binary
▶ Proportion: Female = 20/50 = 0.4
▶ Percentage: Female = 40%
49 / 199

Visualizing Nominal Data
▶ Use pie charts or bar charts for nominal data.
▶ Not suitable for histograms (used for numerical data).
▶ Example: Bar chart showing student preferences:
▶ Favorite Food: Pizza (20), Sushi (15), Pasta (10)
▶ Visuals make frequencies and proportions clear to
stakeholders.
50 / 199

Example: Student Survey Dataset
▶ Dataset: Survey of 100 students with nominal variables:
▶ Gender: Male, Female, Non-binary, Other
▶ Favorite Subject: Math, Science, History
▶ Analysis:
▶ Frequency: 50 Male, 40 Female, 10 Transgender
▶ Proportion: Male = 50/100 = 0.5
▶ Visualization: Pie chart of Gender distribution
51 / 199

Nominal Scales in Real Life
▶ Used in surveys, classifications, and categorizations.
▶ Examples:
▶ Customer survey: Preferred brand (Nike, Adidas, Puma)
▶ Biology: Classifying species (Lion, Tiger, Leopard)
▶ Social media: Post category (Photo, Video, Text)
▶ Analysis: Count how many customers prefer each brand.
52 / 199

What are Ordinal Scales?
▶ Categorical data with a significant order of values.
▶ Tip: "Ordinal" sounds like "order" (1st, 2nd, 3rd, etc.).
▶ Differs from nominal scales (no order, e.g., Gender).
▶ Examples:
▶ Satisfaction: Low, Medium, High
▶ Education Level: High School, Bachelor’s, Master’s
▶ Race Position: 1st, 2nd, 3rd
53 / 199

Ordinal vs. Nominal Scales
▶ Nominal: No order (e.g., Blood Type: A, B, AB, O).
▶ Ordinal: Ordered categories (e.g., Satisfaction: Low,
Medium, High).
▶ Key Difference: Order matters in ordinal scales.
▶ Example:
▶ Nominal: Favorite Color (Red, Blue, Green).
▶ Ordinal: Class Rank (1st, 2nd, 3rd).
54 / 199

What is the Likert Scale?
▶ A type of ordinal scale with ordered response options.
▶ Used to measure opinions, attitudes, or feelings.
▶ Example Question: "WordPress is making content managers’
lives easier."
▶ Responses (5-point Likert Scale):
▶ Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree
▶ Visual: See next slide for diagram.
55 / 199

Likert Scale Example
Feelings
▶ Options: 1 - Very Unhappy, 2 - Unhappy, 3 - OK, 4 - Happy,
5 - Very Happy
▶ Order matters: 1 is less happy than 5.
▶ Example: A student rates their mood as "Happy" (4).
Satisfaction
▶ Options: 1 - Very Unsatisfied, 2 - Somewhat Unsatisfied, 3 -
Neutral, 4 - Somewhat Satisfied, 5 - Very Satisfied
▶ Example: A customer rates service as "Somewhat Satisfied"
(4).
56 / 199

Example: Student Survey Dataset
▶ Dataset: Survey of 50 students with ordinal variables:
▶ Effort Level: Low, Medium, High
▶ Course Difficulty: Easy, Moderate, Hard
▶ Analysis:
▶ Count: 20 Low, 20 Medium, 10 High Effort
▶ Median: Medium Effort
▶ Visualization: Bar chart of Effort Levels
57 / 199

Visualizing Ordinal Data
▶ Use bar charts to show order and frequency.
▶ Avoid pie charts if order is critical (use for nominal data
instead).
▶ Example: Bar chart of Likert responses:
▶ Strongly Agree: 10
▶ Agree: 15
▶ Neutral: 10
▶ Disagree: 5
▶ Strongly Disagree: 5
58 / 199

Real-World Example: Customer Feedback
▶ Dataset: 100 customer reviews with:
▶ Satisfaction: 1 - Very Unsatisfied, 2 - Unsatisfied, 3 - Neutral,
4 - Satisfied, 5 - Very Satisfied
▶ Analysis:
▶ Median: Neutral (3)
▶ Frequency: 30 Satisfied, 25 Neutral, etc.
▶ Action: Improve service based on low satisfaction feedback.
59 / 199

Why Learn Ordinal Scales?
▶ Order helps rank and compare data (e.g., satisfaction levels).
▶ Guides correct statistical measures (median, not mean).
▶ Essential for surveys and Likert scale analysis in EDA.
▶ Example: Ranking student performance (1st, 2nd, 3rd) for
awards.
60 / 199

What are Interval and Ratio Scales?
▶ Interval Scales: Order and exact differences between values
matter.
▶ Ratio Scales: Include order, exact differences, and a true
zero.
▶ Both extend beyond nominal and ordinal scales for advanced
analysis.
▶ Example: Temperature (interval) vs. Height (ratio).
61 / 199

Interval Scales
▶ Order and exact differences between values are significant.
▶ Used in statistics (e.g., mean, median, mode, standard
deviation).
▶ No true zero (e.g., 0°C doesn’t mean no temperature).
▶ Examples:
▶ Temperature in Celsius (°C)
▶ Location in Cartesian coordinates (x, y)
▶ Direction in degrees from magnetic north
▶ Example: Difference between 20°C and 30°C is the same as
30°C and 40°C.
62 / 199

Properties of Interval Scales
▶ Provides: Order, frequency, mode, median, mean.
▶ Can quantify differences between values.
▶ Can add or subtract values (e.g., 20°C - 10°C = 10°C).
▶ Cannot multiply/divide or use a true zero.
▶ Example: Average temperature of a week (e.g., mean =
25°C).
63 / 199

Example: Interval Scale in Action
▶ Dataset: Daily temperatures (°C) for a week:
▶ Mon: 20°C, Tue: 22°C, Wed: 25°C, Thu: 23°C, Fri: 21°C,
Sat: 24°C, Sun: 26°C
▶ Analysis:
▶ Mean: (20 + 22 + 25 + 23 + 21 + 24 + 26) / 7 = 23°C
▶ Difference: 25°C - 20°C = 5°C
▶ Visualization: Line graph of temperature trends.
64 / 199

Ratio Scales
▶ Include order, exact differences, and a true zero (e.g., 0 kg =
no mass).
▶ Enable advanced statistical analysis (e.g., mean, variance,
ratios).
▶ Examples:
▶ Mass (e.g., 50 kg)
▶ Length (e.g., 2 meters)
▶ Duration (e.g., 5 seconds)
▶ Volume (e.g., 10 liters)
▶ Example: Height of students (0 cm = no height is
meaningful).
65 / 199

Properties of Ratio Scales
▶ Provides: Order, frequency, mode, median, mean, differences.
▶ Can quantify differences, add/subtract, multiply/divide values.
▶ Has a true zero, allowing ratios (e.g., 10 kg is twice 5 kg).
▶ Example: Average weight of a class (e.g., mean = 60 kg).
66 / 199

Example: Ratio Scale in Action
▶ Dataset: Weights (kg) of 5 students:
▶ Student 1: 50 kg, Student 2: 55 kg, Student 3: 60 kg,
Student 4: 65 kg, Student 5: 70 kg
▶ Analysis:
▶ Mean: (50 + 55 + 60 + 65 + 70) / 5 = 60 kg
▶ Ratio: 70 kg is 1.4 times 50 kg
▶ Visualization: Bar chart of weights.
67 / 199

Comparison of All Scales
Table 3: A summary of the data types and scale measures
Provides: Nominal Ordinal Interval Ratio
The "order" of values is known ✓ ✓ ✓
"Counts," aka "Frequency of Distribution" ✓ ✓ ✓ ✓
Mode ✓ ✓ ✓ ✓
Median ✓ ✓ ✓
Mean ✓ ✓
Can quantify the difference between each value ✓ ✓
Can add or subtract values ✓ ✓
Can multiple and divide values ✓
Has "true zero" ✓
▶ Example: Gender (nominal) vs. Temperature (interval) vs.
Weight (ratio).
68 / 199

Real-World Example: Weather Data
▶ Interval: Temperature (°C) over a month.
▶ Mean: 22°C, Difference: 25°C - 20°C = 5°C
▶ Ratio: Rainfall (mm) in the same month.
▶ Mean: 50 mm, Ratio: 100 mm is twice 50 mm
▶ Use: Plan irrigation based on rainfall ratios.
69 / 199

Why Learn Interval and Ratio Scales?
▶ Enable precise statistical analysis (e.g., mean, ratios).
▶ Critical for fields like science, engineering, and economics.
▶ Guide correct visualization and modeling choices.
▶ Example: Analyzing test scores (interval) or distances (ratio)
in a project.
70 / 199

Data Analysis Approaches
▶ Three key methods:
▶ Classical Data Analysis
▶ Exploratory Data Analysis (EDA)
▶ Bayesian Data Analysis
▶ Each has a unique process for handling data.
▶ Example: Analyzing student exam scores using different
methods.
71 / 199

Classical Data Analysis
▶ Steps:
▶ Problem Definition
▶ Data Collection
▶ Model Development
▶ Data Analysis
▶ Results Communication
▶ Focus: Build a model first, then analyze data.
▶ Example: Predicting student grades with a pre-set linear
model.
72 / 199

▶ Steps:
▶ Data Collection
▶ Data Analysis
▶ Focus: Explore data (outliers, patterns) before modeling.
▶ No imposed models; emphasizes visualizations.
▶ Example: Plotting exam scores to spot trends before
modeling.
73 / 199

Bayesian Data Analysis
▶ Steps:
▶ Data Collection
▶ Prior Distribution
▶ Data Analysis
▶ Uses prior probability (belief before evidence).
▶ Example: Using past exam trends as prior to predict current
scores.
74 / 199

Key Differences
▶ Classical: Model first, then data analysis.
▶ EDA: Data exploration first, flexible modeling.
▶ Bayesian: Incorporates prior beliefs.
▶ Example: Classical fits a grade model directly; EDA checks
score distribution first.
75 / 199

Why Compare These Approaches?
▶ Choose the best method for your data and goals.
▶ EDA is great for initial exploration; Classical for structured
analysis.
▶ Bayesian adds prior knowledge for accuracy.
▶ Example: Use EDA to explore customer feedback, then
Bayesian for predictions.
76 / 199

Example: Student Exam Scores
▶ Classical: Define problem (predict grades), collect scores,
build a regression model, analyze, report.
▶ EDA: Collect scores, plot distribution (e.g., histogram),
identify outliers, then model.
▶ Bayesian: Use last year’s grade trends as prior, update with
new data, analyze.
▶ Outcome: Different insights based on approach.
77 / 199

Real-World Example: Sales Data
▶ Classical: Build a sales forecast model, then analyze monthly
data.
▶ EDA: Explore sales trends (e.g., bar chart), then model
seasonal patterns.
▶ Bayesian: Use last year’s sales as prior, refine with current
data.
▶ Goal: Optimize inventory based on insights.
78 / 199

Software Tools for EDA?
▶ Facilitate data exploration, visualization, and analysis.
▶ Open-source tools are free and widely accessible.
▶ Help uncover patterns, outliers, and insights.
▶ Example: Analyzing student performance data to find trends.
79 / 199

EDA Open-Source Tools
▶ Popular tools for EDA include:
▶ Python
▶ R Programming Language
▶ Weka
▶ KNIME
▶ Each offers unique features for data analysis.
80 / 199

Python
▶ Open-source programming language.
▶ Widely used for data analysis, mining, and data science.
▶ Link: https://www.python.org/
▶ Features: Libraries like pandas, matplotlib for EDA.
▶ Example: Plotting a histogram of exam scores using
matplotlib.
81 / 199

R Programming
▶ Open-source language for statistical computation.
▶ Strong in graphical data analysis.
▶ Link: https://www.r-project.org
▶ Features: Packages like ggplot2 for visualizations.
▶ Example: Creating a bar chart of sales data with ggplot2.
82 / 199

Weka
▶ Open-source data mining package.
▶ Includes EDA tools and algorithms.
▶ Link: https://www.cs.waikato.ac.nz/ml/weka/
▶ Features: Data visualization and data preprocessing tools.
▶ Example: Detecting outliers in a dataset of customer
purchases.
83 / 199

KNIME
▶ Open-source tool for data analysis.
▶ Based on Eclipse platform.
▶ Link: https://www.knime.com/
▶ Features: Drag-and-drop interface for workflows.
▶ Example: Building a workflow to analyze social media
engagement.
84 / 199

EDA using Python
Python
▶ Programming basics (variables, strings, data types)
▶ Conditionals, functions, sequences, collections, iterations
▶ File handling, object-oriented programming
NumPy
▶ Create, copy, and divide arrays
▶ Perform operations on arrays
▶ Array selections, advanced indexing, multi-dimensional arrays
▶ Linear algebra and built-in functions
85 / 199

EDA using Python
pandas
▶ Create and understand DataFrame objects
▶ Subset and index data
▶ Arithmetic functions, mapping, index management
▶ Styling for visual analysis
Matplotlib
▶ Load linear datasets
▶ Adjust axes, grids, labels, titles, legends
▶ Save plots
SciPy
▶ Import the package
▶ Use statistical packages
▶ Perform descriptive statistics, inference, analysis
86 / 199

Virtual Environment
▶ Essential for isolating Python projects.
▶ Steps:
▶ Install: pip install virtualenv
▶ Create: virtualenv Local_Version_Directory -p
Python_System_Directory
▶ Example: virtualenv myenv -p /usr/bin/python3
▶ Check: Activate and install packages (e.g., pandas).
87 / 199

Reading/Writing to Files
▶ Basic file handling is key for data input/output.
▶ Example Code:
# Reading/writing to files
filename = "datamining.txt"
file = open(filename, mode="r", encoding='utf-8')
for line in file:
lines = file.readlines()
print(lines)
file.close()
▶ Practice: Read a CSV file of exam scores.
88 / 199

Error Handling
▶ Manage errors to ensure robust code.
▶ Example Code:
# Handle invalid grade inputs
try:
val = int(input("Type a number between 47 and 100:"))
except ValueError:
print("You must type a number between 47 and 100!")
else:
if (val > 47) and (val <= 100):
print("You typed: ", val)
else:
print("The value you typed is incorrect!")
89 / 199

Object-Oriented Concepts
▶ Use classes and objects for structured code.
▶ Example Code:
# Mental Health Diseases: Social Anxiety Disorder
class Disease:
def __init__(self, disease='Depression'):
self.type = disease
def getName(self):
print("Mental Health Diseases: ", self.type)
d1 = Disease('Social Anxiety Disorder')
d1.getName()
▶ Example: Create a class for student data.
90 / 199

NumPy
# Create a 1D array using NumPy
import numpy as np
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
Output: [ 1 8 27 64]
# Create a 2D array using NumPy
import numpy as np
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16],
[4, 8, 18, 32]])
print(my2DArray)
Output: [[ 1 2 3 4]
[ 2 4 9 16]
[ 4 8 18 32]]
91 / 199

NumPy
# Create and display a 3D array using NumPy
import numpy as np
my3Darray = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]],
[[1, 2, 3, 4], [9, 10, 11, 12]]])
print(my3Darray)
Output: [[[ 1 2 3 4]
[ 5 6 7 8]]
[[ 1 2 3 4]
[ 9 10 11 12]]]
# Display the memory addresses
print(my1DArray.data, my2DArray.data, my3Darray.data)
Output: <memory at 0x7f8b1c0a3e80> <memory at 0x7f8b1c0a3f40>
<memory at 0x7f8b1c0a4040>
92 / 199

NumPy
# Display the shapes of 1D, 2D, and 3D NumPy arrays
print(my1DArray.shape, my2DArray.shape, my3Darray.shape)
Output: (4,) (3, 4) (2, 2, 4)
# Display the data types of 1D, 2D, and 3D NumPy arrays
print(my1DArray.dtype, my2DArray.dtype, my3Darray.dtype)
Output: int64 int64 int64
# Display the strides of 1D, 2D, and 3D NumPy arrays
'''The strides (32, 8) in my2DArray indicate that to move
from one row to the next, 32 bytes are skipped, and to move
from one column to the next, 8 bytes are skipped'''
print(my1DArray.strides, my2DArray.strides, my3Darray.strides)
Output: (8,) (32, 8) (64, 32, 8)
93 / 199

NumPy
# Create a 2D array filled with ones
import numpy as np
ones = np.ones((2,4))
print(ones)
Output: [[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
# Create a 3D array filled with zeros
import numpy as np
zeros = np.zeros((2,1,4), dtype=np.int16)
print(zeros)
Output: [[[0 0 0 0]]
[[0 0 0 0]]]
94 / 199

NumPy
# Create a 2D array filled with random values
import numpy as np
random_array = np.random.random((2,2))
print(random_array)
Output: array([[0.44768845, 0.96186535],
[0.99402423, 0.88612299]])
# Create a 2D array with uninitialized values
import numpy as np
emptyArray = np.empty((3,2))
print(emptyArray)
Output: [[9.86638798e-316 0.00000000e+000]
[6.87990479e-310 6.87990488e-310]
[6.87990477e-310 6.87990479e-310]]
95 / 199

NumPy
# Create a 2D array filled with a specific value
import numpy as np
fullArray = np.full((2,2), 7)
print(fullArray)
Output: [[7 7]
[7 7]]
# Create a 1D array with evenly spaced values
import numpy as np
evenSpacedArray = np.arange(10, 25, 5)
print(evenSpacedArray)
Output: [10 15 20]
96 / 199

NumPy
# Create a 1D array with evenly spaced values
import numpy as np
evenSpacedArray2 = np.linspace(0, 2, 9)
print(evenSpacedArray2)
Output: [0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]
''' Create a NumPy array and save it to a file and
load a NumPy array from a text file'''
import numpy as np
x = np.arange(0.0, 25.0, 1.0)
np.savetxt('data.out', x, delimiter=',')
z = np.loadtxt('data.out', unpack=True)
print(z)
Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.]
97 / 199

NumPy
# Load a NumPy array from a text file
import numpy as np
my_array2 = np.genfromtxt('data.out', skip_header=1,
filling_values=-999)
print(my_array2)
Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.]
# Display the number of dimensions of 1D, 2D, and 3D arrays
print(my1DArray.ndim, my2DArray.ndim, my3Darray.ndim)
Output: 1 2 3
# Display the total number of elements in 1D, 2D, and 3D arrays
print(my1DArray.size, my2DArray.size, my3Darray.size)
Output: 4 12 16 98 / 199

NumPy
# Print information about memory layout
print(my1DArray.flags, my2DArray.flags, my3Darray.flags)
Output: C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED: True
99 / 199

NumPy
# Print the length of one array element in bytes
print(my1DArray.itemsize, my2DArray.itemsize,
my3Darray.itemsize)
Output: 8 8 8
# Print the total consumed bytes by elements
print(my1DArray.nbytes, my2DArray.nbytes, my3Darray.nbytes)
Output: 32 96 128
# Sum along Numpy Axes
np_array_2d = np.arange(0, 6).reshape([2,3])
print(np_array_2d, np.sum(np_array_2d, axis = 0),
np.sum(np_array_2d, axis = 1))
Output: [[0 1 2]
[3 4 5]] [3 5 7] [ 3 12] 100 / 199

NumPy
# Create a subset and slice an array using an index
x = np.array([10, 20, 30, 40, 50])
# Select items at index 0 and 1
print(x[0:2])
Output: [10 20]
# Select item at row 0 and 1 and column 1 from 2D array
y = np.array([[ 1, 2, 3, 4], [ 9, 10, 11 ,12]])
print("m", y[0:2, 1])
print("n",y[0:2, 0:2])
print("l", y[0:2, 2:4])
Output: m = [ 2 10]
n = [[ 1 2]
[ 9 10]]
l = [[ 3 4]
[11 12]]
101 / 199

NumPy
# Specifying conditions
biggerThan2 = (y >= 2)
print(y[biggerThan2])
Output: [ 2 3 4 9 10 11 12]
# Basic operations (+, -, *, /, %)
x = np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
# Add two array
add = np.add(x, y)
print(add)
Output: [[ 2 6 12]
[ 4 6 2]]
102 / 199

NumPy
# Subtract two array
sub = np.subtract(x, y)
print(sub)
# Multiply two array
mul = np.multiply(x, y)
print(mul)
# Divide x, y
div = np.divide(x,y)
print(div)
# Calculated the remainder of x and y
rem = np.remainder(x, y)
print(rem)
Output: [[ 0 -2 -6]
[ 0 0 6]]
[[ 1 8 27]
[ 4 9 -8]]
[[ 1. 0.5 0.33333333]
[ 1. 1. -2. ]]
[[0 2 3]
[0 0 0]]
103 / 199

NumPy
# Boradcasting - Operate with arrays of different shapes
# Rule 1: Two dimensions are operatable if they are equal
# Create an array of two dimension
A = np.ones((6, 8))
# Shape of A
print(A.shape)
print(A)
Output: (6, 8)
[[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]]
104 / 199

NumPy
# Create another array
B = np.random.random((6,8))
# Shape of B
print(B.shape)
print(B)
Output: (6, 8)
[[0.06148782 0.10690907 0.92578537 0.29907577 0.42786516 0.01944468
0.14473416 0.30382709]
[0.36209211 0.33220132 0.43412798 0.97707517 0.23210006 0.05892264
0.34311993 0.97168464]
[0.34048395 0.06280427 0.78917397 0.50310127 0.36555426 0.27233463
0.60115097 0.77911552]
[0.39724957 0.38369108 0.10517771 0.97519711 0.49966346 0.51715226
0.50031762 0.91470124]
[0.7647788 0.37106634 0.17694871 0.90837723 0.1932456 0.20634914
0.29533289 0.66564862]
[0.72985568 0.85682569 0.01275113 0.98932163 0.1776967 0.95006083
0.59139126 0.3131595 ]]
105 / 199

NumPy
# Sum of A and B, here the shape of both the matrix is same.
print(A + B)
Output: [[1.06148782 1.10690907 1.92578537 1.29907577 1.42786516 1.01944468
1.14473416 1.30382709]
[1.36209211 1.33220132 1.43412798 1.97707517 1.23210006 1.05892264
1.34311993 1.97168464]
[1.34048395 1.06280427 1.78917397 1.50310127 1.36555426 1.27233463
1.60115097 1.77911552]
[1.39724957 1.38369108 1.10517771 1.97519711 1.49966346 1.51715226
1.50031762 1.91470124]
[1.7647788 1.37106634 1.17694871 1.90837723 1.1932456 1.20634914
1.29533289 1.66564862]
[1.72985568 1.85682569 1.01275113 1.98932163 1.1776967 1.95006083
1.59139126 1.3131595 ]]
106 / 199

NumPy
# Rule 2: Two dimensions are compatible when one of them is 1
# Initialize `x`
x = np.ones((3,4))
print(x)
print(x.shape)
Output: [[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
(3, 4)
# Initialize `y`
y = np.arange(4)
print(y)
print(y.shape)
Output: [0 1 2 3]
(4,)
107 / 199

NumPy
# Subtract `x` and `y`
print(x - y)
Output: [[ 1. 0. -1. -2.]
[ 1. 0. -1. -2.]
[ 1. 0. -1. -2.]]
'''Rule 3: Arrays can be broadcast together if they are
compatible in all dimensions.'''
x = np.ones((2,3))
print("x:", x)
Output: x: [[1. 1. 1.]
[1. 1. 1.]]
108 / 199

NumPy
# Initialize 'y'
y = np.random.random((2, 1, 3))
print("y:", y)
Output: y: [[[0.91087436 0.74716299 0.8804711 ]]
[[0.20148139 0.27853328 0.0647736 ]]]
# Sum of 'x' and 'y'
print("sum: ", x + y)
Output: sum: [[[1.91087436 1.74716299 1.8804711 ]
[1.91087436 1.74716299 1.8804711 ]]
[[1.20148139 1.27853328 1.0647736 ]
[1.20148139 1.27853328 1.0647736 ]]]
109 / 199

pandas
▶ Open-source Python library for data manipulation and analysis
▶ Created by Wes McKinney in 2008
▶ Widely used in data science, finance, and research
▶ GitHub: https://github.com/pandas-dev/pandas
▶ Key features:
▶ Data structures: Series and DataFrame
▶ Handling missing data
▶ Data filtering, grouping, and merging
110 / 199

Usage of pandas
▶ Simplifies data manipulation tasks
▶ Integrates with other Python libraries (NumPy, Matplotlib,
etc.)
▶ Handles large datasets efficiently
▶ Supports various data formats (CSV, Excel, SQL, etc.)
▶ Enables quick data exploration and visualization.
111 / 199

pandas
# Setting up Pandas environment
import numpy as np
import pandas as pd
!pip install --upgrade pandas
print("Pandas Version:", pd.__version__)
Requirement already satisfied: ...
Pandas Version: 2.3.1
# Customizing display dettings for data visibility
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
112 / 199

pandas
# Create a dataframe from a series
series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23])
print(series)
Output: 0 2
1 3
2 7
3 11
4 13
5 17
6 19
7 23
dtype: int64
113 / 199

pandas
# Create a dataframe from a series
series_df = pd.DataFrame({
'A': range(1, 5), 'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety",
"Bipolar Disorder", "Eating Disorder"]),
'F': 'Mental health', 'G': 'is challenging'
})
display(series_df)
Output:
114 / 199

pandas
# Create a dataframe for a dictionary
dict_df = [{'A': 'Apple', 'B': 'Ball'},
{'A': 'Aeroplane', 'B': 'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)
Output: A B C
0 Apple Ball NaN
1 Aeroplane Bat Cat
# Create a dataframe from n-dimensional arrays
sdf = {'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark',
'Oppland', 'Buskerud'], 'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94], 'Administrative centre': ["Sarpsborg",
"Oslo", "City of Oslo", "Hamar", "Lillehammer", "Drammen"]}
print(pd.DataFrame(sdf))
115 / 199

pandas
Output: County ISO-Code Area Administrative centre
0 Østfold 1 4180.69 Sarpsborg
1 Hordaland 2 4917.94 Oslo
2 Oslo 3 454.07 City of Oslo
3 Hedmark 4 27397.76 Hamar
4 Oppland 5 25192.10 Lillehammer
5 Buskerud 6 14910.94 Drammen
# Different dataframe style
'plain', 'simple', 'github', 'grid', 'fancy_grid', 'pipe',
'orgtbl', 'jira', 'presto', 'pretty', 'psql', 'rst',
'mediawiki', 'moinmoin', 'youtrack', 'html', 'latex',
'latex_raw', 'latex_booktabs', 'textile'
from tabulate import tabulate
# displaying the DataFrame
print(tabulate(sdf, headers = 'keys', tablefmt = 'psql'))
116 / 199

pandas
from tabulate import tabulate
# Displaying the DataFrame
print(tabulate(sdf, headers = 'keys', tablefmt = 'github'))
print(tabulate(sdf, headers = 'keys', tablefmt = 'grid'))
117 / 199

pandas
# Load a dataset from an external source into a DataFrame
keys = ['age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation',
'relationship','ethnicity', 'gender', 'capital_gain',
'capital_loss', 'hours_per_week','country_of_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/adult/adult.data', names=keys)
print(df.head())
Output:
age workclass fnlwgt ... country_of_origin income
0 39 State-gov 77516 ... United-States <=50K
1 50 Self-emp-not-inc 83311 ... United-States <=50K
2 38 Private 215646 ... United-States <=50K
4 28 Private 338409 ... cuba <=50K
119 / 199

pandas
# Retrieve the first 10 records
print(df.head(10))
Output:
4 28 Private 338409 ... cuba <=50K
6 49 Private 160187 ... Jamaica <=50K
7 52 Self-emp-not-inc 209642 ... United-States >50K
8 31 Private 45781 ... United-States >50K
120 / 199

pandas
# Retrieve the last 5 records
print(df.tail())
Output:
32560 52 Self-emp-inc 287927 ... United-States >50K
121 / 199

pandas
# Retrieve the last 10 records
print(df.tail(10))
Output:
32553 32 Private 116138 ... Taiwan <=50K
122 / 199

pandas
# Displays the rows, columns, data types, and memory
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education_num 32561 non-null int64
...
12 hours_per_week 32561 non-null int64
13 country_of_origin 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
123 / 199

pandas
# Selects a row
print(df.iloc[10])
Output:
age 37
workclass Private
fnlwgt 280464
education Some-college
education_num 10
marital_status Married-civ-spouse
occupation Exec-managerial
relationship Husband
ethnicity Black
gender Male
capital_gain 0
capital_loss 0
hours_per_week 80
country_of_origin United-States
income >50K
Name: 10, dtype: object
124 / 199

pandas
# Selects 10 rows
print(df.iloc[0:10])
Output:
4 28 Private 338409 ... cuba <=50K
6 49 Private 160187 ... Jamaica <=50K
7 52 Self-emp-not-inc 209642 ... United-States >50K
125 / 199

pandas
# Selects a range of rows
df.iloc[10:15]
Output:
11 30 State-gov 141297 ... India >50K
14 40 Private 121772 ... ? >50K
126 / 199

pandas
# Selects the last 2 rows
print(df.iloc[-2:])
Output:
127 / 199

pandas
# Selects every other row in columns 3-5
df.iloc[::2, 3:5].head()
Output:
education education_num
0 Bachelors 13
2 HS-grad 9
4 Bachelors 13
6 9th 5
8 Masters 14
128 / 199

pandas
# Seeding random data from numpy
np.random.seed(24)
# Making the DataFrame
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4),
columns=list('BCDE'))], axis = 1)
# DataFrame without any styling
print(df)
Output:
A B C D E
0 1.0 1.329212 -0.770033 -0.316280 -0.990810
1 2.0 -1.070816 -1.438713 0.564417 0.295722
2 3.0 -1.626404 0.219565 0.678805 1.889273
3 4.0 0.961538 0.104011 -0.481165 0.850229
4 5.0 1.453425 1.057737 0.165562 0.515018
5 6.0 -1.336936 0.562861 1.392855 -0.063328
6 7.0 0.121668 1.207603 -0.002040 1.627796
7 8.0 0.354493 1.037528 -0.385684 0.519818
8 9.0 1.686583 -1.325963 1.428984 -2.089354
9 10.0 -0.129820 0.631523 -0.586538 0.290720
129 / 199

pandas
# Styling DataFrame using DataFrame.style property
df.style.set_properties(**{'background-color':
'black', 'color': 'green'})
Output:
130 / 199

pandas
# Replacing the locating value by NaN (Not a Number)
df.iloc[0, 3] = np.nan
print(df)
Output:
A B C D E
0 1.0 1.329212 -0.770033 NaN -0.990810
1 2.0 -1.070816 -1.438713 0.564417 0.295722
2 3.0 -1.626404 0.219565 NaN 1.889273
3 4.0 0.961538 0.104011 -0.481165 0.850229
4 5.0 1.453425 NaN 0.165562 0.515018
5 6.0 -1.336936 0.562861 1.392855 -0.063328
6 7.0 0.121668 1.207603 -0.002040 1.627796
7 8.0 0.354493 1.037528 -0.385684 NaN
8 9.0 1.686583 -1.325963 1.428984 -2.089354
9 10.0 -0.129820 0.631523 -0.586538 0.290720
131 / 199

pandas
# Highlight the NaN values in DataFrame
df.style.highlight_null(color='red')
Output:
132 / 199

pandas
# Highlight the Min values in each column
df.style.highlight_min(axis = 0)
Output:
133 / 199

pandas
# Highlight the Max values in each column
df.style.highlight_max(axis = 0)
Output:
134 / 199

pandas
# Highlight the Max values in each row
df.style.highlight_max(axis = 1)
Output:
135 / 199

pandas
# Set text color of positive values in Dataframes
def color_positive_green(val):
"""
Takes a scalar and returns a string with
the css property 'color: green'` for positive
strings, black otherwise.
"""
if val > 0:
color = 'green'
else:
color = 'black'
return 'color: %s' % color
df.style.applymap(color_positive_green)
136 / 199

pandas
# Import seaborn library
import seaborn as sns
# Declaring the color palette from seaborn
cm = sns.light_palette("green", as_cmap=True)
# DataFrame with background gradient and precision
df.style.background_gradient(cmap=cm).format(precision = 2)
138 / 199

pandas
# Checking Missing Values in a DataFrame
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
print(df, "n")
print(df.isnull())
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
0 False False True
1 False False False
2 True False False
3 False True False
140 / 199

pandas
# Filtering Data Based on Missing Values
'''Download employees dataset
https://media.geeksforgeeks.org/wp-content/uploads/employees.csv
'''
import pandas as pd
d = pd.read_csv("/content/employees.csv")
bool_series = pd.isnull(d["Gender"])
missing_gender_data = d[bool_series]
print(missing_gender_data.head())
Output:
First Name Gender ... Senior Management Team
0 Lois NaN ... True Legal
22 Joshua NaN ... True Client Services
27 Scott NaN ... True Legal
31 Joyce NaN ... True Product
41 Christine NaN ... True Business Development
141 / 199

pandas
# Checking for Non-Missing Values
import pandas as pd
import numpy as np
print(df)
print(df.notnull())
Output:
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
0 True True False
1 True True True
2 False True True
3 True False True
142 / 199

pandas
# Filtering Data with Non-Missing Values
import pandas as pd
nmg = pd.notnull(d["Gender"])
print(d[nmg].head())
Output:
0 Douglas Male ... True Marketing
1 Thomas Male ... True NaN
2 Maria Female ... True Finance
3 Jerry Male ... True Finance
4 Larry Male ... True Client Services
143 / 199

pandas
# Filling Missing Values in Pandas
print(df, "n")
print(df.fillna(0))
Output:
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
0 100.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0
144 / 199

pandas
# Fill with Previous Value
print(df, "n")
print(df.fillna(method='pad'))
Output:
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 90.0 56.0 80.0
3 95.0 56.0 98.0
145 / 199

pandas
# Fill with Next Value
print(df, "n")
print(df.fillna(method='bfill'))
Output:
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
0 100.0 30.0 40.0
1 90.0 45.0 40.0
2 95.0 56.0 80.0
3 95.0 NaN 98.0
146 / 199

pandas
# Fill NaN Values with 'No Gender'
print(d[20:25])
d["Gender"].fillna('No Gender', inplace = True)
print(d[20:25])
Output:
21 Matthew Male ... False Marketing
23 NaN Male ... NaN NaN
24 John Male ... False Client Services
20 Lois No Gender ... True Legal
22 Joshua No Gender ... True Client Services
147 / 199

pandas
# Replace all NaN values with -99 value.
print(d[20:25])
d = d.replace(to_replace = np.nan, value = -99)
print(d[20:25])
Output:
20 Lois -99 ... True Legal
22 Joshua -99 ... True Client Services
23 -99 Male ... -99 -99
148 / 199

pandas
# Fills missing values using interpolation techniques
'''
'linear' - Linear interpolation between adjacent non-missing values.
'polynomial' - Polynomial interpolation (Order 2 for quadratic).
'nearest' - Fills with the nearest non-missing value.
'zero' - Fills with the previous non-missing value
(piecewise constant).
'slinear' - Spline interpolation of order 1
(equivalent to linear in pandas).
'quadratic' - Polynomial interpolation of order 2.
'barycentric' - Barycentric interpolation for smooth approximations.
'''
149 / 199

pandas
df = pd.DataFrame({"A": [12, 4, 5, None, 1],
"B": [None, 2, 54, 3, None],
"C": [20, 16, None, 3, 8],
"D": [14, 3, None, None, 6]})
print(df)
#linear forward interpolation
print(df.interpolate(method = 'linear', limit_direction ='forward'))
Output:
A B C D
0 12.0 NaN 20.0 14.0
1 4.0 2.0 16.0 3.0
2 5.0 54.0 NaN NaN
3 NaN 3.0 3.0 NaN
4 1.0 NaN 8.0 6.0
A B C D
0 12.0 NaN 20.0 14.0
1 4.0 2.0 16.0 3.0
2 5.0 54.0 9.5 4.0
3 3.0 3.0 3.0 5.0
4 1.0 3.0 8.0 6.0
150 / 199

pandas
# Polynomial interpolation with order 2
print(df.interpolate(method ='polynomial', order = 2))
Output:
A B C D
0 12.000000 NaN 20.0 14.0
1 4.000000 2.0 16.0 3.0
2 5.000000 54.0 8.0 -2.0
3 4.578947 3.0 3.0 -1.0
4 1.000000 NaN 8.0 6.0
151 / 199

pandas
# Drop rows where all values are missing
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
print(df, "n")
print(df.dropna(how = 'all'))
Output:
First Score Second Score Third Score Fourth Score
0 100.0 30.0 52.0 NaN
1 NaN NaN NaN NaN
2 NaN 45.0 80.0 NaN
3 95.0 56.0 98.0 65.0
0 100.0 30.0 52.0 NaN
2 NaN 45.0 80.0 NaN
3 95.0 56.0 98.0 65.0
152 / 199

pandas
# Remove columns that contain at least one missing value
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [60, 67, 68, 65]}
df = pd.DataFrame(dict)
print(df, "n")
print(df.dropna(axis=1))
Output:
0 100.0 30.0 52.0 60
1 NaN NaN NaN 67
2 NaN 45.0 80.0 68
3 95.0 56.0 98.0 65
Fourth Score
0 60
1 67
2 68
3 65
153 / 199

pandas
# Drop rows with missing values
nd = d.dropna(axis=0, how='any')
print("Old data frame length:", len(d))
print("New data frame length:", len(nd))
print("Rows with at least one missing value:",
(len(d) - len(nd)))
Output:
Old data frame length: 1000
New data frame length: 764
Rows with at least one missing value: 236
154 / 199

pandas
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
'Age': [17, 14, 12, 52],
'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])
print(df, "nn", df1)
155 / 199

pandas
Output:
Name Age Address Qualification
0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
4 Abhi 17 Nagpur Btech
5 Ayushi 14 Kanpur B.A
6 Dhiraj 12 Allahabad Bcom
7 Hitesh 52 Kannuaj B.hons
156 / 199

pandas
# Concatenating DataFrame
frames = [df, df1]
res1 = pd.concat(frames)
print(res1)
Output:
0 Jai 27 Nagpur Msc
4 Abhi 17 Nagpur Btech
157 / 199

pandas
import pandas as pd
'Age': [27, 24, 22, 32],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age': [22, 32, 12, 52],
'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary': [1000, 2000, 3000, 4000]}
158 / 199

pandas
Output:
Name Age Address Qualification Mobile No
0 Jai 27 Nagpur Msc 97
1 Princi 24 Kanpur MA 91
2 Gaurav 22 Allahabad MCA 58
3 Anuj 32 Kannuaj Phd 76
Name Age Address Qualification Salary
6 Dhiraj 12 Allahabad Bcom 3000
7 Hitesh 52 Kannuaj B.hons 4000
159 / 199

pandas
# Inner Join
res2 = pd.concat([df, df1], axis=1, join='inner')
print(res2)
Output:
Name Age Address Qualification Mobile No Name
2 Gaurav 22 Allahabad MCA 58 Gaurav
3 Anuj 32 Kannuaj Phd 76 Anuj
Age Address Qualification Salary
2 22 Allahabad MCA 1000
3 32 Kannuaj Phd 2000
160 / 199

pandas
# Outer Join
res2 = pd.concat([df, df1], axis = 1, sort = False)
print(res2)
Output:
Name Age Address Qualification Mobile No Name
0 Jai 27.0 Nagpur Msc 97.0 NaN
1 Princi 24.0 Kanpur MA 91.0 NaN
2 Gaurav 22.0 Allahabad MCA 58.0 Gaurav
3 Anuj 32.0 Kannuaj Phd 76.0 Anuj
6 NaN NaN NaN NaN NaN Dhiraj
7 NaN NaN NaN NaN NaN Hitesh
Age Address Qualification Salary
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 22.0 Allahabad MCA 1000.0
3 32.0 Kannuaj Phd 2000.0
6 12.0 Allahabad Bcom 3000.0
7 52.0 Kannuaj B.hons 4000.0
161 / 199

pandas
# DataFrames by Ignoring Indexes
res = pd.concat([df, df1], ignore_index=True)
print(res)
Output:
Name Age Address Qualification Mobile No Salary
0 Jai 27 Nagpur Msc 97.0 NaN
1 Princi 24 Kanpur MA 91.0 NaN
2 Gaurav 22 Allahabad MCA 58.0 NaN
3 Anuj 32 Kannuaj Phd 76.0 NaN
4 Gaurav 22 Allahabad MCA NaN 1000.0
5 Anuj 32 Kannuaj Phd NaN 2000.0
6 Dhiraj 12 Allahabad Bcom NaN 3000.0
7 Hitesh 52 Kannuaj B.hons NaN 4000.0
162 / 199

pandas
import pandas as pd
'Age': [27, 24, 22, 32],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age': [22, 32, 12, 52],
'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary': [1000, 2000, 3000, 4000]}
163 / 199

pandas
Output:
Name Age Address Qualification Mobile No
6 Dhiraj 12 Allahabad Bcom 3000
7 Hitesh 52 Kannuaj B.hons 4000
164 / 199

pandas
# Concatenating DataFrame with group keys
frames = [df, df1]
res = pd.concat(frames, keys=['x', 'y'])
print(res)
Output:
x 0 Jai 27 Nagpur Msc
y 4 Abhi 17 Nagpur Btech
165 / 199

pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary')
print(df,"nn", s1)
Output:
0 Jai 27 Nagpur Msc
0 1000
1 2000
2 3000
3 4000
Name: Salary, dtype: int64
166 / 199

pandas
# Concatenating Mixed DataFrames and Series
res = pd.concat([df, s1], axis = 1)
print(res)
Output:
167 / 199

pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "n", df1)
Output:
key Name Age
0 K0 Jai 27
1 K1 Princi 24
2 K2 Gaurav 22
3 K3 Anuj 32
key Address Qualification
0 K0 Nagpur Btech
1 K1 Kanpur B.A
2 K2 Allahabad Bcom
3 K3 Kannuaj B.hons
168 / 199

pandas
# Merging DataFrames Using One Key
res = pd.merge(df, df1, on = 'key')
print(res)
Output:
key Name Age Address Qualification
0 K0 Jai 27 Nagpur Btech
1 K1 Princi 24 Kanpur B.A
2 K2 Gaurav 22 Allahabad Bcom
3 K3 Anuj 32 Kannuaj B.hons
169 / 199

pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
170 / 199

pandas
Output:
key key1 Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
key key1 Address Qualification
0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons
171 / 199

pandas
# Merging DataFrames Using Multiple Keys
res1 = pd.merge(df, df1, on=['key', 'key1'])
print(res1)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27 Nagpur Btech
1 K2 K0 Gaurav 22 Allahabad Bcom
172 / 199

pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
173 / 199

pandas
Output:
key key1 Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
key key1 Address Qualification
0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons
174 / 199

pandas
# Left outer join
res = pd.merge(df, df1, how = 'left', on = ['key', 'key1'])
print(res)
Output:
1 K1 K1 Princi 24 NaN NaN
3 K3 K1 Anuj 32 NaN NaN
175 / 199

pandas
# Right outer join
res1 = pd.merge(df, df1, how = 'right', on = ['key', 'key1'])
print(res1)
Output:
0 K0 K0 Jai 27.0 Nagpur Btech
1 K1 K0 NaN NaN Kanpur B.A
2 K2 K0 Gaurav 22.0 Allahabad Bcom
3 K3 K0 NaN NaN Kannuaj B.hons
176 / 199

pandas
# Outer join
res2 = pd.merge(df, df1, how='outer', on=['key', 'key1'])
print(res2)
Output:
0 K0 K0 Jai 27.0 Nagpur Btech
1 K1 K0 NaN NaN Kanpur B.A
2 K1 K1 Princi 24.0 NaN NaN
3 K2 K0 Gaurav 22.0 Allahabad Bcom
4 K3 K0 NaN NaN Kannuaj B.hons
5 K3 K1 Anuj 32.0 NaN NaN
177 / 199

pandas
# Inner join
res3 = pd.merge(df, df1, how = 'inner', on = ['key', 'key1'])
print(res3)
Output:
178 / 199

pandas
'Age':[27, 24, 22, 32]}
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
Output:
Name Age
K0 Jai 27
K1 Princi 24
K2 Gaurav 22
K3 Anuj 32
Address Qualification
K0 Allahabad MCA
K2 Kannuaj Phd
K3 Allahabad Bcom
K4 Kannuaj B.hons
179 / 199

pandas
# Merge DataFrames based on row indexes
res = df.join(df1)
print(res)
Output:
K0 Jai 27 Allahabad MCA
K1 Princi 24 NaN NaN
K2 Gaurav 22 Kannuaj Phd
K3 Anuj 32 Allahabad Bcom
180 / 199

pandas
# Merge DataFrames based on row indexes
res = df1.join(df)
print(res)
Output:
Address Qualification Name Age
K0 Allahabad MCA Jai 27.0
K2 Kannuaj Phd Gaurav 22.0
K3 Allahabad Bcom Anuj 32.0
K4 Kannuaj B.hons NaN NaN
181 / 199

pandas
# Outer Join
res1 = df.join(df1, how='outer')
print(res1)
Output:
K0 Jai 27.0 Allahabad MCA
K1 Princi 24.0 NaN NaN
K2 Gaurav 22.0 Kannuaj Phd
K3 Anuj 32.0 Allahabad Bcom
K4 NaN NaN Kannuaj B.hons
182 / 199

pandas
'Age':[27, 24, 22, 32],
'Key':['K0', 'K1', 'K2', 'K3']}
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
Output:
Name Age Key
0 Jai 27 K0
1 Princi 24 K1
2 Gaurav 22 K2
3 Anuj 32 K3
K0 Allahabad MCA
K2 Kannuaj Phd
K3 Allahabad Bcom
K4 Kannuaj B.hons
183 / 199

pandas
# Joining DataFrames Using "on" Argument
res2 = df.join(df1, on ='Key')
res2
Output:
Name Age Key Address Qualification
0 Jai 27 K0 Allahabad MCA
1 Princi 24 K1 NaN NaN
2 Gaurav 22 K2 Kannuaj Phd
3 Anuj 32 K3 Allahabad Bcom
184 / 199

pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav'],
'Age':[27, 24, 22]}
data2 = {'Address':['Allahabad', 'Kannuaj',
'Allahabad', 'Kanpur'],
df = pd.DataFrame(data1, index=pd.Index(['K0', 'K1', 'K2'],
name='key'))
index = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
('K2', 'Y2'), ('K2', 'Y3')],
names=['key', 'Y'])
df1 = pd.DataFrame(data2, index= index)
185 / 199

pandas
Output:
Name Age
key
K0 Jai 27
K1 Princi 24
K2 Gaurav 22
key Y
K0 Y0 Allahabad MCA
K1 Y1 Kannuaj Phd
K2 Y2 Allahabad Bcom
Y3 Kanpur B.hons
186 / 199

pandas
# Joining DataFrames with Different Index Levels (Multi-Index)
result = df.join(df1, how ='inner')
print(result)
Output:
key Y
K0 Y0 Jai 27 Allahabad MCA
K1 Y1 Princi 24 Kannuaj Phd
K2 Y2 Gaurav 22 Allahabad Bcom
Y3 Gaurav 22 Kanpur B.hons
187 / 199

SciPy
▶ SciPy is an open-source Python library for scientific and
technical computing.
▶ Relies on NumPy, which provides efficient n-dimensional array
manipulation.
▶ Covers areas like optimization, integration, interpolation,
eigenvalue problems, and statistics.
▶ Essential for research, data analysis, and engineering projects.
188 / 199

Usage of SciPy
▶ Scientific Computing: Solves differential equations and
performs numerical integration.
▶ Statistics: Offers scipy.stats for hypothesis testing,
probability distributions, and more.
▶ Optimization: Includes tools for linear programming and
nonlinear optimization.
▶ Signal Processing: Provides functions for Fourier transforms
and filtering.
▶ Preparation: Install via pip install scipy and explore
documentation.
189 / 199

SciPy
from scipy import stats
data = [1.5, 2.3, 3.1, 4.2, 5.0]
mean = stats.tmean(data)
std_dev = stats.tstd(data)
print(f"Mean: {mean}, Standard Deviation: {std_dev}")
Output: Mean: 3.22, Standard Deviation: 1.4096098751072936
from scipy import integrate
import numpy as np
f = lambda x: x**2
result, error = integrate.quad(f, 0, 1)
print(f"Integral from 0 to 1: {result} (Error: {error})")
Output: Integral from 0 to 1: 0.33333333333333337
(Error: 3.700743415417189e-15)
190 / 199

Matplotlib
▶ A comprehensive Python library for creating static, animated,
and interactive visualizations.
▶ Offers a wide range of customizable plots (line, bar, scatter,
etc.) and backends.
▶ Applications: Used in professional reporting, interactive
dashboards, web/GUI applications, and embedded views.
191 / 199

Usage of Matplotlib
▶ Reporting: Generate quality figures for research articles.
▶ Interactive Tools: Create dynamic plots with widgets for
data exploration.
▶ Dashboards: Build complex visualizations for real-time data
monitoring.
▶ Web Integration: Embed plots in web applications using
backends like WebAgg.
▶ Preparation: Install via pip install matplotlib and
explore documentation.
192 / 199

Matplotlib
# Plot sine, cosine and tangent waves
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), label='Sine', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='Cosine', color='red',
linestyle='--')
plt.plot(x, np.tan(x), label='tan', color='green',
linestyle='-.')
plt.title("Sine, Cosine and Tangent Waves")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.grid(True)
plt.show()
193 / 199

Matplotlib
# Plot sample bar and line chart
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C']
values = [10, 20, 15]
plt.bar(categories, values, color='green')
plt.plot(categories, values, color='red', linewidth=2)
plt.title("Sample Bar and Line Chart")
plt.ylabel("Values")
plt.show()
195 / 199

Summary
It covers:
▶ Presents the basics of Exploring Data Analysis (EDA) and its
significance.
▶ Describes measurement scales, data types, and data analysis
methodologies.
▶ Highlights the steps involved in EDA, including gathering
data, cleaning it, visualizing it, and
▶ developing hypotheses.
▶ Demonstrates the differences between Bayesian, exploratory,
and classical analysis techniques.
▶ Python libraries (NumPy, pandas, SciPy, and Matplotlib) and
EDA tools are demonstrated.
197 / 199

References I
TEXTBOOK
[1] Mukhiya, S. K., & Ahmed, U. (2020). Hands-On Exploratory
Data Analysis with Python: Perform EDA techniques to
understand, summarize, and investigate your data. Packt
Publishing Ltd.
REFERENCE BOOKS
[1] Pearson, R. K. (2020). Exploratory Data Analysis Using R (1st
ed.). CRC Press.
[2] Datar, R., & Garg, H. (2019). Hands-on exploratory data
analysis with R: Become an expert in exploratory data analysis
using R packages. Packt Publishing Ltd.
198 / 199

References II
ONLINE RESOURCES
[1] Python Pool. (2021, June 14). Numpy Axis in Python with
detailed examples. Python Pool.
https://www.pythonpool.com/numpy-axis/
[2] GeeksforGeeks. (2025, July 28). Working with Missing Data in
Pandas. GeeksforGeeks.
https://www.geeksforgeeks.org/data-analysis/
working-with-missing-data-in-pandas/
[3] GeeksforGeeks. (2025, July 26). Python | Pandas merging,
joining and concatenating. GeeksforGeeks.
https://www.geeksforgeeks.org/python/
python-pandas-merging-joining-and-concatenating/
199 / 199

Exploratory_Data_Analysis_Fundamentals.pdf

More Related Content

Recently uploaded (20)

Featured (20)

Exploratory_Data_Analysis_Fundamentals.pdf