SlideShare a Scribd company logo
Exploratory Data Analysis Fundamentals
Ashutosh Satapathy, Ph.D.
Asst. Prof., Department of CSE,
Siddhartha Academy of Higher Education
(Deemed to be University)
Vijayawada - 520007
1 / 199
Outline
Introduction to EDA
Data
From Data to Knowledge
Exploratory Data Analysis
Understanding Data Science
Data Science
Phases of Data Science
The Significance of EDA
Why is EDA Significant?
The Role of EDA
Steps in EDA
Example: EDA for a Fitness
App
Making Sense of Data
Data Matters
Dataset
Data Storage
Numerical Data
Categorical Data
Measurement Scales
Data Analysis Approaches
Classical Data Analysis
Exploratory Data Analysis
Bayesian Data Analysis
Key Differences
Examples
Software Tools for EDA
Python
R Programming
Weka
KNIME
EDA using Python
NumPy
pandas
SciPy
Matplotlib
Summary
References
2 / 199
What is Data?
▶ A collection of discrete objects, numbers, words, events, facts,
measurements, observations, or descriptions.
▶ Generated by processes in various disciplines:
▶ Biology: Genetic sequences, protein structures
▶ Economics: Market trends, GDP data
▶ Engineering: Sensor readings, performance metrics
▶ Marketing: Customer preferences, sales data
▶ Example: A dataset of customer purchases in a retail store
includes product IDs, purchase dates, and amounts spent.
3 / 199
From Data to Knowledge
▶ Data: Raw facts and figures (e.g., sales numbers: $500,
$300, $700).
▶ Information: Processed data with context (e.g., average sales
per day: $500).
▶ Knowledge: Insights derived from information (e.g., sales
peak on weekends).
▶ Goal: Transform raw data into actionable knowledge.
▶ Example: Analyzing website traffic data to identify peak
visiting hours, leading to optimized ad schedules.
4 / 199
What is Exploratory Data Analysis (EDA)?
▶ A process to examine datasets and uncover:
▶ Patterns
▶ Anomalies
▶ Hypotheses
▶ Assumptions
▶ Uses statistical measures and visualizations.
▶ Performed before formal modeling or hypothesis testing.
▶ Example: Plotting sales data to spot seasonal trends or
outliers (e.g., a sudden spike in sales due to a promotion).
5 / 199
Why is EDA?
▶ Helps statisticians understand data characteristics.
▶ Uncovers hidden insights before formal modeling.
▶ Guides hypothesis generation and data collection strategies.
▶ Prevents incorrect assumptions in modeling.
▶ Example:
1. In a medical study, EDA reveals missing values in patient
records, prompting data cleaning before analysis.
2. EDA on patient data reveals inconsistent heart rate readings,
prompting sensor recalibration.
6 / 199
Key Steps in EDA
1. Data Collection: Gather raw data (e.g., sensor readings from
a manufacturing plant).
2. Data Cleaning: Handle missing values, outliers (e.g.,
removing erroneous temperature readings).
3. Descriptive Statistics: Compute mean, median, variance
(e.g., average production rate).
4. Visualization: Create plots (histograms, scatter plots) to
identify trends.
5. Hypothesis Generation: Formulate questions based on
patterns (e.g., does production rate vary by shift?).
7 / 199
Example: EDA in Retail Sales
▶ Dataset: Daily sales data for a clothing store over one year.
▶ Steps:
▶ Check for missing sales entries.
▶ Calculate average sales per month.
▶ Plot sales trends using a line graph.
▶ Identify outliers (e.g., Black Friday sales spike).
▶ Insight: Sales peak during holiday seasons, suggesting
increased inventory in November–December.
8 / 199
Tools for EDA
▶ Programming Languages: Python (pandas, matplotlib), R
(ggplot2).
▶ Software: Excel, Tableau, Power BI.
▶ Visualization Techniques: Histograms, box plots, scatter
plots, heatmaps.
▶ Example: Using Python to create a box plot of customer
spending to detect high-spending outliers.
9 / 199
Benefits of EDA
▶ Uncovers hidden patterns (e.g., customer churn trends).
▶ Detects data quality issues (e.g., duplicate entries).
▶ Informs better data collection strategies.
▶ Supports development of robust models.
▶ Example: EDA on weather data reveals inconsistent sensor
readings, leading to sensor recalibration.
10 / 199
What is Data Science?
▶ A cross-disciplinary field combining:
▶ Computer Science
▶ Statistics
▶ Mathematics
▶ Domain Knowledge
▶ Involves building models and extracting insights for business
intelligence.
▶ No Ph.D. required—practical skills are key.
▶ Example: Predicting customer churn using purchase history
and machine learning.
11 / 199
Phases of Data Science
▶ These phases are similar to Cross-Industry Standard Process
for Data Mining (CRISP-DM) framework:
1. Data Requirements
2. Data Collection
3. Data Processing
4. Data Cleaning
5. Exploratory Data Analysis (EDA)
6. Modeling and Algorithms
7. Data Product
8. Communication
▶ Each phase builds toward actionable insights.
12 / 199
Data Requirements
▶ Identify and categorize data needed for analysis:
▶ Numerical (e.g., heart rate)
▶ Categorical (e.g., patient gender)
▶ Define storage and dissemination formats.
▶ Example: For a dementia study, collect sleep patterns, heart
rate, and activity data from sensors to assess mental state.
13 / 199
Data Collection
▶ Gather data from various sources (sensors, databases, APIs).
▶ Ensure proper storage and transfer to IT systems.
▶ Example: Collecting customer feedback from surveys and
social media to analyze sentiment.
14 / 199
Data Processing
▶ Pre-curate data before analysis.
▶ Tasks: Exporting, structuring, and formatting data into tables.
▶ Example: Converting raw sensor data into a structured CSV
file with columns for time, value, and sensor ID.
15 / 199
Data Cleaning
▶ Address incompleteness, duplicates, errors, and missing values.
▶ Techniques: Record matching, outlier detection, filling missing
values.
▶ Example: Removing duplicate customer entries in a sales
dataset and imputing missing purchase amounts using the
median.
16 / 199
Exploratory Data Analysis (EDA)
▶ Core stage to uncover patterns, anomalies, and hypotheses.
▶ Uses descriptive statistics and visualizations.
▶ May involve data transformation techniques.
▶ Example: Plotting sales data to identify seasonal trends or
outliers (e.g., a spike during a holiday sale).
17 / 199
Modeling and Algorithms
▶ Build models to represent relationships between variables.
▶ Judd model: Data = Model + Error.
▶ Example: Linear regression model for pen purchases:
Total = UnitPrice × Quantity
where Total is the dependent variable, and UnitPrice is
independent.
18 / 199
Data Product
▶ Software that uses data inputs to produce outputs and
feedback.
▶ Based on models from analysis.
▶ Example: A recommendation system suggesting products
based on user purchase history.
19 / 199
Communication
▶ Share results with stakeholders via visualizations (tables,
charts, diagrams).
▶ Drives business intelligence and decision-making.
▶ Example: A bar chart showing monthly sales trends to guide
inventory planning.
20 / 199
Why is EDA Significant?
▶ Data is collected in fields like science, economics, engineering,
and marketing.
▶ Large datasets are stored in electronic databases, making
manual analysis impossible.
▶ EDA is the first step in data mining to:
▶ Visualize data
▶ Understand patterns
▶ Create hypotheses
▶ Example: A store collects sales data to find which products
sell best during holidays.
21 / 199
The Role of EDA
▶ Reveals insights without assumptions (ground truth).
▶ Helps data scientists decide on models and hypotheses.
▶ Key components:
▶ Summarizing data (e.g., averages, totals)
▶ Statistical analysis (e.g., correlations)
▶ Visualization (e.g., graphs, charts)
▶ Example: Plotting student grades to spot trends, like higher
scores in math vs. history.
22 / 199
Steps in EDA
▶ EDA involves four key steps:
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of Results
▶ Each step builds toward clear, actionable insights.
23 / 199
Step 1: Problem Definition
▶ Define the business problem to guide analysis.
▶ Tasks:
▶ Set objectives (e.g., increase sales)
▶ List deliverables (e.g., a report)
▶ Assess data status and costs
▶ Example: A café wants to know which drinks sell best to plan
inventory.
24 / 199
Step 2: Data Preparation
▶ Prepare data for analysis by:
▶ Identifying data sources (e.g., sales records)
▶ Cleaning and transforming data
▶ Dividing data into chunks
▶ Example: Organizing student survey data into tables for
analysis of study habits.
25 / 199
Step 3: Data Analysis
▶ Analyze data using:
▶ Descriptive statistics (e.g., mean, median)
▶ Correlation analysis
▶ Predictive models
▶ Example: Calculating average test scores and finding if study
time correlates with grades.
26 / 199
Step 4: Development and Representation
▶ Present results to stakeholders using:
▶ Graphs (e.g., histograms, scatter plots)
▶ Summary tables
▶ Maps or diagrams
▶ Goal: Make results clear for decision-making.
▶ Example: A bar chart showing top-selling café drinks for the
manager.
27 / 199
Example: EDA for a Fitness App
▶ Problem: Understand user activity patterns.
▶ Data Preparation: Collect step counts from fitness trackers.
▶ Data Analysis: Calculate average steps per day, find peak
activity times.
▶ Representation: Create a line graph showing daily steps over
a month.
▶ Insight: Users walk more on weekends, suggesting targeted
promotions.
28 / 199
Data Matters
▶ Data is everywhere: hospitals, universities, real estate, and
more.
▶ Understanding data types helps analyze it correctly.
▶ Example: Hospitals store patient data to track health trends,
like weight or age.
▶ Goal: Turn raw data into meaningful insights.
29 / 199
Dataset
▶ A collection of observations about an object.
▶ Each observation has variables (features) describing it.
▶ Example: A hospital dataset includes patient details like:
▶ Patient ID
▶ Name
▶ Address
▶ Date of Birth
▶ Email
▶ Gender
▶ Weight
30 / 199
Example: Hospital Patient Dataset
▶ Each row is an observation (a patient).
▶ Each column is a variable (e.g., Name, Weight).
▶ Example entry:
▶ PATIENT_ID: 002
▶ Name: Yoshmi Mukhiya
▶ Address: Mannsverk 61, 5094, Bergen
▶ DOB: 10.07.2018
▶ Email: yoshmimukhiya@gmail.com
▶ Gender: Female
▶ Weight: 10
31 / 199
How Data is Stored
▶ Stored in database management systems as tables/schemas.
▶ Each table has rows (observations) and columns (variables).
▶ Example: A hospital patient table:
Table 1: An example of a table for storing patient information
ID Name Address DOB Email Gender Weight
001 Suresh Mukhiya Mannsverk 61 30.12.1989 skmu@hvl.no Male 68
002 Yoshmi Mukhiya Mannsverk 61, 5094, Bergen 10.07.2018 yoshmimukhiya@gmail.com Female 10
003 Anju Mukhiya Mannsverk 61, 5094, Bergen 10.12.1997 anjumukhiya@gmail.com Female 24
004 Asha Gaire Butwal, Nepal 30.11.1990 aasha.gaire@gmail.com Female 23
005 Ola Nordmann Danmark, Sweden 12.12.1789 ola@gmail.com Male 75
32 / 199
Numerical Data
▶ Data involving measurements or quantities.
▶ Also called quantitative data in statistics.
▶ Examples:
▶ Age (e.g., 20 years)
▶ Height (e.g., 170 cm)
▶ Weight (e.g., 65 kg)
▶ Heart rate (e.g., 72 bpm)
▶ Number of family members (e.g., 4)
▶ Used in fields like medicine, sports, and research.
33 / 199
Types of Numerical Data
▶ Numerical data is divided into two types:
▶ Discrete Data: Countable, fixed values.
▶ Continuous Data: Infinite values within a range.
▶ Understanding these types helps in data analysis.
▶ Example: Number of teeth (discrete) vs. body temperature
(continuous).
34 / 199
Discrete Data
▶ Data that is countable with a finite set of values.
▶ Represented by a discrete variable.
▶ Examples:
▶ Number of heads in 200 coin flips (0 to 200).
▶ Country (e.g., Nepal, India, Norway, Japan).
▶ Student rank in class (e.g., 1, 2, 3, 4).
▶ Number of cars in a parking lot (e.g., 25).
▶ Discrete data has distinct, separate values.
35 / 199
Example: Discrete Data
▶ Scenario: Counting students in a classroom.
▶ Variable: Number of students present.
▶ Values: 20, 21, 22, ..., 30 (finite and countable).
▶ Analysis: Calculate the average attendance over a week.
▶ Visual: Bar chart showing daily student counts.
36 / 199
Continuous Data
▶ Data with an infinite number of values within a range.
▶ Represented by a continuous variable.
▶ Examples:
▶ Temperature (e.g., 25.3°C, 25.31°C, 25.312°C).
▶ Weight (e.g., 65.2 kg, 65.25 kg).
▶ Height (e.g., 170.5 cm, 170.51 cm).
▶ Time to run 100 meters (e.g., 12.345 seconds).
▶ Continuous data can take any value in a range.
37 / 199
Example: Continuous Data
▶ Scenario: Measuring student heights in a class.
▶ Variable: Height.
▶ Values: Any number between 150 cm and 190 cm (e.g., 165.7
cm).
▶ Analysis: Find the average height of students.
▶ Visual: Histogram showing height distribution.
38 / 199
Discrete vs. Continuous
Table 2: Discrete data vs. continuous data
Discrete Data Continuous Data
Definition Countable, fixed values Infinite values in a range
Examples Number of students, rank Weight, temperature
Variable Type Discrete variable Continuous variable
Analysis Counts, frequencies Averages, ranges
▶ Example: Number of cars (discrete) vs. car speed
(continuous).
39 / 199
Example: Car Dataset
▶ Dataset: Cars with variables like:
▶ Number of seats (discrete: 2, 4, 5, 7).
▶ Weight (continuous: e.g., 1200.5 kg).
▶ Speed (continuous: e.g., 180.3 km/h).
▶ EDA Tasks:
▶ Count cars by number of seats (discrete).
▶ Calculate average weight (continuous).
▶ Plot speed distribution (continuous).
40 / 199
What is Categorical Data?
▶ Represents characteristics or qualities of an object.
▶ Also called qualitative data in statistics.
▶ Examples:
▶ Gender (Male, Female, Other)
▶ Marital Status (Married, Single, Divorced)
▶ Movie Genres (Comedy, Drama, Action)
▶ Blood Type (A, B, AB, O)
▶ Drug Types (Stimulants, Opioids, Cannabis)
▶ Used in fields like medicine, marketing, and social sciences.
41 / 199
Categorical Variables
▶ Variables that describe categorical data.
▶ Have a limited number of values (like enumerated types in
computer science).
▶ Two main types:
▶ Binary (Dichotomous): Exactly two values.
▶ Polytomous: More than two values.
▶ Example: Gender (Male, Female) vs. Movie Genres (Action,
Comedy, Drama, etc.).
42 / 199
Binary Categorical Variables
▶ Take exactly two values (dichotomous).
▶ Examples:
▶ Experiment Result: Success or Failure
▶ Attendance: Present or Absent
▶ Light Switch: On or Off
▶ Easy to analyze due to only two options.
▶ Example: Checking if a student passed (Yes/No) an exam.
43 / 199
Polytomous Categorical Variables
▶ Take more than two values.
▶ Examples:
▶ Marital Status: Married, Single, Divorced, Widowed, etc.
▶ Movie Genres: Action, Comedy, Drama, Horror, etc.
▶ Blood Type: A, B, AB, O
▶ Example: Surveying students’ favorite subjects (Math,
Science, History, Art).
44 / 199
Measurement Scales
▶ Four different types of measurement scales in statistics.
▶ These scales are used more in academic industries.
1. Nominal
2. Ordinal
3. Interval
4. Ratio
45 / 199
What are Nominal Scales?
▶ Labels for categorical variables without quantitative value.
▶ Mutually exclusive and carry no numerical importance.
▶ Considered qualitative data in statistics.
▶ Examples:
▶ Gender: Male, Female, Non-binary, Other, Prefer not to answer
▶ Languages: English, Spanish, Hindi
▶ Biological Species: Archea, Bacteria, Eukarya
46 / 199
Characteristics of Nominal Scales
▶ No order or ranking among categories.
▶ No arithmetic and comparison operations (e.g., addition,
subtraction, multiplication, division, mean, greater than, less
than) possible.
▶ Numbers as labels have no numerical meaning (e.g., "1 =
Male" is just a label).
▶ Example: Labeling parts of speech (noun, verb, adjective) has
no numerical value.
47 / 199
Examples of Nominal Scales
▶ Common nominal variables:
▶ Gender: Male, Female, Non-binary, Other
▶ Country Languages: Norwegian, Japanese, Nepali
▶ Movie Genres: Comedy, Action, Drama
▶ Taxonomic Ranks: Archea, Bacteria, Eukarya
▶ Parts of Speech: Noun, Pronoun, Adjective
▶ Example: Survey asking students’ favorite food: Pizza, Sushi,
Pasta.
48 / 199
Analyzing Nominal Data
▶ Key methods:
▶ Frequency: Count how often a label appears.
▶ Proportion: Frequency divided by total events.
▶ Percentage: Proportion multiplied by 100.
▶ Example: In a class of 50 students:
▶ Gender: 25 Male, 20 Female, 5 Non-binary
▶ Proportion: Female = 20/50 = 0.4
▶ Percentage: Female = 40%
49 / 199
Visualizing Nominal Data
▶ Use pie charts or bar charts for nominal data.
▶ Not suitable for histograms (used for numerical data).
▶ Example: Bar chart showing student preferences:
▶ Favorite Food: Pizza (20), Sushi (15), Pasta (10)
▶ Visuals make frequencies and proportions clear to
stakeholders.
50 / 199
Example: Student Survey Dataset
▶ Dataset: Survey of 100 students with nominal variables:
▶ Gender: Male, Female, Non-binary, Other
▶ Favorite Subject: Math, Science, History
▶ Analysis:
▶ Frequency: 50 Male, 40 Female, 10 Transgender
▶ Proportion: Male = 50/100 = 0.5
▶ Visualization: Pie chart of Gender distribution
51 / 199
Nominal Scales in Real Life
▶ Used in surveys, classifications, and categorizations.
▶ Examples:
▶ Customer survey: Preferred brand (Nike, Adidas, Puma)
▶ Biology: Classifying species (Lion, Tiger, Leopard)
▶ Social media: Post category (Photo, Video, Text)
▶ Analysis: Count how many customers prefer each brand.
52 / 199
What are Ordinal Scales?
▶ Categorical data with a significant order of values.
▶ Tip: "Ordinal" sounds like "order" (1st, 2nd, 3rd, etc.).
▶ Differs from nominal scales (no order, e.g., Gender).
▶ Examples:
▶ Satisfaction: Low, Medium, High
▶ Education Level: High School, Bachelor’s, Master’s
▶ Race Position: 1st, 2nd, 3rd
53 / 199
Ordinal vs. Nominal Scales
▶ Nominal: No order (e.g., Blood Type: A, B, AB, O).
▶ Ordinal: Ordered categories (e.g., Satisfaction: Low,
Medium, High).
▶ Key Difference: Order matters in ordinal scales.
▶ Example:
▶ Nominal: Favorite Color (Red, Blue, Green).
▶ Ordinal: Class Rank (1st, 2nd, 3rd).
54 / 199
What is the Likert Scale?
▶ A type of ordinal scale with ordered response options.
▶ Used to measure opinions, attitudes, or feelings.
▶ Example Question: "WordPress is making content managers’
lives easier."
▶ Responses (5-point Likert Scale):
▶ Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree
▶ Visual: See next slide for diagram.
55 / 199
Likert Scale Example
Feelings
▶ Options: 1 - Very Unhappy, 2 - Unhappy, 3 - OK, 4 - Happy,
5 - Very Happy
▶ Order matters: 1 is less happy than 5.
▶ Example: A student rates their mood as "Happy" (4).
Satisfaction
▶ Options: 1 - Very Unsatisfied, 2 - Somewhat Unsatisfied, 3 -
Neutral, 4 - Somewhat Satisfied, 5 - Very Satisfied
▶ Example: A customer rates service as "Somewhat Satisfied"
(4).
56 / 199
Example: Student Survey Dataset
▶ Dataset: Survey of 50 students with ordinal variables:
▶ Effort Level: Low, Medium, High
▶ Course Difficulty: Easy, Moderate, Hard
▶ Analysis:
▶ Count: 20 Low, 20 Medium, 10 High Effort
▶ Median: Medium Effort
▶ Visualization: Bar chart of Effort Levels
57 / 199
Visualizing Ordinal Data
▶ Use bar charts to show order and frequency.
▶ Avoid pie charts if order is critical (use for nominal data
instead).
▶ Example: Bar chart of Likert responses:
▶ Strongly Agree: 10
▶ Agree: 15
▶ Neutral: 10
▶ Disagree: 5
▶ Strongly Disagree: 5
58 / 199
Real-World Example: Customer Feedback
▶ Dataset: 100 customer reviews with:
▶ Satisfaction: 1 - Very Unsatisfied, 2 - Unsatisfied, 3 - Neutral,
4 - Satisfied, 5 - Very Satisfied
▶ Analysis:
▶ Median: Neutral (3)
▶ Frequency: 30 Satisfied, 25 Neutral, etc.
▶ Action: Improve service based on low satisfaction feedback.
59 / 199
Why Learn Ordinal Scales?
▶ Order helps rank and compare data (e.g., satisfaction levels).
▶ Guides correct statistical measures (median, not mean).
▶ Essential for surveys and Likert scale analysis in EDA.
▶ Example: Ranking student performance (1st, 2nd, 3rd) for
awards.
60 / 199
What are Interval and Ratio Scales?
▶ Interval Scales: Order and exact differences between values
matter.
▶ Ratio Scales: Include order, exact differences, and a true
zero.
▶ Both extend beyond nominal and ordinal scales for advanced
analysis.
▶ Example: Temperature (interval) vs. Height (ratio).
61 / 199
Interval Scales
▶ Order and exact differences between values are significant.
▶ Used in statistics (e.g., mean, median, mode, standard
deviation).
▶ No true zero (e.g., 0°C doesn’t mean no temperature).
▶ Examples:
▶ Temperature in Celsius (°C)
▶ Location in Cartesian coordinates (x, y)
▶ Direction in degrees from magnetic north
▶ Example: Difference between 20°C and 30°C is the same as
30°C and 40°C.
62 / 199
Properties of Interval Scales
▶ Provides: Order, frequency, mode, median, mean.
▶ Can quantify differences between values.
▶ Can add or subtract values (e.g., 20°C - 10°C = 10°C).
▶ Cannot multiply/divide or use a true zero.
▶ Example: Average temperature of a week (e.g., mean =
25°C).
63 / 199
Example: Interval Scale in Action
▶ Dataset: Daily temperatures (°C) for a week:
▶ Mon: 20°C, Tue: 22°C, Wed: 25°C, Thu: 23°C, Fri: 21°C,
Sat: 24°C, Sun: 26°C
▶ Analysis:
▶ Mean: (20 + 22 + 25 + 23 + 21 + 24 + 26) / 7 = 23°C
▶ Difference: 25°C - 20°C = 5°C
▶ Visualization: Line graph of temperature trends.
64 / 199
Ratio Scales
▶ Include order, exact differences, and a true zero (e.g., 0 kg =
no mass).
▶ Enable advanced statistical analysis (e.g., mean, variance,
ratios).
▶ Examples:
▶ Mass (e.g., 50 kg)
▶ Length (e.g., 2 meters)
▶ Duration (e.g., 5 seconds)
▶ Volume (e.g., 10 liters)
▶ Example: Height of students (0 cm = no height is
meaningful).
65 / 199
Properties of Ratio Scales
▶ Provides: Order, frequency, mode, median, mean, differences.
▶ Can quantify differences, add/subtract, multiply/divide values.
▶ Has a true zero, allowing ratios (e.g., 10 kg is twice 5 kg).
▶ Example: Average weight of a class (e.g., mean = 60 kg).
66 / 199
Example: Ratio Scale in Action
▶ Dataset: Weights (kg) of 5 students:
▶ Student 1: 50 kg, Student 2: 55 kg, Student 3: 60 kg,
Student 4: 65 kg, Student 5: 70 kg
▶ Analysis:
▶ Mean: (50 + 55 + 60 + 65 + 70) / 5 = 60 kg
▶ Ratio: 70 kg is 1.4 times 50 kg
▶ Visualization: Bar chart of weights.
67 / 199
Comparison of All Scales
Table 3: A summary of the data types and scale measures
Provides: Nominal Ordinal Interval Ratio
The "order" of values is known ✓ ✓ ✓
"Counts," aka "Frequency of Distribution" ✓ ✓ ✓ ✓
Mode ✓ ✓ ✓ ✓
Median ✓ ✓ ✓
Mean ✓ ✓
Can quantify the difference between each value ✓ ✓
Can add or subtract values ✓ ✓
Can multiple and divide values ✓
Has "true zero" ✓
▶ Example: Gender (nominal) vs. Temperature (interval) vs.
Weight (ratio).
68 / 199
Real-World Example: Weather Data
▶ Interval: Temperature (°C) over a month.
▶ Mean: 22°C, Difference: 25°C - 20°C = 5°C
▶ Ratio: Rainfall (mm) in the same month.
▶ Mean: 50 mm, Ratio: 100 mm is twice 50 mm
▶ Use: Plan irrigation based on rainfall ratios.
69 / 199
Why Learn Interval and Ratio Scales?
▶ Enable precise statistical analysis (e.g., mean, ratios).
▶ Critical for fields like science, engineering, and economics.
▶ Guide correct visualization and modeling choices.
▶ Example: Analyzing test scores (interval) or distances (ratio)
in a project.
70 / 199
Data Analysis Approaches
▶ Three key methods:
▶ Classical Data Analysis
▶ Exploratory Data Analysis (EDA)
▶ Bayesian Data Analysis
▶ Each has a unique process for handling data.
▶ Example: Analyzing student exam scores using different
methods.
71 / 199
Classical Data Analysis
▶ Steps:
▶ Problem Definition
▶ Data Collection
▶ Model Development
▶ Data Analysis
▶ Results Communication
▶ Focus: Build a model first, then analyze data.
▶ Example: Predicting student grades with a pre-set linear
model.
72 / 199
Exploratory Data Analysis
▶ Steps:
▶ Problem Definition
▶ Data Collection
▶ Data Analysis
▶ Model Development
▶ Results Communication
▶ Focus: Explore data (outliers, patterns) before modeling.
▶ No imposed models; emphasizes visualizations.
▶ Example: Plotting exam scores to spot trends before
modeling.
73 / 199
Bayesian Data Analysis
▶ Steps:
▶ Problem Definition
▶ Data Collection
▶ Model Development
▶ Prior Distribution
▶ Data Analysis
▶ Results Communication
▶ Uses prior probability (belief before evidence).
▶ Example: Using past exam trends as prior to predict current
scores.
74 / 199
Key Differences
▶ Classical: Model first, then data analysis.
▶ EDA: Data exploration first, flexible modeling.
▶ Bayesian: Incorporates prior beliefs.
▶ Example: Classical fits a grade model directly; EDA checks
score distribution first.
75 / 199
Why Compare These Approaches?
▶ Choose the best method for your data and goals.
▶ EDA is great for initial exploration; Classical for structured
analysis.
▶ Bayesian adds prior knowledge for accuracy.
▶ Example: Use EDA to explore customer feedback, then
Bayesian for predictions.
76 / 199
Example: Student Exam Scores
▶ Classical: Define problem (predict grades), collect scores,
build a regression model, analyze, report.
▶ EDA: Collect scores, plot distribution (e.g., histogram),
identify outliers, then model.
▶ Bayesian: Use last year’s grade trends as prior, update with
new data, analyze.
▶ Outcome: Different insights based on approach.
77 / 199
Real-World Example: Sales Data
▶ Classical: Build a sales forecast model, then analyze monthly
data.
▶ EDA: Explore sales trends (e.g., bar chart), then model
seasonal patterns.
▶ Bayesian: Use last year’s sales as prior, refine with current
data.
▶ Goal: Optimize inventory based on insights.
78 / 199
Software Tools for EDA?
▶ Facilitate data exploration, visualization, and analysis.
▶ Open-source tools are free and widely accessible.
▶ Help uncover patterns, outliers, and insights.
▶ Example: Analyzing student performance data to find trends.
79 / 199
EDA Open-Source Tools
▶ Popular tools for EDA include:
▶ Python
▶ R Programming Language
▶ Weka
▶ KNIME
▶ Each offers unique features for data analysis.
80 / 199
Python
▶ Open-source programming language.
▶ Widely used for data analysis, mining, and data science.
▶ Link: https://www.python.org/
▶ Features: Libraries like pandas, matplotlib for EDA.
▶ Example: Plotting a histogram of exam scores using
matplotlib.
81 / 199
R Programming
▶ Open-source language for statistical computation.
▶ Strong in graphical data analysis.
▶ Link: https://www.r-project.org
▶ Features: Packages like ggplot2 for visualizations.
▶ Example: Creating a bar chart of sales data with ggplot2.
82 / 199
Weka
▶ Open-source data mining package.
▶ Includes EDA tools and algorithms.
▶ Link: https://www.cs.waikato.ac.nz/ml/weka/
▶ Features: Data visualization and data preprocessing tools.
▶ Example: Detecting outliers in a dataset of customer
purchases.
83 / 199
KNIME
▶ Open-source tool for data analysis.
▶ Based on Eclipse platform.
▶ Link: https://www.knime.com/
▶ Features: Drag-and-drop interface for workflows.
▶ Example: Building a workflow to analyze social media
engagement.
84 / 199
EDA using Python
Python
▶ Programming basics (variables, strings, data types)
▶ Conditionals, functions, sequences, collections, iterations
▶ File handling, object-oriented programming
NumPy
▶ Create, copy, and divide arrays
▶ Perform operations on arrays
▶ Array selections, advanced indexing, multi-dimensional arrays
▶ Linear algebra and built-in functions
85 / 199
EDA using Python
pandas
▶ Create and understand DataFrame objects
▶ Subset and index data
▶ Arithmetic functions, mapping, index management
▶ Styling for visual analysis
Matplotlib
▶ Load linear datasets
▶ Adjust axes, grids, labels, titles, legends
▶ Save plots
SciPy
▶ Import the package
▶ Use statistical packages
▶ Perform descriptive statistics, inference, analysis
86 / 199
Virtual Environment
▶ Essential for isolating Python projects.
▶ Steps:
▶ Install: pip install virtualenv
▶ Create: virtualenv Local_Version_Directory -p
Python_System_Directory
▶ Example: virtualenv myenv -p /usr/bin/python3
▶ Check: Activate and install packages (e.g., pandas).
87 / 199
Reading/Writing to Files
▶ Basic file handling is key for data input/output.
▶ Example Code:
# Reading/writing to files
filename = "datamining.txt"
file = open(filename, mode="r", encoding='utf-8')
for line in file:
lines = file.readlines()
print(lines)
file.close()
▶ Practice: Read a CSV file of exam scores.
88 / 199
Error Handling
▶ Manage errors to ensure robust code.
▶ Example Code:
# Handle invalid grade inputs
try:
val = int(input("Type a number between 47 and 100:"))
except ValueError:
print("You must type a number between 47 and 100!")
else:
if (val > 47) and (val <= 100):
print("You typed: ", val)
else:
print("The value you typed is incorrect!")
89 / 199
Object-Oriented Concepts
▶ Use classes and objects for structured code.
▶ Example Code:
# Mental Health Diseases: Social Anxiety Disorder
class Disease:
def __init__(self, disease='Depression'):
self.type = disease
def getName(self):
print("Mental Health Diseases: ", self.type)
d1 = Disease('Social Anxiety Disorder')
d1.getName()
▶ Example: Create a class for student data.
90 / 199
NumPy
# Create a 1D array using NumPy
import numpy as np
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
Output: [ 1 8 27 64]
# Create a 2D array using NumPy
import numpy as np
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], 
[4, 8, 18, 32]])
print(my2DArray)
Output: [[ 1 2 3 4]
[ 2 4 9 16]
[ 4 8 18 32]]
91 / 199
NumPy
# Create and display a 3D array using NumPy
import numpy as np
my3Darray = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]],
[[1, 2, 3, 4], [9, 10, 11, 12]]])
print(my3Darray)
Output: [[[ 1 2 3 4]
[ 5 6 7 8]]
[[ 1 2 3 4]
[ 9 10 11 12]]]
# Display the memory addresses
print(my1DArray.data, my2DArray.data, my3Darray.data)
Output: <memory at 0x7f8b1c0a3e80> <memory at 0x7f8b1c0a3f40>
<memory at 0x7f8b1c0a4040>
92 / 199
NumPy
# Display the shapes of 1D, 2D, and 3D NumPy arrays
print(my1DArray.shape, my2DArray.shape, my3Darray.shape)
Output: (4,) (3, 4) (2, 2, 4)
# Display the data types of 1D, 2D, and 3D NumPy arrays
print(my1DArray.dtype, my2DArray.dtype, my3Darray.dtype)
Output: int64 int64 int64
# Display the strides of 1D, 2D, and 3D NumPy arrays
'''The strides (32, 8) in my2DArray indicate that to move
from one row to the next, 32 bytes are skipped, and to move
from one column to the next, 8 bytes are skipped'''
print(my1DArray.strides, my2DArray.strides, my3Darray.strides)
Output: (8,) (32, 8) (64, 32, 8)
93 / 199
NumPy
# Create a 2D array filled with ones
import numpy as np
ones = np.ones((2,4))
print(ones)
Output: [[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
# Create a 3D array filled with zeros
import numpy as np
zeros = np.zeros((2,1,4), dtype=np.int16)
print(zeros)
Output: [[[0 0 0 0]]
[[0 0 0 0]]]
94 / 199
NumPy
# Create a 2D array filled with random values
import numpy as np
random_array = np.random.random((2,2))
print(random_array)
Output: array([[0.44768845, 0.96186535],
[0.99402423, 0.88612299]])
# Create a 2D array with uninitialized values
import numpy as np
emptyArray = np.empty((3,2))
print(emptyArray)
Output: [[9.86638798e-316 0.00000000e+000]
[6.87990479e-310 6.87990488e-310]
[6.87990477e-310 6.87990479e-310]]
95 / 199
NumPy
# Create a 2D array filled with a specific value
import numpy as np
fullArray = np.full((2,2), 7)
print(fullArray)
Output: [[7 7]
[7 7]]
# Create a 1D array with evenly spaced values
import numpy as np
evenSpacedArray = np.arange(10, 25, 5)
print(evenSpacedArray)
Output: [10 15 20]
96 / 199
NumPy
# Create a 1D array with evenly spaced values
import numpy as np
evenSpacedArray2 = np.linspace(0, 2, 9)
print(evenSpacedArray2)
Output: [0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]
''' Create a NumPy array and save it to a file and
load a NumPy array from a text file'''
import numpy as np
x = np.arange(0.0, 25.0, 1.0)
np.savetxt('data.out', x, delimiter=',')
z = np.loadtxt('data.out', unpack=True)
print(z)
Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.]
97 / 199
NumPy
# Load a NumPy array from a text file
import numpy as np
my_array2 = np.genfromtxt('data.out', skip_header=1, 
filling_values=-999)
print(my_array2)
Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.]
# Display the number of dimensions of 1D, 2D, and 3D arrays
print(my1DArray.ndim, my2DArray.ndim, my3Darray.ndim)
Output: 1 2 3
# Display the total number of elements in 1D, 2D, and 3D arrays
print(my1DArray.size, my2DArray.size, my3Darray.size)
Output: 4 12 16 98 / 199
NumPy
# Print information about memory layout
print(my1DArray.flags, my2DArray.flags, my3Darray.flags)
Output: C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED: True
WRITEBACKIFCOPY : False
99 / 199
NumPy
# Print the length of one array element in bytes
print(my1DArray.itemsize, my2DArray.itemsize, 
my3Darray.itemsize)
Output: 8 8 8
# Print the total consumed bytes by elements
print(my1DArray.nbytes, my2DArray.nbytes, my3Darray.nbytes)
Output: 32 96 128
# Sum along Numpy Axes
np_array_2d = np.arange(0, 6).reshape([2,3])
print(np_array_2d, np.sum(np_array_2d, axis = 0), 
np.sum(np_array_2d, axis = 1))
Output: [[0 1 2]
[3 4 5]] [3 5 7] [ 3 12] 100 / 199
NumPy
# Create a subset and slice an array using an index
x = np.array([10, 20, 30, 40, 50])
# Select items at index 0 and 1
print(x[0:2])
Output: [10 20]
# Select item at row 0 and 1 and column 1 from 2D array
y = np.array([[ 1, 2, 3, 4], [ 9, 10, 11 ,12]])
print("m", y[0:2, 1])
print("n",y[0:2, 0:2])
print("l", y[0:2, 2:4])
Output: m = [ 2 10]
n = [[ 1 2]
[ 9 10]]
l = [[ 3 4]
[11 12]]
101 / 199
NumPy
# Specifying conditions
biggerThan2 = (y >= 2)
print(y[biggerThan2])
Output: [ 2 3 4 9 10 11 12]
# Basic operations (+, -, *, /, %)
x = np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
# Add two array
add = np.add(x, y)
print(add)
Output: [[ 2 6 12]
[ 4 6 2]]
102 / 199
NumPy
# Subtract two array
sub = np.subtract(x, y)
print(sub)
# Multiply two array
mul = np.multiply(x, y)
print(mul)
# Divide x, y
div = np.divide(x,y)
print(div)
# Calculated the remainder of x and y
rem = np.remainder(x, y)
print(rem)
Output: [[ 0 -2 -6]
[ 0 0 6]]
[[ 1 8 27]
[ 4 9 -8]]
[[ 1. 0.5 0.33333333]
[ 1. 1. -2. ]]
[[0 2 3]
[0 0 0]]
103 / 199
NumPy
# Boradcasting - Operate with arrays of different shapes
# Rule 1: Two dimensions are operatable if they are equal
# Create an array of two dimension
A = np.ones((6, 8))
# Shape of A
print(A.shape)
print(A)
Output: (6, 8)
[[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1.]]
104 / 199
NumPy
# Create another array
B = np.random.random((6,8))
# Shape of B
print(B.shape)
print(B)
Output: (6, 8)
[[0.06148782 0.10690907 0.92578537 0.29907577 0.42786516 0.01944468
0.14473416 0.30382709]
[0.36209211 0.33220132 0.43412798 0.97707517 0.23210006 0.05892264
0.34311993 0.97168464]
[0.34048395 0.06280427 0.78917397 0.50310127 0.36555426 0.27233463
0.60115097 0.77911552]
[0.39724957 0.38369108 0.10517771 0.97519711 0.49966346 0.51715226
0.50031762 0.91470124]
[0.7647788 0.37106634 0.17694871 0.90837723 0.1932456 0.20634914
0.29533289 0.66564862]
[0.72985568 0.85682569 0.01275113 0.98932163 0.1776967 0.95006083
0.59139126 0.3131595 ]]
105 / 199
NumPy
# Sum of A and B, here the shape of both the matrix is same.
print(A + B)
Output: [[1.06148782 1.10690907 1.92578537 1.29907577 1.42786516 1.01944468
1.14473416 1.30382709]
[1.36209211 1.33220132 1.43412798 1.97707517 1.23210006 1.05892264
1.34311993 1.97168464]
[1.34048395 1.06280427 1.78917397 1.50310127 1.36555426 1.27233463
1.60115097 1.77911552]
[1.39724957 1.38369108 1.10517771 1.97519711 1.49966346 1.51715226
1.50031762 1.91470124]
[1.7647788 1.37106634 1.17694871 1.90837723 1.1932456 1.20634914
1.29533289 1.66564862]
[1.72985568 1.85682569 1.01275113 1.98932163 1.1776967 1.95006083
1.59139126 1.3131595 ]]
106 / 199
NumPy
# Rule 2: Two dimensions are compatible when one of them is 1
# Initialize `x`
x = np.ones((3,4))
print(x)
print(x.shape)
Output: [[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
(3, 4)
# Initialize `y`
y = np.arange(4)
print(y)
print(y.shape)
Output: [0 1 2 3]
(4,)
107 / 199
NumPy
# Subtract `x` and `y`
print(x - y)
Output: [[ 1. 0. -1. -2.]
[ 1. 0. -1. -2.]
[ 1. 0. -1. -2.]]
'''Rule 3: Arrays can be broadcast together if they are
compatible in all dimensions.'''
x = np.ones((2,3))
print("x:", x)
Output: x: [[1. 1. 1.]
[1. 1. 1.]]
108 / 199
NumPy
# Initialize 'y'
y = np.random.random((2, 1, 3))
print("y:", y)
Output: y: [[[0.91087436 0.74716299 0.8804711 ]]
[[0.20148139 0.27853328 0.0647736 ]]]
# Sum of 'x' and 'y'
print("sum: ", x + y)
Output: sum: [[[1.91087436 1.74716299 1.8804711 ]
[1.91087436 1.74716299 1.8804711 ]]
[[1.20148139 1.27853328 1.0647736 ]
[1.20148139 1.27853328 1.0647736 ]]]
109 / 199
pandas
▶ Open-source Python library for data manipulation and analysis
▶ Created by Wes McKinney in 2008
▶ Widely used in data science, finance, and research
▶ GitHub: https://github.com/pandas-dev/pandas
▶ Key features:
▶ Data structures: Series and DataFrame
▶ Handling missing data
▶ Data filtering, grouping, and merging
110 / 199
Usage of pandas
▶ Simplifies data manipulation tasks
▶ Integrates with other Python libraries (NumPy, Matplotlib,
etc.)
▶ Handles large datasets efficiently
▶ Supports various data formats (CSV, Excel, SQL, etc.)
▶ Enables quick data exploration and visualization.
111 / 199
pandas
# Setting up Pandas environment
import numpy as np
import pandas as pd
!pip install --upgrade pandas
print("Pandas Version:", pd.__version__)
Requirement already satisfied: ...
Requirement already satisfied: ...
Requirement already satisfied: ...
Requirement already satisfied: ...
Requirement already satisfied: ...
Requirement already satisfied: ...
Pandas Version: 2.3.1
# Customizing display dettings for data visibility
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
112 / 199
pandas
# Create a dataframe from a series
series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23])
print(series)
Output: 0 2
1 3
2 7
3 11
4 13
5 17
6 19
7 23
dtype: int64
113 / 199
pandas
# Create a dataframe from a series
series_df = pd.DataFrame({
'A': range(1, 5), 'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety",
"Bipolar Disorder", "Eating Disorder"]),
'F': 'Mental health', 'G': 'is challenging'
})
display(series_df)
Output:
114 / 199
pandas
# Create a dataframe for a dictionary
dict_df = [{'A': 'Apple', 'B': 'Ball'},
{'A': 'Aeroplane', 'B': 'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)
Output: A B C
0 Apple Ball NaN
1 Aeroplane Bat Cat
# Create a dataframe from n-dimensional arrays
sdf = {'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark',
'Oppland', 'Buskerud'], 'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94], 'Administrative centre': ["Sarpsborg",
"Oslo", "City of Oslo", "Hamar", "Lillehammer", "Drammen"]}
print(pd.DataFrame(sdf))
115 / 199
pandas
Output: County ISO-Code Area Administrative centre
0 Østfold 1 4180.69 Sarpsborg
1 Hordaland 2 4917.94 Oslo
2 Oslo 3 454.07 City of Oslo
3 Hedmark 4 27397.76 Hamar
4 Oppland 5 25192.10 Lillehammer
5 Buskerud 6 14910.94 Drammen
# Different dataframe style
'plain', 'simple', 'github', 'grid', 'fancy_grid', 'pipe',
'orgtbl', 'jira', 'presto', 'pretty', 'psql', 'rst',
'mediawiki', 'moinmoin', 'youtrack', 'html', 'latex',
'latex_raw', 'latex_booktabs', 'textile'
from tabulate import tabulate
# displaying the DataFrame
print(tabulate(sdf, headers = 'keys', tablefmt = 'psql'))
116 / 199
pandas
from tabulate import tabulate
# Displaying the DataFrame
print(tabulate(sdf, headers = 'keys', tablefmt = 'github'))
print(tabulate(sdf, headers = 'keys', tablefmt = 'grid'))
117 / 199
pandas
118 / 199
pandas
# Load a dataset from an external source into a DataFrame
keys = ['age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation',
'relationship','ethnicity', 'gender', 'capital_gain',
'capital_loss', 'hours_per_week','country_of_origin','income']
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-
learning-databases/adult/adult.data', names=keys)
print(df.head())
Output:
age workclass fnlwgt ... country_of_origin income
0 39 State-gov 77516 ... United-States <=50K
1 50 Self-emp-not-inc 83311 ... United-States <=50K
2 38 Private 215646 ... United-States <=50K
3 53 Private 234721 ... United-States <=50K
4 28 Private 338409 ... cuba <=50K
119 / 199
pandas
# Retrieve the first 10 records
print(df.head(10))
Output:
age workclass fnlwgt ... country_of_origin income
0 39 State-gov 77516 ... United-States <=50K
1 50 Self-emp-not-inc 83311 ... United-States <=50K
2 38 Private 215646 ... United-States <=50K
3 53 Private 234721 ... United-States <=50K
4 28 Private 338409 ... cuba <=50K
5 37 Private 284582 ... United-States <=50K
6 49 Private 160187 ... Jamaica <=50K
7 52 Self-emp-not-inc 209642 ... United-States >50K
8 31 Private 45781 ... United-States >50K
9 42 Private 159449 ... United-States >50K
120 / 199
pandas
# Retrieve the last 5 records
print(df.tail())
Output:
age workclass fnlwgt ... country_of_origin income
32556 27 Private 257302 ... United-States <=50K
32557 40 Private 154374 ... United-States >50K
32558 58 Private 151910 ... United-States <=50K
32559 22 Private 201490 ... United-States <=50K
32560 52 Self-emp-inc 287927 ... United-States >50K
121 / 199
pandas
# Retrieve the last 10 records
print(df.tail(10))
Output:
age workclass fnlwgt ... country_of_origin income
32551 32 Private 34066 ... United-States <=50K
32552 43 Private 84661 ... United-States <=50K
32553 32 Private 116138 ... Taiwan <=50K
32554 53 Private 321865 ... United-States >50K
32555 22 Private 310152 ... United-States <=50K
32556 27 Private 257302 ... United-States <=50K
32557 40 Private 154374 ... United-States >50K
32558 58 Private 151910 ... United-States <=50K
32559 22 Private 201490 ... United-States <=50K
32560 52 Self-emp-inc 287927 ... United-States >50K
122 / 199
pandas
# Displays the rows, columns, data types, and memory
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education_num 32561 non-null int64
...
12 hours_per_week 32561 non-null int64
13 country_of_origin 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
123 / 199
pandas
# Selects a row
print(df.iloc[10])
Output:
age 37
workclass Private
fnlwgt 280464
education Some-college
education_num 10
marital_status Married-civ-spouse
occupation Exec-managerial
relationship Husband
ethnicity Black
gender Male
capital_gain 0
capital_loss 0
hours_per_week 80
country_of_origin United-States
income >50K
Name: 10, dtype: object
124 / 199
pandas
# Selects 10 rows
print(df.iloc[0:10])
Output:
age workclass fnlwgt ... country_of_origin income
0 39 State-gov 77516 ... United-States <=50K
1 50 Self-emp-not-inc 83311 ... United-States <=50K
2 38 Private 215646 ... United-States <=50K
3 53 Private 234721 ... United-States <=50K
4 28 Private 338409 ... cuba <=50K
5 37 Private 284582 ... United-States <=50K
6 49 Private 160187 ... Jamaica <=50K
7 52 Self-emp-not-inc 209642 ... United-States >50K
8 31 Private 45781 ... United-States >50K
9 42 Private 159449 ... United-States >50K
125 / 199
pandas
# Selects a range of rows
df.iloc[10:15]
Output:
age workclass fnlwgt ... country_of_origin income
10 37 Private 280464 ... United-States >50K
11 30 State-gov 141297 ... India >50K
12 23 Private 122272 ... United-States <=50K
13 32 Private 205019 ... United-States <=50K
14 40 Private 121772 ... ? >50K
126 / 199
pandas
# Selects the last 2 rows
print(df.iloc[-2:])
Output:
age workclass fnlwgt ... country_of_origin income
32559 22 Private 201490 ... United-States <=50K
32560 52 Self-emp-inc 287927 ... United-States >50K
127 / 199
pandas
# Selects every other row in columns 3-5
df.iloc[::2, 3:5].head()
Output:
education education_num
0 Bachelors 13
2 HS-grad 9
4 Bachelors 13
6 9th 5
8 Masters 14
128 / 199
pandas
# Seeding random data from numpy
np.random.seed(24)
# Making the DataFrame
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4),
columns=list('BCDE'))], axis = 1)
# DataFrame without any styling
print(df)
Output:
A B C D E
0 1.0 1.329212 -0.770033 -0.316280 -0.990810
1 2.0 -1.070816 -1.438713 0.564417 0.295722
2 3.0 -1.626404 0.219565 0.678805 1.889273
3 4.0 0.961538 0.104011 -0.481165 0.850229
4 5.0 1.453425 1.057737 0.165562 0.515018
5 6.0 -1.336936 0.562861 1.392855 -0.063328
6 7.0 0.121668 1.207603 -0.002040 1.627796
7 8.0 0.354493 1.037528 -0.385684 0.519818
8 9.0 1.686583 -1.325963 1.428984 -2.089354
9 10.0 -0.129820 0.631523 -0.586538 0.290720
129 / 199
pandas
# Styling DataFrame using DataFrame.style property
df.style.set_properties(**{'background-color':
'black', 'color': 'green'})
Output:
130 / 199
pandas
# Replacing the locating value by NaN (Not a Number)
df.iloc[0, 3] = np.nan
df.iloc[2, 3] = np.nan
df.iloc[4, 2] = np.nan
df.iloc[7, 4] = np.nan
print(df)
Output:
A B C D E
0 1.0 1.329212 -0.770033 NaN -0.990810
1 2.0 -1.070816 -1.438713 0.564417 0.295722
2 3.0 -1.626404 0.219565 NaN 1.889273
3 4.0 0.961538 0.104011 -0.481165 0.850229
4 5.0 1.453425 NaN 0.165562 0.515018
5 6.0 -1.336936 0.562861 1.392855 -0.063328
6 7.0 0.121668 1.207603 -0.002040 1.627796
7 8.0 0.354493 1.037528 -0.385684 NaN
8 9.0 1.686583 -1.325963 1.428984 -2.089354
9 10.0 -0.129820 0.631523 -0.586538 0.290720
131 / 199
pandas
# Highlight the NaN values in DataFrame
df.style.highlight_null(color='red')
Output:
132 / 199
pandas
# Highlight the Min values in each column
df.style.highlight_min(axis = 0)
Output:
133 / 199
pandas
# Highlight the Max values in each column
df.style.highlight_max(axis = 0)
Output:
134 / 199
pandas
# Highlight the Max values in each row
df.style.highlight_max(axis = 1)
Output:
135 / 199
pandas
# Set text color of positive values in Dataframes
def color_positive_green(val):
"""
Takes a scalar and returns a string with
the css property 'color: green'` for positive
strings, black otherwise.
"""
if val > 0:
color = 'green'
else:
color = 'black'
return 'color: %s' % color
df.style.applymap(color_positive_green)
136 / 199
pandas
Output:
137 / 199
pandas
# Import seaborn library
import seaborn as sns
# Declaring the color palette from seaborn
cm = sns.light_palette("green", as_cmap=True)
# DataFrame with background gradient and precision
df.style.background_gradient(cmap=cm).format(precision = 2)
138 / 199
pandas
Output:
139 / 199
pandas
# Checking Missing Values in a DataFrame
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
print(df, "n")
print(df.isnull())
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False
140 / 199
pandas
# Filtering Data Based on Missing Values
'''Download employees dataset
https://media.geeksforgeeks.org/wp-content/uploads/employees.csv
'''
import pandas as pd
d = pd.read_csv("/content/employees.csv")
bool_series = pd.isnull(d["Gender"])
missing_gender_data = d[bool_series]
print(missing_gender_data.head())
Output:
First Name Gender ... Senior Management Team
0 Lois NaN ... True Legal
22 Joshua NaN ... True Client Services
27 Scott NaN ... True Legal
31 Joyce NaN ... True Product
41 Christine NaN ... True Business Development
141 / 199
pandas
# Checking for Non-Missing Values
import pandas as pd
import numpy as np
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
print(df)
print(df.notnull())
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 True True False
1 True True True
2 False True True
3 True False True
142 / 199
pandas
# Filtering Data with Non-Missing Values
import pandas as pd
d = pd.read_csv("/content/employees.csv")
nmg = pd.notnull(d["Gender"])
print(d[nmg].head())
Output:
First Name Gender ... Senior Management Team
0 Douglas Male ... True Marketing
1 Thomas Male ... True NaN
2 Maria Female ... True Finance
3 Jerry Male ... True Finance
4 Larry Male ... True Client Services
143 / 199
pandas
# Filling Missing Values in Pandas
d = {'First Score': [100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score': [np.nan, 40, 80, 98]}
df = pd.DataFrame(d)
print(df, "n")
print(df.fillna(0))
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 100.0 30.0 0.0
1 90.0 45.0 40.0
2 0.0 56.0 80.0
3 95.0 0.0 98.0
144 / 199
pandas
# Fill with Previous Value
print(df, "n")
print(df.fillna(method='pad'))
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 90.0 56.0 80.0
3 95.0 56.0 98.0
145 / 199
pandas
# Fill with Next Value
print(df, "n")
print(df.fillna(method='bfill'))
Output:
First Score Second Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0
First Score Second Score Third Score
0 100.0 30.0 40.0
1 90.0 45.0 40.0
2 95.0 56.0 80.0
3 95.0 NaN 98.0
146 / 199
pandas
# Fill NaN Values with 'No Gender'
d = pd.read_csv("/content/employees.csv")
print(d[20:25])
d["Gender"].fillna('No Gender', inplace = True)
print(d[20:25])
Output:
First Name Gender ... Senior Management Team
20 Lois NaN ... True Legal
21 Matthew Male ... False Marketing
22 Joshua NaN ... True Client Services
23 NaN Male ... NaN NaN
24 John Male ... False Client Services
First Name Gender ... Senior Management Team
20 Lois No Gender ... True Legal
21 Matthew Male ... False Marketing
22 Joshua No Gender ... True Client Services
23 NaN Male ... NaN NaN
24 John Male ... False Client Services
147 / 199
pandas
# Replace all NaN values with -99 value.
d = pd.read_csv("/content/employees.csv")
print(d[20:25])
d = d.replace(to_replace = np.nan, value = -99)
print(d[20:25])
Output:
First Name Gender ... Senior Management Team
20 Lois NaN ... True Legal
21 Matthew Male ... False Marketing
22 Joshua NaN ... True Client Services
23 NaN Male ... NaN NaN
24 John Male ... False Client Services
First Name Gender ... Senior Management Team
20 Lois -99 ... True Legal
21 Matthew Male ... False Marketing
22 Joshua -99 ... True Client Services
23 -99 Male ... -99 -99
24 John Male ... False Client Services
148 / 199
pandas
# Fills missing values using interpolation techniques
'''
'linear' - Linear interpolation between adjacent non-missing values.
'polynomial' - Polynomial interpolation (Order 2 for quadratic).
'nearest' - Fills with the nearest non-missing value.
'zero' - Fills with the previous non-missing value
(piecewise constant).
'slinear' - Spline interpolation of order 1
(equivalent to linear in pandas).
'quadratic' - Polynomial interpolation of order 2.
'barycentric' - Barycentric interpolation for smooth approximations.
'''
149 / 199
pandas
df = pd.DataFrame({"A": [12, 4, 5, None, 1],
"B": [None, 2, 54, 3, None],
"C": [20, 16, None, 3, 8],
"D": [14, 3, None, None, 6]})
print(df)
#linear forward interpolation
print(df.interpolate(method = 'linear', limit_direction ='forward'))
Output:
A B C D
0 12.0 NaN 20.0 14.0
1 4.0 2.0 16.0 3.0
2 5.0 54.0 NaN NaN
3 NaN 3.0 3.0 NaN
4 1.0 NaN 8.0 6.0
A B C D
0 12.0 NaN 20.0 14.0
1 4.0 2.0 16.0 3.0
2 5.0 54.0 9.5 4.0
3 3.0 3.0 3.0 5.0
4 1.0 3.0 8.0 6.0
150 / 199
pandas
# Polynomial interpolation with order 2
print(df.interpolate(method ='polynomial', order = 2))
Output:
A B C D
0 12.000000 NaN 20.0 14.0
1 4.000000 2.0 16.0 3.0
2 5.000000 54.0 8.0 -2.0
3 4.578947 3.0 3.0 -1.0
4 1.000000 NaN 8.0 6.0
151 / 199
pandas
# Drop rows where all values are missing
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [np.nan, np.nan, np.nan, 65]}
df = pd.DataFrame(dict)
print(df, "n")
print(df.dropna(how = 'all'))
Output:
First Score Second Score Third Score Fourth Score
0 100.0 30.0 52.0 NaN
1 NaN NaN NaN NaN
2 NaN 45.0 80.0 NaN
3 95.0 56.0 98.0 65.0
First Score Second Score Third Score Fourth Score
0 100.0 30.0 52.0 NaN
2 NaN 45.0 80.0 NaN
3 95.0 56.0 98.0 65.0
152 / 199
pandas
# Remove columns that contain at least one missing value
dict = {'First Score': [100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score': [52, np.nan, 80, 98],
'Fourth Score': [60, 67, 68, 65]}
df = pd.DataFrame(dict)
print(df, "n")
print(df.dropna(axis=1))
Output:
First Score Second Score Third Score Fourth Score
0 100.0 30.0 52.0 60
1 NaN NaN NaN 67
2 NaN 45.0 80.0 68
3 95.0 56.0 98.0 65
Fourth Score
0 60
1 67
2 68
3 65
153 / 199
pandas
# Drop rows with missing values
d = pd.read_csv("/content/employees.csv")
nd = d.dropna(axis=0, how='any')
print("Old data frame length:", len(d))
print("New data frame length:", len(nd))
print("Rows with at least one missing value:",
(len(d) - len(nd)))
Output:
Old data frame length: 1000
New data frame length: 764
Rows with at least one missing value: 236
154 / 199
pandas
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'],
'Age': [17, 14, 12, 52],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])
print(df, "nn", df1)
155 / 199
pandas
Output:
Name Age Address Qualification
0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
Name Age Address Qualification
4 Abhi 17 Nagpur Btech
5 Ayushi 14 Kanpur B.A
6 Dhiraj 12 Allahabad Bcom
7 Hitesh 52 Kannuaj B.hons
156 / 199
pandas
# Concatenating DataFrame
frames = [df, df1]
res1 = pd.concat(frames)
print(res1)
Output:
Name Age Address Qualification
0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
4 Abhi 17 Nagpur Btech
5 Ayushi 14 Kanpur B.A
6 Dhiraj 12 Allahabad Bcom
7 Hitesh 52 Kannuaj B.hons
157 / 199
pandas
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age': [22, 32, 12, 52],
'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary': [1000, 2000, 3000, 4000]}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[2, 3, 6, 7])
print(df, "nn", df1)
158 / 199
pandas
Output:
Name Age Address Qualification Mobile No
0 Jai 27 Nagpur Msc 97
1 Princi 24 Kanpur MA 91
2 Gaurav 22 Allahabad MCA 58
3 Anuj 32 Kannuaj Phd 76
Name Age Address Qualification Salary
2 Gaurav 22 Allahabad MCA 1000
3 Anuj 32 Kannuaj Phd 2000
6 Dhiraj 12 Allahabad Bcom 3000
7 Hitesh 52 Kannuaj B.hons 4000
159 / 199
pandas
# Inner Join
res2 = pd.concat([df, df1], axis=1, join='inner')
print(res2)
Output:
Name Age Address Qualification Mobile No Name 
2 Gaurav 22 Allahabad MCA 58 Gaurav
3 Anuj 32 Kannuaj Phd 76 Anuj
Age Address Qualification Salary
2 22 Allahabad MCA 1000
3 32 Kannuaj Phd 2000
160 / 199
pandas
# Outer Join
res2 = pd.concat([df, df1], axis = 1, sort = False)
print(res2)
Output:
Name Age Address Qualification Mobile No Name
0 Jai 27.0 Nagpur Msc 97.0 NaN
1 Princi 24.0 Kanpur MA 91.0 NaN
2 Gaurav 22.0 Allahabad MCA 58.0 Gaurav
3 Anuj 32.0 Kannuaj Phd 76.0 Anuj
6 NaN NaN NaN NaN NaN Dhiraj
7 NaN NaN NaN NaN NaN Hitesh
Age Address Qualification Salary
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 22.0 Allahabad MCA 1000.0
3 32.0 Kannuaj Phd 2000.0
6 12.0 Allahabad Bcom 3000.0
7 52.0 Kannuaj B.hons 4000.0
161 / 199
pandas
# DataFrames by Ignoring Indexes
res = pd.concat([df, df1], ignore_index=True)
print(res)
Output:
Name Age Address Qualification Mobile No Salary
0 Jai 27 Nagpur Msc 97.0 NaN
1 Princi 24 Kanpur MA 91.0 NaN
2 Gaurav 22 Allahabad MCA 58.0 NaN
3 Anuj 32 Kannuaj Phd 76.0 NaN
4 Gaurav 22 Allahabad MCA NaN 1000.0
5 Anuj 32 Kannuaj Phd NaN 2000.0
6 Dhiraj 12 Allahabad Bcom NaN 3000.0
7 Hitesh 52 Kannuaj B.hons NaN 4000.0
162 / 199
pandas
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'],
'Mobile No': [97, 91, 58, 76]}
data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'],
'Age': [22, 32, 12, 52],
'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'],
'Salary': [1000, 2000, 3000, 4000]}
df = pd.DataFrame(data1, index=[0, 1, 2, 3])
df1 = pd.DataFrame(data2, index=[4, 5, 6, 7])
print(df, "nn", df1)
163 / 199
pandas
Output:
Name Age Address Qualification Mobile No
0 Jai 27 Nagpur Msc 97
1 Princi 24 Kanpur MA 91
2 Gaurav 22 Allahabad MCA 58
3 Anuj 32 Kannuaj Phd 76
Name Age Address Qualification Salary
4 Gaurav 22 Allahabad MCA 1000
5 Anuj 32 Kannuaj Phd 2000
6 Dhiraj 12 Allahabad Bcom 3000
7 Hitesh 52 Kannuaj B.hons 4000
164 / 199
pandas
# Concatenating DataFrame with group keys
frames = [df, df1]
res = pd.concat(frames, keys=['x', 'y'])
print(res)
Output:
Name Age Address Qualification
x 0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
y 4 Abhi 17 Nagpur Btech
5 Ayushi 14 Kanpur B.A
6 Dhiraj 12 Allahabad Bcom
7 Hitesh 52 Kannuaj B.hons
165 / 199
pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data1,index=[0, 1, 2, 3])
s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary')
print(df,"nn", s1)
Output:
Name Age Address Qualification
0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
0 1000
1 2000
2 3000
3 4000
Name: Salary, dtype: int64
166 / 199
pandas
# Concatenating Mixed DataFrames and Series
res = pd.concat([df, s1], axis = 1)
print(res)
Output:
Name Age Address Qualification Salary
0 Jai 27 Nagpur Msc 1000
1 Princi 24 Kanpur MA 2000
2 Gaurav 22 Allahabad MCA 3000
3 Anuj 32 Kannuaj Phd 4000
167 / 199
pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "n", df1)
Output:
key Name Age
0 K0 Jai 27
1 K1 Princi 24
2 K2 Gaurav 22
3 K3 Anuj 32
key Address Qualification
0 K0 Nagpur Btech
1 K1 Kanpur B.A
2 K2 Allahabad Bcom
3 K3 Kannuaj B.hons
168 / 199
pandas
# Merging DataFrames Using One Key
res = pd.merge(df, df1, on = 'key')
print(res)
Output:
key Name Age Address Qualification
0 K0 Jai 27 Nagpur Btech
1 K1 Princi 24 Kanpur B.A
2 K2 Gaurav 22 Allahabad Bcom
3 K3 Anuj 32 Kannuaj B.hons
169 / 199
pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "nn", df1)
170 / 199
pandas
Output:
key key1 Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
key key1 Address Qualification
0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons
171 / 199
pandas
# Merging DataFrames Using Multiple Keys
res1 = pd.merge(df, df1, on=['key', 'key1'])
print(res1)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27 Nagpur Btech
1 K2 K0 Gaurav 22 Allahabad Bcom
172 / 199
pandas
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K1', 'K0', 'K1'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'key1': ['K0', 'K0', 'K0', 'K0'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
print(df, "nn", df1)
173 / 199
pandas
Output:
key key1 Name Age
0 K0 K0 Jai 27
1 K1 K1 Princi 24
2 K2 K0 Gaurav 22
3 K3 K1 Anuj 32
key key1 Address Qualification
0 K0 K0 Nagpur Btech
1 K1 K0 Kanpur B.A
2 K2 K0 Allahabad Bcom
3 K3 K0 Kannuaj B.hons
174 / 199
pandas
# Left outer join
res = pd.merge(df, df1, how = 'left', on = ['key', 'key1'])
print(res)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27 Nagpur Btech
1 K1 K1 Princi 24 NaN NaN
2 K2 K0 Gaurav 22 Allahabad Bcom
3 K3 K1 Anuj 32 NaN NaN
175 / 199
pandas
# Right outer join
res1 = pd.merge(df, df1, how = 'right', on = ['key', 'key1'])
print(res1)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27.0 Nagpur Btech
1 K1 K0 NaN NaN Kanpur B.A
2 K2 K0 Gaurav 22.0 Allahabad Bcom
3 K3 K0 NaN NaN Kannuaj B.hons
176 / 199
pandas
# Outer join
res2 = pd.merge(df, df1, how='outer', on=['key', 'key1'])
print(res2)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27.0 Nagpur Btech
1 K1 K0 NaN NaN Kanpur B.A
2 K1 K1 Princi 24.0 NaN NaN
3 K2 K0 Gaurav 22.0 Allahabad Bcom
4 K3 K0 NaN NaN Kannuaj B.hons
5 K3 K1 Anuj 32.0 NaN NaN
177 / 199
pandas
# Inner join
res3 = pd.merge(df, df1, how = 'inner', on = ['key', 'key1'])
print(res3)
Output:
key key1 Name Age Address Qualification
0 K0 K0 Jai 27 Nagpur Btech
1 K2 K0 Gaurav 22 Allahabad Bcom
178 / 199
pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32]}
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
print(df, "nn", df1)
Output:
Name Age
K0 Jai 27
K1 Princi 24
K2 Gaurav 22
K3 Anuj 32
Address Qualification
K0 Allahabad MCA
K2 Kannuaj Phd
K3 Allahabad Bcom
K4 Kannuaj B.hons
179 / 199
pandas
# Merge DataFrames based on row indexes
res = df.join(df1)
print(res)
Output:
Name Age Address Qualification
K0 Jai 27 Allahabad MCA
K1 Princi 24 NaN NaN
K2 Gaurav 22 Kannuaj Phd
K3 Anuj 32 Allahabad Bcom
180 / 199
pandas
# Merge DataFrames based on row indexes
res = df1.join(df)
print(res)
Output:
Address Qualification Name Age
K0 Allahabad MCA Jai 27.0
K2 Kannuaj Phd Gaurav 22.0
K3 Allahabad Bcom Anuj 32.0
K4 Kannuaj B.hons NaN NaN
181 / 199
pandas
# Outer Join
res1 = df.join(df1, how='outer')
print(res1)
Output:
Name Age Address Qualification
K0 Jai 27.0 Allahabad MCA
K1 Princi 24.0 NaN NaN
K2 Gaurav 22.0 Kannuaj Phd
K3 Anuj 32.0 Allahabad Bcom
K4 NaN NaN Kannuaj B.hons
182 / 199
pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
'Key':['K0', 'K1', 'K2', 'K3']}
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
print(df, "nn", df1)
Output:
Name Age Key
0 Jai 27 K0
1 Princi 24 K1
2 Gaurav 22 K2
3 Anuj 32 K3
Address Qualification
K0 Allahabad MCA
K2 Kannuaj Phd
K3 Allahabad Bcom
K4 Kannuaj B.hons
183 / 199
pandas
# Joining DataFrames Using "on" Argument
res2 = df.join(df1, on ='Key')
res2
Output:
Name Age Key Address Qualification
0 Jai 27 K0 Allahabad MCA
1 Princi 24 K1 NaN NaN
2 Gaurav 22 K2 Kannuaj Phd
3 Anuj 32 K3 Allahabad Bcom
184 / 199
pandas
data1 = {'Name':['Jai', 'Princi', 'Gaurav'],
'Age':[27, 24, 22]}
data2 = {'Address':['Allahabad', 'Kannuaj',
'Allahabad', 'Kanpur'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1, index=pd.Index(['K0', 'K1', 'K2'],
name='key'))
index = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'),
('K2', 'Y2'), ('K2', 'Y3')],
names=['key', 'Y'])
df1 = pd.DataFrame(data2, index= index)
print(df, "nn", df1)
185 / 199
pandas
Output:
Name Age
key
K0 Jai 27
K1 Princi 24
K2 Gaurav 22
Address Qualification
key Y
K0 Y0 Allahabad MCA
K1 Y1 Kannuaj Phd
K2 Y2 Allahabad Bcom
Y3 Kanpur B.hons
186 / 199
pandas
# Joining DataFrames with Different Index Levels (Multi-Index)
result = df.join(df1, how ='inner')
print(result)
Output:
Name Age Address Qualification
key Y
K0 Y0 Jai 27 Allahabad MCA
K1 Y1 Princi 24 Kannuaj Phd
K2 Y2 Gaurav 22 Allahabad Bcom
Y3 Gaurav 22 Kanpur B.hons
187 / 199
SciPy
▶ SciPy is an open-source Python library for scientific and
technical computing.
▶ Relies on NumPy, which provides efficient n-dimensional array
manipulation.
▶ Covers areas like optimization, integration, interpolation,
eigenvalue problems, and statistics.
▶ Essential for research, data analysis, and engineering projects.
188 / 199
Usage of SciPy
▶ Scientific Computing: Solves differential equations and
performs numerical integration.
▶ Statistics: Offers scipy.stats for hypothesis testing,
probability distributions, and more.
▶ Optimization: Includes tools for linear programming and
nonlinear optimization.
▶ Signal Processing: Provides functions for Fourier transforms
and filtering.
▶ Preparation: Install via pip install scipy and explore
documentation.
189 / 199
SciPy
from scipy import stats
data = [1.5, 2.3, 3.1, 4.2, 5.0]
mean = stats.tmean(data)
std_dev = stats.tstd(data)
print(f"Mean: {mean}, Standard Deviation: {std_dev}")
Output: Mean: 3.22, Standard Deviation: 1.4096098751072936
from scipy import integrate
import numpy as np
f = lambda x: x**2
result, error = integrate.quad(f, 0, 1)
print(f"Integral from 0 to 1: {result} (Error: {error})")
Output: Integral from 0 to 1: 0.33333333333333337
(Error: 3.700743415417189e-15)
190 / 199
Matplotlib
▶ A comprehensive Python library for creating static, animated,
and interactive visualizations.
▶ Offers a wide range of customizable plots (line, bar, scatter,
etc.) and backends.
▶ Applications: Used in professional reporting, interactive
dashboards, web/GUI applications, and embedded views.
191 / 199
Usage of Matplotlib
▶ Reporting: Generate quality figures for research articles.
▶ Interactive Tools: Create dynamic plots with widgets for
data exploration.
▶ Dashboards: Build complex visualizations for real-time data
monitoring.
▶ Web Integration: Embed plots in web applications using
backends like WebAgg.
▶ Preparation: Install via pip install matplotlib and
explore documentation.
192 / 199
Matplotlib
# Plot sine, cosine and tangent waves
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), label='Sine', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='Cosine', color='red',
linestyle='--')
plt.plot(x, np.tan(x), label='tan', color='green',
linestyle='-.')
plt.title("Sine, Cosine and Tangent Waves")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.legend()
plt.grid(True)
plt.show()
193 / 199
Matplotlib
194 / 199
Matplotlib
# Plot sample bar and line chart
import matplotlib.pyplot as plt
categories = ['A', 'B', 'C']
values = [10, 20, 15]
plt.bar(categories, values, color='green')
plt.plot(categories, values, color='red', linewidth=2)
plt.title("Sample Bar and Line Chart")
plt.ylabel("Values")
plt.show()
195 / 199
Matplotlib
196 / 199
Summary
It covers:
▶ Presents the basics of Exploring Data Analysis (EDA) and its
significance.
▶ Describes measurement scales, data types, and data analysis
methodologies.
▶ Highlights the steps involved in EDA, including gathering
data, cleaning it, visualizing it, and
▶ developing hypotheses.
▶ Demonstrates the differences between Bayesian, exploratory,
and classical analysis techniques.
▶ Python libraries (NumPy, pandas, SciPy, and Matplotlib) and
EDA tools are demonstrated.
197 / 199
References I
TEXTBOOK
[1] Mukhiya, S. K., & Ahmed, U. (2020). Hands-On Exploratory
Data Analysis with Python: Perform EDA techniques to
understand, summarize, and investigate your data. Packt
Publishing Ltd.
REFERENCE BOOKS
[1] Pearson, R. K. (2020). Exploratory Data Analysis Using R (1st
ed.). CRC Press.
[2] Datar, R., & Garg, H. (2019). Hands-on exploratory data
analysis with R: Become an expert in exploratory data analysis
using R packages. Packt Publishing Ltd.
198 / 199
References II
ONLINE RESOURCES
[1] Python Pool. (2021, June 14). Numpy Axis in Python with
detailed examples. Python Pool.
https://www.pythonpool.com/numpy-axis/
[2] GeeksforGeeks. (2025, July 28). Working with Missing Data in
Pandas. GeeksforGeeks.
https://www.geeksforgeeks.org/data-analysis/
working-with-missing-data-in-pandas/
[3] GeeksforGeeks. (2025, July 26). Python | Pandas merging,
joining and concatenating. GeeksforGeeks.
https://www.geeksforgeeks.org/python/
python-pandas-merging-joining-and-concatenating/
199 / 199

More Related Content

PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
PDF
Linked List Data Structures .
PDF
The Value of Business Intelligence .
PDF
Business Intelligence and Information Exploitation.pdf
PDF
Introduction to Data Structures .
PDF
Searching and Sorting Algorithms
PDF
Multidimensional Data
Visual Aids for Exploratory Data Analysis.pdf
20ES1152 Programming for Problem Solving Lab Manual VRSEC.pdf
Linked List Data Structures .
The Value of Business Intelligence .
Business Intelligence and Information Exploitation.pdf
Introduction to Data Structures .
Searching and Sorting Algorithms
Multidimensional Data

Recently uploaded (20)

PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Soil Improvement Techniques Note - Rabbi
PDF
PPT on Performance Review to get promotions
PPTX
Artificial Intelligence
PPTX
UNIT - 3 Total quality Management .pptx
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPT
Total quality management ppt for engineering students
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
UNIT 4 Total Quality Management .pptx
III.4.1.2_The_Space_Environment.p pdffdf
Soil Improvement Techniques Note - Rabbi
PPT on Performance Review to get promotions
Artificial Intelligence
UNIT - 3 Total quality Management .pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Safety Seminar civil to be ensured for safe working.
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Information Storage and Retrieval Techniques Unit III
Fundamentals of safety and accident prevention -final (1).pptx
Total quality management ppt for engineering students
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Categorization of Factors Affecting Classification Algorithms Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
UNIT 4 Total Quality Management .pptx
Ad
Ad

Exploratory_Data_Analysis_Fundamentals.pdf

  • 1. Exploratory Data Analysis Fundamentals Ashutosh Satapathy, Ph.D. Asst. Prof., Department of CSE, Siddhartha Academy of Higher Education (Deemed to be University) Vijayawada - 520007 1 / 199
  • 2. Outline Introduction to EDA Data From Data to Knowledge Exploratory Data Analysis Understanding Data Science Data Science Phases of Data Science The Significance of EDA Why is EDA Significant? The Role of EDA Steps in EDA Example: EDA for a Fitness App Making Sense of Data Data Matters Dataset Data Storage Numerical Data Categorical Data Measurement Scales Data Analysis Approaches Classical Data Analysis Exploratory Data Analysis Bayesian Data Analysis Key Differences Examples Software Tools for EDA Python R Programming Weka KNIME EDA using Python NumPy pandas SciPy Matplotlib Summary References 2 / 199
  • 3. What is Data? ▶ A collection of discrete objects, numbers, words, events, facts, measurements, observations, or descriptions. ▶ Generated by processes in various disciplines: ▶ Biology: Genetic sequences, protein structures ▶ Economics: Market trends, GDP data ▶ Engineering: Sensor readings, performance metrics ▶ Marketing: Customer preferences, sales data ▶ Example: A dataset of customer purchases in a retail store includes product IDs, purchase dates, and amounts spent. 3 / 199
  • 4. From Data to Knowledge ▶ Data: Raw facts and figures (e.g., sales numbers: $500, $300, $700). ▶ Information: Processed data with context (e.g., average sales per day: $500). ▶ Knowledge: Insights derived from information (e.g., sales peak on weekends). ▶ Goal: Transform raw data into actionable knowledge. ▶ Example: Analyzing website traffic data to identify peak visiting hours, leading to optimized ad schedules. 4 / 199
  • 5. What is Exploratory Data Analysis (EDA)? ▶ A process to examine datasets and uncover: ▶ Patterns ▶ Anomalies ▶ Hypotheses ▶ Assumptions ▶ Uses statistical measures and visualizations. ▶ Performed before formal modeling or hypothesis testing. ▶ Example: Plotting sales data to spot seasonal trends or outliers (e.g., a sudden spike in sales due to a promotion). 5 / 199
  • 6. Why is EDA? ▶ Helps statisticians understand data characteristics. ▶ Uncovers hidden insights before formal modeling. ▶ Guides hypothesis generation and data collection strategies. ▶ Prevents incorrect assumptions in modeling. ▶ Example: 1. In a medical study, EDA reveals missing values in patient records, prompting data cleaning before analysis. 2. EDA on patient data reveals inconsistent heart rate readings, prompting sensor recalibration. 6 / 199
  • 7. Key Steps in EDA 1. Data Collection: Gather raw data (e.g., sensor readings from a manufacturing plant). 2. Data Cleaning: Handle missing values, outliers (e.g., removing erroneous temperature readings). 3. Descriptive Statistics: Compute mean, median, variance (e.g., average production rate). 4. Visualization: Create plots (histograms, scatter plots) to identify trends. 5. Hypothesis Generation: Formulate questions based on patterns (e.g., does production rate vary by shift?). 7 / 199
  • 8. Example: EDA in Retail Sales ▶ Dataset: Daily sales data for a clothing store over one year. ▶ Steps: ▶ Check for missing sales entries. ▶ Calculate average sales per month. ▶ Plot sales trends using a line graph. ▶ Identify outliers (e.g., Black Friday sales spike). ▶ Insight: Sales peak during holiday seasons, suggesting increased inventory in November–December. 8 / 199
  • 9. Tools for EDA ▶ Programming Languages: Python (pandas, matplotlib), R (ggplot2). ▶ Software: Excel, Tableau, Power BI. ▶ Visualization Techniques: Histograms, box plots, scatter plots, heatmaps. ▶ Example: Using Python to create a box plot of customer spending to detect high-spending outliers. 9 / 199
  • 10. Benefits of EDA ▶ Uncovers hidden patterns (e.g., customer churn trends). ▶ Detects data quality issues (e.g., duplicate entries). ▶ Informs better data collection strategies. ▶ Supports development of robust models. ▶ Example: EDA on weather data reveals inconsistent sensor readings, leading to sensor recalibration. 10 / 199
  • 11. What is Data Science? ▶ A cross-disciplinary field combining: ▶ Computer Science ▶ Statistics ▶ Mathematics ▶ Domain Knowledge ▶ Involves building models and extracting insights for business intelligence. ▶ No Ph.D. required—practical skills are key. ▶ Example: Predicting customer churn using purchase history and machine learning. 11 / 199
  • 12. Phases of Data Science ▶ These phases are similar to Cross-Industry Standard Process for Data Mining (CRISP-DM) framework: 1. Data Requirements 2. Data Collection 3. Data Processing 4. Data Cleaning 5. Exploratory Data Analysis (EDA) 6. Modeling and Algorithms 7. Data Product 8. Communication ▶ Each phase builds toward actionable insights. 12 / 199
  • 13. Data Requirements ▶ Identify and categorize data needed for analysis: ▶ Numerical (e.g., heart rate) ▶ Categorical (e.g., patient gender) ▶ Define storage and dissemination formats. ▶ Example: For a dementia study, collect sleep patterns, heart rate, and activity data from sensors to assess mental state. 13 / 199
  • 14. Data Collection ▶ Gather data from various sources (sensors, databases, APIs). ▶ Ensure proper storage and transfer to IT systems. ▶ Example: Collecting customer feedback from surveys and social media to analyze sentiment. 14 / 199
  • 15. Data Processing ▶ Pre-curate data before analysis. ▶ Tasks: Exporting, structuring, and formatting data into tables. ▶ Example: Converting raw sensor data into a structured CSV file with columns for time, value, and sensor ID. 15 / 199
  • 16. Data Cleaning ▶ Address incompleteness, duplicates, errors, and missing values. ▶ Techniques: Record matching, outlier detection, filling missing values. ▶ Example: Removing duplicate customer entries in a sales dataset and imputing missing purchase amounts using the median. 16 / 199
  • 17. Exploratory Data Analysis (EDA) ▶ Core stage to uncover patterns, anomalies, and hypotheses. ▶ Uses descriptive statistics and visualizations. ▶ May involve data transformation techniques. ▶ Example: Plotting sales data to identify seasonal trends or outliers (e.g., a spike during a holiday sale). 17 / 199
  • 18. Modeling and Algorithms ▶ Build models to represent relationships between variables. ▶ Judd model: Data = Model + Error. ▶ Example: Linear regression model for pen purchases: Total = UnitPrice × Quantity where Total is the dependent variable, and UnitPrice is independent. 18 / 199
  • 19. Data Product ▶ Software that uses data inputs to produce outputs and feedback. ▶ Based on models from analysis. ▶ Example: A recommendation system suggesting products based on user purchase history. 19 / 199
  • 20. Communication ▶ Share results with stakeholders via visualizations (tables, charts, diagrams). ▶ Drives business intelligence and decision-making. ▶ Example: A bar chart showing monthly sales trends to guide inventory planning. 20 / 199
  • 21. Why is EDA Significant? ▶ Data is collected in fields like science, economics, engineering, and marketing. ▶ Large datasets are stored in electronic databases, making manual analysis impossible. ▶ EDA is the first step in data mining to: ▶ Visualize data ▶ Understand patterns ▶ Create hypotheses ▶ Example: A store collects sales data to find which products sell best during holidays. 21 / 199
  • 22. The Role of EDA ▶ Reveals insights without assumptions (ground truth). ▶ Helps data scientists decide on models and hypotheses. ▶ Key components: ▶ Summarizing data (e.g., averages, totals) ▶ Statistical analysis (e.g., correlations) ▶ Visualization (e.g., graphs, charts) ▶ Example: Plotting student grades to spot trends, like higher scores in math vs. history. 22 / 199
  • 23. Steps in EDA ▶ EDA involves four key steps: 1. Problem Definition 2. Data Preparation 3. Data Analysis 4. Development and Representation of Results ▶ Each step builds toward clear, actionable insights. 23 / 199
  • 24. Step 1: Problem Definition ▶ Define the business problem to guide analysis. ▶ Tasks: ▶ Set objectives (e.g., increase sales) ▶ List deliverables (e.g., a report) ▶ Assess data status and costs ▶ Example: A café wants to know which drinks sell best to plan inventory. 24 / 199
  • 25. Step 2: Data Preparation ▶ Prepare data for analysis by: ▶ Identifying data sources (e.g., sales records) ▶ Cleaning and transforming data ▶ Dividing data into chunks ▶ Example: Organizing student survey data into tables for analysis of study habits. 25 / 199
  • 26. Step 3: Data Analysis ▶ Analyze data using: ▶ Descriptive statistics (e.g., mean, median) ▶ Correlation analysis ▶ Predictive models ▶ Example: Calculating average test scores and finding if study time correlates with grades. 26 / 199
  • 27. Step 4: Development and Representation ▶ Present results to stakeholders using: ▶ Graphs (e.g., histograms, scatter plots) ▶ Summary tables ▶ Maps or diagrams ▶ Goal: Make results clear for decision-making. ▶ Example: A bar chart showing top-selling café drinks for the manager. 27 / 199
  • 28. Example: EDA for a Fitness App ▶ Problem: Understand user activity patterns. ▶ Data Preparation: Collect step counts from fitness trackers. ▶ Data Analysis: Calculate average steps per day, find peak activity times. ▶ Representation: Create a line graph showing daily steps over a month. ▶ Insight: Users walk more on weekends, suggesting targeted promotions. 28 / 199
  • 29. Data Matters ▶ Data is everywhere: hospitals, universities, real estate, and more. ▶ Understanding data types helps analyze it correctly. ▶ Example: Hospitals store patient data to track health trends, like weight or age. ▶ Goal: Turn raw data into meaningful insights. 29 / 199
  • 30. Dataset ▶ A collection of observations about an object. ▶ Each observation has variables (features) describing it. ▶ Example: A hospital dataset includes patient details like: ▶ Patient ID ▶ Name ▶ Address ▶ Date of Birth ▶ Email ▶ Gender ▶ Weight 30 / 199
  • 31. Example: Hospital Patient Dataset ▶ Each row is an observation (a patient). ▶ Each column is a variable (e.g., Name, Weight). ▶ Example entry: ▶ PATIENT_ID: 002 ▶ Name: Yoshmi Mukhiya ▶ Address: Mannsverk 61, 5094, Bergen ▶ DOB: 10.07.2018 ▶ Email: yoshmimukhiya@gmail.com ▶ Gender: Female ▶ Weight: 10 31 / 199
  • 32. How Data is Stored ▶ Stored in database management systems as tables/schemas. ▶ Each table has rows (observations) and columns (variables). ▶ Example: A hospital patient table: Table 1: An example of a table for storing patient information ID Name Address DOB Email Gender Weight 001 Suresh Mukhiya Mannsverk 61 30.12.1989 skmu@hvl.no Male 68 002 Yoshmi Mukhiya Mannsverk 61, 5094, Bergen 10.07.2018 yoshmimukhiya@gmail.com Female 10 003 Anju Mukhiya Mannsverk 61, 5094, Bergen 10.12.1997 anjumukhiya@gmail.com Female 24 004 Asha Gaire Butwal, Nepal 30.11.1990 aasha.gaire@gmail.com Female 23 005 Ola Nordmann Danmark, Sweden 12.12.1789 ola@gmail.com Male 75 32 / 199
  • 33. Numerical Data ▶ Data involving measurements or quantities. ▶ Also called quantitative data in statistics. ▶ Examples: ▶ Age (e.g., 20 years) ▶ Height (e.g., 170 cm) ▶ Weight (e.g., 65 kg) ▶ Heart rate (e.g., 72 bpm) ▶ Number of family members (e.g., 4) ▶ Used in fields like medicine, sports, and research. 33 / 199
  • 34. Types of Numerical Data ▶ Numerical data is divided into two types: ▶ Discrete Data: Countable, fixed values. ▶ Continuous Data: Infinite values within a range. ▶ Understanding these types helps in data analysis. ▶ Example: Number of teeth (discrete) vs. body temperature (continuous). 34 / 199
  • 35. Discrete Data ▶ Data that is countable with a finite set of values. ▶ Represented by a discrete variable. ▶ Examples: ▶ Number of heads in 200 coin flips (0 to 200). ▶ Country (e.g., Nepal, India, Norway, Japan). ▶ Student rank in class (e.g., 1, 2, 3, 4). ▶ Number of cars in a parking lot (e.g., 25). ▶ Discrete data has distinct, separate values. 35 / 199
  • 36. Example: Discrete Data ▶ Scenario: Counting students in a classroom. ▶ Variable: Number of students present. ▶ Values: 20, 21, 22, ..., 30 (finite and countable). ▶ Analysis: Calculate the average attendance over a week. ▶ Visual: Bar chart showing daily student counts. 36 / 199
  • 37. Continuous Data ▶ Data with an infinite number of values within a range. ▶ Represented by a continuous variable. ▶ Examples: ▶ Temperature (e.g., 25.3°C, 25.31°C, 25.312°C). ▶ Weight (e.g., 65.2 kg, 65.25 kg). ▶ Height (e.g., 170.5 cm, 170.51 cm). ▶ Time to run 100 meters (e.g., 12.345 seconds). ▶ Continuous data can take any value in a range. 37 / 199
  • 38. Example: Continuous Data ▶ Scenario: Measuring student heights in a class. ▶ Variable: Height. ▶ Values: Any number between 150 cm and 190 cm (e.g., 165.7 cm). ▶ Analysis: Find the average height of students. ▶ Visual: Histogram showing height distribution. 38 / 199
  • 39. Discrete vs. Continuous Table 2: Discrete data vs. continuous data Discrete Data Continuous Data Definition Countable, fixed values Infinite values in a range Examples Number of students, rank Weight, temperature Variable Type Discrete variable Continuous variable Analysis Counts, frequencies Averages, ranges ▶ Example: Number of cars (discrete) vs. car speed (continuous). 39 / 199
  • 40. Example: Car Dataset ▶ Dataset: Cars with variables like: ▶ Number of seats (discrete: 2, 4, 5, 7). ▶ Weight (continuous: e.g., 1200.5 kg). ▶ Speed (continuous: e.g., 180.3 km/h). ▶ EDA Tasks: ▶ Count cars by number of seats (discrete). ▶ Calculate average weight (continuous). ▶ Plot speed distribution (continuous). 40 / 199
  • 41. What is Categorical Data? ▶ Represents characteristics or qualities of an object. ▶ Also called qualitative data in statistics. ▶ Examples: ▶ Gender (Male, Female, Other) ▶ Marital Status (Married, Single, Divorced) ▶ Movie Genres (Comedy, Drama, Action) ▶ Blood Type (A, B, AB, O) ▶ Drug Types (Stimulants, Opioids, Cannabis) ▶ Used in fields like medicine, marketing, and social sciences. 41 / 199
  • 42. Categorical Variables ▶ Variables that describe categorical data. ▶ Have a limited number of values (like enumerated types in computer science). ▶ Two main types: ▶ Binary (Dichotomous): Exactly two values. ▶ Polytomous: More than two values. ▶ Example: Gender (Male, Female) vs. Movie Genres (Action, Comedy, Drama, etc.). 42 / 199
  • 43. Binary Categorical Variables ▶ Take exactly two values (dichotomous). ▶ Examples: ▶ Experiment Result: Success or Failure ▶ Attendance: Present or Absent ▶ Light Switch: On or Off ▶ Easy to analyze due to only two options. ▶ Example: Checking if a student passed (Yes/No) an exam. 43 / 199
  • 44. Polytomous Categorical Variables ▶ Take more than two values. ▶ Examples: ▶ Marital Status: Married, Single, Divorced, Widowed, etc. ▶ Movie Genres: Action, Comedy, Drama, Horror, etc. ▶ Blood Type: A, B, AB, O ▶ Example: Surveying students’ favorite subjects (Math, Science, History, Art). 44 / 199
  • 45. Measurement Scales ▶ Four different types of measurement scales in statistics. ▶ These scales are used more in academic industries. 1. Nominal 2. Ordinal 3. Interval 4. Ratio 45 / 199
  • 46. What are Nominal Scales? ▶ Labels for categorical variables without quantitative value. ▶ Mutually exclusive and carry no numerical importance. ▶ Considered qualitative data in statistics. ▶ Examples: ▶ Gender: Male, Female, Non-binary, Other, Prefer not to answer ▶ Languages: English, Spanish, Hindi ▶ Biological Species: Archea, Bacteria, Eukarya 46 / 199
  • 47. Characteristics of Nominal Scales ▶ No order or ranking among categories. ▶ No arithmetic and comparison operations (e.g., addition, subtraction, multiplication, division, mean, greater than, less than) possible. ▶ Numbers as labels have no numerical meaning (e.g., "1 = Male" is just a label). ▶ Example: Labeling parts of speech (noun, verb, adjective) has no numerical value. 47 / 199
  • 48. Examples of Nominal Scales ▶ Common nominal variables: ▶ Gender: Male, Female, Non-binary, Other ▶ Country Languages: Norwegian, Japanese, Nepali ▶ Movie Genres: Comedy, Action, Drama ▶ Taxonomic Ranks: Archea, Bacteria, Eukarya ▶ Parts of Speech: Noun, Pronoun, Adjective ▶ Example: Survey asking students’ favorite food: Pizza, Sushi, Pasta. 48 / 199
  • 49. Analyzing Nominal Data ▶ Key methods: ▶ Frequency: Count how often a label appears. ▶ Proportion: Frequency divided by total events. ▶ Percentage: Proportion multiplied by 100. ▶ Example: In a class of 50 students: ▶ Gender: 25 Male, 20 Female, 5 Non-binary ▶ Proportion: Female = 20/50 = 0.4 ▶ Percentage: Female = 40% 49 / 199
  • 50. Visualizing Nominal Data ▶ Use pie charts or bar charts for nominal data. ▶ Not suitable for histograms (used for numerical data). ▶ Example: Bar chart showing student preferences: ▶ Favorite Food: Pizza (20), Sushi (15), Pasta (10) ▶ Visuals make frequencies and proportions clear to stakeholders. 50 / 199
  • 51. Example: Student Survey Dataset ▶ Dataset: Survey of 100 students with nominal variables: ▶ Gender: Male, Female, Non-binary, Other ▶ Favorite Subject: Math, Science, History ▶ Analysis: ▶ Frequency: 50 Male, 40 Female, 10 Transgender ▶ Proportion: Male = 50/100 = 0.5 ▶ Visualization: Pie chart of Gender distribution 51 / 199
  • 52. Nominal Scales in Real Life ▶ Used in surveys, classifications, and categorizations. ▶ Examples: ▶ Customer survey: Preferred brand (Nike, Adidas, Puma) ▶ Biology: Classifying species (Lion, Tiger, Leopard) ▶ Social media: Post category (Photo, Video, Text) ▶ Analysis: Count how many customers prefer each brand. 52 / 199
  • 53. What are Ordinal Scales? ▶ Categorical data with a significant order of values. ▶ Tip: "Ordinal" sounds like "order" (1st, 2nd, 3rd, etc.). ▶ Differs from nominal scales (no order, e.g., Gender). ▶ Examples: ▶ Satisfaction: Low, Medium, High ▶ Education Level: High School, Bachelor’s, Master’s ▶ Race Position: 1st, 2nd, 3rd 53 / 199
  • 54. Ordinal vs. Nominal Scales ▶ Nominal: No order (e.g., Blood Type: A, B, AB, O). ▶ Ordinal: Ordered categories (e.g., Satisfaction: Low, Medium, High). ▶ Key Difference: Order matters in ordinal scales. ▶ Example: ▶ Nominal: Favorite Color (Red, Blue, Green). ▶ Ordinal: Class Rank (1st, 2nd, 3rd). 54 / 199
  • 55. What is the Likert Scale? ▶ A type of ordinal scale with ordered response options. ▶ Used to measure opinions, attitudes, or feelings. ▶ Example Question: "WordPress is making content managers’ lives easier." ▶ Responses (5-point Likert Scale): ▶ Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree ▶ Visual: See next slide for diagram. 55 / 199
  • 56. Likert Scale Example Feelings ▶ Options: 1 - Very Unhappy, 2 - Unhappy, 3 - OK, 4 - Happy, 5 - Very Happy ▶ Order matters: 1 is less happy than 5. ▶ Example: A student rates their mood as "Happy" (4). Satisfaction ▶ Options: 1 - Very Unsatisfied, 2 - Somewhat Unsatisfied, 3 - Neutral, 4 - Somewhat Satisfied, 5 - Very Satisfied ▶ Example: A customer rates service as "Somewhat Satisfied" (4). 56 / 199
  • 57. Example: Student Survey Dataset ▶ Dataset: Survey of 50 students with ordinal variables: ▶ Effort Level: Low, Medium, High ▶ Course Difficulty: Easy, Moderate, Hard ▶ Analysis: ▶ Count: 20 Low, 20 Medium, 10 High Effort ▶ Median: Medium Effort ▶ Visualization: Bar chart of Effort Levels 57 / 199
  • 58. Visualizing Ordinal Data ▶ Use bar charts to show order and frequency. ▶ Avoid pie charts if order is critical (use for nominal data instead). ▶ Example: Bar chart of Likert responses: ▶ Strongly Agree: 10 ▶ Agree: 15 ▶ Neutral: 10 ▶ Disagree: 5 ▶ Strongly Disagree: 5 58 / 199
  • 59. Real-World Example: Customer Feedback ▶ Dataset: 100 customer reviews with: ▶ Satisfaction: 1 - Very Unsatisfied, 2 - Unsatisfied, 3 - Neutral, 4 - Satisfied, 5 - Very Satisfied ▶ Analysis: ▶ Median: Neutral (3) ▶ Frequency: 30 Satisfied, 25 Neutral, etc. ▶ Action: Improve service based on low satisfaction feedback. 59 / 199
  • 60. Why Learn Ordinal Scales? ▶ Order helps rank and compare data (e.g., satisfaction levels). ▶ Guides correct statistical measures (median, not mean). ▶ Essential for surveys and Likert scale analysis in EDA. ▶ Example: Ranking student performance (1st, 2nd, 3rd) for awards. 60 / 199
  • 61. What are Interval and Ratio Scales? ▶ Interval Scales: Order and exact differences between values matter. ▶ Ratio Scales: Include order, exact differences, and a true zero. ▶ Both extend beyond nominal and ordinal scales for advanced analysis. ▶ Example: Temperature (interval) vs. Height (ratio). 61 / 199
  • 62. Interval Scales ▶ Order and exact differences between values are significant. ▶ Used in statistics (e.g., mean, median, mode, standard deviation). ▶ No true zero (e.g., 0°C doesn’t mean no temperature). ▶ Examples: ▶ Temperature in Celsius (°C) ▶ Location in Cartesian coordinates (x, y) ▶ Direction in degrees from magnetic north ▶ Example: Difference between 20°C and 30°C is the same as 30°C and 40°C. 62 / 199
  • 63. Properties of Interval Scales ▶ Provides: Order, frequency, mode, median, mean. ▶ Can quantify differences between values. ▶ Can add or subtract values (e.g., 20°C - 10°C = 10°C). ▶ Cannot multiply/divide or use a true zero. ▶ Example: Average temperature of a week (e.g., mean = 25°C). 63 / 199
  • 64. Example: Interval Scale in Action ▶ Dataset: Daily temperatures (°C) for a week: ▶ Mon: 20°C, Tue: 22°C, Wed: 25°C, Thu: 23°C, Fri: 21°C, Sat: 24°C, Sun: 26°C ▶ Analysis: ▶ Mean: (20 + 22 + 25 + 23 + 21 + 24 + 26) / 7 = 23°C ▶ Difference: 25°C - 20°C = 5°C ▶ Visualization: Line graph of temperature trends. 64 / 199
  • 65. Ratio Scales ▶ Include order, exact differences, and a true zero (e.g., 0 kg = no mass). ▶ Enable advanced statistical analysis (e.g., mean, variance, ratios). ▶ Examples: ▶ Mass (e.g., 50 kg) ▶ Length (e.g., 2 meters) ▶ Duration (e.g., 5 seconds) ▶ Volume (e.g., 10 liters) ▶ Example: Height of students (0 cm = no height is meaningful). 65 / 199
  • 66. Properties of Ratio Scales ▶ Provides: Order, frequency, mode, median, mean, differences. ▶ Can quantify differences, add/subtract, multiply/divide values. ▶ Has a true zero, allowing ratios (e.g., 10 kg is twice 5 kg). ▶ Example: Average weight of a class (e.g., mean = 60 kg). 66 / 199
  • 67. Example: Ratio Scale in Action ▶ Dataset: Weights (kg) of 5 students: ▶ Student 1: 50 kg, Student 2: 55 kg, Student 3: 60 kg, Student 4: 65 kg, Student 5: 70 kg ▶ Analysis: ▶ Mean: (50 + 55 + 60 + 65 + 70) / 5 = 60 kg ▶ Ratio: 70 kg is 1.4 times 50 kg ▶ Visualization: Bar chart of weights. 67 / 199
  • 68. Comparison of All Scales Table 3: A summary of the data types and scale measures Provides: Nominal Ordinal Interval Ratio The "order" of values is known ✓ ✓ ✓ "Counts," aka "Frequency of Distribution" ✓ ✓ ✓ ✓ Mode ✓ ✓ ✓ ✓ Median ✓ ✓ ✓ Mean ✓ ✓ Can quantify the difference between each value ✓ ✓ Can add or subtract values ✓ ✓ Can multiple and divide values ✓ Has "true zero" ✓ ▶ Example: Gender (nominal) vs. Temperature (interval) vs. Weight (ratio). 68 / 199
  • 69. Real-World Example: Weather Data ▶ Interval: Temperature (°C) over a month. ▶ Mean: 22°C, Difference: 25°C - 20°C = 5°C ▶ Ratio: Rainfall (mm) in the same month. ▶ Mean: 50 mm, Ratio: 100 mm is twice 50 mm ▶ Use: Plan irrigation based on rainfall ratios. 69 / 199
  • 70. Why Learn Interval and Ratio Scales? ▶ Enable precise statistical analysis (e.g., mean, ratios). ▶ Critical for fields like science, engineering, and economics. ▶ Guide correct visualization and modeling choices. ▶ Example: Analyzing test scores (interval) or distances (ratio) in a project. 70 / 199
  • 71. Data Analysis Approaches ▶ Three key methods: ▶ Classical Data Analysis ▶ Exploratory Data Analysis (EDA) ▶ Bayesian Data Analysis ▶ Each has a unique process for handling data. ▶ Example: Analyzing student exam scores using different methods. 71 / 199
  • 72. Classical Data Analysis ▶ Steps: ▶ Problem Definition ▶ Data Collection ▶ Model Development ▶ Data Analysis ▶ Results Communication ▶ Focus: Build a model first, then analyze data. ▶ Example: Predicting student grades with a pre-set linear model. 72 / 199
  • 73. Exploratory Data Analysis ▶ Steps: ▶ Problem Definition ▶ Data Collection ▶ Data Analysis ▶ Model Development ▶ Results Communication ▶ Focus: Explore data (outliers, patterns) before modeling. ▶ No imposed models; emphasizes visualizations. ▶ Example: Plotting exam scores to spot trends before modeling. 73 / 199
  • 74. Bayesian Data Analysis ▶ Steps: ▶ Problem Definition ▶ Data Collection ▶ Model Development ▶ Prior Distribution ▶ Data Analysis ▶ Results Communication ▶ Uses prior probability (belief before evidence). ▶ Example: Using past exam trends as prior to predict current scores. 74 / 199
  • 75. Key Differences ▶ Classical: Model first, then data analysis. ▶ EDA: Data exploration first, flexible modeling. ▶ Bayesian: Incorporates prior beliefs. ▶ Example: Classical fits a grade model directly; EDA checks score distribution first. 75 / 199
  • 76. Why Compare These Approaches? ▶ Choose the best method for your data and goals. ▶ EDA is great for initial exploration; Classical for structured analysis. ▶ Bayesian adds prior knowledge for accuracy. ▶ Example: Use EDA to explore customer feedback, then Bayesian for predictions. 76 / 199
  • 77. Example: Student Exam Scores ▶ Classical: Define problem (predict grades), collect scores, build a regression model, analyze, report. ▶ EDA: Collect scores, plot distribution (e.g., histogram), identify outliers, then model. ▶ Bayesian: Use last year’s grade trends as prior, update with new data, analyze. ▶ Outcome: Different insights based on approach. 77 / 199
  • 78. Real-World Example: Sales Data ▶ Classical: Build a sales forecast model, then analyze monthly data. ▶ EDA: Explore sales trends (e.g., bar chart), then model seasonal patterns. ▶ Bayesian: Use last year’s sales as prior, refine with current data. ▶ Goal: Optimize inventory based on insights. 78 / 199
  • 79. Software Tools for EDA? ▶ Facilitate data exploration, visualization, and analysis. ▶ Open-source tools are free and widely accessible. ▶ Help uncover patterns, outliers, and insights. ▶ Example: Analyzing student performance data to find trends. 79 / 199
  • 80. EDA Open-Source Tools ▶ Popular tools for EDA include: ▶ Python ▶ R Programming Language ▶ Weka ▶ KNIME ▶ Each offers unique features for data analysis. 80 / 199
  • 81. Python ▶ Open-source programming language. ▶ Widely used for data analysis, mining, and data science. ▶ Link: https://www.python.org/ ▶ Features: Libraries like pandas, matplotlib for EDA. ▶ Example: Plotting a histogram of exam scores using matplotlib. 81 / 199
  • 82. R Programming ▶ Open-source language for statistical computation. ▶ Strong in graphical data analysis. ▶ Link: https://www.r-project.org ▶ Features: Packages like ggplot2 for visualizations. ▶ Example: Creating a bar chart of sales data with ggplot2. 82 / 199
  • 83. Weka ▶ Open-source data mining package. ▶ Includes EDA tools and algorithms. ▶ Link: https://www.cs.waikato.ac.nz/ml/weka/ ▶ Features: Data visualization and data preprocessing tools. ▶ Example: Detecting outliers in a dataset of customer purchases. 83 / 199
  • 84. KNIME ▶ Open-source tool for data analysis. ▶ Based on Eclipse platform. ▶ Link: https://www.knime.com/ ▶ Features: Drag-and-drop interface for workflows. ▶ Example: Building a workflow to analyze social media engagement. 84 / 199
  • 85. EDA using Python Python ▶ Programming basics (variables, strings, data types) ▶ Conditionals, functions, sequences, collections, iterations ▶ File handling, object-oriented programming NumPy ▶ Create, copy, and divide arrays ▶ Perform operations on arrays ▶ Array selections, advanced indexing, multi-dimensional arrays ▶ Linear algebra and built-in functions 85 / 199
  • 86. EDA using Python pandas ▶ Create and understand DataFrame objects ▶ Subset and index data ▶ Arithmetic functions, mapping, index management ▶ Styling for visual analysis Matplotlib ▶ Load linear datasets ▶ Adjust axes, grids, labels, titles, legends ▶ Save plots SciPy ▶ Import the package ▶ Use statistical packages ▶ Perform descriptive statistics, inference, analysis 86 / 199
  • 87. Virtual Environment ▶ Essential for isolating Python projects. ▶ Steps: ▶ Install: pip install virtualenv ▶ Create: virtualenv Local_Version_Directory -p Python_System_Directory ▶ Example: virtualenv myenv -p /usr/bin/python3 ▶ Check: Activate and install packages (e.g., pandas). 87 / 199
  • 88. Reading/Writing to Files ▶ Basic file handling is key for data input/output. ▶ Example Code: # Reading/writing to files filename = "datamining.txt" file = open(filename, mode="r", encoding='utf-8') for line in file: lines = file.readlines() print(lines) file.close() ▶ Practice: Read a CSV file of exam scores. 88 / 199
  • 89. Error Handling ▶ Manage errors to ensure robust code. ▶ Example Code: # Handle invalid grade inputs try: val = int(input("Type a number between 47 and 100:")) except ValueError: print("You must type a number between 47 and 100!") else: if (val > 47) and (val <= 100): print("You typed: ", val) else: print("The value you typed is incorrect!") 89 / 199
  • 90. Object-Oriented Concepts ▶ Use classes and objects for structured code. ▶ Example Code: # Mental Health Diseases: Social Anxiety Disorder class Disease: def __init__(self, disease='Depression'): self.type = disease def getName(self): print("Mental Health Diseases: ", self.type) d1 = Disease('Social Anxiety Disorder') d1.getName() ▶ Example: Create a class for student data. 90 / 199
  • 91. NumPy # Create a 1D array using NumPy import numpy as np my1DArray = np.array([1, 8, 27, 64]) print(my1DArray) Output: [ 1 8 27 64] # Create a 2D array using NumPy import numpy as np my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]]) print(my2DArray) Output: [[ 1 2 3 4] [ 2 4 9 16] [ 4 8 18 32]] 91 / 199
  • 92. NumPy # Create and display a 3D array using NumPy import numpy as np my3Darray = np.array([[[1, 2, 3, 4], [5, 6, 7, 8]], [[1, 2, 3, 4], [9, 10, 11, 12]]]) print(my3Darray) Output: [[[ 1 2 3 4] [ 5 6 7 8]] [[ 1 2 3 4] [ 9 10 11 12]]] # Display the memory addresses print(my1DArray.data, my2DArray.data, my3Darray.data) Output: <memory at 0x7f8b1c0a3e80> <memory at 0x7f8b1c0a3f40> <memory at 0x7f8b1c0a4040> 92 / 199
  • 93. NumPy # Display the shapes of 1D, 2D, and 3D NumPy arrays print(my1DArray.shape, my2DArray.shape, my3Darray.shape) Output: (4,) (3, 4) (2, 2, 4) # Display the data types of 1D, 2D, and 3D NumPy arrays print(my1DArray.dtype, my2DArray.dtype, my3Darray.dtype) Output: int64 int64 int64 # Display the strides of 1D, 2D, and 3D NumPy arrays '''The strides (32, 8) in my2DArray indicate that to move from one row to the next, 32 bytes are skipped, and to move from one column to the next, 8 bytes are skipped''' print(my1DArray.strides, my2DArray.strides, my3Darray.strides) Output: (8,) (32, 8) (64, 32, 8) 93 / 199
  • 94. NumPy # Create a 2D array filled with ones import numpy as np ones = np.ones((2,4)) print(ones) Output: [[1. 1. 1. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]] # Create a 3D array filled with zeros import numpy as np zeros = np.zeros((2,1,4), dtype=np.int16) print(zeros) Output: [[[0 0 0 0]] [[0 0 0 0]]] 94 / 199
  • 95. NumPy # Create a 2D array filled with random values import numpy as np random_array = np.random.random((2,2)) print(random_array) Output: array([[0.44768845, 0.96186535], [0.99402423, 0.88612299]]) # Create a 2D array with uninitialized values import numpy as np emptyArray = np.empty((3,2)) print(emptyArray) Output: [[9.86638798e-316 0.00000000e+000] [6.87990479e-310 6.87990488e-310] [6.87990477e-310 6.87990479e-310]] 95 / 199
  • 96. NumPy # Create a 2D array filled with a specific value import numpy as np fullArray = np.full((2,2), 7) print(fullArray) Output: [[7 7] [7 7]] # Create a 1D array with evenly spaced values import numpy as np evenSpacedArray = np.arange(10, 25, 5) print(evenSpacedArray) Output: [10 15 20] 96 / 199
  • 97. NumPy # Create a 1D array with evenly spaced values import numpy as np evenSpacedArray2 = np.linspace(0, 2, 9) print(evenSpacedArray2) Output: [0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ] ''' Create a NumPy array and save it to a file and load a NumPy array from a text file''' import numpy as np x = np.arange(0.0, 25.0, 1.0) np.savetxt('data.out', x, delimiter=',') z = np.loadtxt('data.out', unpack=True) print(z) Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.] 97 / 199
  • 98. NumPy # Load a NumPy array from a text file import numpy as np my_array2 = np.genfromtxt('data.out', skip_header=1, filling_values=-999) print(my_array2) Output: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.] # Display the number of dimensions of 1D, 2D, and 3D arrays print(my1DArray.ndim, my2DArray.ndim, my3Darray.ndim) Output: 1 2 3 # Display the total number of elements in 1D, 2D, and 3D arrays print(my1DArray.size, my2DArray.size, my3Darray.size) Output: 4 12 16 98 / 199
  • 99. NumPy # Print information about memory layout print(my1DArray.flags, my2DArray.flags, my3Darray.flags) Output: C_CONTIGUOUS : True F_CONTIGUOUS : True OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED: True WRITEBACKIFCOPY : False 99 / 199
  • 100. NumPy # Print the length of one array element in bytes print(my1DArray.itemsize, my2DArray.itemsize, my3Darray.itemsize) Output: 8 8 8 # Print the total consumed bytes by elements print(my1DArray.nbytes, my2DArray.nbytes, my3Darray.nbytes) Output: 32 96 128 # Sum along Numpy Axes np_array_2d = np.arange(0, 6).reshape([2,3]) print(np_array_2d, np.sum(np_array_2d, axis = 0), np.sum(np_array_2d, axis = 1)) Output: [[0 1 2] [3 4 5]] [3 5 7] [ 3 12] 100 / 199
  • 101. NumPy # Create a subset and slice an array using an index x = np.array([10, 20, 30, 40, 50]) # Select items at index 0 and 1 print(x[0:2]) Output: [10 20] # Select item at row 0 and 1 and column 1 from 2D array y = np.array([[ 1, 2, 3, 4], [ 9, 10, 11 ,12]]) print("m", y[0:2, 1]) print("n",y[0:2, 0:2]) print("l", y[0:2, 2:4]) Output: m = [ 2 10] n = [[ 1 2] [ 9 10]] l = [[ 3 4] [11 12]] 101 / 199
  • 102. NumPy # Specifying conditions biggerThan2 = (y >= 2) print(y[biggerThan2]) Output: [ 2 3 4 9 10 11 12] # Basic operations (+, -, *, /, %) x = np.array([[1, 2, 3], [2, 3, 4]]) y = np.array([[1, 4, 9], [2, 3, -2]]) # Add two array add = np.add(x, y) print(add) Output: [[ 2 6 12] [ 4 6 2]] 102 / 199
  • 103. NumPy # Subtract two array sub = np.subtract(x, y) print(sub) # Multiply two array mul = np.multiply(x, y) print(mul) # Divide x, y div = np.divide(x,y) print(div) # Calculated the remainder of x and y rem = np.remainder(x, y) print(rem) Output: [[ 0 -2 -6] [ 0 0 6]] [[ 1 8 27] [ 4 9 -8]] [[ 1. 0.5 0.33333333] [ 1. 1. -2. ]] [[0 2 3] [0 0 0]] 103 / 199
  • 104. NumPy # Boradcasting - Operate with arrays of different shapes # Rule 1: Two dimensions are operatable if they are equal # Create an array of two dimension A = np.ones((6, 8)) # Shape of A print(A.shape) print(A) Output: (6, 8) [[1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1.] [1. 1. 1. 1. 1. 1. 1. 1.]] 104 / 199
  • 105. NumPy # Create another array B = np.random.random((6,8)) # Shape of B print(B.shape) print(B) Output: (6, 8) [[0.06148782 0.10690907 0.92578537 0.29907577 0.42786516 0.01944468 0.14473416 0.30382709] [0.36209211 0.33220132 0.43412798 0.97707517 0.23210006 0.05892264 0.34311993 0.97168464] [0.34048395 0.06280427 0.78917397 0.50310127 0.36555426 0.27233463 0.60115097 0.77911552] [0.39724957 0.38369108 0.10517771 0.97519711 0.49966346 0.51715226 0.50031762 0.91470124] [0.7647788 0.37106634 0.17694871 0.90837723 0.1932456 0.20634914 0.29533289 0.66564862] [0.72985568 0.85682569 0.01275113 0.98932163 0.1776967 0.95006083 0.59139126 0.3131595 ]] 105 / 199
  • 106. NumPy # Sum of A and B, here the shape of both the matrix is same. print(A + B) Output: [[1.06148782 1.10690907 1.92578537 1.29907577 1.42786516 1.01944468 1.14473416 1.30382709] [1.36209211 1.33220132 1.43412798 1.97707517 1.23210006 1.05892264 1.34311993 1.97168464] [1.34048395 1.06280427 1.78917397 1.50310127 1.36555426 1.27233463 1.60115097 1.77911552] [1.39724957 1.38369108 1.10517771 1.97519711 1.49966346 1.51715226 1.50031762 1.91470124] [1.7647788 1.37106634 1.17694871 1.90837723 1.1932456 1.20634914 1.29533289 1.66564862] [1.72985568 1.85682569 1.01275113 1.98932163 1.1776967 1.95006083 1.59139126 1.3131595 ]] 106 / 199
  • 107. NumPy # Rule 2: Two dimensions are compatible when one of them is 1 # Initialize `x` x = np.ones((3,4)) print(x) print(x.shape) Output: [[1. 1. 1. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]] (3, 4) # Initialize `y` y = np.arange(4) print(y) print(y.shape) Output: [0 1 2 3] (4,) 107 / 199
  • 108. NumPy # Subtract `x` and `y` print(x - y) Output: [[ 1. 0. -1. -2.] [ 1. 0. -1. -2.] [ 1. 0. -1. -2.]] '''Rule 3: Arrays can be broadcast together if they are compatible in all dimensions.''' x = np.ones((2,3)) print("x:", x) Output: x: [[1. 1. 1.] [1. 1. 1.]] 108 / 199
  • 109. NumPy # Initialize 'y' y = np.random.random((2, 1, 3)) print("y:", y) Output: y: [[[0.91087436 0.74716299 0.8804711 ]] [[0.20148139 0.27853328 0.0647736 ]]] # Sum of 'x' and 'y' print("sum: ", x + y) Output: sum: [[[1.91087436 1.74716299 1.8804711 ] [1.91087436 1.74716299 1.8804711 ]] [[1.20148139 1.27853328 1.0647736 ] [1.20148139 1.27853328 1.0647736 ]]] 109 / 199
  • 110. pandas ▶ Open-source Python library for data manipulation and analysis ▶ Created by Wes McKinney in 2008 ▶ Widely used in data science, finance, and research ▶ GitHub: https://github.com/pandas-dev/pandas ▶ Key features: ▶ Data structures: Series and DataFrame ▶ Handling missing data ▶ Data filtering, grouping, and merging 110 / 199
  • 111. Usage of pandas ▶ Simplifies data manipulation tasks ▶ Integrates with other Python libraries (NumPy, Matplotlib, etc.) ▶ Handles large datasets efficiently ▶ Supports various data formats (CSV, Excel, SQL, etc.) ▶ Enables quick data exploration and visualization. 111 / 199
  • 112. pandas # Setting up Pandas environment import numpy as np import pandas as pd !pip install --upgrade pandas print("Pandas Version:", pd.__version__) Requirement already satisfied: ... Requirement already satisfied: ... Requirement already satisfied: ... Requirement already satisfied: ... Requirement already satisfied: ... Requirement already satisfied: ... Pandas Version: 2.3.1 # Customizing display dettings for data visibility pd.set_option('display.max_columns', 500) pd.set_option('display.max_rows', 500) 112 / 199
  • 113. pandas # Create a dataframe from a series series = pd.Series([2, 3, 7, 11, 13, 17, 19, 23]) print(series) Output: 0 2 1 3 2 7 3 11 4 13 5 17 6 19 7 23 dtype: int64 113 / 199
  • 114. pandas # Create a dataframe from a series series_df = pd.DataFrame({ 'A': range(1, 5), 'B': pd.Timestamp('20190526'), 'C': pd.Series(5, index=list(range(4)), dtype='float64'), 'D': np.array([3] * 4, dtype='int64'), 'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder", "Eating Disorder"]), 'F': 'Mental health', 'G': 'is challenging' }) display(series_df) Output: 114 / 199
  • 115. pandas # Create a dataframe for a dictionary dict_df = [{'A': 'Apple', 'B': 'Ball'}, {'A': 'Aeroplane', 'B': 'Bat', 'C': 'Cat'}] dict_df = pd.DataFrame(dict_df) print(dict_df) Output: A B C 0 Apple Ball NaN 1 Aeroplane Bat Cat # Create a dataframe from n-dimensional arrays sdf = {'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland', 'Buskerud'], 'ISO-Code':[1,2,3,4,5,6], 'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10, 14910.94], 'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo", "Hamar", "Lillehammer", "Drammen"]} print(pd.DataFrame(sdf)) 115 / 199
  • 116. pandas Output: County ISO-Code Area Administrative centre 0 Østfold 1 4180.69 Sarpsborg 1 Hordaland 2 4917.94 Oslo 2 Oslo 3 454.07 City of Oslo 3 Hedmark 4 27397.76 Hamar 4 Oppland 5 25192.10 Lillehammer 5 Buskerud 6 14910.94 Drammen # Different dataframe style 'plain', 'simple', 'github', 'grid', 'fancy_grid', 'pipe', 'orgtbl', 'jira', 'presto', 'pretty', 'psql', 'rst', 'mediawiki', 'moinmoin', 'youtrack', 'html', 'latex', 'latex_raw', 'latex_booktabs', 'textile' from tabulate import tabulate # displaying the DataFrame print(tabulate(sdf, headers = 'keys', tablefmt = 'psql')) 116 / 199
  • 117. pandas from tabulate import tabulate # Displaying the DataFrame print(tabulate(sdf, headers = 'keys', tablefmt = 'github')) print(tabulate(sdf, headers = 'keys', tablefmt = 'grid')) 117 / 199
  • 119. pandas # Load a dataset from an external source into a DataFrame keys = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship','ethnicity', 'gender', 'capital_gain', 'capital_loss', 'hours_per_week','country_of_origin','income'] df = pd.read_csv('http://archive.ics.uci.edu/ml/machine- learning-databases/adult/adult.data', names=keys) print(df.head()) Output: age workclass fnlwgt ... country_of_origin income 0 39 State-gov 77516 ... United-States <=50K 1 50 Self-emp-not-inc 83311 ... United-States <=50K 2 38 Private 215646 ... United-States <=50K 3 53 Private 234721 ... United-States <=50K 4 28 Private 338409 ... cuba <=50K 119 / 199
  • 120. pandas # Retrieve the first 10 records print(df.head(10)) Output: age workclass fnlwgt ... country_of_origin income 0 39 State-gov 77516 ... United-States <=50K 1 50 Self-emp-not-inc 83311 ... United-States <=50K 2 38 Private 215646 ... United-States <=50K 3 53 Private 234721 ... United-States <=50K 4 28 Private 338409 ... cuba <=50K 5 37 Private 284582 ... United-States <=50K 6 49 Private 160187 ... Jamaica <=50K 7 52 Self-emp-not-inc 209642 ... United-States >50K 8 31 Private 45781 ... United-States >50K 9 42 Private 159449 ... United-States >50K 120 / 199
  • 121. pandas # Retrieve the last 5 records print(df.tail()) Output: age workclass fnlwgt ... country_of_origin income 32556 27 Private 257302 ... United-States <=50K 32557 40 Private 154374 ... United-States >50K 32558 58 Private 151910 ... United-States <=50K 32559 22 Private 201490 ... United-States <=50K 32560 52 Self-emp-inc 287927 ... United-States >50K 121 / 199
  • 122. pandas # Retrieve the last 10 records print(df.tail(10)) Output: age workclass fnlwgt ... country_of_origin income 32551 32 Private 34066 ... United-States <=50K 32552 43 Private 84661 ... United-States <=50K 32553 32 Private 116138 ... Taiwan <=50K 32554 53 Private 321865 ... United-States >50K 32555 22 Private 310152 ... United-States <=50K 32556 27 Private 257302 ... United-States <=50K 32557 40 Private 154374 ... United-States >50K 32558 58 Private 151910 ... United-States <=50K 32559 22 Private 201490 ... United-States <=50K 32560 52 Self-emp-inc 287927 ... United-States >50K 122 / 199
  • 123. pandas # Displays the rows, columns, data types, and memory df.info() Output: <class 'pandas.core.frame.DataFrame'> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 32561 non-null int64 1 workclass 32561 non-null object 2 fnlwgt 32561 non-null int64 3 education 32561 non-null object 4 education_num 32561 non-null int64 ... 12 hours_per_week 32561 non-null int64 13 country_of_origin 32561 non-null object 14 income 32561 non-null object dtypes: int64(6), object(9) memory usage: 3.7+ MB 123 / 199
  • 124. pandas # Selects a row print(df.iloc[10]) Output: age 37 workclass Private fnlwgt 280464 education Some-college education_num 10 marital_status Married-civ-spouse occupation Exec-managerial relationship Husband ethnicity Black gender Male capital_gain 0 capital_loss 0 hours_per_week 80 country_of_origin United-States income >50K Name: 10, dtype: object 124 / 199
  • 125. pandas # Selects 10 rows print(df.iloc[0:10]) Output: age workclass fnlwgt ... country_of_origin income 0 39 State-gov 77516 ... United-States <=50K 1 50 Self-emp-not-inc 83311 ... United-States <=50K 2 38 Private 215646 ... United-States <=50K 3 53 Private 234721 ... United-States <=50K 4 28 Private 338409 ... cuba <=50K 5 37 Private 284582 ... United-States <=50K 6 49 Private 160187 ... Jamaica <=50K 7 52 Self-emp-not-inc 209642 ... United-States >50K 8 31 Private 45781 ... United-States >50K 9 42 Private 159449 ... United-States >50K 125 / 199
  • 126. pandas # Selects a range of rows df.iloc[10:15] Output: age workclass fnlwgt ... country_of_origin income 10 37 Private 280464 ... United-States >50K 11 30 State-gov 141297 ... India >50K 12 23 Private 122272 ... United-States <=50K 13 32 Private 205019 ... United-States <=50K 14 40 Private 121772 ... ? >50K 126 / 199
  • 127. pandas # Selects the last 2 rows print(df.iloc[-2:]) Output: age workclass fnlwgt ... country_of_origin income 32559 22 Private 201490 ... United-States <=50K 32560 52 Self-emp-inc 287927 ... United-States >50K 127 / 199
  • 128. pandas # Selects every other row in columns 3-5 df.iloc[::2, 3:5].head() Output: education education_num 0 Bachelors 13 2 HS-grad 9 4 Bachelors 13 6 9th 5 8 Masters 14 128 / 199
  • 129. pandas # Seeding random data from numpy np.random.seed(24) # Making the DataFrame df = pd.DataFrame({'A': np.linspace(1, 10, 10)}) df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))], axis = 1) # DataFrame without any styling print(df) Output: A B C D E 0 1.0 1.329212 -0.770033 -0.316280 -0.990810 1 2.0 -1.070816 -1.438713 0.564417 0.295722 2 3.0 -1.626404 0.219565 0.678805 1.889273 3 4.0 0.961538 0.104011 -0.481165 0.850229 4 5.0 1.453425 1.057737 0.165562 0.515018 5 6.0 -1.336936 0.562861 1.392855 -0.063328 6 7.0 0.121668 1.207603 -0.002040 1.627796 7 8.0 0.354493 1.037528 -0.385684 0.519818 8 9.0 1.686583 -1.325963 1.428984 -2.089354 9 10.0 -0.129820 0.631523 -0.586538 0.290720 129 / 199
  • 130. pandas # Styling DataFrame using DataFrame.style property df.style.set_properties(**{'background-color': 'black', 'color': 'green'}) Output: 130 / 199
  • 131. pandas # Replacing the locating value by NaN (Not a Number) df.iloc[0, 3] = np.nan df.iloc[2, 3] = np.nan df.iloc[4, 2] = np.nan df.iloc[7, 4] = np.nan print(df) Output: A B C D E 0 1.0 1.329212 -0.770033 NaN -0.990810 1 2.0 -1.070816 -1.438713 0.564417 0.295722 2 3.0 -1.626404 0.219565 NaN 1.889273 3 4.0 0.961538 0.104011 -0.481165 0.850229 4 5.0 1.453425 NaN 0.165562 0.515018 5 6.0 -1.336936 0.562861 1.392855 -0.063328 6 7.0 0.121668 1.207603 -0.002040 1.627796 7 8.0 0.354493 1.037528 -0.385684 NaN 8 9.0 1.686583 -1.325963 1.428984 -2.089354 9 10.0 -0.129820 0.631523 -0.586538 0.290720 131 / 199
  • 132. pandas # Highlight the NaN values in DataFrame df.style.highlight_null(color='red') Output: 132 / 199
  • 133. pandas # Highlight the Min values in each column df.style.highlight_min(axis = 0) Output: 133 / 199
  • 134. pandas # Highlight the Max values in each column df.style.highlight_max(axis = 0) Output: 134 / 199
  • 135. pandas # Highlight the Max values in each row df.style.highlight_max(axis = 1) Output: 135 / 199
  • 136. pandas # Set text color of positive values in Dataframes def color_positive_green(val): """ Takes a scalar and returns a string with the css property 'color: green'` for positive strings, black otherwise. """ if val > 0: color = 'green' else: color = 'black' return 'color: %s' % color df.style.applymap(color_positive_green) 136 / 199
  • 138. pandas # Import seaborn library import seaborn as sns # Declaring the color palette from seaborn cm = sns.light_palette("green", as_cmap=True) # DataFrame with background gradient and precision df.style.background_gradient(cmap=cm).format(precision = 2) 138 / 199
  • 140. pandas # Checking Missing Values in a DataFrame d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d) print(df, "n") print(df.isnull()) Output: First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 First Score Second Score Third Score 0 False False True 1 False False False 2 True False False 3 False True False 140 / 199
  • 141. pandas # Filtering Data Based on Missing Values '''Download employees dataset https://media.geeksforgeeks.org/wp-content/uploads/employees.csv ''' import pandas as pd d = pd.read_csv("/content/employees.csv") bool_series = pd.isnull(d["Gender"]) missing_gender_data = d[bool_series] print(missing_gender_data.head()) Output: First Name Gender ... Senior Management Team 0 Lois NaN ... True Legal 22 Joshua NaN ... True Client Services 27 Scott NaN ... True Legal 31 Joyce NaN ... True Product 41 Christine NaN ... True Business Development 141 / 199
  • 142. pandas # Checking for Non-Missing Values import pandas as pd import numpy as np d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d) print(df) print(df.notnull()) Output: First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 First Score Second Score Third Score 0 True True False 1 True True True 2 False True True 3 True False True 142 / 199
  • 143. pandas # Filtering Data with Non-Missing Values import pandas as pd d = pd.read_csv("/content/employees.csv") nmg = pd.notnull(d["Gender"]) print(d[nmg].head()) Output: First Name Gender ... Senior Management Team 0 Douglas Male ... True Marketing 1 Thomas Male ... True NaN 2 Maria Female ... True Finance 3 Jerry Male ... True Finance 4 Larry Male ... True Client Services 143 / 199
  • 144. pandas # Filling Missing Values in Pandas d = {'First Score': [100, 90, np.nan, 95], 'Second Score': [30, 45, 56, np.nan], 'Third Score': [np.nan, 40, 80, 98]} df = pd.DataFrame(d) print(df, "n") print(df.fillna(0)) Output: First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 First Score Second Score Third Score 0 100.0 30.0 0.0 1 90.0 45.0 40.0 2 0.0 56.0 80.0 3 95.0 0.0 98.0 144 / 199
  • 145. pandas # Fill with Previous Value print(df, "n") print(df.fillna(method='pad')) Output: First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 90.0 56.0 80.0 3 95.0 56.0 98.0 145 / 199
  • 146. pandas # Fill with Next Value print(df, "n") print(df.fillna(method='bfill')) Output: First Score Second Score Third Score 0 100.0 30.0 NaN 1 90.0 45.0 40.0 2 NaN 56.0 80.0 3 95.0 NaN 98.0 First Score Second Score Third Score 0 100.0 30.0 40.0 1 90.0 45.0 40.0 2 95.0 56.0 80.0 3 95.0 NaN 98.0 146 / 199
  • 147. pandas # Fill NaN Values with 'No Gender' d = pd.read_csv("/content/employees.csv") print(d[20:25]) d["Gender"].fillna('No Gender', inplace = True) print(d[20:25]) Output: First Name Gender ... Senior Management Team 20 Lois NaN ... True Legal 21 Matthew Male ... False Marketing 22 Joshua NaN ... True Client Services 23 NaN Male ... NaN NaN 24 John Male ... False Client Services First Name Gender ... Senior Management Team 20 Lois No Gender ... True Legal 21 Matthew Male ... False Marketing 22 Joshua No Gender ... True Client Services 23 NaN Male ... NaN NaN 24 John Male ... False Client Services 147 / 199
  • 148. pandas # Replace all NaN values with -99 value. d = pd.read_csv("/content/employees.csv") print(d[20:25]) d = d.replace(to_replace = np.nan, value = -99) print(d[20:25]) Output: First Name Gender ... Senior Management Team 20 Lois NaN ... True Legal 21 Matthew Male ... False Marketing 22 Joshua NaN ... True Client Services 23 NaN Male ... NaN NaN 24 John Male ... False Client Services First Name Gender ... Senior Management Team 20 Lois -99 ... True Legal 21 Matthew Male ... False Marketing 22 Joshua -99 ... True Client Services 23 -99 Male ... -99 -99 24 John Male ... False Client Services 148 / 199
  • 149. pandas # Fills missing values using interpolation techniques ''' 'linear' - Linear interpolation between adjacent non-missing values. 'polynomial' - Polynomial interpolation (Order 2 for quadratic). 'nearest' - Fills with the nearest non-missing value. 'zero' - Fills with the previous non-missing value (piecewise constant). 'slinear' - Spline interpolation of order 1 (equivalent to linear in pandas). 'quadratic' - Polynomial interpolation of order 2. 'barycentric' - Barycentric interpolation for smooth approximations. ''' 149 / 199
  • 150. pandas df = pd.DataFrame({"A": [12, 4, 5, None, 1], "B": [None, 2, 54, 3, None], "C": [20, 16, None, 3, 8], "D": [14, 3, None, None, 6]}) print(df) #linear forward interpolation print(df.interpolate(method = 'linear', limit_direction ='forward')) Output: A B C D 0 12.0 NaN 20.0 14.0 1 4.0 2.0 16.0 3.0 2 5.0 54.0 NaN NaN 3 NaN 3.0 3.0 NaN 4 1.0 NaN 8.0 6.0 A B C D 0 12.0 NaN 20.0 14.0 1 4.0 2.0 16.0 3.0 2 5.0 54.0 9.5 4.0 3 3.0 3.0 3.0 5.0 4 1.0 3.0 8.0 6.0 150 / 199
  • 151. pandas # Polynomial interpolation with order 2 print(df.interpolate(method ='polynomial', order = 2)) Output: A B C D 0 12.000000 NaN 20.0 14.0 1 4.000000 2.0 16.0 3.0 2 5.000000 54.0 8.0 -2.0 3 4.578947 3.0 3.0 -1.0 4 1.000000 NaN 8.0 6.0 151 / 199
  • 152. pandas # Drop rows where all values are missing dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [np.nan, np.nan, np.nan, 65]} df = pd.DataFrame(dict) print(df, "n") print(df.dropna(how = 'all')) Output: First Score Second Score Third Score Fourth Score 0 100.0 30.0 52.0 NaN 1 NaN NaN NaN NaN 2 NaN 45.0 80.0 NaN 3 95.0 56.0 98.0 65.0 First Score Second Score Third Score Fourth Score 0 100.0 30.0 52.0 NaN 2 NaN 45.0 80.0 NaN 3 95.0 56.0 98.0 65.0 152 / 199
  • 153. pandas # Remove columns that contain at least one missing value dict = {'First Score': [100, np.nan, np.nan, 95], 'Second Score': [30, np.nan, 45, 56], 'Third Score': [52, np.nan, 80, 98], 'Fourth Score': [60, 67, 68, 65]} df = pd.DataFrame(dict) print(df, "n") print(df.dropna(axis=1)) Output: First Score Second Score Third Score Fourth Score 0 100.0 30.0 52.0 60 1 NaN NaN NaN 67 2 NaN 45.0 80.0 68 3 95.0 56.0 98.0 65 Fourth Score 0 60 1 67 2 68 3 65 153 / 199
  • 154. pandas # Drop rows with missing values d = pd.read_csv("/content/employees.csv") nd = d.dropna(axis=0, how='any') print("Old data frame length:", len(d)) print("New data frame length:", len(nd)) print("Rows with at least one missing value:", (len(d) - len(nd))) Output: Old data frame length: 1000 New data frame length: 764 Rows with at least one missing value: 236 154 / 199
  • 155. pandas import pandas as pd data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age': [27, 24, 22, 32], 'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification': ['Msc', 'MA', 'MCA', 'Phd']} data2 = {'Name': ['Abhi', 'Ayushi', 'Dhiraj', 'Hitesh'], 'Age': [17, 14, 12, 52], 'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification': ['Btech', 'B.A', 'Bcom', 'B.hons']} df = pd.DataFrame(data1, index=[0, 1, 2, 3]) df1 = pd.DataFrame(data2, index=[4, 5, 6, 7]) print(df, "nn", df1) 155 / 199
  • 156. pandas Output: Name Age Address Qualification 0 Jai 27 Nagpur Msc 1 Princi 24 Kanpur MA 2 Gaurav 22 Allahabad MCA 3 Anuj 32 Kannuaj Phd Name Age Address Qualification 4 Abhi 17 Nagpur Btech 5 Ayushi 14 Kanpur B.A 6 Dhiraj 12 Allahabad Bcom 7 Hitesh 52 Kannuaj B.hons 156 / 199
  • 157. pandas # Concatenating DataFrame frames = [df, df1] res1 = pd.concat(frames) print(res1) Output: Name Age Address Qualification 0 Jai 27 Nagpur Msc 1 Princi 24 Kanpur MA 2 Gaurav 22 Allahabad MCA 3 Anuj 32 Kannuaj Phd 4 Abhi 17 Nagpur Btech 5 Ayushi 14 Kanpur B.A 6 Dhiraj 12 Allahabad Bcom 7 Hitesh 52 Kannuaj B.hons 157 / 199
  • 158. pandas import pandas as pd data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age': [27, 24, 22, 32], 'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification': ['Msc', 'MA', 'MCA', 'Phd'], 'Mobile No': [97, 91, 58, 76]} data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'], 'Age': [22, 32, 12, 52], 'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'], 'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'], 'Salary': [1000, 2000, 3000, 4000]} df = pd.DataFrame(data1, index=[0, 1, 2, 3]) df1 = pd.DataFrame(data2, index=[2, 3, 6, 7]) print(df, "nn", df1) 158 / 199
  • 159. pandas Output: Name Age Address Qualification Mobile No 0 Jai 27 Nagpur Msc 97 1 Princi 24 Kanpur MA 91 2 Gaurav 22 Allahabad MCA 58 3 Anuj 32 Kannuaj Phd 76 Name Age Address Qualification Salary 2 Gaurav 22 Allahabad MCA 1000 3 Anuj 32 Kannuaj Phd 2000 6 Dhiraj 12 Allahabad Bcom 3000 7 Hitesh 52 Kannuaj B.hons 4000 159 / 199
  • 160. pandas # Inner Join res2 = pd.concat([df, df1], axis=1, join='inner') print(res2) Output: Name Age Address Qualification Mobile No Name 2 Gaurav 22 Allahabad MCA 58 Gaurav 3 Anuj 32 Kannuaj Phd 76 Anuj Age Address Qualification Salary 2 22 Allahabad MCA 1000 3 32 Kannuaj Phd 2000 160 / 199
  • 161. pandas # Outer Join res2 = pd.concat([df, df1], axis = 1, sort = False) print(res2) Output: Name Age Address Qualification Mobile No Name 0 Jai 27.0 Nagpur Msc 97.0 NaN 1 Princi 24.0 Kanpur MA 91.0 NaN 2 Gaurav 22.0 Allahabad MCA 58.0 Gaurav 3 Anuj 32.0 Kannuaj Phd 76.0 Anuj 6 NaN NaN NaN NaN NaN Dhiraj 7 NaN NaN NaN NaN NaN Hitesh Age Address Qualification Salary 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 22.0 Allahabad MCA 1000.0 3 32.0 Kannuaj Phd 2000.0 6 12.0 Allahabad Bcom 3000.0 7 52.0 Kannuaj B.hons 4000.0 161 / 199
  • 162. pandas # DataFrames by Ignoring Indexes res = pd.concat([df, df1], ignore_index=True) print(res) Output: Name Age Address Qualification Mobile No Salary 0 Jai 27 Nagpur Msc 97.0 NaN 1 Princi 24 Kanpur MA 91.0 NaN 2 Gaurav 22 Allahabad MCA 58.0 NaN 3 Anuj 32 Kannuaj Phd 76.0 NaN 4 Gaurav 22 Allahabad MCA NaN 1000.0 5 Anuj 32 Kannuaj Phd NaN 2000.0 6 Dhiraj 12 Allahabad Bcom NaN 3000.0 7 Hitesh 52 Kannuaj B.hons NaN 4000.0 162 / 199
  • 163. pandas import pandas as pd data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age': [27, 24, 22, 32], 'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification': ['Msc', 'MA', 'MCA', 'Phd'], 'Mobile No': [97, 91, 58, 76]} data2 = {'Name': ['Gaurav', 'Anuj', 'Dhiraj', 'Hitesh'], 'Age': [22, 32, 12, 52], 'Address': ['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'], 'Qualification': ['MCA', 'Phd', 'Bcom', 'B.hons'], 'Salary': [1000, 2000, 3000, 4000]} df = pd.DataFrame(data1, index=[0, 1, 2, 3]) df1 = pd.DataFrame(data2, index=[4, 5, 6, 7]) print(df, "nn", df1) 163 / 199
  • 164. pandas Output: Name Age Address Qualification Mobile No 0 Jai 27 Nagpur Msc 97 1 Princi 24 Kanpur MA 91 2 Gaurav 22 Allahabad MCA 58 3 Anuj 32 Kannuaj Phd 76 Name Age Address Qualification Salary 4 Gaurav 22 Allahabad MCA 1000 5 Anuj 32 Kannuaj Phd 2000 6 Dhiraj 12 Allahabad Bcom 3000 7 Hitesh 52 Kannuaj B.hons 4000 164 / 199
  • 165. pandas # Concatenating DataFrame with group keys frames = [df, df1] res = pd.concat(frames, keys=['x', 'y']) print(res) Output: Name Age Address Qualification x 0 Jai 27 Nagpur Msc 1 Princi 24 Kanpur MA 2 Gaurav 22 Allahabad MCA 3 Anuj 32 Kannuaj Phd y 4 Abhi 17 Nagpur Btech 5 Ayushi 14 Kanpur B.A 6 Dhiraj 12 Allahabad Bcom 7 Hitesh 52 Kannuaj B.hons 165 / 199
  • 166. pandas data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd']} df = pd.DataFrame(data1,index=[0, 1, 2, 3]) s1 = pd.Series([1000, 2000, 3000, 4000], name='Salary') print(df,"nn", s1) Output: Name Age Address Qualification 0 Jai 27 Nagpur Msc 1 Princi 24 Kanpur MA 2 Gaurav 22 Allahabad MCA 3 Anuj 32 Kannuaj Phd 0 1000 1 2000 2 3000 3 4000 Name: Salary, dtype: int64 166 / 199
  • 167. pandas # Concatenating Mixed DataFrames and Series res = pd.concat([df, s1], axis = 1) print(res) Output: Name Age Address Qualification Salary 0 Jai 27 Nagpur Msc 1000 1 Princi 24 Kanpur MA 2000 2 Gaurav 22 Allahabad MCA 3000 3 Anuj 32 Kannuaj Phd 4000 167 / 199
  • 168. pandas data1 = {'key': ['K0', 'K1', 'K2', 'K3'], 'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],} data2 = {'key': ['K0', 'K1', 'K2', 'K3'], 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df = pd.DataFrame(data1) df1 = pd.DataFrame(data2) print(df, "n", df1) Output: key Name Age 0 K0 Jai 27 1 K1 Princi 24 2 K2 Gaurav 22 3 K3 Anuj 32 key Address Qualification 0 K0 Nagpur Btech 1 K1 Kanpur B.A 2 K2 Allahabad Bcom 3 K3 Kannuaj B.hons 168 / 199
  • 169. pandas # Merging DataFrames Using One Key res = pd.merge(df, df1, on = 'key') print(res) Output: key Name Age Address Qualification 0 K0 Jai 27 Nagpur Btech 1 K1 Princi 24 Kanpur B.A 2 K2 Gaurav 22 Allahabad Bcom 3 K3 Anuj 32 Kannuaj B.hons 169 / 199
  • 170. pandas data1 = {'key': ['K0', 'K1', 'K2', 'K3'], 'key1': ['K0', 'K1', 'K0', 'K1'], 'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],} data2 = {'key': ['K0', 'K1', 'K2', 'K3'], 'key1': ['K0', 'K0', 'K0', 'K0'], 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df = pd.DataFrame(data1) df1 = pd.DataFrame(data2) print(df, "nn", df1) 170 / 199
  • 171. pandas Output: key key1 Name Age 0 K0 K0 Jai 27 1 K1 K1 Princi 24 2 K2 K0 Gaurav 22 3 K3 K1 Anuj 32 key key1 Address Qualification 0 K0 K0 Nagpur Btech 1 K1 K0 Kanpur B.A 2 K2 K0 Allahabad Bcom 3 K3 K0 Kannuaj B.hons 171 / 199
  • 172. pandas # Merging DataFrames Using Multiple Keys res1 = pd.merge(df, df1, on=['key', 'key1']) print(res1) Output: key key1 Name Age Address Qualification 0 K0 K0 Jai 27 Nagpur Btech 1 K2 K0 Gaurav 22 Allahabad Bcom 172 / 199
  • 173. pandas data1 = {'key': ['K0', 'K1', 'K2', 'K3'], 'key1': ['K0', 'K1', 'K0', 'K1'], 'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32],} data2 = {'key': ['K0', 'K1', 'K2', 'K3'], 'key1': ['K0', 'K0', 'K0', 'K0'], 'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'], 'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']} df = pd.DataFrame(data1) df1 = pd.DataFrame(data2) print(df, "nn", df1) 173 / 199
  • 174. pandas Output: key key1 Name Age 0 K0 K0 Jai 27 1 K1 K1 Princi 24 2 K2 K0 Gaurav 22 3 K3 K1 Anuj 32 key key1 Address Qualification 0 K0 K0 Nagpur Btech 1 K1 K0 Kanpur B.A 2 K2 K0 Allahabad Bcom 3 K3 K0 Kannuaj B.hons 174 / 199
  • 175. pandas # Left outer join res = pd.merge(df, df1, how = 'left', on = ['key', 'key1']) print(res) Output: key key1 Name Age Address Qualification 0 K0 K0 Jai 27 Nagpur Btech 1 K1 K1 Princi 24 NaN NaN 2 K2 K0 Gaurav 22 Allahabad Bcom 3 K3 K1 Anuj 32 NaN NaN 175 / 199
  • 176. pandas # Right outer join res1 = pd.merge(df, df1, how = 'right', on = ['key', 'key1']) print(res1) Output: key key1 Name Age Address Qualification 0 K0 K0 Jai 27.0 Nagpur Btech 1 K1 K0 NaN NaN Kanpur B.A 2 K2 K0 Gaurav 22.0 Allahabad Bcom 3 K3 K0 NaN NaN Kannuaj B.hons 176 / 199
  • 177. pandas # Outer join res2 = pd.merge(df, df1, how='outer', on=['key', 'key1']) print(res2) Output: key key1 Name Age Address Qualification 0 K0 K0 Jai 27.0 Nagpur Btech 1 K1 K0 NaN NaN Kanpur B.A 2 K1 K1 Princi 24.0 NaN NaN 3 K2 K0 Gaurav 22.0 Allahabad Bcom 4 K3 K0 NaN NaN Kannuaj B.hons 5 K3 K1 Anuj 32.0 NaN NaN 177 / 199
  • 178. pandas # Inner join res3 = pd.merge(df, df1, how = 'inner', on = ['key', 'key1']) print(res3) Output: key key1 Name Age Address Qualification 0 K0 K0 Jai 27 Nagpur Btech 1 K2 K0 Gaurav 22 Allahabad Bcom 178 / 199
  • 179. pandas data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32]} data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'], 'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']} df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3']) df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4']) print(df, "nn", df1) Output: Name Age K0 Jai 27 K1 Princi 24 K2 Gaurav 22 K3 Anuj 32 Address Qualification K0 Allahabad MCA K2 Kannuaj Phd K3 Allahabad Bcom K4 Kannuaj B.hons 179 / 199
  • 180. pandas # Merge DataFrames based on row indexes res = df.join(df1) print(res) Output: Name Age Address Qualification K0 Jai 27 Allahabad MCA K1 Princi 24 NaN NaN K2 Gaurav 22 Kannuaj Phd K3 Anuj 32 Allahabad Bcom 180 / 199
  • 181. pandas # Merge DataFrames based on row indexes res = df1.join(df) print(res) Output: Address Qualification Name Age K0 Allahabad MCA Jai 27.0 K2 Kannuaj Phd Gaurav 22.0 K3 Allahabad Bcom Anuj 32.0 K4 Kannuaj B.hons NaN NaN 181 / 199
  • 182. pandas # Outer Join res1 = df.join(df1, how='outer') print(res1) Output: Name Age Address Qualification K0 Jai 27.0 Allahabad MCA K1 Princi 24.0 NaN NaN K2 Gaurav 22.0 Kannuaj Phd K3 Anuj 32.0 Allahabad Bcom K4 NaN NaN Kannuaj B.hons 182 / 199
  • 183. pandas data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Key':['K0', 'K1', 'K2', 'K3']} data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'], 'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']} df = pd.DataFrame(data1) df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4']) print(df, "nn", df1) Output: Name Age Key 0 Jai 27 K0 1 Princi 24 K1 2 Gaurav 22 K2 3 Anuj 32 K3 Address Qualification K0 Allahabad MCA K2 Kannuaj Phd K3 Allahabad Bcom K4 Kannuaj B.hons 183 / 199
  • 184. pandas # Joining DataFrames Using "on" Argument res2 = df.join(df1, on ='Key') res2 Output: Name Age Key Address Qualification 0 Jai 27 K0 Allahabad MCA 1 Princi 24 K1 NaN NaN 2 Gaurav 22 K2 Kannuaj Phd 3 Anuj 32 K3 Allahabad Bcom 184 / 199
  • 185. pandas data1 = {'Name':['Jai', 'Princi', 'Gaurav'], 'Age':[27, 24, 22]} data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kanpur'], 'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']} df = pd.DataFrame(data1, index=pd.Index(['K0', 'K1', 'K2'], name='key')) index = pd.MultiIndex.from_tuples([('K0', 'Y0'), ('K1', 'Y1'), ('K2', 'Y2'), ('K2', 'Y3')], names=['key', 'Y']) df1 = pd.DataFrame(data2, index= index) print(df, "nn", df1) 185 / 199
  • 186. pandas Output: Name Age key K0 Jai 27 K1 Princi 24 K2 Gaurav 22 Address Qualification key Y K0 Y0 Allahabad MCA K1 Y1 Kannuaj Phd K2 Y2 Allahabad Bcom Y3 Kanpur B.hons 186 / 199
  • 187. pandas # Joining DataFrames with Different Index Levels (Multi-Index) result = df.join(df1, how ='inner') print(result) Output: Name Age Address Qualification key Y K0 Y0 Jai 27 Allahabad MCA K1 Y1 Princi 24 Kannuaj Phd K2 Y2 Gaurav 22 Allahabad Bcom Y3 Gaurav 22 Kanpur B.hons 187 / 199
  • 188. SciPy ▶ SciPy is an open-source Python library for scientific and technical computing. ▶ Relies on NumPy, which provides efficient n-dimensional array manipulation. ▶ Covers areas like optimization, integration, interpolation, eigenvalue problems, and statistics. ▶ Essential for research, data analysis, and engineering projects. 188 / 199
  • 189. Usage of SciPy ▶ Scientific Computing: Solves differential equations and performs numerical integration. ▶ Statistics: Offers scipy.stats for hypothesis testing, probability distributions, and more. ▶ Optimization: Includes tools for linear programming and nonlinear optimization. ▶ Signal Processing: Provides functions for Fourier transforms and filtering. ▶ Preparation: Install via pip install scipy and explore documentation. 189 / 199
  • 190. SciPy from scipy import stats data = [1.5, 2.3, 3.1, 4.2, 5.0] mean = stats.tmean(data) std_dev = stats.tstd(data) print(f"Mean: {mean}, Standard Deviation: {std_dev}") Output: Mean: 3.22, Standard Deviation: 1.4096098751072936 from scipy import integrate import numpy as np f = lambda x: x**2 result, error = integrate.quad(f, 0, 1) print(f"Integral from 0 to 1: {result} (Error: {error})") Output: Integral from 0 to 1: 0.33333333333333337 (Error: 3.700743415417189e-15) 190 / 199
  • 191. Matplotlib ▶ A comprehensive Python library for creating static, animated, and interactive visualizations. ▶ Offers a wide range of customizable plots (line, bar, scatter, etc.) and backends. ▶ Applications: Used in professional reporting, interactive dashboards, web/GUI applications, and embedded views. 191 / 199
  • 192. Usage of Matplotlib ▶ Reporting: Generate quality figures for research articles. ▶ Interactive Tools: Create dynamic plots with widgets for data exploration. ▶ Dashboards: Build complex visualizations for real-time data monitoring. ▶ Web Integration: Embed plots in web applications using backends like WebAgg. ▶ Preparation: Install via pip install matplotlib and explore documentation. 192 / 199
  • 193. Matplotlib # Plot sine, cosine and tangent waves import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 10, 100) plt.plot(x, np.sin(x), label='Sine', color='blue', linewidth=2) plt.plot(x, np.cos(x), label='Cosine', color='red', linestyle='--') plt.plot(x, np.tan(x), label='tan', color='green', linestyle='-.') plt.title("Sine, Cosine and Tangent Waves") plt.xlabel("X-axis") plt.ylabel("Y-axis") plt.legend() plt.grid(True) plt.show() 193 / 199
  • 195. Matplotlib # Plot sample bar and line chart import matplotlib.pyplot as plt categories = ['A', 'B', 'C'] values = [10, 20, 15] plt.bar(categories, values, color='green') plt.plot(categories, values, color='red', linewidth=2) plt.title("Sample Bar and Line Chart") plt.ylabel("Values") plt.show() 195 / 199
  • 197. Summary It covers: ▶ Presents the basics of Exploring Data Analysis (EDA) and its significance. ▶ Describes measurement scales, data types, and data analysis methodologies. ▶ Highlights the steps involved in EDA, including gathering data, cleaning it, visualizing it, and ▶ developing hypotheses. ▶ Demonstrates the differences between Bayesian, exploratory, and classical analysis techniques. ▶ Python libraries (NumPy, pandas, SciPy, and Matplotlib) and EDA tools are demonstrated. 197 / 199
  • 198. References I TEXTBOOK [1] Mukhiya, S. K., & Ahmed, U. (2020). Hands-On Exploratory Data Analysis with Python: Perform EDA techniques to understand, summarize, and investigate your data. Packt Publishing Ltd. REFERENCE BOOKS [1] Pearson, R. K. (2020). Exploratory Data Analysis Using R (1st ed.). CRC Press. [2] Datar, R., & Garg, H. (2019). Hands-on exploratory data analysis with R: Become an expert in exploratory data analysis using R packages. Packt Publishing Ltd. 198 / 199
  • 199. References II ONLINE RESOURCES [1] Python Pool. (2021, June 14). Numpy Axis in Python with detailed examples. Python Pool. https://www.pythonpool.com/numpy-axis/ [2] GeeksforGeeks. (2025, July 28). Working with Missing Data in Pandas. GeeksforGeeks. https://www.geeksforgeeks.org/data-analysis/ working-with-missing-data-in-pandas/ [3] GeeksforGeeks. (2025, July 26). Python | Pandas merging, joining and concatenating. GeeksforGeeks. https://www.geeksforgeeks.org/python/ python-pandas-merging-joining-and-concatenating/ 199 / 199