Data
Science and
its
Applications
VI SEM
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data visualization
Matplotlib
 For simple bar charts, line charts, and scatterplots, it works pretty well.
 If you are interested in producing elaborate interactive visualizations for the Web it is likely not the right
choice.
 we will be using the matplotlib.pyplot module .
 When you import matplotlib.pyplot using the standard convention import matplotlib.pyplot as plt,
 you gain access to a wide range of functions and methods that allow you to create various types of
plots, customize them, and add annotations. Some common types of plots you can create with
Matplotlib include line plots, scatter plots, bar plots, histograms, and more
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Plotting the data
plt.plot(x, y)
# Adding labels and title
plt.xlabel('X-axis’)
plt.ylabel('Y-axis’)
plt.title('Simple Line Plot')
from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()
Data Science.pptx00000000000000000000000
import matplotlib.pyplot as plt
# Sample data
categories = ['A', 'B', 'C', 'D', 'E’]
values = [10, 20, 15, 25, 30]
# Creating a bar plot
plt.bar(categories, values)
# Adding labels and title
plt.xlabel('Categories’)
plt.ylabel('Values’)
plt.title('Basic Bar Plot’)
# Display the plot
plt.show()
movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi",
"West Side Story"] num_oscars = [5, 11, 3, 8, 10]
# bars are by default width 0.8, so we'll add 0.1 to the left coordinates
# so that each bar is centered
xs = [i + 0.1 for i, _ in enumerate(movies)]
# plot bars with left x-coordinates [xs], heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
# label x-axis with movie names at bar centers
plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies)
plt.show()
In this list comprehension,
 enumerate(movies) is used to loop through the list of movies, providing both the index i and the movie name _.
Since we are only interested in the index, you use i.
 Then, you add 0.1 to each index to ensure that the bars are centered when plotting.
 Finally, these adjusted indices are stored in the list xs, which will be used as the x-coordinates for plotting the
bars.
In this code:
 plt.xticks() is used to set the x-axis ticks. The first argument is the list of x-coordinates where you want
the ticks to appear.
 The second argument is the list of tick labels, which in this case are the movie names.
 plt.show() is called to display the plot.
Data Science.pptx00000000000000000000000
grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
decile = lambda grade: grade // 10 * 10
histogram = Counter(decile(grade) for grade in grades)
plt.bar([x - 4 for x in histogram.keys()], # shift each bar to the left by 4
histogram.values(), # give each bar its correct height
8) # give each bar a width of 8
plt.axis([-5, 105, 0, 5]) # x-axis from -5 to 105,
# y-axis from 0 to 5
plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0, 10, ..., 100
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()
from collections import Counter
# Create a Counter from a list
my_list = ['a', 'b', 'c', 'a', 'b', 'a']
my_counter = Counter(my_list)
# Access counts
print(my_counter['a']) # Output: 3 (since 'a' appears 3 times)
# Access unique elements and their counts
print(my_counter.keys()) # Output: dict_keys(['a', 'b', 'c'])
print(my_counter.values()) # Output: dict_values([3, 2, 1])
# Arithmetic operations
other_list = ['a', 'b', 'c', 'a', 'a']
other_counter = Counter(other_list)
print(my_counter + other_counter) # Output: Counter({'a': 5, 'b': 3, 'c': 2})
Line Charts
As we saw already, we can make line charts using plt.plot().
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
plt.plot(xs, variance, 'g-', label='variance') # green solid line
plt.plot(xs, bias_squared, 'r-.', label='bias^2') # red dot-dashed line plt.plot(xs, total_error,
'b:', label='total error') # blue dotted line
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()
Data Science.pptx00000000000000000000000
Scatterplots
A scatterplot is the right choice for visualizing the relationship between two paired sets of
data. For example:the relationship between the number of friends
your users have and the number of minutes they spend on the site every day:
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put the label with its pointxytext=(5, -5), but slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()
Scatterplots
In the line for label, friend_count, minute_count in zip(labels, friends, minutes):
Python's zip() function is used to iterate over multiple lists (labels, friends, and
minutes) simultaneously.
labels contains the labels for each data point.
friends contains the number of friends for each data point.
minutes contains the minutes spent on the site for each data point.
By using zip(), we iterate over these lists together. In each iteration, label,
friend_count, and minute_count will correspond to the current elements from
labels, friends, and minutes lists, respectively.
plt.annotate(label, xy=(friend_count, minute_count), xytext=(5, -5),
textcoords='offset points') is then used to annotate the scatter plot with the label
for each point. This function places text at the specified coordinates
(xy=(friend_count, minute_count)) with a small offset (xytext=(5, -5)) from the
specified point.
Data Science.pptx00000000000000000000000
Bar Chart:
•Use bar charts to represent categorical data or data that can be divided into distinct groups.
•Best for comparing values between different categories or groups.
•Useful for showing discrete data points or data that doesn't have a natural order.
•Suitable for showing changes over time when time is divided into distinct intervals (e.g., months, years).
Examples of when to use bar charts:
•Comparing sales performance of different products.
•Showing population distribution by country.
•Displaying the frequency of occurrence of different categories
2.Line Chart:
2. Use line charts to visualize trends and patterns in continuous data over time.
3. Best for showing changes and trends over a continuous scale (e.g., time, temperature, distance).
4. Ideal for illustrating relationships between variables and identifying patterns such as growth, decline, or
or fluctuations.
5. Also suitable for displaying multiple series of data on the same chart for comparison.
3.Examples of when to use line charts:
2. Showing stock price fluctuations over time.
3. Visualizing temperature changes throughout the year.
4. Displaying trends in website traffic over months or years.
In summary, choose a bar chart when you want to compare discrete categories or groups, and opt for a line
chart when you need to visualize trends or patterns in continuous data over time.
Linear Algebra
 vectors are objects that can be added together (to form new vectors) and that
can be multiplied by scalars (i.e., numbers), also to form new vectors.
 For example, if you have the heights, weights, and ages of a large number of people,
you can treat your data as three-dimensional vectors (height, weight, age). If you’re
teaching a class with four exams, you can treat student grades as four-dimensional
vectors
(exam1, exam2, exam3, exam4).
height_weight_age = [70, # inches,
170, # pounds,
40 ] # years
grades = [95, # exam1
80, # exam2
75, # exam3
62 ] # exam4
Data Science.pptx00000000000000000000000
def vector_add(v, w):
"""adds corresponding elements"""
return [v_i + w_i
for v_i, w_i in zip(v, w)]
Similarly, to subtract two vectors we just subtract corresponding elements:
def vector_subtract(v, w):
"""subtracts corresponding elements"""
return [v_i - w_i
for v_i, w_i in zip(v, w)]
def scalar_multiply(c, v):
"""c is a number, v is a vector"""
return [c * v_i for v_i in v]
def vector_sum(vectors):
"""sums all corresponding elements"""
result = vectors[0] # start with the first vector
for vector in vectors[1:]: # then loop over the others
result = vector_add(result, vector) # and add them to the result
return result
We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying each element
of the vector by that number:
def scalar_multiply(c, v):
"""c is a number, v is a vector"""
return [c * v_i for v_i in v]
In Python, a tuple is a collection data type similar to a list, but with one key difference: tuples are immutable,
meaning once they are created, their elements cannot be changed or modified. Tuples are defined by
enclosing comma-separated values within parentheses ().
Here's a basic example of a tuple:
my_tuple = (1, 2, 'a', 'b', True)
Tuples can contain elements of different data types, including integers, floats, strings, booleans, and even
other tuples or data structures. You can access elements of a tuple using indexing, just like with lists:
print(my_tuple[0]) # Output: 1
print(my_tuple[2]) # Output: 'a'
Tuples support many of the same operations as lists, such as slicing, concatenation, and repetition:
python
tuple1 = (1, 2, 3)
tuple2 = ('a', 'b', 'c')
# Slicing
print(tuple1[:2]) # Output: (1, 2)
# Concatenation
tuple3 = tuple1 + tuple2
print(tuple3) # Output: (1, 2, 3, 'a', 'b', 'c')
# Repetition
tuple4 = tuple2 * 2
print(tuple4) # Output: ('a', 'b', 'c', 'a', 'b', 'c')
However, because tuples are immutable, you cannot modify
individual elements:
my_tuple[0] = 5 # This will raise an error because tuples are
immutable
Tuples are commonly used in Python for various purposes,
such as representing fixed collections of items, returning
multiple values from a function, and as keys in dictionaries
(since they are immutable).
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Vector addition
If two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0],
whose second element is v[1] + w[1], and so on. (If they’re not the same length, then we’re not allowed to
add them.)
For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3, 3],
 notes
The %matplotlib inline command is used in Jupyter Notebooks or IPython environments to display
Matplotlib plots directly within the notebook. It ensures that plots are rendered inline, meaning they
appear directly below the code cell that generates them.
By using %matplotlib inline, you're setting up the notebook to show Matplotlib plots without the need for
additional commands like plt.show()
x=[10,20,30,40]
plt.plot(x,y,'r')
[<matplotlib.lines.Line2D at 0xa6e4dd2128>]
plt.xlabel('X axis title')
plt.ylabel('Y axis title')
plt.title("Curve plot")
plt.plot(x,y,'r',label='curve')
plt.legend()
<matplotlib.legend.Legend at 0xa6e3b4e
The plt.plot(x, y, 'r') command you've used plots the data in arrays x and y using red color ('r').
Here's what each part of the command does:
plt.plot(): This function is used to create a line plot.
x and y: These are the data arrays to be plotted along the x and y axes, respectively.
'r': This specifies the color of the line. In this case, 'r' stands for red. You can use different color
abbreviations ('b' for blue, 'g' for green, etc.) or full color names ('red', 'blue', 'green', etc.).
So, plt.plot(x, y, 'r') will create a line plot of y against x with a red color line
The np.arange() function in NumPy is used to create an array with evenly spaced values within a
specified interval. Here's how it works:python
np.arange(start, stop, step)
start: The starting value of the sequence.
stop: The end value of the sequence, not included.
step: The step size between each pair of consecutive values. It defaults to 1 if not provided.
For example, np.arange(0, 10) will generate an array containing integers from 0 up to (but not
including) 10, with a default step size of 1. So, the resulting array will be [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
np.linspace(0,10,5)
array([ 0. , 2.5, 5. , 7.5, 10. ])
np.linspace(0,10,50)
array([ 0. , 0.20408163, 0.40816327, 0.6122449 ,
0.81632653, 1.02040816, 1.2244898 , 1.42857143,
1.63265306, 1.83673469, 2.04081633, 2.24489796,
2.44897959, 2.65306122, 2.85714286, 3.06122449,
3.26530612, 3.46938776, 3.67346939, 3.87755102,
4.08163265, 4.28571429, 4.48979592, 4.69387755,
4.89795918, 5.10204082, 5.30612245, 5.51020408,
5.71428571, 5.91836735, 6.12244898, 6.32653061,
6.53061224, 6.73469388, 6.93877551, 7.14285714,
7.34693878, 7.55102041, 7.75510204, 7.95918367,
8.16326531, 8.36734694, 8.57142857, 8.7755102 ,
8.97959184, 9.18367347, 9.3877551 , 9.59183673,
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
labels: This is a list containing three strings: 'a', 'b', and 'c'.
my_list: This is a list containing three integers: 10, 20, and 30.
arr: This is a NumPy array created using np.array(), containing the same integers as my_list
d: This is a dictionary where keys are strings ('a', 'b', 'c') and values are integers (10, 20, 30).
These data structures store similar information but in different ways and with different
functionalities. Lists are ordered collections, NumPy arrays are arrays of homogeneous data, and
dictionaries are mappings of keys to values
Pandas is a popular open-source Python library used for data manipulation and analysis. It provides easy-to-
use data structures and functions for working with structured data, such as tables or spreadsheet-like data,
making it a fundamental tool for data scientists, analysts, and researchers.
Here are some key features of Pandas:
DataFrame: The primary data structure in Pandas is the DataFrame, which is a two-dimensional labeled
data structure with columns of potentially different types. It resembles a spreadsheet or SQL table, and you
can think of it as a dictionary of Series objects, where each Series represents a column
import matplotlib.pyplot as plt
# Data for demonstration
x = [1, 2, 3, 4] y = [1, 4, 9, 16]
# Create a figure with 4 rows and 2 columns of subplotsplt.figure(figsize=(10, 10))
# Loop through each subplot position in the 4x2 grid for i in range(1, 9):
# 1 to 8 for a 4x2 grid plt.subplot(4, 2, i) plt.plot(x, y) plt.title(f'Subplot {i}’)
# Label each subplot
# Adjust layout to prevent overlapplt.tight_layout()
# Display the plotplt.show()
-------------------+-------------------+
| Subplot 1 | Subplot 2 |
+----------------
Statistics
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Central Tendencies
we’ll want some notion of where our data is centered.
we’ll use the mean (or average), which is just the sum of the data divided by its count:
def mean(x): return sum(x) / len(x)
mean(num_friends)
We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is
odd) or the average of the two middle-most values (if the number of data points is even).
Data Science.pptx00000000000000000000000
from collections import Counter
import matplotlib.pyplot as plt
# Example list of friend counts
num_friends = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6]
friend_counts = Counter(num_friends)
xs = range(101) # Assuming the largest value is 100
ys = [friend_counts[x] for x in xs] # height is just the number of friends
plt.bar(xs, ys)
plt.axis([0, 101, 0, max(ys) + 1]) # Setting the y-axis limit to one more than the maximum count
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
sorted_values = sorted(num_friends)
sorted_values
smallest_value = sorted_values[0]
smallest_value
2
second_largest_value = sorted_values[-2]
7
[2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]
def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
Median
def quantile(x, p):
p_index = int(p * len(x))
return sorted(x)[p_index]
x = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6]
p = 0.25 # Desired quantile (25th percentile)
result = quantile(x, p) p_index = int(p * len(x)) calculates the index corresponding to the quantile
𝑝
p in the sorted dataset
𝑥
len(x) returns the number of elements in the dataset 𝑥x, i.e., the length of the dataset
p is the quantile you want to calculate, represented as a value between 0 and 1.
p×len(x) calculates the position in the sorted dataset corresponding to the desired quantile. Since
𝑝
p is a fraction between 0 and 1, multiplying it by the length of the dataset gives the index at which the
quantile would be if the data were sorted.
t(p×len(x)) takes the integer part of the result. This ensures that the index is an integer value, as
indices in Python must be integers.
x=[3,5,3,8,2,5,7,5,3,6,4,3,4,5,6] and you want to calculate the 25th percentile
𝑝=0.25
0.25×15=3.75
p×len(x)=0.25×15=3.75. Taking the integer part of
3.75
3.75 gives
3
3, so the 25th percentile of the dataset
𝑥
x corresponds to the element at index 3 in the sorted dataset.
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Uncertainty
Randomness
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Dependence and Independence
If we flip a fair coin twice, knowing whether the first flip is Heads gives us no
information about whether the second flip is Heads. These events are independent.
On the other hand, knowing whether the first flip is Heads certainly gives us
information about whether both flips are Tails.
(If the first flip is Heads, then definitely it’s not the case that both flips are Tails.)
These two events are dependent.
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
import random
def random_kid():
return random.choice(["boy", "girl"])
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1
print("P(both | older):", both_girls / older_girl) # 0.514 ~ 1/2
print("P(both | either):", both_girls / either_girl)
We want to calculate two conditional probabilities related to having girls in a family with two children:The probability that
both children are girls given that the older child is a girl.The probability that both children are girls given that at least one of
the children is a girl.
random_kid() function:Returns either "boy" or "girl" randomly with equal probability (0.5 each).
Counters:
both_girls: Counts the number of times both children are girls.
older_girl: Counts the number of times the older child is a girl.
either_girl: Counts the number of times at least one child is a girl.
Simulation Loop:For 10,000 iterations, the code simulates the gender of two children (older and younger).It updates the
counters based on the genders of the children.
Probabilities
P(both | older):This is the probability that both children are girls given that the older child is a girl.both_girls / older_girl: The
number of times both children are girls divided by the number of times the older child is a girl.
P(both | either):This is the probability that both children are girls given that at least one of
the children is a girl.both_girls / either_girl: The number of times both children are girls
divided by the number of times at least one child is a girl.
Mathematical Explanation
P(both | older):Given that the older child is a girl, the younger child can be either a girl or a boy
with equal probability.
Therefore, the probability that both are girls is 1/2.
P(both | either):This situation requires considering all possible combinations where at least one
child is a girl:Girl-GirlGirl-BoyBoy-Girl
Out of these combinations, the only one with both girls is "Girl-Girl".There are 3 favorable
combinations with at least one girl out of 4 total combinations (Girl-Girl, Girl-Boy, Boy-Girl, Boy-
Boy).Therefore, the probability is 1/3.
Simulation Results
P(both | older): The simulation result should be close to 0.5 (which is 1/2).
P(both | either): The simulation result should be close to 1/3.
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Normal Distribution
Data Science.pptx00000000000000000000000
Data Science.pptx00000000000000000000000
Program 5
Code:
import pandas as pd
import numpy as np
# Import the data into a DataFrame
books_df = pd.read_csv('desktop/BL-Flickr-Images-Book.csv')
# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(books_df.head())
# Find and drop the columns which are irrelevant for the book information
columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner',
'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=columns_to_drop, inplace=True)
# Change the Index of the DataFrame
books_df.set_index('Identifier', inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
def clean_date(date):
if isinstance(date, str):
match = re.search(r'd{4}', date)
if match:
return match.group()
return np.nan
books_df['Date of Publication'] = books_df['Date of
Publication'].apply(clean_date)
# Combine str methods with NumPy to clean columns
books_df['Place of Publication'] = np.where(
books_df['Place of Publication'].str.contains('London'),
'London',
np.where(
books_df['Place of Publication'].str.contains('Oxford'),
'Oxford',
books_df['Place of Publication'].replace(
r'^s*$', 'Unknown', regex=True
)
)
)
# Display the cleaned DataFrame
print("nCleaned DataFrame:")
print(books_df.head())
Function Definition: The function clean_date takes one parameter, date
def clean_date(date):.
Type Check: It first checks if the input date is a string using the isinstance function.
if isinstance(date, str):
Regular Expression Search: If date is indeed a string, the function uses the re.search method to search
for a pattern that matches four consecutive digits (which typically represent a year) in the string
match = re.search(r'd{4}', date)
re.search searches the input string for the first location where the regular expression pattern d{4}
(which means any four digits) matches.
If such a pattern is found, re.search returns a match object; otherwise, it returns None.
Extracting the Year: If a match is found (i.e., the match object is not None), the function extracts the
matched string (the year) using the group method of the match object.
if match: return match.group()
Handling No Match: If date is not a string or if no four-digit number is found in the string, the function
returns np.nan (which represents a missing value in the context of data analysis, often using the NumPy
library).
return np.nan
Libraries
re: This library provides regular expression matching operations.
numpy as np: This library is typically used for numerical and array operations. np.nan is a special
floating-point value that represents 'Not a Number' and is used to denote missing values.
Example:
Input: "April 20, 1995"
Output: "1995"
Input: "The year is 2023"
Output: "2023"
Input: "No year here"
Output: np.nan
Input: 12345
Output: np.nan (since the input is not a string)
r'^s*$': This is a regular expression pattern.
^: Asserts the position at the start of the string.
s*: Matches zero or more whitespace characters (spaces, tabs, newlines).
$: Asserts the position at the end of the string.
Therefore, r'^s*$' matches any string that contains only whitespace characters or is completely empty.
'Unknown': This is the replacement value. Any string that matches the regular expression pattern will be
replaced with the string 'Unknown'.
regex=True: This tells the replace method to interpret the first argument as a regular expression pattern.
Without regex=True, the method would treat the pattern as a plain string and attempt to find and replace the
exact string r'^s*$', which wouldn't match anything in most cases.
What does this code do
The code replaces any entry in the 'Place of Publication' column that is empty or
contains only whitespace with the string 'Unknown'. By setting regex=True, it
ensures that the regular expression pattern is correctly used to identify these
entries.

More Related Content

PDF
Chapter3_Visualizations2.pdf
PPTX
Data Visualization using Matplotlib to understand Graphs
PPTX
Python chart plotting using Matplotlib.pptx
PPTX
Unit3-v1-Plotting and Visualization.pptx
PDF
CE344L-200365-Lab2.pdf
DOCX
Introduction to r
PPTX
MatplotLib.pptx
PDF
Chapter 1: Linear Regression
Chapter3_Visualizations2.pdf
Data Visualization using Matplotlib to understand Graphs
Python chart plotting using Matplotlib.pptx
Unit3-v1-Plotting and Visualization.pptx
CE344L-200365-Lab2.pdf
Introduction to r
MatplotLib.pptx
Chapter 1: Linear Regression

Similar to Data Science.pptx00000000000000000000000 (20)

PPTX
matplotlib.pptxdsfdsfdsfdsdsfdsdfdsfsdf cvvf
PPTX
Basic Analysis using Python
PPTX
Matplot Lib Practicals artificial intelligence.pptx
PPT
R Programming Intro
DOCX
Data Manipulation with Numpy and Pandas in PythonStarting with N
DOCX
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
PPTX
COM1407: Arrays
PPTX
Matplotlib yayyyyyyyyyyyyyin Python.pptx
PPTX
R Programming.pptx
PDF
Python.pdf
PDF
23UCACC11 Python Programming (MTNC) (BCA)
PDF
Forecast stock prices python
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Time Series.pptx
PPTX
CIV1900 Matlab - Plotting & Coursework
PPTX
matplotlib _
PPTX
PDF
De-Cluttering-ML | TechWeekends
PDF
R_CheatSheet.pdf
PDF
Py lecture5 python plots
matplotlib.pptxdsfdsfdsfdsdsfdsdfdsfsdf cvvf
Basic Analysis using Python
Matplot Lib Practicals artificial intelligence.pptx
R Programming Intro
Data Manipulation with Numpy and Pandas in PythonStarting with N
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
COM1407: Arrays
Matplotlib yayyyyyyyyyyyyyin Python.pptx
R Programming.pptx
Python.pdf
23UCACC11 Python Programming (MTNC) (BCA)
Forecast stock prices python
Visual Aids for Exploratory Data Analysis.pdf
Time Series.pptx
CIV1900 Matlab - Plotting & Coursework
matplotlib _
De-Cluttering-ML | TechWeekends
R_CheatSheet.pdf
Py lecture5 python plots
Ad

Recently uploaded (20)

PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
communication and presentation skills 01
PPTX
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
Feature types and data preprocessing steps
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Current and future trends in Computer Vision.pptx
PDF
Soil Improvement Techniques Note - Rabbi
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Categorization of Factors Affecting Classification Algorithms Selection
Fundamentals of safety and accident prevention -final (1).pptx
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
communication and presentation skills 01
ASME PCC-02 TRAINING -DESKTOP-NLE5HNP.pptx
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Feature types and data preprocessing steps
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Amdahl’s law is explained in the above power point presentations
Current and future trends in Computer Vision.pptx
Soil Improvement Techniques Note - Rabbi
Ad

Data Science.pptx00000000000000000000000

  • 4. Data visualization Matplotlib  For simple bar charts, line charts, and scatterplots, it works pretty well.  If you are interested in producing elaborate interactive visualizations for the Web it is likely not the right choice.  we will be using the matplotlib.pyplot module .  When you import matplotlib.pyplot using the standard convention import matplotlib.pyplot as plt,  you gain access to a wide range of functions and methods that allow you to create various types of plots, customize them, and add annotations. Some common types of plots you can create with Matplotlib include line plots, scatter plots, bar plots, histograms, and more
  • 5. import matplotlib.pyplot as plt # Sample data x = [1, 2, 3, 4, 5] y = [2, 4, 6, 8, 10] # Plotting the data plt.plot(x, y) # Adding labels and title plt.xlabel('X-axis’) plt.ylabel('Y-axis’) plt.title('Simple Line Plot')
  • 6. from matplotlib import pyplot as plt years = [1950, 1960, 1970, 1980, 1990, 2000, 2010] gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3] # create a line chart, years on x-axis, gdp on y-axis plt.plot(years, gdp, color='green', marker='o', linestyle='solid') # add a title plt.title("Nominal GDP") # add a label to the y-axis plt.ylabel("Billions of $") plt.show()
  • 8. import matplotlib.pyplot as plt # Sample data categories = ['A', 'B', 'C', 'D', 'E’] values = [10, 20, 15, 25, 30] # Creating a bar plot plt.bar(categories, values) # Adding labels and title plt.xlabel('Categories’) plt.ylabel('Values’) plt.title('Basic Bar Plot’) # Display the plot plt.show()
  • 9. movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"] num_oscars = [5, 11, 3, 8, 10] # bars are by default width 0.8, so we'll add 0.1 to the left coordinates # so that each bar is centered xs = [i + 0.1 for i, _ in enumerate(movies)] # plot bars with left x-coordinates [xs], heights [num_oscars] plt.bar(xs, num_oscars) plt.ylabel("# of Academy Awards") plt.title("My Favorite Movies") # label x-axis with movie names at bar centers plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies) plt.show()
  • 10. In this list comprehension,  enumerate(movies) is used to loop through the list of movies, providing both the index i and the movie name _. Since we are only interested in the index, you use i.  Then, you add 0.1 to each index to ensure that the bars are centered when plotting.  Finally, these adjusted indices are stored in the list xs, which will be used as the x-coordinates for plotting the bars. In this code:  plt.xticks() is used to set the x-axis ticks. The first argument is the list of x-coordinates where you want the ticks to appear.  The second argument is the list of tick labels, which in this case are the movie names.  plt.show() is called to display the plot.
  • 12. grades = [83,95,91,87,70,0,85,82,100,67,73,77,0] decile = lambda grade: grade // 10 * 10 histogram = Counter(decile(grade) for grade in grades) plt.bar([x - 4 for x in histogram.keys()], # shift each bar to the left by 4 histogram.values(), # give each bar its correct height 8) # give each bar a width of 8 plt.axis([-5, 105, 0, 5]) # x-axis from -5 to 105, # y-axis from 0 to 5 plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0, 10, ..., 100 plt.xlabel("Decile") plt.ylabel("# of Students") plt.title("Distribution of Exam 1 Grades") plt.show()
  • 13. from collections import Counter # Create a Counter from a list my_list = ['a', 'b', 'c', 'a', 'b', 'a'] my_counter = Counter(my_list) # Access counts print(my_counter['a']) # Output: 3 (since 'a' appears 3 times) # Access unique elements and their counts print(my_counter.keys()) # Output: dict_keys(['a', 'b', 'c']) print(my_counter.values()) # Output: dict_values([3, 2, 1]) # Arithmetic operations other_list = ['a', 'b', 'c', 'a', 'a'] other_counter = Counter(other_list) print(my_counter + other_counter) # Output: Counter({'a': 5, 'b': 3, 'c': 2})
  • 14. Line Charts As we saw already, we can make line charts using plt.plot(). variance = [1, 2, 4, 8, 16, 32, 64, 128, 256] bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1] total_error = [x + y for x, y in zip(variance, bias_squared)] xs = [i for i, _ in enumerate(variance)] # we can make multiple calls to plt.plot # to show multiple series on the same chart plt.plot(xs, variance, 'g-', label='variance') # green solid line plt.plot(xs, bias_squared, 'r-.', label='bias^2') # red dot-dashed line plt.plot(xs, total_error, 'b:', label='total error') # blue dotted line # because we've assigned labels to each series # we can get a legend for free # loc=9 means "top center" plt.legend(loc=9) plt.xlabel("model complexity") plt.title("The Bias-Variance Tradeoff") plt.show()
  • 16. Scatterplots A scatterplot is the right choice for visualizing the relationship between two paired sets of data. For example:the relationship between the number of friends your users have and the number of minutes they spend on the site every day: friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67] minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190] labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'] plt.scatter(friends, minutes) # label each point for label, friend_count, minute_count in zip(labels, friends, minutes): plt.annotate(label, xy=(friend_count, minute_count), # put the label with its pointxytext=(5, -5), but slightly offset textcoords='offset points') plt.title("Daily Minutes vs. Number of Friends") plt.xlabel("# of friends") plt.ylabel("daily minutes spent on the site") plt.show()
  • 17. Scatterplots In the line for label, friend_count, minute_count in zip(labels, friends, minutes): Python's zip() function is used to iterate over multiple lists (labels, friends, and minutes) simultaneously. labels contains the labels for each data point. friends contains the number of friends for each data point. minutes contains the minutes spent on the site for each data point. By using zip(), we iterate over these lists together. In each iteration, label, friend_count, and minute_count will correspond to the current elements from labels, friends, and minutes lists, respectively. plt.annotate(label, xy=(friend_count, minute_count), xytext=(5, -5), textcoords='offset points') is then used to annotate the scatter plot with the label for each point. This function places text at the specified coordinates (xy=(friend_count, minute_count)) with a small offset (xytext=(5, -5)) from the specified point.
  • 19. Bar Chart: •Use bar charts to represent categorical data or data that can be divided into distinct groups. •Best for comparing values between different categories or groups. •Useful for showing discrete data points or data that doesn't have a natural order. •Suitable for showing changes over time when time is divided into distinct intervals (e.g., months, years). Examples of when to use bar charts: •Comparing sales performance of different products. •Showing population distribution by country. •Displaying the frequency of occurrence of different categories 2.Line Chart: 2. Use line charts to visualize trends and patterns in continuous data over time. 3. Best for showing changes and trends over a continuous scale (e.g., time, temperature, distance). 4. Ideal for illustrating relationships between variables and identifying patterns such as growth, decline, or or fluctuations. 5. Also suitable for displaying multiple series of data on the same chart for comparison. 3.Examples of when to use line charts: 2. Showing stock price fluctuations over time. 3. Visualizing temperature changes throughout the year. 4. Displaying trends in website traffic over months or years. In summary, choose a bar chart when you want to compare discrete categories or groups, and opt for a line chart when you need to visualize trends or patterns in continuous data over time.
  • 21.  vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors.  For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors (height, weight, age). If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors (exam1, exam2, exam3, exam4). height_weight_age = [70, # inches, 170, # pounds, 40 ] # years grades = [95, # exam1 80, # exam2 75, # exam3 62 ] # exam4
  • 23. def vector_add(v, w): """adds corresponding elements""" return [v_i + w_i for v_i, w_i in zip(v, w)] Similarly, to subtract two vectors we just subtract corresponding elements: def vector_subtract(v, w): """subtracts corresponding elements""" return [v_i - w_i for v_i, w_i in zip(v, w)] def scalar_multiply(c, v): """c is a number, v is a vector""" return [c * v_i for v_i in v]
  • 24. def vector_sum(vectors): """sums all corresponding elements""" result = vectors[0] # start with the first vector for vector in vectors[1:]: # then loop over the others result = vector_add(result, vector) # and add them to the result return result We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying each element of the vector by that number: def scalar_multiply(c, v): """c is a number, v is a vector""" return [c * v_i for v_i in v]
  • 25. In Python, a tuple is a collection data type similar to a list, but with one key difference: tuples are immutable, meaning once they are created, their elements cannot be changed or modified. Tuples are defined by enclosing comma-separated values within parentheses (). Here's a basic example of a tuple: my_tuple = (1, 2, 'a', 'b', True) Tuples can contain elements of different data types, including integers, floats, strings, booleans, and even other tuples or data structures. You can access elements of a tuple using indexing, just like with lists: print(my_tuple[0]) # Output: 1 print(my_tuple[2]) # Output: 'a'
  • 26. Tuples support many of the same operations as lists, such as slicing, concatenation, and repetition: python tuple1 = (1, 2, 3) tuple2 = ('a', 'b', 'c') # Slicing print(tuple1[:2]) # Output: (1, 2) # Concatenation tuple3 = tuple1 + tuple2 print(tuple3) # Output: (1, 2, 3, 'a', 'b', 'c') # Repetition tuple4 = tuple2 * 2 print(tuple4) # Output: ('a', 'b', 'c', 'a', 'b', 'c')
  • 27. However, because tuples are immutable, you cannot modify individual elements: my_tuple[0] = 5 # This will raise an error because tuples are immutable Tuples are commonly used in Python for various purposes, such as representing fixed collections of items, returning multiple values from a function, and as keys in dictionaries (since they are immutable).
  • 38. Vector addition If two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0], whose second element is v[1] + w[1], and so on. (If they’re not the same length, then we’re not allowed to add them.) For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3, 3],
  • 39.  notes The %matplotlib inline command is used in Jupyter Notebooks or IPython environments to display Matplotlib plots directly within the notebook. It ensures that plots are rendered inline, meaning they appear directly below the code cell that generates them. By using %matplotlib inline, you're setting up the notebook to show Matplotlib plots without the need for additional commands like plt.show()
  • 40. x=[10,20,30,40] plt.plot(x,y,'r') [<matplotlib.lines.Line2D at 0xa6e4dd2128>] plt.xlabel('X axis title') plt.ylabel('Y axis title') plt.title("Curve plot") plt.plot(x,y,'r',label='curve') plt.legend() <matplotlib.legend.Legend at 0xa6e3b4e
  • 41. The plt.plot(x, y, 'r') command you've used plots the data in arrays x and y using red color ('r'). Here's what each part of the command does: plt.plot(): This function is used to create a line plot. x and y: These are the data arrays to be plotted along the x and y axes, respectively. 'r': This specifies the color of the line. In this case, 'r' stands for red. You can use different color abbreviations ('b' for blue, 'g' for green, etc.) or full color names ('red', 'blue', 'green', etc.). So, plt.plot(x, y, 'r') will create a line plot of y against x with a red color line
  • 42. The np.arange() function in NumPy is used to create an array with evenly spaced values within a specified interval. Here's how it works:python np.arange(start, stop, step) start: The starting value of the sequence. stop: The end value of the sequence, not included. step: The step size between each pair of consecutive values. It defaults to 1 if not provided. For example, np.arange(0, 10) will generate an array containing integers from 0 up to (but not including) 10, with a default step size of 1. So, the resulting array will be [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
  • 43. np.linspace(0,10,5) array([ 0. , 2.5, 5. , 7.5, 10. ]) np.linspace(0,10,50) array([ 0. , 0.20408163, 0.40816327, 0.6122449 , 0.81632653, 1.02040816, 1.2244898 , 1.42857143, 1.63265306, 1.83673469, 2.04081633, 2.24489796, 2.44897959, 2.65306122, 2.85714286, 3.06122449, 3.26530612, 3.46938776, 3.67346939, 3.87755102, 4.08163265, 4.28571429, 4.48979592, 4.69387755, 4.89795918, 5.10204082, 5.30612245, 5.51020408, 5.71428571, 5.91836735, 6.12244898, 6.32653061, 6.53061224, 6.73469388, 6.93877551, 7.14285714, 7.34693878, 7.55102041, 7.75510204, 7.95918367, 8.16326531, 8.36734694, 8.57142857, 8.7755102 , 8.97959184, 9.18367347, 9.3877551 , 9.59183673,
  • 44. labels = ['a','b','c'] my_list = [10,20,30] arr = np.array([10,20,30]) d = {'a':10,'b':20,'c':30} labels: This is a list containing three strings: 'a', 'b', and 'c'. my_list: This is a list containing three integers: 10, 20, and 30. arr: This is a NumPy array created using np.array(), containing the same integers as my_list d: This is a dictionary where keys are strings ('a', 'b', 'c') and values are integers (10, 20, 30). These data structures store similar information but in different ways and with different functionalities. Lists are ordered collections, NumPy arrays are arrays of homogeneous data, and dictionaries are mappings of keys to values
  • 45. Pandas is a popular open-source Python library used for data manipulation and analysis. It provides easy-to- use data structures and functions for working with structured data, such as tables or spreadsheet-like data, making it a fundamental tool for data scientists, analysts, and researchers. Here are some key features of Pandas: DataFrame: The primary data structure in Pandas is the DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. It resembles a spreadsheet or SQL table, and you can think of it as a dictionary of Series objects, where each Series represents a column
  • 46. import matplotlib.pyplot as plt # Data for demonstration x = [1, 2, 3, 4] y = [1, 4, 9, 16] # Create a figure with 4 rows and 2 columns of subplotsplt.figure(figsize=(10, 10)) # Loop through each subplot position in the 4x2 grid for i in range(1, 9): # 1 to 8 for a 4x2 grid plt.subplot(4, 2, i) plt.plot(x, y) plt.title(f'Subplot {i}’) # Label each subplot # Adjust layout to prevent overlapplt.tight_layout() # Display the plotplt.show() -------------------+-------------------+ | Subplot 1 | Subplot 2 | +----------------
  • 52. Central Tendencies we’ll want some notion of where our data is centered. we’ll use the mean (or average), which is just the sum of the data divided by its count: def mean(x): return sum(x) / len(x) mean(num_friends) We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is odd) or the average of the two middle-most values (if the number of data points is even).
  • 54. from collections import Counter import matplotlib.pyplot as plt # Example list of friend counts num_friends = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6] friend_counts = Counter(num_friends) xs = range(101) # Assuming the largest value is 100 ys = [friend_counts[x] for x in xs] # height is just the number of friends plt.bar(xs, ys) plt.axis([0, 101, 0, max(ys) + 1]) # Setting the y-axis limit to one more than the maximum count plt.title("Histogram of Friend Counts") plt.xlabel("# of friends") plt.ylabel("# of people") plt.show()
  • 59. sorted_values = sorted(num_friends) sorted_values smallest_value = sorted_values[0] smallest_value 2 second_largest_value = sorted_values[-2] 7 [2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]
  • 60. def median(v): """finds the 'middle-most' value of v""" n = len(v) sorted_v = sorted(v) midpoint = n // 2 if n % 2 == 1: # if odd, return the middle value return sorted_v[midpoint] else: # if even, return the average of the middle values lo = midpoint - 1 hi = midpoint return (sorted_v[lo] + sorted_v[hi]) / 2 Median
  • 61. def quantile(x, p): p_index = int(p * len(x)) return sorted(x)[p_index] x = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6] p = 0.25 # Desired quantile (25th percentile) result = quantile(x, p) p_index = int(p * len(x)) calculates the index corresponding to the quantile 𝑝 p in the sorted dataset 𝑥 len(x) returns the number of elements in the dataset 𝑥x, i.e., the length of the dataset p is the quantile you want to calculate, represented as a value between 0 and 1. p×len(x) calculates the position in the sorted dataset corresponding to the desired quantile. Since 𝑝 p is a fraction between 0 and 1, multiplying it by the length of the dataset gives the index at which the quantile would be if the data were sorted. t(p×len(x)) takes the integer part of the result. This ensures that the index is an integer value, as indices in Python must be integers.
  • 62. x=[3,5,3,8,2,5,7,5,3,6,4,3,4,5,6] and you want to calculate the 25th percentile 𝑝=0.25 0.25×15=3.75 p×len(x)=0.25×15=3.75. Taking the integer part of 3.75 3.75 gives 3 3, so the 25th percentile of the dataset 𝑥 x corresponds to the element at index 3 in the sorted dataset.
  • 79. Dependence and Independence If we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are independent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then definitely it’s not the case that both flips are Tails.) These two events are dependent.
  • 84. import random def random_kid(): return random.choice(["boy", "girl"]) both_girls = 0 older_girl = 0 either_girl = 0 random.seed(0) for _ in range(10000): younger = random_kid() older = random_kid() if older == "girl": older_girl += 1 if older == "girl" and younger == "girl": both_girls += 1 if older == "girl" or younger == "girl": either_girl += 1 print("P(both | older):", both_girls / older_girl) # 0.514 ~ 1/2 print("P(both | either):", both_girls / either_girl)
  • 85. We want to calculate two conditional probabilities related to having girls in a family with two children:The probability that both children are girls given that the older child is a girl.The probability that both children are girls given that at least one of the children is a girl. random_kid() function:Returns either "boy" or "girl" randomly with equal probability (0.5 each). Counters: both_girls: Counts the number of times both children are girls. older_girl: Counts the number of times the older child is a girl. either_girl: Counts the number of times at least one child is a girl. Simulation Loop:For 10,000 iterations, the code simulates the gender of two children (older and younger).It updates the counters based on the genders of the children. Probabilities P(both | older):This is the probability that both children are girls given that the older child is a girl.both_girls / older_girl: The number of times both children are girls divided by the number of times the older child is a girl.
  • 86. P(both | either):This is the probability that both children are girls given that at least one of the children is a girl.both_girls / either_girl: The number of times both children are girls divided by the number of times at least one child is a girl. Mathematical Explanation P(both | older):Given that the older child is a girl, the younger child can be either a girl or a boy with equal probability. Therefore, the probability that both are girls is 1/2. P(both | either):This situation requires considering all possible combinations where at least one child is a girl:Girl-GirlGirl-BoyBoy-Girl Out of these combinations, the only one with both girls is "Girl-Girl".There are 3 favorable combinations with at least one girl out of 4 total combinations (Girl-Girl, Girl-Boy, Boy-Girl, Boy- Boy).Therefore, the probability is 1/3. Simulation Results P(both | older): The simulation result should be close to 0.5 (which is 1/2). P(both | either): The simulation result should be close to 1/3.
  • 97. Program 5 Code: import pandas as pd import numpy as np # Import the data into a DataFrame books_df = pd.read_csv('desktop/BL-Flickr-Images-Book.csv') # Display the first few rows of the DataFrame print("Original DataFrame:") print(books_df.head()) # Find and drop the columns which are irrelevant for the book information columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks'] books_df.drop(columns=columns_to_drop, inplace=True) # Change the Index of the DataFrame books_df.set_index('Identifier', inplace=True) # Tidy up fields in the data such as date of publication with the help of simple regular expression def clean_date(date): if isinstance(date, str): match = re.search(r'd{4}', date) if match: return match.group()
  • 98. return np.nan books_df['Date of Publication'] = books_df['Date of Publication'].apply(clean_date) # Combine str methods with NumPy to clean columns books_df['Place of Publication'] = np.where( books_df['Place of Publication'].str.contains('London'), 'London', np.where( books_df['Place of Publication'].str.contains('Oxford'), 'Oxford', books_df['Place of Publication'].replace( r'^s*$', 'Unknown', regex=True ) ) ) # Display the cleaned DataFrame print("nCleaned DataFrame:") print(books_df.head())
  • 99. Function Definition: The function clean_date takes one parameter, date def clean_date(date):. Type Check: It first checks if the input date is a string using the isinstance function. if isinstance(date, str): Regular Expression Search: If date is indeed a string, the function uses the re.search method to search for a pattern that matches four consecutive digits (which typically represent a year) in the string match = re.search(r'd{4}', date) re.search searches the input string for the first location where the regular expression pattern d{4} (which means any four digits) matches. If such a pattern is found, re.search returns a match object; otherwise, it returns None.
  • 100. Extracting the Year: If a match is found (i.e., the match object is not None), the function extracts the matched string (the year) using the group method of the match object. if match: return match.group() Handling No Match: If date is not a string or if no four-digit number is found in the string, the function returns np.nan (which represents a missing value in the context of data analysis, often using the NumPy library). return np.nan Libraries re: This library provides regular expression matching operations. numpy as np: This library is typically used for numerical and array operations. np.nan is a special floating-point value that represents 'Not a Number' and is used to denote missing values.
  • 101. Example: Input: "April 20, 1995" Output: "1995" Input: "The year is 2023" Output: "2023" Input: "No year here" Output: np.nan Input: 12345 Output: np.nan (since the input is not a string)
  • 102. r'^s*$': This is a regular expression pattern. ^: Asserts the position at the start of the string. s*: Matches zero or more whitespace characters (spaces, tabs, newlines). $: Asserts the position at the end of the string. Therefore, r'^s*$' matches any string that contains only whitespace characters or is completely empty. 'Unknown': This is the replacement value. Any string that matches the regular expression pattern will be replaced with the string 'Unknown'. regex=True: This tells the replace method to interpret the first argument as a regular expression pattern. Without regex=True, the method would treat the pattern as a plain string and attempt to find and replace the exact string r'^s*$', which wouldn't match anything in most cases.
  • 103. What does this code do The code replaces any entry in the 'Place of Publication' column that is empty or contains only whitespace with the string 'Unknown'. By setting regex=True, it ensures that the regular expression pattern is correctly used to identify these entries.