Data Science.pptx00000000000000000000000

Data
Science and
its
Applications
VI SEM

Data visualization
Matplotlib
 For simple bar charts, line charts, and scatterplots, it works pretty well.
 If you are interested in producing elaborate interactive visualizations for the Web it is likely not the right
choice.
 we will be using the matplotlib.pyplot module .
 When you import matplotlib.pyplot using the standard convention import matplotlib.pyplot as plt,
 you gain access to a wide range of functions and methods that allow you to create various types of
plots, customize them, and add annotations. Some common types of plots you can create with
Matplotlib include line plots, scatter plots, bar plots, histograms, and more

import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Plotting the data
plt.plot(x, y)
# Adding labels and title
plt.xlabel('X-axis’)
plt.ylabel('Y-axis’)
plt.title('Simple Line Plot')

from matplotlib import pyplot as plt
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o', linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()

# Sample data
categories = ['A', 'B', 'C', 'D', 'E’]
values = [10, 20, 15, 25, 30]
# Creating a bar plot
plt.bar(categories, values)
# Adding labels and title
plt.xlabel('Categories’)
plt.ylabel('Values’)
plt.title('Basic Bar Plot’)
# Display the plot
plt.show()

movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi",
"West Side Story"] num_oscars = [5, 11, 3, 8, 10]
# bars are by default width 0.8, so we'll add 0.1 to the left coordinates
# so that each bar is centered
xs = [i + 0.1 for i, _ in enumerate(movies)]
# plot bars with left x-coordinates [xs], heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")
# label x-axis with movie names at bar centers
plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies)
plt.show()

In this list comprehension,
 enumerate(movies) is used to loop through the list of movies, providing both the index i and the movie name _.
Since we are only interested in the index, you use i.
 Then, you add 0.1 to each index to ensure that the bars are centered when plotting.
 Finally, these adjusted indices are stored in the list xs, which will be used as the x-coordinates for plotting the
bars.
In this code:
 plt.xticks() is used to set the x-axis ticks. The first argument is the list of x-coordinates where you want
the ticks to appear.
 The second argument is the list of tick labels, which in this case are the movie names.
 plt.show() is called to display the plot.

grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]
decile = lambda grade: grade // 10 * 10
histogram = Counter(decile(grade) for grade in grades)
plt.bar([x - 4 for x in histogram.keys()], # shift each bar to the left by 4
histogram.values(), # give each bar its correct height
8) # give each bar a width of 8
plt.axis([-5, 105, 0, 5]) # x-axis from -5 to 105,
# y-axis from 0 to 5
plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0, 10, ..., 100
plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam 1 Grades")
plt.show()

from collections import Counter
# Create a Counter from a list
my_list = ['a', 'b', 'c', 'a', 'b', 'a']
my_counter = Counter(my_list)
# Access counts
print(my_counter['a']) # Output: 3 (since 'a' appears 3 times)
# Access unique elements and their counts
print(my_counter.keys()) # Output: dict_keys(['a', 'b', 'c'])
print(my_counter.values()) # Output: dict_values([3, 2, 1])
# Arithmetic operations
other_list = ['a', 'b', 'c', 'a', 'a']
other_counter = Counter(other_list)
print(my_counter + other_counter) # Output: Counter({'a': 5, 'b': 3, 'c': 2})

Line Charts
As we saw already, we can make line charts using plt.plot().
variance = [1, 2, 4, 8, 16, 32, 64, 128, 256]
bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1]
total_error = [x + y for x, y in zip(variance, bias_squared)]
xs = [i for i, _ in enumerate(variance)]
# we can make multiple calls to plt.plot
# to show multiple series on the same chart
plt.plot(xs, variance, 'g-', label='variance') # green solid line
plt.plot(xs, bias_squared, 'r-.', label='bias^2') # red dot-dashed line plt.plot(xs, total_error,
'b:', label='total error') # blue dotted line
# because we've assigned labels to each series
# we can get a legend for free
# loc=9 means "top center"
plt.legend(loc=9)
plt.xlabel("model complexity")
plt.title("The Bias-Variance Tradeoff")
plt.show()

Scatterplots
A scatterplot is the right choice for visualizing the relationship between two paired sets of
data. For example:the relationship between the number of friends
your users have and the number of minutes they spend on the site every day:
friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67]
minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, friend_count, minute_count in zip(labels, friends, minutes):
plt.annotate(label,
xy=(friend_count, minute_count), # put the label with its pointxytext=(5, -5), but slightly offset
textcoords='offset points')
plt.title("Daily Minutes vs. Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the site")
plt.show()

Scatterplots
In the line for label, friend_count, minute_count in zip(labels, friends, minutes):
Python's zip() function is used to iterate over multiple lists (labels, friends, and
minutes) simultaneously.
labels contains the labels for each data point.
friends contains the number of friends for each data point.
minutes contains the minutes spent on the site for each data point.
By using zip(), we iterate over these lists together. In each iteration, label,
friend_count, and minute_count will correspond to the current elements from
labels, friends, and minutes lists, respectively.
plt.annotate(label, xy=(friend_count, minute_count), xytext=(5, -5),
textcoords='offset points') is then used to annotate the scatter plot with the label
for each point. This function places text at the specified coordinates
(xy=(friend_count, minute_count)) with a small offset (xytext=(5, -5)) from the
specified point.

Bar Chart:
•Use bar charts to represent categorical data or data that can be divided into distinct groups.
•Best for comparing values between different categories or groups.
•Useful for showing discrete data points or data that doesn't have a natural order.
•Suitable for showing changes over time when time is divided into distinct intervals (e.g., months, years).
Examples of when to use bar charts:
•Comparing sales performance of different products.
•Showing population distribution by country.
•Displaying the frequency of occurrence of different categories
2.Line Chart:
2. Use line charts to visualize trends and patterns in continuous data over time.
3. Best for showing changes and trends over a continuous scale (e.g., time, temperature, distance).
4. Ideal for illustrating relationships between variables and identifying patterns such as growth, decline, or
or fluctuations.
5. Also suitable for displaying multiple series of data on the same chart for comparison.
3.Examples of when to use line charts:
2. Showing stock price fluctuations over time.
3. Visualizing temperature changes throughout the year.
4. Displaying trends in website traffic over months or years.
In summary, choose a bar chart when you want to compare discrete categories or groups, and opt for a line
chart when you need to visualize trends or patterns in continuous data over time.

 vectors are objects that can be added together (to form new vectors) and that
can be multiplied by scalars (i.e., numbers), also to form new vectors.
 For example, if you have the heights, weights, and ages of a large number of people,
you can treat your data as three-dimensional vectors (height, weight, age). If you’re
teaching a class with four exams, you can treat student grades as four-dimensional
vectors
(exam1, exam2, exam3, exam4).
height_weight_age = [70, # inches,
170, # pounds,
40 ] # years
grades = [95, # exam1
80, # exam2
75, # exam3
62 ] # exam4

def vector_add(v, w):
"""adds corresponding elements"""
return [v_i + w_i
for v_i, w_i in zip(v, w)]
Similarly, to subtract two vectors we just subtract corresponding elements:
def vector_subtract(v, w):
"""subtracts corresponding elements"""
return [v_i - w_i
for v_i, w_i in zip(v, w)]
def scalar_multiply(c, v):
"""c is a number, v is a vector"""
return [c * v_i for v_i in v]

def vector_sum(vectors):
"""sums all corresponding elements"""
result = vectors[0] # start with the first vector
for vector in vectors[1:]: # then loop over the others
result = vector_add(result, vector) # and add them to the result
return result
We’ll also need to be able to multiply a vector by a scalar, which we do simply by multiplying each element
of the vector by that number:
def scalar_multiply(c, v):
"""c is a number, v is a vector"""
return [c * v_i for v_i in v]

In Python, a tuple is a collection data type similar to a list, but with one key difference: tuples are immutable,
meaning once they are created, their elements cannot be changed or modified. Tuples are defined by
enclosing comma-separated values within parentheses ().
Here's a basic example of a tuple:
my_tuple = (1, 2, 'a', 'b', True)
Tuples can contain elements of different data types, including integers, floats, strings, booleans, and even
other tuples or data structures. You can access elements of a tuple using indexing, just like with lists:
print(my_tuple[0]) # Output: 1
print(my_tuple[2]) # Output: 'a'

Tuples support many of the same operations as lists, such as slicing, concatenation, and repetition:
python
tuple1 = (1, 2, 3)
tuple2 = ('a', 'b', 'c')
# Slicing
print(tuple1[:2]) # Output: (1, 2)
# Concatenation
tuple3 = tuple1 + tuple2
print(tuple3) # Output: (1, 2, 3, 'a', 'b', 'c')
# Repetition
tuple4 = tuple2 * 2
print(tuple4) # Output: ('a', 'b', 'c', 'a', 'b', 'c')

However, because tuples are immutable, you cannot modify
individual elements:
my_tuple[0] = 5 # This will raise an error because tuples are
immutable
Tuples are commonly used in Python for various purposes,
such as representing fixed collections of items, returning
multiple values from a function, and as keys in dictionaries
(since they are immutable).

Vector addition
If two vectors v and w are the same length, their sum is just the vector whose first element is v[0] + w[0],
whose second element is v[1] + w[1], and so on. (If they’re not the same length, then we’re not allowed to
add them.)
For example, adding the vectors [1, 2] and [2, 1] results in [1 + 2, 2 + 1] or [3, 3],

 notes
The %matplotlib inline command is used in Jupyter Notebooks or IPython environments to display
Matplotlib plots directly within the notebook. It ensures that plots are rendered inline, meaning they
appear directly below the code cell that generates them.
By using %matplotlib inline, you're setting up the notebook to show Matplotlib plots without the need for
additional commands like plt.show()

x=[10,20,30,40]
plt.plot(x,y,'r')
[<matplotlib.lines.Line2D at 0xa6e4dd2128>]
plt.xlabel('X axis title')
plt.ylabel('Y axis title')
plt.title("Curve plot")
plt.plot(x,y,'r',label='curve')
plt.legend()
<matplotlib.legend.Legend at 0xa6e3b4e

The plt.plot(x, y, 'r') command you've used plots the data in arrays x and y using red color ('r').
Here's what each part of the command does:
plt.plot(): This function is used to create a line plot.
x and y: These are the data arrays to be plotted along the x and y axes, respectively.
'r': This specifies the color of the line. In this case, 'r' stands for red. You can use different color
abbreviations ('b' for blue, 'g' for green, etc.) or full color names ('red', 'blue', 'green', etc.).
So, plt.plot(x, y, 'r') will create a line plot of y against x with a red color line

The np.arange() function in NumPy is used to create an array with evenly spaced values within a
specified interval. Here's how it works:python
np.arange(start, stop, step)
start: The starting value of the sequence.
stop: The end value of the sequence, not included.
step: The step size between each pair of consecutive values. It defaults to 1 if not provided.
For example, np.arange(0, 10) will generate an array containing integers from 0 up to (but not
including) 10, with a default step size of 1. So, the resulting array will be [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

np.linspace(0,10,5)
array([ 0. , 2.5, 5. , 7.5, 10. ])
np.linspace(0,10,50)
array([ 0. , 0.20408163, 0.40816327, 0.6122449 ,
0.81632653, 1.02040816, 1.2244898 , 1.42857143,
1.63265306, 1.83673469, 2.04081633, 2.24489796,
2.44897959, 2.65306122, 2.85714286, 3.06122449,
3.26530612, 3.46938776, 3.67346939, 3.87755102,
4.08163265, 4.28571429, 4.48979592, 4.69387755,
4.89795918, 5.10204082, 5.30612245, 5.51020408,
5.71428571, 5.91836735, 6.12244898, 6.32653061,
6.53061224, 6.73469388, 6.93877551, 7.14285714,
7.34693878, 7.55102041, 7.75510204, 7.95918367,
8.16326531, 8.36734694, 8.57142857, 8.7755102 ,
8.97959184, 9.18367347, 9.3877551 , 9.59183673,

labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}
labels: This is a list containing three strings: 'a', 'b', and 'c'.
my_list: This is a list containing three integers: 10, 20, and 30.
arr: This is a NumPy array created using np.array(), containing the same integers as my_list
d: This is a dictionary where keys are strings ('a', 'b', 'c') and values are integers (10, 20, 30).
These data structures store similar information but in different ways and with different
functionalities. Lists are ordered collections, NumPy arrays are arrays of homogeneous data, and
dictionaries are mappings of keys to values

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides easy-to-
use data structures and functions for working with structured data, such as tables or spreadsheet-like data,
making it a fundamental tool for data scientists, analysts, and researchers.
Here are some key features of Pandas:
DataFrame: The primary data structure in Pandas is the DataFrame, which is a two-dimensional labeled
data structure with columns of potentially different types. It resembles a spreadsheet or SQL table, and you
can think of it as a dictionary of Series objects, where each Series represents a column

# Data for demonstration
x = [1, 2, 3, 4] y = [1, 4, 9, 16]
# Create a figure with 4 rows and 2 columns of subplotsplt.figure(figsize=(10, 10))
# Loop through each subplot position in the 4x2 grid for i in range(1, 9):
# 1 to 8 for a 4x2 grid plt.subplot(4, 2, i) plt.plot(x, y) plt.title(f'Subplot {i}’)
# Label each subplot
# Adjust layout to prevent overlapplt.tight_layout()
# Display the plotplt.show()
-------------------+-------------------+
| Subplot 1 | Subplot 2 |
+----------------

Central Tendencies
we’ll want some notion of where our data is centered.
we’ll use the mean (or average), which is just the sum of the data divided by its count:
def mean(x): return sum(x) / len(x)
mean(num_friends)
We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is
odd) or the average of the two middle-most values (if the number of data points is even).

from collections import Counter
# Example list of friend counts
num_friends = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6]
friend_counts = Counter(num_friends)
xs = range(101) # Assuming the largest value is 100
ys = [friend_counts[x] for x in xs] # height is just the number of friends
plt.bar(xs, ys)
plt.axis([0, 101, 0, max(ys) + 1]) # Setting the y-axis limit to one more than the maximum count
plt.title("Histogram of Friend Counts")
plt.xlabel("# of friends")
plt.ylabel("# of people")
plt.show()

sorted_values = sorted(num_friends)
sorted_values
smallest_value = sorted_values[0]
smallest_value
2
second_largest_value = sorted_values[-2]
7
[2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8]

def median(v):
"""finds the 'middle-most' value of v"""
n = len(v)
sorted_v = sorted(v)
midpoint = n // 2
if n % 2 == 1:
# if odd, return the middle value
return sorted_v[midpoint]
else:
# if even, return the average of the middle values
lo = midpoint - 1
hi = midpoint
return (sorted_v[lo] + sorted_v[hi]) / 2
Median

def quantile(x, p):
p_index = int(p * len(x))
return sorted(x)[p_index]
x = [3, 5, 3, 8, 2, 5, 7, 5, 3, 6, 4, 3, 4, 5, 6]
p = 0.25 # Desired quantile (25th percentile)
result = quantile(x, p) p_index = int(p * len(x)) calculates the index corresponding to the quantile
𝑝
p in the sorted dataset
𝑥
len(x) returns the number of elements in the dataset 𝑥x, i.e., the length of the dataset
p is the quantile you want to calculate, represented as a value between 0 and 1.
p×len(x) calculates the position in the sorted dataset corresponding to the desired quantile. Since
𝑝
p is a fraction between 0 and 1, multiplying it by the length of the dataset gives the index at which the
quantile would be if the data were sorted.
t(p×len(x)) takes the integer part of the result. This ensures that the index is an integer value, as
indices in Python must be integers.

x=[3,5,3,8,2,5,7,5,3,6,4,3,4,5,6] and you want to calculate the 25th percentile
𝑝=0.25
0.25×15=3.75
p×len(x)=0.25×15=3.75. Taking the integer part of
3.75
3.75 gives
3
3, so the 25th percentile of the dataset
𝑥
x corresponds to the element at index 3 in the sorted dataset.

Dependence and Independence
If we flip a fair coin twice, knowing whether the first flip is Heads gives us no
information about whether the second flip is Heads. These events are independent.
On the other hand, knowing whether the first flip is Heads certainly gives us
information about whether both flips are Tails.
(If the first flip is Heads, then definitely it’s not the case that both flips are Tails.)
These two events are dependent.

import random
def random_kid():
return random.choice(["boy", "girl"])
both_girls = 0
older_girl = 0
either_girl = 0
random.seed(0)
for _ in range(10000):
younger = random_kid()
older = random_kid()
if older == "girl":
older_girl += 1
if older == "girl" and younger == "girl":
both_girls += 1
if older == "girl" or younger == "girl":
either_girl += 1
print("P(both | older):", both_girls / older_girl) # 0.514 ~ 1/2
print("P(both | either):", both_girls / either_girl)

We want to calculate two conditional probabilities related to having girls in a family with two children:The probability that
both children are girls given that the older child is a girl.The probability that both children are girls given that at least one of
the children is a girl.
random_kid() function:Returns either "boy" or "girl" randomly with equal probability (0.5 each).
Counters:
both_girls: Counts the number of times both children are girls.
older_girl: Counts the number of times the older child is a girl.
either_girl: Counts the number of times at least one child is a girl.
Simulation Loop:For 10,000 iterations, the code simulates the gender of two children (older and younger).It updates the
counters based on the genders of the children.
Probabilities
P(both | older):This is the probability that both children are girls given that the older child is a girl.both_girls / older_girl: The
number of times both children are girls divided by the number of times the older child is a girl.

P(both | either):This is the probability that both children are girls given that at least one of
the children is a girl.both_girls / either_girl: The number of times both children are girls
divided by the number of times at least one child is a girl.
Mathematical Explanation
P(both | older):Given that the older child is a girl, the younger child can be either a girl or a boy
with equal probability.
Therefore, the probability that both are girls is 1/2.
P(both | either):This situation requires considering all possible combinations where at least one
child is a girl:Girl-GirlGirl-BoyBoy-Girl
Out of these combinations, the only one with both girls is "Girl-Girl".There are 3 favorable
combinations with at least one girl out of 4 total combinations (Girl-Girl, Girl-Boy, Boy-Girl, Boy-
Boy).Therefore, the probability is 1/3.
Simulation Results
P(both | older): The simulation result should be close to 0.5 (which is 1/2).
P(both | either): The simulation result should be close to 1/3.

Program 5
Code:
import pandas as pd
import numpy as np
# Import the data into a DataFrame
books_df = pd.read_csv('desktop/BL-Flickr-Images-Book.csv')
# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(books_df.head())
# Find and drop the columns which are irrelevant for the book information
columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner',
'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=columns_to_drop, inplace=True)
# Change the Index of the DataFrame
books_df.set_index('Identifier', inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
def clean_date(date):
if isinstance(date, str):
match = re.search(r'd{4}', date)
if match:
return match.group()

return np.nan
books_df['Date of Publication'] = books_df['Date of
Publication'].apply(clean_date)
# Combine str methods with NumPy to clean columns
books_df['Place of Publication'] = np.where(
books_df['Place of Publication'].str.contains('London'),
'London',
np.where(
books_df['Place of Publication'].str.contains('Oxford'),
'Oxford',
books_df['Place of Publication'].replace(
r'^s*$', 'Unknown', regex=True
)
)
)
# Display the cleaned DataFrame
print("nCleaned DataFrame:")
print(books_df.head())

Function Definition: The function clean_date takes one parameter, date
def clean_date(date):.
Type Check: It first checks if the input date is a string using the isinstance function.
if isinstance(date, str):
Regular Expression Search: If date is indeed a string, the function uses the re.search method to search
for a pattern that matches four consecutive digits (which typically represent a year) in the string
match = re.search(r'd{4}', date)
re.search searches the input string for the first location where the regular expression pattern d{4}
(which means any four digits) matches.
If such a pattern is found, re.search returns a match object; otherwise, it returns None.

Extracting the Year: If a match is found (i.e., the match object is not None), the function extracts the
matched string (the year) using the group method of the match object.
if match: return match.group()
Handling No Match: If date is not a string or if no four-digit number is found in the string, the function
returns np.nan (which represents a missing value in the context of data analysis, often using the NumPy
library).
return np.nan
Libraries
re: This library provides regular expression matching operations.
numpy as np: This library is typically used for numerical and array operations. np.nan is a special
floating-point value that represents 'Not a Number' and is used to denote missing values.

Example:
Input: "April 20, 1995"
Output: "1995"
Input: "The year is 2023"
Output: "2023"
Input: "No year here"
Output: np.nan
Input: 12345
Output: np.nan (since the input is not a string)

r'^s*$': This is a regular expression pattern.
^: Asserts the position at the start of the string.
s*: Matches zero or more whitespace characters (spaces, tabs, newlines).
$: Asserts the position at the end of the string.
Therefore, r'^s*$' matches any string that contains only whitespace characters or is completely empty.
'Unknown': This is the replacement value. Any string that matches the regular expression pattern will be
replaced with the string 'Unknown'.
regex=True: This tells the replace method to interpret the first argument as a regular expression pattern.
Without regex=True, the method would treat the pattern as a plain string and attempt to find and replace the
exact string r'^s*$', which wouldn't match anything in most cases.

What does this code do
The code replaces any entry in the 'Place of Publication' column that is empty or
contains only whitespace with the string 'Unknown'. By setting regex=True, it
ensures that the regular expression pattern is correctly used to identify these
entries.

Data Science.pptx00000000000000000000000

More Related Content

Similar to Data Science.pptx00000000000000000000000 (20)

Recently uploaded (20)

Data Science.pptx00000000000000000000000