UNIT_5_Data Wrangling.pptx

• Wrangling Data:
• If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and
manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data
wrangling (or munging) and for machine learning.
• The final step of most data science projects is to build a data tool able to automatically summarize, predict,
and recommend directly from your data.
• Before taking that final step, you still have to process your data by enforcing transformations that are even
more radical.
• That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual
and statistical explorations, and then again by further transformations.
• In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a
dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate
new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go
wrong.
2

• Sometimes the best way to discover how to use something is to spend time playing with it. The more
complex a tool, the more important play becomes.
• Given the complex math tasks you perform using Scikit-learn, playing becomes especially important.
• The following sections use the idea of playing with Scikit-learn to help you discover important
concepts in using Scikit-learn to perform amazing feats of data science work.
3

• Understanding classes in Scikit-learn
• Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package
appropriately.
• Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms, error functions, and testing procedures.
• At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from
BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic
machine-learning functionalities:
• Classifying
• Regressing
• Grouping by clusters
• Transforming data
4

• Understanding classes in Scikit-learn
• Even though each base class has specific methods and attributes, the core functionalities for data processing and
machine learning are guaranteed by one or more series of methods and attributes called interfaces.
• The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and
attributes between all the different algorithms present in the package. There are four Scikit-learn object-based
interfaces:
1. estimator: For fitting parameters, learning them from data, according to the algorithm
2. predictor: For generating predictions from the fitted parameters
3. transformer: For transforming data, implementing the fitted parameters
4. model: For reporting goodness of fit or other score measures
5

Defining applications for data science
• Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the
estimator interface to a
1. Classification problem: Guessing that a new observation is from a certain group
2. Regression problem: Guessing the value of a new observation
• It works with the method fit(X, y) where X is the bidimensional array of predictors (the set of observations to learn)
and y is the target outcome (another array, unidimensional).
• By applying fit, the information in X is related to y, so that, knowing some new information with the same
characteristics of X, it’s possible to guess y correctly.
• In the process, some parameters are estimated internally by the fit method. Using fit makes it possible to
distinguish between parameters, which are learned, and hyperparameters, which instead are fixed by you when
you instantiate the learner.
6

• Instantiation involves assigning a Scikit-learn class to a Python variable.
• In addition to hyperparameters, you can also fix other working parameters, such as requiring normalization or
setting a random seed to reproduce the same results for each call, given the same input data.
7

8
Here is an example with linear regression, a very basic and common machine learning algorithm. You upload some
data to use this example from the examples that Scikit-learn provides.
The Boston dataset, for instance, contains predictor variables that the example code can match against house
prices, which helps build a predictor that can calculate the value of a house given its characteristics.

9
• The output specifies that both arrays have the same number of rows and that X has 13 features.
• The shape method performs array analysis and reports the arrays’ dimensions.

• After importing the LinearRegression class, you can instantiate a variable called hypothesis and set a parameter
indicating the algorithm to standardize (that is, to set mean zero and unit standard deviation for all the variables, a
statistical operation for having all the variables at a similar level) before estimating the parameters to learn.
10

• After fitting, hypothesis holds the learned parameters, and you can visualize them using the coef_ method, which
is typical of all the linear models (where the model output is a summation of variables weighted by coefficients).
You can also call this fitting activity training (as in, “training a machine learning algorithm”).
• hypothesis is a way to describe a learning algorithm trained with data. The hypothesis defines a possible
representation of y given X that you test for validity. Therefore, it’s a hypothesis in both scientific and machine
learning language.
11

• Apart from the estimator class, the predictor and the model object classes are also important.
• The predictor class, which predicts the probability of a certain result, obtains the result of new observations using
the predict and predict_proba methods, as in this script:
12
Make sure that new observations have the same feature number and order as in the training x; otherwise, the
prediction will be incorrect.

• The class model provides information about the quality of the fit using the score method, as shown here:
• In this case, score returns the coefficient of determination R2 of the prediction. R2 is a measure ranging from 0 to
1, comparing our predictor to a simple mean. Higher values show that the predictor is working well.
• Different learning algorithms may use different scoring functions. Please consult the online documentation of each
algorithm or ask for help on the Python console:
13

• The transform class applies transformations derived from the fitting phase to other data arrays.
• LinearRegression doesn’t have a transform method, but most preprocessing algorithms do.
• For example, MinMaxScaler, from the Scikit-learn preprocessing module, can transform values in a specific range
of minimum and maximum values, learning the transformation formula from an example array.
14

• Scikit-learn provides you with most of the data structures and functionality you need to complete your data science
project.
• You can even find classes for the trickiest and most advanced problems.
• For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the
hashing trick.
• You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words
Model and Beyond”) and weighting them with the Term Frequency times Inverse Document Frequency (TF-IDF)
transformation.
• All these powerful transformations can operate properly only if all your text is known and available in the memory
of your computer.
15

• A more serious data science challenge is to analyze online-generated text flows, such as from social networks or
large, online text repositories.
• This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When
working through such problems, knowing the hashing trick can give you quite a few advantages by helping you
 Handle large data matrices based on text on the fly
 Fix unexpected values or variables in your textual data
 Build scalable algorithms for large collections of documents
16

• Hash functions can transform any input into an output whose characteristics are predictable.
• Usually they return a value where the output is bound at a specific interval — whose extremities range from
negative to positive numbers or just span through positive numbers.
• You can imagine them as enforcing a standard on your data — no matter what values you provide, they always
return a specific data product.
• Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric
output value. Consequently, they’re called deterministic functions.
• For example, input a word like dog and the hashing function always returns the same number.
• In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret
codes, however, you can’t convert the hashed code to its original value.
• In addition, in some rare cases, different words generate the same hashed result (also called a hash collision).
17

• There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files)
and SHA (used in cryptography) being the most popular.
• Python possesses a built-in hash function named hash that you can use to compare data objects before storing
them in dictionaries.
• For instance, you can test how Python hashes its name:
18
The command returns a large integer number:

• A Scikit-learn hash function can also return an index in a specific positive range.
• You can obtain something similar using a built-in hash by employing standard division and its remainder:
19

• When you ask for the remainder of the absolute number of the result from the hash function, you get a number
that never exceeds the value you used for the division.
• To see how this technique works, pretend that you want to transform a text string from the Internet into a numeric
vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for
managing this data science task is to employ one-hot encoding, which produces a bag of words.
• Here are the steps for one-hot encoding a string (“Python for data science”) into a vector.
1. Assign an arbitrary number to each word, for instance, Python=0 for=1 data=2 science=3.
2. Initialize the vector, counting the number of unique words that you assigned a code in Step 1.
3. Use the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1
• where there is a coincidence with a word existing in the phrase.
20

• The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements.
• You have started the machine-learning process, telling the program to expect sequences of four text features,
when suddenly a new phrase arrives and you must convert the following text into a numeric vector as well:
“Python for machine learning”.
• Now you have two new words — “machine learning” — to work with. The following steps help you create the new
vectors:
1. Assign these new codes: machine=4 learning=5. This is called encoding.
2. Enlarge the previous vector to include the new words: [1,1,1,1,0,0].
3. Compute the vector for the new string: [1,1,0,0,1,1].
• One-hot encoding is quite optimal because it creates efficient and ordered feature vectors.
21

• The command returns a dictionary containing the words and their encodings:
22

• Unfortunately, one-hot encoding fails and becomes difficult to handle when your project experiences a lot of
variability with regard to its inputs.
• This is a common situation in data science projects working with text or other symbolic features where flow from
the Internet or other online environments can suddenly create or add to your initial data.
• Using hash functions is a smarter way to handle unpredictability in your inputs:
1. Define a range for the hash function outputs. All your feature vectors will use that range. The example uses a
range of values from 0 to 24.
2. Compute an index for each word in your string using the hash function.
3. Assign a unit value to vector’s positions according to word indexes.
23

• As before, your results may not precisely match those in the book because hashes may not match across
machines.
• The code now prints the second string encoded:
25

• When viewing the feature vectors, you should notice that:
• You don’t know where each word is located. When it’s important to be able to reverse the process of assigning
words to indexes, you must store the relationship between words and their hashed value separately (for example,
you can use a dictionary where the keys are the hashed values and the values are the words).
• For small values of the vector_size function parameter (for example, vector_size=10), many words overlap in the
same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create
hash function boundaries that are greater than the number of elements you plan to index later.
• The feature vectors in this example are made mostly of zero entries, representing a waste of memory when
compared to the more memory-efficient one-hot-encoding.
• One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next
section. 26

• Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix
values are zeroes.
• Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all
the cells in the matrix.
• When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for
the coordinates and not finding them.
• Here’s an example vector:
• [1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0]
27

• The following code turns it into a sparse matrix.
28
Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and
the cell value.

• As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you
would like some special implementation of the idea.
• Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data
matrix using the hashing trick.
• Here’s an example script that replicates the previous example:
29
Python reports the size of the sparse matrix and a count of the stored elements present in it:
<2x20

• As soon as new text arrives, CountVectorizer transforms the text based on the previous encoding schema where
the new words weren’t present; hence, the result is simply an empty vector of zeros.
• You can check this by transforming the sparse matrix into a normal, dense one using todense:
30

• Contrast the output from CountVectorizer with HashingVectorizer, which always provides a place for new words in
the data matrix:
31
At worst, a word settles in an already occupied position, causing two different words to be treated
as the same one by the algorithm (which won’t noticeably degrade the algorithm’s performances).
HashingVectorizer is the perfect function to use when your data can’t fit into memory and its
features aren’t fixed. In the other cases, consider using the more intuitive CountVectorizer.

• when testing your application code for performance (speed) characteristics, you can obtain analogous
information about memory usage.
• Keeping track of memory consumption could tell you about possible problems in the way data is processed or
transmitted to the learning algorithms.
• The memory_profiler package implements the required functionality. This package is not provided as a default
Python package and it requires installation.
• Use the following command to install the package directly from a cell of your Jupyter notebook,
32

• Use the following command for each Jupyter Notebook session you want to monitor:
33
• After performing these tasks, you can easily track how much memory a command consumes:
The reported peak memory and increment tell you about memory usage:
peak memory: 90.42 MiB, increment: 0.09 MiB

• Obtaining a complete overview of memory
consumption is possible by saving a notebook cell to
disk and then profiling it using the line magic %mprun
on an externally imported function.
• The line magic works only by operating with external
Python scripts.
• Profiling produces a detailed report, command by
command, as shown in the following example:
34
The resulting report details the memory usage from every line
in the function, pointing out the major increments.

• Most computers today are multicore (two or more processors in a single package), some with multiple physical
CPUs. One of the most important limitations of Python is that it uses a single core by default.
• Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data
science relies on repeated tests and experiments on different data matrices.
• Using more CPU cores accelerates a computation by a factor that almost matches the number of cores.
• For example, having four cores would mean working at best four times faster.
• You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new
running Python instances have to be set up with the right in-memory information and launched; consequently, the
improvement will be less than potentially achievable but still significant.
• Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the
number of analyses completed, and for speeding up your operations both when setting up and when using your
data Products
35

• Performing multicore parallelism
• To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for
time-consuming operations, such as replicating models for validating results or for looking for the best
hyperparameters. In particular, Scikit-learn allows multiprocessing when
• Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data
• Grid-searching: Systematically changing the hyperparameters of a machine-learning hypothesis and testing the
consequent results
• Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many
different target outcomes to predict at the same time
• Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the
other, such as when using RandomForest-based modeling
36

• Using Jupyter provides the advantage of using the %timeit magic
• command for timing execution. You start by loading a multiclass dataset, a complex machine learning algorithm
(the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores
from all the procedures.
• The most important thing to know is that the procedures become quite large because the SVC produces 10
models, which it repeats 10 times each using cross-validation, for a total of 100 models
37
As a result, you get the recorded average running time for a single core: 10.9 S

• After this test, you need to activate the multicore parallelism and time the results using the following commands:
• %timeit multi_core
38
As a result, you get the recorded average running time for a Multi core: 4.44 S

• Data science relies on complex algorithms for building predictions and spotting important signals in
data, and each algorithm presents different strong and weak points.
• In short, you select a range of algorithms, you have them run on the data, you optimize their
parameters as much as you can, and finally you decide which one will best help you build your data
product or generate insight into your problem.
• Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple
summary statistics and graphic visualizations in order to gain a deeper understanding of data.
• EDA helps you become more effective in the subsequent data analysis and modeling.
40

• EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who wanted to
promote more questions and actions on data based on the data itself (the exploratory motif) in
contrast to the dominant confirmatory approach of the time.
• EDA goes further than IDA (Initial Data Analysis). It’s moved by a different attitude: going beyond
basic assumptions. With
• EDA, You can
Describe of your data
Closely explore data distributions
Understand the relations between variables
Notice unusual or unexpected situations
Place the data into groups
Notice unexpected patterns within groups
Take note of group differences
41

• The first actions that you can take with the data are to produce some synthetic measures to help
figure out what is going in it.
• You acquire knowledge of measures such as maximum and minimum values, and you define which
intervals are the best place to start.
• During your exploration, you use a simple but useful dataset that is used in previous chapters, the
Fisher’s Iris dataset. You can load it from the Scikit-learn package by using the following code:
42

• Mean and median are the first measures to calculate for numeric variables when starting EDA.
• They can provide you with an estimate when the variables are centered and somehow symmetric.
• Using pandas, you can quickly compute both means and medians.
• Here is the command for getting the mean from the Iris DataFrame:
43
• When checking for central
tendency measures, you should:
1. Verify whether means are zero
2. Check whether they are different
from each other
3. Notice whether the median is
different from the mean

• As a next step, you should check the variance by using its square root, the standard deviation.
• The standard deviation is as informative as the variance, but comparing to the mean is easier
because it’s expressed in the same unit of measure.
• The variance is a good indicator of whether a mean is a suitable indicator of the variable distribution
because it tells you how the values of a variable distribute around the mean.
• The higher the variance, the farther you can expect some values to appear from the mean.
44

• In addition, you also check the range, which is the difference between the maximum and minimum
value for each quantitative variable, and it is quite informative about the difference in scale among
variables.
45
• Note the standard deviation and the range in relation to the mean and median.
• A standard deviation or range that’s too high with respect to the measures of centrality (mean and median) may
point to a possible problem, with extremely unusual values affecting the calculation or an unexpected distribution
of values around the mean.

• Because the median is the value in the central position of your distribution of values, you may need to
consider other notable positions.
• Apart from the minimum and maximum, the position at 25 percent of your values (the lower quartile)
and the position at 75 percent (the upper quartile) are useful for determining the data distribution, and
they are the basis of an illustrative graph called a boxplot.
46
• comparison that uses quartiles for rows and the different dataset variables as columns.
• So, the 25-percent quartile for sepal length (cm) is 5.1, which means that 25 percent of the dataset values for this
measure are less than 5.1.

• The last indicative measures of how the numeric variables used for these examples are structured are
skewness and kurtosis:
• Skewness defines the asymmetry of data with respect to the mean. If the skew is negative, the left tail
is too long and the mass of the observations are on the right side of the distribution. If it is positive, it is
exactly the opposite.
• Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape.
If the kurtosis is above zero, the distribution has a marked peak. If it is below zero, the distribution is
too flat instead.
47

• When performing the skewness and kurtosis tests, you determine whether the p-value is less than or
equal 0.05.
• If so, you have to reject normality (your variable distributed as a Gaussian distribution), which implies
that you could obtain better results if you try to transform the variable into a normal one.
• The following code shows how to perform the required test:
48

• The test results tell you that the data is slightly skewed to the left, but not enough to make it unusable.
• The real problem is that the curve is much too flat to be bell shaped, so you should investigate the
matter further.
49

• The Iris dataset is made of four metric variables and a qualitative target outcome.
• Just as you use means and variance as descriptive measures for metric variables, so do frequencies
strictly relate to qualitative ones.
• Because the dataset is made up of metric measurements (width and lengths in centimeters), you must
render it qualitative by dividing it into bins according to specific intervals.
• The pandas package features two useful functions, cut and qcut, that can transform a metric variable
into a qualitative one:
• cut expects a series of edge values used to cut the measurements or an integer number of
groups used to cut the variables into equal-width bins.
• qcut expects a series of percentiles used to cut the variable.
50

• By matching different categorical frequency distributions, you can display the relationship between
qualitative variables.
• The pandas.crosstab function can match variables or groups of variables, helping to locate possible data
structures or relationships.
• In the following example, you check how the outcome variable is related to petal length and observe
how certain outcomes and petal binned classes never appear together.
• Figure shows the various iris types along the left side of the output, followed by the output as related to
petal length.
53

• The data is rich in information because it offers a perspective that goes beyond the single variable, presenting more
variables with their reciprocal variations.
• The way to use more of the data is to create a bivariate (seeing how couples of variables relate to each other)
exploration.
• This is also the basis for complex data analysis based on a multivariate (simultaneously considering all the existent
relations between variables) approach.
• If the univariate approach inspected a limited number of descriptive statistics, then matching different variables or
groups of variables increases the number of possibilities.
• Such exploration overloads the data scientist with different tests and bivariate analysis.
• Using visualization is a rapid way to limit test and analysis to only interesting traces and hints.
• Visualizations, using a few informative graphics, can convey the variety of statistical characteristics of the variables
and their reciprocal relationships with greater ease.
54

• Boxplots provide a way to represent distributions and their extreme ranges, signaling whether some observations
are too far from the core of the data — a problematic situation for some learning algorithms.
• The following code shows how to create a basic boxplot using the Iris dataset:
55
• In Figure, you see the structure of
each variable’s distribution at its core,
represented by the 25° and 75°
percentile (the sides of the box) and
the median (at the center of the box).
• The lines, the socalled whiskers,
represent 1.5 times the IQR from the
box sides (or by the distance to the
most extreme value, if within 1.5 times
the IQR).

• After you have spotted a possible group difference relative to a variable, a t-test (you use a t-test in situations in
which the sampled population has an exact normal distribution) or a one-way Analysis Of Variance (ANOVA) can
provide you with a statistical verification of the significance of the difference between the groups’ means.
56
• The t-test compares two groups at a time, and
it requires that you define whether the groups
have similar variance or not.
• You interpret the pvalue as the probability
that the calculated t statistic difference is just
due to chance.
• Usually, when it is below 0.05, you can
confirm that the groups’ means are
significantly different.

• You can simultaneously check more than two groups using the one-way ANOVA test. In this case, the pvalue has an
interpretation similar to the t-test:
57

• Parallel coordinates can help spot which groups in the outcome variable you could easily separate from the other.
• It is a truly multivariate plot, because at a glance it represents all your data at the same time.
• The following example shows how to use parallel coordinates.
58
• If the parallel lines of each group stream
together along the visualization in a
separate part of the graph far from other
groups, the group is easily separable.
• The visualization also provides the means
to assert the capability of certain features
to separate the groups.

• You usually render the information that boxplot and descriptive statistics provide into a curve or a histogram, which shows an
overview of the complete distribution of values.
• The output shown in Figure represents all the distributions in the dataset.
• Different variable scales and shapes are immediately visible, such as the fact that petals’ features display two peaks.
59

• Histograms present another, more detailed, view over distributions:
60

• In scatterplots, the two compared variables provide the coordinates for plotting the observations as points on a plane.
• The result is usually a cloud of points. When the cloud is elongated and resembles a line, you can deduce that the variables are
correlated.
• The following example demonstrates this principle:
61
• The scatterplot highlights different groups using
different colors.
• The elongated shape described by the points hints
at a strong correlation between the two observed
variables, and the division of the cloud into groups
suggests a possible separability of the groups.
• Because the number of variables isn’t too large, you
can also generate all the scatterplots automatically
• from the combination of the variables.
• This representation is a matrix of scatterplots.

• Just as the relationship between variables is graphically representable, it is also measurable by a statistical estimate.
• When working with numeric variables, the estimate is a correlation, and the Pearson’s correlation is the most famous.
• The Pearson’s correlation is the foundation for complex linear estimation models.
• When you work with categorical variables, the estimate is an association, and the chi-square statistic is the most frequently used
tool for measuring association between features.
• Using covariance and correlation
• Covariance is the first measure of the relationship of two variables.
• It determines whether both variables have a coincident behavior with respect to their mean. If the single values of two variables
are usually above or below their respective averages, the two variables have a positive association.
• It means that they tend to agree, and you can figure out the behavior of one of the two by looking at the other.
• In such a case, their covariance will be a positive number, and the higher the number, the higher the agreement. 62

• If, instead, one variable is usually above and the other variable usually below their respective averages,
the two variables are negatively associated.
• Even though the two disagree, it’s an interesting situation for making predictions, because by observing
the state of one of them, you can figure out the likely state of the other (albeit they’re opposite).
• In this case, their covariance will be a negative number.
• A third state is that the two variables don’t systematically agree or disagree with each other. In this case,
the covariance will tend to be zero, a sign that the variables don’t share much and have independent
behaviors.
63

UNIT_5_Data Wrangling.pptx

More Related Content

Similar to UNIT_5_Data Wrangling.pptx (20)

Recently uploaded (20)

UNIT_5_Data Wrangling.pptx