2. The manipulation of the data has the purpose of preparing the data
so that they can be more easily subjected to analysis.
The three phases of data manipulation are
● Data preparation
● Data transformation
• Data aggregation
3. DATA PREPARATION
Before you start manipulating data itself, it is necessary to prepare the data
and assemble them in the form of data structures such that they can be
manipulated later with the tools made available by the pandas library.
The different procedures for data preparation are listed below.
• loading
• assembling
▪Μerging
▪Concatenating
▪Combining
• reshaping (pivoting)
• removing
4. In the loading phase, there is also that part of the preparation which concerns
the conversion from many different formats into a data structure such as
DataFrame.
But even after you have gotten the data, probably from different sources and
formats, and unified it into a DataFrame, you will need to perform further
operations of preparation.
The data contained in the pandas objects can be assembled together in
different ways:
❖ Merging—the pandas.merge( ) function connects the rows in a
DataFrame based on one or more keys.
❖ Concatenating—the pandas.concat( ) function concatenates the objects
along an axis.
❖ Combining—the pandas.DataFrame.combine_first( ) function is a
method that allows you to connect overlapped data in order to fill in
missing values in a data structure by taking data from another structure.
5. Merging
First, you have to import the pandas library and define two DataFrame that will
serve you as examples for this section.
6. The returned DataFrame consists of all rows that have an ID in common between the
two DataFrame.
7. In fact, in most cases you need to decide which is the column on which to base the
merging. To do this, add the on option with the column name as the key for the
merging.
Now in this case you have two DataFrame having columns with the same name.
So if you launch a merging you do not get any results.
8. So it is necessary to explicitly define the criterion of merging that pandas must
follow, specifying the name of the key column in the on option.
9. Often, however, the opposite problem arises, that is, to have two DataFrames in
which the key columns do not have the same name. To remedy this situation, you
have to use the left_on and right_on options that specify the key column for the
first and for the second DataFrame.
10. By default, the merge( ) function performs an inner join; the keys in the result
are the result of an intersection.
Other possible options are the left join, the right join, and the outer join.
The outer join produces the union of all keys, combining the effect of a left join
with a right join.
To select the type of join you have to use the how option.
11. To make the merge of multiple keys, you simply just add a list to the on
option.
12. Merging on Index
Instead of considering the columns of a DataFrame as keys, the indexes could be
used as keys on which to make the criteria for merging.
Then in order to decide which indexes to consider, set the left_index or
right_index options to True to activate them, with the ability to activate them both.
13. Concatenating
Another type of data combination is referred to as concatenation. NumPy
provides a concatenate() function to do this kind of operation with arrays.
14. As regards the pandas library and its data structures like Series and DataFrame,
the fact of having labeled axes allows you to further generalize the concatenation of
arrays. The concat() function is provided by pandas for this kind of operation.
By default, the concat() function works on axis = 0, having as returned object a
Series. If you set the axis = 1, then the result will be a DataFrame.
15. From the result you can see that there is
no overlap of data, therefore what you
have just done is an ‘outer’ join.
This can be changed by setting the join
option to ‘inner’.
A problem in this kind of operation is that the concatenated parts are not
identifiable in the result. For example, you want to create a hierarchical index on
the axis of concatenation. To do this you have to use the keys option.
16. In the case of combinations
between Series along the axis
= 1 the keys become the
column headers of the
DataFrame
17. Combinin
g
There is another situation in which there is combination of data that cannot be
obtained either with merging or with concatenation.
Take the case in which you want the two datasets to have indexes that overlap in
their entirety or at least partially.
One applicable function to Series is combine_first(), which performs this kind of
operation along with an data alignment.
19. Instead, if you want a partial overlap, you can specify only the portion of the Series
you want to overlap.
20. Pivoting
Arrangement of the values by row or by column is not always suited to your
goals. Sometimes you would like to rearrange the data carrying column values
on rows or vice versa.
Pivoting with Hierarchical Indexing
In the context of pivoting you have two basic operations:
❑ stacking: rotates or pivots the data structure converting columns to rows
❑ unstacking: converts rows into columns
21. From this hierarchically indexed series, you
can reassemble the DataFrame into a pivoted
table by use of the unstack() function
22. Pivoting from “Long” to “Wide” Format
When we want to have entries on various columns, and often duplicated in
subsequent lines, and remaining in tabular format of data; we can refer them to as
long or stacked format.
23. This mode of data recording, however, has some disadvantages. One, for
example, is precisely the multiplicity and repetition of some fields.
Instead of the long format, there is another way to arrange the data in a table that
is called wide.
This mode is easier to read, allowing easy connection with other tables, and it
occupies much less space. So in general it is a more efficient way of storing the
data
24. Removing The last stage of data preparation is the removal of columns and
rows.
In order to remove a
column, you have to
simply use the del
command applied to
the DataFrame with the
column name specified.
Instead, to remove an
unwanted row, you have
to use the drop() function
with the label of the
corresponding index as
argument.
25. Data Transformation
After you arrange the form of data and their disposal within the data structure, it
is important to transform their values.
Removing
Duplicates
Pandas provides us with a series of tools to analyze the duplicate data present
in large data structures.
The duplicated() function applied to a DataFrame can detect the rows which
appear to be duplicated. It returns a Series of Booleans where each element
corresponds to a row, with True if the row is duplicated (i.e., only the other
occurrences, not the first), and with False if there are no duplicates in the
previous elements.
26. The fact of having as the return value a
Boolean Series can be useful in many
cases, especially for the filtering. In fact,
if you want to know what are the
duplicate rows, just type the following:
27. drop_duplicates()
Generally, all duplicated rows are to be deleted
from the DataFrame; to do that, pandas
provides the drop_duplicates() function, which
returns the DataFrame without duplicate rows.
Mappin
g
The mapping is nothing more than the creation of a list of matches between two
different values, with the ability to bind a value to a particular label or string.
To define a mapping there is no better object than dict objects.
map = {
'label1' : 'value1,
'label2' : 'value2,
...
}
28. There are three specific functions used in Mapping and they are ;
❖ replace(): replaces values
❖ map(): creates a new column
❖ rename(): replaces the index values
Replacing Values via Mapping
Often in the data structure that you have assembled there are values that do not
meet your needs.
Define, as an example, a DataFrame containing various objects and colors,
including two colors that are not in English.
30. Adding Values via Mapping
Exploiting mapping to add values in a column depending on the values contained in
another.
31. Rename the Indexes of the Axes
To replace the label indexes, pandas provides the rename() function, which
takes the mapping as argument, that is, a dict object.
32. As you can see, by default, the indexes are renamed. If you want to rename
columns you must use the columns option. Thus this time you assign various
mapping explicitly to the two index and columns options.
33. Also here, for the simplest cases in which you have a single value to be replaced,
it can further explicate the arguments passed to the function of avoiding having to
write and assign many variables.
34. Discretization and Binning
To carry out an analysis of the data, however, it is necessary to transform this
data into discrete categories.
For example, you may have a reading of an experimental value between 0 and
100. These data are collected in a list.
>>> results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]
You know that the experimental values have a range from 0 to 100; therefore
you can uniformly divide this interval, for example, into four equal parts, i.e.,
bins.
The first contains the values between 0 and 25, the second between 26 and 50,
the third between 51 and 75, and the last between 76 and 100.
To do this binning with pandas, first you have to define an array containing the
values of separation of bin:
>>> bins = [0,25,50,75,100]
35. Then there is a special function called cut() and apply it to the array of results also
passing the bins.
36. The object returned by the cut() function is a special object of Categorical type.
You can consider it as an array of strings indicating the name of the bin.
Internally it contains a levels array indicating the names of the different internal
categories and a labels array that contains a list of numbers equal to the elements
of results.
Finally to know the occurrences for each bin, that is, how many results fall into
each category, you have to use the value_counts() function.
37. In addition to cut(), pandas provides another method for binning: qcut(). This
function divides the sample directly into quintiles. qcut() will ensure that the
number of occurrences for each bin is equal, but the edges of each bin to vary.
38. Detecting and Filtering
Outliers
❖ During the data analysis, the need to detect the presence of abnormal values
within a data structure often arises. By way of example, create a DataFrame
with three columns from 1,000 completely random values:
>>> randframe = pd.DataFrame(np.random.randn(1000,3))
>>> randframe.describe()
0 1 2
count 1000.000000 1000.000000 1000.000000
mean 0.021609 -0.022926 -0.019577
std 1.045777 0.998493 1.056961
min -2.981600 -2.828229 -3.735046
25% -0.675005 -0.729834 -0.737677
50% 0.003857 -0.016940 -0.031886
75% 0.738968 0.619175 0.718702
max 3.104202 2.942778 3.458472
39. Permutation
The operations of permutation (random reordering) of a Series or the rows of a
DataFrame are easy to do using the numpy.random.permutation() function.
Now create an array of five integers from 0 to 4 arranged in random
order with the permutation() function. This will be the new order in
which to set the values of a row of DataFrame.
40. Now apply it to the DataFrame on all lines, using the take() function.
>>> nframe.take(new_order)
Random Sampling
Sometimes, when you have a huge DataFrame, you may have the
need to sample it randomly, and the quickest way to do this is by using
the np.random.randint() function.
41. String Manipulation
Built-in Methods for Manipulation of
Strings
split(
)
The split() function allows us to separate parts of a text, taking as a
reference point a separator, for example a comma.
As we can see in the first element, you have a string with a space character at the
end. To overcome this problem and often a frequent problem, you have to use the
split() function along with the strip() function that takes care of doing the trim of
whitespace (including newlines).
42. If the parts to be concatenated are much more, a more practical approach in this
case will be to use the join() function assigned to the separator character, with
which you want to join the various strings between them.
join()
Another category of operations that can be performed on the string is the
search for pieces of text in them, i.e., substrings.
>>> 'Boston' in text
True
43. In the same area, you can know how many times a character or combination of
characters (substring) occurs within a text. The count() function provides you with
this number.
>>> text.count('e’)
2
>>> text.count('Avenue’)
1
Another operation that can be performed on strings is the replacement or
elimination of a substring (or a single character). In both cases you will use the
replace() function, where if you are prompted to replace a substring with a blank
character, the operation will be equivalent to the elimination of the substring
from the text.
44. Regular
Expressions
The regular expressions provide a very flexible way to search and match string
patterns within a text.
A single expression, generically called regex, is a string formed according to the
regular expression language.
There is a built-in Python module called re, which is responsible for the operation
of the regex.
So first of all, when you want to make use of regular expressions, you will need
to import the module.
45. The re module provides a set of functions that can be divided into three
different categories:
❑ pattern matching
❑ substitution
❑ Splitting
The regex for expressing a sequence of one or more whitespace characters is
s+.
When you call the re.split() function, the regular expression is first compiled,
then subsequently calls the split() function on the text argument.
46. As regards matching a regex pattern with any other business substrings in the
text, you can use the findall() function. It returns a list of all the substrings in the
text that meet the requirements of the regex.
There are two other functions related to the function findall(): match() and
search(). While findall() returns all matches within a list, the function search()
returns only the first match.
47. Data Aggregation
The last stage of data manipulation is data aggregation. For data aggregation
you generally mean a transformation that produces a single integer from an
array.
GroupBy
Now you will analyze in detail what the process of GroupBy is and how it
works. Generally, it refers to its internal mechanism as a process called
SPLIT-APPLY-COMBINE.
This process as divided into three different phases:
• splitting: division into groups of datasets
• applying: application of a function on each group
• combining: combination of all the results obtained by different groups
49. Group Iteration
The GroupBy object supports the operation of an iteration for generating a
sequence of 2-tuples containing the name of the group together with the data
portion.