SlideShare a Scribd company logo
Pandas in Depth:
Data
Manipulation
The manipulation of the data has the purpose of preparing the data
so that they can be more easily subjected to analysis.
The three phases of data manipulation are
● Data preparation
● Data transformation
• Data aggregation
DATA PREPARATION
Before you start manipulating data itself, it is necessary to prepare the data
and assemble them in the form of data structures such that they can be
manipulated later with the tools made available by the pandas library.
The different procedures for data preparation are listed below.
• loading
• assembling
▪Μerging
▪Concatenating
▪Combining
• reshaping (pivoting)
• removing
In the loading phase, there is also that part of the preparation which concerns
the conversion from many different formats into a data structure such as
DataFrame.
But even after you have gotten the data, probably from different sources and
formats, and unified it into a DataFrame, you will need to perform further
operations of preparation.
The data contained in the pandas objects can be assembled together in
different ways:
❖ Merging—the pandas.merge( ) function connects the rows in a
DataFrame based on one or more keys.
❖ Concatenating—the pandas.concat( ) function concatenates the objects
along an axis.
❖ Combining—the pandas.DataFrame.combine_first( ) function is a
method that allows you to connect overlapped data in order to fill in
missing values in a data structure by taking data from another structure.
Merging
First, you have to import the pandas library and define two DataFrame that will
serve you as examples for this section.
The returned DataFrame consists of all rows that have an ID in common between the
two DataFrame.
In fact, in most cases you need to decide which is the column on which to base the
merging. To do this, add the on option with the column name as the key for the
merging.
Now in this case you have two DataFrame having columns with the same name.
So if you launch a merging you do not get any results.
So it is necessary to explicitly define the criterion of merging that pandas must
follow, specifying the name of the key column in the on option.
Often, however, the opposite problem arises, that is, to have two DataFrames in
which the key columns do not have the same name. To remedy this situation, you
have to use the left_on and right_on options that specify the key column for the
first and for the second DataFrame.
By default, the merge( ) function performs an inner join; the keys in the result
are the result of an intersection.
Other possible options are the left join, the right join, and the outer join.
The outer join produces the union of all keys, combining the effect of a left join
with a right join.
To select the type of join you have to use the how option.
To make the merge of multiple keys, you simply just add a list to the on
option.
Merging on Index
Instead of considering the columns of a DataFrame as keys, the indexes could be
used as keys on which to make the criteria for merging.
Then in order to decide which indexes to consider, set the left_index or
right_index options to True to activate them, with the ability to activate them both.
Concatenating
Another type of data combination is referred to as concatenation. NumPy
provides a concatenate() function to do this kind of operation with arrays.
As regards the pandas library and its data structures like Series and DataFrame,
the fact of having labeled axes allows you to further generalize the concatenation of
arrays. The concat() function is provided by pandas for this kind of operation.
By default, the concat() function works on axis = 0, having as returned object a
Series. If you set the axis = 1, then the result will be a DataFrame.
From the result you can see that there is
no overlap of data, therefore what you
have just done is an ‘outer’ join.
This can be changed by setting the join
option to ‘inner’.
A problem in this kind of operation is that the concatenated parts are not
identifiable in the result. For example, you want to create a hierarchical index on
the axis of concatenation. To do this you have to use the keys option.
In the case of combinations
between Series along the axis
= 1 the keys become the
column headers of the
DataFrame
Combinin
g
There is another situation in which there is combination of data that cannot be
obtained either with merging or with concatenation.
Take the case in which you want the two datasets to have indexes that overlap in
their entirety or at least partially.
One applicable function to Series is combine_first(), which performs this kind of
operation along with an data alignment.
Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf
Instead, if you want a partial overlap, you can specify only the portion of the Series
you want to overlap.
Pivoting
Arrangement of the values by row or by column is not always suited to your
goals. Sometimes you would like to rearrange the data carrying column values
on rows or vice versa.
Pivoting with Hierarchical Indexing
In the context of pivoting you have two basic operations:
❑ stacking: rotates or pivots the data structure converting columns to rows
❑ unstacking: converts rows into columns
From this hierarchically indexed series, you
can reassemble the DataFrame into a pivoted
table by use of the unstack() function
Pivoting from “Long” to “Wide” Format
When we want to have entries on various columns, and often duplicated in
subsequent lines, and remaining in tabular format of data; we can refer them to as
long or stacked format.
This mode of data recording, however, has some disadvantages. One, for
example, is precisely the multiplicity and repetition of some fields.
Instead of the long format, there is another way to arrange the data in a table that
is called wide.
This mode is easier to read, allowing easy connection with other tables, and it
occupies much less space. So in general it is a more efficient way of storing the
data
Removing The last stage of data preparation is the removal of columns and
rows.
In order to remove a
column, you have to
simply use the del
command applied to
the DataFrame with the
column name specified.
Instead, to remove an
unwanted row, you have
to use the drop() function
with the label of the
corresponding index as
argument.
Data Transformation
After you arrange the form of data and their disposal within the data structure, it
is important to transform their values.
Removing
Duplicates
Pandas provides us with a series of tools to analyze the duplicate data present
in large data structures.
The duplicated() function applied to a DataFrame can detect the rows which
appear to be duplicated. It returns a Series of Booleans where each element
corresponds to a row, with True if the row is duplicated (i.e., only the other
occurrences, not the first), and with False if there are no duplicates in the
previous elements.
The fact of having as the return value a
Boolean Series can be useful in many
cases, especially for the filtering. In fact,
if you want to know what are the
duplicate rows, just type the following:
drop_duplicates()
Generally, all duplicated rows are to be deleted
from the DataFrame; to do that, pandas
provides the drop_duplicates() function, which
returns the DataFrame without duplicate rows.
Mappin
g
The mapping is nothing more than the creation of a list of matches between two
different values, with the ability to bind a value to a particular label or string.
To define a mapping there is no better object than dict objects.
map = {
'label1' : 'value1,
'label2' : 'value2,
...
}
There are three specific functions used in Mapping and they are ;
❖ replace(): replaces values
❖ map(): creates a new column
❖ rename(): replaces the index values
Replacing Values via Mapping
Often in the data structure that you have assembled there are values that do not
meet your needs.
Define, as an example, a DataFrame containing various objects and colors,
including two colors that are not in English.
Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf
Adding Values via Mapping
Exploiting mapping to add values in a column depending on the values contained in
another.
Rename the Indexes of the Axes
To replace the label indexes, pandas provides the rename() function, which
takes the mapping as argument, that is, a dict object.
As you can see, by default, the indexes are renamed. If you want to rename
columns you must use the columns option. Thus this time you assign various
mapping explicitly to the two index and columns options.
Also here, for the simplest cases in which you have a single value to be replaced,
it can further explicate the arguments passed to the function of avoiding having to
write and assign many variables.
Discretization and Binning
To carry out an analysis of the data, however, it is necessary to transform this
data into discrete categories.
For example, you may have a reading of an experimental value between 0 and
100. These data are collected in a list.
>>> results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]
You know that the experimental values have a range from 0 to 100; therefore
you can uniformly divide this interval, for example, into four equal parts, i.e.,
bins.
The first contains the values between 0 and 25, the second between 26 and 50,
the third between 51 and 75, and the last between 76 and 100.
To do this binning with pandas, first you have to define an array containing the
values of separation of bin:
>>> bins = [0,25,50,75,100]
Then there is a special function called cut() and apply it to the array of results also
passing the bins.
The object returned by the cut() function is a special object of Categorical type.
You can consider it as an array of strings indicating the name of the bin.
Internally it contains a levels array indicating the names of the different internal
categories and a labels array that contains a list of numbers equal to the elements
of results.
Finally to know the occurrences for each bin, that is, how many results fall into
each category, you have to use the value_counts() function.
In addition to cut(), pandas provides another method for binning: qcut(). This
function divides the sample directly into quintiles. qcut() will ensure that the
number of occurrences for each bin is equal, but the edges of each bin to vary.
Detecting and Filtering
Outliers
❖ During the data analysis, the need to detect the presence of abnormal values
within a data structure often arises. By way of example, create a DataFrame
with three columns from 1,000 completely random values:
>>> randframe = pd.DataFrame(np.random.randn(1000,3))
>>> randframe.describe()
0 1 2
count 1000.000000 1000.000000 1000.000000
mean 0.021609 -0.022926 -0.019577
std 1.045777 0.998493 1.056961
min -2.981600 -2.828229 -3.735046
25% -0.675005 -0.729834 -0.737677
50% 0.003857 -0.016940 -0.031886
75% 0.738968 0.619175 0.718702
max 3.104202 2.942778 3.458472
Permutation
The operations of permutation (random reordering) of a Series or the rows of a
DataFrame are easy to do using the numpy.random.permutation() function.
Now create an array of five integers from 0 to 4 arranged in random
order with the permutation() function. This will be the new order in
which to set the values of a row of DataFrame.
Now apply it to the DataFrame on all lines, using the take() function.
>>> nframe.take(new_order)
Random Sampling
Sometimes, when you have a huge DataFrame, you may have the
need to sample it randomly, and the quickest way to do this is by using
the np.random.randint() function.
String Manipulation
Built-in Methods for Manipulation of
Strings
split(
)
The split() function allows us to separate parts of a text, taking as a
reference point a separator, for example a comma.
As we can see in the first element, you have a string with a space character at the
end. To overcome this problem and often a frequent problem, you have to use the
split() function along with the strip() function that takes care of doing the trim of
whitespace (including newlines).
If the parts to be concatenated are much more, a more practical approach in this
case will be to use the join() function assigned to the separator character, with
which you want to join the various strings between them.
join()
Another category of operations that can be performed on the string is the
search for pieces of text in them, i.e., substrings.
>>> 'Boston' in text
True
In the same area, you can know how many times a character or combination of
characters (substring) occurs within a text. The count() function provides you with
this number.
>>> text.count('e’)
2
>>> text.count('Avenue’)
1
Another operation that can be performed on strings is the replacement or
elimination of a substring (or a single character). In both cases you will use the
replace() function, where if you are prompted to replace a substring with a blank
character, the operation will be equivalent to the elimination of the substring
from the text.
Regular
Expressions
The regular expressions provide a very flexible way to search and match string
patterns within a text.
A single expression, generically called regex, is a string formed according to the
regular expression language.
There is a built-in Python module called re, which is responsible for the operation
of the regex.
So first of all, when you want to make use of regular expressions, you will need
to import the module.
The re module provides a set of functions that can be divided into three
different categories:
❑ pattern matching
❑ substitution
❑ Splitting
The regex for expressing a sequence of one or more whitespace characters is
s+.
When you call the re.split() function, the regular expression is first compiled,
then subsequently calls the split() function on the text argument.
As regards matching a regex pattern with any other business substrings in the
text, you can use the findall() function. It returns a list of all the substrings in the
text that meet the requirements of the regex.
There are two other functions related to the function findall(): match() and
search(). While findall() returns all matches within a list, the function search()
returns only the first match.
Data Aggregation
The last stage of data manipulation is data aggregation. For data aggregation
you generally mean a transformation that produces a single integer from an
array.
GroupBy
Now you will analyze in detail what the process of GroupBy is and how it
works. Generally, it refers to its internal mechanism as a process called
SPLIT-APPLY-COMBINE.
This process as divided into three different phases:
• splitting: division into groups of datasets
• applying: application of a function on each group
• combining: combination of all the results obtained by different groups
Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf
Group Iteration
The GroupBy object supports the operation of an iteration for generating a
sequence of 2-tuples containing the name of the group together with the data
portion.

More Related Content

PDF
Data Analysis with Pandas CheatSheet .pdf
PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
PPTX
DataStructures in Pyhton Pandas and numpy.pptx
PDF
pandas-221217084954-937bb582.pdf
PPTX
Pandas.pptx
PPTX
Data Science ppt on dataframe operations.pptx
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
PPTX
introduction to data structures in pandas
Data Analysis with Pandas CheatSheet .pdf
lecture14DATASCIENCE AND MACHINE LER.pdf
DataStructures in Pyhton Pandas and numpy.pptx
pandas-221217084954-937bb582.pdf
Pandas.pptx
Data Science ppt on dataframe operations.pptx
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
introduction to data structures in pandas

Similar to Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf (20)

PDF
Pandas cheat sheet
PDF
Pandas Cheat Sheet
PDF
Pandas cheat sheet_data science
PDF
Data Wrangling with Pandas
PPT
Python Panda Library for python programming.ppt
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PPTX
Pandas Dataframe reading data Kirti final.pptx
PDF
pandas dataframe notes.pdf
PPTX
Pandas in Programming (python) presentation
PPTX
Pandas in Programming (Python) Presentation
PDF
Aaa ped-5-Data manipulation: Pandas
PPTX
Pandas-(Ziad).pptx
PPTX
Data_Manipulation_with_Pandas that manipulation used
PDF
pandas.pdf
PDF
pandas (1).pdf
PPTX
Pandas csv
PPTX
ppanda.pptx
PPTX
Handling Missing Data for Data Analysis.pptx
PDF
SciPy 2011 pandas lightning talk
PPTX
pandas directories on the python language.pptx
Pandas cheat sheet
Pandas Cheat Sheet
Pandas cheat sheet_data science
Data Wrangling with Pandas
Python Panda Library for python programming.ppt
python-pandas-For-Data-Analysis-Manipulate.pptx
Pandas Dataframe reading data Kirti final.pptx
pandas dataframe notes.pdf
Pandas in Programming (python) presentation
Pandas in Programming (Python) Presentation
Aaa ped-5-Data manipulation: Pandas
Pandas-(Ziad).pptx
Data_Manipulation_with_Pandas that manipulation used
pandas.pdf
pandas (1).pdf
Pandas csv
ppanda.pptx
Handling Missing Data for Data Analysis.pptx
SciPy 2011 pandas lightning talk
pandas directories on the python language.pptx
Ad

Recently uploaded (20)

PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
modul_python (1).pptx for professional and student
PPT
Predictive modeling basics in data cleaning process
PDF
How to run a consulting project- client discovery
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Global Data and Analytics Market Outlook Report
DOCX
Factor Analysis Word Document Presentation
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Microsoft Core Cloud Services powerpoint
importance of Data-Visualization-in-Data-Science. for mba studnts
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
modul_python (1).pptx for professional and student
Predictive modeling basics in data cleaning process
How to run a consulting project- client discovery
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Global Data and Analytics Market Outlook Report
Factor Analysis Word Document Presentation
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
New ISO 27001_2022 standard and the changes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STERILIZATION AND DISINFECTION-1.ppthhhbx
IBA_Chapter_11_Slides_Final_Accessible.pptx
CYBER SECURITY the Next Warefare Tactics
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Microsoft Core Cloud Services powerpoint
Ad

Pandas in Depth_ Data Manipultion(Chapter 5)(Important).pdf

  • 2. The manipulation of the data has the purpose of preparing the data so that they can be more easily subjected to analysis. The three phases of data manipulation are ● Data preparation ● Data transformation • Data aggregation
  • 3. DATA PREPARATION Before you start manipulating data itself, it is necessary to prepare the data and assemble them in the form of data structures such that they can be manipulated later with the tools made available by the pandas library. The different procedures for data preparation are listed below. • loading • assembling ▪Μerging ▪Concatenating ▪Combining • reshaping (pivoting) • removing
  • 4. In the loading phase, there is also that part of the preparation which concerns the conversion from many different formats into a data structure such as DataFrame. But even after you have gotten the data, probably from different sources and formats, and unified it into a DataFrame, you will need to perform further operations of preparation. The data contained in the pandas objects can be assembled together in different ways: ❖ Merging—the pandas.merge( ) function connects the rows in a DataFrame based on one or more keys. ❖ Concatenating—the pandas.concat( ) function concatenates the objects along an axis. ❖ Combining—the pandas.DataFrame.combine_first( ) function is a method that allows you to connect overlapped data in order to fill in missing values in a data structure by taking data from another structure.
  • 5. Merging First, you have to import the pandas library and define two DataFrame that will serve you as examples for this section.
  • 6. The returned DataFrame consists of all rows that have an ID in common between the two DataFrame.
  • 7. In fact, in most cases you need to decide which is the column on which to base the merging. To do this, add the on option with the column name as the key for the merging. Now in this case you have two DataFrame having columns with the same name. So if you launch a merging you do not get any results.
  • 8. So it is necessary to explicitly define the criterion of merging that pandas must follow, specifying the name of the key column in the on option.
  • 9. Often, however, the opposite problem arises, that is, to have two DataFrames in which the key columns do not have the same name. To remedy this situation, you have to use the left_on and right_on options that specify the key column for the first and for the second DataFrame.
  • 10. By default, the merge( ) function performs an inner join; the keys in the result are the result of an intersection. Other possible options are the left join, the right join, and the outer join. The outer join produces the union of all keys, combining the effect of a left join with a right join. To select the type of join you have to use the how option.
  • 11. To make the merge of multiple keys, you simply just add a list to the on option.
  • 12. Merging on Index Instead of considering the columns of a DataFrame as keys, the indexes could be used as keys on which to make the criteria for merging. Then in order to decide which indexes to consider, set the left_index or right_index options to True to activate them, with the ability to activate them both.
  • 13. Concatenating Another type of data combination is referred to as concatenation. NumPy provides a concatenate() function to do this kind of operation with arrays.
  • 14. As regards the pandas library and its data structures like Series and DataFrame, the fact of having labeled axes allows you to further generalize the concatenation of arrays. The concat() function is provided by pandas for this kind of operation. By default, the concat() function works on axis = 0, having as returned object a Series. If you set the axis = 1, then the result will be a DataFrame.
  • 15. From the result you can see that there is no overlap of data, therefore what you have just done is an ‘outer’ join. This can be changed by setting the join option to ‘inner’. A problem in this kind of operation is that the concatenated parts are not identifiable in the result. For example, you want to create a hierarchical index on the axis of concatenation. To do this you have to use the keys option.
  • 16. In the case of combinations between Series along the axis = 1 the keys become the column headers of the DataFrame
  • 17. Combinin g There is another situation in which there is combination of data that cannot be obtained either with merging or with concatenation. Take the case in which you want the two datasets to have indexes that overlap in their entirety or at least partially. One applicable function to Series is combine_first(), which performs this kind of operation along with an data alignment.
  • 19. Instead, if you want a partial overlap, you can specify only the portion of the Series you want to overlap.
  • 20. Pivoting Arrangement of the values by row or by column is not always suited to your goals. Sometimes you would like to rearrange the data carrying column values on rows or vice versa. Pivoting with Hierarchical Indexing In the context of pivoting you have two basic operations: ❑ stacking: rotates or pivots the data structure converting columns to rows ❑ unstacking: converts rows into columns
  • 21. From this hierarchically indexed series, you can reassemble the DataFrame into a pivoted table by use of the unstack() function
  • 22. Pivoting from “Long” to “Wide” Format When we want to have entries on various columns, and often duplicated in subsequent lines, and remaining in tabular format of data; we can refer them to as long or stacked format.
  • 23. This mode of data recording, however, has some disadvantages. One, for example, is precisely the multiplicity and repetition of some fields. Instead of the long format, there is another way to arrange the data in a table that is called wide. This mode is easier to read, allowing easy connection with other tables, and it occupies much less space. So in general it is a more efficient way of storing the data
  • 24. Removing The last stage of data preparation is the removal of columns and rows. In order to remove a column, you have to simply use the del command applied to the DataFrame with the column name specified. Instead, to remove an unwanted row, you have to use the drop() function with the label of the corresponding index as argument.
  • 25. Data Transformation After you arrange the form of data and their disposal within the data structure, it is important to transform their values. Removing Duplicates Pandas provides us with a series of tools to analyze the duplicate data present in large data structures. The duplicated() function applied to a DataFrame can detect the rows which appear to be duplicated. It returns a Series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.
  • 26. The fact of having as the return value a Boolean Series can be useful in many cases, especially for the filtering. In fact, if you want to know what are the duplicate rows, just type the following:
  • 27. drop_duplicates() Generally, all duplicated rows are to be deleted from the DataFrame; to do that, pandas provides the drop_duplicates() function, which returns the DataFrame without duplicate rows. Mappin g The mapping is nothing more than the creation of a list of matches between two different values, with the ability to bind a value to a particular label or string. To define a mapping there is no better object than dict objects. map = { 'label1' : 'value1, 'label2' : 'value2, ... }
  • 28. There are three specific functions used in Mapping and they are ; ❖ replace(): replaces values ❖ map(): creates a new column ❖ rename(): replaces the index values Replacing Values via Mapping Often in the data structure that you have assembled there are values that do not meet your needs. Define, as an example, a DataFrame containing various objects and colors, including two colors that are not in English.
  • 30. Adding Values via Mapping Exploiting mapping to add values in a column depending on the values contained in another.
  • 31. Rename the Indexes of the Axes To replace the label indexes, pandas provides the rename() function, which takes the mapping as argument, that is, a dict object.
  • 32. As you can see, by default, the indexes are renamed. If you want to rename columns you must use the columns option. Thus this time you assign various mapping explicitly to the two index and columns options.
  • 33. Also here, for the simplest cases in which you have a single value to be replaced, it can further explicate the arguments passed to the function of avoiding having to write and assign many variables.
  • 34. Discretization and Binning To carry out an analysis of the data, however, it is necessary to transform this data into discrete categories. For example, you may have a reading of an experimental value between 0 and 100. These data are collected in a list. >>> results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87] You know that the experimental values have a range from 0 to 100; therefore you can uniformly divide this interval, for example, into four equal parts, i.e., bins. The first contains the values between 0 and 25, the second between 26 and 50, the third between 51 and 75, and the last between 76 and 100. To do this binning with pandas, first you have to define an array containing the values of separation of bin: >>> bins = [0,25,50,75,100]
  • 35. Then there is a special function called cut() and apply it to the array of results also passing the bins.
  • 36. The object returned by the cut() function is a special object of Categorical type. You can consider it as an array of strings indicating the name of the bin. Internally it contains a levels array indicating the names of the different internal categories and a labels array that contains a list of numbers equal to the elements of results. Finally to know the occurrences for each bin, that is, how many results fall into each category, you have to use the value_counts() function.
  • 37. In addition to cut(), pandas provides another method for binning: qcut(). This function divides the sample directly into quintiles. qcut() will ensure that the number of occurrences for each bin is equal, but the edges of each bin to vary.
  • 38. Detecting and Filtering Outliers ❖ During the data analysis, the need to detect the presence of abnormal values within a data structure often arises. By way of example, create a DataFrame with three columns from 1,000 completely random values: >>> randframe = pd.DataFrame(np.random.randn(1000,3)) >>> randframe.describe() 0 1 2 count 1000.000000 1000.000000 1000.000000 mean 0.021609 -0.022926 -0.019577 std 1.045777 0.998493 1.056961 min -2.981600 -2.828229 -3.735046 25% -0.675005 -0.729834 -0.737677 50% 0.003857 -0.016940 -0.031886 75% 0.738968 0.619175 0.718702 max 3.104202 2.942778 3.458472
  • 39. Permutation The operations of permutation (random reordering) of a Series or the rows of a DataFrame are easy to do using the numpy.random.permutation() function. Now create an array of five integers from 0 to 4 arranged in random order with the permutation() function. This will be the new order in which to set the values of a row of DataFrame.
  • 40. Now apply it to the DataFrame on all lines, using the take() function. >>> nframe.take(new_order) Random Sampling Sometimes, when you have a huge DataFrame, you may have the need to sample it randomly, and the quickest way to do this is by using the np.random.randint() function.
  • 41. String Manipulation Built-in Methods for Manipulation of Strings split( ) The split() function allows us to separate parts of a text, taking as a reference point a separator, for example a comma. As we can see in the first element, you have a string with a space character at the end. To overcome this problem and often a frequent problem, you have to use the split() function along with the strip() function that takes care of doing the trim of whitespace (including newlines).
  • 42. If the parts to be concatenated are much more, a more practical approach in this case will be to use the join() function assigned to the separator character, with which you want to join the various strings between them. join() Another category of operations that can be performed on the string is the search for pieces of text in them, i.e., substrings. >>> 'Boston' in text True
  • 43. In the same area, you can know how many times a character or combination of characters (substring) occurs within a text. The count() function provides you with this number. >>> text.count('e’) 2 >>> text.count('Avenue’) 1 Another operation that can be performed on strings is the replacement or elimination of a substring (or a single character). In both cases you will use the replace() function, where if you are prompted to replace a substring with a blank character, the operation will be equivalent to the elimination of the substring from the text.
  • 44. Regular Expressions The regular expressions provide a very flexible way to search and match string patterns within a text. A single expression, generically called regex, is a string formed according to the regular expression language. There is a built-in Python module called re, which is responsible for the operation of the regex. So first of all, when you want to make use of regular expressions, you will need to import the module.
  • 45. The re module provides a set of functions that can be divided into three different categories: ❑ pattern matching ❑ substitution ❑ Splitting The regex for expressing a sequence of one or more whitespace characters is s+. When you call the re.split() function, the regular expression is first compiled, then subsequently calls the split() function on the text argument.
  • 46. As regards matching a regex pattern with any other business substrings in the text, you can use the findall() function. It returns a list of all the substrings in the text that meet the requirements of the regex. There are two other functions related to the function findall(): match() and search(). While findall() returns all matches within a list, the function search() returns only the first match.
  • 47. Data Aggregation The last stage of data manipulation is data aggregation. For data aggregation you generally mean a transformation that produces a single integer from an array. GroupBy Now you will analyze in detail what the process of GroupBy is and how it works. Generally, it refers to its internal mechanism as a process called SPLIT-APPLY-COMBINE. This process as divided into three different phases: • splitting: division into groups of datasets • applying: application of a function on each group • combining: combination of all the results obtained by different groups
  • 49. Group Iteration The GroupBy object supports the operation of an iteration for generating a sequence of 2-tuples containing the name of the group together with the data portion.