SlideShare a Scribd company logo
Module – 5
The PANDAS
(Chapter-5)
Reading & Writing Data
I/O API Tools
There are two main functions used in Pandas Library that helps in Data Analysis,
which are known as I/O API and they are;
1. Readers
2. Writers
Readers Writers
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_html to_html
read_stata to_stata
read_clipboard to_clipboard
read_pickle to_pickle
read_msgpack to_msgpack (experimental)
read_gbq to_gbq (experimental)
CSV and Textual Files
If the values in a row are separated by a comma, you have the CSV
(comma-separated values) format.
Other forms with tabular data separated by spaces or tabs are
typically contained in text files of various types (generally with the
extension .txt)
pandas provides a set of functions specific for this type of file;
• read_csv
• read_table
• to_csv
Reading Data in CSV or Text Files
The most common operation for a person approaching data analysis is to read
the data contained in a CSV file, or at least in a text file.
myCSV_01.csv
white,red,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
Since this file is comma-delimited, you can use the read_csv() function to
read its content and convert it at the same time in a DataFrame object.
CSV files are tabulated data in which the values on the same column are
separated by commas. But since CSV files are considered text files, you can
also use the read_table() function, but specifying the delimiter.
You can notice that in the CSV file, headers to identify all the columns are in
the first row. But this is not a general case, it often happens that the tabulated
data begin directly from the first line.
myCSV_02.csv
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
In this case, then you could make sure that it is precisely pandas to assign
default names to the columns by using the header option set to None.
In addition, there is also the possibility to specify the names directly
assigning a list of labels to the names option.
In more complex cases, in which you want to create a DataFrame with a
hierarchical structure by reading a CSV file, you can extend the functionality of
the read_csv() function by adding the index_col option, assigning all the
columns to be converted into indexes to it.
myCSV_03.csv
color,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4
Using RegExp for Parsing TXT Files
In other cases, it is possible that the files on which to parse the data do not
show separators well defined as a comma or a semicolon.
In these cases, the regular expressions come to our aid.
In fact, you can specify a regexp within the read_table() function using the sep
option.
Usually, you think of the separators as special characters like commas,
spaces, tabs, etc. but in reality you could consider separator characters as
alphanumeric characters, or for example, as integers such as 0.
In this example, you need to extract the numeric part from a TXT file, in which
there is a sequence of characters with numerical values and literal characters
are completely fused.
Remember to use the header option set to None whenever the column
headings are not present in the TXT file.
Another fairly common event is to exclude lines from parsing. In fact you do not
always want to include headers or unnecessary comments contained within a
file.
With the skiprows option you can exclude all the lines you want, just assigning
an array containing the line numbers to not consider in parsing.
If you want to exclude the first five lines, then you have to write skiprows = 5,
but if we want to rule out the fifth line you have to write skiprows = [5].
Reading TXT Files into Parts or Partially
When large files are processed, or when you’re only interested in portions of
these files, you often need to read the file into portions (chunks).
So if for example you want to read only a portion of the file, you can explicitly
specify the number of lines on which to parse.
We can use the nrows and skiprows options, you can select the starting line n
(n = SkipRows) and the lines to be read after it (nrows = i).
If you want to write to a CSV file, the data contained in a DataFrame, we will
use the to_csv() function that accepts as an argument the name of the file you
generate.
As you can see from the previous example, when you make the writing of a
data frame to a file, by default both indexes and columns are marked on the
file. This default behavior can be changed by placing the two options index and
header set to False.
One thing to take into account when making the writing of files is that NaN values
present in a data structure are shown as empty fields in the file.
We can replace this empty field with a value to your liking using the na_rep
option in the to_csv() function.
Common values may be NULL, 0, or the same NaN.
Reading and Writing HTML Files
With regard to the HTML format pandas provides the corresponding pair of I/O
API functions.
• read_html()
• to_html()
Writing Data in HTML
See how to convert a DataFrame into an HTML table.
The internal structure of the data frame is automatically converted into nested
tags <TH>, <TR>, <TD> retaining any internal hierarchies.
Example:
The Pandas Chapter 5(Important Questions).pdf
To write an HTML page through the generation of a string.
❖ First of all we create a string that contains the code of the HTML page.
❖ Now you can write the contents from the string “html” directly on the file that will
be called myFrame.html
❖ Now in your working directory will be a new HTML file, myFrame.html.
Double-click it to open it directly from the browser. An HTML table will appear in
the upper left as shown below;
Reading Data from an HTML File
Reading Data from XML
In the list of I/O API functions, there is no specific tool regarding the XML
(Extensible Markup Language) format.
In fact, although it is not listed, this format is very important, because many
structured data are available in XML format.
This presents no problem, since Python has many other libraries (besides
pandas) that manage the reading and writing of data in XML format.
One of these libraries is the lxml library, which stands out for its excellent
performance during the parsing of very large files.
Example:
• In this example you will take the data structure described in the XML file to
convert it directly into a DataFrame.
• To do so the first thing to do is use the sub-module objectify of the lxml library,
importing it in the following way.
>>> from lxml import objectify
>>> xml = objectify.parse('books.xml')
>>> xml
<lxml.etree._ElementTree object at 0x0000000009734E08>
You got an object tree, which is an internal data structure of the module lxml.
Look in more detail at this type of object. To navigate in this tree structure, so as to
select
element by element, you must first define the root. You can do this with the
getroot() function.
>>> root = xml.getroot()
Now that the root of the structure has been defined, you can access the various
nodes of the tree, each corresponding to the tag contained within the original XML
file.
>>> root.Book.Author
'Ross, Mark'
>>> root.Book.PublishDate
❖ In this way you access nodes individually, but you can access various elements
at the same time using getchildren(). With this function, you’ll get all the child
nodes of the reference element.
>>> root.getchildren()
[<Element Book at 0x9c66688>, <Element Book at 0x9c66e08>]
❖ With the tag attribute you get the name of the tag corresponding to the child
node.
>>> [child.tag for child in root.Book.getchildren()]
['Author', 'Title', 'Genre', 'Price', 'PublishDate’]
❖ with the text attribute you get the value contained between the corresponding
tags.
>>> [child.text for child in root.Book.getchildren()]
['Ross, Mark', 'XML Cookbook', 'Computer', '23.56', '2014-22-01']
Reading and Writing Data on Microsoft Excel Files
❖ pandas provides specific functions to read and write data with Excel files and the
I/O API that provides functions to this purpose are:
• to_excel()
• read_excel()
❖ The read_excel() function is able to read both Excel 2003 (.xls) files and Excel
2007 (.xlsx) files.
Example:
First, open an Excel file and enter the data as shown below. Copy data in sheet1
and sheet2. Then save it as data.xls.
The Pandas Chapter 5(Important Questions).pdf
❖ To read the data contained within the XLS file and obtain the conversion into a
data frame, you only have to use the read_excel() function.
❖ As you can see, by default, the returned DataFrame is composed of the data
tabulated in the first spreadsheets.
❖ If, however, you’d need to load the data in the second spreadsheet, then specify
the name of the sheet or the number of the sheet (index) just as the second
argument.
❖ To convert a data frame in a spreadsheet on Excel you have to write as follows.
>>> frame = pd.DataFrame(np.random.random((4,4)),
index = ['exp1','exp2','exp3','exp4'],
columns = ['Jan2015','Fab2015','Mar2015','Apr2005’])
❖ In the working directory you will find a new Excel file containing the data as
shown;
JSON Data
❖ JSON (JavaScript Object Notation) has become one of the most common
standard formats, especially for the transmission of data through the Web.
❖ So it is normal to have to do with this data format if you want to use the
available data on the Web.
Step-1:
❖ The converse is possible, using the read_json()
with the name of the file passed as an argument.
❖ Generally, however, the JSON files do not have a tabular structure. Thus, you will
need to somehow convert the structure dict file in tabular form.
❖ The library pandas provides a function, called json_normalize(), that is able to
convert a dict or a list in a table.
❖ First you have to import the function
>>> from pandas.io.json import json_normalize
❖ Then write a JSON file as described below with any text editor. Save it in the
working directory as books.json
❖ As you can see, the file structure is no longer tabular, but more complex. Then the
approach with the read_json() function is no longer valid.
❖ You can still get the data in tabular form from this structure.
❖ First you have to load the contents of the JSON file and convert it into a string.
❖ Now you are ready to apply the json_normalize() function.
Pickle—Python Object Serialization
❖ The pickle module implements a powerful algorithm for serialization and
de-serialization of a data structure implemented in Python.
❖ Pickling is the process in which the hierarchy of an object is converted into a
stream of bytes.
❖ In Python, the picking operation is carried out by the pickle module, but
currently there is a module called cPickle which is the result of an enormous
amount of work optimizing the pickle module (written in C).
❖ This module can be in fact in many cases even 1,000 times faster than the
pickle module.
Serialize a Python Object with cPickle
❖ The data format used by the pickle module (or cPickle) is specific to Python.
❖ By default, an ASCII representation is used to represent it, in order to be readable
from the human point of view.
❖ Then opening a file with a text editor you may be able to understand its contents.
❖ To use this module you must first import it
>>> import cPickle as pickle
❖ Then create an object sufficiently complex to have an internal data structure, for
example a dict object.
>>> data = { 'color': ['white','red'], 'value': [5, 7]}
❖ Now you will perform a serialization of the data object through the dumps()
function of the cPickle module.
>>> pickled_data = pickle.dumps(data)
❖ Now, to see how it was serialized the dict object, you need to look at the
content of the pickled_data variable.
❖ Once serialized data, they can easily be written on a file, or sent over a socket,
pipe, etc.
Pickling with pandas
❖ As regards the operation of pickling (and unpickling) with the pandas library,
everything remains much facilitated. No need to import the cPickle module in the
Python session and also the whole operation is performed implicitly.
❖ Also, the serialization format used by pandas is not completely in ASCII.
>>> frame = pd.DataFrame(np.arange(16).reshape(4,4), index =
['up','down','left','right’])
>>> frame.to_pickle('frame.pkl’)
❖ Now in your working directory there is a new file called frame.pkl containing all
the information about the frame DataFrame.
❖ To open a PKL file and read the contents, simply use the command

More Related Content

PPTX
Pandas csv
PDF
Dealing with files in python specially CSV files
PPTX
Pandas-(Ziad).pptx
PPTX
Introduccion a Pandas_cargar datos, modelar, analizar, manipular y prepararlo...
PPTX
Lecture 3 intro2data
PPTX
Python Pandas.pptx
PPTX
ReadingWriting_CSV_files.pptx sjdjs sjbjs sjnd
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
Pandas csv
Dealing with files in python specially CSV files
Pandas-(Ziad).pptx
Introduccion a Pandas_cargar datos, modelar, analizar, manipular y prepararlo...
Lecture 3 intro2data
Python Pandas.pptx
ReadingWriting_CSV_files.pptx sjdjs sjbjs sjnd
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...

Similar to The Pandas Chapter 5(Important Questions).pdf (20)

PDF
Data Wrangling and Visualization Using Python
PPTX
CSV JSON and XML files in Python.pptx
PPTX
Python and CSV Connectivity
PPTX
Unit 5 Introduction to Built-in Packages in python .pptx
PPTX
Data analysis with pandas
PDF
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
PDF
Panda data structures and its importance in Python.pdf
PPTX
Pandas Dataframe reading data Kirti final.pptx
PPTX
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
PPTX
data science for engineering reference pdf
PDF
pandas.pdf
PDF
pandas (1).pdf
PPTX
Unit 3_Numpy_VP.pptx
PDF
Data analystics with R module 3 cseds vtu
PDF
Importing Data Sets | Importing Data Sets | Importing Data Sets
PPTX
BINARY files CSV files JSON files with example.pptx
PPTX
Unit 1 Ch 2 Data Frames digital vis.pptx
PPTX
Unit 3_Numpy_VP.pptx
PPTX
Python pandas Library
Data Wrangling and Visualization Using Python
CSV JSON and XML files in Python.pptx
Python and CSV Connectivity
Unit 5 Introduction to Built-in Packages in python .pptx
Data analysis with pandas
Aaa ped-6-Data manipulation: Data Files, and Data Cleaning & Preparation
Panda data structures and its importance in Python.pdf
Pandas Dataframe reading data Kirti final.pptx
Q-Step_WS_06112019_Data_Analysis_and_visualisation_with_Python.pptx
data science for engineering reference pdf
pandas.pdf
pandas (1).pdf
Unit 3_Numpy_VP.pptx
Data analystics with R module 3 cseds vtu
Importing Data Sets | Importing Data Sets | Importing Data Sets
BINARY files CSV files JSON files with example.pptx
Unit 1 Ch 2 Data Frames digital vis.pptx
Unit 3_Numpy_VP.pptx
Python pandas Library
Ad

Recently uploaded (20)

PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Managing Community Partner Relationships
DOCX
Factor Analysis Word Document Presentation
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
A Complete Guide to Streamlining Business Processes
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
modul_python (1).pptx for professional and student
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to the R Programming Language
Optimise Shopper Experiences with a Strong Data Estate.pdf
Managing Community Partner Relationships
Factor Analysis Word Document Presentation
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
importance of Data-Visualization-in-Data-Science. for mba studnts
CYBER SECURITY the Next Warefare Tactics
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
A Complete Guide to Streamlining Business Processes
ISS -ESG Data flows What is ESG and HowHow
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
modul_python (1).pptx for professional and student
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to the R Programming Language
Ad

The Pandas Chapter 5(Important Questions).pdf

  • 1. Module – 5 The PANDAS (Chapter-5)
  • 2. Reading & Writing Data I/O API Tools There are two main functions used in Pandas Library that helps in Data Analysis, which are known as I/O API and they are; 1. Readers 2. Writers Readers Writers read_csv to_csv read_excel to_excel read_hdf to_hdf read_sql to_sql read_json to_json read_html to_html read_stata to_stata read_clipboard to_clipboard read_pickle to_pickle read_msgpack to_msgpack (experimental) read_gbq to_gbq (experimental)
  • 3. CSV and Textual Files If the values in a row are separated by a comma, you have the CSV (comma-separated values) format. Other forms with tabular data separated by spaces or tabs are typically contained in text files of various types (generally with the extension .txt) pandas provides a set of functions specific for this type of file; • read_csv • read_table • to_csv
  • 4. Reading Data in CSV or Text Files The most common operation for a person approaching data analysis is to read the data contained in a CSV file, or at least in a text file. myCSV_01.csv white,red,blue,green,animal 1,5,2,3,cat 2,7,8,5,dog 3,3,6,7,horse 2,2,8,3,duck 4,4,2,1,mouse Since this file is comma-delimited, you can use the read_csv() function to read its content and convert it at the same time in a DataFrame object.
  • 5. CSV files are tabulated data in which the values on the same column are separated by commas. But since CSV files are considered text files, you can also use the read_table() function, but specifying the delimiter. You can notice that in the CSV file, headers to identify all the columns are in the first row. But this is not a general case, it often happens that the tabulated data begin directly from the first line.
  • 6. myCSV_02.csv 1,5,2,3,cat 2,7,8,5,dog 3,3,6,7,horse 2,2,8,3,duck 4,4,2,1,mouse In this case, then you could make sure that it is precisely pandas to assign default names to the columns by using the header option set to None.
  • 7. In addition, there is also the possibility to specify the names directly assigning a list of labels to the names option.
  • 8. In more complex cases, in which you want to create a DataFrame with a hierarchical structure by reading a CSV file, you can extend the functionality of the read_csv() function by adding the index_col option, assigning all the columns to be converted into indexes to it. myCSV_03.csv color,status,item1,item2,item3 black,up,3,4,6 black,down,2,6,7 white,up,5,5,5 white,down,3,3,2 white,left,1,2,1 red,up,2,2,2 red,down,1,1,4
  • 9. Using RegExp for Parsing TXT Files In other cases, it is possible that the files on which to parse the data do not show separators well defined as a comma or a semicolon. In these cases, the regular expressions come to our aid. In fact, you can specify a regexp within the read_table() function using the sep option.
  • 10. Usually, you think of the separators as special characters like commas, spaces, tabs, etc. but in reality you could consider separator characters as alphanumeric characters, or for example, as integers such as 0. In this example, you need to extract the numeric part from a TXT file, in which there is a sequence of characters with numerical values and literal characters are completely fused. Remember to use the header option set to None whenever the column headings are not present in the TXT file.
  • 11. Another fairly common event is to exclude lines from parsing. In fact you do not always want to include headers or unnecessary comments contained within a file. With the skiprows option you can exclude all the lines you want, just assigning an array containing the line numbers to not consider in parsing. If you want to exclude the first five lines, then you have to write skiprows = 5, but if we want to rule out the fifth line you have to write skiprows = [5].
  • 12. Reading TXT Files into Parts or Partially When large files are processed, or when you’re only interested in portions of these files, you often need to read the file into portions (chunks). So if for example you want to read only a portion of the file, you can explicitly specify the number of lines on which to parse. We can use the nrows and skiprows options, you can select the starting line n (n = SkipRows) and the lines to be read after it (nrows = i).
  • 13. If you want to write to a CSV file, the data contained in a DataFrame, we will use the to_csv() function that accepts as an argument the name of the file you generate.
  • 14. As you can see from the previous example, when you make the writing of a data frame to a file, by default both indexes and columns are marked on the file. This default behavior can be changed by placing the two options index and header set to False.
  • 15. One thing to take into account when making the writing of files is that NaN values present in a data structure are shown as empty fields in the file.
  • 16. We can replace this empty field with a value to your liking using the na_rep option in the to_csv() function. Common values may be NULL, 0, or the same NaN.
  • 17. Reading and Writing HTML Files With regard to the HTML format pandas provides the corresponding pair of I/O API functions. • read_html() • to_html() Writing Data in HTML See how to convert a DataFrame into an HTML table. The internal structure of the data frame is automatically converted into nested tags <TH>, <TR>, <TD> retaining any internal hierarchies. Example:
  • 19. To write an HTML page through the generation of a string. ❖ First of all we create a string that contains the code of the HTML page. ❖ Now you can write the contents from the string “html” directly on the file that will be called myFrame.html
  • 20. ❖ Now in your working directory will be a new HTML file, myFrame.html. Double-click it to open it directly from the browser. An HTML table will appear in the upper left as shown below;
  • 21. Reading Data from an HTML File
  • 22. Reading Data from XML In the list of I/O API functions, there is no specific tool regarding the XML (Extensible Markup Language) format. In fact, although it is not listed, this format is very important, because many structured data are available in XML format. This presents no problem, since Python has many other libraries (besides pandas) that manage the reading and writing of data in XML format. One of these libraries is the lxml library, which stands out for its excellent performance during the parsing of very large files. Example: • In this example you will take the data structure described in the XML file to convert it directly into a DataFrame. • To do so the first thing to do is use the sub-module objectify of the lxml library, importing it in the following way.
  • 23. >>> from lxml import objectify >>> xml = objectify.parse('books.xml') >>> xml <lxml.etree._ElementTree object at 0x0000000009734E08> You got an object tree, which is an internal data structure of the module lxml. Look in more detail at this type of object. To navigate in this tree structure, so as to select element by element, you must first define the root. You can do this with the getroot() function. >>> root = xml.getroot() Now that the root of the structure has been defined, you can access the various nodes of the tree, each corresponding to the tag contained within the original XML file. >>> root.Book.Author 'Ross, Mark' >>> root.Book.PublishDate
  • 24. ❖ In this way you access nodes individually, but you can access various elements at the same time using getchildren(). With this function, you’ll get all the child nodes of the reference element. >>> root.getchildren() [<Element Book at 0x9c66688>, <Element Book at 0x9c66e08>] ❖ With the tag attribute you get the name of the tag corresponding to the child node. >>> [child.tag for child in root.Book.getchildren()] ['Author', 'Title', 'Genre', 'Price', 'PublishDate’] ❖ with the text attribute you get the value contained between the corresponding tags. >>> [child.text for child in root.Book.getchildren()] ['Ross, Mark', 'XML Cookbook', 'Computer', '23.56', '2014-22-01']
  • 25. Reading and Writing Data on Microsoft Excel Files ❖ pandas provides specific functions to read and write data with Excel files and the I/O API that provides functions to this purpose are: • to_excel() • read_excel() ❖ The read_excel() function is able to read both Excel 2003 (.xls) files and Excel 2007 (.xlsx) files. Example: First, open an Excel file and enter the data as shown below. Copy data in sheet1 and sheet2. Then save it as data.xls.
  • 27. ❖ To read the data contained within the XLS file and obtain the conversion into a data frame, you only have to use the read_excel() function. ❖ As you can see, by default, the returned DataFrame is composed of the data tabulated in the first spreadsheets. ❖ If, however, you’d need to load the data in the second spreadsheet, then specify the name of the sheet or the number of the sheet (index) just as the second argument.
  • 28. ❖ To convert a data frame in a spreadsheet on Excel you have to write as follows. >>> frame = pd.DataFrame(np.random.random((4,4)), index = ['exp1','exp2','exp3','exp4'], columns = ['Jan2015','Fab2015','Mar2015','Apr2005’]) ❖ In the working directory you will find a new Excel file containing the data as shown;
  • 29. JSON Data ❖ JSON (JavaScript Object Notation) has become one of the most common standard formats, especially for the transmission of data through the Web. ❖ So it is normal to have to do with this data format if you want to use the available data on the Web. Step-1:
  • 30. ❖ The converse is possible, using the read_json() with the name of the file passed as an argument.
  • 31. ❖ Generally, however, the JSON files do not have a tabular structure. Thus, you will need to somehow convert the structure dict file in tabular form. ❖ The library pandas provides a function, called json_normalize(), that is able to convert a dict or a list in a table. ❖ First you have to import the function >>> from pandas.io.json import json_normalize ❖ Then write a JSON file as described below with any text editor. Save it in the working directory as books.json
  • 32. ❖ As you can see, the file structure is no longer tabular, but more complex. Then the approach with the read_json() function is no longer valid. ❖ You can still get the data in tabular form from this structure. ❖ First you have to load the contents of the JSON file and convert it into a string. ❖ Now you are ready to apply the json_normalize() function.
  • 33. Pickle—Python Object Serialization ❖ The pickle module implements a powerful algorithm for serialization and de-serialization of a data structure implemented in Python. ❖ Pickling is the process in which the hierarchy of an object is converted into a stream of bytes. ❖ In Python, the picking operation is carried out by the pickle module, but currently there is a module called cPickle which is the result of an enormous amount of work optimizing the pickle module (written in C). ❖ This module can be in fact in many cases even 1,000 times faster than the pickle module.
  • 34. Serialize a Python Object with cPickle ❖ The data format used by the pickle module (or cPickle) is specific to Python. ❖ By default, an ASCII representation is used to represent it, in order to be readable from the human point of view. ❖ Then opening a file with a text editor you may be able to understand its contents. ❖ To use this module you must first import it >>> import cPickle as pickle ❖ Then create an object sufficiently complex to have an internal data structure, for example a dict object. >>> data = { 'color': ['white','red'], 'value': [5, 7]}
  • 35. ❖ Now you will perform a serialization of the data object through the dumps() function of the cPickle module. >>> pickled_data = pickle.dumps(data) ❖ Now, to see how it was serialized the dict object, you need to look at the content of the pickled_data variable. ❖ Once serialized data, they can easily be written on a file, or sent over a socket, pipe, etc.
  • 36. Pickling with pandas ❖ As regards the operation of pickling (and unpickling) with the pandas library, everything remains much facilitated. No need to import the cPickle module in the Python session and also the whole operation is performed implicitly. ❖ Also, the serialization format used by pandas is not completely in ASCII. >>> frame = pd.DataFrame(np.arange(16).reshape(4,4), index = ['up','down','left','right’]) >>> frame.to_pickle('frame.pkl’) ❖ Now in your working directory there is a new file called frame.pkl containing all the information about the frame DataFrame. ❖ To open a PKL file and read the contents, simply use the command