SlideShare a Scribd company logo
{
PANDAS
Not your ordinary fuzzy bear
Agenda:
6:45-6:50 Why do we need Pandas
6:50-7:00 General Information (Download etc..)
7:00-7:15 Quick Tour and Use Cases using Python Notebooks
7:15-7:30 Q&A
Data Science has been
around for much longer
than it’s been buzz
worthy. At it’s core we
simply apply the
scientific method to data:
(Data) Scientists ask
questions, do research,
construct hypotheses,
test, analyze , interpret
results and repeat.
Too much testing Too little
Observe and Prep.
We think Data looks clean and
pretty
But the reality is that it looks
like well… like…
But, “ETL is not as fun as
Statistics, Machine learning …”
Queue the PANDA !
Time to show you that it is
pandas is an open source, BSD-licensed library
providing high-performance, easy-to-use data
structures and data analysis tools for the Python
programming language.
You can download Pandas using standard pypi. I recommend, however you try
the continuum analytics package provided for free by Anaconda. It includes
pandas, numpy, scipy ,python notebooks and the Spyder editor.
http://pandas.pydata.org/
http://docs.continuum.io/anaconda/install.html
Quick Tour using open Financial data set
1. Data cleansing, manipulation, merging etc…
2.Web Crawling and data preparation using scrapy-pandas pipeline
A fast and efficient DataFrame object for data manipulation with integrated
indexing;
Tools for reading and writing data between in-memory data structures and
different formats: CSV and text files, Microsoft Excel, SQL databases, and the
fast HDF5 format;
Intelligent data alignment and integrated handling of missing data: gain
automatic label-based alignment in computations and easily manipulate
messy data into an orderly form;
Flexible reshaping and pivoting of data sets; Intelligent label-based slicing,
fancy indexing, and subsetting of large data sets;
Columns can be inserted and deleted from data structures for size
mutability; Aggregating or transforming data with a powerful group by
engine allowing split-apply-combine operations on data sets;
High performance merging and joining of data sets; Hierarchical axis
indexing provides an intuitive way of working with high-dimensional data
in a lower-dimensional data structure;
Time series-functionality: date range generation and frequency conversion,
moving window statistics,moving window linear regressions, date shifting
and lagging. Even create domain-specific time offsets and join time series
without losing data;
Use Case 1: Determine metric for sales data for company/client using
common mapping (geography, age, etc..)
Data: Data set is in numerous files with different file names, columns,
data types, null values, numeric and text data. How can you:
a. Combine all the files quickly
b. identify and clean erroneous data
c. create mapping that will link data sets (all data links to location, age
etc..)
Solution: PANDAS! Usually with excel and an RDBMS this can take
days/weeks. In pandas it took me a day and can probably take a few
hours.
Added benefit: The code can be repurposed for all input data. With an
rdbms you would need to re-write sql (or related queries) for each
particular database. *
* There is also pandassql which connects to a sql instance. Please see the
pandas docs.
This is what I got:
Files that were neither here nor there
Files that were everywhere
You could not tell which was which
And the data quality was a B***
Demo:
Original files : Sales txt file, customer file, and address book file. No files
had headers. They had to be inserted using a main data definitions file.
Outcome: One File showing sales by zip, region, province and country ,
mapped to specific column types for APACHE HIVE
Exciting Libraries
Geo Mapping: https://github.com/kjordahl/geopandas
Machine Learning: https://github.com/paulgb/sklearn-pandas
External Data Sources growing:
http://pandas.pydata.org/pandas-docs/stable/remote_data.html
What else
You can write in Cython to get maximize performance. It will be up to 10X faster
than pure python because in python it calls a series for each row so with cython and
numpy you can pass ndarrays instead.
Development Roadmap:
(0.13) Improved SQL / relational database tools
Tools for working with data sets that do not fit into memory
(0.10) Better memory usage and performance when reading very large CSV files
Better statistical graphics using matplotlib
Integration with D3.js
Better support for integer NA values
Extend GroupBy functionality to regular ndarrays, record arrays
✔ numpy.datetime64 integration, scikits.timeseries codebase integration.
Substantially improved time series functionality.
✔ Improved PyTables (HDF5) integration
✔ NDFrame data structure for arbitrarily high-dimensional labeled data
✔ Better support for NumPy dtype hierarchy without sacrificing usability
✔ Add a Factor data type (in R parlance)

More Related Content

PPTX
Big Data Analytics for Non-Programmers
DOCX
10 Popular Hadoop Technical Interview Questions
PPTX
Hadoop for beginners free course ppt
PDF
Data Science Toolchain 101
PDF
Data science-toolchain
PDF
Introduction to Bigdata and HADOOP
PPTX
Intro to Big Data Hadoop
PPT
Big Tools for Big Data
Big Data Analytics for Non-Programmers
10 Popular Hadoop Technical Interview Questions
Hadoop for beginners free course ppt
Data Science Toolchain 101
Data science-toolchain
Introduction to Bigdata and HADOOP
Intro to Big Data Hadoop
Big Tools for Big Data

What's hot (20)

PPTX
Hadoop
PPTX
Big data ppt
PPTX
Big Data and Hadoop
PDF
Open source analytics
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
PDF
Hadoop/Spark Non-Technical Basics
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
DOCX
1. what is hadoop part 1
PDF
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
PPTX
Big Data - Part IV
DOCX
Big data abstract
PPTX
Introduction of Big data and Hadoop
PPTX
Big data analytics: Technology's bleeding edge
PDF
Introduction To Big Data Analytics On Hadoop - SpringPeople
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
PPTX
Big Data Technology Stack : Nutshell
PDF
Bigdata and Hadoop Bootcamp
PDF
Big data with java
PPTX
Analyzing Data With Python
PPTX
Big Data - Part II
Hadoop
Big data ppt
Big Data and Hadoop
Open source analytics
Big data vahidamiri-tabriz-13960226-datastack.ir
Hadoop/Spark Non-Technical Basics
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
1. what is hadoop part 1
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Big Data - Part IV
Big data abstract
Introduction of Big data and Hadoop
Big data analytics: Technology's bleeding edge
Introduction To Big Data Analytics On Hadoop - SpringPeople
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Technology Stack : Nutshell
Bigdata and Hadoop Bootcamp
Big data with java
Analyzing Data With Python
Big Data - Part II
Ad

Similar to Dc python meetup (20)

PDF
pandas: Powerful data analysis tools for Python
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Python for Financial Data Analysis with pandas
PDF
Slides 111017220255-phpapp01
PPTX
python-pandas-For-Data-Analysis-Manipulate.pptx
PDF
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
DOCX
Detailed Report on Basics Of Pandas of Python
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
PDF
Data Wrangling and Visualization Using Python
PDF
Introduction to Data Analtics with Pandas [PyCon Cz]
PDF
Download full ebook of Mastering Pandas Femi Anthony instant download pdf
PDF
pandas.pdf
PDF
pandas (1).pdf
PDF
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PPTX
Complete Introduction To Pandas Python.pptx
PDF
Python as the Zen of Data Science
PPTX
Meetup Junio Data Analysis with python 2018
PDF
SciPy 2011 pandas lightning talk
pandas: Powerful data analysis tools for Python
What's new in pandas and the SciPy stack for financial users
Python for Financial Data Analysis with pandas
Slides 111017220255-phpapp01
python-pandas-For-Data-Analysis-Manipulate.pptx
Pandas vs. SQL – Tools that Data Scientists use most often.pdf
Detailed Report on Basics Of Pandas of Python
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Data Wrangling and Visualization Using Python
Introduction to Data Analtics with Pandas [PyCon Cz]
Download full ebook of Mastering Pandas Femi Anthony instant download pdf
pandas.pdf
pandas (1).pdf
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Complete Introduction To Pandas Python.pptx
Python as the Zen of Data Science
Meetup Junio Data Analysis with python 2018
SciPy 2011 pandas lightning talk
Ad

More from Jeffrey Clark (20)

PDF
Python memory management_v2
PDF
Python meetup
PDF
Jwt with flask slide deck - alan swenson
PDF
Genericmeetupslides 110607190400-phpapp02
PDF
Pyramiddcpythonfeb2013 131006105131-phpapp02
PDF
Zpugdc2007 101105081808-phpapp01
PDF
Zpugdc deformpresentation-100709203803-phpapp01
PPT
Zpugdccherry 101105081729-phpapp01
PDF
Tornado
PDF
Science To Bfg
PDF
The PSF and You
ODP
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
PPT
What Makes A Great Dev Team - Mike Robinson
PPT
What Makes A Great Dev Team - Mike Robinson
PDF
Plone I18n Tutorial - Hanno Schlichting
PDF
Real World Intranets - Joel Burton
PDF
State Of Zope 3 - Stephan Richter
PDF
KSS Techniques - Joel Burton
ZIP
Zenoss: Buildout
PDF
Opensourceweblion
Python memory management_v2
Python meetup
Jwt with flask slide deck - alan swenson
Genericmeetupslides 110607190400-phpapp02
Pyramiddcpythonfeb2013 131006105131-phpapp02
Zpugdc2007 101105081808-phpapp01
Zpugdc deformpresentation-100709203803-phpapp01
Zpugdccherry 101105081729-phpapp01
Tornado
Science To Bfg
The PSF and You
Using Grok to Walk Like a Duck - Brandon Craig Rhodes
What Makes A Great Dev Team - Mike Robinson
What Makes A Great Dev Team - Mike Robinson
Plone I18n Tutorial - Hanno Schlichting
Real World Intranets - Joel Burton
State Of Zope 3 - Stephan Richter
KSS Techniques - Joel Burton
Zenoss: Buildout
Opensourceweblion

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
medical staffing services at VALiNTRY
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Design an Analysis of Algorithms II-SECS-1021-03
Upgrade and Innovation Strategies for SAP ERP Customers
Essential Infomation Tech presentation.pptx
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
L1 - Introduction to python Backend.pptx
Softaken Excel to vCard Converter Software.pdf
medical staffing services at VALiNTRY
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
Wondershare Filmora 15 Crack With Activation Key [2025
PTS Company Brochure 2025 (1).pdf.......
How to Choose the Right IT Partner for Your Business in Malaysia
Design an Analysis of Algorithms I-SECS-1021-03
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

Dc python meetup

  • 1. { PANDAS Not your ordinary fuzzy bear Agenda: 6:45-6:50 Why do we need Pandas 6:50-7:00 General Information (Download etc..) 7:00-7:15 Quick Tour and Use Cases using Python Notebooks 7:15-7:30 Q&A
  • 2. Data Science has been around for much longer than it’s been buzz worthy. At it’s core we simply apply the scientific method to data: (Data) Scientists ask questions, do research, construct hypotheses, test, analyze , interpret results and repeat.
  • 3. Too much testing Too little Observe and Prep. We think Data looks clean and pretty But the reality is that it looks like well… like…
  • 4. But, “ETL is not as fun as Statistics, Machine learning …” Queue the PANDA ! Time to show you that it is
  • 5. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. You can download Pandas using standard pypi. I recommend, however you try the continuum analytics package provided for free by Anaconda. It includes pandas, numpy, scipy ,python notebooks and the Spyder editor. http://pandas.pydata.org/ http://docs.continuum.io/anaconda/install.html Quick Tour using open Financial data set 1. Data cleansing, manipulation, merging etc… 2.Web Crawling and data preparation using scrapy-pandas pipeline
  • 6. A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; High performance merging and joining of data sets; Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics,moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
  • 7. Use Case 1: Determine metric for sales data for company/client using common mapping (geography, age, etc..) Data: Data set is in numerous files with different file names, columns, data types, null values, numeric and text data. How can you: a. Combine all the files quickly b. identify and clean erroneous data c. create mapping that will link data sets (all data links to location, age etc..) Solution: PANDAS! Usually with excel and an RDBMS this can take days/weeks. In pandas it took me a day and can probably take a few hours. Added benefit: The code can be repurposed for all input data. With an rdbms you would need to re-write sql (or related queries) for each particular database. * * There is also pandassql which connects to a sql instance. Please see the pandas docs.
  • 8. This is what I got: Files that were neither here nor there Files that were everywhere You could not tell which was which And the data quality was a B***
  • 9. Demo: Original files : Sales txt file, customer file, and address book file. No files had headers. They had to be inserted using a main data definitions file. Outcome: One File showing sales by zip, region, province and country , mapped to specific column types for APACHE HIVE
  • 10. Exciting Libraries Geo Mapping: https://github.com/kjordahl/geopandas Machine Learning: https://github.com/paulgb/sklearn-pandas External Data Sources growing: http://pandas.pydata.org/pandas-docs/stable/remote_data.html
  • 11. What else You can write in Cython to get maximize performance. It will be up to 10X faster than pure python because in python it calls a series for each row so with cython and numpy you can pass ndarrays instead. Development Roadmap: (0.13) Improved SQL / relational database tools Tools for working with data sets that do not fit into memory (0.10) Better memory usage and performance when reading very large CSV files Better statistical graphics using matplotlib Integration with D3.js Better support for integer NA values Extend GroupBy functionality to regular ndarrays, record arrays ✔ numpy.datetime64 integration, scikits.timeseries codebase integration. Substantially improved time series functionality. ✔ Improved PyTables (HDF5) integration ✔ NDFrame data structure for arbitrarily high-dimensional labeled data ✔ Better support for NumPy dtype hierarchy without sacrificing usability ✔ Add a Factor data type (in R parlance)

Editor's Notes

  • #6: Open Ipython notebooks have codes and all source data ready to go in notebook viewer. Set up notebook for this meeting specifically. Remember to leave Merck name out of all data.
  • #8: Demo the zip code pre-processing along with the mapping : Show code slices and then go to python notebook to show it in action Zip pre –process Clean the transfer sales Merge transfer sales with general sales Show mappings from digits to text (efficient way to change and manipulate data for visualization)
  • #10: Mention that this is one of the largest pharma companies in the world Mention hierarchical indexing