SlideShare a Scribd company logo
7
Most read
8
Most read
14
Most read
Confidential Customized for Lorem Ipsum LLC Version 1.0
Basic of Python for
Data Analysis
Pramod Toraskar.
Why learn Python for data analysis?
Here are some reasons which go in favour of learning Python:
● Open Source – free to install
● Awesome online community
● Very easy to learn
● Can become a common language for data science and production of web based analytics products.
Choosing a development environment
1
Terminal / Shell based
2
IDLE (default environment)
3
iPython notebook – similar to markdown in
R
iPython environment - jupyter
http://jupyter-notebook-beginner-
guide.readthedocs.io/en/latest/install.html
Recall Python libraries and Data Structures
Lists, Strings, Tuples, Dictionary..
Following are a list of libraries, you will need for any scientific computations and data
analysis:
● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library
also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities
and tools for integration with other low level languages like Fortran, C and C++
● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high
level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization
and Sparse matrices.
● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting
features inline. If you ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
● Pandas for structured data operations and manipulations. It is extensively used for data munging and
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Python’s usage in data scientist community.
● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive
plots, dashboards and data applications on modern web-browsers. It empowers the user to generate
elegant and concise graphics in the style of D3.js.)
Key phases
The 3 key phases
01
Data Exploration:
Finding out more about the data we have
● numpy
● matplotlib
● Pandas
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv")
#Reading the dataset in a dataframe using Pandas
QUICK TIP
Try right clicking on a photo and
using "Replace Image" to show
your own photo.
Data
Exploration
Once you have read the dataset, you can have a look at few top rows by
using the function head()
df.head(10)
The 3 key phases
02
Data Munging:
Cleaning the data and playing with it to make it better suit statistical
modeling.
1. There are missing values in some variables. We should
estimate those values wisely depending on the amount of
missing values and the expected importance of variables.
1. While looking at the distributions, we saw that Applicant
Income and Loan Amount seemed to contain extreme values
at either end. Though they might make intuitive sense, but
should be treated appropriately.
Check missing
values in the
dataset
Let us look at missing values in all the variables because most of the models
don’t work with missing data and even if they do, imputing them helps more
often than not. So, let us check the number of nulls / NaNs in the dataset
df.apply(lambda x: sum(x.isnull()),axis=0)
The 3 key phases
03
Predictive Modeling:
Running the actual algorithms and having fun
After, we have made the data useful for modeling, The Skicit-
Learn (sklearn) is the most commonly used library in Python
for this purpose
Building a
Predictive
Model in Python
sklearn requires all inputs to be numeric, we should convert all our
categorical variables into numeric by encoding the categories.
This can be done using the following code:
from sklearn.preprocessingimport LabelEncoder
var_mod =
['Gender','Married','Dependents','Education','Self_Employed','Property_Are
a','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
Model’s
Logistic
Regression
Is a classification algorithm
Decision Tree
is a type of supervised
learning algorithm (having a
pre-defined target variable)
that is mostly used in
classification problems.
Random Forest
Is a versatile machine learning
method capable of performing
both regression and
classification tasks.
Thank you.

More Related Content

PPTX
Python-for-Data-Analysis.pptx
PDF
Best Python Libraries For Data Science & Machine Learning | Edureka
PDF
Plotly dash and data visualisation in Python
ODP
Data Analysis in Python
PPTX
Pandas Data Cleaning and Preprocessing PPT.pptx
PPTX
Machine learning libraries with python
PPTX
Python for Data Science with Anaconda
PDF
Data Visualization(s) Using Python
Python-for-Data-Analysis.pptx
Best Python Libraries For Data Science & Machine Learning | Edureka
Plotly dash and data visualisation in Python
Data Analysis in Python
Pandas Data Cleaning and Preprocessing PPT.pptx
Machine learning libraries with python
Python for Data Science with Anaconda
Data Visualization(s) Using Python

What's hot (20)

PPTX
PPT on Data Science Using Python
PDF
Data preprocessing using Machine Learning
PPTX
Data Wrangling
PDF
Python for Data Science
PPTX
Python Seaborn Data Visualization
PPT
Data preprocessing
PDF
Data visualization in Python
PDF
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
PDF
Data Analysis and Visualization using Python
PPTX
Introduction to matplotlib
PDF
Data Wrangling and Visualization Using Python
PPT
2.3 bayesian classification
PDF
Introduction to NumPy (PyData SV 2013)
PPTX
Text MIning
PDF
Introduction to Python Pandas for Data Analytics
PPTX
Naive bayes
PDF
pandas - Python Data Analysis
PDF
Data Science Full Course | Edureka
PPTX
Python Scipy Numpy
PPTX
Data mining concepts and work
PPT on Data Science Using Python
Data preprocessing using Machine Learning
Data Wrangling
Python for Data Science
Python Seaborn Data Visualization
Data preprocessing
Data visualization in Python
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Data Analysis and Visualization using Python
Introduction to matplotlib
Data Wrangling and Visualization Using Python
2.3 bayesian classification
Introduction to NumPy (PyData SV 2013)
Text MIning
Introduction to Python Pandas for Data Analytics
Naive bayes
pandas - Python Data Analysis
Data Science Full Course | Edureka
Python Scipy Numpy
Data mining concepts and work
Ad

Similar to Basic of python for data analysis (20)

PPTX
Data Science With Python | Python For Data Science | Python Data Science Cour...
PPTX
Python ml
PPTX
Ml programming with python
PDF
Using pandas library for data analysis in python
PDF
Python for Data Science: A Comprehensive Guide
PDF
Python Advanced Predictive Analytics Kumar Ashish
PDF
An Overview of Python for Data Analytics
PPTX
To understand the importance of Python libraries in data analysis.
PPTX
Meetup Junio Data Analysis with python 2018
PDF
-python-for-data-science-20240911071905Ss8z.pdf
PDF
De-Cluttering-ML | TechWeekends
PDF
Data Science With Python
PPTX
More on Pandas.pptx
PPTX
Data Science.pptx
PPTX
R.SOWMIYA (30323U09086).pptx data science with python
PPT
PDF
12 Introduction to Modeling Libraries in Python.pdf
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
PPTX
Abhishek Training PPT.pptx
PPTX
Data analysis using python in Jupyter notebook.pptx
Data Science With Python | Python For Data Science | Python Data Science Cour...
Python ml
Ml programming with python
Using pandas library for data analysis in python
Python for Data Science: A Comprehensive Guide
Python Advanced Predictive Analytics Kumar Ashish
An Overview of Python for Data Analytics
To understand the importance of Python libraries in data analysis.
Meetup Junio Data Analysis with python 2018
-python-for-data-science-20240911071905Ss8z.pdf
De-Cluttering-ML | TechWeekends
Data Science With Python
More on Pandas.pptx
Data Science.pptx
R.SOWMIYA (30323U09086).pptx data science with python
12 Introduction to Modeling Libraries in Python.pdf
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Abhishek Training PPT.pptx
Data analysis using python in Jupyter notebook.pptx
Ad

Recently uploaded (20)

PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
climate analysis of Dhaka ,Banglades.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
1_Introduction to advance data techniques.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Qualitative Qantitative and Mixed Methods.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Analytics and business intelligence.pdf
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Basic of python for data analysis

  • 1. Confidential Customized for Lorem Ipsum LLC Version 1.0 Basic of Python for Data Analysis Pramod Toraskar.
  • 2. Why learn Python for data analysis? Here are some reasons which go in favour of learning Python: ● Open Source – free to install ● Awesome online community ● Very easy to learn ● Can become a common language for data science and production of web based analytics products.
  • 3. Choosing a development environment 1 Terminal / Shell based 2 IDLE (default environment) 3 iPython notebook – similar to markdown in R iPython environment - jupyter http://jupyter-notebook-beginner- guide.readthedocs.io/en/latest/install.html
  • 4. Recall Python libraries and Data Structures Lists, Strings, Tuples, Dictionary.. Following are a list of libraries, you will need for any scientific computations and data analysis: ● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++ ● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
  • 5. ● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot. ● Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community. ● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. ● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js.)
  • 7. The 3 key phases 01 Data Exploration: Finding out more about the data we have ● numpy ● matplotlib ● Pandas import pandas as pd import numpy as np import matplotlib as plt df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv") #Reading the dataset in a dataframe using Pandas QUICK TIP Try right clicking on a photo and using "Replace Image" to show your own photo.
  • 8. Data Exploration Once you have read the dataset, you can have a look at few top rows by using the function head() df.head(10)
  • 9. The 3 key phases 02 Data Munging: Cleaning the data and playing with it to make it better suit statistical modeling. 1. There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables. 1. While looking at the distributions, we saw that Applicant Income and Loan Amount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.
  • 10. Check missing values in the dataset Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset df.apply(lambda x: sum(x.isnull()),axis=0)
  • 11. The 3 key phases 03 Predictive Modeling: Running the actual algorithms and having fun After, we have made the data useful for modeling, The Skicit- Learn (sklearn) is the most commonly used library in Python for this purpose
  • 12. Building a Predictive Model in Python sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code: from sklearn.preprocessingimport LabelEncoder var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Are a','Loan_Status'] le = LabelEncoder() for i in var_mod: df[i] = le.fit_transform(df[i]) df.dtypes
  • 13. Model’s Logistic Regression Is a classification algorithm Decision Tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. Random Forest Is a versatile machine learning method capable of performing both regression and classification tasks.