SlideShare a Scribd company logo
Python for Data Science
Sankalp Gabbita
Graduate Student-Data Science and Business Analytics
UNC Charlotte
How is Data used?
 The extensive use of data, statistical and quantitative analysis, explanatory
and predictive models, and fact-based management to drive decisions and
actions. (Davenport and Harris 2007)
Data
Analytical Tools Actionable
Knowledge
Unicorn data scientist?
Collecting
Cleaning Explore
Transform
ModellingEvaluate
Inference
Agenda
 Anaconda – Spyder
 Review of NumPy,Pandas- Basic data munging
 Using Matplotlib to make visualizations
 Regression concepts
 Regression – Application( Scikit-Learn)
 Clustering concept
 Clustering Application( K- mean clustering using Scikit-Learn)
SPYDER -Scientific Python Development Environment
 Spyder is an interactive development environment for the Python
language with advanced editing, live testing, and a numerical
computing environment
 Spyder also includes the popular Python library NumPy for linear
algebra, MatPlotLib for interactive 2D/3D graphs, Pandas for
dataset manipulation, and SciKit-Learn for machine learning.
 Code line by line
 Interact and alter scripts
 Code directly in the console
 Spyder is accessible through Anaconda
 https://www.continuum.io/downloads
NumPy- Numerical Computing
 Similar to creation of Matlab array objects
 N-dimensional array objects
 Used for linear algebra, fourier transform, and random number capabilities
 Capable of matrix operations, string operations, and binary operations
 Easy to install and import with single line
 Import numpy as np
 The above code fetches numpy package and it can be used with it’s alias as np
eg., np.array([(2,3),(4,5)])
Pandas- Dataframes
 Creates an efficient dataframe object for data manipulation with integrated
indexing
 Takes input data in many formats: CSV, Excel, SQL databases
 Handles messy and missing data easily
 Slicing, dicing and indexing of large datasets
 Very useful for cleaning the data before applying any algorithm
 Can be imported with single line
 Import pandas as pd
 Eg : pd.read_table(‘—file path in local machine-’)
Matplotlib-Visualization
 Python 2D plotting library to generate quality figures
 Generates plots, histograms, bar charts, scatterplots, etc.,
 Uses NumPy NDArrays to plot graphs
 Full control of font styles , line properties , axes properties, etc.
 Easy to install and import using single line
 Import matplotlib
 Pyplot module is used for simple plotting and provides good interface when
combined with Ipython
Regression
One Dependent Variable Y
Independent Variables X1,X2,X3,...
Y = ß0 + ß1 X(1) + ß2 X(2) + ß3 X(3) + ... + ßk X(k) + E
 Estimate the ß's in multiple regression using least squares
 Sizes of the coefficients not good indicators of importance of X variables
Simple Linear Regression Model
Key Assumptions for Linear Regression
 Linearity
 The dependent variable is a linear combination of independent variables
 Homoscedasticity
 Constant variance in errors
 Normality
 Independence of errors
Logistic Regression
Binary target: linear regression does not work due to
unbounded results
Key Assumptions for Logistic Regression
 Linearity
 Linearity of independent variables and log odds
 Homoscedasticity: no
 Normality: no
 Highly skewed independent variables can still be problematic
 Independence of errors: yes
Clustering
 Cluster analysis is the generic name for a wide variety of procedures that can
be used to create a classification of entities/objects
 It has been referred to as Q analysis, typology construction, classification
analysis, unsupervised pattern recognition, and numerical taxonomy
 A deck of 52 cards can be grouped as:
 26 red and 26 black cards
 13 each of Spades, Hearts, Diamonds, and Clubs
 4 each of Aces, Kings, Queens, and Jacks
A Geometrical view of an ideal pattern
Importance of Price
ImportanceofQuality
Reality
Importance of Price
ImportanceofQuality
How to group them?
Importance of Price
ImportanceofQuality
Importance of Price
ImportanceofQuality
Importance of Price
ImportanceofQuality
Similarity and Distance
 To identify natural groups, we must first define a measure of similarity
(proximity) between objects/entities.
 Assume variables (axes in space) are numeric.
 Then, if two things are similar, they should be close to each other in the
space.
That is, the distance between them should be small.
 But, if two things are dissimilar, they should be well separated from each
other in the space.
That is, the distance between them should be large.
 A collection of similar things would therefore likely result in more
cohesive (homogenous) groups than a collection of dissimilar things.
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
K- Means Clustering
1. Select k cluster centers.
2. Assign cases to closest center.
3. Update cluster centers.
4. Re-assign cases.
5. Repeat steps 3 and 4 until convergence.
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
Dimension1
A
B
K
E
Dimension 2
F
G
H
I
J
D
C
Thank You

More Related Content

PDF
Machine Learning Real Life Applications By Examples
PPTX
Nimrita koul Machine Learning
PDF
Graph Tea: Simulating Tool for Graph Theory & Algorithms
PPTX
Mscs discussion
PPTX
Introduction to data structure and algorithms
PPTX
Data visualization
PPTX
Data Mining: Mining ,associations, and correlations
PPTX
Machine Learning for Modern Developers
Machine Learning Real Life Applications By Examples
Nimrita koul Machine Learning
Graph Tea: Simulating Tool for Graph Theory & Algorithms
Mscs discussion
Introduction to data structure and algorithms
Data visualization
Data Mining: Mining ,associations, and correlations
Machine Learning for Modern Developers

What's hot (14)

PDF
Big Data with Rough Set Using Map- Reduce
DOCX
Data visualization using py plot part i
PPTX
introduction to Data Structure and classification
PPTX
Introduction to Data Structure part 1
PDF
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
PPTX
hash
PPT
R-programming-training-in-mumbai
PPTX
Presentation on unsupervised learning
PPT
Introduction to data structure
PDF
What's next in Julia
PPT
Aggregation computation over distributed data streams(the final version)
PPT
Introductiont To Aray,Tree,Stack, Queue
PPTX
Linear Regression, Machine learning term
PPT
Big Data with Rough Set Using Map- Reduce
Data visualization using py plot part i
introduction to Data Structure and classification
Introduction to Data Structure part 1
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
hash
R-programming-training-in-mumbai
Presentation on unsupervised learning
Introduction to data structure
What's next in Julia
Aggregation computation over distributed data streams(the final version)
Introductiont To Aray,Tree,Stack, Queue
Linear Regression, Machine learning term
Ad

Similar to Python for data science (20)

PDF
Skytree big data london meetup - may 2013
PPTX
Data Science.pptx
DOCX
fds u1.docx
PDF
Data Science as a Career and Intro to R
PPTX
Recommendation system using collaborative deep learning
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PDF
Introduction to Machine Learning with SciKit-Learn
PPTX
Machine Learning and Real-World Applications
PPT
Data science and OSS
PDF
Graph Analyses with Python and NetworkX
PDF
598_RamaSrikanthJakkam_CEE
PDF
603_SaiKiranPutta_CEE
PDF
662_AravindKumarN_CEE
PDF
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
PDF
587_EswarPrasadReddyMachireddy_CEE
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PPTX
fINAL ML PPT.pptx
PPTX
Introduction of data science
PPT
Machine Learning ICS 273A
PPTX
Data Science Using Scikit-Learn
Skytree big data london meetup - may 2013
Data Science.pptx
fds u1.docx
Data Science as a Career and Intro to R
Recommendation system using collaborative deep learning
Yarn spark next_gen_hadoop_8_jan_2014
Introduction to Machine Learning with SciKit-Learn
Machine Learning and Real-World Applications
Data science and OSS
Graph Analyses with Python and NetworkX
598_RamaSrikanthJakkam_CEE
603_SaiKiranPutta_CEE
662_AravindKumarN_CEE
PPT4: Frameworks & Libraries of Machine Learning & Deep Learning
587_EswarPrasadReddyMachireddy_CEE
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
fINAL ML PPT.pptx
Introduction of data science
Machine Learning ICS 273A
Data Science Using Scikit-Learn
Ad

More from botsplash.com (15)

PDF
Migrating to postgresql
PPTX
Bootstrap SaaS startup using Open Source Tools
PPTX
Devops Days, 2019 - Charlotte
PPTX
Building NLP solutions for Davidson ML Group
PPTX
Getting started with postgresql
PPTX
Building NLP solutions using Python
PPTX
Chat interfaces, Extension to Digital Marketing
PPTX
Cloud computing options
PPTX
Data Science meets Digital Marketing
PPTX
botsplash deep dive
PPTX
Building Twitter bot using Python
PPTX
Live development & tools
PPTX
AI Use Cases discussion
PPTX
Career advice for beginner software engineers
PPTX
Node.js Getting Started &amd Best Practices
Migrating to postgresql
Bootstrap SaaS startup using Open Source Tools
Devops Days, 2019 - Charlotte
Building NLP solutions for Davidson ML Group
Getting started with postgresql
Building NLP solutions using Python
Chat interfaces, Extension to Digital Marketing
Cloud computing options
Data Science meets Digital Marketing
botsplash deep dive
Building Twitter bot using Python
Live development & tools
AI Use Cases discussion
Career advice for beginner software engineers
Node.js Getting Started &amd Best Practices

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding

Python for data science

  • 1. Python for Data Science Sankalp Gabbita Graduate Student-Data Science and Business Analytics UNC Charlotte
  • 2. How is Data used?  The extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions. (Davenport and Harris 2007) Data Analytical Tools Actionable Knowledge
  • 5. Agenda  Anaconda – Spyder  Review of NumPy,Pandas- Basic data munging  Using Matplotlib to make visualizations  Regression concepts  Regression – Application( Scikit-Learn)  Clustering concept  Clustering Application( K- mean clustering using Scikit-Learn)
  • 6. SPYDER -Scientific Python Development Environment  Spyder is an interactive development environment for the Python language with advanced editing, live testing, and a numerical computing environment  Spyder also includes the popular Python library NumPy for linear algebra, MatPlotLib for interactive 2D/3D graphs, Pandas for dataset manipulation, and SciKit-Learn for machine learning.  Code line by line  Interact and alter scripts  Code directly in the console  Spyder is accessible through Anaconda  https://www.continuum.io/downloads
  • 7. NumPy- Numerical Computing  Similar to creation of Matlab array objects  N-dimensional array objects  Used for linear algebra, fourier transform, and random number capabilities  Capable of matrix operations, string operations, and binary operations  Easy to install and import with single line  Import numpy as np  The above code fetches numpy package and it can be used with it’s alias as np eg., np.array([(2,3),(4,5)])
  • 8. Pandas- Dataframes  Creates an efficient dataframe object for data manipulation with integrated indexing  Takes input data in many formats: CSV, Excel, SQL databases  Handles messy and missing data easily  Slicing, dicing and indexing of large datasets  Very useful for cleaning the data before applying any algorithm  Can be imported with single line  Import pandas as pd  Eg : pd.read_table(‘—file path in local machine-’)
  • 9. Matplotlib-Visualization  Python 2D plotting library to generate quality figures  Generates plots, histograms, bar charts, scatterplots, etc.,  Uses NumPy NDArrays to plot graphs  Full control of font styles , line properties , axes properties, etc.  Easy to install and import using single line  Import matplotlib  Pyplot module is used for simple plotting and provides good interface when combined with Ipython
  • 10. Regression One Dependent Variable Y Independent Variables X1,X2,X3,... Y = ß0 + ß1 X(1) + ß2 X(2) + ß3 X(3) + ... + ßk X(k) + E  Estimate the ß's in multiple regression using least squares  Sizes of the coefficients not good indicators of importance of X variables
  • 12. Key Assumptions for Linear Regression  Linearity  The dependent variable is a linear combination of independent variables  Homoscedasticity  Constant variance in errors  Normality  Independence of errors
  • 13. Logistic Regression Binary target: linear regression does not work due to unbounded results
  • 14. Key Assumptions for Logistic Regression  Linearity  Linearity of independent variables and log odds  Homoscedasticity: no  Normality: no  Highly skewed independent variables can still be problematic  Independence of errors: yes
  • 15. Clustering  Cluster analysis is the generic name for a wide variety of procedures that can be used to create a classification of entities/objects  It has been referred to as Q analysis, typology construction, classification analysis, unsupervised pattern recognition, and numerical taxonomy  A deck of 52 cards can be grouped as:  26 red and 26 black cards  13 each of Spades, Hearts, Diamonds, and Clubs  4 each of Aces, Kings, Queens, and Jacks
  • 16. A Geometrical view of an ideal pattern Importance of Price ImportanceofQuality
  • 18. How to group them? Importance of Price ImportanceofQuality Importance of Price ImportanceofQuality Importance of Price ImportanceofQuality
  • 19. Similarity and Distance  To identify natural groups, we must first define a measure of similarity (proximity) between objects/entities.  Assume variables (axes in space) are numeric.  Then, if two things are similar, they should be close to each other in the space. That is, the distance between them should be small.  But, if two things are dissimilar, they should be well separated from each other in the space. That is, the distance between them should be large.  A collection of similar things would therefore likely result in more cohesive (homogenous) groups than a collection of dissimilar things.
  • 20. Dimension1 A B K E Dimension 2 F G H I J D C K- Means Clustering 1. Select k cluster centers. 2. Assign cases to closest center. 3. Update cluster centers. 4. Re-assign cases. 5. Repeat steps 3 and 4 until convergence. Dimension1 A B K E Dimension 2 F G H I J D C Dimension1 A B K E Dimension 2 F G H I J D C Dimension1 A B K E Dimension 2 F G H I J D C