SlideShare a Scribd company logo
Introduction to
PySpark
INTRODUCTION TO PYSPARK
Benjamin Schmidt
Data Engineer
INTRODUCTION TO PYSPARK
Meet your instructor
Almost a Decade of Data Experience with PySpark
Used PySpark for Machine Learning, ETL tasks, and much more more
Enthusiastic teacher of new tools for all!
-
INTRODUCTION TO PYSPARK
What is PySpark?
Distributed data processing: Designed to handle large datasets across clusters
Supports various data formats including CSV, Parquet, and JSON
SQL integration allows querying of data using both Python and SQL syntax
Optimized for speed at scale
INTRODUCTION TO PYSPARK
When would we use PySpark?
Big data analytics
Distributed data processing
Real-time data streaming
Machine learning on large datasets
ETL and ELT pipelines
Working with diverse data sources:
1. CSV
2. JSON
3. Parquet
4. Many Many More
INTRODUCTION TO PYSPARK
Spark cluster
Master Node
Manages the cluster, coordinates tasks,
and schedules jobs
Worker Nodes
Execute the tasks assigned by the master
Responsible for executing the actual
computations and storing data in memory
or disk
INTRODUCTION TO PYSPARK
SparkSession
SparkSessions allow you to access your Spark cluster and are critical for using PySpark.
# Import SparkSession
from pyspark.sql import SparkSession
# Initialize a SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
.builder() sets up a session
getOrCreate() creates or retrieves a session
.appName() helps manage multiple sessions
INTRODUCTION TO PYSPARK
PySpark DataFrames
Similar to other DataFrames but
Optimized for PySpark
# Import and initialize a Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApp").getOrCreate()
# Create a DataFrame
census_df = spark.read.csv("census.csv",
["gender","age","zipcode","salary_range_usd","marriage_status"])
# Show the DataFrame
census_df.show()
Let's practice!
INTRODUCTION TO PYSPARK
Introduction to
PySpark
DataFrames
INTRODUCTION TO PYSPARK
Benjamin Schmidt
Data Engineer
INTRODUCTION TO PYSPARK
About DataFrames
DataFrames: Tabular format (rows/columns)
Supports SQL-like operations
Comparable to a Pandas Dataframe or a SQL TABLE
Structured Data
INTRODUCTION TO PYSPARK
Creating DataFrames from filestores
# Create a DataFrame from CSV
census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)
INTRODUCTION TO PYSPARK
Printing the DataFrame
# Show the first 5 rows of the DataFrame
census_df.show()
age education.num marital.status occupation income
0 90 9 Widowed ? <=50K
1 82 9 Widowed Exec-managerial <=50K
2 66 10 Widowed ? <=50K
3 54 4 Divorced Machine-op-inspct <=50K
4 41 10 Separated Prof-specialty <=50K
INTRODUCTION TO PYSPARK
Printing DataFrame Schema
# Show the schema
census_df.printSchema()
Output:
root
|-- age: integer (nullable = true)
|-- education.num: integer (nullable = true)
|-- marital.status: string (nullable = true)
|-- occupation: string (nullable = true)
|-- income: string (nullable = true)
INTRODUCTION TO PYSPARK
Basic analytics on PySpark DataFrames
# .count() will return the total row numbers in the DataFrame
row_count = census_df.count()
print(f'Number of rows: {row_count}')
# groupby() allows the use of sql-like aggregations
census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show()
Other aggregate functions are:
sum()
min()
max()
INTRODUCTION TO PYSPARK
Key functions for PySpark analytics
.select() : Selects specific columns from the DataFrame
.filter() : Filters rows based on specific conditions
.groupBy() : Groups rows based on one or more columns
.agg() : Applies aggregate functions to grouped data
INTRODUCTION TO PYSPARK
Key Functions For Example
# Using filter and select, we can narrow down our DataFrame
filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation')
filtered_census_df.show()
Output
+---+------------------+
|age| occupation |
+---+------------------+
| 90| ?|
| 82| Exec-managerial|
| 66| ?|
| 54| Machine-op-inspct|
+---+------------------+
Let's practice!
INTRODUCTION TO PYSPARK
More on Spark
DataFrames
INTRODUCTION TO PYSPARK
Benjamin Schmidt
Data Engineer
INTRODUCTION TO PYSPARK
Creating DataFrames from various data sources
CSV Files: Common for structured,
delimited data
JSON Files: Semi-structured, hierarchical
data format
Parquet Files: Optimized for storage and
querying, often used in data engineering
Example:
spark.read.csv("path/to/file.csv")
Example:
spark.read.json("path/to/file.json")
Example:
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv
spark.read.parquet("path/to/file.parquet")
1
INTRODUCTION TO PYSPARK
Schema inference and manual schema definition
Spark can infer schemas from data with inferSchema=True
Manually define schema for better control - useful for fixed data structures
INTRODUCTION TO PYSPARK
DataTypes in PySpark DataFrames
IntegerType : Whole numbers
E.g., 1 , 3478 , -1890456
LongType: Larger whole numbers
E.g., 8-byte signed numbers, 922334775806
FloatType and DoubleType: Floating-point numbers for decimal values
E.g., 3.14159
StringType: Used for text or string data
E.g., "This is an example of a string."
...
INTRODUCTION TO PYSPARK
DataTypes Syntax for PySpark DataFrames
# Import the necessary types as classes
from pyspark.sql.types import (StructType,
StructField, IntegerType,
StringType, ArrayType)
# Construct the schema
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("scores", ArrayType(IntegerType()), True)
])
# Set the schema
df = spark.createDataFrame(data, schema=schema)
INTRODUCTION TO PYSPARK
DataFrame operations - selection and filtering
Use .select() to choose specific columns
Use .filter() or .where() to filter rows based on conditions
Use .sort() to order by a collection of columns
# Select and show only the name and age columns
df.select("name", "age").show()
# Filter on age > 30
df.filter(df["age"] > 30).show()
# Use Where to filter match a specific value
df.where(df["age"] == 30).show()
INTRODUCTION TO PYSPARK
Sorting and dropping missing values
Order data using .sort() or .orderBy()
Use na.drop() to remove rows with null values
# Sort using the age column
df.sort("age", ascending=False).show()
# Drop missing values
df.na.drop().show()
INTRODUCTION TO PYSPARK
Cheatsheet
spark.read_json() : Load data from JSON
spark.read.schema() : Define schemas explicitly
.na.drop() : Drop rows with missing values
.select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions
Let's practice!
INTRODUCTION TO PYSPARK

More Related Content

PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PDF
Data manipulation with DataFrames bimboo
PDF
Pyspark training | Introduction to PySpark DataFrames
PDF
Introduction to Spark Datasets - Functional and relational together at last
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PDF
Dive into PySpark
PDF
Spark Programming Basic Training Handout
PPTX
PySpark Training | Pyspark course online
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Data manipulation with DataFrames bimboo
Pyspark training | Introduction to PySpark DataFrames
Introduction to Spark Datasets - Functional and relational together at last
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Dive into PySpark
Spark Programming Basic Training Handout
PySpark Training | Pyspark course online

Similar to Introduction to PySpark maka sakinaka loda (20)

PDF
Introduction to Spark with Python
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
PDF
Pyspark training | Pyspark training online
PDF
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
PDF
Improving Pandas and PySpark interoperability with Apache Arrow
PDF
Improving Pandas and PySpark performance and interoperability with Apache Arrow
PDF
Jump Start into Apache® Spark™ and Databricks
PDF
Life of PySpark - A tale of two environments
PDF
Spark SQL - 10 Things You Need to Know
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Introduction to Spark SQL, query types and UDF
PPTX
Spark sql
PDF
Importing Data Sets | Importing Data Sets | Importing Data Sets
PDF
pyspark_df.pdf
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PPTX
More on Pandas.pptx
PPTX
Lecture 3 intro2data
PDF
PYSPARK PROGRAMMING.pdf
PDF
Spark Dataframe - Mr. Jyotiska
PDF
A look ahead at spark 2.0
Introduction to Spark with Python
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Pyspark training | Pyspark training online
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Improving Pandas and PySpark interoperability with Apache Arrow
Improving Pandas and PySpark performance and interoperability with Apache Arrow
Jump Start into Apache® Spark™ and Databricks
Life of PySpark - A tale of two environments
Spark SQL - 10 Things You Need to Know
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Introduction to Spark SQL, query types and UDF
Spark sql
Importing Data Sets | Importing Data Sets | Importing Data Sets
pyspark_df.pdf
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
More on Pandas.pptx
Lecture 3 intro2data
PYSPARK PROGRAMMING.pdf
Spark Dataframe - Mr. Jyotiska
A look ahead at spark 2.0
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Empathic Computing: Creating Shared Understanding
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Monthly Chronicles - July 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Empathic Computing: Creating Shared Understanding
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Ad

Introduction to PySpark maka sakinaka loda

  • 1. Introduction to PySpark INTRODUCTION TO PYSPARK Benjamin Schmidt Data Engineer
  • 2. INTRODUCTION TO PYSPARK Meet your instructor Almost a Decade of Data Experience with PySpark Used PySpark for Machine Learning, ETL tasks, and much more more Enthusiastic teacher of new tools for all! -
  • 3. INTRODUCTION TO PYSPARK What is PySpark? Distributed data processing: Designed to handle large datasets across clusters Supports various data formats including CSV, Parquet, and JSON SQL integration allows querying of data using both Python and SQL syntax Optimized for speed at scale
  • 4. INTRODUCTION TO PYSPARK When would we use PySpark? Big data analytics Distributed data processing Real-time data streaming Machine learning on large datasets ETL and ELT pipelines Working with diverse data sources: 1. CSV 2. JSON 3. Parquet 4. Many Many More
  • 5. INTRODUCTION TO PYSPARK Spark cluster Master Node Manages the cluster, coordinates tasks, and schedules jobs Worker Nodes Execute the tasks assigned by the master Responsible for executing the actual computations and storing data in memory or disk
  • 6. INTRODUCTION TO PYSPARK SparkSession SparkSessions allow you to access your Spark cluster and are critical for using PySpark. # Import SparkSession from pyspark.sql import SparkSession # Initialize a SparkSession spark = SparkSession.builder.appName("MySparkApp").getOrCreate() .builder() sets up a session getOrCreate() creates or retrieves a session .appName() helps manage multiple sessions
  • 7. INTRODUCTION TO PYSPARK PySpark DataFrames Similar to other DataFrames but Optimized for PySpark # Import and initialize a Spark session from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MySparkApp").getOrCreate() # Create a DataFrame census_df = spark.read.csv("census.csv", ["gender","age","zipcode","salary_range_usd","marriage_status"]) # Show the DataFrame census_df.show()
  • 9. Introduction to PySpark DataFrames INTRODUCTION TO PYSPARK Benjamin Schmidt Data Engineer
  • 10. INTRODUCTION TO PYSPARK About DataFrames DataFrames: Tabular format (rows/columns) Supports SQL-like operations Comparable to a Pandas Dataframe or a SQL TABLE Structured Data
  • 11. INTRODUCTION TO PYSPARK Creating DataFrames from filestores # Create a DataFrame from CSV census_df = spark.read.csv('path/to/census.csv', header=True, inferSchema=True)
  • 12. INTRODUCTION TO PYSPARK Printing the DataFrame # Show the first 5 rows of the DataFrame census_df.show() age education.num marital.status occupation income 0 90 9 Widowed ? <=50K 1 82 9 Widowed Exec-managerial <=50K 2 66 10 Widowed ? <=50K 3 54 4 Divorced Machine-op-inspct <=50K 4 41 10 Separated Prof-specialty <=50K
  • 13. INTRODUCTION TO PYSPARK Printing DataFrame Schema # Show the schema census_df.printSchema() Output: root |-- age: integer (nullable = true) |-- education.num: integer (nullable = true) |-- marital.status: string (nullable = true) |-- occupation: string (nullable = true) |-- income: string (nullable = true)
  • 14. INTRODUCTION TO PYSPARK Basic analytics on PySpark DataFrames # .count() will return the total row numbers in the DataFrame row_count = census_df.count() print(f'Number of rows: {row_count}') # groupby() allows the use of sql-like aggregations census_df.groupBy('gender').agg({'salary_usd': 'avg'}).show() Other aggregate functions are: sum() min() max()
  • 15. INTRODUCTION TO PYSPARK Key functions for PySpark analytics .select() : Selects specific columns from the DataFrame .filter() : Filters rows based on specific conditions .groupBy() : Groups rows based on one or more columns .agg() : Applies aggregate functions to grouped data
  • 16. INTRODUCTION TO PYSPARK Key Functions For Example # Using filter and select, we can narrow down our DataFrame filtered_census_df = census_df.filter(df['age'] > 50).select('age', 'occupation') filtered_census_df.show() Output +---+------------------+ |age| occupation | +---+------------------+ | 90| ?| | 82| Exec-managerial| | 66| ?| | 54| Machine-op-inspct| +---+------------------+
  • 18. More on Spark DataFrames INTRODUCTION TO PYSPARK Benjamin Schmidt Data Engineer
  • 19. INTRODUCTION TO PYSPARK Creating DataFrames from various data sources CSV Files: Common for structured, delimited data JSON Files: Semi-structured, hierarchical data format Parquet Files: Optimized for storage and querying, often used in data engineering Example: spark.read.csv("path/to/file.csv") Example: spark.read.json("path/to/file.json") Example: https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv spark.read.parquet("path/to/file.parquet") 1
  • 20. INTRODUCTION TO PYSPARK Schema inference and manual schema definition Spark can infer schemas from data with inferSchema=True Manually define schema for better control - useful for fixed data structures
  • 21. INTRODUCTION TO PYSPARK DataTypes in PySpark DataFrames IntegerType : Whole numbers E.g., 1 , 3478 , -1890456 LongType: Larger whole numbers E.g., 8-byte signed numbers, 922334775806 FloatType and DoubleType: Floating-point numbers for decimal values E.g., 3.14159 StringType: Used for text or string data E.g., "This is an example of a string." ...
  • 22. INTRODUCTION TO PYSPARK DataTypes Syntax for PySpark DataFrames # Import the necessary types as classes from pyspark.sql.types import (StructType, StructField, IntegerType, StringType, ArrayType) # Construct the schema schema = StructType([ StructField("id", IntegerType(), True), StructField("name", StringType(), True), StructField("scores", ArrayType(IntegerType()), True) ]) # Set the schema df = spark.createDataFrame(data, schema=schema)
  • 23. INTRODUCTION TO PYSPARK DataFrame operations - selection and filtering Use .select() to choose specific columns Use .filter() or .where() to filter rows based on conditions Use .sort() to order by a collection of columns # Select and show only the name and age columns df.select("name", "age").show() # Filter on age > 30 df.filter(df["age"] > 30).show() # Use Where to filter match a specific value df.where(df["age"] == 30).show()
  • 24. INTRODUCTION TO PYSPARK Sorting and dropping missing values Order data using .sort() or .orderBy() Use na.drop() to remove rows with null values # Sort using the age column df.sort("age", ascending=False).show() # Drop missing values df.na.drop().show()
  • 25. INTRODUCTION TO PYSPARK Cheatsheet spark.read_json() : Load data from JSON spark.read.schema() : Define schemas explicitly .na.drop() : Drop rows with missing values .select() , .filter() , .sort() , .orderBy() : Basic data manipulation functions