SlideShare a Scribd company logo
1. Loading and Manipulating Data with PySpark
Loading Data:
python
df = spark.read.load("/data/products.csv", format="csv", header=True)
 spark.read.load: Reads the data from a CSV file located at /data/products.csv.
 format="csv": Specifies that the data format is CSV.
 header=True: Indicates that the first row of the CSV file contains header information.
Manipulating Data:
python
counts_df = df.select("ProductID", "Category").groupBy("Category").count()
 select("ProductID", "Category"): Selects the ProductID and Category
columns.
 groupBy("Category").count(): Groups the data by Category and counts the
number of ProductIDs in each category.
Displaying Data:
python
display(counts_df)
 display: Displays the resulting DataFrame.
2. Creating a Table in the Metastore and Running SQL Queries
Creating a Temp View:
python
df.createOrReplaceTempView("products")
 createOrReplaceTempView: Creates or replaces a temporary view named products.
Running SQL Queries:
python
bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice 
FROM products 
WHERE Category IN ('Mountain Bikes', 'Road Bikes')")
 spark.sql: Runs an SQL query to select ProductID, ProductName, and ListPrice
from products where Category is either 'Mountain Bikes' or 'Road Bikes'.
 display(bikes_df): Displays the resulting DataFrame.
3. Grouping and Counting Products with SQL
SQL Query:
sql
SELECT Category, COUNT(ProductID) AS ProductCount
FROM products
GROUP BY Category
ORDER BY Category
 SELECT Category, COUNT(ProductID) AS ProductCount: Selects the Category
and counts the number of ProductIDs in each category, renaming the count as
ProductCount.
 GROUP BY Category: Groups the data by Category.
 ORDER BY Category: Orders the results by Category.
4. Transforming and Saving Data with PySpark
Loading Data:
python
df = spark.read.load("/data/orders.csv", format="csv", header=True)
 Reads the data from a CSV file located at /data/orders.csv.
Adding a Year Column:
python
df = df.withColumn("Year", year(col("OrderDate")))
 Adds a new column Year derived from the OrderDate column.
Saving Transformed Data:
python
df.write.mode("overwrite").parquet("/data/orders.parquet")
 Saves the DataFrame as a Parquet file at the specified path.
5. Writing Partitioned Data
python
df.write.partitionBy("Year").mode("overwrite").parquet("/data")
 Partitions the data by Year and saves it as Parquet files.
6. Creating and Saving a View
Creating a View:
python
df.createOrReplaceTempView("sales_orders")
 Creates or replaces a temporary view named sales_orders.
Transforming Data with SQL:
python
new_df = spark.sql("SELECT OrderNo, OrderDate, Year(OrderDate) As Year FROM
sales_orders")
 Runs an SQL query to select OrderNo, OrderDate, and derive Year from OrderDate.
Saving as an External Table:
python
new_df.write.partitionBy("Year").saveAsTable("transformed_orders",
format="parquet",
mode="overwrite",
path="/transformed_orders")
 Saves the DataFrame as an external table partitioned by Year in Parquet format.
7. Delta Lake Operations
Saving Data in Delta Format:
python
df = spark.read.load("/data/mydata.csv", format="csv", header=True)
delta_table_path = "/delta/mydata"
df.write.format("delta").save(delta_table_path)
 Reads data from a CSV file and writes it in Delta format at the specified path.
Updating Delta Table:
python
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forPath(spark, delta_table_path)
deltaTable.update(
condition = "Category == 'Accessories'",
set = { "Price": "Price * 0.9" })
 Updates the Delta table where the category is 'Accessories', reducing the price by
10%.
Reading a Specific Version:
python
df = spark.read.format("delta").option("versionAsOf",
0).load(delta_table_path)
 Reads the Delta table as of a specific version.
8. Managed vs. External Tables
Managed Tables:
 Defined without a specific location, files are created in the metastore folder.
 Dropping the table deletes the files.
External Tables:
 Defined with a specific file location, dropping the table does not delete the files.
9. Stream Processing with Delta Lake
Reading a Delta Table Stream:
python
from pyspark.sql.types import *
from pyspark.sql.functions import *
stream_df = spark.readStream.format("delta") 
.option("ignoreChanges", "true") 
.load("/delta/internetorders")
stream_df.show()
 Reads streaming data from a Delta table.
Writing a Delta Table Stream:
python
stream_df =
spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger",
1).json(inputPath)
table_path = '/delta/devicetable'
checkpoint_path = '/delta/checkpoint'
delta_stream =
stream_df.writeStream.format("delta").option("checkpointLocation",
checkpoint_path).start(table_path)
 Reads streaming data from JSON files, writes it to a Delta table with checkpointing.
10. Querying Delta Files with OPENROWSET
Using OPENROWSET to Query Delta Files:
sql
SELECT *
FROM
OPENROWSET(
BULK 'https://mystore.dfs.core.windows.net/files/delta/mytable/',
FORMAT = 'DELTA'
) AS deltadata
 Uses OPENROWSET to query Delta files stored in Azure Data Lake Storage.
11. Querying a Delta Table
Simple Query:
sql
USE default;
SELECT * FROM MyDeltaTable;
 Switches to the default database and selects all data from MyDeltaTable.
managed tables vs external tables
Let's delve into the differences between managed tables and external tables in a data
management system:
Managed Tables
1. Definition: Managed tables, also known as internal tables, are tables for which the
data storage and lifecycle are fully managed by the database system.
2. Storage Location: The database stores the data files in a default location, typically
within the database's managed storage.
3. Creation: You don't need to specify the storage location while creating a managed
table; the system handles it.
4. Lifecycle Management: When you drop a managed table, both the table schema and
the underlying data are deleted.
5. Example:
sql
CREATE TABLE ManagedTable (
id INT,
name STRING
)
USING DELTA;
External Tables
1. Definition: External tables allow you to manage the data storage location separately
from the database system, providing more flexibility and control over the data.
2. Storage Location: You specify the exact location of the data files when creating the
table, which can reside in an external storage system like Azure Blob Storage,
Amazon S3, etc.
3. Creation: You need to provide the path to the data files while creating an external
table.
4. Lifecycle Management: Dropping an external table only removes the table schema
from the database, not the underlying data files. The data remains in the specified
storage location.
5. Example:
sql
CREATE TABLE ExternalTable (
id INT,
name STRING
)
USING DELTA
LOCATION '/path/to/data';
Key Differences
1. Storage Management:
o Managed Tables: Database manages data files.
o External Tables: User specifies and manages data files.
2. Deletion Behavior:
o Managed Tables: Dropping the table deletes both schema and data.
o External Tables: Dropping the table deletes only the schema; data remains
intact.
3. Flexibility:
o Managed Tables: Easier to use but less control over data location.
o External Tables: More flexible in managing data storage but requires
specifying the location.
Use Cases
 Managed Tables: Suitable for scenarios where the database system should handle the
entire data lifecycle, including storage, management, and deletion.
 External Tables: Ideal for scenarios where data needs to be shared across different
systems or when you want more control over data storage and retention policies.
I hope this clarifies the differences between managed and external tables for you! If you have
more specific questions or scenarios, feel free to ask.
Why to create a table in metastore
Creating a table in the metastore offers several advantages that can improve data
management, accessibility, and collaboration. Here are the key reasons:
1. Centralized Metadata Management
The metastore acts as a centralized repository for metadata, which includes information about
the tables, such as their schema, location, and properties. This allows for easy management
and access to metadata across different services and applications.
2. Simplified Data Access
Tables registered in the metastore can be easily queried using SQL or other query languages.
This simplifies data access for users and applications, as they don't need to know the
underlying storage details.
3. Schema Enforcement
The metastore enforces schema definitions, ensuring that data adheres to the specified
structure. This helps maintain data quality and consistency.
4. Data Discovery
With the metastore, users can quickly discover available datasets and their schemas, making
it easier to find and use the data they need.
5. Integration with Data Processing Engines
The metastore integrates seamlessly with various data processing engines, such as Apache
Spark, Hive, and Presto. This enables efficient data processing and querying.
6. Access Control and Security
The metastore can be used to manage access controls and permissions, ensuring that only
authorized users can access or modify the data. This enhances data security and compliance
with regulations.
7. Support for ACID Transactions
For systems like Delta Lake, registering tables in the metastore enables support for ACID
(Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and
reliability.
Example
Creating a table in the metastore using Spark:
python
df.createOrReplaceTempView("products")
 createOrReplaceTempView: Registers a temporary view in the metastore.
Creating a managed table in the metastore using SQL:
sql
CREATE TABLE ManagedTable (
id INT,
name STRING
)
USING DELTA;
 CREATE TABLE: Registers a managed table in the metastore.
By creating tables in the metastore, you can leverage these benefits to enhance data
management and usability in your data ecosystem. If you have more specific questions or
scenarios, feel free to ask!
Module 3 Notes.docx and pyspark being used

More Related Content

PPT
An overview of snowflake
PDF
Sql introduction
PPTX
Android Data Storagefinal
PPSX
ADO.NET
PDF
Graph db as metastore
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
DOCX
Android-data storage in android-chapter21
An overview of snowflake
Sql introduction
Android Data Storagefinal
ADO.NET
Graph db as metastore
Day 1 - Technical Bootcamp azure synapse analytics
Introduction to Apache Tajo: Data Warehouse for Big Data
Android-data storage in android-chapter21

Similar to Module 3 Notes.docx and pyspark being used (20)

PPT
Data Storage In Android
PPT
ora_sothea
PPT
DataFinder concepts and example: General (20100503)
PDF
Data analystics with R module 3 cseds vtu
PDF
09.Local Database Files and Storage on WP
PPSX
ASP.Net Presentation Part2
PDF
Data Wrangling and Visualization Using Python
PPTX
Dynamic Publishing with Arbortext Data Merge
PDF
MS-SQL SERVER ARCHITECTURE
PPTX
7. SQL.pptx
PPT
Informatica training
PPTX
Postgresql Database Administration Basic - Day2
PDF
Presentation on the ADO.NET framework in C#
DOC
Whitepaper To Study Filestream Option In Sql Server
PDF
Stretch db sql server 2016 (sn0028)
PDF
Aucfanlab Datalake - Big Data Management Platform -
PDF
Oracle tutorial
PDF
Apache Hive, data segmentation and bucketing
PPTX
Database fundamentals
PPTX
Dev Sql Beyond Relational
Data Storage In Android
ora_sothea
DataFinder concepts and example: General (20100503)
Data analystics with R module 3 cseds vtu
09.Local Database Files and Storage on WP
ASP.Net Presentation Part2
Data Wrangling and Visualization Using Python
Dynamic Publishing with Arbortext Data Merge
MS-SQL SERVER ARCHITECTURE
7. SQL.pptx
Informatica training
Postgresql Database Administration Basic - Day2
Presentation on the ADO.NET framework in C#
Whitepaper To Study Filestream Option In Sql Server
Stretch db sql server 2016 (sn0028)
Aucfanlab Datalake - Big Data Management Platform -
Oracle tutorial
Apache Hive, data segmentation and bucketing
Database fundamentals
Dev Sql Beyond Relational
Ad

Recently uploaded (20)

PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PPTX
Share_Module_2_Power_conflict_and_negotiation.pptx
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PPTX
20th Century Theater, Methods, History.pptx
PPTX
Computer Architecture Input Output Memory.pptx
PDF
IGGE1 Understanding the Self1234567891011
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
HVAC Specification 2024 according to central public works department
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Share_Module_2_Power_conflict_and_negotiation.pptx
B.Sc. DS Unit 2 Software Engineering.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
20th Century Theater, Methods, History.pptx
Computer Architecture Input Output Memory.pptx
IGGE1 Understanding the Self1234567891011
Uderstanding digital marketing and marketing stratergie for engaging the digi...
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Virtual and Augmented Reality in Current Scenario
HVAC Specification 2024 according to central public works department
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Weekly quiz Compilation Jan -July 25.pdf
A powerpoint presentation on the Revised K-10 Science Shaping Paper
FORM 1 BIOLOGY MIND MAPS and their schemes
AI-driven educational solutions for real-life interventions in the Philippine...
History, Philosophy and sociology of education (1).pptx
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Ad

Module 3 Notes.docx and pyspark being used

  • 1. 1. Loading and Manipulating Data with PySpark Loading Data: python df = spark.read.load("/data/products.csv", format="csv", header=True)  spark.read.load: Reads the data from a CSV file located at /data/products.csv.  format="csv": Specifies that the data format is CSV.  header=True: Indicates that the first row of the CSV file contains header information. Manipulating Data: python counts_df = df.select("ProductID", "Category").groupBy("Category").count()  select("ProductID", "Category"): Selects the ProductID and Category columns.  groupBy("Category").count(): Groups the data by Category and counts the number of ProductIDs in each category. Displaying Data: python display(counts_df)  display: Displays the resulting DataFrame. 2. Creating a Table in the Metastore and Running SQL Queries Creating a Temp View: python df.createOrReplaceTempView("products")  createOrReplaceTempView: Creates or replaces a temporary view named products. Running SQL Queries: python bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice FROM products WHERE Category IN ('Mountain Bikes', 'Road Bikes')")  spark.sql: Runs an SQL query to select ProductID, ProductName, and ListPrice from products where Category is either 'Mountain Bikes' or 'Road Bikes'.  display(bikes_df): Displays the resulting DataFrame. 3. Grouping and Counting Products with SQL
  • 2. SQL Query: sql SELECT Category, COUNT(ProductID) AS ProductCount FROM products GROUP BY Category ORDER BY Category  SELECT Category, COUNT(ProductID) AS ProductCount: Selects the Category and counts the number of ProductIDs in each category, renaming the count as ProductCount.  GROUP BY Category: Groups the data by Category.  ORDER BY Category: Orders the results by Category. 4. Transforming and Saving Data with PySpark Loading Data: python df = spark.read.load("/data/orders.csv", format="csv", header=True)  Reads the data from a CSV file located at /data/orders.csv. Adding a Year Column: python df = df.withColumn("Year", year(col("OrderDate")))  Adds a new column Year derived from the OrderDate column. Saving Transformed Data: python df.write.mode("overwrite").parquet("/data/orders.parquet")  Saves the DataFrame as a Parquet file at the specified path. 5. Writing Partitioned Data python df.write.partitionBy("Year").mode("overwrite").parquet("/data")  Partitions the data by Year and saves it as Parquet files. 6. Creating and Saving a View Creating a View: python df.createOrReplaceTempView("sales_orders")
  • 3.  Creates or replaces a temporary view named sales_orders. Transforming Data with SQL: python new_df = spark.sql("SELECT OrderNo, OrderDate, Year(OrderDate) As Year FROM sales_orders")  Runs an SQL query to select OrderNo, OrderDate, and derive Year from OrderDate. Saving as an External Table: python new_df.write.partitionBy("Year").saveAsTable("transformed_orders", format="parquet", mode="overwrite", path="/transformed_orders")  Saves the DataFrame as an external table partitioned by Year in Parquet format. 7. Delta Lake Operations Saving Data in Delta Format: python df = spark.read.load("/data/mydata.csv", format="csv", header=True) delta_table_path = "/delta/mydata" df.write.format("delta").save(delta_table_path)  Reads data from a CSV file and writes it in Delta format at the specified path. Updating Delta Table: python from delta.tables import * from pyspark.sql.functions import * deltaTable = DeltaTable.forPath(spark, delta_table_path) deltaTable.update( condition = "Category == 'Accessories'", set = { "Price": "Price * 0.9" })  Updates the Delta table where the category is 'Accessories', reducing the price by 10%. Reading a Specific Version: python df = spark.read.format("delta").option("versionAsOf", 0).load(delta_table_path)  Reads the Delta table as of a specific version. 8. Managed vs. External Tables
  • 4. Managed Tables:  Defined without a specific location, files are created in the metastore folder.  Dropping the table deletes the files. External Tables:  Defined with a specific file location, dropping the table does not delete the files. 9. Stream Processing with Delta Lake Reading a Delta Table Stream: python from pyspark.sql.types import * from pyspark.sql.functions import * stream_df = spark.readStream.format("delta") .option("ignoreChanges", "true") .load("/delta/internetorders") stream_df.show()  Reads streaming data from a Delta table. Writing a Delta Table Stream: python stream_df = spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger", 1).json(inputPath) table_path = '/delta/devicetable' checkpoint_path = '/delta/checkpoint' delta_stream = stream_df.writeStream.format("delta").option("checkpointLocation", checkpoint_path).start(table_path)  Reads streaming data from JSON files, writes it to a Delta table with checkpointing. 10. Querying Delta Files with OPENROWSET Using OPENROWSET to Query Delta Files: sql SELECT * FROM OPENROWSET( BULK 'https://mystore.dfs.core.windows.net/files/delta/mytable/', FORMAT = 'DELTA' ) AS deltadata  Uses OPENROWSET to query Delta files stored in Azure Data Lake Storage. 11. Querying a Delta Table
  • 5. Simple Query: sql USE default; SELECT * FROM MyDeltaTable;  Switches to the default database and selects all data from MyDeltaTable. managed tables vs external tables Let's delve into the differences between managed tables and external tables in a data management system: Managed Tables 1. Definition: Managed tables, also known as internal tables, are tables for which the data storage and lifecycle are fully managed by the database system. 2. Storage Location: The database stores the data files in a default location, typically within the database's managed storage. 3. Creation: You don't need to specify the storage location while creating a managed table; the system handles it. 4. Lifecycle Management: When you drop a managed table, both the table schema and the underlying data are deleted. 5. Example: sql CREATE TABLE ManagedTable ( id INT, name STRING ) USING DELTA; External Tables 1. Definition: External tables allow you to manage the data storage location separately from the database system, providing more flexibility and control over the data. 2. Storage Location: You specify the exact location of the data files when creating the table, which can reside in an external storage system like Azure Blob Storage, Amazon S3, etc. 3. Creation: You need to provide the path to the data files while creating an external table. 4. Lifecycle Management: Dropping an external table only removes the table schema from the database, not the underlying data files. The data remains in the specified storage location. 5. Example: sql
  • 6. CREATE TABLE ExternalTable ( id INT, name STRING ) USING DELTA LOCATION '/path/to/data'; Key Differences 1. Storage Management: o Managed Tables: Database manages data files. o External Tables: User specifies and manages data files. 2. Deletion Behavior: o Managed Tables: Dropping the table deletes both schema and data. o External Tables: Dropping the table deletes only the schema; data remains intact. 3. Flexibility: o Managed Tables: Easier to use but less control over data location. o External Tables: More flexible in managing data storage but requires specifying the location. Use Cases  Managed Tables: Suitable for scenarios where the database system should handle the entire data lifecycle, including storage, management, and deletion.  External Tables: Ideal for scenarios where data needs to be shared across different systems or when you want more control over data storage and retention policies. I hope this clarifies the differences between managed and external tables for you! If you have more specific questions or scenarios, feel free to ask. Why to create a table in metastore Creating a table in the metastore offers several advantages that can improve data management, accessibility, and collaboration. Here are the key reasons: 1. Centralized Metadata Management The metastore acts as a centralized repository for metadata, which includes information about the tables, such as their schema, location, and properties. This allows for easy management and access to metadata across different services and applications. 2. Simplified Data Access Tables registered in the metastore can be easily queried using SQL or other query languages. This simplifies data access for users and applications, as they don't need to know the underlying storage details.
  • 7. 3. Schema Enforcement The metastore enforces schema definitions, ensuring that data adheres to the specified structure. This helps maintain data quality and consistency. 4. Data Discovery With the metastore, users can quickly discover available datasets and their schemas, making it easier to find and use the data they need. 5. Integration with Data Processing Engines The metastore integrates seamlessly with various data processing engines, such as Apache Spark, Hive, and Presto. This enables efficient data processing and querying. 6. Access Control and Security The metastore can be used to manage access controls and permissions, ensuring that only authorized users can access or modify the data. This enhances data security and compliance with regulations. 7. Support for ACID Transactions For systems like Delta Lake, registering tables in the metastore enables support for ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and reliability. Example Creating a table in the metastore using Spark: python df.createOrReplaceTempView("products")  createOrReplaceTempView: Registers a temporary view in the metastore. Creating a managed table in the metastore using SQL: sql CREATE TABLE ManagedTable ( id INT, name STRING ) USING DELTA;  CREATE TABLE: Registers a managed table in the metastore. By creating tables in the metastore, you can leverage these benefits to enhance data management and usability in your data ecosystem. If you have more specific questions or scenarios, feel free to ask!