Module 3 Notes.docx and pyspark being used

1. Loading and Manipulating Data with PySpark
Loading Data:
python
df = spark.read.load("/data/products.csv", format="csv", header=True)
 spark.read.load: Reads the data from a CSV file located at /data/products.csv.
 format="csv": Specifies that the data format is CSV.
 header=True: Indicates that the first row of the CSV file contains header information.
Manipulating Data:
python
counts_df = df.select("ProductID", "Category").groupBy("Category").count()
 select("ProductID", "Category"): Selects the ProductID and Category
columns.
 groupBy("Category").count(): Groups the data by Category and counts the
number of ProductIDs in each category.
Displaying Data:
python
display(counts_df)
 display: Displays the resulting DataFrame.
2. Creating a Table in the Metastore and Running SQL Queries
Creating a Temp View:
python
df.createOrReplaceTempView("products")
 createOrReplaceTempView: Creates or replaces a temporary view named products.
Running SQL Queries:
python
bikes_df = spark.sql("SELECT ProductID, ProductName, ListPrice
FROM products
WHERE Category IN ('Mountain Bikes', 'Road Bikes')")
 spark.sql: Runs an SQL query to select ProductID, ProductName, and ListPrice
from products where Category is either 'Mountain Bikes' or 'Road Bikes'.
 display(bikes_df): Displays the resulting DataFrame.
3. Grouping and Counting Products with SQL

SQL Query:
sql
SELECT Category, COUNT(ProductID) AS ProductCount
FROM products
GROUP BY Category
ORDER BY Category
 SELECT Category, COUNT(ProductID) AS ProductCount: Selects the Category
and counts the number of ProductIDs in each category, renaming the count as
ProductCount.
 GROUP BY Category: Groups the data by Category.
 ORDER BY Category: Orders the results by Category.
4. Transforming and Saving Data with PySpark
Loading Data:
python
df = spark.read.load("/data/orders.csv", format="csv", header=True)
 Reads the data from a CSV file located at /data/orders.csv.
Adding a Year Column:
python
df = df.withColumn("Year", year(col("OrderDate")))
 Adds a new column Year derived from the OrderDate column.
Saving Transformed Data:
python
df.write.mode("overwrite").parquet("/data/orders.parquet")
 Saves the DataFrame as a Parquet file at the specified path.
5. Writing Partitioned Data
python
df.write.partitionBy("Year").mode("overwrite").parquet("/data")
 Partitions the data by Year and saves it as Parquet files.
6. Creating and Saving a View
Creating a View:
python
df.createOrReplaceTempView("sales_orders")

 Creates or replaces a temporary view named sales_orders.
Transforming Data with SQL:
python
new_df = spark.sql("SELECT OrderNo, OrderDate, Year(OrderDate) As Year FROM
sales_orders")
 Runs an SQL query to select OrderNo, OrderDate, and derive Year from OrderDate.
Saving as an External Table:
python
new_df.write.partitionBy("Year").saveAsTable("transformed_orders",
format="parquet",
mode="overwrite",
path="/transformed_orders")
 Saves the DataFrame as an external table partitioned by Year in Parquet format.
7. Delta Lake Operations
Saving Data in Delta Format:
python
df = spark.read.load("/data/mydata.csv", format="csv", header=True)
delta_table_path = "/delta/mydata"
df.write.format("delta").save(delta_table_path)
 Reads data from a CSV file and writes it in Delta format at the specified path.
Updating Delta Table:
python
from delta.tables import *
from pyspark.sql.functions import *
deltaTable = DeltaTable.forPath(spark, delta_table_path)
deltaTable.update(
condition = "Category == 'Accessories'",
set = { "Price": "Price * 0.9" })
 Updates the Delta table where the category is 'Accessories', reducing the price by
10%.
Reading a Specific Version:
python
df = spark.read.format("delta").option("versionAsOf",
0).load(delta_table_path)
 Reads the Delta table as of a specific version.
8. Managed vs. External Tables

Managed Tables:
 Defined without a specific location, files are created in the metastore folder.
 Dropping the table deletes the files.
External Tables:
 Defined with a specific file location, dropping the table does not delete the files.
9. Stream Processing with Delta Lake
Reading a Delta Table Stream:
python
from pyspark.sql.types import *
from pyspark.sql.functions import *
stream_df = spark.readStream.format("delta")
.option("ignoreChanges", "true")
.load("/delta/internetorders")
stream_df.show()
 Reads streaming data from a Delta table.
Writing a Delta Table Stream:
python
stream_df =
spark.readStream.schema(jsonSchema).option("maxFilesPerTrigger",
1).json(inputPath)
table_path = '/delta/devicetable'
checkpoint_path = '/delta/checkpoint'
delta_stream =
stream_df.writeStream.format("delta").option("checkpointLocation",
checkpoint_path).start(table_path)
 Reads streaming data from JSON files, writes it to a Delta table with checkpointing.
10. Querying Delta Files with OPENROWSET
Using OPENROWSET to Query Delta Files:
sql
SELECT *
FROM
OPENROWSET(
BULK 'https://mystore.dfs.core.windows.net/files/delta/mytable/',
FORMAT = 'DELTA'
) AS deltadata
 Uses OPENROWSET to query Delta files stored in Azure Data Lake Storage.
11. Querying a Delta Table

Simple Query:
sql
USE default;
SELECT * FROM MyDeltaTable;
 Switches to the default database and selects all data from MyDeltaTable.
managed tables vs external tables
Let's delve into the differences between managed tables and external tables in a data
management system:
Managed Tables
1. Definition: Managed tables, also known as internal tables, are tables for which the
data storage and lifecycle are fully managed by the database system.
2. Storage Location: The database stores the data files in a default location, typically
within the database's managed storage.
3. Creation: You don't need to specify the storage location while creating a managed
table; the system handles it.
4. Lifecycle Management: When you drop a managed table, both the table schema and
the underlying data are deleted.
5. Example:
sql
CREATE TABLE ManagedTable (
id INT,
name STRING
)
USING DELTA;
External Tables
1. Definition: External tables allow you to manage the data storage location separately
from the database system, providing more flexibility and control over the data.
2. Storage Location: You specify the exact location of the data files when creating the
table, which can reside in an external storage system like Azure Blob Storage,
Amazon S3, etc.
3. Creation: You need to provide the path to the data files while creating an external
table.
4. Lifecycle Management: Dropping an external table only removes the table schema
from the database, not the underlying data files. The data remains in the specified
storage location.
5. Example:
sql

CREATE TABLE ExternalTable (
id INT,
name STRING
)
USING DELTA
LOCATION '/path/to/data';
Key Differences
1. Storage Management:
o Managed Tables: Database manages data files.
o External Tables: User specifies and manages data files.
2. Deletion Behavior:
o Managed Tables: Dropping the table deletes both schema and data.
o External Tables: Dropping the table deletes only the schema; data remains
intact.
3. Flexibility:
o Managed Tables: Easier to use but less control over data location.
o External Tables: More flexible in managing data storage but requires
specifying the location.
Use Cases
 Managed Tables: Suitable for scenarios where the database system should handle the
entire data lifecycle, including storage, management, and deletion.
 External Tables: Ideal for scenarios where data needs to be shared across different
systems or when you want more control over data storage and retention policies.
I hope this clarifies the differences between managed and external tables for you! If you have
more specific questions or scenarios, feel free to ask.
Why to create a table in metastore
Creating a table in the metastore offers several advantages that can improve data
management, accessibility, and collaboration. Here are the key reasons:
1. Centralized Metadata Management
The metastore acts as a centralized repository for metadata, which includes information about
the tables, such as their schema, location, and properties. This allows for easy management
and access to metadata across different services and applications.
2. Simplified Data Access
Tables registered in the metastore can be easily queried using SQL or other query languages.
This simplifies data access for users and applications, as they don't need to know the
underlying storage details.

3. Schema Enforcement
The metastore enforces schema definitions, ensuring that data adheres to the specified
structure. This helps maintain data quality and consistency.
4. Data Discovery
With the metastore, users can quickly discover available datasets and their schemas, making
it easier to find and use the data they need.
5. Integration with Data Processing Engines
The metastore integrates seamlessly with various data processing engines, such as Apache
Spark, Hive, and Presto. This enables efficient data processing and querying.
6. Access Control and Security
The metastore can be used to manage access controls and permissions, ensuring that only
authorized users can access or modify the data. This enhances data security and compliance
with regulations.
7. Support for ACID Transactions
For systems like Delta Lake, registering tables in the metastore enables support for ACID
(Atomicity, Consistency, Isolation, Durability) transactions, ensuring data integrity and
reliability.
Example
Creating a table in the metastore using Spark:
python
df.createOrReplaceTempView("products")
 createOrReplaceTempView: Registers a temporary view in the metastore.
Creating a managed table in the metastore using SQL:
sql
CREATE TABLE ManagedTable (
id INT,
name STRING
)
USING DELTA;
 CREATE TABLE: Registers a managed table in the metastore.
By creating tables in the metastore, you can leverage these benefits to enhance data
management and usability in your data ecosystem. If you have more specific questions or
scenarios, feel free to ask!

Module 3 Notes.docx and pyspark being used

Module 3 Notes.docx and pyspark being used

More Related Content

Similar to Module 3 Notes.docx and pyspark being used (20)

Recently uploaded (20)

Module 3 Notes.docx and pyspark being used