SlideShare a Scribd company logo
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
Killer Scenarios with Data Lake in Azure with U-SQL
U-SQL extensibility
Extend U-SQL with C#/.NET
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
 User-Defined Extractors
 User-Defined Outputters
 User-Defined Processors
 Take one row and produce one row
 Pass-through versus transforming
 User-Defined Appliers
 Take one row and produce 0 to n rows
 Used with OUTER/CROSS APPLY
 User-Defined Combiners
 Combines rowsets (like a user-defined join)
 User-Defined Reducers
 Take n rows and produce m rows (normally m<n)
 Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
 EXTRACT
 OUTPUT
 PROCESS
 COMBINE
 REDUCE
What are
UDOs?
Custom Operator Extensions
Scaled out by U-SQL
UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their
schemas in UDOs
• Setting results
 By position
 By name
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
 Code behind
How to specify
UDOs?
 C# Class Project for U-SQL
How to specify
UDOs?
 Any .Net language usable
 however not first-class in tooling
 Use U-SQL specific .Net DLLs
 Compile DLL, upload to ADLS, register with script
How to specify
UDOs?
Managing
Assemblies
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll
system.data.dll, System.Runtime.Serialization.dll,
mscorelib.dll (e.g., System.Text,
System.Text.RegularExpressions, System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer
• DROP ASSEMBLY db.assembly;
 Create assemblies
 Reference assemblies
 Enumerate assemblies
 Drop assemblies
 VisualStudio makes registration easy!
'USING' csharp_namespace
| Alias '=' csharp_namespace_or_class.
Examples:
DECLARE @ input string = "somejsonfile.json";
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
@data0 =
EXTRACT IPAddresses string
FROM @input
USING new JsonExtractor("Devices[*]");
USING json =
[Microsoft.Analytics.Samples.Formats.Json.JsonExtractor];
@data1 =
EXTRACT IPAddresses string
FROM @input
USING new json("Devices[*]");
Overlapping Range
Aggregation
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
8:00 AM - 9:00 AM - ABC
8:00 AM - 10:00 AM - ABC
10:00 AM - 2:00 PM - ABC
7:00 AM - 11:00 AM - ABC
9:00 AM - 11:00 AM - ABC
11:00 AM - 11:30 AM - ABC
11:40 PM - 11:59 PM - FOO
11:50 PM - 0:40 AM - FOO
https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine-
overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos
Start Time - End Time - User Name
5:00 AM - 6:00 AM - ABC
5:00 AM - 6:00 AM - XYZ
7:00 AM - 2:00 PM - ABC
11:40 PM - 0:40 AM - FOO
U-SQL:
@r = REDUCE @in
PRESORT begin
ON user
PRODUCE begin DateTime
, end DateTime
, user string
READONLY user
USING new ReduceSample.RangeReducer();
Overlapping
Range
Aggregation
 Code Behind:
namespace ReduceSample
{
[SqlUserDefinedReducer(IsRecursive = true)]
public class RangeReducer : IReducer
{
public override IEnumerable<IRow>
Reduce(IRowset input, IUpdatableRow output)
{
// Init aggregation values
int i = 0;
var begin = DateTime.MaxValue;
var end = DateTime.MinValue;
foreach (var row in input.Rows)
{
...
begin = row.Get<DateTime>("begin");
end = row.Get<DateTime>("end");
...
output.Set<DateTime>("begin", begin);
output.Set<DateTime>("end", end);
yield return output.AsReadOnly();
...
} // foreach
} // Reduce
Overlapping
Range
Aggregation
JSON Processing
How do I extract data from JSON documents?
https://github.com/Azure/usql/tree/master/Examples/DataFormats
 Architecture of Sample Format Assembly
 Single JSON document per file: Use JsonExtractor
 Multiple JSON documents per file:
 Do not allow CR/LF (row delimiter) in JSON
 Use built-in Text Extractor to extract
 Use JsonTuple to schematize (with CROSS APPLY)
 Currently loads full JSON document into memory
 better to use JSONReader Processing if docs are large
JSON
Processing
Microsoft.Analytics.Samples.Formats
NewtonSoft.Json System.Xml
JSON
Processing
@json =
EXTRACT personid int,
name string,
addresses string
FROM @input
USING new Json.JsonExtractor(“[*].person");
@person =
SELECT personid,
name,
Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array
FROM @json;
@addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address
FROM @person
CROSS APPLY
EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address);
@result =
SELECT personid,
name,
address["addressid"]AS addressid,
address["street"]AS street,
address["postcode"]AS postcode,
address["city"]AS city
FROM @addresses;
Image Processing
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
https://github.com/Azure/usql/tree/master/Examples/ImageApp
 Image processing assembly
 Uses System.Drawing
 Exposes
 Extractors
 Outputter
 Processor
 User-defined Functions
 Trade-offs
 Column memory limits:
Image Extractor vs Feature Extractor
 Main memory pressures in vertex:
UDFs vs Processor vs Extractor
Image
Processing
R Processing
KMeans Centroids
Architecture
U-SQL
Processing
with R
KMeansRReducer
R to .Net interop (RDotNet.dll &
RDotNet.NativeLib.dll)
R Runtime (R-bin.zip)
R Engine Manager Utility (RUtilities.dll)
Similar Approaches can be done for deploying other
runtimes: Python, JavaScript, JVM
No external access from UDOs
Future work:
 More generic samples
 More automatic experiences (no user wrappers/deploys)
Killer Scenarios with Data Lake in Azure with U-SQL
What are UDOs?
Custom Operator Extensions written in .Net (C#)
Scaled out by U-SQL
UDO Tips and
Warnings
• Tips when Using UDOs:
 READONLY clause to allow pushing predicates through UDOs
 REQUIRED clause to allow column pruning through UDOs
 PRESORT on REDUCE if you need global order
 Hint Cardinality if it does choose the wrong plan
• Warnings and better alternatives:
 Use SELECT with UDFs instead of PROCESS
 Use User-defined Aggregators instead of REDUCE
 Learn to use Windowing Functions (OVER expression)
• Good use-cases for PROCESS/REDUCE/COMBINE:
 The logic needs to dynamically access the input and/or output
schema.
E.g., create a JSON doc for the data in the row where the
columns are not known apriori.
 Your UDF based solution creates too much memory pressure and
you can write your code more memory efficient in a UDO
 You need an ordered Aggregator or produce more than 1 row
per group
http://usql.io
http://blogs.msdn.microsoft.com/azuredatalake/
http://blogs.msdn.microsoft.com/mrys/
https://channel9.msdn.com/Search?term=U-SQL#ch9Search
http://aka.ms/usql_reference
https://azure.microsoft.com/en-
us/documentation/services/data-lake-analytics/
https://msdn.microsoft.com/en-us/magazine/mt614251
http://aka.ms/adlfeedback
https://social.msdn.microsoft.com/Forums/azure/en-
US/home?forum=AzureDataLake
http://stackoverflow.com/questions/tagged/u-sql
Killer Scenarios with Data Lake in Azure with U-SQL

More Related Content

PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
PPTX
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R co...
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
U-SQL Reading & Writing Files (SQLBits 2016)

What's hot (20)

PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
PPTX
Introducing U-SQL (SQLPASS 2016)
PPTX
U-SQL Intro (SQLBits 2016)
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
U-SQL Query Execution and Performance Tuning
PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
PPTX
Using C# with U-SQL (SQLBits 2016)
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
PPTX
U-SQL Does SQL (SQLBits 2016)
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PDF
Data Source API in Spark
PDF
Introduction to Spark SQL & Catalyst
PDF
20140908 spark sql & catalyst
PPTX
Hive @ Bucharest Java User Group
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
PPTX
Hive and HiveQL - Module6
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
Introducing U-SQL (SQLPASS 2016)
U-SQL Intro (SQLBits 2016)
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
U-SQL Query Execution and Performance Tuning
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
U-SQL Partitioned Data and Tables (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
U-SQL Does SQL (SQLBits 2016)
Be A Hero: Transforming GoPro Analytics Data Pipeline
Data Source API in Spark
Introduction to Spark SQL & Catalyst
20140908 spark sql & catalyst
Hive @ Bucharest Java User Group
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Hive and HiveQL - Module6
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Ad

Viewers also liked (12)

PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
PPTX
U-SQL - Azure Data Lake Analytics for Developers
PPTX
U-SQL Learning Resources (SQLBits 2016)
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
PPTX
Azure Data Lake Intro (SQLBits 2016)
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PPTX
Azure Data Lake and U-SQL
PPTX
Microsoft's Hadoop Story
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Azure Data Lake Analytics Deep Dive
PPTX
Analyzing StackExchange data with Azure Data Lake
PPTX
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL - Azure Data Lake Analytics for Developers
U-SQL Learning Resources (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Azure Data Lake and U-SQL
Microsoft's Hadoop Story
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Azure Data Lake Analytics Deep Dive
Analyzing StackExchange data with Azure Data Lake
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Ad

Similar to Killer Scenarios with Data Lake in Azure with U-SQL (20)

PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
PPTX
3 CityNetConf - sql+c#=u-sql
PDF
USQ Landdemos Azure Data Lake
PPTX
Dive Into Azure Data Lake - PASS 2017
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PPTX
Azure data lake sql konf 2016
PPTX
Paris Datageeks meetup 05102016
PPTX
Azure Data Lake and Azure Data Lake Analytics
PPTX
NDC Sydney - Analyzing StackExchange with Azure Data Lake
PPTX
C# + SQL = Big Data
PDF
Introduction to Azure Data Lake
PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
PDF
Talavant Data Lake Analytics
PPTX
SQL Server - CLR integration
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
DOCX
MCS,BCS-7(A,B) Visual programming Syllabus for Final exams @ ISP
PPT
Linq To The Enterprise
PPTX
An intro to Azure Data Lake
PPT
Linq 1224887336792847 9
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
Using existing language skillsets to create large-scale, cloud-based analytics
3 CityNetConf - sql+c#=u-sql
USQ Landdemos Azure Data Lake
Dive Into Azure Data Lake - PASS 2017
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Azure data lake sql konf 2016
Paris Datageeks meetup 05102016
Azure Data Lake and Azure Data Lake Analytics
NDC Sydney - Analyzing StackExchange with Azure Data Lake
C# + SQL = Big Data
Introduction to Azure Data Lake
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
Talavant Data Lake Analytics
SQL Server - CLR integration
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
MCS,BCS-7(A,B) Visual programming Syllabus for Final exams @ ISP
Linq To The Enterprise
An intro to Azure Data Lake
Linq 1224887336792847 9
Big Data Analytics from Azure Cloud to Power BI Mobile

More from Michael Rys (7)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data Processing with .NET and Spark (SQLBits 2020)
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Lecture1 pattern recognition............
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Introduction to Business Data Analytics.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
Lecture1 pattern recognition............
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Business Data Analytics.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1

Killer Scenarios with Data Lake in Azure with U-SQL

  • 4. U-SQL extensibility Extend U-SQL with C#/.NET Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 5.  User-Defined Extractors  User-Defined Outputters  User-Defined Processors  Take one row and produce one row  Pass-through versus transforming  User-Defined Appliers  Take one row and produce 0 to n rows  Used with OUTER/CROSS APPLY  User-Defined Combiners  Combines rowsets (like a user-defined join)  User-Defined Reducers  Take n rows and produce m rows (normally m<n)  Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution):  EXTRACT  OUTPUT  PROCESS  COMBINE  REDUCE What are UDOs? Custom Operator Extensions Scaled out by U-SQL
  • 6. UDO model • Marking UDOs • Parameterizing UDOs • UDO signature • UDO-specific processing pattern • Rowsets and their schemas in UDOs • Setting results  By position  By name [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor
  • 7.  Code behind How to specify UDOs?
  • 8.  C# Class Project for U-SQL How to specify UDOs?
  • 9.  Any .Net language usable  however not first-class in tooling  Use U-SQL specific .Net DLLs  Compile DLL, upload to ADLS, register with script How to specify UDOs?
  • 10. Managing Assemblies • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer • DROP ASSEMBLY db.assembly;  Create assemblies  Reference assemblies  Enumerate assemblies  Drop assemblies  VisualStudio makes registration easy!
  • 11. 'USING' csharp_namespace | Alias '=' csharp_namespace_or_class. Examples: DECLARE @ input string = "somejsonfile.json"; REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; USING Microsoft.Analytics.Samples.Formats.Json; @data0 = EXTRACT IPAddresses string FROM @input USING new JsonExtractor("Devices[*]"); USING json = [Microsoft.Analytics.Samples.Formats.Json.JsonExtractor]; @data1 = EXTRACT IPAddresses string FROM @input USING new json("Devices[*]");
  • 12. Overlapping Range Aggregation Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 8:00 AM - 9:00 AM - ABC 8:00 AM - 10:00 AM - ABC 10:00 AM - 2:00 PM - ABC 7:00 AM - 11:00 AM - ABC 9:00 AM - 11:00 AM - ABC 11:00 AM - 11:30 AM - ABC 11:40 PM - 11:59 PM - FOO 11:50 PM - 0:40 AM - FOO https://blogs.msdn.microsoft.com/azuredatalake/2016/06/27/how-do-i-combine- overlapping-ranges-using-u-sql-introducing-u-sql-reducer-udos Start Time - End Time - User Name 5:00 AM - 6:00 AM - ABC 5:00 AM - 6:00 AM - XYZ 7:00 AM - 2:00 PM - ABC 11:40 PM - 0:40 AM - FOO
  • 13. U-SQL: @r = REDUCE @in PRESORT begin ON user PRODUCE begin DateTime , end DateTime , user string READONLY user USING new ReduceSample.RangeReducer(); Overlapping Range Aggregation
  • 14.  Code Behind: namespace ReduceSample { [SqlUserDefinedReducer(IsRecursive = true)] public class RangeReducer : IReducer { public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output) { // Init aggregation values int i = 0; var begin = DateTime.MaxValue; var end = DateTime.MinValue; foreach (var row in input.Rows) { ... begin = row.Get<DateTime>("begin"); end = row.Get<DateTime>("end"); ... output.Set<DateTime>("begin", begin); output.Set<DateTime>("end", end); yield return output.AsReadOnly(); ... } // foreach } // Reduce Overlapping Range Aggregation
  • 15. JSON Processing How do I extract data from JSON documents? https://github.com/Azure/usql/tree/master/Examples/DataFormats
  • 16.  Architecture of Sample Format Assembly  Single JSON document per file: Use JsonExtractor  Multiple JSON documents per file:  Do not allow CR/LF (row delimiter) in JSON  Use built-in Text Extractor to extract  Use JsonTuple to schematize (with CROSS APPLY)  Currently loads full JSON document into memory  better to use JSONReader Processing if docs are large JSON Processing Microsoft.Analytics.Samples.Formats NewtonSoft.Json System.Xml
  • 17. JSON Processing @json = EXTRACT personid int, name string, addresses string FROM @input USING new Json.JsonExtractor(“[*].person"); @person = SELECT personid, name, Json.JsonFunctions.JsonTuple(addresses)["address"] AS address_array FROM @json; @addresses = SELECT personid, name, Json.JsonFunctions.JsonTuple(address) AS address FROM @person CROSS APPLY EXPLODE (Json.JsonFunctions.JsonTuple(address_array).Values) AS A(address); @result = SELECT personid, name, address["addressid"]AS addressid, address["street"]AS street, address["postcode"]AS postcode, address["city"]AS city FROM @addresses;
  • 18. Image Processing Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Michael Samsung S7 https://github.com/Azure/usql/tree/master/Examples/ImageApp
  • 19.  Image processing assembly  Uses System.Drawing  Exposes  Extractors  Outputter  Processor  User-defined Functions  Trade-offs  Column memory limits: Image Extractor vs Feature Extractor  Main memory pressures in vertex: UDFs vs Processor vs Extractor Image Processing
  • 21. Architecture U-SQL Processing with R KMeansRReducer R to .Net interop (RDotNet.dll & RDotNet.NativeLib.dll) R Runtime (R-bin.zip) R Engine Manager Utility (RUtilities.dll) Similar Approaches can be done for deploying other runtimes: Python, JavaScript, JVM No external access from UDOs Future work:  More generic samples  More automatic experiences (no user wrappers/deploys)
  • 23. What are UDOs? Custom Operator Extensions written in .Net (C#) Scaled out by U-SQL
  • 24. UDO Tips and Warnings • Tips when Using UDOs:  READONLY clause to allow pushing predicates through UDOs  REQUIRED clause to allow column pruning through UDOs  PRESORT on REDUCE if you need global order  Hint Cardinality if it does choose the wrong plan • Warnings and better alternatives:  Use SELECT with UDFs instead of PROCESS  Use User-defined Aggregators instead of REDUCE  Learn to use Windowing Functions (OVER expression) • Good use-cases for PROCESS/REDUCE/COMBINE:  The logic needs to dynamically access the input and/or output schema. E.g., create a JSON doc for the data in the row where the columns are not known apriori.  Your UDF based solution creates too much memory pressure and you can write your code more memory efficient in a UDO  You need an ordered Aggregator or produce more than 1 row per group

Editor's Notes

  • #5: Extensions require .NET assemblies to be registered with a database