SlideShare a Scribd company logo
Run Python, R and .NET
code at Data Lake scale
with U-SQL in Azure
Data Lake
Michael Rys
Principal Program Manager Big Data Team, Microsoft
@MikeDoesBigData
usql@microsoft.com
Agenda
• Characteristics of Big Data Analytics Programming
• Scaling out existing code with U-SQL:
• Scaling out Cognitive Libraries
• Introduction to U-SQL’s Extensibility Framework
• Scaling out .NET with U-SQL:
• Custom Image processing
• Scaling out Python with U-SQL
• Scaling out R with U-SQL:
• Model generation, Model testing and scoring
Some sample use cases
Digital Crime Unit – Analyze complex attack patterns
to understand BotNets and to predict and mitigate
future attacks by analyzing log records with
complex custom algorithms
Image Processing – Large-scale image feature
extraction and classification using custom code
Shopping Recommendation – Complex pattern
analysis and prediction over shopping records
using proprietary algorithms
Characteristics
of Big Data
Analytics
• Requires processing
of any type of data
• Allow use of custom
algorithms
• Scale to any size and
be efficient
Bring your own coding expertise and
existing code and scale it out?
Status Quo:
SQL for
Big Data
 Declarativity does scaling and
parallelization for you
 Extensibility is bolted on and
not “native”
 hard to work with anything other than
structured data
 difficult to extend with custom code:
complex installations and frameworks
 Limited to one or two languages
Status Quo:
Programming
Languages for
Big Data
 Extensibility through custom code
is “native”
 Declarativity is bolted on and
not “native”
 User often has to
care about scale and performance
 SQL is 2nd class within string, only local
optimizations
 Often no code reuse/
sharing across queries
Why U-SQL?  Declarativity and Extensibility
are equally native!
Get benefits of both!
Scales out your custom imperative Code
(written in .NET, Python, R, and more to come)
in a declarative SQL-based framework
R
Python
.NET
U-SQL Framework
Extract
Process
Output
User CodeUser Code
User Code
User Code
Declarative Framework
User Extensions
U-SQL Example
Extract
User Code
User Code
Scale Out Cognitive Library
https://github.com/Azure/usql/tree/master/Examples/ImageApp
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive
Car Green
Parked
Outdoor
Racing
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
REFERENCE ASSEMBLY ImageOcr;
@imgs =
EXTRACT FileName string, ImgData byte[]
FROM @"/images/{FileName}.jpg"
USING new Cognition.Vision.ImageExtractor();
// Extract the number of objects on each image and tag them
@objects =
PROCESS @imgs
PRODUCE FileName,
NumObjects int,
Tags SqlMap<string, float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
OUTPUT @objects
TO "/objects.tsv"
USING Outputters.Tsv();
Imaging
REFERENCE ASSEMBLY [TextSentiment];
REFERENCE ASSEMBLY [TextKeyPhrase];
@WarAndPeace =
EXTRACT No int,
Year string,
Book string, Chapter string,
Text string
FROM @"/usqlext/samples/cognition/war_and_peace.csv"
USING Extractors.Csv();
@sentiment =
PROCESS @WarAndPeace
PRODUCE No,
Year,
Book, Chapter,
Text,
Sentiment string,
Conf double
USING new Cognition.Text.SentimentAnalyzer(true);
OUTPUT @sentinment
TO "/sentiment.tsv"
USING Outputters.Tsv();
Text Analysis
U-SQL/Cognitive
Example
• Identify objects in images (tags)
• Identify faces and emotions and images
• Join datasets – find out which tags are associated with happiness
REFERENCE ASSEMBLY ImageCommon;
REFERENCE ASSEMBLY FaceSdk;
REFERENCE ASSEMBLY ImageEmotion;
REFERENCE ASSEMBLY ImageTagging;
@objects =
PROCESS MegaFaceView
PRODUCE FileName, NumObjects int, Tags SqlMap<string,float?>
READONLY FileName
USING new Cognition.Vision.ImageTagger();
@tags =
SELECT FileName, T.Tag
FROM @objects CROSS APPLY EXPLODE(Tags.Split) AS T(Tag, Conf)
WHERE Tag.Contains("dog") OR Tag.Contains("cat");
@emotion =
SELECT ImageName, Details.Emotion
FROM MegaFaceView
CROSS APPLY new Cognition.Vision.EmotionApplier(imgCol:"image")
AS Details(NumFaces int, FaceIndex int,
RectX float, RectY float, Width float, Height float,
Emotion string, Confidence float);
@correlation =
SELECT T.FileName, Emotion, Tag
FROM @emotion AS E
INNER JOIN
@tags AS T
ON E.FileName == T.FileName;
Images
Objects Emotions
filter
join
aggregate
U-SQL extensibility
Extend U-SQL with C#/.NET, Python, R etc.
Built-in operators,
function, aggregates
C# expressions (in SELECT expressions)
User-defined aggregates (UDAGGs)
User-defined functions (UDFs)
User-defined operators (UDOs)
What are UDOs? • User-Defined Extractors
• Converts files into rowset
• User-Defined Outputters
• Converts rowset into files
• User-Defined Processors
• Take one row and produce one row
• Pass-through versus transforming
• User-Defined Appliers
• Take one row and produce 0 to n rows
• Used with OUTER/CROSS APPLY
• User-Defined Combiners
• Combines rowsets (like a user-defined join)
• User-Defined Reducers
• Take n rows and produce m rows (normally m<n)
• Scaled out with explicit U-SQL Syntax that takes a UDO
instance (created as part of the execution):
• EXTRACT
• OUTPUT
• CROSS APPLY
Custom Operator Extensions in
language of your choice
Scaled out by U-SQL
• PROCESS
• COMBINE
• REDUCE
Scaling out C# with U-SQL
https://github.com/Azure/usql/tree/master/Examples/ImageApp
Copyright Camera
Make
Camera
Model
Thumbnail
Michael Canon 70D
Michael Samsung S7
How to specify
.NET UDOs?
• .Net API provided to build UDOs
• Any .Net language usable
• however only C# is first-class in tooling
• Use U-SQL specific .Net DLLs
• Deploying UDOs
• Compile DLL
• Upload DLL to ADLS
• register with U-SQL script
• VisualStudio provides tool support
• UDOs can
• Invoke managed code
• Invoke native code deployed with UDO assemblies
• Invoke other language runtimes (e.g., Python, R)
• be scaled out by U-SQL execution framework
• UDOs cannot
• Communicate between different UDO invocations
• Call Webservices or Reach outside the vertex
boundary
How to specify UDOs?
• Code behind
• C#, Python, R
• C# Class Project for U-SQL
How to specify UDOs?
[SqlUserDefinedExtractor]
public class DriverExtractor : IExtractor
{
private byte[] _row_delim;
private string _col_delim;
private Encoding _encoding;
// Define a non-default constructor since I want to pass in my own parameters
public DriverExtractor( string row_delim = "rn", string col_delim = ",“
, Encoding encoding = null )
{
_encoding = encoding == null ? Encoding.UTF8 : encoding;
_row_delim = _encoding.GetBytes(row_delim);
_col_delim = col_delim;
} // DriverExtractor
// Converting text to target schema
private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow)
{
var schema = outputrow.Schema;
if (schema[i].Type == typeof(int))
{
var tmp = Convert.ToInt32(c);
outputrow.Set(i, tmp);
}
...
} //SerializeCol
public override IEnumerable<IRow> Extract( IUnstructuredReader input
, IUpdatableRow outputrow)
{
foreach (var row in input.Split(_row_delim))
{
using(var s = new StreamReader(row, _encoding))
{
int i = 0;
foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None))
{
OutputValueAtCol_I(c, i++, outputrow);
} // foreach
} // using
yield return outputrow.AsReadOnly();
} // foreach
} // Extract
} // class DriverExtractor
UDO model
• Marking UDOs
• Parameterizing UDOs
• UDO signature
• UDO-specific processing
pattern
• Rowsets and their schemas
in UDOs
• Setting results
• By position
• By name
Managing Assemblies
• Create assemblies
• Reference assemblies
• Enumerate assemblies
• Drop assemblies
• VisualStudio makes registration easy!
• CREATE ASSEMBLY db.assembly FROM @path;
• CREATE ASSEMBLY db.assembly FROM byte[];
• Can also include additional resource files
• REFERENCE ASSEMBLY db.assembly;
• Referencing .Net Framework Assemblies
• Always accessible system namespaces:
• U-SQL specific (e.g., for SQL.MAP)
• All provided by system.dll system.core.dll system.data.dll,
System.Runtime.Serialization.dll, mscorelib.dll (e.g.,
System.Text, System.Text.RegularExpressions,
System.Linq)
• Add all other .Net Framework Assemblies with:
REFERENCE SYSTEM ASSEMBLY [System.XML];
• Enumerating Assemblies
• Powershell command
• U-SQL Studio Server Explorer and Azure Portal
• DROP ASSEMBLY db.assembly;
DEPLOY RESOURCE Syntax:
'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }.
Example:
DEPLOY RESOURCE "/config/configfile.xml", "package.zip";
Use Cases:
• Script specific configuration files (not stored with Asm)
• Script specific models
• Any other file you want to access from user code on all
vertices
Semantics:
• Files have to be in ADLS or WASB
• Files are deployed to vertex and are accessible from any custom
code
Limits:
• Single resource file limit is 400MB
• Overall limit for deployed resource files is 3GB
U-SQL Vertex Code (.NET)
C#
C++
Algebra
Additional non-dll files &
Deployed resources
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ADLS DEPLOY RESOURCE
System files
(built-in Runtimes, Core DLLs, OS)
Scale Out Python With U-SQL
Python
Author Tweet
MikeDoesBigData @AzureDataLake: Come and see the #SQLKonferenz sessions on #USQL
AzureDataLake What are your recommendations for #SQLKonferenz? @MikeDoesBigData
Author Mentions Topics
MikeDoesBigData {@AzureDataLake} {#SQLKonferenz, #USQL}
AzureDataLake {@MikeDoesBigData} {#SQLKonferenz}
REFERENCE ASSEMBLY [ExtPython];
DECLARE @myScript = @"
def get_mentions(tweet):
return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) )
def usqlml_main(df):
del df['time']
del df['author']
df['mentions'] = df.tweet.apply(get_mentions)
del df['tweet']
return df
";
@t =
SELECT * FROM
(VALUES
("D1","T1","A1","@foo Hello World @bar"),
("D2","T2","A2","@baz Hello World @beer")
) AS D( date, time, author, tweet );
@m =
REDUCE @t ON date
PRODUCE date string, mentions string
USING new Extension.Python.Reducer(pyScript:@myScript);
Use U-SQL to create a massively
distributed program.
Executing Python code across
many nodes.
Using standard libraries such as
numpy and pandas.
Documentation:
https://docs.microsoft.com/en-
us/azure/data-lake-analytics/data-
lake-analytics-u-sql-python-
extensions
Python
Extensions
U-SQL Vertex Code (Python)
C#
C++
Algebra
Additional Python Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtPython
ADLS DEPLOY RESOURCE
Script.py
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
Python Python Engine & Libs
Python (and R) Extension Execution Paradigm
Python/R.Reducer (type mapping) Python/R.Reducer (type mapping)
Scale Out R With U-SQL
R running in U-
SQL
Generate a linear model
SampleScript_LM_Iris.R
REFERENCE ASSEMBLY [ExtR];
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFileModelSummary string =
@"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt";
DECLARE @myRScript = @"
inputFromUSQL$Species = as.factor(inputFromUSQL$Species)
lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL)
#do not return readonly columns and make sure that the column names are
the same in usql and r scripts,
outputToUSQL=data.frame(summary(lm.fit)$coefficients)
colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"",
""Pr"")
outputToUSQL";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
@ExtendedData = SELECT 0 AS Par, * FROM @InputData;
@ModelCoefficients = REDUCE @ExtendedData ON Par
PRODUCE Par, Estimate double, StdError double, tValue double, Pr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript,
rReturnType:"dataframe");
OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();
R running in U-
SQL
Use a previously
generated model
REFERENCE ASSEMBLY master.ExtR;
DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; //
Prediction Model
DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv";
DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv";
DECLARE @PartitionCount int = 10;
// R script to run
DECLARE @myRScript = @"
load(""my_model_LM_Iris.rda"")
outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))";
@InputData =
EXTRACT SepalLength double, SepalWidth double, PetalLength double,
PetalWidth double, Species string
FROM @IrisData
USING Extractors.Csv();
//Randomly partition the data to apply the model in parallel
@ExtendedData =
SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, *
FROM @InputData;
// Predict Species
@RScriptOutput =
REDUCE @ExtendedData ON Par
PRODUCE Par, fit double, lwr double, upr double
READONLY Par
USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe",
stringsAsFactors:false);
OUTPUT @RScriptOutput TO @OutputFilePredictions
USING Outputters.Csv(outputHeader:true);
U-SQL Vertex Code (R)
C#
C++
Algebra
Additional R Libs and Script
managed dll
native dll
Compilation output (in job folder)
Compilation and Optimization
U-SQL
Metadata
Service
Deployed to
Vertices
REFERENCE ASSEMBLY
ExtR
ADLS DEPLOY RESOURCE
Script.R
OtherLibs.zip
System files
(built-in Runtimes, Core DLLs, OS)
R R Engine & Libs
Summary
Scaling Out your Code and Language with U-SQL
Bring your Code or Write your Custom Operator Extensions in
 .Net (C#, F#, etc)
 Python
 R
 …
Scaled out by U-SQL
Additional
Resources
• Blogs and community page:
• http://usql.io (U-SQL Github)
• http://blogs.msdn.microsoft.com/azuredatalake/
• http://blogs.msdn.microsoft.com/mrys/
• https://channel9.msdn.com/Search?term=U-SQL#ch9Search
• Documentation, presentations and articles:
• http://aka.ms/usql_reference
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-
programmability-guide
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/
• https://msdn.microsoft.com/en-us/magazine/mt614251
• https://msdn.microsoft.com/magazine/mt790200
• http://www.slideshare.net/MichaelRys
• Getting Started with R in U-SQL
• https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python-
extensions
• ADL forums and feedback
• https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake
• http://stackoverflow.com/questions/tagged/u-sql
• http://aka.ms/adlfeedback
Continue your education
at Microsoft Virtual
Academy online.
Vielen Dank für Eure
Aufmerksamkeit!
usql@microsoft.com@MikeDoesBigData
http://aka.ms/azuredatalake

More Related Content

PPTX
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
PPTX
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
PPTX
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
PPTX
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
PPTX
Killer Scenarios with Data Lake in Azure with U-SQL
PPTX
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
PPTX
U-SQL Partitioned Data and Tables (SQLBits 2016)
PPTX
Taming the Data Science Monster with A New ‘Sword’ – U-SQL
U-SQL Killer Scenarios: Custom Processing, Big Cognition, Image and JSON Proc...
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
The Road to U-SQL: Experiences in Language Design (SQL Konferenz 2017 Keynote)
U-SQL Killer Scenarios: Taming the Data Science Monster with U-SQL and Big Co...
Killer Scenarios with Data Lake in Azure with U-SQL
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
U-SQL Partitioned Data and Tables (SQLBits 2016)
Taming the Data Science Monster with A New ‘Sword’ – U-SQL

What's hot (20)

PPTX
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
PPTX
U-SQL Meta Data Catalog (SQLBits 2016)
PPTX
U-SQL Query Execution and Performance Tuning
PPTX
U-SQL Reading & Writing Files (SQLBits 2016)
PPTX
Introducing U-SQL (SQLPASS 2016)
PPTX
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
PPTX
ADL/U-SQL Introduction (SQLBits 2016)
PPTX
U-SQL Intro (SQLBits 2016)
PPTX
Using C# with U-SQL (SQLBits 2016)
PPTX
U-SQL Does SQL (SQLBits 2016)
PDF
Spark SQL with Scala Code Examples
PPTX
Be A Hero: Transforming GoPro Analytics Data Pipeline
PPTX
Apache Spark sql
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
Introduction to Spark SQL & Catalyst
PPTX
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
PPTX
Hive @ Bucharest Java User Group
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
20140908 spark sql & catalyst
Tuning and Optimizing U-SQL Queries (SQLPASS 2016)
U-SQL Meta Data Catalog (SQLBits 2016)
U-SQL Query Execution and Performance Tuning
U-SQL Reading & Writing Files (SQLBits 2016)
Introducing U-SQL (SQLPASS 2016)
U-SQL User-Defined Operators (UDOs) (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
U-SQL Intro (SQLBits 2016)
Using C# with U-SQL (SQLBits 2016)
U-SQL Does SQL (SQLBits 2016)
Spark SQL with Scala Code Examples
Be A Hero: Transforming GoPro Analytics Data Pipeline
Apache Spark sql
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Introduction to Spark SQL & Catalyst
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Hive @ Bucharest Java User Group
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
20140908 spark sql & catalyst
Ad

Similar to Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018) (20)

PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
PPTX
U-SQL - Azure Data Lake Analytics for Developers
PPTX
Azure data lake sql konf 2016
PDF
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
PPTX
Dive Into Azure Data Lake - PASS 2017
PDF
USQ Landdemos Azure Data Lake
PPTX
Azure Data Lake and U-SQL
PPTX
3 CityNetConf - sql+c#=u-sql
PPTX
Azure Data Lake and Azure Data Lake Analytics
PPTX
Tokyo azure meetup #2 big data made easy
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
PPTX
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
PDF
Introduction to Azure Data Lake
PPTX
NDC Sydney - Analyzing StackExchange with Azure Data Lake
PPTX
Paris Datageeks meetup 05102016
PPTX
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
PDF
USQL Trivadis Azure Data Lake Event
PPTX
An intro to Azure Data Lake
PDF
Talavant Data Lake Analytics
PPTX
.NET per la Data Science e oltre
Using existing language skillsets to create large-scale, cloud-based analytics
U-SQL - Azure Data Lake Analytics for Developers
Azure data lake sql konf 2016
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
Dive Into Azure Data Lake - PASS 2017
USQ Landdemos Azure Data Lake
Azure Data Lake and U-SQL
3 CityNetConf - sql+c#=u-sql
Azure Data Lake and Azure Data Lake Analytics
Tokyo azure meetup #2 big data made easy
Big Data Analytics from Azure Cloud to Power BI Mobile
Hands-On with U-SQL and Azure Data Lake Analytics (ADLA)
Introduction to Azure Data Lake
NDC Sydney - Analyzing StackExchange with Azure Data Lake
Paris Datageeks meetup 05102016
NDC Minnesota - Analyzing StackExchange data with Azure Data Lake
USQL Trivadis Azure Data Lake Event
An intro to Azure Data Lake
Talavant Data Lake Analytics
.NET per la Data Science e oltre
Ad

More from Michael Rys (13)

PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
PPTX
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
PPTX
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
PPTX
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
PPTX
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
PPTX
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
PPTX
U-SQL Learning Resources (SQLBits 2016)
PPTX
U-SQL Federated Distributed Queries (SQLBits 2016)
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
PPTX
Azure Data Lake Intro (SQLBits 2016)
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data Processing with .NET and Spark (SQLBits 2020)
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platfor...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
U-SQL Learning Resources (SQLBits 2016)
U-SQL Federated Distributed Queries (SQLBits 2016)
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)

Recently uploaded (20)

PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Quality review (1)_presentation of this 21
PPTX
Global journeys: estimating international migration
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Data_Analytics_and_PowerBI_Presentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Reliability_Chapter_ presentation 1221.5784
Quality review (1)_presentation of this 21
Global journeys: estimating international migration
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Foundation of Data Science unit number two notes
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Mega Projects Data Mega Projects Data
Major-Components-ofNKJNNKNKNKNKronment.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Moving the Public Sector (Government) to a Digital Adoption

Bring your code to explore the Azure Data Lake: Execute your .NET/Python/R code at scale with U-SQL (SQLBits and SQLKonferenz 2018)

  • 1. Run Python, R and .NET code at Data Lake scale with U-SQL in Azure Data Lake Michael Rys Principal Program Manager Big Data Team, Microsoft @MikeDoesBigData usql@microsoft.com
  • 2. Agenda • Characteristics of Big Data Analytics Programming • Scaling out existing code with U-SQL: • Scaling out Cognitive Libraries • Introduction to U-SQL’s Extensibility Framework • Scaling out .NET with U-SQL: • Custom Image processing • Scaling out Python with U-SQL • Scaling out R with U-SQL: • Model generation, Model testing and scoring
  • 3. Some sample use cases Digital Crime Unit – Analyze complex attack patterns to understand BotNets and to predict and mitigate future attacks by analyzing log records with complex custom algorithms Image Processing – Large-scale image feature extraction and classification using custom code Shopping Recommendation – Complex pattern analysis and prediction over shopping records using proprietary algorithms Characteristics of Big Data Analytics • Requires processing of any type of data • Allow use of custom algorithms • Scale to any size and be efficient Bring your own coding expertise and existing code and scale it out?
  • 4. Status Quo: SQL for Big Data  Declarativity does scaling and parallelization for you  Extensibility is bolted on and not “native”  hard to work with anything other than structured data  difficult to extend with custom code: complex installations and frameworks  Limited to one or two languages
  • 5. Status Quo: Programming Languages for Big Data  Extensibility through custom code is “native”  Declarativity is bolted on and not “native”  User often has to care about scale and performance  SQL is 2nd class within string, only local optimizations  Often no code reuse/ sharing across queries
  • 6. Why U-SQL?  Declarativity and Extensibility are equally native! Get benefits of both! Scales out your custom imperative Code (written in .NET, Python, R, and more to come) in a declarative SQL-based framework R Python .NET U-SQL Framework
  • 7. Extract Process Output User CodeUser Code User Code User Code Declarative Framework User Extensions U-SQL Example Extract User Code User Code
  • 8. Scale Out Cognitive Library https://github.com/Azure/usql/tree/master/Examples/ImageApp https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-cognitive Car Green Parked Outdoor Racing
  • 9. REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; REFERENCE ASSEMBLY ImageOcr; @imgs = EXTRACT FileName string, ImgData byte[] FROM @"/images/{FileName}.jpg" USING new Cognition.Vision.ImageExtractor(); // Extract the number of objects on each image and tag them @objects = PROCESS @imgs PRODUCE FileName, NumObjects int, Tags SqlMap<string, float?> READONLY FileName USING new Cognition.Vision.ImageTagger(); OUTPUT @objects TO "/objects.tsv" USING Outputters.Tsv(); Imaging
  • 10. REFERENCE ASSEMBLY [TextSentiment]; REFERENCE ASSEMBLY [TextKeyPhrase]; @WarAndPeace = EXTRACT No int, Year string, Book string, Chapter string, Text string FROM @"/usqlext/samples/cognition/war_and_peace.csv" USING Extractors.Csv(); @sentiment = PROCESS @WarAndPeace PRODUCE No, Year, Book, Chapter, Text, Sentiment string, Conf double USING new Cognition.Text.SentimentAnalyzer(true); OUTPUT @sentinment TO "/sentiment.tsv" USING Outputters.Tsv(); Text Analysis
  • 11. U-SQL/Cognitive Example • Identify objects in images (tags) • Identify faces and emotions and images • Join datasets – find out which tags are associated with happiness REFERENCE ASSEMBLY ImageCommon; REFERENCE ASSEMBLY FaceSdk; REFERENCE ASSEMBLY ImageEmotion; REFERENCE ASSEMBLY ImageTagging; @objects = PROCESS MegaFaceView PRODUCE FileName, NumObjects int, Tags SqlMap<string,float?> READONLY FileName USING new Cognition.Vision.ImageTagger(); @tags = SELECT FileName, T.Tag FROM @objects CROSS APPLY EXPLODE(Tags.Split) AS T(Tag, Conf) WHERE Tag.Contains("dog") OR Tag.Contains("cat"); @emotion = SELECT ImageName, Details.Emotion FROM MegaFaceView CROSS APPLY new Cognition.Vision.EmotionApplier(imgCol:"image") AS Details(NumFaces int, FaceIndex int, RectX float, RectY float, Width float, Height float, Emotion string, Confidence float); @correlation = SELECT T.FileName, Emotion, Tag FROM @emotion AS E INNER JOIN @tags AS T ON E.FileName == T.FileName; Images Objects Emotions filter join aggregate
  • 12. U-SQL extensibility Extend U-SQL with C#/.NET, Python, R etc. Built-in operators, function, aggregates C# expressions (in SELECT expressions) User-defined aggregates (UDAGGs) User-defined functions (UDFs) User-defined operators (UDOs)
  • 13. What are UDOs? • User-Defined Extractors • Converts files into rowset • User-Defined Outputters • Converts rowset into files • User-Defined Processors • Take one row and produce one row • Pass-through versus transforming • User-Defined Appliers • Take one row and produce 0 to n rows • Used with OUTER/CROSS APPLY • User-Defined Combiners • Combines rowsets (like a user-defined join) • User-Defined Reducers • Take n rows and produce m rows (normally m<n) • Scaled out with explicit U-SQL Syntax that takes a UDO instance (created as part of the execution): • EXTRACT • OUTPUT • CROSS APPLY Custom Operator Extensions in language of your choice Scaled out by U-SQL • PROCESS • COMBINE • REDUCE
  • 14. Scaling out C# with U-SQL https://github.com/Azure/usql/tree/master/Examples/ImageApp Copyright Camera Make Camera Model Thumbnail Michael Canon 70D Michael Samsung S7
  • 15. How to specify .NET UDOs? • .Net API provided to build UDOs • Any .Net language usable • however only C# is first-class in tooling • Use U-SQL specific .Net DLLs • Deploying UDOs • Compile DLL • Upload DLL to ADLS • register with U-SQL script • VisualStudio provides tool support • UDOs can • Invoke managed code • Invoke native code deployed with UDO assemblies • Invoke other language runtimes (e.g., Python, R) • be scaled out by U-SQL execution framework • UDOs cannot • Communicate between different UDO invocations • Call Webservices or Reach outside the vertex boundary
  • 16. How to specify UDOs? • Code behind • C#, Python, R
  • 17. • C# Class Project for U-SQL How to specify UDOs?
  • 18. [SqlUserDefinedExtractor] public class DriverExtractor : IExtractor { private byte[] _row_delim; private string _col_delim; private Encoding _encoding; // Define a non-default constructor since I want to pass in my own parameters public DriverExtractor( string row_delim = "rn", string col_delim = ",“ , Encoding encoding = null ) { _encoding = encoding == null ? Encoding.UTF8 : encoding; _row_delim = _encoding.GetBytes(row_delim); _col_delim = col_delim; } // DriverExtractor // Converting text to target schema private void OutputValueAtCol_I(string c, int i, IUpdatableRow outputrow) { var schema = outputrow.Schema; if (schema[i].Type == typeof(int)) { var tmp = Convert.ToInt32(c); outputrow.Set(i, tmp); } ... } //SerializeCol public override IEnumerable<IRow> Extract( IUnstructuredReader input , IUpdatableRow outputrow) { foreach (var row in input.Split(_row_delim)) { using(var s = new StreamReader(row, _encoding)) { int i = 0; foreach (var c in s.ReadToEnd().Split(new[] { _col_delim }, StringSplitOptions.None)) { OutputValueAtCol_I(c, i++, outputrow); } // foreach } // using yield return outputrow.AsReadOnly(); } // foreach } // Extract } // class DriverExtractor UDO model • Marking UDOs • Parameterizing UDOs • UDO signature • UDO-specific processing pattern • Rowsets and their schemas in UDOs • Setting results • By position • By name
  • 19. Managing Assemblies • Create assemblies • Reference assemblies • Enumerate assemblies • Drop assemblies • VisualStudio makes registration easy! • CREATE ASSEMBLY db.assembly FROM @path; • CREATE ASSEMBLY db.assembly FROM byte[]; • Can also include additional resource files • REFERENCE ASSEMBLY db.assembly; • Referencing .Net Framework Assemblies • Always accessible system namespaces: • U-SQL specific (e.g., for SQL.MAP) • All provided by system.dll system.core.dll system.data.dll, System.Runtime.Serialization.dll, mscorelib.dll (e.g., System.Text, System.Text.RegularExpressions, System.Linq) • Add all other .Net Framework Assemblies with: REFERENCE SYSTEM ASSEMBLY [System.XML]; • Enumerating Assemblies • Powershell command • U-SQL Studio Server Explorer and Azure Portal • DROP ASSEMBLY db.assembly;
  • 20. DEPLOY RESOURCE Syntax: 'DEPLOY' 'RESOURCE' file_path_URI { ',' file_path_URI }. Example: DEPLOY RESOURCE "/config/configfile.xml", "package.zip"; Use Cases: • Script specific configuration files (not stored with Asm) • Script specific models • Any other file you want to access from user code on all vertices Semantics: • Files have to be in ADLS or WASB • Files are deployed to vertex and are accessible from any custom code Limits: • Single resource file limit is 400MB • Overall limit for deployed resource files is 3GB
  • 21. U-SQL Vertex Code (.NET) C# C++ Algebra Additional non-dll files & Deployed resources managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices REFERENCE ASSEMBLY ADLS DEPLOY RESOURCE System files (built-in Runtimes, Core DLLs, OS)
  • 22. Scale Out Python With U-SQL Python Author Tweet MikeDoesBigData @AzureDataLake: Come and see the #SQLKonferenz sessions on #USQL AzureDataLake What are your recommendations for #SQLKonferenz? @MikeDoesBigData Author Mentions Topics MikeDoesBigData {@AzureDataLake} {#SQLKonferenz, #USQL} AzureDataLake {@MikeDoesBigData} {#SQLKonferenz}
  • 23. REFERENCE ASSEMBLY [ExtPython]; DECLARE @myScript = @" def get_mentions(tweet): return ';'.join( ( w[1:] for w in tweet.split() if w[0]=='@' ) ) def usqlml_main(df): del df['time'] del df['author'] df['mentions'] = df.tweet.apply(get_mentions) del df['tweet'] return df "; @t = SELECT * FROM (VALUES ("D1","T1","A1","@foo Hello World @bar"), ("D2","T2","A2","@baz Hello World @beer") ) AS D( date, time, author, tweet ); @m = REDUCE @t ON date PRODUCE date string, mentions string USING new Extension.Python.Reducer(pyScript:@myScript); Use U-SQL to create a massively distributed program. Executing Python code across many nodes. Using standard libraries such as numpy and pandas. Documentation: https://docs.microsoft.com/en- us/azure/data-lake-analytics/data- lake-analytics-u-sql-python- extensions Python Extensions
  • 24. U-SQL Vertex Code (Python) C# C++ Algebra Additional Python Libs and Script managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices REFERENCE ASSEMBLY ExtPython ADLS DEPLOY RESOURCE Script.py OtherLibs.zip System files (built-in Runtimes, Core DLLs, OS) Python Python Engine & Libs
  • 25. Python (and R) Extension Execution Paradigm Python/R.Reducer (type mapping) Python/R.Reducer (type mapping)
  • 26. Scale Out R With U-SQL
  • 27. R running in U- SQL Generate a linear model SampleScript_LM_Iris.R REFERENCE ASSEMBLY [ExtR]; DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv"; DECLARE @OutputFileModelSummary string = @"/my/R/Output/LMModelSummaryCoefficientsIrisFromRCommand.txt"; DECLARE @myRScript = @" inputFromUSQL$Species = as.factor(inputFromUSQL$Species) lm.fit=lm(unclass(Species)~.-Par, data=inputFromUSQL) #do not return readonly columns and make sure that the column names are the same in usql and r scripts, outputToUSQL=data.frame(summary(lm.fit)$coefficients) colnames(outputToUSQL) <- c(""Estimate"", ""StdError"", ""tValue"", ""Pr"") outputToUSQL"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string FROM @IrisData USING Extractors.Csv(); @ExtendedData = SELECT 0 AS Par, * FROM @InputData; @ModelCoefficients = REDUCE @ExtendedData ON Par PRODUCE Par, Estimate double, StdError double, tValue double, Pr double READONLY Par USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe"); OUTPUT @ModelCoefficients TO @OutputFileModelSummary USING Outputters.Tsv();
  • 28. R running in U- SQL Use a previously generated model REFERENCE ASSEMBLY master.ExtR; DEPLOY RESOURCE @"/usqlext/samples/R/my_model_LM_Iris.rda"; // Prediction Model DECLARE @IrisData string = @"/usqlext/samples/R/iris.csv"; DECLARE @OutputFilePredictions string = @"/Output/LMPredictionsIris.csv"; DECLARE @PartitionCount int = 10; // R script to run DECLARE @myRScript = @" load(""my_model_LM_Iris.rda"") outputToUSQL=data.frame(predict(lm.fit, inputFromUSQL, interval=""confidence""))"; @InputData = EXTRACT SepalLength double, SepalWidth double, PetalLength double, PetalWidth double, Species string FROM @IrisData USING Extractors.Csv(); //Randomly partition the data to apply the model in parallel @ExtendedData = SELECT Extension.R.RandomNumberGenerator.GetRandomNumber(@PartitionCount) AS Par, * FROM @InputData; // Predict Species @RScriptOutput = REDUCE @ExtendedData ON Par PRODUCE Par, fit double, lwr double, upr double READONLY Par USING new Extension.R.Reducer(command:@myRScript, rReturnType:"dataframe", stringsAsFactors:false); OUTPUT @RScriptOutput TO @OutputFilePredictions USING Outputters.Csv(outputHeader:true);
  • 29. U-SQL Vertex Code (R) C# C++ Algebra Additional R Libs and Script managed dll native dll Compilation output (in job folder) Compilation and Optimization U-SQL Metadata Service Deployed to Vertices REFERENCE ASSEMBLY ExtR ADLS DEPLOY RESOURCE Script.R OtherLibs.zip System files (built-in Runtimes, Core DLLs, OS) R R Engine & Libs
  • 31. Scaling Out your Code and Language with U-SQL Bring your Code or Write your Custom Operator Extensions in  .Net (C#, F#, etc)  Python  R  … Scaled out by U-SQL
  • 32. Additional Resources • Blogs and community page: • http://usql.io (U-SQL Github) • http://blogs.msdn.microsoft.com/azuredatalake/ • http://blogs.msdn.microsoft.com/mrys/ • https://channel9.msdn.com/Search?term=U-SQL#ch9Search • Documentation, presentations and articles: • http://aka.ms/usql_reference • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql- programmability-guide • https://docs.microsoft.com/en-us/azure/data-lake-analytics/ • https://msdn.microsoft.com/en-us/magazine/mt614251 • https://msdn.microsoft.com/magazine/mt790200 • http://www.slideshare.net/MichaelRys • Getting Started with R in U-SQL • https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-python- extensions • ADL forums and feedback • https://social.msdn.microsoft.com/Forums/azure/en-US/home?forum=AzureDataLake • http://stackoverflow.com/questions/tagged/u-sql • http://aka.ms/adlfeedback Continue your education at Microsoft Virtual Academy online.
  • 33. Vielen Dank für Eure Aufmerksamkeit! usql@microsoft.com@MikeDoesBigData http://aka.ms/azuredatalake

Editor's Notes

  • #4: Add velocity?
  • #5: Hard to operate on unstructured data: Even Hive requires meta data to be created to operate on unstructured data. Adding Custom Java functions, aggregators and SerDes is involving a lot of steps and often access to server’s head node and differs based on type of operation. Requires many tools and steps. Some examples: Hive UDAgg Code and compile .java into .jar Extend AbstractGenericUDAFResolver class: Does type checking, argument checking and overloading Extend GenericUDAFEvaluator class: implements logic in 8 methods. - Deploy: Deploy jar into class path on server Edit FunctionRegistry.java to register as built-in Update the content of show functions with ant Hive UDF (as of v0.13) Code Load JAR into head node or at URI CREATE FUNCTION USING JAR to register and load jar into classpath for every function (instead of registering jar and just use the functions)
  • #6: Spark supports Custom “inputters and outputters” for defining custom RDDs No UDAGGs Simple integration of UDFs but only for duration of program. No reuse/sharing. Cloud dataflow? Requires has to care about scale and perf Spark UDAgg Is not yet supported ( SPARK-3947) Spark UDF Write inline function def westernState(state: String) = Seq("CA", "OR", "WA", "AK").contains(state) for SQL usage need to register the table customerTable.registerTempTable("customerTable") Register each UDF sqlContext.udf.register("westernState", westernState _) Call it val westernStates = sqlContext.sql("SELECT * FROM customerTable WHERE westernState(state)")
  • #7: Makes it easy for you by unifying: Declarative and imperative Unstructured and structured data processing Local and remote Queries Increase productivity and agility from Day 1 and at Day 100 for YOU!
  • #8: ADL uses U-SQL to create a distributed, parallel job using simple declarative statements and provides discrete points for attaching user code
  • #9: U-SQL is build on top of existing frameworks and languages
  • #14: Extensions require .NET assemblies to be registered with a database