11. Data Manipulation
SELECT
FROM
WHERE
HAVING
GROUP BY
CASE WHEN THEN
ELSE
INNER/LEFT OUTER
JOIN
UNION
CROSS/OUTER APPLY
CAST INTO
ORDER BY ASC, DSC
ScalingExtensions
WITH
PARTITION BY
OVER
Date and Time
DateName
DatePart Day, Month, Year
DateDiff
DateTimeFromParts
DateAdd
WindowingExtensions
TumblingWindow
HoppingWindow
SlidingWindow
Aggregation
SUM
COUNT
AVG
MIN
MAX
STDEV
STDEVP
VAR
VARP
TopOne
String
Len
Concat
CharIndex
Substring
Lower, Upper
PatIndex
Temporal
Lag
IsFirst
Last
CollectTop
Mathematical
ABS
CEILING
EXP
FLOOR
POWER
SIGN
SQUARE
SQRT
Geospatial(preview)
CreatePoint
CreatePolygon
CreateLineString
ST_DISTANCE
ST_WITHIN
ST_OVERLAPS
ST_INTERSECTS
Declarative SQL like language to
describe transformations
Filters (“Where”)
Projections (“Select”)
Time-window and property-based aggregates
(“Group By”)
Time-shifted joins (specifying time bounds within
which the joining events must occur)
and all combinations thereof
12. 1,915 lines of code with Apache Storm
@ApplicationAnnotation(name="WordCountDemo")
public class Application implements StreamingApplication
{
protected String fileName =
"com/datatorrent/demos/wordcount/samplefile.txt";
private Locality locality = null;
@Override public void populateDAG(DAG dag, Configuration
conf)
{
locality = Locality.CONTAINER_LOCAL;
WordCountInputOperator input =
dag.addOperator("wordinput", new
WordCountInputOperator());
input.setFileName(fileName);
UniqueCounter<String> wordCount =
dag.addOperator("count", new
UniqueCounter<String>());
dag.addStream("wordinput-count", input.outputPort,
wordCount.data).setLocality(locality);
ConsoleOutputOperator consoleOperator =
dag.addOperator("console", new
ConsoleOutputOperator());
dag.addStream("count-console",wordCount.count,
consoleOperator.input);
}
}
3 lines of SQL in Azure Stream Analytics
SELECT Avg(Purchase), ScoreTollId, Count(*)
FROM GameDataStream
GROUP BY TumblingWindows(5, Minute), Score
21. There are two distinct types of Inputs
• Data Streams:
• IoT Hub
• Event Hub
• Azure Blob storage
• Reference data:
• Azure Blob storage
Data Outputs Supported
• Azure Data Lake Store
• SQL Database
• Blob storage
• Event Hub
• Power BI
• Table Storage
• Service Bus Queues
• Service Bus Topics
• Azure Cosmos DB
• Azure Functions
27. Every event that flows through the system has a timestamp
ASA supports:
Arrival Time - Event timestamps based on arrival time (input adapter clock, e.g., Event Hubs)
App Time - Event timestamps based on a timestamp field in the actual event tuple
SELECT * FROM EntryStream TIMESTAMP BY EntryTime
SELECT * FROM EntryStream
28. Output at the end of each window
Windows are fixed length
Used in a GROUP BY clause
1 5 4 2
6 8 6 4
t1 t2 t5 t6
t3 t4
Time
Window 1 Window 2 Window 3
Aggregate
Function (Sum)
18 14
Output Events
29. SELECT TollId, Count(*)
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 10)
1 5 4 2
6 8 6 5
0 5 20
10 15
Time(s)
1 5 4 2
6
8 6
25
A 10-second Tumbling Window
30
3 6 1
5 3 6 1
Every 10 seconds give me the count
of vehicles entering each toll booth
over the last 10 seconds
30. Every 5 seconds give me the
count of vehicles entering each
toll booth over the last 10
seconds
1 5 4 2
6 8 7
0 5 20
10 15
Time
(s)
25
A 10 second Hopping Window with a 5 second hop
30
4 2
6
8 6
5 3 6 1
1 5 4 2
6
8 6 5 3
6 1
5 3
SELECT TollId, Count(*)
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, HoppingWindow(second, 10, 5)
31. SELECT TollId, Count(*)
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, SlidingWindow(second, 20)
HAVING Count(*) > 10
1 5
0 10 40
20 30 Time
(s)
50
A 20-second Sliding Window
5
1
5
1
Entry
Exit
1 5
Find all toll booths that have
served more than 10 vehicles in
the last 20 seconds
An output is generated whenever an event
either enters/leaves the system
33. Perform real-time scoring on streaming data
Anomaly Detection and Sentiment Analysis are common use cases
Function calls from the query
Azure ML can publish web endpoints for operationalized ML models
Azure Stream Analytics binds custom function names to such web endpoints
SELECT text, sentiment(text) AS score
FROM myStream
in public preview
40. Scenarios where you might find JavaScript user-defined functions useful:
• Parsing and manipulating strings that have regular expression functions, for example,
Regexp_Replace() and Regexp_Extract()
• Decoding and encoding data, for example, binary-to-hex conversion
• Performing mathematic computations with JavaScript Math functions
• Performing array operations like sort, join, find, and fill
Here are some things that you cannot do with a JavaScript user-defined function in Stream
Analytics:
• Call out external REST endpoints, for example, performing reverse IP lookup or pulling
reference data from an external source
• Perform custom event format serialization or deserialization on inputs/outputs
• Create custom aggregates
53. Feature Status Remarks
SQL Parallel Write GA WW rollout in 1 week
Blob O/P partitioning by custom date-time Public preview WW rollout in 1 week
C# UDF on IoT Edge Public preview Available now
Live testing in Visual Studio Public preview Available now
User defined custom repartition count Public preview Available now
New built-in ML models for A/D – Edge and Cloud Private Preview Access granted upon sign-up
Custom de-serializers on IoT Edge Private Preview Access granted upon sign-up
MSI Authentication for egress to ADLS Gen1 Private Preview Access granted upon sign-up
54. Supports inline learning and real-time scoring
Easily invoked with simple function calls within query language
Types of Anomalies Detected:
Spikes
Dips
Slow positive trend
Slow negative trend
Bi-Level change
55. • Faster iterative testing
• Show results in real time
• View Job metrics
• Time policies support
Public
Preview
56. Partition egress to Blob storage by
1) Any input field
2) Custom date and time formats
Gain more fine grain control over data
written to Blob storage for dashboarding
and reporting
Better alignment with Hive conventions
for blob output to be consumed by
HDInsight and Azure Databricks.
57. To achieve fully parallel topologies, ASA will
transition SQL ‘writes’ from Serial to Parallel
operations for SQL DB and SQL Data Warehouse
4x-5x improvement in write throughput
Allows for batch size customization to achieve
higher throughput
For e.g., this feature enabled our customer
building a connected car scenario to scale up
from 150K events/min to 500K events/min
58. MSI based authentication will enable
egress to Azure Data Lake Storage.
Key benefits over existing AAD (Azure
Active Directory) based authentication:
• Job deployment automation (thru Power
Shell etc.)
• Long running production jobs
• Consistency with other services
59. Enables better performance
tuning
Key Scenarios
• When upstream partition count
can’t be changed
• Partitioned processing is
needed to scale out to larger
processing load
• Fixed number of output
partitions
SELECT *
INTO
[output]
FROM
[input]
PARTITION BY
DeviceID INTO 10