SlideShare a Scribd company logo
Data-Intensive Computing on Windows HPC Server with the DryadLINQ FrameworkJohn VertArchitectMicrosoft CorporationSVR17
Moving PartsWindows HPC Server 2008 – cluster management, job schedulingDryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasetsLINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data modelPLINQ – multi-core parallelism across LINQ queries.DryadLINQ – Bring LINQ ease of programming to Dryad
Software Stack…ImageProcessingMachineLearningGraphAnalysisDataMining.NET ApplicationsDryadLINQDryadHPC Job SchedulerWindows HPC Server 2008Windows HPC Server 2008Windows HPC Server 2008Windows HPC Server 2008
DryadProvides a general, flexible distributed execution layerDataflow graph as the computation modelCan be modified by runtime optimizationsHigher language layer supplies graph, vertex code, serialization code, hints for data localityAutomatically handles distributed executionDistributes code, routes dataSchedules processes on machines near dataMasks failures in cluster and network
A Dryad JobDirected acyclic graph (DAG)OutputsProcessingverticesChannels(file, fifo, pipe)Inputs
2-D PipingUnix Pipes: 1-Dgrep   |  sed  |  sort   |  awk  |  perlDryad: 2-D	   grep1000    |     sed500   |    sort1000   |     awk500   |  perl506
LINQLanguage Integrated QueryDeclarative extensions to C# and VB.NET for iterating over collectionsIn memoryVia data providersSQL-LikeBroadly adoptable by developersEasy to useReduces written codePredictable resultsScalable experienceDeep tooling support
PLINQ Parallel Language Integrated QueryValue Proposition:Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism.Declarative data parallelism (focus on the “what” not the “how”)Alternative to LINQ-to-ObjectsSame set of query operators + some extrasDefault is IEnumerable<T> basedPreview in Parallel Extensions to .NET Framework 3.5 CTPShipping in .NET Framework 4.0 Beta 2
DryadLINQLINQ to clustersDeclarative programming style of LINQ for clustersAutomatic parallelizationParallel query plan exploits multi-node parallelismPLINQ underneath exploits multi-core parallelismIntegration with VS and .NETType safety, automatic serializationQuery plan optimizationsStatic optimization rules to optimize localityDynamic run-time optimizations
DryadLINQ: From LINQ to DryadAutomatic query plan generationDistributed query execution by DryadLINQ queryQuery planDryadvarlogentries =from line in logswhere !line.StartsWith("#")select new LogEntry(line);logswhereselect
A Simple LINQ QueryIEnumerable<BabyInfo> babies = ...; varresults = from baby in babieswhere baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
A Simple PLINQ QueryIEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel()where baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
A Simple DryadLINQQueryPartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”);varresults = from baby in babies              where baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
PartitionedTable<T>Core data structure for DryadLINQScale-out, partitioned container for .NET objectsDerives from IQueryable<T>, IEnumerable<T>ToPartitionedTable() extension methodsDryadLINQ operators consume and produce PartitionedTable<T>DryadLINQ generates code to serialize/deserialize your .NET objectsUnderlying storage can be partitioned file, partitioned SQL table, cluster filesystem
Partitioned FileFile-based container for PartitionedTable<T> metadataXC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000
PartitionedFileFile-based container for PartitionedTable<T> metadataXC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000\\HPCA1CN13\XC\output\520a0fcf\Part.00000001\\HPCA1CN12\XC\output\520a0fcf\Part.00000002\\HPCA1CN22\XC\output\520a0fcf\Part.00000003\\HPCA1CN07\XC\output\520a0fcf\Part.00000004\\HPCA1CN08\XC\output\520a0fcf\Part.00000005\\HPCA1CN11\XC\output\520a0fcf\Part.00000006\\HPCA1CN06\XC\output\520a0fcf\Part.00000007\\HPCA1CN14\XC\output\520a0fcf\Part.00000008\\HPCA1CN17\XC\output\520a0fcf\Part.00000009\\HPCA1CN23\XC\output\520a0fcf\Part.00000010\\HPCA1CN20\XC\output\520a0fcf\Part.00000011\\HPCA1CN03\XC\output\520a0fcf\Part.00000012\\HPCA1CN16\XC\output\520a0fcf\Part.00000013\\HPCA1CN05\XC\output\520a0fcf\Part.00000014\\HPCA1CN04\XC\output\520a0fcf\Part.00000015\\HPCA1CN09\XC\output\520a0fcf\Part.00000016\\HPCA1CN18\XC\output\520a0fcf\Part.00000017\\HPCA1CN10\XC\output\520a0fcf\Part.00000018\\HPCA1CN21\XC\output\520a0fcf\Part.00000019
A typical data-intensive queryvar logs = PartitionedTable.Get<string>(“weblogs.pt”);varlogentries =        from line in logs        where !line.StartsWith("#")        select new LogEntry(line);var user =         from access in logentries        where access.user.EndsWith(@"\jvert")        select access;var accesses =        from access in user        group access by access.page into pages        select new UserPageCount(“jvert", pages.Key, pages.Count());varhtmAccesses =        from access in accesses        where access.page.EndsWith(".htm")orderbyaccess.count descending        select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject.Go through logentries and keep only entries that are accesses by jvert.Group jvertaccesses according to what page they correspond to. For each page, count the occurrences.Sort the pages jverthas accessed according to access frequency.
Dryad Parallel DAG executionlogslogentriesvarlogentries =from line in logs        where !line.StartsWith("#")        select new LogEntry(line);var user =         from access in logentries        where access.user.EndsWith(@"\jvert")        select access;var accesses =        from access in user        group access by access.page into pages        select new UserPageCount(“jvert", pages.Key, pages.Count());varhtmAccesses =        from access in accesses        where access.page.EndsWith(".htm")orderbyaccess.count descending        select access; useraccesseshtmAccessesoutput
Query plan generationSeparation of query from its execution contextAdd all the loaded assemblies as resourcesEliminate references to local variables by partially evaluating all the expressions in the queryDistribute objects used by the queryDetect impure queries when possibleAutomatic code generationObject serialization code for Dryad channelsManaged code for Dryad VerticesStatic query plan optimizationsPipelining: composing multiple operators into one vertexMinimize unnecessary  data repartitionsOther standard DB optimizations
DryadLINQ query planQuery 0 Output: file://\\hpcmetahn01\XC\output\b7e651a4-38b7-490c-8399-f63eaba7f29a.ptDryadLinq0.dll was built successfully.Input:        [PartitionedTable: file://weblogs.pt]Super__1:        Where(line => !(line.StartsWith(_)))        Select(line => new logdemo.LogEntry(line))        Where(access => access.user.EndsWith(_))DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count()))DryadHashPartition(e => e.Key,e => e.Key)Super__12:DryadMerge()DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum()))        Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
XML representationGenerated by DryadLINQ and passed to Dryad<Query>  <DryadLinqVersion>1.0.1401.0</DryadLinqVersion>  <ClusterName>hpcmetahn01</ClusterName>  ...  <Resources>    <Resource>wrappernativeinfo.dll</Resource>    <Resource>DryadLinq0.dll</Resource>    <Resource>System.Threading.dll</Resource>    <Resource>logdemo.exe</Resource>    <Resource>LinqToDryad.dll</Resource>  </Resources>  <QueryPlan>    <Vertex>     <UniqueId>0</UniqueId> <Type>InputTable</Type>      <Name>weblogs.pt</Name>      ...   </Vertex><Vertex><UniqueId>1</UniqueId> <Type>Super</Type>      <Name>Super__1</Name>      ...<Children><Child>          <UniqueId>0</UniqueId>        </Child></Children></Vertex>   ...  </QueryPlan><Query>List of files to be shipped to the clusterVertex definitions
DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex    {        public static int Super__1(string args){            < . . . >DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam);var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0);var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString);var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true);var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true);var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"\jvert"), true);var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false);DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2);DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff"));            return 0;        }        public static int Super__12(string args){< . . . >       }
DryadLINQ query operatorsAlmost all the useful LINQ operatorsWhere, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, AggregateOperators introduced by DryadLINQHashPartition, RangePartition, Merge, ForkDryad ApplyOperates on sequences rather than items
MapReduce in DryadLINQMapReduce(source,             // sequence of Ts          mapper,             // T -> MskeySelector, // M -> K          reducer)            // (K, Ms) -> Rs{var map = source.SelectMany(mapper);var group = map.GroupBy(keySelector);var result = group.SelectMany(reducer);     return result;      // sequence of Rs}
K-means in DryadLINQpublic static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) {    return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);}public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) {    return vectors.GroupBy(point => NearestCenter(point, centers)).Select(group => group.Aggregate((x,y) => x + y) / group.Count());}var vectors = PartitionedTable.Get<Vector>("vectors.pt");IQueryable<Vector> centers = vectors.Take(100);for (int i = 0; i < 10; i++) {    centers = Step(vectors, centers);}centers.ToPartitionedTable<Vector>(“centers.pt”);public class Vector {    public double[] entries;    [Associative]    public static Vector operator +(Vector v1, Vector v2) { … }    public static Vector operator -(Vector v1, Vector v2) { … }    public double Norm2() {…}}
Putting it all togetherIt’s LINQ all the way downMajor League Baseball datasetPitch-by-pitch data for every MLB game since 200747,909 pitch XML files (one for each pitcher appearance)6,127 player XML files (one for each player)Hash partition the input data files to distribute the workLINQ to XML to shred the dataDryadLINQ to analyze dataset
Load the dataset and partitionDefine Pitch and Player classesvoid StagePitchData(string[] fileList, string PartitionedFile){// partition the list of filenames across     // 20 nodes of the clustervarpitches = fileList.ToPartitionedTable("filelist")                  .HashPartition((x) => (x), 20).SelectMany((f) => XElement.Load(f).Elements("atbat")).SelectMany((a) => a.Elements("pitch").Select((p) => new Pitch((string)a.Attribute("pitcher"),                                     (string)a.Attribute("batter"),p)));pitches.ToPartitionedTable(PartitionedFile);}Void StagePlayerData(string[] fileList, string PartitionedFile){varplayers = fileList.Select((p) => new Player(XElement.Load(p)));players.ToPartitionedTable(PartitionedFile);    return 0;}
Analyze dataset with LINQIQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount){    return pitches.OrderByDescending((p) => p.StartSpeed)                  .Take(count);}
Supports LINQ JoinsIQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches,IQueryable<Player> players,intcount){    return pitches.OrderByDescending((p) => p.StartSpeed)                  .Take(count)                  .Join(players,                        (o) => o.Pitcher,                        (i) => i.Id,                        (o, i) => i.FirstName + " " + i.LastName)                  .Distinct();}
DryadLINQ on HPC ServerDryadLINQ program runs on client workstationDevelop, debug, run locallyWhen ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC ServerHPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job ManagerThe JM then schedules additional tasks to execute the vertices of the DryadLINQ queryWhen the job completes, the client program picks up the output result and continues.
Examples of DryadLINQ ApplicationsData miningAnalysis of service logs for network securityAnalysis of Windows Watson/SQM dataCluster monitoring and performance analysisGraph analysisAccelerated Page-Rank computationRoad network shortest-path preprocessingImage processingImage indexingDecision tree trainingEpitome computationSimulationlight flow simulations for next-generation display researchMonte-Carlo simulations for mobile dataeScienceMachine learning platform for health solutionsAstrophysics simulation
Ongoing WorkAdvanced query optimizationsCombination of static analysis and annotationsSampling execution of the query planDynamic query optimizationIncremental computationReal-time event processingGlobal schedulingDynamically allocate cluster resources between multiple concurrent DryadLINQ applicationsScale-out partitioned storagePluggable storage providersDryadLINQ on AzureBetter debugging, performance analysis, visualization, etc.
Additional ResourcesDryad and DryadLINQhttp://connect.microsoft.com/DryadLINQDryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc.PLINQAvailable in Parallel Extensions to .NET Framework 3.5 CTPAvailable in .NET Framework 4.0 Beta 2http://msdn.microsoft.com/en-us/concurrency/default.aspxhttp://msdn.microsoft.com/en-us/magazine/cc163329.aspxWindows HPC Server 2008http://www.microsoft.com/hpcDownload it, try it, we want your feedback!
Questions?
YOUR FEEDBACK IS IMPORTANT TO US!Please fill out session evaluation forms online atMicrosoftPDC.com
Learn More On Channel 9Expand your PDC experience through Channel 9.Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.channel9.msdn.com/learnBuilt by Developers for Developers….
SVR17: Data-Intensive Computing on Windows HPC Server with the ...
SVR17: Data-Intensive Computing on Windows HPC Server with the ...

More Related Content

PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
PDF
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
PDF
cb streams - gavin pickin
PDF
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Analyzing Blockchain Transactions in Apache Spark with Jiri Kremser
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
GraphQL the holy contract between client and server
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...
cb streams - gavin pickin
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Flink 0.10 @ Bay Area Meetup (October 2015)
Analyzing Blockchain Transactions in Apache Spark with Jiri Kremser
Distributed Stream Processing - Spark Summit East 2017
GraphQL the holy contract between client and server

What's hot (20)

PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
PDF
Apache Flink Deep Dive
PDF
Streaming SQL
PPTX
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PDF
New Analytics Toolbox DevNexus 2015
PDF
Grokking Techtalk #38: Escape Analysis in Go compiler
PDF
What's new with Apache Spark's Structured Streaming?
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PPTX
Apache Pinot Meetup Sept02, 2020
PPTX
Apache Spark
PDF
Extending Flink State Serialization for Better Performance and Smaller Checkp...
PDF
Rapid Web API development with Kotlin and Ktor
PDF
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
PDF
Design and Implementation of the Security Graph Language
PPTX
Spark Study Notes
PDF
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
PDF
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
PPTX
Stress test data pipeline
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Apache Flink Deep Dive
Streaming SQL
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
New Analytics Toolbox DevNexus 2015
Grokking Techtalk #38: Escape Analysis in Go compiler
What's new with Apache Spark's Structured Streaming?
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Apache Pinot Meetup Sept02, 2020
Apache Spark
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Rapid Web API development with Kotlin and Ktor
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Design and Implementation of the Security Graph Language
Spark Study Notes
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Stress test data pipeline
Ad

Viewers also liked (7)

PPT
USWCC | Benefits+ Card Registration
PPT
Презентация доклада А.Л. Денисова "Принципы построения информационной системы...
PPT
What s an Event ? How Ontologies and Linguistic Semantics ...
PPT
Tekencursus
PPT
Baica Ve Cuoc Song
DOCX
RFP document template
PPS
Passive Income
USWCC | Benefits+ Card Registration
Презентация доклада А.Л. Денисова "Принципы построения информационной системы...
What s an Event ? How Ontologies and Linguistic Semantics ...
Tekencursus
Baica Ve Cuoc Song
RFP document template
Passive Income
Ad

Similar to SVR17: Data-Intensive Computing on Windows HPC Server with the ... (20)

PDF
Apache Samza 1.0 - What's New, What's Next
PPTX
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
PPTX
Yogesh kumar kushwah represent’s
PPTX
Overview of VS2010 and .NET 4.0
PDF
Intake 38 data access 3
PDF
Continuous Application with Structured Streaming 2.0
PPTX
Hadoop and HBase experiences in perf log project
PPT
Whidbey old
PPT
B_110500002
PDF
Intake 37 linq2
PPTX
Fabric - Realtime stream processing framework
PPTX
BenchmarkDotNet - Powerful .NET library for benchmarking
PDF
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
PDF
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PDF
Dataservices: Processing (Big) Data the Microservice Way
PPTX
.net Framework
PDF
Serverless London 2019 FaaS composition using Kafka and CloudEvents
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PDF
C# as a System Language
Apache Samza 1.0 - What's New, What's Next
LINQ to HPC: Developing Big Data Applications on Windows HPC Server
Yogesh kumar kushwah represent’s
Overview of VS2010 and .NET 4.0
Intake 38 data access 3
Continuous Application with Structured Streaming 2.0
Hadoop and HBase experiences in perf log project
Whidbey old
B_110500002
Intake 37 linq2
Fabric - Realtime stream processing framework
BenchmarkDotNet - Powerful .NET library for benchmarking
OSCON 2014 - API Ecosystem with Scala, Scalatra, and Swagger at Netflix
Timely Year Two: Lessons Learned Building a Scalable Metrics Analytic System
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Dataservices: Processing (Big) Data the Microservice Way
.net Framework
Serverless London 2019 FaaS composition using Kafka and CloudEvents
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
C# as a System Language

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

SVR17: Data-Intensive Computing on Windows HPC Server with the ...

  • 1. Data-Intensive Computing on Windows HPC Server with the DryadLINQ FrameworkJohn VertArchitectMicrosoft CorporationSVR17
  • 2. Moving PartsWindows HPC Server 2008 – cluster management, job schedulingDryad – distributed execution engine, failure recovery, distribution, scalability across very large partitioned datasetsLINQ – .NET extensions for declarative query, easy expression of data parallelism, unified data modelPLINQ – multi-core parallelism across LINQ queries.DryadLINQ – Bring LINQ ease of programming to Dryad
  • 3. Software Stack…ImageProcessingMachineLearningGraphAnalysisDataMining.NET ApplicationsDryadLINQDryadHPC Job SchedulerWindows HPC Server 2008Windows HPC Server 2008Windows HPC Server 2008Windows HPC Server 2008
  • 4. DryadProvides a general, flexible distributed execution layerDataflow graph as the computation modelCan be modified by runtime optimizationsHigher language layer supplies graph, vertex code, serialization code, hints for data localityAutomatically handles distributed executionDistributes code, routes dataSchedules processes on machines near dataMasks failures in cluster and network
  • 5. A Dryad JobDirected acyclic graph (DAG)OutputsProcessingverticesChannels(file, fifo, pipe)Inputs
  • 6. 2-D PipingUnix Pipes: 1-Dgrep | sed | sort | awk | perlDryad: 2-D grep1000 | sed500 | sort1000 | awk500 | perl506
  • 7. LINQLanguage Integrated QueryDeclarative extensions to C# and VB.NET for iterating over collectionsIn memoryVia data providersSQL-LikeBroadly adoptable by developersEasy to useReduces written codePredictable resultsScalable experienceDeep tooling support
  • 8. PLINQ Parallel Language Integrated QueryValue Proposition:Enable LINQ developers to take advantage of parallel hardware—with basic understanding of data parallelism.Declarative data parallelism (focus on the “what” not the “how”)Alternative to LINQ-to-ObjectsSame set of query operators + some extrasDefault is IEnumerable<T> basedPreview in Parallel Extensions to .NET Framework 3.5 CTPShipping in .NET Framework 4.0 Beta 2
  • 9. DryadLINQLINQ to clustersDeclarative programming style of LINQ for clustersAutomatic parallelizationParallel query plan exploits multi-node parallelismPLINQ underneath exploits multi-core parallelismIntegration with VS and .NETType safety, automatic serializationQuery plan optimizationsStatic optimization rules to optimize localityDynamic run-time optimizations
  • 10. DryadLINQ: From LINQ to DryadAutomatic query plan generationDistributed query execution by DryadLINQ queryQuery planDryadvarlogentries =from line in logswhere !line.StartsWith("#")select new LogEntry(line);logswhereselect
  • 11. A Simple LINQ QueryIEnumerable<BabyInfo> babies = ...; varresults = from baby in babieswhere baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
  • 12. A Simple PLINQ QueryIEnumerable<BabyInfo> babies = ...; varresults = from baby in babies.AsParallel()where baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
  • 13. A Simple DryadLINQQueryPartitionedTable<BabyInfo> babies = PartitionedTable.Get<BabyInfo>(“BabyInfo.pt”);varresults = from baby in babies where baby.Name == queryName &&baby.State == queryState &&baby.Year >= yearStart && baby.Year <= yearEndorderbybaby.Yearascendingselect baby;
  • 14. PartitionedTable<T>Core data structure for DryadLINQScale-out, partitioned container for .NET objectsDerives from IQueryable<T>, IEnumerable<T>ToPartitionedTable() extension methodsDryadLINQ operators consume and produce PartitionedTable<T>DryadLINQ generates code to serialize/deserialize your .NET objectsUnderlying storage can be partitioned file, partitioned SQL table, cluster filesystem
  • 15. Partitioned FileFile-based container for PartitionedTable<T> metadataXC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000
  • 16. PartitionedFileFile-based container for PartitionedTable<T> metadataXC\output\520a0fcf\Part200,1855000,HPCMETAHN011,1630000,HPCA1CN132,1707500,HPCA1CN123,1828820,HPCA1CN224,1802140,HPCA1CN075,1741000,HPCA1CN086,1733980,HPCA1CN117,1762620,HPCA1CN068,1861300,HPCA1CN149,1807460,HPCA1CN1710,1807560,HPCA1CN2311,1768120,HPCA1CN2012,1847220,HPCA1CN0313,1729160,HPCA1CN1614,1767500,HPCA1CN0515,1781520,HPCA1CN0416,1728480,HPCA1CN0917,1802580,HPCA1CN1818,1862380,HPCA1CN1019,1762540,HPCA1CN21\\HPCMETAHN01\XC\output\520a0fcf\Part.00000000\\HPCA1CN13\XC\output\520a0fcf\Part.00000001\\HPCA1CN12\XC\output\520a0fcf\Part.00000002\\HPCA1CN22\XC\output\520a0fcf\Part.00000003\\HPCA1CN07\XC\output\520a0fcf\Part.00000004\\HPCA1CN08\XC\output\520a0fcf\Part.00000005\\HPCA1CN11\XC\output\520a0fcf\Part.00000006\\HPCA1CN06\XC\output\520a0fcf\Part.00000007\\HPCA1CN14\XC\output\520a0fcf\Part.00000008\\HPCA1CN17\XC\output\520a0fcf\Part.00000009\\HPCA1CN23\XC\output\520a0fcf\Part.00000010\\HPCA1CN20\XC\output\520a0fcf\Part.00000011\\HPCA1CN03\XC\output\520a0fcf\Part.00000012\\HPCA1CN16\XC\output\520a0fcf\Part.00000013\\HPCA1CN05\XC\output\520a0fcf\Part.00000014\\HPCA1CN04\XC\output\520a0fcf\Part.00000015\\HPCA1CN09\XC\output\520a0fcf\Part.00000016\\HPCA1CN18\XC\output\520a0fcf\Part.00000017\\HPCA1CN10\XC\output\520a0fcf\Part.00000018\\HPCA1CN21\XC\output\520a0fcf\Part.00000019
  • 17. A typical data-intensive queryvar logs = PartitionedTable.Get<string>(“weblogs.pt”);varlogentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());varhtmAccesses = from access in accesses where access.page.EndsWith(".htm")orderbyaccess.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a new LogEntryobject.Go through logentries and keep only entries that are accesses by jvert.Group jvertaccesses according to what page they correspond to. For each page, count the occurrences.Sort the pages jverthas accessed according to access frequency.
  • 18. Dryad Parallel DAG executionlogslogentriesvarlogentries =from line in logs where !line.StartsWith("#") select new LogEntry(line);var user = from access in logentries where access.user.EndsWith(@"\jvert") select access;var accesses = from access in user group access by access.page into pages select new UserPageCount(“jvert", pages.Key, pages.Count());varhtmAccesses = from access in accesses where access.page.EndsWith(".htm")orderbyaccess.count descending select access; useraccesseshtmAccessesoutput
  • 19. Query plan generationSeparation of query from its execution contextAdd all the loaded assemblies as resourcesEliminate references to local variables by partially evaluating all the expressions in the queryDistribute objects used by the queryDetect impure queries when possibleAutomatic code generationObject serialization code for Dryad channelsManaged code for Dryad VerticesStatic query plan optimizationsPipelining: composing multiple operators into one vertexMinimize unnecessary data repartitionsOther standard DB optimizations
  • 20. DryadLINQ query planQuery 0 Output: file://\\hpcmetahn01\XC\output\b7e651a4-38b7-490c-8399-f63eaba7f29a.ptDryadLinq0.dll was built successfully.Input: [PartitionedTable: file://weblogs.pt]Super__1: Where(line => !(line.StartsWith(_))) Select(line => new logdemo.LogEntry(line)) Where(access => access.user.EndsWith(_))DryadGroupBy(access => access.page,(k__0, pages) => new LinqToDryad.Pair<String,Int32>(k__0, pages.Count()))DryadHashPartition(e => e.Key,e => e.Key)Super__12:DryadMerge()DryadGroupBy(e => e.Key,e => e.Value,(k__0, g__1) => new LinqToDryad.Pair<String,Int32>(k__0, g__1.Sum())) Select(pages => new logdemo.UserPageCount(_, pages.Key, pages.Count()))
  • 21. XML representationGenerated by DryadLINQ and passed to Dryad<Query> <DryadLinqVersion>1.0.1401.0</DryadLinqVersion> <ClusterName>hpcmetahn01</ClusterName> ... <Resources> <Resource>wrappernativeinfo.dll</Resource> <Resource>DryadLinq0.dll</Resource> <Resource>System.Threading.dll</Resource> <Resource>logdemo.exe</Resource> <Resource>LinqToDryad.dll</Resource> </Resources> <QueryPlan> <Vertex> <UniqueId>0</UniqueId> <Type>InputTable</Type> <Name>weblogs.pt</Name> ... </Vertex><Vertex><UniqueId>1</UniqueId> <Type>Super</Type> <Name>Super__1</Name> ...<Children><Child> <UniqueId>0</UniqueId> </Child></Children></Vertex> ... </QueryPlan><Query>List of files to be shipped to the clusterVertex definitions
  • 22. DryadLINQ generated codeCompiled at runtime, assembly passed to Dryad to implement vertices public sealed class DryadLinq__Vertex { public static int Super__1(string args){ < . . . >DryadVertexEnvdenv = new DryadVertexEnv(args, dvertexparam);var dwriter__2 = denv.MakeWriter(DryadLinq__Extension.FactoryType__0);var dreader__3 = denv.MakeReader(DryadLinq__Extension.FactoryString);var source__4 = DryadLinqVertex.DryadWhere(dreader__3, line => (!(line.StartsWith(@"#"))), true);var source__5 = DryadLinqVertex.DryadSelect(source__4, line => new logdemo.LogEntry(line), true);var source__6 = DryadLinqVertex.DryadWhere(source__5, access => access.user.EndsWith(@"\jvert"), true);var source__7 = DryadLinqVertex.DryadGroupBy(source__6, access => access.page, (k__0, pages) => new LinqToDryad.Pair<System.String,System.Int32>(k__0, pages.Count<logdemo.LogEntry>()), null, true, true, false);DryadLinqVertex.DryadHashPartition(source__7, e => e.Key, null, dwriter__2);DryadLinqLog.Add("Vertex Super__1 completed at {0}", DateTime.Now.ToString("MM/dd/yyyyHH:mm:ss.fff")); return 0; } public static int Super__12(string args){< . . . > }
  • 23. DryadLINQ query operatorsAlmost all the useful LINQ operatorsWhere, Select, SelectMany, OrderBy, GroupBy, Join, GroupJoin, Distinct, Concat, Union, Intersect, Except, Count, Contains, Sum, Min, Max, Average, Any, All, Skip, Take, AggregateOperators introduced by DryadLINQHashPartition, RangePartition, Merge, ForkDryad ApplyOperates on sequences rather than items
  • 24. MapReduce in DryadLINQMapReduce(source, // sequence of Ts mapper, // T -> MskeySelector, // M -> K reducer) // (K, Ms) -> Rs{var map = source.SelectMany(mapper);var group = map.GroupBy(keySelector);var result = group.SelectMany(reducer); return result; // sequence of Rs}
  • 25. K-means in DryadLINQpublic static Vector NearestCenter(Vector v, IEnumerable<Vector> centers) { return centers.Aggregate((r, c) => (r - v).Norm2() < (c - v).Norm2() ? r : c);}public static IQueryable<Vector> Step(IQueryable<Vector> vectors, IQueryable<Vector> centers) { return vectors.GroupBy(point => NearestCenter(point, centers)).Select(group => group.Aggregate((x,y) => x + y) / group.Count());}var vectors = PartitionedTable.Get<Vector>("vectors.pt");IQueryable<Vector> centers = vectors.Take(100);for (int i = 0; i < 10; i++) { centers = Step(vectors, centers);}centers.ToPartitionedTable<Vector>(“centers.pt”);public class Vector { public double[] entries; [Associative] public static Vector operator +(Vector v1, Vector v2) { … } public static Vector operator -(Vector v1, Vector v2) { … } public double Norm2() {…}}
  • 26. Putting it all togetherIt’s LINQ all the way downMajor League Baseball datasetPitch-by-pitch data for every MLB game since 200747,909 pitch XML files (one for each pitcher appearance)6,127 player XML files (one for each player)Hash partition the input data files to distribute the workLINQ to XML to shred the dataDryadLINQ to analyze dataset
  • 27. Load the dataset and partitionDefine Pitch and Player classesvoid StagePitchData(string[] fileList, string PartitionedFile){// partition the list of filenames across // 20 nodes of the clustervarpitches = fileList.ToPartitionedTable("filelist") .HashPartition((x) => (x), 20).SelectMany((f) => XElement.Load(f).Elements("atbat")).SelectMany((a) => a.Elements("pitch").Select((p) => new Pitch((string)a.Attribute("pitcher"), (string)a.Attribute("batter"),p)));pitches.ToPartitionedTable(PartitionedFile);}Void StagePlayerData(string[] fileList, string PartitionedFile){varplayers = fileList.Select((p) => new Player(XElement.Load(p)));players.ToPartitionedTable(PartitionedFile); return 0;}
  • 28. Analyze dataset with LINQIQueryable<Pitch> FindFastest(IQueryable<Pitch> pitches, intcount){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count);}
  • 29. Supports LINQ JoinsIQueryable<string> FindFastestPitchers(IQueryable<Pitch> pitches,IQueryable<Player> players,intcount){ return pitches.OrderByDescending((p) => p.StartSpeed) .Take(count) .Join(players, (o) => o.Pitcher, (i) => i.Id, (o, i) => i.FirstName + " " + i.LastName) .Distinct();}
  • 30. DryadLINQ on HPC ServerDryadLINQ program runs on client workstationDevelop, debug, run locallyWhen ToPartitionedTable() is called, the query expression is materialized (codegen, query plan, optimization) and a job is submitted to HPC ServerHPC Server allocates resources for the job and schedules the single task. This task is the Dryad Job ManagerThe JM then schedules additional tasks to execute the vertices of the DryadLINQ queryWhen the job completes, the client program picks up the output result and continues.
  • 31. Examples of DryadLINQ ApplicationsData miningAnalysis of service logs for network securityAnalysis of Windows Watson/SQM dataCluster monitoring and performance analysisGraph analysisAccelerated Page-Rank computationRoad network shortest-path preprocessingImage processingImage indexingDecision tree trainingEpitome computationSimulationlight flow simulations for next-generation display researchMonte-Carlo simulations for mobile dataeScienceMachine learning platform for health solutionsAstrophysics simulation
  • 32. Ongoing WorkAdvanced query optimizationsCombination of static analysis and annotationsSampling execution of the query planDynamic query optimizationIncremental computationReal-time event processingGlobal schedulingDynamically allocate cluster resources between multiple concurrent DryadLINQ applicationsScale-out partitioned storagePluggable storage providersDryadLINQ on AzureBetter debugging, performance analysis, visualization, etc.
  • 33. Additional ResourcesDryad and DryadLINQhttp://connect.microsoft.com/DryadLINQDryadLINQ source, Dryad binaries, documentation, samples, blog, discussion group, etc.PLINQAvailable in Parallel Extensions to .NET Framework 3.5 CTPAvailable in .NET Framework 4.0 Beta 2http://msdn.microsoft.com/en-us/concurrency/default.aspxhttp://msdn.microsoft.com/en-us/magazine/cc163329.aspxWindows HPC Server 2008http://www.microsoft.com/hpcDownload it, try it, we want your feedback!
  • 35. YOUR FEEDBACK IS IMPORTANT TO US!Please fill out session evaluation forms online atMicrosoftPDC.com
  • 36. Learn More On Channel 9Expand your PDC experience through Channel 9.Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses.channel9.msdn.com/learnBuilt by Developers for Developers….