SlideShare a Scribd company logo
Sector: An Open Source Cloud for Data Intensive ComputingRobert GrossmanUniversity of Illinois at ChicagoOpen Data GroupOctober 20, 2009
Part 1.  Sector2http://sector.sourceforge.net
Sector OverviewSector is fastest open source large data cloudAs measured by MalStone & TerasortSector is easy to programSupports UDFs, MapReduce & Python over streamsSector is secureA HIPAA compliant Sector cloud is being set upSector is reliableSector v1.24 has a backup master node server3
About Sector YunhongGu from the Laboratory for Advanced Computing at the University of Illinois at Chicago is the Lead Developer of Sector.Sector is open source (BSD License) and available from sector.sourceforge.netThe current version is 1.24a4
Target ConfigurationsSector is designed to run on racks of commodity computersTypical rack configuration today (Oct, 2009)Rack of 32 quad-core 1U computersEach computer has 4 x 1TB disksEach computer has 1 Gbps connection to a top of a rack switchSometimes these are called Raywulf clusters5
Google’s Large Data CloudCompute ServicesData ServicesStorage Services6ApplicationsGoogle’s MapReduceGoogle’s BigTableGoogle File System (GFS)Google’s Stack
Hadoop’s Large Data CloudCompute ServicesStorage Services7ApplicationsHadoop’sMapReduceData ServicesHadoop Distributed File System (HDFS)Hadoop’s Stack
Sector’s Large Data Cloud8ApplicationsCompute ServicesSphere’s UDFsData ServicesSector’s Distributed File System (SDFS)Storage ServicesUDP-based Data Transport Protocol (UDT)Routing & Transport ServicesSector’s Stack
Comparing Sector and Hadoop9
Terasort - Sector vsHadoop PerformanceSector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone (OCC-Developed Benchmark)Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack.  Data consisted of 20 nodes with 500 million 100-byte records / node.
How Do You Program A Data Center?12
Idea 1 – Support UDF’s Over Data CenterThink of MapReduce asMap acting on (text) recordsWith fixed Shuffle and SortFollowed by Reducing acting on (text) recordsWe generalize this framework as follows:Support a sequence of User Defined Functions (UDF) acting on segments (=chunks) of files.MapReduce is one special case consisting of a user defined Map, a system-defined shuffle and sort, and a user defined reduceIn both cases, framework takes care of assigning nodes to process data, restarting failed processes, etc.13
Applying UDF using Sector/Sphere141. Split dataApplicationSphere ClientInput streamSPESPESPE2. Locate & schedule Sphere Processing Engine (SPE)3. Collect resultsOutputstream
Sector Programming ModelSector dataset consists of one or more physical filesSphere applies User Defined Functions over streams of data consisting of data segmentsData segments can be data records, collections of data records, or filesExample of UDFs: Map function, Reduce function, Split function for CART, etc.Outputs of UDFs can be returned to originating node, written to local node,  or shuffled to another node.15
How Do Move Data in a Cloud & Between Clouds?16Option 1: Use TCP and close your eyes.Option 2:    ?????
Idea 2: Sector is Built on Top of UDT17UDT is a specialized network transport protocol.
UDT can take advantage of wide area high performance 10 Gbps network
Sector is a wide area distributed file system built over UDT.
Sector is layered over the native file system (vs being a block-based file system).UDT Has Been Downloaded 25,000+ Times18Sterling Commerceudt.sourceforge.netMovie2MeGlobusPower FolderNifty TVhttp://udt.sourceforge.net
(x)UDTScalable TCPHighSpeed TCPAIMD (TCP NewReno)xAlternatives to TCP – Decreasing Increases AIMD Protocolsincrease of packet sending rate xdecrease factor
UDT Makes Wide Area Clouds PossibleUsing UDT, Sector can take advantage of wide area high performance networks (10+ Gbps)2010 Gbps per application
What About Security?21
Idea 3: Add Security From the StartSecurity ServerSecurity server maintains information about users and slaves.User access control: password and client IP address.File level access control.Messages are encrypted over SSL. Certificate is used for authentication.Sector is HIPAA capable.MasterClientSSLSSLAAAdataSlaves
For More Information About SectorYunhongGu and Robert L Grossman, Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data, Philosophical Transactions of the Royal Society A, Volume 367, Number 1897, pages 2429--2445, 2009http://arxiv.org/abs/0809.1181http://rsta.royalsocietypublishing.org/content/367/1897/242923

More Related Content

PPTX
Bionimbus - An Overview (2010-v6)
PPT
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
PPTX
Sector Vs Hadoop
PPT
Mining Gnome Data
PPTX
Inroduction to Big Data
PDF
Efficient load rebalancing for distributed file system in Clouds
PPTX
Building a PII scrubbing layer
PPT
Sector Sphere 2009
Bionimbus - An Overview (2010-v6)
Lessons Learned from a Year's Worth of Benchmarking Large Data Clouds (Robert...
Sector Vs Hadoop
Mining Gnome Data
Inroduction to Big Data
Efficient load rebalancing for distributed file system in Clouds
Building a PII scrubbing layer
Sector Sphere 2009

What's hot (20)

PDF
DBMS Unit IV and V Material
DOCX
assignment3
PPTX
Upper layer protocol
PPT
Ds1 int (1)
PPT
Memory allocation (4)
PPTX
Map Reduce basics
PPTX
2.introduction to hdfs
PPTX
Allocation and free space management
PPTX
Network topology for ha
PDF
RAID Levels
PPT
Hadoop training in bangalore-kellytechnologies
PPTX
Tape Access Optimization With TReqS
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
The spatiotemporal RDF store Strabon
PPTX
Webinar: 3 Steps to Controlling the Secondary Storage Deluge
PPTX
Hadoop distributed file system
PPT
報告
DBMS Unit IV and V Material
assignment3
Upper layer protocol
Ds1 int (1)
Memory allocation (4)
Map Reduce basics
2.introduction to hdfs
Allocation and free space management
Network topology for ha
RAID Levels
Hadoop training in bangalore-kellytechnologies
Tape Access Optimization With TReqS
Managing Big Data (Chapter 2, SC 11 Tutorial)
The spatiotemporal RDF store Strabon
Webinar: 3 Steps to Controlling the Secondary Storage Deluge
Hadoop distributed file system
報告
Ad

Similar to Sector - Presentation at Cloud Computing & Its Applications 2009 (20)

PPTX
My Other Computer is a Data Center: The Sector Perspective on Big Data
PPTX
Sector Cloudcom Tutorial
PPTX
My Other Computer is a Data Center (2010 v21)
PPT
sector-sphere
PPTX
Sector CloudSlam 09
PPTX
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
Big data and cloud computing 9 sep-2017
PPT
BWC Supercomputing 2008 Presentation
PPT
Google Cloud Computing on Google Developer 2008 Day
PPTX
Research on vector spatial data storage scheme based
PPTX
Big Process for Big Data @ NASA
PDF
Google Storage concepts and computing concepts.pdf
PPTX
Cloud storage
PPTX
Hadoop File system (HDFS)
PPTX
Bigdata
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Open Cloud Consortium: An Update (04-23-10, v9)
PPTX
storage-systems.pptx
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
My Other Computer is a Data Center: The Sector Perspective on Big Data
Sector Cloudcom Tutorial
My Other Computer is a Data Center (2010 v21)
sector-sphere
Sector CloudSlam 09
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
Big Data Anti-Patterns: Lessons From the Front LIne
Big data and cloud computing 9 sep-2017
BWC Supercomputing 2008 Presentation
Google Cloud Computing on Google Developer 2008 Day
Research on vector spatial data storage scheme based
Big Process for Big Data @ NASA
Google Storage concepts and computing concepts.pdf
Cloud storage
Hadoop File system (HDFS)
Bigdata
Big Data Architecture Workshop - Vahid Amiri
Open Cloud Consortium: An Update (04-23-10, v9)
storage-systems.pptx
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Ad

More from Robert Grossman (20)

PDF
Some Frameworks for Improving Analytic Operations at Your Company
PDF
Some Proposed Principles for Interoperating Cloud Based Data Platforms
PDF
A Gen3 Perspective of Disparate Data
PDF
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
PDF
A Data Biosphere for Biomedical Research
PDF
What is Data Commons and How Can Your Organization Build One?
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
PDF
AnalyticOps - Chicago PAW 2016
PDF
Keynote on 2015 Yale Day of Data
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
PDF
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
PDF
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
PDF
Architectures for Data Commons (XLDB 15 Lightning Talk)
PDF
Practical Methods for Identifying Anomalies That Matter in Large Datasets
PDF
What is a Data Commons and Why Should You Care?
PDF
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
What Are Science Clouds?
PDF
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Some Frameworks for Improving Analytic Operations at Your Company
Some Proposed Principles for Interoperating Cloud Based Data Platforms
A Gen3 Perspective of Disparate Data
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
A Data Biosphere for Biomedical Research
What is Data Commons and How Can Your Organization Build One?
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
AnalyticOps - Chicago PAW 2016
Keynote on 2015 Yale Day of Data
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Architectures for Data Commons (XLDB 15 Lightning Talk)
Practical Methods for Identifying Anomalies That Matter in Large Datasets
What is a Data Commons and Why Should You Care?
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Big Data, The Community and The Commons (May 12, 2014)
What Are Science Clouds?
Adversarial Analytics - 2013 Strata & Hadoop World Talk

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Machine learning based COVID-19 study performance prediction
DOCX
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
sap open course for s4hana steps from ECC to s4
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine learning based COVID-19 study performance prediction
The AUB Centre for AI in Media Proposal.docx

Sector - Presentation at Cloud Computing & Its Applications 2009

  • 1. Sector: An Open Source Cloud for Data Intensive ComputingRobert GrossmanUniversity of Illinois at ChicagoOpen Data GroupOctober 20, 2009
  • 2. Part 1. Sector2http://sector.sourceforge.net
  • 3. Sector OverviewSector is fastest open source large data cloudAs measured by MalStone & TerasortSector is easy to programSupports UDFs, MapReduce & Python over streamsSector is secureA HIPAA compliant Sector cloud is being set upSector is reliableSector v1.24 has a backup master node server3
  • 4. About Sector YunhongGu from the Laboratory for Advanced Computing at the University of Illinois at Chicago is the Lead Developer of Sector.Sector is open source (BSD License) and available from sector.sourceforge.netThe current version is 1.24a4
  • 5. Target ConfigurationsSector is designed to run on racks of commodity computersTypical rack configuration today (Oct, 2009)Rack of 32 quad-core 1U computersEach computer has 4 x 1TB disksEach computer has 1 Gbps connection to a top of a rack switchSometimes these are called Raywulf clusters5
  • 6. Google’s Large Data CloudCompute ServicesData ServicesStorage Services6ApplicationsGoogle’s MapReduceGoogle’s BigTableGoogle File System (GFS)Google’s Stack
  • 7. Hadoop’s Large Data CloudCompute ServicesStorage Services7ApplicationsHadoop’sMapReduceData ServicesHadoop Distributed File System (HDFS)Hadoop’s Stack
  • 8. Sector’s Large Data Cloud8ApplicationsCompute ServicesSphere’s UDFsData ServicesSector’s Distributed File System (SDFS)Storage ServicesUDP-based Data Transport Protocol (UDT)Routing & Transport ServicesSector’s Stack
  • 10. Terasort - Sector vsHadoop PerformanceSector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
  • 11. MalStone (OCC-Developed Benchmark)Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
  • 12. How Do You Program A Data Center?12
  • 13. Idea 1 – Support UDF’s Over Data CenterThink of MapReduce asMap acting on (text) recordsWith fixed Shuffle and SortFollowed by Reducing acting on (text) recordsWe generalize this framework as follows:Support a sequence of User Defined Functions (UDF) acting on segments (=chunks) of files.MapReduce is one special case consisting of a user defined Map, a system-defined shuffle and sort, and a user defined reduceIn both cases, framework takes care of assigning nodes to process data, restarting failed processes, etc.13
  • 14. Applying UDF using Sector/Sphere141. Split dataApplicationSphere ClientInput streamSPESPESPE2. Locate & schedule Sphere Processing Engine (SPE)3. Collect resultsOutputstream
  • 15. Sector Programming ModelSector dataset consists of one or more physical filesSphere applies User Defined Functions over streams of data consisting of data segmentsData segments can be data records, collections of data records, or filesExample of UDFs: Map function, Reduce function, Split function for CART, etc.Outputs of UDFs can be returned to originating node, written to local node, or shuffled to another node.15
  • 16. How Do Move Data in a Cloud & Between Clouds?16Option 1: Use TCP and close your eyes.Option 2: ?????
  • 17. Idea 2: Sector is Built on Top of UDT17UDT is a specialized network transport protocol.
  • 18. UDT can take advantage of wide area high performance 10 Gbps network
  • 19. Sector is a wide area distributed file system built over UDT.
  • 20. Sector is layered over the native file system (vs being a block-based file system).UDT Has Been Downloaded 25,000+ Times18Sterling Commerceudt.sourceforge.netMovie2MeGlobusPower FolderNifty TVhttp://udt.sourceforge.net
  • 21. (x)UDTScalable TCPHighSpeed TCPAIMD (TCP NewReno)xAlternatives to TCP – Decreasing Increases AIMD Protocolsincrease of packet sending rate xdecrease factor
  • 22. UDT Makes Wide Area Clouds PossibleUsing UDT, Sector can take advantage of wide area high performance networks (10+ Gbps)2010 Gbps per application
  • 24. Idea 3: Add Security From the StartSecurity ServerSecurity server maintains information about users and slaves.User access control: password and client IP address.File level access control.Messages are encrypted over SSL. Certificate is used for authentication.Sector is HIPAA capable.MasterClientSSLSSLAAAdataSlaves
  • 25. For More Information About SectorYunhongGu and Robert L Grossman, Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data, Philosophical Transactions of the Royal Society A, Volume 367, Number 1897, pages 2429--2445, 2009http://arxiv.org/abs/0809.1181http://rsta.royalsocietypublishing.org/content/367/1897/242923
  • 26. For Related InformationRelated information can be found at:blog.rgrossman.comwww.rgrossman.com24