Big Data and NoSQL for Database and BI Pros

Big Data-BI Fusion:
Microsoft HDInsight & MS BI
Level: Intermediate
March 28, 2013
Andrew Brust
CEO and Founder
Blue Badge Insights

• CEO and Founder, Blue Badge Insights
• Big Data blogger for ZDNet
• Microsoft Regional Director, MVP
• Co-chair VSLive! and 18 years as a speaker
• Founder, MS BI and Big Data User Group of NYC
– http://www.msbigdatanyc.com
• Co-moderator, NYC .NET Developers Group
– http://www.nycdotnetdev.com
• “Redmond Review” columnist for
Visual Studio Magazine and Redmond Developer
News
• brustblog.com, Twitter: @andrewbrust
Meet Andrew

Andrew’s New Blog (bit.ly/bigondata)

What is Big Data?
• 100s of TB into PB and higher
• Involving data from: financial data,
sensors, web logs, social media, etc.
• Parallel processing often involved
– Hadoop is emblematic, but other technologies are Big
Data too
• Processing of data sets too large for
transactional databases
– Analyzing interactions, rather than transactions
– The three V’s: Volume, Velocity, Variety
• Big Data tech sometimes imposed on
small data problems

The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration

What’s MapReduce?
• Divide and conquer approach to “Big”
data processing
• Partition the data and send to mappers
(nodes in cluster)
• Mappers pre-process into key-value pairs,
then all output for (a) given key(s) goes to
a reducer
• Reducer performs aggregations; one
output per key, with value
• Map and Reduce code natively written as
Java functions

MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1
K2
K3
Output
Output
Output

HDFS
• File system whose data gets distributed
over commodity disks on commodity
servers
• Data is replicated
• If one box goes down, no data lost
– “Shared Nothing”
– Except the name node
• BUT: Immutable
– Files can only be written to once
– So updates require drop + re-write (slow)
– You can append though
– Like a DVD/CD-ROM

HBase
• A Wide-Column Store, NoSQL database
• Modeled after Google BigTable
• HBase tables are HDFS files
– Therefore, Hadoop-compatible
• Hadoop often used with HBase
– But you can use either without the other
• HDInsight (more on next slide) does not
(yet) include HBase

Microsoft HDInsight
• Developed with Hortonworks and
incorporates Hortonworks Data Platform
(HDP) for Windows
• Windows Azure HDInsight and Microsoft
HDInsight Server
– Single node preview runs on Windows client
• Includes ODBC Driver for Hive
– And Excel add-in that uses it
• JavaScript MapReduce framework
• Contribute it all back to open source
Apache Project

Azure HDInsight Provisioning
• HDInsight preview now public, so…
• Go to Windows Azure portal
• Sign up for the public preview
• Select HDInsight from left navbar
• Click “+ NEW” button @ lower-left
• Specify cluster name, number of nodes, admin
password, storage account
– Credentials used for browser login, RDP and ODBC
– During preview, you will be billed 50% of Azure compute rates
for nodes in cluster. Will be 100% at GA.
• Click “CREATE HDINSIGHT CLUSTER”
• Wait for provisioning to complete
• Navigate to http://clustername.azurehdinsight.net
New!

Azure HDInsight Provisioning
New!

Submitting, Running and
Monitoring Jobs
• Upload a JAR
• Use Streaming
– Use other languages (i.e. other than Java) to write
MapReduce code
– Python is popular option
– Any executable works, even C# console apps
– On HDInsight, JavaScript works too
– Still uses a JAR file: streaming.jar
• Run at command line (passing JAR name
and params) or use GUI

Hortonworks
Data Platform for
Windows
MRLib
(NuGet
Package)
LINQ to Hive
OdbcClient +
Hive ODBC
Driver
Deployment
Debugging
MR code in
C#,
HadoopJob,
MapperBase,
ReducerBase
Amenities for
Visual Studio/.NET

The “Data-Refinery” Idea
• Use Hadoop to “on-board” unstructured
data, then extract manageable subsets
• Load the subsets into conventional DW/BI
servers and use familiar analytics tool to
examine
• This is the current rationalization of
Hadoop + BI tools’ coexistence
• Will it stay this way?

Hive
• Used by most BI products which connect
to Hadoop
• Provides a SQL-like abstraction over
Hadoop
– Officially HiveQL, or HQL
• Works on own tables, but also on HBase
• Query generates MapReduce job, output of
which becomes result set
• Microsoft has Hive ODBC driver
– Connects Excel, Reporting Services, PowerPivot,
Analysis Services Tabular Mode (only)

HDInsight Data Sources
• Files in HDFS
• Azure Blob Storage (Azure HDInsight only)
– Use asv:// URLs (“Azure Storage Vault”)
• Hive tables
• HBase?

Just-in-time Schema
• When looking at unstructured data,
schema is imposed at query time
• Schema is context specific
– If scanning a book, are the values words, lines, or
pages?
– Are notes a single field, or is each word a value?
– Are date and time two fields or one?
– Are street, city, state, zip separate or one value?
– Pig and Hive let you determine this at query time
– So does the Map function in MapReduce code

How Does MS BI Fit In?
• Excel, PowerPivot: can query via Hive
ODBC driver
• Analysis Services (SSAS) Tabular Mode
– Also compatible with Hive ODBC Driver
Multidimensional mode is not
• Power View
– Works against PowerPivot and SSAS Tabular
• RDBMS + Parallel Data Warehouse (PDW)
– Sqoop connectors
– Columnstore Indexes
Enterprise Edition and PDW only
• PDW: PolyBase

Excel, PowerPivot
• Excel and PowerPivot use the BI Semantic
Model (BISM), which can query Hadoop via
Hive and its ODBC driver
• Excel also features “Data Explorer”
(currently in Beta) which can query HDFS
directly and insert the results into a BISM
repository
• Excel BISM accommodates millions of
rows through compression. Not petabyte
scale, but sufficient to store and analyze
output of Hadoop queries.

PowerPivot, SSAS Tabular
• SQL Server Analysis Services Tabular
mode is the enterprise server
implementation of BISM
• Features partitioning and role-based
security
• Can store billions of rows. So even better
for Hadoop output analysis.
• Excel-based BISM repositories can be
upsized to SSAS Tabular

Querying Hadoop from
Microsoft BI

Sqoop
• Acronym for “SQL to Hadoop”
• Essentially a technology for moving data
between data warehouses and Hadoop
• Command line utility; allows specification
of source/target HDFS file and relational
server, database and table
• Sqoop connectors available for SQL
Server and PDW
• Sqoop generates MapReduce job to
extract data from, or insert data into, HDFS

PDW, PolyBase
• SQL Server Parallel Data Warehouse
(PDW) is a Massively Parallel Proicessing
(MPP) data warehouse appliance version
of SQL Server
• MPP manages a grid of relational database
servers for divide-and-conquer processing
of large data sets.
• PDW v2 includes “PolyBase,” a
component which allows PDW to query
data in Hadoop directly.
– Bypasses MapReduce; addresses data nodes directly
and orchestrates parallelism itself

PolyBase Versus Hive, Sqoop
• Hive and Sqoop generate MapReduce
jobs, and work in batch mode
• PolyBase addresses HDFS data itself
• This is true SQL over Hadoop.
• Competitors:
– Cloudera Impala
– Teradata Aster SQL-H
– EMC/Greenplum Pivotal HD
– Hadapt

Usability Impact
• PowerPivot makes analysis much easier,
self-service
• Power View is great for discovery and
visualization; also self-service
• Combine with the Hive ODBC driver and
suddenly Hadoop is accessible to
business users
• Caveats
– Someone has to write the HiveQL
– Can query Big Data, but must have smaller result

Resources
• Big On Data blog
– http://www.zdnet.com/blog/big-data
• Apache Hadoop home page
– http://hadoop.apache.org/
• Hive & Pig home pages
– http://hive.apache.org/
– http://pig.apache.org/
• Hadoop on Azure home page
– https://www.hadooponazure.com/
• SQL Server 2012 Big Data
– http://bit.ly/sql2012bigdata

Thank You!
• Email
• andrew.brust@bluebadgeinsights.com
• Blog:
• http://www.zdnet.com/blog/big-data
• Twitter
• @andrewbrust on twitter

Big Data and NoSQL for Database and BI Pros

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Big Data and NoSQL for Database and BI Pros (20)

Recently uploaded (20)

Big Data and NoSQL for Database and BI Pros