SlideShare a Scribd company logo
Building an Analytical Platform
Building an analytical platform
I was recently asked to build an analyti-   3. Provide a way to automate the                 An incidental but equally useful
cal platform for a project. But what is an        running of the statistical data models,     consequence of using a column-store
analytical platform? The client, a retailer,      once developed, so that they can be         database such as SAP Sybase IQ is that
described it as a database where it could         run without engaging the statistical        there is no advantage in creating a star
store data and as a front end where it            development resources.                      schema as a data model. Instead, hold-
could do statistical work. This work             Of course, time was of the essence           ing all the data in one large wide table is
would range from simple means and              and costs had to be as low as possible –                                                   -
standard deviations through to more            but we’ve come to expect that with             ing each column with a key means that
complex predictive analytics that could                                                       the underlying storage of data is a star
be used, for example, to analyze past          Step 1: The database                           schema. Creating a star schema in a
performance of a customer to assess the          Our chosen solution for the database         column-store database rather than a
likelihood that the customer will exhibit a    was an SAP® Sybase® IQ database, a             large single table would mean incurring
future behavior. Or it might involve using     technology our client was already famil-       unnecessary additional join and process-
models to classify customers into groups       iar with. SAP Sybase IQ is a column-store      ing overhead.
and ultimately to bring the two processes      database. This means that instead of             As a result of choosing SAP
together into an area known as decision        storing all the data in its rows, as many      Sybase IQ’s column-store database
models. The customer had also come up          other databases do, the data is organized      we are able to have a data model that
with an innovative way to resource the         on disk by the columns. For example if a       consists of a number of simple single
                                                                                              table data sets (one table for each
work placements to master’s degree             have the text of each country (for exam-
students studying statistics at the local      ple, “United Kingdom”) stored many             that is quick to load and to query.
university and arranged for them to work       times. In a column-store database the            It should be noted that this type of
with the customer insight team to              text is stored only once and given a
describe and develop the advanced              unique ID. This is repeated for each           online transaction processing (OLTP)
models. All the customer needed was a          column and therefore the “row” of data         applications because of the cost of doing
platform to work with.                         consists of a list of IDs linked to the data   small inserts and updates. However, this
   From a systems architecture and             held for each column.                          is not relevant for this particular
development perspective, we could
describe the requirements in three rela-       reporting and analytical databases.              The solution can be deployed only on
tively simple statements:                                                                     a Linux platform. We use Linux for three
1. Build a database with a very simple                                                        reasons. First, RStudio Server Edition is
    data model that could be easily            used. In our example, “United Kingdom”         not yet available for Microsoft Windows.
    loaded, that was capable of support-       would occupy 14 bytes, while the ID            Second, precompiled packages for all
    ing high-performance queries, and          might occupy only 1 byte – reducing the        elements of the solution on Linux reduce
    that did not consume a massive             storage for that one value in that one
    amount of disk space. It would also        column by a ratio of 14:1 – and this           environments are normally cheaper than
    ideally be capable of being placed in                                                     Windows environments due to the cost
    the cloud.                                 the data. Furthermore, because there is        of the operating system license. We
2. Create a Web-based interface that           less data on the disk, the time taken to       chose CentOS because it is a Red Hat
    would allow users to securely log on,      read the data from disk and to process         derivative that is free.
    to write statistical programs that                                                           One additional advantage of this solu-
    could use the database as a source of      which massively speeds up the queries          tion for some organizations is the ability
    data, and to output reports and graph-     too. Finally, each column is already           to deploy it in the cloud. Since the solu-
    ics and well as to populate other          indexed, which again helps the overall                                                 -
    tables (for example, target lists) as a    query speed.                                   ered, and since all querying is done via a
    result of statistical models.                                                             Web interface, it is possible to use any




SAP White Paper – Building an Analytical Platform                                                                                         3
colocation or cloud-based hosting            your environment, but these are well                          At this point data has to be loaded and
provider. Colocation or cloud deploy-        documented on the source Web sites                          the statisticians can get to work.
                                             and in general automatically download if                    Obviously this is more time consuming
systems management overhead, and             you are using a tool such as yum.                           than the build, and over the days and
access for both data delivery and data         The next step was to get access to the                    weeks the analysts created their models
access. The system requires SSH access       data held in our SAP Sybase IQ server.                      and produced the results.
for management; FTP, SFTP, or SCP for        This proved to also be very straightfor-                      For this exercise we used our in-house
                                             ward. There is a SAP Sybase white paper                     extract, transform, and load (ETL) tool to
port open. The RStudio server uses the                                                                   create a repeatable data extraction and
server login accounts for security but                                                                   load process, but it would have been
can also be tied to existing LDAP            describes the process that can be simply                    possible to use any of a wide range of
infrastructure.                              stated as:                                                  tools that are available for this process
Step 2: Statistical tools and Web              Install the R JDBC package                                Step 3: Automatically running the
interface                                      Set up a JDBC connection                                  statistical models
   There are a number of statistical tools     Establish your connection                                   Eventually a number of models for
in the market. Most are very expensive,        Query the table                                           analyzing the data had been created and
prohibitively so in this case, and the         We now have an R object that contains                     we were ready to move into a production
associated skills are hard to come by        data sourced from SAP Sybase IQ that                        environment. We automated the load of
and expensive. However, since 1993 an        we can work with. And what is amazing is                    the data into the agreed single-table
open-source programming language             that it took me less than half a day to                     structure and wanted to also run the
called R (www.r-project.org) for statisti-   build the platform from scratch.                            data models.
cal computing and graphics has been
under development. It is now widely used
among statisticians for developing statis-
tical software and data analysis, is used
by many universities, and is predicted to                         Analytical Platform Server
become the most widely used statistical
package by 2015. The R project provides
                                                                                                  R Studio
a command line and graphical interface                                     R
                                                                                                Server Edition
as well as a large open-source library of
useful routines (http:/ /cran.r-project.                         R/JDBC
org) and it is available as packaged soft-                    Connection
ware for most platforms including Linux.
   In addition, a second open-source proj-                                     SAP
ect called RStudio (http:/  /rstudio.org)                                     Sybase
provides a single integrated development                              R/JDBC Connection
                                                                                IQ
environment for R and can be deployed        (S) FTP/SCP
on a local machine or as a Web-based          File Delivery                                                   Write to
service using the server’s security model.                                                                    Database                  Any Network
In this case, we implemented the server                                                                                             Connected Computer
                                                                               Read File               ETL                        with a Browser Accessing
edition in order to make the entire envi-                                                                                        the R Studio Server Edition
                                                                                                      Engine
ronment Web based.
   So in two simple steps (download and
install R, followed by download and
install RStudio) we can install a full                                              CentOS
Web-based statistical environment. Note
                                                                                           ©2012 Data Management & Warehousing


packages may be required depending on
SAP Sybase IQ has the functionality          ConCluSionS                                 ABout the Author

                                                                                           David Walker has been involved with business
These C++ programs “talk” to a process          Business intelligence requirements are
                                                                                           intelligence and data warehousing for over
known as Rserve, which in turn executes       changing and business users are moving
the R program and returns the results         more and more from historical reporting
to SAP Sybase IQ. This allows R func-         into predictive analytics in an attempt to
tions to be embedded directly into SAP        get both a better and deeper under-          Data Management & Warehousing (http://
                                                                                           datamgmt.com) in 1995.
Sybase IQ SQL commands. While setting         standing of their data.
this up requires a little more program-         Traditionally, building an analytical      David and his team have worked around
ming experience, it does mean that all        platform has required an expensive infra-    the world on projects designed to deliver
processing can be done within SAP             structure and a considerable amount of
Sybase IQ.                                    time for setup and deployment.               converting data into information and by
   Conversely, it is possible to run R from     By combining the high performance,
                                                                                           exploit that information.
the command line and call the program         low footprint of SAP Sybase IQ with the
that in turn uses the RJDBC connection        open-source R and RStudio statistical        David’s project work has given him experi-
to read and write data to the database.       packages, it is possible to quickly deploy   ence in a wide variety of industries including
   Having a choice of methods is very         an analytical platform in the cloud for                                                  -
                                                                                           facturing, transportation, and public sector
helpful as it means that it can be inte-      which there are readily available skills.
                                                                                           as well as a broad and deep knowledge of
grated with the ETL environment in the          This infrastructure can be used both       business intelligence and data warehousing
most appropriate way. If the ETL tool         for rapid prototyping on analytical          technologies.
                                              models and for running completed
function (UDF) route is the most attrac-      models on new data sets to deliver
tive. However, if the ETL tool supports       greater insight into the data.
host callouts (as ours does) then running
R programs from a command line callout
is quicker than developing the UDF.




SAP White Paper – Building an Analytical Platform                                                                                           5
www.sap.com/contactsap




12/08 ©2012 SAP AG. All rights reserved.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign,
SAP BusinessObjects Explorer, StreamWork, SAP HANA, and
other SAP products and services mentioned herein as well as
their respective logos are trademarks or registered trademarks
of SAP AG in Germany and other countries.

Business Objects and the Business Objects logo, BusinessObjects,
Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other
Business Objects products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of Business
Objects Software Ltd. Business Objects is an SAP company.

Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and
other Sybase products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of Sybase Inc.
Sybase is an SAP company.

Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are registered
trademarks of Crossgate AG in Germany and other countries. Crossgate
is an SAP company.

All other product and service names mentioned are the trademarks of
their respective companies. Data contained in this document serves


These materials are subject to change without notice. These materials

for informational purposes only, without representation or warranty of
any kind, and SAP Group shall not be liable for errors or omissions with
respect to the materials. The only warranties for SAP Group products and
services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should
be construed as constituting an additional warranty.

More Related Content

PDF
EOUG95 - Client Server Very Large Databases - Presentation
PDF
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
PDF
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
PDF
Storage Characteristics Of Call Data Records In Column Store Databases
PDF
Why advanced monitoring is key for healthy
PPT
BigData & CDN - OOP2011 (Pavlo Baron)
PDF
Data Virtualization Deployments: How to Manage Very Large Deployments
PDF
Openworld04 - Information Delivery - The Change In Data Management At Network...
EOUG95 - Client Server Very Large Databases - Presentation
Big Data Week 2016 - Worldpay - Deploying Secure Clusters
Data Works Summit Munich 2017 - Worldpay - Multi Tenancy Clusters
Storage Characteristics Of Call Data Records In Column Store Databases
Why advanced monitoring is key for healthy
BigData & CDN - OOP2011 (Pavlo Baron)
Data Virtualization Deployments: How to Manage Very Large Deployments
Openworld04 - Information Delivery - The Change In Data Management At Network...

What's hot (20)

PDF
Data warehousing change in a challenging environment
PPT
New Database and Application Development Technology
PDF
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
PPTX
Why and how to leverage the simplicity and power of SQL on Flink
PPTX
Pervasive analytics through data & analytic centricity
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PPT
MongoDB Sharding Webinar 2014
PDF
Can data virtualization uphold performance with complex queries?
PDF
No sql3 rmoug
PDF
Designing For Occasionally Connected Apps Slideshare
PPTX
MongoDB and In-Memory Computing
PDF
How Real TIme Data Changes the Data Warehouse
PDF
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
PPTX
Data Vault Automation at the Bijenkorf
PPTX
Big Data Analytics with Hadoop
PDF
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
PDF
SnapLogic corporate presentation
PDF
Emergent Distributed Data Storage
DOCX
Queues, Pools and Caches - Paper
PDF
Using hadoop to expand data warehousing
Data warehousing change in a challenging environment
New Database and Application Development Technology
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Why and how to leverage the simplicity and power of SQL on Flink
Pervasive analytics through data & analytic centricity
Enabling a Data Mesh Architecture with Data Virtualization
MongoDB Sharding Webinar 2014
Can data virtualization uphold performance with complex queries?
No sql3 rmoug
Designing For Occasionally Connected Apps Slideshare
MongoDB and In-Memory Computing
How Real TIme Data Changes the Data Warehouse
IOUG93 - Technical Architecture for the Data Warehouse - Presentation
Data Vault Automation at the Bijenkorf
Big Data Analytics with Hadoop
White paper making an-operational_data_store_(ods)_the_center_of_your_data_...
SnapLogic corporate presentation
Emergent Distributed Data Storage
Queues, Pools and Caches - Paper
Using hadoop to expand data warehousing
Ad

Viewers also liked (20)

PDF
An introduction to social network data
PDF
Building a data warehouse of call data records
PDF
Data Driven Insurance Underwriting (Dutch Language Version)
PDF
The ABC of Data Governance: driving Information Excellence
PDF
BI SaaS & Cloud Strategies for Telcos
PDF
Implementing Netezza Spatial
PDF
LL Higher Ed BI 2014 Key BI Market Trends 20140513a
PPT
Basics of Microsoft Business Intelligence and Data Integration Techniques
PDF
Data Driven Insurance Underwriting
PPTX
Igqie14 analytics and ethics 20141107
PDF
The one question you must never ask!" (Information Requirements Gathering for...
DOCX
04. Logical Data Definition template
DOCX
02. Information solution outline template
PDF
WHITE PAPER: Distributed Data Quality
DOCX
05. Physical Data Specification Template
DOCX
Example data specifications and info requirements framework OVERVIEW
PDF
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
PDF
Moving From Scorecards To Strategic Management
DOCX
06. Transformation Logic Template (Source to Target)
PPTX
DATA MART APPROCHES TO ARCHITECTURE
An introduction to social network data
Building a data warehouse of call data records
Data Driven Insurance Underwriting (Dutch Language Version)
The ABC of Data Governance: driving Information Excellence
BI SaaS & Cloud Strategies for Telcos
Implementing Netezza Spatial
LL Higher Ed BI 2014 Key BI Market Trends 20140513a
Basics of Microsoft Business Intelligence and Data Integration Techniques
Data Driven Insurance Underwriting
Igqie14 analytics and ethics 20141107
The one question you must never ask!" (Information Requirements Gathering for...
04. Logical Data Definition template
02. Information solution outline template
WHITE PAPER: Distributed Data Quality
05. Physical Data Specification Template
Example data specifications and info requirements framework OVERVIEW
Managing for Effective Data Governance: workshop for DQ Asia Pacific Congress...
Moving From Scorecards To Strategic Management
06. Transformation Logic Template (Source to Target)
DATA MART APPROCHES TO ARCHITECTURE
Ad

Similar to Building an analytical platform (20)

DOCX
Prashanth Kumar_Hadoop_NEW
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
Unstructured Datasets Analysis: Thesaurus Model
PDF
Facade
PPT
Sap Interview Questions - Part 1
PPTX
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
PDF
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
PDF
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
PDF
A comparative survey based on processing network traffic data using hadoop pi...
PDF
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
PDF
Cloud-Native Patterns for Data-Intensive Applications
DOCX
Database Integrated Analytics using R InitialExperiences wi
PPT
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
PDF
Aucfanlab Datalake - Big Data Management Platform -
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Agile data warehousing
PDF
Performance evaluation and estimation model using regression method for hadoo...
DOCX
Maharshi_Amin_416
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
PDF
Agile data lake? An oxymoron?
Prashanth Kumar_Hadoop_NEW
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Unstructured Datasets Analysis: Thesaurus Model
Facade
Sap Interview Questions - Part 1
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pi...
A comparative survey based on processing network traffic data using hadoop pi...
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
Cloud-Native Patterns for Data-Intensive Applications
Database Integrated Analytics using R InitialExperiences wi
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Aucfanlab Datalake - Big Data Management Platform -
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Agile data warehousing
Performance evaluation and estimation model using regression method for hadoo...
Maharshi_Amin_416
IRJET- Big Data Processes and Analysis using Hadoop Framework
Agile data lake? An oxymoron?

More from David Walker (18)

PDF
Moving To MicroServices
PDF
Data Works Berlin 2018 - Worldpay - PCI Compliance
PDF
Big Data Analytics 2017 - Worldpay - Empowering Payments
PDF
An introduction to data virtualization in business intelligence
PDF
Gathering Business Requirements for Data Warehouses
PDF
Struggling with data management
PDF
A linux mac os x command line interface
PDF
Connections a life in the day of - david walker
PDF
Conspectus data warehousing appliances – fad or future
PDF
Using the right data model in a data mart
PDF
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
PDF
Oracle BI06 From Volume To Value - Presentation
PDF
IRM09 - What Can IT Really Deliver For BI and DW - Presentation
PDF
ETIS11 - Enterprise Metadata Management
PDF
ETIS11 - Agile Business Intelligence - Presentation
PDF
ETIS10 - BI Governance Models & Strategies - Presentation
PDF
ETIS10 - BI Business Requirements - Presentation
PDF
ETIS09 - Data Quality: Common Problems & Checks - Presentation
Moving To MicroServices
Data Works Berlin 2018 - Worldpay - PCI Compliance
Big Data Analytics 2017 - Worldpay - Empowering Payments
An introduction to data virtualization in business intelligence
Gathering Business Requirements for Data Warehouses
Struggling with data management
A linux mac os x command line interface
Connections a life in the day of - david walker
Conspectus data warehousing appliances – fad or future
Using the right data model in a data mart
UKOUG06 - An Introduction To Process Neutral Data Modelling - Presentation
Oracle BI06 From Volume To Value - Presentation
IRM09 - What Can IT Really Deliver For BI and DW - Presentation
ETIS11 - Enterprise Metadata Management
ETIS11 - Agile Business Intelligence - Presentation
ETIS10 - BI Governance Models & Strategies - Presentation
ETIS10 - BI Business Requirements - Presentation
ETIS09 - Data Quality: Common Problems & Checks - Presentation

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
NewMind AI Weekly Chronicles - August'25 Week I
GamePlan Trading System Review: Professional Trader's Honest Take
Advanced Soft Computing BINUS July 2025.pdf
MYSQL Presentation for SQL database connectivity

Building an analytical platform

  • 3. I was recently asked to build an analyti- 3. Provide a way to automate the An incidental but equally useful cal platform for a project. But what is an running of the statistical data models, consequence of using a column-store analytical platform? The client, a retailer, once developed, so that they can be database such as SAP Sybase IQ is that described it as a database where it could run without engaging the statistical there is no advantage in creating a star store data and as a front end where it development resources. schema as a data model. Instead, hold- could do statistical work. This work Of course, time was of the essence ing all the data in one large wide table is would range from simple means and and costs had to be as low as possible – - standard deviations through to more but we’ve come to expect that with ing each column with a key means that complex predictive analytics that could the underlying storage of data is a star be used, for example, to analyze past Step 1: The database schema. Creating a star schema in a performance of a customer to assess the Our chosen solution for the database column-store database rather than a likelihood that the customer will exhibit a was an SAP® Sybase® IQ database, a large single table would mean incurring future behavior. Or it might involve using technology our client was already famil- unnecessary additional join and process- models to classify customers into groups iar with. SAP Sybase IQ is a column-store ing overhead. and ultimately to bring the two processes database. This means that instead of As a result of choosing SAP together into an area known as decision storing all the data in its rows, as many Sybase IQ’s column-store database models. The customer had also come up other databases do, the data is organized we are able to have a data model that with an innovative way to resource the on disk by the columns. For example if a consists of a number of simple single table data sets (one table for each work placements to master’s degree have the text of each country (for exam- students studying statistics at the local ple, “United Kingdom”) stored many that is quick to load and to query. university and arranged for them to work times. In a column-store database the It should be noted that this type of with the customer insight team to text is stored only once and given a describe and develop the advanced unique ID. This is repeated for each online transaction processing (OLTP) models. All the customer needed was a column and therefore the “row” of data applications because of the cost of doing platform to work with. consists of a list of IDs linked to the data small inserts and updates. However, this From a systems architecture and held for each column. is not relevant for this particular development perspective, we could describe the requirements in three rela- reporting and analytical databases. The solution can be deployed only on tively simple statements: a Linux platform. We use Linux for three 1. Build a database with a very simple reasons. First, RStudio Server Edition is data model that could be easily used. In our example, “United Kingdom” not yet available for Microsoft Windows. loaded, that was capable of support- would occupy 14 bytes, while the ID Second, precompiled packages for all ing high-performance queries, and might occupy only 1 byte – reducing the elements of the solution on Linux reduce that did not consume a massive storage for that one value in that one amount of disk space. It would also column by a ratio of 14:1 – and this environments are normally cheaper than ideally be capable of being placed in Windows environments due to the cost the cloud. the data. Furthermore, because there is of the operating system license. We 2. Create a Web-based interface that less data on the disk, the time taken to chose CentOS because it is a Red Hat would allow users to securely log on, read the data from disk and to process derivative that is free. to write statistical programs that One additional advantage of this solu- could use the database as a source of which massively speeds up the queries tion for some organizations is the ability data, and to output reports and graph- too. Finally, each column is already to deploy it in the cloud. Since the solu- ics and well as to populate other indexed, which again helps the overall - tables (for example, target lists) as a query speed. ered, and since all querying is done via a result of statistical models. Web interface, it is possible to use any SAP White Paper – Building an Analytical Platform 3
  • 4. colocation or cloud-based hosting your environment, but these are well At this point data has to be loaded and provider. Colocation or cloud deploy- documented on the source Web sites the statisticians can get to work. and in general automatically download if Obviously this is more time consuming systems management overhead, and you are using a tool such as yum. than the build, and over the days and access for both data delivery and data The next step was to get access to the weeks the analysts created their models access. The system requires SSH access data held in our SAP Sybase IQ server. and produced the results. for management; FTP, SFTP, or SCP for This proved to also be very straightfor- For this exercise we used our in-house ward. There is a SAP Sybase white paper extract, transform, and load (ETL) tool to port open. The RStudio server uses the create a repeatable data extraction and server login accounts for security but load process, but it would have been can also be tied to existing LDAP describes the process that can be simply possible to use any of a wide range of infrastructure. stated as: tools that are available for this process Step 2: Statistical tools and Web Install the R JDBC package Step 3: Automatically running the interface Set up a JDBC connection statistical models There are a number of statistical tools Establish your connection Eventually a number of models for in the market. Most are very expensive, Query the table analyzing the data had been created and prohibitively so in this case, and the We now have an R object that contains we were ready to move into a production associated skills are hard to come by data sourced from SAP Sybase IQ that environment. We automated the load of and expensive. However, since 1993 an we can work with. And what is amazing is the data into the agreed single-table open-source programming language that it took me less than half a day to structure and wanted to also run the called R (www.r-project.org) for statisti- build the platform from scratch. data models. cal computing and graphics has been under development. It is now widely used among statisticians for developing statis- tical software and data analysis, is used by many universities, and is predicted to Analytical Platform Server become the most widely used statistical package by 2015. The R project provides R Studio a command line and graphical interface R Server Edition as well as a large open-source library of useful routines (http:/ /cran.r-project. R/JDBC org) and it is available as packaged soft- Connection ware for most platforms including Linux. In addition, a second open-source proj- SAP ect called RStudio (http:/ /rstudio.org) Sybase provides a single integrated development R/JDBC Connection IQ environment for R and can be deployed (S) FTP/SCP on a local machine or as a Web-based File Delivery Write to service using the server’s security model. Database Any Network In this case, we implemented the server Connected Computer Read File ETL with a Browser Accessing edition in order to make the entire envi- the R Studio Server Edition Engine ronment Web based. So in two simple steps (download and install R, followed by download and install RStudio) we can install a full CentOS Web-based statistical environment. Note ©2012 Data Management & Warehousing packages may be required depending on
  • 5. SAP Sybase IQ has the functionality ConCluSionS ABout the Author David Walker has been involved with business These C++ programs “talk” to a process Business intelligence requirements are intelligence and data warehousing for over known as Rserve, which in turn executes changing and business users are moving the R program and returns the results more and more from historical reporting to SAP Sybase IQ. This allows R func- into predictive analytics in an attempt to tions to be embedded directly into SAP get both a better and deeper under- Data Management & Warehousing (http:// datamgmt.com) in 1995. Sybase IQ SQL commands. While setting standing of their data. this up requires a little more program- Traditionally, building an analytical David and his team have worked around ming experience, it does mean that all platform has required an expensive infra- the world on projects designed to deliver processing can be done within SAP structure and a considerable amount of Sybase IQ. time for setup and deployment. converting data into information and by Conversely, it is possible to run R from By combining the high performance, exploit that information. the command line and call the program low footprint of SAP Sybase IQ with the that in turn uses the RJDBC connection open-source R and RStudio statistical David’s project work has given him experi- to read and write data to the database. packages, it is possible to quickly deploy ence in a wide variety of industries including Having a choice of methods is very an analytical platform in the cloud for - facturing, transportation, and public sector helpful as it means that it can be inte- which there are readily available skills. as well as a broad and deep knowledge of grated with the ETL environment in the This infrastructure can be used both business intelligence and data warehousing most appropriate way. If the ETL tool for rapid prototyping on analytical technologies. models and for running completed function (UDF) route is the most attrac- models on new data sets to deliver tive. However, if the ETL tool supports greater insight into the data. host callouts (as ours does) then running R programs from a command line callout is quicker than developing the UDF. SAP White Paper – Building an Analytical Platform 5
  • 6. www.sap.com/contactsap 12/08 ©2012 SAP AG. All rights reserved. SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign, SAP BusinessObjects Explorer, StreamWork, SAP HANA, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries. Business Objects and the Business Objects logo, BusinessObjects, Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other Business Objects products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Business Objects Software Ltd. Business Objects is an SAP company. Sybase and Adaptive Server, iAnywhere, Sybase 365, SQL Anywhere, and other Sybase products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of Sybase Inc. Sybase is an SAP company. Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are registered trademarks of Crossgate AG in Germany and other countries. Crossgate is an SAP company. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves These materials are subject to change without notice. These materials for informational purposes only, without representation or warranty of any kind, and SAP Group shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.