SlideShare a Scribd company logo
Introduction to Augustus OVERVIEW Open Data Group September 17, 2009
Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com
 
Getting Augustus Releases can be downloaded from the website under the Download tab. Current release are also on the main page's Featured side bar Augustus can be directly checked out from source control.  We use Subversion.  Project members can be granted commit access.
 
Source All of the source files are viewable on line with markup and revision history. The raw version of each file is also available.  http://augustus.googlecode.com/source/browse
 
 
 
Documentation and Community WIKI The wiki is intended for people who want to install Augustus for use and possibly develop new features. FORUM The forum is open for any general discussion regarding Augustus.
 
 
Using Augustus Model Development Use Cycle Work Flow
Development and Use Cycle The typical model development and use cycle with Augustus is as follows: Identify suitable data with which to construct a new model. Provide a model schema which proscribes the requirements for the model. Run the Augustus producer to obtain a new model. Run the Augustus consumer on new data to effect scoring.
Development and Use Cycle 2. Model schema 1. Data Inputs
Running Augustus 3. Obtain new model with Producer 4. Score with Consumer
Work Flows Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.
Components Pre-processing Producers Consumers Post-Processing
Producers and Consumers The Producers and Consumers require configuration with XML-formatted files. Supplying the schema, configuration and training data to the Producer yields a completely specified model. The Consumers provide for some configurability of the output but post-processing can be used to  render the output according to the user's needs.
Post Processing Augustus can accommodate a post-processing step. While not necessary, this is often useful to: Re-normalize the scoring results or perform an additional transformation. Supplement the results with global meta-data such as timestamps. Format the results. Select certain interesting values from the results. Restructure the data for use with other applications.
Segments Segments are covered  elsewhere, but Augustus supports segments and this can be described at the Producer level. Augustus was originally written to an Open Data draft RFC for segmented models.  Augustus 0.3.x conform to the RFC.   PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.  Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.
Result of Scoring
Case Study: Auto Auto is an example distributed with Augustus, found in the examples directory. It consists of four simple examples of applying vector channel analysis to a single field of a stream of input records . The examples use two types of data files .  The data consists of records with three entries:  Date ,  Color , and  Automaker .  The Weighted examples have an additional 'weight' column, named  Count .  The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.
Work Flow Overview
Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `-- produce.py http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch
Input for the Producer The Producer takes the training data set.  In the code, we have declared how we want to test the data import augustus.modellib.baseline.producer.Producer as Producer def makeConfigs(inFile, outFile, inPMML, outPMML): #open data file inf = uni.UniTable().fromfile(inFile) #start the configuration file   test = ET.SubElement(root, "test") test.set("field", "Automaker") test.set("weightField", "Count") test.set("testStatistic", "dDist") test.set("testType", "threshold") test.set("threshold", "0.475")
Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, "baseline") baseline.set("dist", "discrete") baseline.set("file", str(inFile)) baseline.set("type", "UniTable") # create the segmentation declarations for the two fields at this level ''' Taken out for the example, other Use Cases will focus on Segments segmentation = ET.SubElement(test, "segmentation") makeSegment(inf, segmentation, "Color") ''' #output the configuration file tree = ET.ElementTree(root) tree.write(outFile)
Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs)  Beginning timing (0.000 secs)  Creating configuration file (0.001 secs)  Creating input PMML file (0.001 secs)  Starting producer (0.000 secs)  Inputting configurations (0.001 secs)  Inputting model (0.008 secs)  Collecting stats for baseline distribution (0.011 secs)  Events 20.067% processed (0.009 secs)  Events 40.134% processed (0.009 secs)  Events 60.201% processed (0.009 secs)  Events 80.268% processed (0.009 secs)  Events 100.000% processed (0.000 secs)  Making test distributions from statistics (0.002 secs)  Outputting PMML (0.062 secs)  Lifetime of timer
Model generated by the Producer <PMML version=&quot;3.1&quot;> <Header copyright=&quot; &quot; /> < DataDictionary > < DataField  dataType=&quot;string&quot; name=&quot;Automaker&quot; optype=&quot;categorical&quot; /> < DataField  dataType=&quot;string&quot; name=&quot;Color&quot; optype=&quot;categorical&quot; /> < DataField  dataType=&quot;float&quot; name=&quot;Count&quot; optype=&quot;continuous&quot; /> </ DataDictionary > < BaselineModel  functionName=&quot;baseline&quot;> < MiningSchema > < MiningField  name=&quot;Automaker&quot; /> < MiningField  name=&quot;Color&quot; /> < MiningField  name=&quot;Count&quot; /> </ MiningSchema > </ BaselineModel > </PMML>
Model generated by the Producer (Cont) The structure is determined by code in the Producer.py: def makePMML(outFile): #create the pmml root = ET.Element(&quot;PMML&quot;) root.set(&quot;version&quot;, &quot;3.1&quot;) header = ET.SubElement(root, &quot;Header&quot;) header.set(&quot;copyright&quot;, &quot; &quot;) dataDict = ET.SubElement(root,  &quot;DataDictionary&quot;) It then goes on for each Data and Mining Field: dataField = ET.SubElement(dataDict, &quot;DataField&quot;) dataField.set(&quot;name&quot;, &quot;Automaker&quot;) dataField.set(&quot;optype&quot;, &quot;categorical&quot;) dataField.set(&quot;dataType&quot;, &quot;string&quot;) . . . miningSchema = ET.SubElement(baselineModel, &quot;MiningSchema&quot;) miningField = ET.SubElement(miningSchema,  &quot;MiningField&quot;) miningField.set(&quot;name&quot;, &quot;Automaker&quot;)
Producer Output The training step used the code in producer.py to generate a model and get expected results.  Training generated the following files: . |-- consumer |  `-- wtraining.nab.pmml  MODEL WITH EXPECTED VALUES BASED ON THE TRAINING DATA `-- producer |-- wtraining.nab.pmml  BASELINE DATA, DATA DICTIONARY, MINING SCHEMA `-- wtraining.nab.xml  MODEL FILE USED FOR TRAINING
Training XML This provides: Model  with expected values from Training that is used when we score Test Distribution Baeline data and how it is to be handled $ cat producer wtraining.nab.xml <model input=&quot;../producer/wtraining.nab.pmml&quot;  output=&quot;../consumer/wtraining.nab.pmml&quot;> <test field=&quot;Automaker&quot; testStatistic=&quot;dDist&quot; testType=&quot;threshold&quot;  threshold=&quot;0.475&quot; weightField=&quot;Count&quot;> <baseline dist=&quot;discrete&quot; file=&quot;../data/wtraining.nab&quot;  type=&quot;UniTable&quot; /> </test> </model>
Unitable Unitable is used to hold the data that is read in.  It allows us to encapsulate the data is a why which allows us to manipulate it efficiently. It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure. More to follow.
Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer |  |-- wscoring.nab.wtraining.nab.xml |  `-- wtraining.nab.pmml |-- postprocess |  `-- wscoring.nab.wtraining.nab.xml `-- producer |-- wtraining.nab.pmml `-- wtraining.nab.xml This examples generates a report in the post process directory.
Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name=&quot;../data/wscoring.nab&quot; type=&quot;UniTable&quot; /> </inputData> <inputModel> <fromFile name=&quot;../consumer/wtraining.nab.pmml&quot; /> </inputModel> <output> <report name=&quot;report&quot;> <toFile name=&quot;../postprocess/wscoring.nab.wtraining.nab.xml&quot; /> <outputRow name=&quot;event&quot;> <score name=&quot;score&quot; /> <alert name=&quot;alert&quot; /> <segments name=&quot;segments&quot; /> </outputRow> </report> </output> </pmmlDeployment>
Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < alert >True</ alert > < Segments ></ Segments > </ event > </report>
Unitable The Unitable is one of the main components of the Augustus system.  Data read into Augustus is stored in a Unitable.  Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context. Designed to hold data in a way which allows it to be acted upon by numpy.  Takes advantage of new features and improvements which are put into numpy by the scientific Python community . Unitable can be used outside of the Augustus scoring flow.  Find a standalone example on the wiki
Key Features of Unitable File format that matches the native machine memory storage of the data-allowing for memory-mapped access to the data. No parsing or sequential reading Fast vector operations using any number of data columns. Support for demand driven, rule based calculations.   Derived columns defined in terms of operations on other columns, including other derived columns, and made available when referenced.
Key Features of Unitable (cont) Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events. Ability to invoke calculations in scalar or vector mode transparently.  One set of rule definitions can be applied to an entire data set in batch mode, or to individual rows of real-time events.
For more information Open Data Group 400 Lathrop Avenue River Forest IL 60305 708-488-8660 [email_address] http://code.google.com/p/augustus/

More Related Content

PPT
PPT
Cost Based Optimizer - Part 2 of 2
PPT
Cost Based Optimizer - Part 1 of 2
PPT
SQL Optimization With Trace Data And Dbms Xplan V6
PDF
Explaining the explain_plan
PDF
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
PDF
AnalyticOps - Chicago PAW 2016
PDF
On the representation and reuse of machine learning (ML) models
Cost Based Optimizer - Part 2 of 2
Cost Based Optimizer - Part 1 of 2
SQL Optimization With Trace Data And Dbms Xplan V6
Explaining the explain_plan
Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine ...
AnalyticOps - Chicago PAW 2016
On the representation and reuse of machine learning (ML) models

Similar to Augustus Overview Open Source Analytics (20)

PDF
How Does AutoML Address Data Preprocessing?
PDF
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
PDF
VSSML16 L5. Basic Data Transformations
PPTX
Rapid Miner
PPTX
Video Analytics on Hadoop webinar victor fang-201309
PDF
Big Data Day LA 2017
PDF
Advanced Modeling of Industrial Optimization Problems
PDF
Choosing a Machine Learning technique to solve your need
PPTX
Generating test data for Statistical and ML models
PDF
Pipeline of Supervised learning algorithms
PDF
Supervised Learning Algorithms - Analysis of different approaches
PDF
Prepare your data for machine learning
PPTX
This notes are more beneficial for artifical intelligence
PDF
Data ops: Machine Learning in production
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
PDF
Data preprocessing in Data Mining
PPTX
AI Program Details by Enukollu Mahesh
PPTX
Data Provenance for Data Science
PDF
Effective data pre-processing for AutoML
PDF
VSSML17 Review. Summary Day 2 Sessions
How Does AutoML Address Data Preprocessing?
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
VSSML16 L5. Basic Data Transformations
Rapid Miner
Video Analytics on Hadoop webinar victor fang-201309
Big Data Day LA 2017
Advanced Modeling of Industrial Optimization Problems
Choosing a Machine Learning technique to solve your need
Generating test data for Statistical and ML models
Pipeline of Supervised learning algorithms
Supervised Learning Algorithms - Analysis of different approaches
Prepare your data for machine learning
This notes are more beneficial for artifical intelligence
Data ops: Machine Learning in production
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Data preprocessing in Data Mining
AI Program Details by Enukollu Mahesh
Data Provenance for Data Science
Effective data pre-processing for AutoML
VSSML17 Review. Summary Day 2 Sessions
Ad

Recently uploaded (20)

PDF
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
DOCX
Business Management - unit 1 and 2
PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
PDF
Digital Marketing & E-commerce Certificate Glossary.pdf.................
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
5 Stages of group development guide.pptx
PDF
IFRS Notes in your pocket for study all the time
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PDF
MSPs in 10 Words - Created by US MSP Network
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PPTX
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
Unit 1 Cost Accounting - Cost sheet
Elevate Cleaning Efficiency Using Tallfly Hair Remover Roller Factory Expertise
unit 1 COST ACCOUNTING AND COST SHEET
Business Management - unit 1 and 2
Power and position in leadershipDOC-20250808-WA0011..pdf
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
DOC-20250806-WA0002._20250806_112011_0000.pdf
Chapter 5_Foreign Exchange Market in .pdf
Solara Labs: Empowering Health through Innovative Nutraceutical Solutions
Digital Marketing & E-commerce Certificate Glossary.pdf.................
Roadmap Map-digital Banking feature MB,IB,AB
5 Stages of group development guide.pptx
IFRS Notes in your pocket for study all the time
Ôn tập tiếng anh trong kinh doanh nâng cao
MSPs in 10 Words - Created by US MSP Network
Belch_12e_PPT_Ch18_Accessible_university.pptx
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
svnfcksanfskjcsnvvjknsnvsdscnsncxasxa saccacxsax
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Unit 1 Cost Accounting - Cost sheet
Ad

Augustus Overview Open Source Analytics

  • 1. Introduction to Augustus OVERVIEW Open Data Group September 17, 2009
  • 2. Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com
  • 3.  
  • 4. Getting Augustus Releases can be downloaded from the website under the Download tab. Current release are also on the main page's Featured side bar Augustus can be directly checked out from source control. We use Subversion. Project members can be granted commit access.
  • 5.  
  • 6. Source All of the source files are viewable on line with markup and revision history. The raw version of each file is also available. http://augustus.googlecode.com/source/browse
  • 7.  
  • 8.  
  • 9.  
  • 10. Documentation and Community WIKI The wiki is intended for people who want to install Augustus for use and possibly develop new features. FORUM The forum is open for any general discussion regarding Augustus.
  • 11.  
  • 12.  
  • 13. Using Augustus Model Development Use Cycle Work Flow
  • 14. Development and Use Cycle The typical model development and use cycle with Augustus is as follows: Identify suitable data with which to construct a new model. Provide a model schema which proscribes the requirements for the model. Run the Augustus producer to obtain a new model. Run the Augustus consumer on new data to effect scoring.
  • 15. Development and Use Cycle 2. Model schema 1. Data Inputs
  • 16. Running Augustus 3. Obtain new model with Producer 4. Score with Consumer
  • 17. Work Flows Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.
  • 18. Components Pre-processing Producers Consumers Post-Processing
  • 19. Producers and Consumers The Producers and Consumers require configuration with XML-formatted files. Supplying the schema, configuration and training data to the Producer yields a completely specified model. The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.
  • 20. Post Processing Augustus can accommodate a post-processing step. While not necessary, this is often useful to: Re-normalize the scoring results or perform an additional transformation. Supplement the results with global meta-data such as timestamps. Format the results. Select certain interesting values from the results. Restructure the data for use with other applications.
  • 21. Segments Segments are covered  elsewhere, but Augustus supports segments and this can be described at the Producer level. Augustus was originally written to an Open Data draft RFC for segmented models.  Augustus 0.3.x conform to the RFC.   PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.  Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.
  • 23. Case Study: Auto Auto is an example distributed with Augustus, found in the examples directory. It consists of four simple examples of applying vector channel analysis to a single field of a stream of input records . The examples use two types of data files . The data consists of records with three entries: Date , Color , and Automaker . The Weighted examples have an additional 'weight' column, named Count . The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.
  • 25. Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `-- produce.py http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch
  • 26. Input for the Producer The Producer takes the training data set. In the code, we have declared how we want to test the data import augustus.modellib.baseline.producer.Producer as Producer def makeConfigs(inFile, outFile, inPMML, outPMML): #open data file inf = uni.UniTable().fromfile(inFile) #start the configuration file test = ET.SubElement(root, &quot;test&quot;) test.set(&quot;field&quot;, &quot;Automaker&quot;) test.set(&quot;weightField&quot;, &quot;Count&quot;) test.set(&quot;testStatistic&quot;, &quot;dDist&quot;) test.set(&quot;testType&quot;, &quot;threshold&quot;) test.set(&quot;threshold&quot;, &quot;0.475&quot;)
  • 27. Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, &quot;baseline&quot;) baseline.set(&quot;dist&quot;, &quot;discrete&quot;) baseline.set(&quot;file&quot;, str(inFile)) baseline.set(&quot;type&quot;, &quot;UniTable&quot;) # create the segmentation declarations for the two fields at this level ''' Taken out for the example, other Use Cases will focus on Segments segmentation = ET.SubElement(test, &quot;segmentation&quot;) makeSegment(inf, segmentation, &quot;Color&quot;) ''' #output the configuration file tree = ET.ElementTree(root) tree.write(outFile)
  • 28. Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs) Beginning timing (0.000 secs) Creating configuration file (0.001 secs) Creating input PMML file (0.001 secs) Starting producer (0.000 secs) Inputting configurations (0.001 secs) Inputting model (0.008 secs) Collecting stats for baseline distribution (0.011 secs) Events 20.067% processed (0.009 secs) Events 40.134% processed (0.009 secs) Events 60.201% processed (0.009 secs) Events 80.268% processed (0.009 secs) Events 100.000% processed (0.000 secs) Making test distributions from statistics (0.002 secs) Outputting PMML (0.062 secs) Lifetime of timer
  • 29. Model generated by the Producer <PMML version=&quot;3.1&quot;> <Header copyright=&quot; &quot; /> < DataDictionary > < DataField dataType=&quot;string&quot; name=&quot;Automaker&quot; optype=&quot;categorical&quot; /> < DataField dataType=&quot;string&quot; name=&quot;Color&quot; optype=&quot;categorical&quot; /> < DataField dataType=&quot;float&quot; name=&quot;Count&quot; optype=&quot;continuous&quot; /> </ DataDictionary > < BaselineModel functionName=&quot;baseline&quot;> < MiningSchema > < MiningField name=&quot;Automaker&quot; /> < MiningField name=&quot;Color&quot; /> < MiningField name=&quot;Count&quot; /> </ MiningSchema > </ BaselineModel > </PMML>
  • 30. Model generated by the Producer (Cont) The structure is determined by code in the Producer.py: def makePMML(outFile): #create the pmml root = ET.Element(&quot;PMML&quot;) root.set(&quot;version&quot;, &quot;3.1&quot;) header = ET.SubElement(root, &quot;Header&quot;) header.set(&quot;copyright&quot;, &quot; &quot;) dataDict = ET.SubElement(root, &quot;DataDictionary&quot;) It then goes on for each Data and Mining Field: dataField = ET.SubElement(dataDict, &quot;DataField&quot;) dataField.set(&quot;name&quot;, &quot;Automaker&quot;) dataField.set(&quot;optype&quot;, &quot;categorical&quot;) dataField.set(&quot;dataType&quot;, &quot;string&quot;) . . . miningSchema = ET.SubElement(baselineModel, &quot;MiningSchema&quot;) miningField = ET.SubElement(miningSchema, &quot;MiningField&quot;) miningField.set(&quot;name&quot;, &quot;Automaker&quot;)
  • 31. Producer Output The training step used the code in producer.py to generate a model and get expected results. Training generated the following files: . |-- consumer | `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES BASED ON THE TRAINING DATA `-- producer |-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY, MINING SCHEMA `-- wtraining.nab.xml MODEL FILE USED FOR TRAINING
  • 32. Training XML This provides: Model with expected values from Training that is used when we score Test Distribution Baeline data and how it is to be handled $ cat producer wtraining.nab.xml <model input=&quot;../producer/wtraining.nab.pmml&quot; output=&quot;../consumer/wtraining.nab.pmml&quot;> <test field=&quot;Automaker&quot; testStatistic=&quot;dDist&quot; testType=&quot;threshold&quot; threshold=&quot;0.475&quot; weightField=&quot;Count&quot;> <baseline dist=&quot;discrete&quot; file=&quot;../data/wtraining.nab&quot; type=&quot;UniTable&quot; /> </test> </model>
  • 33. Unitable Unitable is used to hold the data that is read in. It allows us to encapsulate the data is a why which allows us to manipulate it efficiently. It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure. More to follow.
  • 34. Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer | |-- wscoring.nab.wtraining.nab.xml | `-- wtraining.nab.pmml |-- postprocess | `-- wscoring.nab.wtraining.nab.xml `-- producer |-- wtraining.nab.pmml `-- wtraining.nab.xml This examples generates a report in the post process directory.
  • 35. Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name=&quot;../data/wscoring.nab&quot; type=&quot;UniTable&quot; /> </inputData> <inputModel> <fromFile name=&quot;../consumer/wtraining.nab.pmml&quot; /> </inputModel> <output> <report name=&quot;report&quot;> <toFile name=&quot;../postprocess/wscoring.nab.wtraining.nab.xml&quot; /> <outputRow name=&quot;event&quot;> <score name=&quot;score&quot; /> <alert name=&quot;alert&quot; /> <segments name=&quot;segments&quot; /> </outputRow> </report> </output> </pmmlDeployment>
  • 36. Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < alert >True</ alert > < Segments ></ Segments > </ event > </report>
  • 37. Unitable The Unitable is one of the main components of the Augustus system. Data read into Augustus is stored in a Unitable. Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context. Designed to hold data in a way which allows it to be acted upon by numpy. Takes advantage of new features and improvements which are put into numpy by the scientific Python community . Unitable can be used outside of the Augustus scoring flow. Find a standalone example on the wiki
  • 38. Key Features of Unitable File format that matches the native machine memory storage of the data-allowing for memory-mapped access to the data. No parsing or sequential reading Fast vector operations using any number of data columns. Support for demand driven, rule based calculations. Derived columns defined in terms of operations on other columns, including other derived columns, and made available when referenced.
  • 39. Key Features of Unitable (cont) Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events. Ability to invoke calculations in scalar or vector mode transparently. One set of rule definitions can be applied to an entire data set in batch mode, or to individual rows of real-time events.
  • 40. For more information Open Data Group 400 Lathrop Avenue River Forest IL 60305 708-488-8660 [email_address] http://code.google.com/p/augustus/