Augustus Overview Open Source Analytics

Introduction to Augustus OVERVIEW Open Data Group September 17, 2009

Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com

Getting Augustus Releases can be downloaded from the website under the Download tab. Current release are also on the main page's Featured side bar Augustus can be directly checked out from source control. We use Subversion. Project members can be granted commit access.

Source All of the source files are viewable on line with markup and revision history. The raw version of each file is also available. http://augustus.googlecode.com/source/browse

Documentation and Community WIKI The wiki is intended for people who want to install Augustus for use and possibly develop new features. FORUM The forum is open for any general discussion regarding Augustus.

Using Augustus Model Development Use Cycle Work Flow

Development and Use Cycle The typical model development and use cycle with Augustus is as follows: Identify suitable data with which to construct a new model. Provide a model schema which proscribes the requirements for the model. Run the Augustus producer to obtain a new model. Run the Augustus consumer on new data to effect scoring.

Development and Use Cycle 2. Model schema 1. Data Inputs

Running Augustus 3. Obtain new model with Producer 4. Score with Consumer

Work Flows Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.

Components Pre-processing Producers Consumers Post-Processing

Producers and Consumers The Producers and Consumers require configuration with XML-formatted files. Supplying the schema, configuration and training data to the Producer yields a completely specified model. The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.

Post Processing Augustus can accommodate a post-processing step. While not necessary, this is often useful to: Re-normalize the scoring results or perform an additional transformation. Supplement the results with global meta-data such as timestamps. Format the results. Select certain interesting values from the results. Restructure the data for use with other applications.

Segments Segments are covered elsewhere, but Augustus supports segments and this can be described at the Producer level. Augustus was originally written to an Open Data draft RFC for segmented models. Augustus 0.3.x conform to the RFC. PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard. Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.

Case Study: Auto Auto is an example distributed with Augustus, found in the examples directory. It consists of four simple examples of applying vector channel analysis to a single field of a stream of input records . The examples use two types of data files . The data consists of records with three entries: Date , Color , and Automaker . The Weighted examples have an additional 'weight' column, named Count . The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.

Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `-- produce.py http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch

Input for the Producer The Producer takes the training data set. In the code, we have declared how we want to test the data import augustus.modellib.baseline.producer.Producer as Producer def makeConfigs(inFile, outFile, inPMML, outPMML): #open data file inf = uni.UniTable().fromfile(inFile) #start the configuration file test = ET.SubElement(root, "test") test.set("field", "Automaker") test.set("weightField", "Count") test.set("testStatistic", "dDist") test.set("testType", "threshold") test.set("threshold", "0.475")

Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, "baseline") baseline.set("dist", "discrete") baseline.set("file", str(inFile)) baseline.set("type", "UniTable") # create the segmentation declarations for the two fields at this level ''' Taken out for the example, other Use Cases will focus on Segments segmentation = ET.SubElement(test, "segmentation") makeSegment(inf, segmentation, "Color") ''' #output the configuration file tree = ET.ElementTree(root) tree.write(outFile)

Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs) Beginning timing (0.000 secs) Creating configuration file (0.001 secs) Creating input PMML file (0.001 secs) Starting producer (0.000 secs) Inputting configurations (0.001 secs) Inputting model (0.008 secs) Collecting stats for baseline distribution (0.011 secs) Events 20.067% processed (0.009 secs) Events 40.134% processed (0.009 secs) Events 60.201% processed (0.009 secs) Events 80.268% processed (0.009 secs) Events 100.000% processed (0.000 secs) Making test distributions from statistics (0.002 secs) Outputting PMML (0.062 secs) Lifetime of timer

Model generated by the Producer <PMML version="3.1"> <Header copyright=" " /> < DataDictionary > < DataField dataType="string" name="Automaker" optype="categorical" /> < DataField dataType="string" name="Color" optype="categorical" /> < DataField dataType="float" name="Count" optype="continuous" /> </ DataDictionary > < BaselineModel functionName="baseline"> < MiningSchema > < MiningField name="Automaker" /> < MiningField name="Color" /> < MiningField name="Count" /> </ MiningSchema > </ BaselineModel > </PMML>

Model generated by the Producer (Cont) The structure is determined by code in the Producer.py: def makePMML(outFile): #create the pmml root = ET.Element("PMML") root.set("version", "3.1") header = ET.SubElement(root, "Header") header.set("copyright", " ") dataDict = ET.SubElement(root, "DataDictionary") It then goes on for each Data and Mining Field: dataField = ET.SubElement(dataDict, "DataField") dataField.set("name", "Automaker") dataField.set("optype", "categorical") dataField.set("dataType", "string") . . . miningSchema = ET.SubElement(baselineModel, "MiningSchema") miningField = ET.SubElement(miningSchema, "MiningField") miningField.set("name", "Automaker")

Producer Output The training step used the code in producer.py to generate a model and get expected results. Training generated the following files: . |-- consumer | `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES BASED ON THE TRAINING DATA `-- producer |-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY, MINING SCHEMA `-- wtraining.nab.xml MODEL FILE USED FOR TRAINING

Training XML This provides: Model with expected values from Training that is used when we score Test Distribution Baeline data and how it is to be handled $ cat producer wtraining.nab.xml <model input="../producer/wtraining.nab.pmml" output="../consumer/wtraining.nab.pmml"> <test field="Automaker" testStatistic="dDist" testType="threshold" threshold="0.475" weightField="Count"> <baseline dist="discrete" file="../data/wtraining.nab" type="UniTable" /> </test> </model>

Unitable Unitable is used to hold the data that is read in. It allows us to encapsulate the data is a why which allows us to manipulate it efficiently. It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure. More to follow.

Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer | |-- wscoring.nab.wtraining.nab.xml | `-- wtraining.nab.pmml |-- postprocess | `-- wscoring.nab.wtraining.nab.xml `-- producer |-- wtraining.nab.pmml `-- wtraining.nab.xml This examples generates a report in the post process directory.

Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name="../data/wscoring.nab" type="UniTable" /> </inputData> <inputModel> <fromFile name="../consumer/wtraining.nab.pmml" /> </inputModel> <output> <report name="report"> <toFile name="../postprocess/wscoring.nab.wtraining.nab.xml" /> <outputRow name="event"> <score name="score" /> <alert name="alert" /> <segments name="segments" /> </outputRow> </report> </output> </pmmlDeployment>

Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < alert >True</ alert > < Segments ></ Segments > </ event > </report>

Unitable The Unitable is one of the main components of the Augustus system. Data read into Augustus is stored in a Unitable. Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context. Designed to hold data in a way which allows it to be acted upon by numpy. Takes advantage of new features and improvements which are put into numpy by the scientific Python community . Unitable can be used outside of the Augustus scoring flow. Find a standalone example on the wiki

Key Features of Unitable File format that matches the native machine memory storage of the data-allowing for memory-mapped access to the data. No parsing or sequential reading Fast vector operations using any number of data columns. Support for demand driven, rule based calculations. Derived columns defined in terms of operations on other columns, including other derived columns, and made available when referenced.

Key Features of Unitable (cont) Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events. Ability to invoke calculations in scalar or vector mode transparently. One set of rule definitions can be applied to an entire data set in batch mode, or to individual rows of real-time events.

For more information Open Data Group 400 Lathrop Avenue River Forest IL 60305 708-488-8660 [email_address] http://code.google.com/p/augustus/

Augustus Overview Open Source Analytics

More Related Content

Similar to Augustus Overview Open Source Analytics (20)

Recently uploaded (20)

Augustus Overview Open Source Analytics