STAT Requirement Analysis

Requirement Analysis THE STAT PROJECT Milestone 1 Report

To design a framework, how many variations we need to protect? How many functionalities we need to provide for supporting all these variations? QUESTIONS

Variation for importing dataset (File Sources)

Variations for importing dataset (File formats)

Variations for importing dataset (Schemas) Even if we only consider dataset in XML, each dataset may have its own schema.

Simplified approach One approach: High Level Reader Class, - ReutersReader RCV1Reader Once written, can be shared by community Observation: for the sake of comparison, researchers usually deal with a few famous dataset (e.g., Reuters, RCV-1)

Able to persist and read back memory objects

Able to visualize memory objects

STAT (brief) Domain Model Note : We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation

STAT framework sample code (conceptual)

Domain Concept: RawCorpus A collection of RawDocument , supporting collection operations: - Add new RawDocument element - Remove existing RawDocument element - Accessing elements in the collection - …

Domain Concept: RawCorpus abstract class RawCorpus { List< RawDocument > rawDocuments; RawDocument getDocument(int index); void setDocument(int index, T doc); void removeDocument(int index); }

Domain Concept: RawDocument An object with one or more string fields, serving as a non-processed, in-memory representation of a document unit - Like Java beans with getter and setter - All fields must be string type, even for numbers

Domain Concept: RawDocument class MyRawDocument extends RawDocument { String title; String author; String body; String date; String numOfClicks; String topicType; … } abstract class RawDocument { public RawDocument() {} }

Domain Concept: Processor An object that processes RawCorpus and produces Corpus . - Linguistic: Tokenizer, Stemmer, StopRemover, PosTagger, … - Machine learning: Feature-specific, document-specific

Domain Concept: Corpus An object representing a collection of Document for use by machine learning side of framework. This object provides a notion of splits which is commonly used (e.g., train, test)

Domain Concept: Trainer A representation of a machine learning algorithm, which can learn from a Corpus and produce a Model .

Domain Concept: Model An object of what machine learning algorithm (i.e., Trainer ) creates to store parameters that are "learned" from the data (i.e., Corpus )

Domain Concept: Classifier An object that maps Documents to target values (label, number, probability). It takes a Corpus and a Model as inputs, and produces a Prediction associated with the Corpus according to the Model .

Domain Concept: Prediction A collection of target values (label, number, probability) that associate with a Corpus , i.e., a collection of Document .

Domain Concept: Evaluator An object used for comparing the Prediction against its associated Corpus and generating Evaluation

Domain Concept: Evaluation A representation of evaluation result given by a Evaluator , in a summarized manner.

STAT (brief) Domain Model Note : We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer Vocabulary

STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer

STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Document RawDocument

STAT Requirement Analysis

More Related Content

What's hot (19)

Viewers also liked (8)

Similar to STAT Requirement Analysis (20)

More from stat (6)

Recently uploaded (20)

STAT Requirement Analysis