SlideShare a Scribd company logo
S.P.A.C.E.
& COWS
& SOFT. ENG.
TIM MENZIES
WVU
DEC 2011
THE COW
DOCTRINE

•  Seek the fence
   where the grass is
   greener on the
   other side.
       •  Learn from
          there
       •  Test on here


•  Don’t rely on trite
   definitions of
   “there” and “here”
       •  Cluster to find
          “here” and
          “there”

12/1/2011




                            2
THE AGE OF “PREDICTION” IS OVER

 OLDE WORLDE                                        NEW WORLD
 Porter & Selby, 1990                               Time to lift our game
     •  Evaluating Techniques for Generating        No more: D*L*M*N
        Metric-Based Classification Trees, JSS.
     •  Empirically Guided Software Development     Time to look at the bigger picture
        Using Metric-Based Classification Trees.
        IEEE Software                               Topics at COW not studied, not
     •  Learning from Examples: Generation and      publishable, previously:
        Evaluation of Decision Trees for Software
        Resource Analysis. IEEE TSE                     •  data quality
 In 2011, Hall et al. (TSE, pre-print)                  •  user studies
      •  reported 100s of similar                       •  local learning
         studies.                                       •  conclusion instability,
      •  L learners on D data sets                  What is your next paper?
         in a M*N cross-val
                                                       •  Hopefully not D*L*M*N
 The times, they are a changing:
 harder now to publish D*L*M*N

12/1/201




                                                                                         3
REALIZING AI IN SE
(RAISE’12)

            An ICSE’12 workshop submission
                •    Organizers: Rachel Harrison, Daniel
                     Rodriguez, Me

            AI in SE research
                •    To much focus on low-hanging fruit;
                •    SE only exploring small fraction of AI
                     technologies.

            Goal:
                •    database of sample problems that both SE
                     and AI researchers can explore, together

            Success criteria
                •    ICSE'13: meet to report papers written by
                     teams of authors from SE &AI community

12/1/2011




                                                                 4
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications


12/1/2011




                                                             5
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications


12/1/2011




                                                             6
Q1: WHY SO MUCH SE + DATA MINING?
A: INFORMATION EXPLOSION

http://CIA.vc
    •  Monitors 10K projects
    •  one commit every 17 secs

SourceForge.Net:
    •  hosts over 300K projects,

Github.com:
    •  2.9M GIT repositories

Mozilla Firefox projects :
    •  700K reports

12/1/2011




                                    7
Q1: WHY SO MUCH SE + DATA MINING?
A: WELCOME TO DATA-DRIVEN SE

 Olde worlde: large “applications” (e.g. MsOffice)
     •  slow to change, user-community locked in

 New world: cloud-based apps
     •  “applications” now 100s of services
            •  offered by different vendors
     •  The user zeitgeist can dump you and move on
            •  Thanks for nothing, Simon Cowell
     •  This change the release planning problem
            •  What to release next…
            •  … that most attracts and retains market share

 Must mine your population
     •  To keep your population
12/1/2011




                                                               8
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications


12/1/2011




                                                             9
Q2: WHY RESEARCH SE + DATA MINING?
A: NEED TO BETTER UNDERSTAND TOOLS
Q: What causes the variance in our results?
   •  Who does the data mining?
   •  What data is mined?
   •  How the data is mined (the algorithms)?
   •  Etc




                                                10
12/1/2011
Q2: WHY RESEARCH SE + DATA MINING?
A: NEED TO BETTER UNDERSTAND TOOLS
Q: What causes the variance in our results?
   •  Who does the data mining?
   •  What data is mined?
   •  How the data is mined (the algorithms)?
   •  Etc

Conclusions depend on who does the looking?
   •  Reduce the skills gap between user skills and tool capabilities
   •  Inductive Engineering: Zimmermann, Bird, Menzies (MALETS’11)
            •  Reflections on active projects
            •  Documenting the analysis patterns




                                                                        11
12/1/2011
Inductive Engineering:

Understanding user goals to inductively generate the models that most matter to the user.




                                                                                            12
   12/1/2011
Q2: WHY RESEARCH SE + DATA MINING?
A: NEED TO UNDERSTAND INDUSTRY
You are a university educator designing graduate classes for
prospective industrial inductive engineers
   •  Q: what do you teach them?

You are an industrial practitioner hiring consultants for an in-house
inductive engineering team
   •  Q: what skills do you advertise for?

You a professional accreditation body asked to certify an graduate
program in “analytics”
   •  Q: what material should be covered?




                                                                        13
12/1/2011
Q2: WHY RESEARCH SE + DATA MINING?
A: BECAUSE WE FORGET TOO MUCH
Basili
   •  Story of how folks misread NASA SEL data
   •  Required researchers to visit for a week
            •  before they could use SEL data

But now, the SEL is no more:
      •  that data is lost

The only data is the stuff we can touch via its
collectors?
   •  That’s not how physics, biology, maths,
      chemistry, the rest of science does it.
   •  Need some lessons that survive after the
      institutions pass




                                                  14
12/1/2011
Its not as if we can embalm those
researchers, keep them with us forever




      Unless you are from University College
PROMISE
PROJECT
1) Conference,
2) Repository to store data from the
conference: promisedata.org/data
Steering committee:
    •  Founders: me, Jelber Sayyad
    •  Former: Gary Boetticher, Tom Ostrand,
       Guntheur Ruhe,
    •  Current: Ayse Bener, me, Burak Turhan,
       Stefan Wagner, Ye Yang, Du Zhang
Open issues
    •  Conclusion instability
    •  Privacy: share, without reveal;
            •  E.g. Peters & me ICSE’12
    •  Data quality issues:
            •  see talks at EASE’11 and COW’11
See also SIR (U. Nebraska) and ISBSG




                                                 16
12/1/2011
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             17
12/1/2011
Q3: BUT IS DATA MINING RELEVANT
TO INDUSTRY?

   A: Which bit of industry?

   Different sectors of (say)
   Microsoft need different
   kinds of solutions

   As an educator and
   researchers, I ask “what
   can I do to make me and
   my students readier for
   the next business group
   I meet?”
                                  Microsoft research,   Other studios,
                                Redmond, Building 99    many other projects




                                                                         18
12/1/2011
Q3: BUT IS IT RELEVANT TO INDUSTRY?
 A: YES, MUCH RECENT INTEREST
                                                        POSITIONS OFFERED TO MSA GRADUATES:


                                                        Credit Risk Analyst
 Business intelligence                                  Data Mining Analyst
                                                        E-Commerce Business Analyst
 Predictive analytics                                   Fraud Analyst
                                                        Informatics Analyst
 NC state: Masters in Analytics
                                                        Marketing Database Analyst
                                                        Risk Analyst
                                                        Display Ads Optimization
                                                        Senior Decision Science Analyst
                                                        Senior Health Outcomes Analyst
                                                        Life Sciences Consultant
MSA Class                 2011   2010    2009   2008    Senior Scientist
                                                        Forecasting and Analytics
graduates:                39      39      35     23     Sales Analytics
%multiple job offers by
                                                        Pricing and Analytics
graduation:
                          97      91      90     91     Strategy and Analytics
Range of salary offers
                          70K-   65K –          65K –   Quantitative Analytics
                          140K   150K 60K- 115K 135K    Director, Web Analytics
                                                        Analytic Infrastructure
                                                        Chief, Quantitative Methods Section




                                                                                              19
 12/1/2011
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             20
12/1/2011
The Problem of
Conclusion Instability
Learning from software projects           So we can’t take on conclusions from
  •  only viable inside                   one site verbatim
     industrial development                 •  Need sanity checks +certification
     organizations?                            envelopes + anomaly detectors
  •  e.g Basili at SEL                      •  check if “their” conclusions work “here”
  •  e.g. Briand at Simula
                                          Even “one” site, has many projects.
  •  e.g Mockus at Avaya
  •  e.g Nachi at Microsoft                 •  Can one project can use another’s
                                               conclusion?
  •  e.g. Ostrand/Weyuker at AT&T
                                            •  Finding local lessons in a cost-effective
                                               manner!
Conclusion instability is a
repeated observation.
  •  What works here, may not work
     there
  •  Shull & Menzies, in “Making
     Software”, 2010
  •  Sheppered & Menzies: speial issue,
     ESE, conclusion instability
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             22
12/1/2011
GLOBALISM:
BIGGER SAMPLE IS BETTER

E.g. examples from 2 sources about 2 application types
                    Source          Gui apps     Web apps
               Green Software Inc   gui1, gui2   web1, web2,
                  Blue Sky Ltd      gui3, gui4   web3, web4

To learn lessons relevant to “gui1”
    •  Use all of {gui2, web1, web2} + {gui3, gui4, web3, web4}




                                                                  23
12/1/2011
GLOBALISM
& RESEARCHERS
            R. Glass, Facts and Falllacies of Software
            Engineering. Addison- Wesley, 2002.



            C. Jones, Estimating Software Costs, 2nd
            Edition. McGraw-Hill, 2007.



            B. Boehm, E. Horowitz, R. Madachy, D.
            Reifer, B. K. Clark, B. Steece, A. W.
            Brown, S. Chulani, and C. Abts, Software
            Cost Estimation with Cocomo II. Prentice
            Hall, 2000.



            R. A. Endres, D. Rombach, A Handbook
            of Software and Systems Engi- neering:
            Empirical Observations, Laws and
            Theories. Addison Wesley, 2003.

               •  50 laws:
               •      “the nuggets that must be captured




                                                           24
12/1/2011
                    to improve future performance” [p3]
GLOBALISM
     & INDUSTRIAL ENGINEERS




Mind maps of
developers

Brazil (top)
from
PASSOS et al
20011

USA (bottom)




                                                  25
     12/1/2011
                 See also, Jorgensen, TSE, 2009
(NOT) GLOBALISM
& DEFECT PREDICTION




                      26
12/1/2011
(NOT) GLOBALISM
& EFFORT ESTIMATION
Effort = a . locx . y
    •  learned using Boehm’s
       methods
    •  20*66% of NASA93
    •  COCOMO attributes
    •  Linear regression (log
       pre-processor)
    •  Sort the co-efficients
       found for each member
       of x,y




                                27
12/1/2011
CONCLUSION (ON GLOBALISM)




                            28
12/1/2011
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             29
12/1/2011
LOCALISM:
SAMPLE ONLY FROM SAME CONTEXT

E.g. examples from 2 sources about 2 application types
                      Source          Gui apps     Web apps
                 Green Software Inc   gui1, gui2   web1, web2,
                    Blue Sky Ltd      gui3, gui4   web3, web4

To learn lessons relevant to “gui1”
    •  Restrict to just this the gui tools {gui2, gui3, gui4 }
    •  Restrict to just this company {gui2,web1, web2}

Er… hang on
    •  How to find the right local context?




                                                                 30
12/1/2011
DELPHI LOCALIZATION
Ask an expert to find the right local
context
    •  Are we sure they’re right?
    •  Posnett at al. 2011:
            •  What is right level for
               learning?
            •  Files or packages?
            •  Methods or classes?
            •  Changes from study to
               study

And even if they are “right”:
    •  should we use those contexts?
    •  E.g. need at least 10 examples
       to learn a defect model
       (Valerdi’s rule, IEEE Trans,
       2009)
    •  17/147 = 11% of this data




                                         31
12/1/2011
CLUSTERING TO FIND “LOCAL”
TEAK: estimates from “k”
nearest-neighbors
    •  “k” auto-selected per test case
    •  Pre-processor to cluster data,
       remove worrisome regions
    •  IEEE TSE, Jan’11
       T = Tim
       E = Ekrem Kocaguneli
        A = Ayse Bener
        K= Jacky Keung




ESEM’11
    •  Train within one delphi localization
    •  Or train on all and see what it picks
    •  Results #1: usually, cross as good as within




                                                      32
12/1/2011
Results #2: 20 times, estimate for x in S_i.
TEAK picked across as picked within




                                               33
12/1/2011
CONCLUSION (ON LOCALIZATION)
Delphi localizations
    •  Can restrict sample size
    •  Don’t know how to check if your delphi
       localizations are “right”
    •  How to learn delphi localizations for new
       domains?
    •  Not essential to inference


Auto-learned localizations
(learned via nearest neighbor methods)
    •  Works just as well as delphi
    •  Can select data from many sources
    •  Can be auto-generated for new domains
    •  Can hunt out relevant samples from data
       from multiple sources




                                                   34
12/1/2011
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             35
12/1/2011
CLUSTERING + LEARNING


Turhan, Me, Bener, ESE journal ’09
    •  Nearest neighbor, defect prediction
            •  Combine data from other sources
            •  Prune to just the 10 nearest examples to each test instance
            •  Naïve Bayes on the pruned set




Turhan et al. (2009)                   Me et al, ASE, 2011

Not scalable                           Near linear time processing
No generalization to report to users   Use rule learning




                                                                             36
12/1/2011
CLUSTERING + LEARNING
ON SE DATA
Cuadrado, Gallego, Rodriguez, Sicilia, Rubio, Crespo.
Journal Computer Science and Technology (May07)
    •  EM on to 4 Delphi localizations
             •  case tool = yes, no
             •  methodology used = yes, no
    •  Regression models, learned per
       cluster, do better than global

But why train on your own clusters?
    •    If your neighbors get better results…
    •    … train on neighbors…
    •    … test on local
    •    Training data similar to test
    •    No need for N*M-way cross val




                                                        37
12/1/2011
MUST DO BETTER

 Turhan et al. (2009)                   Me et al, ASE, 2011

 Not scalable                           Near linear time processing
 No generalization to report to users   Use rule learning




Cuadrado et .al (2007)                  Me et al, ASE, 2011

Only one data set                       Need more experiments
Just effort estimation                  Why not effort and defect?
Delphi and automatic localizations ?    Seek fully automated procedure
Returns regression models               Our users want actions, not trends. Navigators, not maps
Clusters on naturally dimensions        What about synthesized dimensions?
Train and test on local clusters        Why not train on superior neighbors (the envy principle)
Tested via cross-val                    Train on neighbor, test on self. No 10*10-way cross val




                                                                                                   38
 12/1/2011
S.P.A.C.E = SPLIT, PRUNE
SPLIT: quadtree generation                                   PRUNE: FORM CLUSTERS
Pick any point W; find X furthest from W,                    Combine quadtree leaves
find Y furthest from Y.                                      with similar densities
XY is like PCA’s first component; found in                   Score each cluster by median
O(2N) time, note O(N2) time                                  score of class variable
All points have distance a,b to (X,Y)                        Find envious neighbors (C1,C2)
x=   (a2   +   c2   −   b2)/2c   ; y=   sqrt(a2   –   x 2)
                                                                •  score(C2) better than score(C1)
Recurse on four quadrants formed                             Train on C2 , test on C2
from median(x), median(y)




                                                                                                     39
WHY SPLIT, PRUNE?
Unlike Turhan’09:
LogLinear clustering time:
i.e. fast and scales


                                                                                                  S.P.
Turhan et al. (2009)                    Me et al, ASE, 2011
Not scalable                            Near linear time processing
No generalization to report to users    Use rule learning

                                                                                                   S.
Cuadrado et .al (2007)                 Me et al, ASE, 2011                                         P.

Only one data set                      Need more experiments
Just effort estimation                 Why not effort and defect?
Delphi & automatic localizations ?     Seek fully automated procedure

Returns regression models              Our users want actions, not trends. Navigators, not maps
Clusters on naturally dimensions       What about synthesized dimensions?




                                                                                                         40
Train and test on local clusters       Why not train on superior neighbors (the envy principle)

Tested via cross-val                   Train on neighbor, test on self. No 10*10-way cross val
S.P.A.C.E =
S.P ADD CONTRAST ENVY (A.C.E.)
   .
Contrast set learning (WHICH)

Fuzzy beam search
First Stack = one rule for each discretized range of each attribute
Repeat. Make next stack as follows:
  •  Score stack entries by lift (ability to select better examples)
  •  Sort stack entries by score
  •  Next stack = old stack
        •  plus combinations of randomly selected pairs of existing rules
        •  Selection biased towards high scoring rules
Halt when top of stack’s score stabilizes
Return top of stack




                                                                            41
WHY ADD CONSTRAST ENVY?

    Search criteria is adjustable
       •  See Menzies et al ASE journal 2010
    Early termination
                                                                                                  S.P.      A.C.
Turhan et al. (2009)                       Me et al, ASE, 2011                                              E

Not scalable                               Near linear time processing
No generalization to report to users       Use rule learning


                                                                                                     S.P.    A.C
Cuadrado et .al (2007)                 Me et al, ASE, 2011                                                   .E.

Only one data set                      Need more experiments
Just effort estimation                 Why not effort and defect?
Delphi & automatic localizations ?     Seek fully automated procedure
Returns regression models              Our users want actions, not trends. Navigators, not maps
Clusters on naturally dimensions       What about synthesized dimensions?
Train and test on local clusters       Why not train on superior neighbors (the envy principle)
Tested via cross-val                   Train on neighbor, test on self. No 10*10-way cross val




                                                                                                                   42
  12/1/2011
DATA FROM
HTTP://PROMISEDATA.ORG/DATA
Find (25,50,75,100)th percentiles of class values
    •  in examples of test set selected by global or local
Express those percentiles as ratios of max values in all.
Effort reduction = { NasaCoc, China } : COCOMO or function points
Defect reduction = { lucene, xalan, jedit, synapse,etc } : CK metrics(OO)




When the same learner was applied globally or locally
    •  Local did better than global
    •  Death to generalism                   As with Cuadrado ‘07: local better than
                                          global (but for multiple effort and defect data




                                                                                            43
12/1/2011
                                                sets and no delphi-localizations)
EVALUATION
                                                                                             S.P.   A.C.     COW
Turhan et al. (2009)                      Me et al, ASE, 2011                                       E

Not scalable                              Near linear time processing
No generalization to report to users      Use rule learning



                                                                                                    S.     A.C   CO
Cuadrado et .al (2007)                 Me et al, ASE, 2011                                          P.     .E.   W

Only one data set                      Need more experiments
Just effort estimation                 Why not effort and defect?
Delphi & automatic localizations ?     Seek fully automated procedure

Returns regression models              Our users want actions, not trends. Navigators, not
                                       maps
Clusters on naturally dimensions       What about synthesized dimensions?

Train and test on local clusters       Why not train on superior neighbors (the envy principle)




                                                                                                                      44
  12/1/2011
Tested via cross-val                   Train on neighbor, test on self. No 10*10-way cross val
ROADMAP

Some comments on the state of the art
   •  Why so much SE + data mining?
   •  Why research SE + data mining
   •  But is data mining relevant to industry
   •  The problem of conclusion instability
Learning local
       •    Globalism: learn from all data
       •    Localism: learn from local samples
       •    Learning locality with clustering (S.P.A.C.E.)
       •    Implications




                                                             45
12/1/2011
IMPLICATIONS:
GLOABLISM
Simon says, no




                 46
12/1/2011
IMPLICATIONS:
DELPHI LOCALISM
Simon says, no




                  47
12/1/2011
IMPLICATIONS:
CLUSTER-BASED LOCALISM
Simon says, yes




                         48
12/1/2011
IMPLICATIONS:
CONCLUSION INSTABILITY
From this work
    •  Misguided to try and tame conclusion instability
    •  Inherent in the data




    •  Don’t tame it, use it
           •  Built lots of local models




                                                          49
12/1/2011
IMPLICATIONS:
OUTLIER REMOVAL
Remove odd training items
Examples:
    •  Keung & Kitchenham, IEEE TSE, 2008: effort estimation
    •  Kim et al., ICSE’11, defect prediction
            •  case-based reasoning
            •  prune neighboring rows containing too many contradictory conclusions.
    •  Yoon & Bae, IST journal, 2010, defect prediction
            •  association rule learning methods to find frequent item sets.
            •  Remove rows with too few frequent items.
            •  Prunes 20% to 30% of rows.

Assumed, assumes a
general pattern,
muddle by some outliers

But my works says
“its all outliers”.




                                                                                       50
12/1/2011
IMPLICATIONS:
STRATIFIED CROSS-VALIDATION
Best to test on hold-out data
    •  That is similar to what will be
       seen in the future
    •  E.g. stratified cross validation

This work: “similar” is not a
simple matter
    •  select cross-val bins via
       clustering
            •  Train on neighboring cluster
            •  Test on local cluster

Why learn from yourself?
    •  If the grass is greener on the
       other side of the fence

    •  Learn from your better neighbors




                                              51
12/1/2011
IMPLICATIONS:
STRUCTURE LITERATURE REVIEWS




            ?

                               52
12/1/2011
IMPLICATIONS:
SBSE-1 (A.K.A. LEAP, THEN LOOK)




   When faced with a new problem
   •  Jump off a cliff with roller skates and see where you stop.
   That is:
   •  Define objective function and use it to guide a search engine.




                                                                       53
12/1/2011
IMPLICATIONS:
SBSE-2 (LOOK BEFORE YOU LEAP)

 •  Split
        •    data on independent variables

 •  Prune
        •    leaf quadrants using dependent variables

 •  Contrast.
        •    Sort data in each cluster
        •    Contrast intra-cluster data between good
             and bad examples

 •  Add Envy:
        •    For each cluster C1…
        •    Find C2; i.e. the neighboring clustering
             you most envy
        •    Apply C2’s rules to C1




                                                        54
12/1/2011
THE COW
DOCTRINE

•  Seek the fence
   where the grass is
   greener on the
   other side.
       •  Learn from
          there
       •  Test on here


•  Don’t rely on trite
   definitions of
   “there” and “here”
       •  Cluster to find
          “here” and
          “there”




                            55
12/1/2011
56
12/1/2011

More Related Content

PPTX
Promise notes
PPTX
Dm sei-tutorial-v7
PPTX
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
PDF
Elder
PDF
Learning Local Lessons in Software Engineering
PDF
許永真/Crowd Computing for Big and Deep AI
PDF
Building Interactive Systems for Social Good [Job Talk]
PDF
Effective Data Modeling
Promise notes
Dm sei-tutorial-v7
NYC Open Data Meetup-- Thoughtworks chief data scientist talk
Elder
Learning Local Lessons in Software Engineering
許永真/Crowd Computing for Big and Deep AI
Building Interactive Systems for Social Good [Job Talk]
Effective Data Modeling

Similar to S.P.A.C.E. Exploration for Software Engineering (20)

PPTX
Ml pluss ejan2013
PPTX
Cloud Programming Models: eScience, Big Data, etc.
PDF
Actuarial Analytics in R
PDF
Accretive Health - Quality Management in Health Care
PDF
Telco Big Data Workshop Sample
PDF
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
PDF
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
PDF
Data-Ed: A Framework for no sql and Hadoop
PPTX
Department of Commerce App Challenge: Big Data Dashboards
PPTX
Icse15 Tech-briefing Data Science
PPTX
Big Data with IOT approach and trends with case study
PPT
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
PPTX
Data science meetup - Spiros Antonatos
PPTX
(Big) Data (Science) Skills
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
Educating a New Breed of Data Scientists for Scientific Data Management
PDF
How to develop a data scientist – What business has requested v02
PPTX
2014 aus-agta
PDF
Size does not matter (if your data is in a silo)
Ml pluss ejan2013
Cloud Programming Models: eScience, Big Data, etc.
Actuarial Analytics in R
Accretive Health - Quality Management in Health Care
Telco Big Data Workshop Sample
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
Data-Ed Webinar: A Framework for Implementing NoSQL, Hadoop
Data-Ed: A Framework for no sql and Hadoop
Department of Commerce App Challenge: Big Data Dashboards
Icse15 Tech-briefing Data Science
Big Data with IOT approach and trends with case study
IaaS Cloud Benchmarking: Approaches, Challenges, and Experience
Data science meetup - Spiros Antonatos
(Big) Data (Science) Skills
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Educating a New Breed of Data Scientists for Scientific Data Management
How to develop a data scientist – What business has requested v02
2014 aus-agta
Size does not matter (if your data is in a silo)
Ad

More from CS, NcState (20)

PPTX
Talks2015 novdec
PPTX
Future se oct15
PPTX
GALE: Geometric active learning for Search-Based Software Engineering
PPTX
Big Data: the weakest link
PPTX
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
PPTX
Lexisnexis june9
PPTX
Welcome to ICSE NIER’15 (new ideas and emerging results).
PPTX
Kits to Find the Bits that Fits
PPTX
Ai4se lab template
PPTX
Automated Software Enging, Fall 2015, NCSU
PPT
Requirements Engineering
PPT
172529main ken and_tim_software_assurance_research_at_west_virginia
PPTX
Automated Software Engineering
PDF
Next Generation “Treatment Learning” (finding the diamonds in the dust)
PPTX
Tim Menzies, directions in Data Science
PPTX
Goldrush
PPTX
Dagstuhl14 intro-v1
PPTX
Know thy tools
PPTX
The Art and Science of Analyzing Software Data
PPTX
What Metrics Matter?
Talks2015 novdec
Future se oct15
GALE: Geometric active learning for Search-Based Software Engineering
Big Data: the weakest link
Three Laws of Trusted Data Sharing: (Building a Better Business Case for Dat...
Lexisnexis june9
Welcome to ICSE NIER’15 (new ideas and emerging results).
Kits to Find the Bits that Fits
Ai4se lab template
Automated Software Enging, Fall 2015, NCSU
Requirements Engineering
172529main ken and_tim_software_assurance_research_at_west_virginia
Automated Software Engineering
Next Generation “Treatment Learning” (finding the diamonds in the dust)
Tim Menzies, directions in Data Science
Goldrush
Dagstuhl14 intro-v1
Know thy tools
The Art and Science of Analyzing Software Data
What Metrics Matter?
Ad

Recently uploaded (20)

PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
01-Introduction-to-Information-Management.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Basic Mud Logging Guide for educational purpose
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Pre independence Education in Inndia.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
master seminar digital applications in india
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
TR - Agricultural Crops Production NC III.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
GDM (1) (1).pptx small presentation for students
01-Introduction-to-Information-Management.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Basic Mud Logging Guide for educational purpose
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Classroom Observation Tools for Teachers
Pre independence Education in Inndia.pdf
Complications of Minimal Access Surgery at WLH
master seminar digital applications in india
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Supply Chain Operations Speaking Notes -ICLT Program

S.P.A.C.E. Exploration for Software Engineering

  • 1. S.P.A.C.E. & COWS & SOFT. ENG. TIM MENZIES WVU DEC 2011
  • 2. THE COW DOCTRINE •  Seek the fence where the grass is greener on the other side. •  Learn from there •  Test on here •  Don’t rely on trite definitions of “there” and “here” •  Cluster to find “here” and “there” 12/1/2011 2
  • 3. THE AGE OF “PREDICTION” IS OVER OLDE WORLDE NEW WORLD Porter & Selby, 1990 Time to lift our game •  Evaluating Techniques for Generating No more: D*L*M*N Metric-Based Classification Trees, JSS. •  Empirically Guided Software Development Time to look at the bigger picture Using Metric-Based Classification Trees. IEEE Software Topics at COW not studied, not •  Learning from Examples: Generation and publishable, previously: Evaluation of Decision Trees for Software Resource Analysis. IEEE TSE •  data quality In 2011, Hall et al. (TSE, pre-print) •  user studies •  reported 100s of similar •  local learning studies. •  conclusion instability, •  L learners on D data sets What is your next paper? in a M*N cross-val •  Hopefully not D*L*M*N The times, they are a changing: harder now to publish D*L*M*N 12/1/201 3
  • 4. REALIZING AI IN SE (RAISE’12) An ICSE’12 workshop submission •  Organizers: Rachel Harrison, Daniel Rodriguez, Me AI in SE research •  To much focus on low-hanging fruit; •  SE only exploring small fraction of AI technologies. Goal: •  database of sample problems that both SE and AI researchers can explore, together Success criteria •  ICSE'13: meet to report papers written by teams of authors from SE &AI community 12/1/2011 4
  • 5. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 12/1/2011 5
  • 6. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 12/1/2011 6
  • 7. Q1: WHY SO MUCH SE + DATA MINING? A: INFORMATION EXPLOSION http://CIA.vc •  Monitors 10K projects •  one commit every 17 secs SourceForge.Net: •  hosts over 300K projects, Github.com: •  2.9M GIT repositories Mozilla Firefox projects : •  700K reports 12/1/2011 7
  • 8. Q1: WHY SO MUCH SE + DATA MINING? A: WELCOME TO DATA-DRIVEN SE Olde worlde: large “applications” (e.g. MsOffice) •  slow to change, user-community locked in New world: cloud-based apps •  “applications” now 100s of services •  offered by different vendors •  The user zeitgeist can dump you and move on •  Thanks for nothing, Simon Cowell •  This change the release planning problem •  What to release next… •  … that most attracts and retains market share Must mine your population •  To keep your population 12/1/2011 8
  • 9. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 12/1/2011 9
  • 10. Q2: WHY RESEARCH SE + DATA MINING? A: NEED TO BETTER UNDERSTAND TOOLS Q: What causes the variance in our results? •  Who does the data mining? •  What data is mined? •  How the data is mined (the algorithms)? •  Etc 10 12/1/2011
  • 11. Q2: WHY RESEARCH SE + DATA MINING? A: NEED TO BETTER UNDERSTAND TOOLS Q: What causes the variance in our results? •  Who does the data mining? •  What data is mined? •  How the data is mined (the algorithms)? •  Etc Conclusions depend on who does the looking? •  Reduce the skills gap between user skills and tool capabilities •  Inductive Engineering: Zimmermann, Bird, Menzies (MALETS’11) •  Reflections on active projects •  Documenting the analysis patterns 11 12/1/2011
  • 12. Inductive Engineering: Understanding user goals to inductively generate the models that most matter to the user. 12 12/1/2011
  • 13. Q2: WHY RESEARCH SE + DATA MINING? A: NEED TO UNDERSTAND INDUSTRY You are a university educator designing graduate classes for prospective industrial inductive engineers •  Q: what do you teach them? You are an industrial practitioner hiring consultants for an in-house inductive engineering team •  Q: what skills do you advertise for? You a professional accreditation body asked to certify an graduate program in “analytics” •  Q: what material should be covered? 13 12/1/2011
  • 14. Q2: WHY RESEARCH SE + DATA MINING? A: BECAUSE WE FORGET TOO MUCH Basili •  Story of how folks misread NASA SEL data •  Required researchers to visit for a week •  before they could use SEL data But now, the SEL is no more: •  that data is lost The only data is the stuff we can touch via its collectors? •  That’s not how physics, biology, maths, chemistry, the rest of science does it. •  Need some lessons that survive after the institutions pass 14 12/1/2011
  • 15. Its not as if we can embalm those researchers, keep them with us forever Unless you are from University College
  • 16. PROMISE PROJECT 1) Conference, 2) Repository to store data from the conference: promisedata.org/data Steering committee: •  Founders: me, Jelber Sayyad •  Former: Gary Boetticher, Tom Ostrand, Guntheur Ruhe, •  Current: Ayse Bener, me, Burak Turhan, Stefan Wagner, Ye Yang, Du Zhang Open issues •  Conclusion instability •  Privacy: share, without reveal; •  E.g. Peters & me ICSE’12 •  Data quality issues: •  see talks at EASE’11 and COW’11 See also SIR (U. Nebraska) and ISBSG 16 12/1/2011
  • 17. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 17 12/1/2011
  • 18. Q3: BUT IS DATA MINING RELEVANT TO INDUSTRY? A: Which bit of industry? Different sectors of (say) Microsoft need different kinds of solutions As an educator and researchers, I ask “what can I do to make me and my students readier for the next business group I meet?” Microsoft research, Other studios, Redmond, Building 99 many other projects 18 12/1/2011
  • 19. Q3: BUT IS IT RELEVANT TO INDUSTRY? A: YES, MUCH RECENT INTEREST POSITIONS OFFERED TO MSA GRADUATES: Credit Risk Analyst Business intelligence Data Mining Analyst E-Commerce Business Analyst Predictive analytics Fraud Analyst Informatics Analyst NC state: Masters in Analytics Marketing Database Analyst Risk Analyst Display Ads Optimization Senior Decision Science Analyst Senior Health Outcomes Analyst Life Sciences Consultant MSA Class 2011 2010 2009 2008 Senior Scientist Forecasting and Analytics graduates: 39 39 35 23 Sales Analytics %multiple job offers by Pricing and Analytics graduation: 97 91 90 91 Strategy and Analytics Range of salary offers 70K- 65K – 65K – Quantitative Analytics 140K 150K 60K- 115K 135K Director, Web Analytics Analytic Infrastructure Chief, Quantitative Methods Section 19 12/1/2011
  • 20. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 20 12/1/2011
  • 21. The Problem of Conclusion Instability Learning from software projects So we can’t take on conclusions from •  only viable inside one site verbatim industrial development •  Need sanity checks +certification organizations? envelopes + anomaly detectors •  e.g Basili at SEL •  check if “their” conclusions work “here” •  e.g. Briand at Simula Even “one” site, has many projects. •  e.g Mockus at Avaya •  e.g Nachi at Microsoft •  Can one project can use another’s conclusion? •  e.g. Ostrand/Weyuker at AT&T •  Finding local lessons in a cost-effective manner! Conclusion instability is a repeated observation. •  What works here, may not work there •  Shull & Menzies, in “Making Software”, 2010 •  Sheppered & Menzies: speial issue, ESE, conclusion instability
  • 22. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 22 12/1/2011
  • 23. GLOBALISM: BIGGER SAMPLE IS BETTER E.g. examples from 2 sources about 2 application types Source Gui apps Web apps Green Software Inc gui1, gui2 web1, web2, Blue Sky Ltd gui3, gui4 web3, web4 To learn lessons relevant to “gui1” •  Use all of {gui2, web1, web2} + {gui3, gui4, web3, web4} 23 12/1/2011
  • 24. GLOBALISM & RESEARCHERS R. Glass, Facts and Falllacies of Software Engineering. Addison- Wesley, 2002. C. Jones, Estimating Software Costs, 2nd Edition. McGraw-Hill, 2007. B. Boehm, E. Horowitz, R. Madachy, D. Reifer, B. K. Clark, B. Steece, A. W. Brown, S. Chulani, and C. Abts, Software Cost Estimation with Cocomo II. Prentice Hall, 2000. R. A. Endres, D. Rombach, A Handbook of Software and Systems Engi- neering: Empirical Observations, Laws and Theories. Addison Wesley, 2003. •  50 laws: •  “the nuggets that must be captured 24 12/1/2011 to improve future performance” [p3]
  • 25. GLOBALISM & INDUSTRIAL ENGINEERS Mind maps of developers Brazil (top) from PASSOS et al 20011 USA (bottom) 25 12/1/2011 See also, Jorgensen, TSE, 2009
  • 26. (NOT) GLOBALISM & DEFECT PREDICTION 26 12/1/2011
  • 27. (NOT) GLOBALISM & EFFORT ESTIMATION Effort = a . locx . y •  learned using Boehm’s methods •  20*66% of NASA93 •  COCOMO attributes •  Linear regression (log pre-processor) •  Sort the co-efficients found for each member of x,y 27 12/1/2011
  • 29. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 29 12/1/2011
  • 30. LOCALISM: SAMPLE ONLY FROM SAME CONTEXT E.g. examples from 2 sources about 2 application types Source Gui apps Web apps Green Software Inc gui1, gui2 web1, web2, Blue Sky Ltd gui3, gui4 web3, web4 To learn lessons relevant to “gui1” •  Restrict to just this the gui tools {gui2, gui3, gui4 } •  Restrict to just this company {gui2,web1, web2} Er… hang on •  How to find the right local context? 30 12/1/2011
  • 31. DELPHI LOCALIZATION Ask an expert to find the right local context •  Are we sure they’re right? •  Posnett at al. 2011: •  What is right level for learning? •  Files or packages? •  Methods or classes? •  Changes from study to study And even if they are “right”: •  should we use those contexts? •  E.g. need at least 10 examples to learn a defect model (Valerdi’s rule, IEEE Trans, 2009) •  17/147 = 11% of this data 31 12/1/2011
  • 32. CLUSTERING TO FIND “LOCAL” TEAK: estimates from “k” nearest-neighbors •  “k” auto-selected per test case •  Pre-processor to cluster data, remove worrisome regions •  IEEE TSE, Jan’11 T = Tim E = Ekrem Kocaguneli A = Ayse Bener K= Jacky Keung ESEM’11 •  Train within one delphi localization •  Or train on all and see what it picks •  Results #1: usually, cross as good as within 32 12/1/2011
  • 33. Results #2: 20 times, estimate for x in S_i. TEAK picked across as picked within 33 12/1/2011
  • 34. CONCLUSION (ON LOCALIZATION) Delphi localizations •  Can restrict sample size •  Don’t know how to check if your delphi localizations are “right” •  How to learn delphi localizations for new domains? •  Not essential to inference Auto-learned localizations (learned via nearest neighbor methods) •  Works just as well as delphi •  Can select data from many sources •  Can be auto-generated for new domains •  Can hunt out relevant samples from data from multiple sources 34 12/1/2011
  • 35. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 35 12/1/2011
  • 36. CLUSTERING + LEARNING Turhan, Me, Bener, ESE journal ’09 •  Nearest neighbor, defect prediction •  Combine data from other sources •  Prune to just the 10 nearest examples to each test instance •  Naïve Bayes on the pruned set Turhan et al. (2009) Me et al, ASE, 2011 Not scalable Near linear time processing No generalization to report to users Use rule learning 36 12/1/2011
  • 37. CLUSTERING + LEARNING ON SE DATA Cuadrado, Gallego, Rodriguez, Sicilia, Rubio, Crespo. Journal Computer Science and Technology (May07) •  EM on to 4 Delphi localizations •  case tool = yes, no •  methodology used = yes, no •  Regression models, learned per cluster, do better than global But why train on your own clusters? •  If your neighbors get better results… •  … train on neighbors… •  … test on local •  Training data similar to test •  No need for N*M-way cross val 37 12/1/2011
  • 38. MUST DO BETTER Turhan et al. (2009) Me et al, ASE, 2011 Not scalable Near linear time processing No generalization to report to users Use rule learning Cuadrado et .al (2007) Me et al, ASE, 2011 Only one data set Need more experiments Just effort estimation Why not effort and defect? Delphi and automatic localizations ? Seek fully automated procedure Returns regression models Our users want actions, not trends. Navigators, not maps Clusters on naturally dimensions What about synthesized dimensions? Train and test on local clusters Why not train on superior neighbors (the envy principle) Tested via cross-val Train on neighbor, test on self. No 10*10-way cross val 38 12/1/2011
  • 39. S.P.A.C.E = SPLIT, PRUNE SPLIT: quadtree generation PRUNE: FORM CLUSTERS Pick any point W; find X furthest from W, Combine quadtree leaves find Y furthest from Y. with similar densities XY is like PCA’s first component; found in Score each cluster by median O(2N) time, note O(N2) time score of class variable All points have distance a,b to (X,Y) Find envious neighbors (C1,C2) x= (a2 + c2 − b2)/2c ; y= sqrt(a2 – x 2) •  score(C2) better than score(C1) Recurse on four quadrants formed Train on C2 , test on C2 from median(x), median(y) 39
  • 40. WHY SPLIT, PRUNE? Unlike Turhan’09: LogLinear clustering time: i.e. fast and scales S.P. Turhan et al. (2009) Me et al, ASE, 2011 Not scalable Near linear time processing No generalization to report to users Use rule learning S. Cuadrado et .al (2007) Me et al, ASE, 2011 P. Only one data set Need more experiments Just effort estimation Why not effort and defect? Delphi & automatic localizations ? Seek fully automated procedure Returns regression models Our users want actions, not trends. Navigators, not maps Clusters on naturally dimensions What about synthesized dimensions? 40 Train and test on local clusters Why not train on superior neighbors (the envy principle) Tested via cross-val Train on neighbor, test on self. No 10*10-way cross val
  • 41. S.P.A.C.E = S.P ADD CONTRAST ENVY (A.C.E.) . Contrast set learning (WHICH) Fuzzy beam search First Stack = one rule for each discretized range of each attribute Repeat. Make next stack as follows: •  Score stack entries by lift (ability to select better examples) •  Sort stack entries by score •  Next stack = old stack •  plus combinations of randomly selected pairs of existing rules •  Selection biased towards high scoring rules Halt when top of stack’s score stabilizes Return top of stack 41
  • 42. WHY ADD CONSTRAST ENVY? Search criteria is adjustable •  See Menzies et al ASE journal 2010 Early termination S.P. A.C. Turhan et al. (2009) Me et al, ASE, 2011 E Not scalable Near linear time processing No generalization to report to users Use rule learning S.P. A.C Cuadrado et .al (2007) Me et al, ASE, 2011 .E. Only one data set Need more experiments Just effort estimation Why not effort and defect? Delphi & automatic localizations ? Seek fully automated procedure Returns regression models Our users want actions, not trends. Navigators, not maps Clusters on naturally dimensions What about synthesized dimensions? Train and test on local clusters Why not train on superior neighbors (the envy principle) Tested via cross-val Train on neighbor, test on self. No 10*10-way cross val 42 12/1/2011
  • 43. DATA FROM HTTP://PROMISEDATA.ORG/DATA Find (25,50,75,100)th percentiles of class values •  in examples of test set selected by global or local Express those percentiles as ratios of max values in all. Effort reduction = { NasaCoc, China } : COCOMO or function points Defect reduction = { lucene, xalan, jedit, synapse,etc } : CK metrics(OO) When the same learner was applied globally or locally •  Local did better than global •  Death to generalism As with Cuadrado ‘07: local better than global (but for multiple effort and defect data 43 12/1/2011 sets and no delphi-localizations)
  • 44. EVALUATION S.P. A.C. COW Turhan et al. (2009) Me et al, ASE, 2011 E Not scalable Near linear time processing No generalization to report to users Use rule learning S. A.C CO Cuadrado et .al (2007) Me et al, ASE, 2011 P. .E. W Only one data set Need more experiments Just effort estimation Why not effort and defect? Delphi & automatic localizations ? Seek fully automated procedure Returns regression models Our users want actions, not trends. Navigators, not maps Clusters on naturally dimensions What about synthesized dimensions? Train and test on local clusters Why not train on superior neighbors (the envy principle) 44 12/1/2011 Tested via cross-val Train on neighbor, test on self. No 10*10-way cross val
  • 45. ROADMAP Some comments on the state of the art •  Why so much SE + data mining? •  Why research SE + data mining •  But is data mining relevant to industry •  The problem of conclusion instability Learning local •  Globalism: learn from all data •  Localism: learn from local samples •  Learning locality with clustering (S.P.A.C.E.) •  Implications 45 12/1/2011
  • 49. IMPLICATIONS: CONCLUSION INSTABILITY From this work •  Misguided to try and tame conclusion instability •  Inherent in the data •  Don’t tame it, use it •  Built lots of local models 49 12/1/2011
  • 50. IMPLICATIONS: OUTLIER REMOVAL Remove odd training items Examples: •  Keung & Kitchenham, IEEE TSE, 2008: effort estimation •  Kim et al., ICSE’11, defect prediction •  case-based reasoning •  prune neighboring rows containing too many contradictory conclusions. •  Yoon & Bae, IST journal, 2010, defect prediction •  association rule learning methods to find frequent item sets. •  Remove rows with too few frequent items. •  Prunes 20% to 30% of rows. Assumed, assumes a general pattern, muddle by some outliers But my works says “its all outliers”. 50 12/1/2011
  • 51. IMPLICATIONS: STRATIFIED CROSS-VALIDATION Best to test on hold-out data •  That is similar to what will be seen in the future •  E.g. stratified cross validation This work: “similar” is not a simple matter •  select cross-val bins via clustering •  Train on neighboring cluster •  Test on local cluster Why learn from yourself? •  If the grass is greener on the other side of the fence •  Learn from your better neighbors 51 12/1/2011
  • 53. IMPLICATIONS: SBSE-1 (A.K.A. LEAP, THEN LOOK) When faced with a new problem •  Jump off a cliff with roller skates and see where you stop. That is: •  Define objective function and use it to guide a search engine. 53 12/1/2011
  • 54. IMPLICATIONS: SBSE-2 (LOOK BEFORE YOU LEAP) •  Split •  data on independent variables •  Prune •  leaf quadrants using dependent variables •  Contrast. •  Sort data in each cluster •  Contrast intra-cluster data between good and bad examples •  Add Envy: •  For each cluster C1… •  Find C2; i.e. the neighboring clustering you most envy •  Apply C2’s rules to C1 54 12/1/2011
  • 55. THE COW DOCTRINE •  Seek the fence where the grass is greener on the other side. •  Learn from there •  Test on here •  Don’t rely on trite definitions of “there” and “here” •  Cluster to find “here” and “there” 55 12/1/2011