SlideShare a Scribd company logo
EXPANDING IDENTIFIERS TO
                NORMALIZING SOURCE
                 CODE VOCABULARY
                            PRESENTED BY DAWN LAWRIE
                           LOYOLA UNIVERSITY MARYLAND


                        IN COLLABORATION WITH DAVE BINKLEY




Friday, October 7, 11
VOCABULARY MISMATCH


                        DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER
                        SOFTWARE ARTIFACTS

                        EXAMPLE

                          REQUIREMENT - “FEATURE LOCATION”

                          SOURCE CODE - “FEATURELOCATION”

                            OR WORSE     “FLOC”




Friday, October 7, 11
PURPOSE OF NORMALIZE



                        COPE WITH VOCABULARY MISMATCH

                         SOURCE CODE

                         OTHER SOFTWARE DOCUMENTS




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURELOCATION

                         FLOC




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         FLOC




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         F LOC                 SPLITTING PROBLEM




Friday, October 7, 11
EXAMPLE PROBLEMS



                        CONSIDER IDENTIFIERS

                         FEATURE LOCATION      SPLITTING PROBLEM

                         FEATURE LOCATION      SPLITTING AND
                                               EXPANSION PROBLEM




Friday, October 7, 11
WHY NORMALIZE?



                        MANY SE PROBLEMS CAN BE ADDRESSED USING
                        INFORMATION RETRIEVAL (IR) TECHNIQUES

                        UN-NORMALIZED CODE LEADS TO AN UNDER
                        ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS




Friday, October 7, 11
NORMALIZE PROBLEM STATEMENT




                        FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS




                            FLOC           FEATURE LOCATION


Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN

                         SOFT-WORD - WHITE-HOUSE_LAWN




Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN    (2)

                         SOFT-WORD - WHITE-HOUSE_LAWN




Friday, October 7, 11
NORMALIZE ALGORITHM



                        TERMINOLOGY

                         HARD-WORD - WHITEHOUSE_LAWN    (2)

                         SOFT-WORD - WHITE-HOUSE_LAWN   (3)




Friday, October 7, 11
NORMALIZE ALGORITHM




Friday, October 7, 11
NORMALIZE ALGORITHM


                        STRLEN    STRING LENGTH




Friday, October 7, 11
MACHINE TRANSLATION
                             APPROACH


                        EL   PAPA   VISITA   LA   IGLESIA




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA  VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE    HIT




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA  VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE    HIT




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA   VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE     HIT COH ESION
                                  STRONG




Friday, October 7, 11
MACHINE TRANSLATION
                              APPROACH


                        EL   PAPA   VISITA LA IGLESIA
                            FATHER VISITS
                        THE POTATO VISITOR THE CHURCH
                             POPE     HIT COH ESION
                                  STRONG




Friday, October 7, 11
NORMALIZE ALGORITHM




Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN




Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
        ST-RLEN
       STR-LEN
       STRL_EN
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
                                E(RLEN) = {RIFLEMEN}
        ST-RLEN
       STR-LEN
       STRL_EN
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
       S-TRLEN
                                E(RLEN) = {RIFLEMEN}
        ST-RLEN
                                WILDCARD EXPANSION
       STR-LEN
       STRL_EN
       STRLE_N                       R*L*E*N*
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM

       STRLEN
                              E(ST) = {SET, STOP, STRING}
       S-TRLEN
                                 E(RLEN) = {RIFLEMEN}
        ST-RLEN
       STR-LEN             E(STR) = {STEER, STRING}
       STRL_EN            E(LEN) = {LENDER, LENGTH}
       STRLE_N
       S_T_RLEN
        S-TR-LEN
        S_TRL_EN
         S_TRLE_N
         ST_R_LEN
          ST_RL_EN
          ST_RLE_N
           STR_L_EN
           STR_LE_N
            STRL_E_N
            S_T_R_LEN
            S_T_RL_EN
             S_T_RLE_N
             S_TR_L_EN
              S_TR_LE_N
              S_TRL_E_N
               ST_R_L_EN
               ST_R_LE_N
                ST_RL_E_N
                STR_L_E_N
                S_T_R_L_EN
                 S_T_R_LE_N
                 S_TR_L_E_N
                  ST_R_L_E_N
                  S-T-R-L-E-N
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                VS

                STRING               STEER




Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                  VS
                         LENDER                LENDER
                STRING                 STEER
                         LENGTH                LENGTH




Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                                LENDER                 LENDER
                STRING                         STEER
                                LENGTH                 LENGTH




                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS



Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS



Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB



                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART I
             STR
                                         VS
                         LENDER                        LENDER
                STRING                         STEER
                       + LENGTH                      + LENGTH
                       COHESIONA                     COHESIONB

                                    STRING
                        1. FIND COHESION BY SUMMING LOG OF
                             PROBABILITIES OF WORD PAIRS
                    2. SELECT EXPANSION THAT MAXIMIZES
                                  COHESION
Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                  VS

                        STR-LEN        ST-RLEN




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                        VS

                          STR-LEN              ST-RLEN
                        STRING LENGTH        STOP RIFLEMEN




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                        VS

                          STR-LEN              ST-RLEN
                        STRING LENGTH        STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS




Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN




                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
NORMALIZE ALGORITHM PART II

                                         VS

                          STR-LEN                 ST-RLEN
                        STRING LENGTH           STOP RIFLEMEN

                             STRING LENGTH
                    1. FIND COHESION OVER EXPANSIONS
                        2. SELECT EXPANSION OF THE SPLIT
                            THAT MAXIMIZES COHESION

Friday, October 7, 11
ADDING CONTEXT




Friday, October 7, 11
ADDING CONTEXT

             DIR




Friday, October 7, 11
ADDING CONTEXT

             DIR        E(DIR) = {DIRECTION, DIRECTORY}




Friday, October 7, 11
ADDING CONTEXT

             DIR         E(DIR) = {DIRECTION, DIRECTORY}
                        CONTEXT = {FORWARD, BACKWARD}




Friday, October 7, 11
ADDING CONTEXT

             DIR             E(DIR) = {DIRECTION, DIRECTORY}
                            CONTEXT = {FORWARD, BACKWARD}



                        FIND COHESION WITH CONTEXT WORDS IN ADDITION TO
                        EXPANSIONS OF OTHER SOFT WORDS

                        USED IN BOTH PART 1 AND PART 2




Friday, October 7, 11
NORMALIZE IMPLEMENTATION




                        USES GenTest TO SPLIT IDENTIFIERS

                          RETURNS MULTIPLE SPLITS

                        GOOGLE 5-GRAM DATASET




Friday, October 7, 11
EVALUATION

                    Program             Loc        SLoc     Unique Ids

                    which-2.20         3,670       2,293       487

                        a2ps-4.14      62,347     38,436       4,393


                    Program         Selected Ids Hard Words Soft Words

                    which-2.20          487        903         1214

                        a2ps-4.14       211        459         618




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES




Friday, October 7, 11
EVALUATION

                        THREE GROUPS OF IDENTIFIERS

                          STANDARD LIBRARY CALLS

                          NAMES FROM STANDARD HEADER FILES / KEYWORDS

                          DOMAIN NAMES


                         Program         Filtered Ids   Reported Ids

                         which-2.20          152            335

                         a2ps-4.14            46            166

Friday, October 7, 11
EXAMPLE EXPANSIONS

                          id           Top 10         Top Expansion
                                     Expansion
                        nextchar    next_character     next_character
                        indfound   index_found_need     index_found
                         optarg      option_are_g          optarg
                        itemno       i_them_not           itemno




Friday, October 7, 11
RESEARCH QUESTIONS



                        WHAT IS THE OVERALL ACCURACY OF NORMALIZE?

                        DOES THE VOCABULARY USED HAVE A SIGNIFICANT
                        IMPACT ON THE EXPANSION’S ACCURACY?

                        CAN THE EXPANDER INFORM THE SPLITTER?

                        CAN THE SPLITTER INFORM THE EXPANDER?




Friday, October 7, 11
ACCURACY ON DOMAIN IDS




Friday, October 7, 11
SOURCE OF EXPANSION WORDS



                        SOURCE CODE

                        INTERNAL DOCUMENTATION

                        MANUAL




Friday, October 7, 11
BEST VOCABULARY SOURCE?




Friday, October 7, 11
FUTURE WORK


                        EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE
                        DATA

                        EXPLORING DIFFERENT WAYS OF CALCULATING
                        PROBABILITIES

                        EXAMINING NORMALIZATION IN CONTEXT OF AN
                        INFORMATION RETRIEVAL TASK




Friday, October 7, 11
SUMMARY



                        IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER
                        SOFTWARE DOCUMENTS

                          DEGRADES PERFORMANCE OF IR TECHNIQUES

                        NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF
                        SOFT WORDS CORRECTLY




Friday, October 7, 11
QUESTIONS?


                         Need an identifier split?
                        GenTest Splitter available at
                            splitit.cs.loyola.edu



Friday, October 7, 11

More Related Content

PPTX
Tugas pti excel
PDF
Metrics - Using Source Code Metrics to Predict Change-Prone Java Interfaces
PDF
Components - Graph Based Detection of Library API Limitations
PDF
Industry - Precise Detection of Un-Initialized Variables in Large, Real-life ...
PDF
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
PDF
Faults and Regression Testing - Fault interaction and its repercussions
PDF
Metrics - You can't control the unfamiliar
PDF
ERA - Measuring Maintainability of Spreadsheets in the Wild
Tugas pti excel
Metrics - Using Source Code Metrics to Predict Change-Prone Java Interfaces
Components - Graph Based Detection of Library API Limitations
Industry - Precise Detection of Un-Initialized Variables in Large, Real-life ...
ERA - Measuring Disruption from Software Evolution Activities Using Graph-Bas...
Faults and Regression Testing - Fault interaction and its repercussions
Metrics - You can't control the unfamiliar
ERA - Measuring Maintainability of Spreadsheets in the Wild

Viewers also liked (20)

PDF
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
PDF
Industry - Estimating software maintenance effort from use cases an indu...
PDF
Postdoc Symposium - Abram Hindle
PDF
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
PDF
ERA - Clustering and Recommending Collections of Code Relevant to Task
PDF
Richard Kemmerer Keynote icsm11
PDF
Lionel Briand ICSM 2011 Keynote
PDF
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
PDF
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
PDF
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
PDF
ICSM'01 Most Influential Paper - Rainer Koschke
PDF
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
PDF
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
PDF
Industry - The Evolution of Information Systems. A Case Study on Document Man...
PDF
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
PDF
ERA - Tracking Technical Debt
PDF
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
PDF
Industry - Evolution and migration - Incremental and Iterative Reengineering ...
PDF
Natural Language Analysis - Mining Java Class Naming Conventions
PDF
Industry - Testing & Quality Assurance in Data Migration Projects
Migration and Refactoring - Identifying Overly Strong Conditions in Refactori...
Industry - Estimating software maintenance effort from use cases an indu...
Postdoc Symposium - Abram Hindle
Impact analysis - A Seismology-inspired Approach to Study Change Propagation
ERA - Clustering and Recommending Collections of Code Relevant to Task
Richard Kemmerer Keynote icsm11
Lionel Briand ICSM 2011 Keynote
ERA - A Comparison of Stemmers on Source Code Identifiers for Software Search
Dynamic Analysis - SCOTCH: Improving Test-to-Code Traceability using Slicing ...
Postdoc symposium - A Logic Meta-Programming Foundation for Example-Driven Pa...
ICSM'01 Most Influential Paper - Rainer Koschke
Impact Analysis - ImpactScale: Quantifying Change Impact to Predict Faults in...
Components - Crossing the Boundaries while Analyzing Heterogeneous Component-...
Industry - The Evolution of Information Systems. A Case Study on Document Man...
Faults and Regression testing - Localizing Failure-Inducing Program Edits Bas...
ERA - Tracking Technical Debt
Industry - Relating Developers' Concepts and Artefact Vocabulary in a Financ...
Industry - Evolution and migration - Incremental and Iterative Reengineering ...
Natural Language Analysis - Mining Java Class Naming Conventions
Industry - Testing & Quality Assurance in Data Migration Projects
Ad

Similar to Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary (20)

PDF
Boost your-oop-with-fp
PDF
Erlang Introduction
PDF
Developing a Language
PDF
Developing a Language
PDF
Of Rats And Dragons
PDF
Python & Stuff
PDF
Tackling Big Data with Hadoop
PDF
Patterns
PDF
Writing a SAT solver as a hobby project
PDF
Who's More Functional: Kotlin, Groovy, Scala, or Java?
PDF
Mining Development Repositories to Study the Impact of Collaboration on Softw...
PDF
HiRoshima.R #2 入門者講習資料
PDF
Introduction to "R" for Language Researchers
PDF
lisp (vs ruby) metaprogramming
PDF
2010 05-20-clojure concurrency--jugd
KEY
Verification with LoLA: 2 The LoLA Input Language
PDF
Programming and Minimalism: Lessons from Orwell and the Clash
PDF
HiRoshima.R #1 1-2
KEY
ESEM2007
PDF
DTGP AAIP11
Boost your-oop-with-fp
Erlang Introduction
Developing a Language
Developing a Language
Of Rats And Dragons
Python & Stuff
Tackling Big Data with Hadoop
Patterns
Writing a SAT solver as a hobby project
Who's More Functional: Kotlin, Groovy, Scala, or Java?
Mining Development Repositories to Study the Impact of Collaboration on Softw...
HiRoshima.R #2 入門者講習資料
Introduction to "R" for Language Researchers
lisp (vs ruby) metaprogramming
2010 05-20-clojure concurrency--jugd
Verification with LoLA: 2 The LoLA Input Language
Programming and Minimalism: Lessons from Orwell and the Clash
HiRoshima.R #1 1-2
ESEM2007
DTGP AAIP11
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
sap open course for s4hana steps from ECC to s4
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Spectral efficient network and resource selection model in 5G networks
sap open course for s4hana steps from ECC to s4
“AI and Expert System Decision Support & Business Intelligence Systems”
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Programs and apps: productivity, graphics, security and other tools
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Natural Language Analysis - Expanding Identifiers to Normalize Source Code Vocabulary

  • 1. EXPANDING IDENTIFIERS TO NORMALIZING SOURCE CODE VOCABULARY PRESENTED BY DAWN LAWRIE LOYOLA UNIVERSITY MARYLAND IN COLLABORATION WITH DAVE BINKLEY Friday, October 7, 11
  • 2. VOCABULARY MISMATCH DIFFERENT VOCABULARY IN SOURCE CODE AND OTHER SOFTWARE ARTIFACTS EXAMPLE REQUIREMENT - “FEATURE LOCATION” SOURCE CODE - “FEATURELOCATION” OR WORSE “FLOC” Friday, October 7, 11
  • 3. PURPOSE OF NORMALIZE COPE WITH VOCABULARY MISMATCH SOURCE CODE OTHER SOFTWARE DOCUMENTS Friday, October 7, 11
  • 4. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURELOCATION FLOC Friday, October 7, 11
  • 5. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FLOC Friday, October 7, 11
  • 6. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM F LOC SPLITTING PROBLEM Friday, October 7, 11
  • 7. EXAMPLE PROBLEMS CONSIDER IDENTIFIERS FEATURE LOCATION SPLITTING PROBLEM FEATURE LOCATION SPLITTING AND EXPANSION PROBLEM Friday, October 7, 11
  • 8. WHY NORMALIZE? MANY SE PROBLEMS CAN BE ADDRESSED USING INFORMATION RETRIEVAL (IR) TECHNIQUES UN-NORMALIZED CODE LEADS TO AN UNDER ESTIMATE OF THE IMPORTANCE OF CRUCIAL WORDS Friday, October 7, 11
  • 9. NORMALIZE PROBLEM STATEMENT FIND THE BEST EXPANSION OVERALL POSSIBLE SPLITS FLOC FEATURE LOCATION Friday, October 7, 11
  • 10. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN SOFT-WORD - WHITE-HOUSE_LAWN Friday, October 7, 11
  • 11. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN Friday, October 7, 11
  • 12. NORMALIZE ALGORITHM TERMINOLOGY HARD-WORD - WHITEHOUSE_LAWN (2) SOFT-WORD - WHITE-HOUSE_LAWN (3) Friday, October 7, 11
  • 14. NORMALIZE ALGORITHM STRLEN STRING LENGTH Friday, October 7, 11
  • 15. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA Friday, October 7, 11
  • 16. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT Friday, October 7, 11
  • 17. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT Friday, October 7, 11
  • 18. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONG Friday, October 7, 11
  • 19. MACHINE TRANSLATION APPROACH EL PAPA VISITA LA IGLESIA FATHER VISITS THE POTATO VISITOR THE CHURCH POPE HIT COH ESION STRONG Friday, October 7, 11
  • 21. NORMALIZE ALGORITHM STRLEN Friday, October 7, 11
  • 22. NORMALIZE ALGORITHM STRLEN S-TRLEN ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 23. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN STRL_EN STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 24. NORMALIZE ALGORITHM STRLEN S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN WILDCARD EXPANSION STR-LEN STRL_EN STRLE_N R*L*E*N* S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 25. NORMALIZE ALGORITHM STRLEN E(ST) = {SET, STOP, STRING} S-TRLEN E(RLEN) = {RIFLEMEN} ST-RLEN STR-LEN E(STR) = {STEER, STRING} STRL_EN E(LEN) = {LENDER, LENGTH} STRLE_N S_T_RLEN S-TR-LEN S_TRL_EN S_TRLE_N ST_R_LEN ST_RL_EN ST_RLE_N STR_L_EN STR_LE_N STRL_E_N S_T_R_LEN S_T_RL_EN S_T_RLE_N S_TR_L_EN S_TR_LE_N S_TRL_E_N ST_R_L_EN ST_R_LE_N ST_RL_E_N STR_L_E_N S_T_R_L_EN S_T_R_LE_N S_TR_L_E_N ST_R_L_E_N S-T-R-L-E-N Friday, October 7, 11
  • 26. NORMALIZE ALGORITHM PART I STR VS STRING STEER Friday, October 7, 11
  • 27. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH Friday, October 7, 11
  • 28. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER LENGTH LENGTH 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS Friday, October 7, 11
  • 29. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS Friday, October 7, 11
  • 30. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 31. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 32. NORMALIZE ALGORITHM PART I STR VS LENDER LENDER STRING STEER + LENGTH + LENGTH COHESIONA COHESIONB STRING 1. FIND COHESION BY SUMMING LOG OF PROBABILITIES OF WORD PAIRS 2. SELECT EXPANSION THAT MAXIMIZES COHESION Friday, October 7, 11
  • 33. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN Friday, October 7, 11
  • 34. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN Friday, October 7, 11
  • 35. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS Friday, October 7, 11
  • 36. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 37. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 38. NORMALIZE ALGORITHM PART II VS STR-LEN ST-RLEN STRING LENGTH STOP RIFLEMEN STRING LENGTH 1. FIND COHESION OVER EXPANSIONS 2. SELECT EXPANSION OF THE SPLIT THAT MAXIMIZES COHESION Friday, October 7, 11
  • 40. ADDING CONTEXT DIR Friday, October 7, 11
  • 41. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} Friday, October 7, 11
  • 42. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} Friday, October 7, 11
  • 43. ADDING CONTEXT DIR E(DIR) = {DIRECTION, DIRECTORY} CONTEXT = {FORWARD, BACKWARD} FIND COHESION WITH CONTEXT WORDS IN ADDITION TO EXPANSIONS OF OTHER SOFT WORDS USED IN BOTH PART 1 AND PART 2 Friday, October 7, 11
  • 44. NORMALIZE IMPLEMENTATION USES GenTest TO SPLIT IDENTIFIERS RETURNS MULTIPLE SPLITS GOOGLE 5-GRAM DATASET Friday, October 7, 11
  • 45. EVALUATION Program Loc SLoc Unique Ids which-2.20 3,670 2,293 487 a2ps-4.14 62,347 38,436 4,393 Program Selected Ids Hard Words Soft Words which-2.20 487 903 1214 a2ps-4.14 211 459 618 Friday, October 7, 11
  • 46. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Friday, October 7, 11
  • 47. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Friday, October 7, 11
  • 48. EVALUATION THREE GROUPS OF IDENTIFIERS STANDARD LIBRARY CALLS NAMES FROM STANDARD HEADER FILES / KEYWORDS DOMAIN NAMES Program Filtered Ids Reported Ids which-2.20 152 335 a2ps-4.14 46 166 Friday, October 7, 11
  • 49. EXAMPLE EXPANSIONS id Top 10 Top Expansion Expansion nextchar next_character next_character indfound index_found_need index_found optarg option_are_g optarg itemno i_them_not itemno Friday, October 7, 11
  • 50. RESEARCH QUESTIONS WHAT IS THE OVERALL ACCURACY OF NORMALIZE? DOES THE VOCABULARY USED HAVE A SIGNIFICANT IMPACT ON THE EXPANSION’S ACCURACY? CAN THE EXPANDER INFORM THE SPLITTER? CAN THE SPLITTER INFORM THE EXPANDER? Friday, October 7, 11
  • 51. ACCURACY ON DOMAIN IDS Friday, October 7, 11
  • 52. SOURCE OF EXPANSION WORDS SOURCE CODE INTERNAL DOCUMENTATION MANUAL Friday, October 7, 11
  • 54. FUTURE WORK EXPLORING DIFFERENT SOURCES OF CO-OCCURRENCE DATA EXPLORING DIFFERENT WAYS OF CALCULATING PROBABILITIES EXAMINING NORMALIZATION IN CONTEXT OF AN INFORMATION RETRIEVAL TASK Friday, October 7, 11
  • 55. SUMMARY IDENTIFIERS ARE WRITTEN DIFFERENTLY THAN OTHER SOFTWARE DOCUMENTS DEGRADES PERFORMANCE OF IR TECHNIQUES NORMALIZE CURRENTLY EXPANDS ABOUT HALF OF SOFT WORDS CORRECTLY Friday, October 7, 11
  • 56. QUESTIONS? Need an identifier split? GenTest Splitter available at splitit.cs.loyola.edu Friday, October 7, 11