SlideShare a Scribd company logo
FooCodeChu
Services for software analysis, malware
detection, and vulnerability research

Silvio Cesare <silvio.cesare@gmail.com>
Who am I and why this talk?
• Ph.D. Student at Deakin University

• Book Author

• This talk covers some of my publically accessible
  Ph.D. research.
Introduction
• Research on software analysis, similarity, and
  classification
 ▫   Malware detection and attribution
 ▫   Incident response
 ▫   Plagiarism detection
 ▫   Software theft detection
 ▫   Vulnerability research

• Three academic research tools free to use on my
  website.
Outline
• Simseer

• Clonewise

• Bugwise

• Future Work and Conclusion
Software similarity and visualisation
Motivation
• Many applications of software similarity
  ▫ Malware detection
  ▫ Plagiarism detection
  ▫ Software theft detection

• Traditional string signatures are ineffective

• Modern fingerprints effective but in many case
  inefficient
Program Representation
    lea     0x4(%esp),%ecx
    and     $0xfffffff0,%esp                    Proc_0
    pushl   -0x4(%ecx)
    push    %ebp
    mov     %esp,%ebp
    push    %ecx
    sub     $0x24,%esp
    call    4011b0 <___main>
    movl    $0x0,-0x8(%ebp)
    jmp     40115f <_main+0x2f>
                                       Proc_1            Proc_3


             movl   $0x4020a0,(%esp)
             call   4011b8 <_puts>
             addl   $0x1,-0x8(%ebp)



    cmpl    $0x9,-0x8(%ebp)            Proc_4
    jle     40114f <_main+0x1f>




    add     $0x24,%esp
    pop     %ecx
    pop     %ebp                                Proc_2
    lea     -0x4(%ecx),%esp
    ret
Simseer Program Fingerprint
• Set of control flow graphs
• Many procedures
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Decompilation of a Control Flow Graph
                                   proc(){
                     L_0           L_0:                   W|IEH}R
                                     while (v1 || v2) {
                     L_3           L_1:
                                       if (v3) {
       true                        L_2:
                     L_6
                                       } else {
              true                 L_4:
                                       }
       L_1           L_7           L_5:
                            true     }
       true                        L_7:
                                     return;
       L_2           L_4
                                   }
                     true

                     L_5
Q-Grams
• Input is decompiled strings

• Extract all possible fixed size substrings (q-
  grams)

• Train 500 dominant q-grams
                                           W|IE
                                           |IEH
                          W|IEH}R
                                           IEH}
                                           EH}R
Program Similarity
• 500 q-grams make a „feature vector‟

• Similarity using vector distance
Software similarity search
                           Query Benign

                                           r
                            q
           distance(p,q)

     p
                                          Query Malicious
         Query

         Malware
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Future Work
• Give access to more classes of program
  „fingerprints‟
 ▫ Call graphs
 ▫ Opcodes
 ▫ Different similarity measures
Simseer summary
• Simseer is effective

• Efficient

• Web service is free for public use
Detecting package clones and inferring security problems
Motivation
• Developers may “embed” or “clone” software
  from 3rd party sources
 ▫ Maintaining an internal copy of a library
 ▫ Forking a library

• Clonewise detects if two packages share code
• And if one package is entirely embedded in
  another.
                                       Firefox Vulnerabilities
                                                                 libpng Vulnerabilities
Feature Extraction – Shared package
clone detection
                1.    N_Filenames_A
                2.    N_Filenames_Source_A
                3.    N_Filenames_B
                4.    N_Filenames_Source_B
                5.    N_Common_Filenames
                6.    N_Common_Similar_Filenames
                7.    N_Common_FilenameHashes
                8.    N_Common_FilenameHash80
                9.    N_Common_ExactFilenameHash
                10.   N_Score_of_Common_Filename
                11.   N_Score_of_Common_Similar_Filename
                12.   N_Score_of_Common_FilenameHash
                13.   N_Score_of_Common_FilenameHash80
                14.   N_Score_of_Common_ExactFilenameHash80
                15.   N_Data_Common_Filenames
                16.   N_Data_Common_Similar_Filenames
                17.   N_Data_Common_FilenameHashes
                18.   N_Data_Common_FilenameHash80
                19.   N_Data_Common_ExactFilenameHash
                20.   N_Data_Score_of_Common_Filename
                21.   N_Data_Score_of_Common_Similar_Filename
                22.   N_Data_Score_of_Common_FilenameHash
                23.   N_Data_Score_of_Common_FilenameHash80
                24.   N_Data_Score_of_Common_ExactFilenameHash80
                25.   N_Common_ExactHash
                26.   N_Common_DataExactHash
Classification
• Consider feature vectors as n-dimensional
  points in space.

• Linear classifiers

• Non-linear classifiers

• Decision trees
                            Class A

                            Class B
Feature Extraction – Embedded clone
detection
               1. N_Filenames_A
               2. N_Filenames_Source_A
               3. N_Filenames_B
               4. N_Filenames_Source_B
               5. Percent_Match_In_A
               6. Percent_Data_Match_In_A
               7. Percent_Match_In_B
               8. Percent_Data_Match_In_B
               9. Percent_Score_In_A
               10.Percent_Data_Score_In_A
               11.Percent_Score_In_B
               12.Percent_Data_Score_In_B
               13.A_Has_Lib_In_Name
               14.B_Has_Lib_In_Name
               15.A_To_B_Ratio
               16.A_To_B_Data_Ratio
               17.N_Dependents_A
               18.N_Dependents_B
Detecting copyright violations

1. Identify embedded package clones.
2. Extract license information of each package.
3. For each GPL licensed embedded package
   clone:
  ▫ Verify that the package it is embedded in is
     not licensing it under a permissive license.
Automated Vulnerability Inference
1. Take CVE, match CPE name to Debian package.

2. Parse CVE summary and extract vuln filename.

3. Find clones of package with similar filename.

4. Trim dynamically linked clones.

5. Is vuln affected clone already being tracked?
Package clone detection use-case
Finding Vulnerabilities
Shared package clone evaluation

 Classifier      TP/FN     FP/TN       TP Rate   FP Rate

 Naïve Bayes     439/322   484/56296   57.69%    0.85%
 Multilayer
 Perceptron      204/557   48/56732    26.81%    0.08%

 C4.5            523/238   86/56694    68.73%    0.15%

 Random Forest   533/228   60/56720    70.04%    0.11%
 Random Forest
 (0.8)           446/315   15/56765    58.61%    0.03%
Embedded clone detection evaluation

 Classifier      TP/FN     FP/TN       TP Rate   FP Rate

 Naïve Bayes     718/43    6341/2808   94.35%    69.31%
 Multilayer
 Perceptron      328/433   108/9041    43.10%    1.18%

 C4.5            572/189   69/9080     75.16%    0.75%

 Random Forest   554/207   68/9081     72.80%    0.74%
 Asymmetric
 Bagging         699/62    615/8534    91.86%    6.72%
Automatic detection of suspicious
clones

    PACKAGE           EMBEDDED PACKAGE
    freevo            feedparser
    hedgewars         freetype
    ia32-libs         *
    libtk-img         tiff
    likewise-open     curl
    luatex            poppler
    planet-venus      feedparser
    syslinux          libpng
    vnc4              freetype
    vtk               tiff
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Future Work
• Binary-level clone detection

• Integrate into Linux distributions

• Linux security teams usage
Clonewise summary
• Practical clone detection in Linux

• Improves manual only tracking

• Has found bugs

• Debian Linux want to integrate it into infrastructure

• Open source project

• Web service to perform clone detection
Detecting bugs in binaries using decompilation and data flow
analysis
Motivation
• Detecting bugs in binary is useful
 ▫   Black-box penetration testing
 ▫   External audits and compliance
 ▫   Quality assurance of 3rd party software
 ▫   Verification of compilation and linkage
Wire – A formal language for binary
analysis
• x86 is complex and big

• Wire is a low level RISC assembly style language

• Translated from x86

• Formally defined operational semantics



                  The LOAD instruction implements a memory read.
Stack Pointer Inference
• Proposed in HexRays decompiler -
  http://www.hexblog.com/?p=42

• Estimate Stack Pointer (SP) in and out of basic block
  ▫ By tracking and estimating SP modifications using linear
    inequalities
• Solve.




Picture from HexRays blog   .
Decompilation - Local Variable
 Recovery
  • Based on stack pointer inference
  • Access to memory offset to the stack
  • Replace with native Wire register
Imark     ($0x80483f5, , )
AddImm32 (%esp(4), $0x1c, %temp_memreg(12c))
LoadMem32 (%temp_memreg(12c), , %temp_op1d(66))
                                                      Imark   ($0x80483f5, , )
Imark     ($0x80483f9, , )
                                                      Imark   ($0x80483f9, , )
StoreMem32(%temp_op1d(66), , %esp(4))                Imark   ($0x80483fc, , )
Imark     ($0x80483fc, , )
                                                      Free    (%local_28(186bc), , )
SubImm32 (%esp(4), $0x4, %esp(4))
LoadImm32 ($0x80483fc, , %temp_op1d(66))
StoreMem32(%temp_op1d(66), , %esp(4))
Lcall     (, , $0x80482f0)
Data Flow Analysis - Reaching
Definitions
• A reaching definition is a definition of a variable
  that reaches a program point without being
  redefined.
                                               X=1
                                               Y=3



                                         X>2          X <=2



                               X=2
                                                              Print(X)
                              Print(X)




                                                     Y=3, X=1, and X=2 are
                              Print(X)
                                                      reaching definitions
More data flow problems
• Upward Exposed Uses
  ▫ All uses of a definition

• Live Variables
  ▫ A variable is live if it will be subsequently read
    without being redefined.

• Reaching Copies
  ▫ The reach of a copy statement

• etc
getenv() bugs
• Detect unsafe applications of getenv()
• Example: strcpy(buf,getenv(“HOME”))
• For each getenv()
 ▫ If return value is live
 ▫ And it‟s the reaching definition to the 2nd
   argument to strcpy()
 ▫ Then warn

• P.S. 2001 wants its bugs back.
Use-after-free Detection
• For each free(ptr)
 ▫ If ptr live
                       void f(int x)
 ▫ Then warn           {
                              int *p = malloc(10);
                              dowork(p);
                              free(p);
                              if (x)
                                     p[0] = 1;
                       }
Double Free Detection
• For each free(ptr)
 ▫ If an upward exposed use of ptr‟s definition is
   free(ptr)
                       void f(int x)
 ▫ Then warn           {
                               int *p = malloc(10);
                               dowork(p);
                               free(p);
                               if (x)
                                      free(p);
• 2001 calls again      }
getenv() bugs
•   Scanned entire Debian 7 unstable repository
•   ~123,000 ELF binaries
                            4digits                    ptop
                            acedb-other-belvu          recordmydesktop
                            acedb-other-dotter         rlplot
                            bvi                        sapphire

•   85 bug reports          comgt
                            csmash
                                                       sc
                                                       scm
                            elvis-tiny                 sgrep

•   47 packages             fvwm
                            garmin-ant-downloader
                                                       slurm-llnl-slurmdbd
                                                       statserial
                            gcin                       stopmotion
                            gexec                      supertransball2
                            gmorgan                    theorur
                            gopher                     twpsk
                            gsoko                      udo
                            gstm                       vnc4server
                            hime                       wily
                            le-dico-de-rene-cougnenc   wmpinboard
                            libreoffice-dev            wmppp.app
                            libxgks-dev                xboing
                            lie                        xemacs21-bin
                            lpe                        xjdic
                            mp3rename                  xmotd
                            mpich-mpd-bin
                            open-cobol
                            procmail
getenv() bugs over time –
sorted by binary size
• Linear or power growth?
getenv() bug statistics
• Probability (P) of a binary being vulnerable: 0.00067

• P. of a package being vulnerable: 0.00255

                             P( A   B)
                 P( A | B)
                               P(B)

   Conditional probability of A given that B has occurred:

• P. of a package having a 2nd vulnerability given that one
  binary in the package is vulnerable: 0.52380
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Double free in SGID games “xonix”
  memset(score_rec[i].login, 0, 11);
  strncpy(score_rec[i].login, pw->pw_name, 10);
  memset(score_rec[i].full, 0, 65);
  strncpy(score_rec[i].full, fullname, 64);
  score_rec[i].tstamp = time(NULL);

  free(fullname);


  if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) {
      fprintf(stderr, "xonix: cannot reopen high score filen");

      free(fullname);
      gameover_pending = 0;
      return;
  }
Future Work
• Core
 ▫   Summary-based interprocedural analysis
 ▫   Context sensitive interprocedural analysis
 ▫   Pointer analysis
 ▫   Improved decompilation
• More bug classes
Bugwise summary
• Practical tool to find simple bugs

• Based on strong theory

• Extensible

• Much work to do in the future

• Web service free to use
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research
Future Work
• Make more of my research public

• Provide better backend infrastructure

• Get people to use the services!
Conclusion
• All of the tools in this talk are for public use

• http://www.FooCodeChu.com

  ▫ Wiki on software similarity and classification

  ▫ Preprint of my book available

• Buy my book from Springer

More Related Content

PPTX
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
PPT
Programming in Computational Biology
PDF
Something About Dynamic Linking
PDF
How to write a TableGen backend
PPTX
More on Lex
PPTX
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
PDF
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
PDF
Memory Management In Python The Basics
Clonewise - Automatically Detecting Package Clones and Inferring Security Vu...
Programming in Computational Biology
Something About Dynamic Linking
How to write a TableGen backend
More on Lex
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
Memory Management In Python The Basics

What's hot (20)

PPTX
Pythonppt28 11-18
PPTX
FUNDAMENTALS OF PYTHON LANGUAGE
PPTX
Clojure 7-Languages
ODP
Biopython
PPT
Spsl iv unit final
DOC
php&mysql with Ethical Hacking
PPTX
C language
PPTX
PPT on Data Science Using Python
PDF
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
PPTX
Bestiary of Functional Programming with Cats
PDF
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
PDF
PHP 8: Process & Fixing Insanity
PDF
Ry pyconjp2015 turtle
PPT
python.ppt
PDF
R Programming: Introduction To R Packages
PDF
Python programming Workshop SITTTR - Kalamassery
PPTX
A Source-To-Source Approach to HPC Challenges
ODP
OpenGurukul : Language : Python
PPTX
Bioinformatics v2014 wim_vancriekinge
PDF
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Pythonppt28 11-18
FUNDAMENTALS OF PYTHON LANGUAGE
Clojure 7-Languages
Biopython
Spsl iv unit final
php&mysql with Ethical Hacking
C language
PPT on Data Science Using Python
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Bestiary of Functional Programming with Cats
WAD : A Module for Converting Fatal Extension Errors into Python Exceptions
PHP 8: Process & Fixing Insanity
Ry pyconjp2015 turtle
python.ppt
R Programming: Introduction To R Packages
Python programming Workshop SITTTR - Kalamassery
A Source-To-Source Approach to HPC Challenges
OpenGurukul : Language : Python
Bioinformatics v2014 wim_vancriekinge
Python Programming | Python Programming For Beginners | Python Tutorial | Edu...
Ad

Viewers also liked (15)

PDF
Sourcefire Vulnerability Research Team Labs
PDF
Challenges in High Accuracy of Malware Detection
PDF
Applications of genetic algorithms to malware detection and creation
PDF
Zero Day Malware Detection/Prevention Using Open Source Software
PPTX
Seminar
PDF
Malware Detection With Multiple Features
PDF
Ensembled Based Categorization and Adaptive Learning Model for Malware Detection
PDF
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
PPT
Next Generation Advanced Malware Detection and Defense
PDF
Malware Detection - A Machine Learning Perspective
PPTX
Malware Detection Using Machine Learning Techniques
PPT
Malware Detection using Machine Learning
PPTX
Data Science Driven Malware Detection
PDF
Model-checking for efficient malware detection
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
Sourcefire Vulnerability Research Team Labs
Challenges in High Accuracy of Malware Detection
Applications of genetic algorithms to malware detection and creation
Zero Day Malware Detection/Prevention Using Open Source Software
Seminar
Malware Detection With Multiple Features
Ensembled Based Categorization and Adaptive Learning Model for Malware Detection
Anomaly Detection using String Analysis for Android Malware Detection - CISIS...
Next Generation Advanced Malware Detection and Defense
Malware Detection - A Machine Learning Perspective
Malware Detection Using Machine Learning Techniques
Malware Detection using Machine Learning
Data Science Driven Malware Detection
Model-checking for efficient malware detection
Big Data - 25 Amazing Facts Everyone Should Know
Ad

Similar to FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research (20)

PPTX
Automated Detection of Software Bugs and Vulnerabilities in Linux
PPTX
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
PPTX
Simseer.com - Malware Similarity and Clustering Made Easy
PDF
MeCC: Memory Comparison based Clone Detector
PDF
MeCC: Memory Comparison-based Code Clone Detector
PDF
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
PDF
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
PPTX
Virtual Separation of Concerns (2011 Update)
PDF
Parsing and Type checking all 2^10000 configurations of the Linux kernel
PDF
Pycvf
PDF
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
PDF
Changes and Bugs: Mining and Predicting Development Activities
PDF
Clone detection in Python
PPTX
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
PDF
Inbot10 vxclass
PPTX
Reverse Architecting of a Medical Device Software
PDF
Be Social. Use CrowdRE.
PDF
Automatic comparison of malware
PPT
Effective flowgraph-based malware variant detection
PPT
B-Sides Seattle 2012 Offensive Defense
Automated Detection of Software Bugs and Vulnerabilities in Linux
Simseer and Bugwise - Web Services for Binary-level Software Similarity and D...
Simseer.com - Malware Similarity and Clustering Made Easy
MeCC: Memory Comparison based Clone Detector
MeCC: Memory Comparison-based Code Clone Detector
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Если нашлась одна ошибка — есть и другие. Один способ выявить «наследуемые» у...
Virtual Separation of Concerns (2011 Update)
Parsing and Type checking all 2^10000 configurations of the Linux kernel
Pycvf
Introducing Parameter Sensitivity to Dynamic Code-Clone Analysis Methods
Changes and Bugs: Mining and Predicting Development Activities
Clone detection in Python
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Inbot10 vxclass
Reverse Architecting of a Medical Device Software
Be Social. Use CrowdRE.
Automatic comparison of malware
Effective flowgraph-based malware variant detection
B-Sides Seattle 2012 Offensive Defense

More from Silvio Cesare (12)

PDF
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
PDF
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
PPTX
Wire - A Formal Intermediate Language for Binary Analysis
PPT
Simseer - A Software Similarity Web Service
PPTX
Faster, More Effective Flowgraph-based Malware Classification
PPTX
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
PPT
Simple Bugs and Vulnerabilities in Linux Distributions
PPT
Fast Automated Unpacking and Classification of Malware
PPT
Malware Classification Using Structured Control Flow
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
PPT
Security Applications For Emulation
PDF
Auditing the Opensource Kernels
A BEGINNER’S JOURNEY INTO THE WORLD OF HARDWARE HACKING
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
Wire - A Formal Intermediate Language for Binary Analysis
Simseer - A Software Similarity Web Service
Faster, More Effective Flowgraph-based Malware Classification
Malware Variant Detection Using Similarity Search over Sets of Control Flow G...
Simple Bugs and Vulnerabilities in Linux Distributions
Fast Automated Unpacking and Classification of Malware
Malware Classification Using Structured Control Flow
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Security Applications For Emulation
Auditing the Opensource Kernels

FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerability Research

  • 1. FooCodeChu Services for software analysis, malware detection, and vulnerability research Silvio Cesare <silvio.cesare@gmail.com>
  • 2. Who am I and why this talk? • Ph.D. Student at Deakin University • Book Author • This talk covers some of my publically accessible Ph.D. research.
  • 3. Introduction • Research on software analysis, similarity, and classification ▫ Malware detection and attribution ▫ Incident response ▫ Plagiarism detection ▫ Software theft detection ▫ Vulnerability research • Three academic research tools free to use on my website.
  • 4. Outline • Simseer • Clonewise • Bugwise • Future Work and Conclusion
  • 5. Software similarity and visualisation
  • 6. Motivation • Many applications of software similarity ▫ Malware detection ▫ Plagiarism detection ▫ Software theft detection • Traditional string signatures are ineffective • Modern fingerprints effective but in many case inefficient
  • 7. Program Representation lea 0x4(%esp),%ecx and $0xfffffff0,%esp Proc_0 pushl -0x4(%ecx) push %ebp mov %esp,%ebp push %ecx sub $0x24,%esp call 4011b0 <___main> movl $0x0,-0x8(%ebp) jmp 40115f <_main+0x2f> Proc_1 Proc_3 movl $0x4020a0,(%esp) call 4011b8 <_puts> addl $0x1,-0x8(%ebp) cmpl $0x9,-0x8(%ebp) Proc_4 jle 40114f <_main+0x1f> add $0x24,%esp pop %ecx pop %ebp Proc_2 lea -0x4(%ecx),%esp ret
  • 8. Simseer Program Fingerprint • Set of control flow graphs • Many procedures
  • 10. Decompilation of a Control Flow Graph proc(){ L_0 L_0: W|IEH}R while (v1 || v2) { L_3 L_1: if (v3) { true L_2: L_6 } else { true L_4: } L_1 L_7 L_5: true } true L_7: return; L_2 L_4 } true L_5
  • 11. Q-Grams • Input is decompiled strings • Extract all possible fixed size substrings (q- grams) • Train 500 dominant q-grams W|IE |IEH W|IEH}R IEH} EH}R
  • 12. Program Similarity • 500 q-grams make a „feature vector‟ • Similarity using vector distance
  • 13. Software similarity search Query Benign r q distance(p,q) p Query Malicious Query Malware
  • 17. Future Work • Give access to more classes of program „fingerprints‟ ▫ Call graphs ▫ Opcodes ▫ Different similarity measures
  • 18. Simseer summary • Simseer is effective • Efficient • Web service is free for public use
  • 19. Detecting package clones and inferring security problems
  • 20. Motivation • Developers may “embed” or “clone” software from 3rd party sources ▫ Maintaining an internal copy of a library ▫ Forking a library • Clonewise detects if two packages share code • And if one package is entirely embedded in another. Firefox Vulnerabilities libpng Vulnerabilities
  • 21. Feature Extraction – Shared package clone detection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. N_Common_Filenames 6. N_Common_Similar_Filenames 7. N_Common_FilenameHashes 8. N_Common_FilenameHash80 9. N_Common_ExactFilenameHash 10. N_Score_of_Common_Filename 11. N_Score_of_Common_Similar_Filename 12. N_Score_of_Common_FilenameHash 13. N_Score_of_Common_FilenameHash80 14. N_Score_of_Common_ExactFilenameHash80 15. N_Data_Common_Filenames 16. N_Data_Common_Similar_Filenames 17. N_Data_Common_FilenameHashes 18. N_Data_Common_FilenameHash80 19. N_Data_Common_ExactFilenameHash 20. N_Data_Score_of_Common_Filename 21. N_Data_Score_of_Common_Similar_Filename 22. N_Data_Score_of_Common_FilenameHash 23. N_Data_Score_of_Common_FilenameHash80 24. N_Data_Score_of_Common_ExactFilenameHash80 25. N_Common_ExactHash 26. N_Common_DataExactHash
  • 22. Classification • Consider feature vectors as n-dimensional points in space. • Linear classifiers • Non-linear classifiers • Decision trees Class A Class B
  • 23. Feature Extraction – Embedded clone detection 1. N_Filenames_A 2. N_Filenames_Source_A 3. N_Filenames_B 4. N_Filenames_Source_B 5. Percent_Match_In_A 6. Percent_Data_Match_In_A 7. Percent_Match_In_B 8. Percent_Data_Match_In_B 9. Percent_Score_In_A 10.Percent_Data_Score_In_A 11.Percent_Score_In_B 12.Percent_Data_Score_In_B 13.A_Has_Lib_In_Name 14.B_Has_Lib_In_Name 15.A_To_B_Ratio 16.A_To_B_Data_Ratio 17.N_Dependents_A 18.N_Dependents_B
  • 24. Detecting copyright violations 1. Identify embedded package clones. 2. Extract license information of each package. 3. For each GPL licensed embedded package clone: ▫ Verify that the package it is embedded in is not licensing it under a permissive license.
  • 25. Automated Vulnerability Inference 1. Take CVE, match CPE name to Debian package. 2. Parse CVE summary and extract vuln filename. 3. Find clones of package with similar filename. 4. Trim dynamically linked clones. 5. Is vuln affected clone already being tracked?
  • 28. Shared package clone evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 439/322 484/56296 57.69% 0.85% Multilayer Perceptron 204/557 48/56732 26.81% 0.08% C4.5 523/238 86/56694 68.73% 0.15% Random Forest 533/228 60/56720 70.04% 0.11% Random Forest (0.8) 446/315 15/56765 58.61% 0.03%
  • 29. Embedded clone detection evaluation Classifier TP/FN FP/TN TP Rate FP Rate Naïve Bayes 718/43 6341/2808 94.35% 69.31% Multilayer Perceptron 328/433 108/9041 43.10% 1.18% C4.5 572/189 69/9080 75.16% 0.75% Random Forest 554/207 68/9081 72.80% 0.74% Asymmetric Bagging 699/62 615/8534 91.86% 6.72%
  • 30. Automatic detection of suspicious clones PACKAGE EMBEDDED PACKAGE freevo feedparser hedgewars freetype ia32-libs * libtk-img tiff likewise-open curl luatex poppler planet-venus feedparser syslinux libpng vnc4 freetype vtk tiff
  • 32. Future Work • Binary-level clone detection • Integrate into Linux distributions • Linux security teams usage
  • 33. Clonewise summary • Practical clone detection in Linux • Improves manual only tracking • Has found bugs • Debian Linux want to integrate it into infrastructure • Open source project • Web service to perform clone detection
  • 34. Detecting bugs in binaries using decompilation and data flow analysis
  • 35. Motivation • Detecting bugs in binary is useful ▫ Black-box penetration testing ▫ External audits and compliance ▫ Quality assurance of 3rd party software ▫ Verification of compilation and linkage
  • 36. Wire – A formal language for binary analysis • x86 is complex and big • Wire is a low level RISC assembly style language • Translated from x86 • Formally defined operational semantics The LOAD instruction implements a memory read.
  • 37. Stack Pointer Inference • Proposed in HexRays decompiler - http://www.hexblog.com/?p=42 • Estimate Stack Pointer (SP) in and out of basic block ▫ By tracking and estimating SP modifications using linear inequalities • Solve. Picture from HexRays blog .
  • 38. Decompilation - Local Variable Recovery • Based on stack pointer inference • Access to memory offset to the stack • Replace with native Wire register Imark ($0x80483f5, , ) AddImm32 (%esp(4), $0x1c, %temp_memreg(12c)) LoadMem32 (%temp_memreg(12c), , %temp_op1d(66)) Imark ($0x80483f5, , ) Imark ($0x80483f9, , ) Imark ($0x80483f9, , ) StoreMem32(%temp_op1d(66), , %esp(4))  Imark ($0x80483fc, , ) Imark ($0x80483fc, , ) Free (%local_28(186bc), , ) SubImm32 (%esp(4), $0x4, %esp(4)) LoadImm32 ($0x80483fc, , %temp_op1d(66)) StoreMem32(%temp_op1d(66), , %esp(4)) Lcall (, , $0x80482f0)
  • 39. Data Flow Analysis - Reaching Definitions • A reaching definition is a definition of a variable that reaches a program point without being redefined. X=1 Y=3 X>2 X <=2 X=2 Print(X) Print(X) Y=3, X=1, and X=2 are Print(X) reaching definitions
  • 40. More data flow problems • Upward Exposed Uses ▫ All uses of a definition • Live Variables ▫ A variable is live if it will be subsequently read without being redefined. • Reaching Copies ▫ The reach of a copy statement • etc
  • 41. getenv() bugs • Detect unsafe applications of getenv() • Example: strcpy(buf,getenv(“HOME”)) • For each getenv() ▫ If return value is live ▫ And it‟s the reaching definition to the 2nd argument to strcpy() ▫ Then warn • P.S. 2001 wants its bugs back.
  • 42. Use-after-free Detection • For each free(ptr) ▫ If ptr live void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) p[0] = 1; }
  • 43. Double Free Detection • For each free(ptr) ▫ If an upward exposed use of ptr‟s definition is free(ptr) void f(int x) ▫ Then warn { int *p = malloc(10); dowork(p); free(p); if (x) free(p); • 2001 calls again }
  • 44. getenv() bugs • Scanned entire Debian 7 unstable repository • ~123,000 ELF binaries 4digits ptop acedb-other-belvu recordmydesktop acedb-other-dotter rlplot bvi sapphire • 85 bug reports comgt csmash sc scm elvis-tiny sgrep • 47 packages fvwm garmin-ant-downloader slurm-llnl-slurmdbd statserial gcin stopmotion gexec supertransball2 gmorgan theorur gopher twpsk gsoko udo gstm vnc4server hime wily le-dico-de-rene-cougnenc wmpinboard libreoffice-dev wmppp.app libxgks-dev xboing lie xemacs21-bin lpe xjdic mp3rename xmotd mpich-mpd-bin open-cobol procmail
  • 45. getenv() bugs over time – sorted by binary size • Linear or power growth?
  • 46. getenv() bug statistics • Probability (P) of a binary being vulnerable: 0.00067 • P. of a package being vulnerable: 0.00255 P( A B) P( A | B) P(B) Conditional probability of A given that B has occurred: • P. of a package having a 2nd vulnerability given that one binary in the package is vulnerable: 0.52380
  • 48. Double free in SGID games “xonix” memset(score_rec[i].login, 0, 11); strncpy(score_rec[i].login, pw->pw_name, 10); memset(score_rec[i].full, 0, 65); strncpy(score_rec[i].full, fullname, 64); score_rec[i].tstamp = time(NULL); free(fullname); if((high = freopen(PATH_HIGHSCORE, "w",high)) == NULL) { fprintf(stderr, "xonix: cannot reopen high score filen"); free(fullname); gameover_pending = 0; return; }
  • 49. Future Work • Core ▫ Summary-based interprocedural analysis ▫ Context sensitive interprocedural analysis ▫ Pointer analysis ▫ Improved decompilation • More bug classes
  • 50. Bugwise summary • Practical tool to find simple bugs • Based on strong theory • Extensible • Much work to do in the future • Web service free to use
  • 52. Future Work • Make more of my research public • Provide better backend infrastructure • Get people to use the services!
  • 53. Conclusion • All of the tools in this talk are for public use • http://www.FooCodeChu.com ▫ Wiki on software similarity and classification ▫ Preprint of my book available • Buy my book from Springer