SlideShare a Scribd company logo
Improving static code
review using AST-based
code analysis
Christophe Alladoum
@_hugsy_
hugsy
Who am I ?
➔ Christophe Alladoum
➔ IOActive pirate
➔ blah blah blah
What about ?
➔ I read a LOT of code
◆ mostly for fun (eventually for work)
● just to know how it works
● occasionally to find bugs
◆ most of the time, C code
● sometimes C++
● occasionally higher level stuff: PHP (lol), Java,
Python, ...
What about ?
➔ C code is tricky & not trivial
● many standards (ANSI C - C89, C99, C11, etc..)
● many bad coding practices
● MANY subtleties in the language
➔ Ergo, many places for flaws
● logic errors
● programming errors
● lack of restriction in code (buffers, integers)
I like
Existing automated tools
● Many Open-Source & licenced ($$$) tools use regexp to
find weak patterns
● Insufficient approach :
○ Example using latest flawfinder :
○ Basically as clever as making a `grep`
which is one of the best vuln finder btw
Ok, thanks !
Existing automated tools
○ and (too) many times, there are “strange” results
○ Usually a very *bad* idea to just paste output from
those tools in a (serious) code review report
*PLUS* splint fails to
see vulnerable calls
A smarter approach
➔ C based code projects are ultimately made
to be compiled & linked
◆ Compilers are the best code reviewers !!
● Code is parsed and transformed into another format
● Code is validated
● Some additional checks are even provided by default for
programming errors (type checks, unused vars, invalid
formatted strings, uninitialized values, etc…)
Quick reminder on compilers
● Compiler, noun : set of programs that transforms source code written in a
programming language into another computer language (Wikipedia).
■ Examples : GCC, as, Python ( which embeds a JIT compiler), etc...
● Abstract representation of compiler behavior:
LLVM Specifics
● What makes LLVM so special ?
○ LLVM (Low-Level Virtual Machine) : 13 year old project
○ Many different projects around this architecture
○ LLVM structure *truly* isolates each part
(lexing/optimizing/generating)
○ Totally Plug-and-Play
● you can easily write a lexer for generating Python .pyc file ...
● … or you can use optimizer API to help runtime bug detection (heard of Google
AddressSanitizer module ?) …
● … or you can use an existing parser (for instance GCC’s) and bind it to the rest
of the LLVM architecture (llvm-gcc)
→ really cool features ! Go
hack it !!
LLVM Specifics
● Clang
○ Default C/C++/Obj-C compiler based for LLVM architecture
○ Parser gets .c, .cpp, .m files as input and generates an
Intermediate Representation (IR) of the code
→ this is achieved thanks to an Abstract Syntax Tree (AST)
created when “reading” each source file
○ An API is provided to interact with the generated AST
→ in native C++
→ or higher languages, like Python
■ This means that Clang parses the code for us, then why not use
this to parse code in a smart way (and ultimately find
vulnerabilities) ?
Clang Python API
● Relatively easy to use...
○ … but not enough thoroughly documented (just automatically generated documentation)
→ pydoc works fairly well on it
○ Many blog posts (but sometimes outdated on the topic)
○ Namespace fairly intuitive
Basic example : outputs
Demo
● clang-draw-ast.py is a 70-line Python script that will parse a C source
file and display (PNG format) the corresponding AST.
(This is the expected result if live demo fails)
Let’s have a look...
The magic inside
Indexation engine API is exposed by `clang.cindex` package.
● Index
○ top-level object which manages some global library state.
● TranslationUnit
○ High-level object encapsulating the AST for a single translation unit
(parsed on the fly)
● SourceRange, SourceLocation, and File
○ Objects representing information about the input source.
Clang internals voodoo
The routines in this group provide the
ability to create and destroy translation
units from files, either by parsing the
contents of the files or by reading in a
serialized representation of a
translation unit.
● Once indexation engine is created, parse() function
will output a TranslationUnit object
○ The most important object
● Cursor object that will iterate through all nodes
○ kind : declare the type of the current node
○ displayname : display name for the entity referenced
○ location : returns the source location (the starting
character)
○ get_children() : return an iterator for accessing the children of
this cursor
○ get_arguments(): return an iterator for accessing the arguments
of this cursor
Clang internals voodoo
Now we can better understand the previous script
Easy, right ?
1
2
3
4
Pros / Cons
Pros
● simple and intuitive Python bindings
● full control over all the code being audited
● parsing and browsing are fast
● can be extended with LLVM extra modules
Cons
● generated over Python ctypes : might not work as well for other high
level languages (Ruby, Java, etc.)
Limitations ?
● Many developments, API keeps on improving and docs becoming more
complete
Introducing CodeBro!
● Built as a Proof-of-Concept around this idea
○ Meaning : you can use it but don’t rely on it
● Underlying idea : create a web-based tool that would interface between
AST and code reviewer
○ Code reviewer can smartly analyse/navigate through code and
eventually add some modules to detect basic (or advanced)
vulnerabilities
CodeBro!
● 100% Open-Source
○ Beer-Ware License
● 100% full Python
● (Hopefully) Easily installable (pip)
● Django (compat. 1.5+) based application
○ combines many cool Python based technologies
■ PyDot
■ PyCharm
■ Pygments
■ etc.
○ Allows to keep things simple
■ 1 project to audit = 1 specific database (default : SQLite)
CodeBro!
● Uses Clang parsing module to dynamically
interact with code
○ Cross-referencing feature similar to IDA Pro
■ only between functions (caller/callee)
○ call graphs generation : visual understanding of code
■ SVG generated graph → can be browsed through browser
CodeBro!
● “Analysis” module
○ reports all default diagnostics provided by Clang
○ provides a “Plugin” API
■ some modules implemented
■ … some more to come
CodeBro!
● Extensible through plugins
○ can use AST and/or already existing references
○ Examples :
■ detecting dead code
● find all functions never called (i.e. no down Xref to it)
■ improving format string flaws detection
● “count” number of args for known functions (printf, sprintf,
etc.) and parse the arguments
● detect formatted string wrapping functions (based on former
calls)
■ (in a limited extent)
detect use-after-free like this →
Demo time
(More screenshots if demo still fails)
Code project listing
Code browsing - unparsed
then parsed
Call graph generation : SVG generation (href linking)
← Functions listing
Future enhancements
● Still a work in progress
● Fix bugs
● Index all components of source files (instead of just CALL_EXPR and
FUNCTION_DECL)
● Improve search engine
● Add macro parsing
● Integrate more source code input vector (GIT - as soon as there is a decent
Python GIT bindings package)
● Improve C++ and Objective-C analysis
● Add moar modulez !!
The end
QUESTIONS ?
Links :
● https://github.com/hugsy/codebro
● https://twitter.com/_hugsy_
● http://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang
● http://llvm.org/devmtg/2010-11/Gregor-libclang.pdf
● https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer

More Related Content

PDF
10 reasons to be excited about go
PDF
Not Your Fathers C - C Application Development In 2016
PDF
Golang
ODP
Phpactor and VIM
PDF
Getting Started with PHP Extensions
PDF
A Recovering Java Developer Learns to Go
PDF
Clang Analyzer Tool Review
PDF
Cap'n Proto (C++ Developer Meetup Iasi)
10 reasons to be excited about go
Not Your Fathers C - C Application Development In 2016
Golang
Phpactor and VIM
Getting Started with PHP Extensions
A Recovering Java Developer Learns to Go
Clang Analyzer Tool Review
Cap'n Proto (C++ Developer Meetup Iasi)

What's hot (20)

PDF
FTD JVM Internals
PDF
Introduction to Go programming language
PDF
Hands on clang-format
PPTX
Go Programming Language (Golang)
PDF
Grant Rogerson SDEC2015
PDF
Functional Patterns for C++ Multithreading (C++ Dev Meetup Iasi)
PDF
JDD 2017: Kotlin for Java developers (Tomasz Kleszczyński)
PPTX
Groovy / comparison with java
ODP
The D Programming Language - Why I love it!
PDF
D programming language
PDF
TDC2016SP - Groovy como você nunca viu
PDF
Go Lang Tutorial
PDF
Basic c++ 11/14 for python programmers
PDF
Kotlin workshop 2018-06-11
ODP
Beginning python programming
PDF
OWF12/PAUG Conf Days Dart a new html5 technology, nicolas geoffray, softwar...
PDF
A Plan towards Ruby 3 Types
PDF
DConf 2016: Keynote by Walter Bright
PPT
Go lang introduction
FTD JVM Internals
Introduction to Go programming language
Hands on clang-format
Go Programming Language (Golang)
Grant Rogerson SDEC2015
Functional Patterns for C++ Multithreading (C++ Dev Meetup Iasi)
JDD 2017: Kotlin for Java developers (Tomasz Kleszczyński)
Groovy / comparison with java
The D Programming Language - Why I love it!
D programming language
TDC2016SP - Groovy como você nunca viu
Go Lang Tutorial
Basic c++ 11/14 for python programmers
Kotlin workshop 2018-06-11
Beginning python programming
OWF12/PAUG Conf Days Dart a new html5 technology, nicolas geoffray, softwar...
A Plan towards Ruby 3 Types
DConf 2016: Keynote by Walter Bright
Go lang introduction
Ad

Viewers also liked (9)

PDF
Negocis europaest 270408
DOCX
เดินทางบุกตะลุยในกรุงเทพด้วยมอเตอร์ไซค์รับจ้าง
DOCX
RESUME DODDANAGOUDA.K M-TECH
DOCX
Letter of recommendaton Roy Teng
PDF
Black MBA Mag
PDF
completionCertificate
PPTX
PPTX
Negocis europaest 270408
เดินทางบุกตะลุยในกรุงเทพด้วยมอเตอร์ไซค์รับจ้าง
RESUME DODDANAGOUDA.K M-TECH
Letter of recommendaton Roy Teng
Black MBA Mag
completionCertificate
Ad

Similar to Ruxmon.2013-08.-.CodeBro! (20)

ODP
Joxean Koret - Interactive Static Analysis Tools for Vulnerability Discovery ...
PDF
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
PDF
Static Code Analysis and Cppcheck
PDF
Ch 18: Source Code Auditing
PDF
Programming Languages #devcon2013
PDF
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
PPTX
OptView2 - C++ on Sea 2022
ODP
Hugging Abstract Syntax Trees: A Pythonic Love Story (OSDC 2010)
PDF
Don't do this
PDF
CNIT 127: Ch 18: Source Code Auditing
PDF
LCU14 209- LLVM Linux
PDF
Software Security - Static Analysis Tools
PDF
Clang: More than just a C/C++ Compiler
PDF
My talk on Piter Py 2016
PDF
Python debuggers slides
PDF
Semi-Automatic Code Cleanup with Clang-Tidy
PDF
不深不淺,帶你認識 LLVM (Found LLVM in your life)
PDF
PyPy's approach to construct domain-specific language runtime
PDF
Notes about moving from python to c++ py contw 2020
PPTX
Free / Open Source C++ Static Analysis Tools
Joxean Koret - Interactive Static Analysis Tools for Vulnerability Discovery ...
Pigaios: A Tool for Diffing Source Codes against Binaries (Hacktivity 2018)
Static Code Analysis and Cppcheck
Ch 18: Source Code Auditing
Programming Languages #devcon2013
XConf 2022 - Code As Data: How data insights on legacy codebases can fill the...
OptView2 - C++ on Sea 2022
Hugging Abstract Syntax Trees: A Pythonic Love Story (OSDC 2010)
Don't do this
CNIT 127: Ch 18: Source Code Auditing
LCU14 209- LLVM Linux
Software Security - Static Analysis Tools
Clang: More than just a C/C++ Compiler
My talk on Piter Py 2016
Python debuggers slides
Semi-Automatic Code Cleanup with Clang-Tidy
不深不淺,帶你認識 LLVM (Found LLVM in your life)
PyPy's approach to construct domain-specific language runtime
Notes about moving from python to c++ py contw 2020
Free / Open Source C++ Static Analysis Tools

Ruxmon.2013-08.-.CodeBro!

  • 1. Improving static code review using AST-based code analysis Christophe Alladoum @_hugsy_ hugsy
  • 2. Who am I ? ➔ Christophe Alladoum ➔ IOActive pirate ➔ blah blah blah
  • 3. What about ? ➔ I read a LOT of code ◆ mostly for fun (eventually for work) ● just to know how it works ● occasionally to find bugs ◆ most of the time, C code ● sometimes C++ ● occasionally higher level stuff: PHP (lol), Java, Python, ...
  • 4. What about ? ➔ C code is tricky & not trivial ● many standards (ANSI C - C89, C99, C11, etc..) ● many bad coding practices ● MANY subtleties in the language ➔ Ergo, many places for flaws ● logic errors ● programming errors ● lack of restriction in code (buffers, integers) I like
  • 5. Existing automated tools ● Many Open-Source & licenced ($$$) tools use regexp to find weak patterns ● Insufficient approach : ○ Example using latest flawfinder : ○ Basically as clever as making a `grep` which is one of the best vuln finder btw Ok, thanks !
  • 6. Existing automated tools ○ and (too) many times, there are “strange” results ○ Usually a very *bad* idea to just paste output from those tools in a (serious) code review report *PLUS* splint fails to see vulnerable calls
  • 7. A smarter approach ➔ C based code projects are ultimately made to be compiled & linked ◆ Compilers are the best code reviewers !! ● Code is parsed and transformed into another format ● Code is validated ● Some additional checks are even provided by default for programming errors (type checks, unused vars, invalid formatted strings, uninitialized values, etc…)
  • 8. Quick reminder on compilers ● Compiler, noun : set of programs that transforms source code written in a programming language into another computer language (Wikipedia). ■ Examples : GCC, as, Python ( which embeds a JIT compiler), etc... ● Abstract representation of compiler behavior:
  • 9. LLVM Specifics ● What makes LLVM so special ? ○ LLVM (Low-Level Virtual Machine) : 13 year old project ○ Many different projects around this architecture ○ LLVM structure *truly* isolates each part (lexing/optimizing/generating) ○ Totally Plug-and-Play ● you can easily write a lexer for generating Python .pyc file ... ● … or you can use optimizer API to help runtime bug detection (heard of Google AddressSanitizer module ?) … ● … or you can use an existing parser (for instance GCC’s) and bind it to the rest of the LLVM architecture (llvm-gcc) → really cool features ! Go hack it !!
  • 10. LLVM Specifics ● Clang ○ Default C/C++/Obj-C compiler based for LLVM architecture ○ Parser gets .c, .cpp, .m files as input and generates an Intermediate Representation (IR) of the code → this is achieved thanks to an Abstract Syntax Tree (AST) created when “reading” each source file ○ An API is provided to interact with the generated AST → in native C++ → or higher languages, like Python ■ This means that Clang parses the code for us, then why not use this to parse code in a smart way (and ultimately find vulnerabilities) ?
  • 11. Clang Python API ● Relatively easy to use... ○ … but not enough thoroughly documented (just automatically generated documentation) → pydoc works fairly well on it ○ Many blog posts (but sometimes outdated on the topic) ○ Namespace fairly intuitive Basic example : outputs
  • 12. Demo ● clang-draw-ast.py is a 70-line Python script that will parse a C source file and display (PNG format) the corresponding AST.
  • 13. (This is the expected result if live demo fails)
  • 14. Let’s have a look...
  • 15. The magic inside Indexation engine API is exposed by `clang.cindex` package. ● Index ○ top-level object which manages some global library state. ● TranslationUnit ○ High-level object encapsulating the AST for a single translation unit (parsed on the fly) ● SourceRange, SourceLocation, and File ○ Objects representing information about the input source.
  • 16. Clang internals voodoo The routines in this group provide the ability to create and destroy translation units from files, either by parsing the contents of the files or by reading in a serialized representation of a translation unit. ● Once indexation engine is created, parse() function will output a TranslationUnit object ○ The most important object ● Cursor object that will iterate through all nodes ○ kind : declare the type of the current node ○ displayname : display name for the entity referenced ○ location : returns the source location (the starting character) ○ get_children() : return an iterator for accessing the children of this cursor ○ get_arguments(): return an iterator for accessing the arguments of this cursor
  • 17. Clang internals voodoo Now we can better understand the previous script Easy, right ? 1 2 3 4
  • 18. Pros / Cons Pros ● simple and intuitive Python bindings ● full control over all the code being audited ● parsing and browsing are fast ● can be extended with LLVM extra modules Cons ● generated over Python ctypes : might not work as well for other high level languages (Ruby, Java, etc.) Limitations ? ● Many developments, API keeps on improving and docs becoming more complete
  • 19. Introducing CodeBro! ● Built as a Proof-of-Concept around this idea ○ Meaning : you can use it but don’t rely on it ● Underlying idea : create a web-based tool that would interface between AST and code reviewer ○ Code reviewer can smartly analyse/navigate through code and eventually add some modules to detect basic (or advanced) vulnerabilities
  • 20. CodeBro! ● 100% Open-Source ○ Beer-Ware License ● 100% full Python ● (Hopefully) Easily installable (pip) ● Django (compat. 1.5+) based application ○ combines many cool Python based technologies ■ PyDot ■ PyCharm ■ Pygments ■ etc. ○ Allows to keep things simple ■ 1 project to audit = 1 specific database (default : SQLite)
  • 21. CodeBro! ● Uses Clang parsing module to dynamically interact with code ○ Cross-referencing feature similar to IDA Pro ■ only between functions (caller/callee) ○ call graphs generation : visual understanding of code ■ SVG generated graph → can be browsed through browser
  • 22. CodeBro! ● “Analysis” module ○ reports all default diagnostics provided by Clang ○ provides a “Plugin” API ■ some modules implemented ■ … some more to come
  • 23. CodeBro! ● Extensible through plugins ○ can use AST and/or already existing references ○ Examples : ■ detecting dead code ● find all functions never called (i.e. no down Xref to it) ■ improving format string flaws detection ● “count” number of args for known functions (printf, sprintf, etc.) and parse the arguments ● detect formatted string wrapping functions (based on former calls) ■ (in a limited extent) detect use-after-free like this →
  • 24. Demo time (More screenshots if demo still fails)
  • 26. Code browsing - unparsed then parsed
  • 27. Call graph generation : SVG generation (href linking) ← Functions listing
  • 28. Future enhancements ● Still a work in progress ● Fix bugs ● Index all components of source files (instead of just CALL_EXPR and FUNCTION_DECL) ● Improve search engine ● Add macro parsing ● Integrate more source code input vector (GIT - as soon as there is a decent Python GIT bindings package) ● Improve C++ and Objective-C analysis ● Add moar modulez !!
  • 30. Links : ● https://github.com/hugsy/codebro ● https://twitter.com/_hugsy_ ● http://eli.thegreenplace.net/2011/07/03/parsing-c-in-python-with-clang ● http://llvm.org/devmtg/2010-11/Gregor-libclang.pdf ● https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer