SlideShare a Scribd company logo
Finding Diversity in Remote Code Injection Exploits Justin Ma ,  John Dunagan ,  Helen J. Wang , Stefan Savage ,  Geoffrey M. Voelker University of California, San Diego Microsoft Research Internet Measurement Conference 2006
Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
Introduction Internet users are increasingly victimized by online criminal enterprise that spans denial-of-service extortion, identity theft, piracy and unsolicited bulk email At the core of these activities is malware Software used to remotely compromise and harness the resources of millions of hosts There is little research describing the malware ecosystem itself How does one piece of malware relate to another? What pressures drive its structural and functional evolution? This paper focuses on how to  identify and measure the diversity  among remote code injection exploits
Introduction (cont’d) Typically, a host is compromised via a software vulnerability (e.g.  buffer overflow ) that allows network-based input to be “injected” into a running program and executed. Subsequently, the exploit payload may Download additional software Reconfigure the OS to evade detection, etc. This paper focuses on  the exploit  and  its initial payload -the so-called  shellcodes.  Shellcodes : First executed on a newly compromised machine Are typically small, simple, hand-coded machine programs Are well-suited to automated analysis Understand how much variation exists among the shellcodes for an exploit Measure shellcode diversity better understand how malware is created. Infer the paternity of different samples Construct a shellcode phylogeny
Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
Background Remote code injection attacks are a combination of vulnerability, exploit and shellcode The vulnerability is the particular software structure that allows data provided over the network to subvert and redirect execution control flow An unchecked buffer Overwrite the return address of the calling stack frame An exploit is a particular formulation of a attack against a vulnerability The shellcode is the payload carried by the exploit—it is the first code to execute
Stack Buffer Overflow Simple example of a remote stack-based buffer overflow. The shaded regions represent the shellcode of the exploit as sent over network packets, then as injected into the vulnerable buffer of the target host. The return address has been overwritten with injected data, thereby redirecting the execution flow to the shellcode residing in the vulnerable buffer .
Background (cont’d) Shellcodes: Are frequently limited by  the size of the buffer being processed The need for the buffer to contain “NOP sleds” or long regions of consecutive “do nothing” instructions Can be quite sophisticated in their construction : The creation of pseudo-random NOP sleds Polymorphic payloads that are encrypted (and potentially compressed) in transit and only decrypted just before execution Some polymorphic shellcode generators also create random decryptors
Background (cont’d) Early attempts to defeat polymorphic : X-ray analysis Heuristically decode polymorphic codes based on a portion of known, decoded instance to recover the encryption key Generic decryption Emulate execution while the shellcode decrypts itself Typically using a heuristic to guess when this process terminates Having decoded a malware shellcode, comparing it to other shellcode is another key problem. Approaches: Model each shellcode as a binary string and use traditional lexical distance measures Use structural distance measures that capture variation in the control flow and values at instruction, basic block, or function levels [11] [11] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic Worm Detection Using Structural Information of Executables. In  Proceedings of Symposium on Recent Advances in Intrusion Detection (RAID) , Seattle, WA, Sept.2005.
Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
Methodology-Exploit Collection Primary means of collecting exploits is by examining network traces of traffic sent to and from  active responders . Active responders: hosts that respond to unsolicited probes (exploit attempts) Emulating end-host behavior allows us to collect more session data In particular, completing the infection handshake will suffice to cause the attack to transmit the shellcode For example: ISystemActivator and RemoteActivation exploit Require active responders to capture RPC Bind and Request
Methodology-Extracting Shellcodes Use Shield[29] to extract the shellcode for each exploit session from the traces However, not all of the collected data corresponds to executable code. Execution starts at an offset within the vulnerable buffer The buffer may contain random padding  [29] H. Wang, C. Guo, D. Simon, and A. Zugenmaier. Shield: Vulnerability-Driven Network Filters for Preventing Known Vulnerability Exploits. In  Proceedings of the ACM SIGCOMM Conference , Portland, Oregon, Sept. 2004.
Methodology-Exploit Emulation Decoding the exploits is often necessary to reveal most of the actual executable code The easiest way to deal with the variety of decoding routines is to use binary emulation We implement the emulator using Intel’s Pin[13] on Linux Given an encoded shellcode, we first declare it as a statically allocated buffer in C source code that treats the buffer as a function By Iteratively retrying failed emulations at subsequent offsets To overcome any issues with non-executable prefixes As Pin successfully emulates the binary, we mark the executed instruction bytes for later analysis
Methodology-Clustering Agglomerative clustering A form of hierarchical clustering Begins with each unique shellcode belonging to it own cluster. Performs merging on the closest ( distance ) pair of clusters Builds up a hierarchy of similarity among exploit samples by iteratively merging the closest pair of clusters at each step distance between clusters: the distance between the furthest samples in the two respective clusters
Methodology-Clustering (cont’d) Distance Metrics Exedit Distance Edit Distance Does not distinguish code from data Random padding generates further noise Structural Distance Control flow graph (CFG)[11] Do not capture subtle variation between related exploits because entire basic blocks are summarized [11]C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic Worm Detection Using Structural Information of Executables. In Proceedings of Symposium on Recent Advances in Intrusion Detection (RAID), Seattle, WA, Sept. 2005
Methodology-Clustering (cont’d) Exedit  distance metric Edit distance over  executed  parts of shellcode Distinguishes  code  from  data Maintains instruction-level details Canonical string for shellcode
Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
Exploit Diversity Four well-known vulnerabilities SQL Name Resolution (Slammer) LSASS (Sasser) MS RPC ISystemActivator (Blaster) MS RPC RemoteActivation (Blaster) Use methodology from Section 3 to  cluster the shellcodes according to their variability and thus identify shellcode families provide a detailed characterization of each family to both convey the structure of shellcode families as well as the subtle functional variations among them show the prevalence of each shellcode family in the trace The trace 1 Capture exploit attempts on a residential DSL network for 2 days (2005 9/6) Fully patched Windows XP SP2 29 IP addresses Respond to incoming requests The Trace 2 From a honeyfarm at the Lawrence Berkeley National Laboratory
SQL Name Resolution The Slammer worm (Jan. 2003) The outlier Its payload was likely corrupted one the network before being captured. The last 91 bytes: Un identified 22 bytes 20-byte IP header a UDP header The first 41 bytes of the Slammer exploit No exploit diversity
LSASS (Local Security Authority Subsystem Service) The original Sasser worm: Apr. 2004 A handful of variants were responsible for a large number of occurrences
LSASS (cont’d) Exedit Edit structural Not fundamental to the code Ignores subtle differences between shellcodes
LSASS (cont’d) Inter-family analysis (Manual analysis) The differences in variations are 2–20 bytes, and correspond to phone-home/connect-back IP addresses, hostnames, and ports encoded in the payload.   LSASS-1 The main body of the exploit followed immediately after the decoding loop The main body and data session were XOR-ed one byte at a time with key 0x99 LSASS-0 An unencoded main body followed by an encoded data session (byte-wise XOR with key 0xff) There are embedded URL strings belonged to previously classified malware LSASS-2,3,4 Share the same encoding scheme and roughly the same flow of execution
LSASS (cont’d) - Prevalence
ISystemActivator The Blaster worm (Aug 2003) Originally exploited the RemoteActivation The result of polymorphism? indicate that exploits within a family are similar, but that ISys families differ more substantially from each other than the LSASS exploit families
ISystemActivator We confirmed that there were six different code bases There was no code polymorphism The differences were due to variations in data constants, such as encodings of phone-home addresses and hostnames, as well as names of executables ISys-0 used a 4-byte, non-overlapping XOR to encode its payload, whereas all other exploits used a byte-by-byte XOR ISys-4 had the largest payload length and its flow of execution was the most complicated. The moderate exedit distance within the ISys-1 family (9%)  some different instructions, otherwise very similar. ISys-5 exploits had a characteristic execution flow performed consecutive jumps over two text sections “ tftp.exe -i <address> get <executable name>” <address> <executable name> accounted for 6.5% distance
ISystemActivator 4-byte decoding key Kernel-address loading function Function-finding block 4-byte encoding key Kernel base loader Function finder
ISystemActivator largest payload length and its flow of execution was the most complicated
ISystemActivator performed consecutive jumps over two text sections “ tftp.exe -i <address> get <executable name>” <address> <executable name> accounted for 6.5% distance
ISystemActivator Different instructions in parts, otherwise very similar
ISystemActivator “ Bind” version required the newly-infected host to  bind  on a socket and wait for a connection attempt from the infecting host “ Connect-back” version required the newly-infected host to  connect back  to the infecting host Interestingly, the number of iterations in ISys-3’s loop overshoots the exploit payload. Thus, it seems that either ISys-2 was a refinement of ISys-3, or that ISys-3 was a poor imitation of ISys-2.
ISystemActivator
RemoteActivation Unlike the other exploits, RemoteActivation exploits exhibited a high amount of exploit diversity per host
RemoteActivation (cont’d) Exedit distance is very small The byte-wise encoding scheme only covered the main bodies of the exploits, but different exploits used different keys. And with manual inspection we confirmed that variable encoding of the exploit’s main body contributed to the jump in average intra-family distance. Changing keys along with random filler characters are commonly described techniques for polymorphism, and the RemoteActivation exploits had both of these features. 0 : “Bind” version 1 : “Connect-back” version Manual inspection : the last third (roughly 300 bytes) of the payload contained randomly generated characters
Diversity Across Vulnerabilities The trace is a full-payload 4.5-day trace from a Windows honeyfarm running at the Lawrence Berkeley National Laboratory starting on April 19, 2006. Hosts in this honeyfarm served as active responders to incoming requests
Diversity Across Vulnerabilities (cont’d) Dendrogram for the LBL trace exploits using exedit distance. The 1st set of hash marks just below 0% represent ISystemActivator, the 2nd represent LSASS, the 3rd represent PNP, and the 4th represent RemoteActivation.
Diversity Across Vulnerabilities (cont’d) Multi-vector family
Discussion - Polymorphism We generated a small set of signatures that exhaustively covered all exploits we observed for each vulnerability in the DSL residential trace.  Each signature was a contiguous sequence of 100 bytes. For each individual vulnerability except LSASS, one signature sufficed to cover the set of exploits. LSASS required two: one covered 1645/1769 exploits, and the other covered the rest.  Manual investigation of these signatures showed that they primarily focused on the portions of the shellcode that were mostly (but not entirely) NOPs.  We then tested the signatures against a 5-GB trace on our internal network for false positives. None of the signatures yielded false positives in the internal trace The polymorphism was not effective for evasion ? Functional variation Increase the difficulty of reverse engineering
Conclusion This paper presents a methodology for constructing the phylogeny of remote code injection exploits. And evaluates this methodology on network traces taken from several vantage points.  The methodology is robust to the observed polymorphism The techniques reveal non-trivial code sharing among different exploit families, and the resulting phylogenies accurately capture the subtle variations among exploits within each family. Analyzing both the emergence of polymorphism and the phylogeny of remote code injection exploits is important

More Related Content

PDF
Intrusion Alert Correlation
PDF
IDS - Fact, Challenges and Future
PPT
Testbed For Ids
PDF
IRJET- Review on Intrusion Detection System using Recurrent Neural Network wi...
PPT
Intrusion Detection Techniques for Mobile Wireless Networks
PDF
Qualifying exam-2015-final
PPT
IDS Network security - Bouvry
PPTX
Intrusion detection using data mining
Intrusion Alert Correlation
IDS - Fact, Challenges and Future
Testbed For Ids
IRJET- Review on Intrusion Detection System using Recurrent Neural Network wi...
Intrusion Detection Techniques for Mobile Wireless Networks
Qualifying exam-2015-final
IDS Network security - Bouvry
Intrusion detection using data mining

What's hot (20)

PPTX
Intrusion Detection with Neural Networks
PDF
Layered approach
PDF
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
DOC
A wireless intrusion detection system and a new attack model (synopsis)
PDF
F0371046050
PDF
Icacci presentation-cnn intrusion
PDF
Replay of Malicious Traffic in Network Testbeds
PPSX
Practical real-time intrusion detection using machine learning approaches
PPTX
Databse Intrusion Detection Using Data Mining Approach
PDF
AI for Cybersecurity Innovation
PDF
Project in malware analysis:C2C
PDF
IRJET- Penetration Testing using Metasploit Framework: An Ethical Approach
PDF
IRJET- SDN Multi-Controller based Framework to Detect and Mitigate DDoS i...
PDF
Optimized Intrusion Detection System using Deep Learning Algorithm
PDF
Deep Learning based Threat / Intrusion detection system
PDF
Design and Implementation of Artificial Immune System for Detecting Flooding ...
PDF
SECURITY THREATS IN SENSOR NETWORK IN IOT: A SURVEY
PDF
Ijnsa050214
PPTX
Deep learning approach for network intrusion detection system
PPT
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion Detection
Intrusion Detection with Neural Networks
Layered approach
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
A wireless intrusion detection system and a new attack model (synopsis)
F0371046050
Icacci presentation-cnn intrusion
Replay of Malicious Traffic in Network Testbeds
Practical real-time intrusion detection using machine learning approaches
Databse Intrusion Detection Using Data Mining Approach
AI for Cybersecurity Innovation
Project in malware analysis:C2C
IRJET- Penetration Testing using Metasploit Framework: An Ethical Approach
IRJET- SDN Multi-Controller based Framework to Detect and Mitigate DDoS i...
Optimized Intrusion Detection System using Deep Learning Algorithm
Deep Learning based Threat / Intrusion detection system
Design and Implementation of Artificial Immune System for Detecting Flooding ...
SECURITY THREATS IN SENSOR NETWORK IN IOT: A SURVEY
Ijnsa050214
Deep learning approach for network intrusion detection system
Cloudslam09:Building a Cloud Computing Analysis System for Intrusion Detection
Ad

Viewers also liked (6)

PPT
Qué Es Internet
PPT
Soundplanning community
PPT
Perfil NicoláS GonzáLez
PPT
Maramaro Rodriguez Ramirez Zabala
PPT
Apresentacao Positioning 20062007
PPS
Solsticio Invierno 2007
Qué Es Internet
Soundplanning community
Perfil NicoláS GonzáLez
Maramaro Rodriguez Ramirez Zabala
Apresentacao Positioning 20062007
Solsticio Invierno 2007
Ad

Similar to Finding Diversity In Remote Code Injection Exploits (20)

PPTX
Anatomy of a Buffer Overflow Attack
PDF
Reverse engineering - Shellcodes techniques
PPTX
Picking apart the morris worm
PPTX
ETCSS: Into the Mind of a Hacker
PDF
2011-03 Developing Windows Exploits
PDF
Fuzzing: Finding Your Own Bugs and 0days! at Arab Security Conference
PPT
shostack-blackhat-991.ppt YUGUUYGYGUUYUHJ
PPT
Software security
PDF
Fuzzing: Finding Your Own Bugs and 0days! 1.0
PDF
Intro to Exploitation
PDF
Dive into exploit development
PPT
Whittaker How To Break Software Security - SoftTest Ireland
PDF
Reverse engineering – debugging fundamentals
PDF
Unix executable buffer overflow
PDF
stackconf 2021 | Fuzzing: Finding Your Own Bugs and 0days!
PPTX
Vulnerability, exploit to metasploit
PPTX
Linux binary analysis and exploitation
PDF
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
DOCX
Backtrack Manual Part7
PPT
Linux Operating System Vulnerabilities
Anatomy of a Buffer Overflow Attack
Reverse engineering - Shellcodes techniques
Picking apart the morris worm
ETCSS: Into the Mind of a Hacker
2011-03 Developing Windows Exploits
Fuzzing: Finding Your Own Bugs and 0days! at Arab Security Conference
shostack-blackhat-991.ppt YUGUUYGYGUUYUHJ
Software security
Fuzzing: Finding Your Own Bugs and 0days! 1.0
Intro to Exploitation
Dive into exploit development
Whittaker How To Break Software Security - SoftTest Ireland
Reverse engineering – debugging fundamentals
Unix executable buffer overflow
stackconf 2021 | Fuzzing: Finding Your Own Bugs and 0days!
Vulnerability, exploit to metasploit
Linux binary analysis and exploitation
Blended attacks exploits, vulnerabilities and buffer overflow techniques in c...
Backtrack Manual Part7
Linux Operating System Vulnerabilities

More from amiable_indian (20)

PDF
Phishing As Tragedy of the Commons
PDF
Cisco IOS Attack & Defense - The State of the Art
PDF
Secrets of Top Pentesters
PPS
Workshop on Wireless Security
PDF
Insecure Implementation of Security Best Practices: of hashing, CAPTCHA's and...
PPS
Workshop on BackTrack live CD
PPS
Reverse Engineering for exploit writers
PPS
State of Cyber Law in India
PPS
AntiSpam - Understanding the good, the bad and the ugly
PPS
Reverse Engineering v/s Secure Coding
PPS
Network Vulnerability Assessments: Lessons Learned
PPS
Economic offenses through Credit Card Frauds Dissected
PPS
Immune IT: Moving from Security to Immunity
PPS
Reverse Engineering for exploit writers
PPS
Hacking Client Side Insecurities
PDF
Web Exploit Finder Presentation
PPT
Network Security Data Visualization
PPT
Enhancing Computer Security via End-to-End Communication Visualization
PDF
Top Network Vulnerabilities Over Time
PDF
What are the Business Security Metrics?
Phishing As Tragedy of the Commons
Cisco IOS Attack & Defense - The State of the Art
Secrets of Top Pentesters
Workshop on Wireless Security
Insecure Implementation of Security Best Practices: of hashing, CAPTCHA's and...
Workshop on BackTrack live CD
Reverse Engineering for exploit writers
State of Cyber Law in India
AntiSpam - Understanding the good, the bad and the ugly
Reverse Engineering v/s Secure Coding
Network Vulnerability Assessments: Lessons Learned
Economic offenses through Credit Card Frauds Dissected
Immune IT: Moving from Security to Immunity
Reverse Engineering for exploit writers
Hacking Client Side Insecurities
Web Exploit Finder Presentation
Network Security Data Visualization
Enhancing Computer Security via End-to-End Communication Visualization
Top Network Vulnerabilities Over Time
What are the Business Security Metrics?

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
NewMind AI Monthly Chronicles - July 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm

Finding Diversity In Remote Code Injection Exploits

  • 1. Finding Diversity in Remote Code Injection Exploits Justin Ma , John Dunagan , Helen J. Wang , Stefan Savage , Geoffrey M. Voelker University of California, San Diego Microsoft Research Internet Measurement Conference 2006
  • 2. Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
  • 3. Introduction Internet users are increasingly victimized by online criminal enterprise that spans denial-of-service extortion, identity theft, piracy and unsolicited bulk email At the core of these activities is malware Software used to remotely compromise and harness the resources of millions of hosts There is little research describing the malware ecosystem itself How does one piece of malware relate to another? What pressures drive its structural and functional evolution? This paper focuses on how to identify and measure the diversity among remote code injection exploits
  • 4. Introduction (cont’d) Typically, a host is compromised via a software vulnerability (e.g. buffer overflow ) that allows network-based input to be “injected” into a running program and executed. Subsequently, the exploit payload may Download additional software Reconfigure the OS to evade detection, etc. This paper focuses on the exploit and its initial payload -the so-called shellcodes. Shellcodes : First executed on a newly compromised machine Are typically small, simple, hand-coded machine programs Are well-suited to automated analysis Understand how much variation exists among the shellcodes for an exploit Measure shellcode diversity better understand how malware is created. Infer the paternity of different samples Construct a shellcode phylogeny
  • 5. Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
  • 6. Background Remote code injection attacks are a combination of vulnerability, exploit and shellcode The vulnerability is the particular software structure that allows data provided over the network to subvert and redirect execution control flow An unchecked buffer Overwrite the return address of the calling stack frame An exploit is a particular formulation of a attack against a vulnerability The shellcode is the payload carried by the exploit—it is the first code to execute
  • 7. Stack Buffer Overflow Simple example of a remote stack-based buffer overflow. The shaded regions represent the shellcode of the exploit as sent over network packets, then as injected into the vulnerable buffer of the target host. The return address has been overwritten with injected data, thereby redirecting the execution flow to the shellcode residing in the vulnerable buffer .
  • 8. Background (cont’d) Shellcodes: Are frequently limited by the size of the buffer being processed The need for the buffer to contain “NOP sleds” or long regions of consecutive “do nothing” instructions Can be quite sophisticated in their construction : The creation of pseudo-random NOP sleds Polymorphic payloads that are encrypted (and potentially compressed) in transit and only decrypted just before execution Some polymorphic shellcode generators also create random decryptors
  • 9. Background (cont’d) Early attempts to defeat polymorphic : X-ray analysis Heuristically decode polymorphic codes based on a portion of known, decoded instance to recover the encryption key Generic decryption Emulate execution while the shellcode decrypts itself Typically using a heuristic to guess when this process terminates Having decoded a malware shellcode, comparing it to other shellcode is another key problem. Approaches: Model each shellcode as a binary string and use traditional lexical distance measures Use structural distance measures that capture variation in the control flow and values at instruction, basic block, or function levels [11] [11] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic Worm Detection Using Structural Information of Executables. In Proceedings of Symposium on Recent Advances in Intrusion Detection (RAID) , Seattle, WA, Sept.2005.
  • 10. Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
  • 11. Methodology-Exploit Collection Primary means of collecting exploits is by examining network traces of traffic sent to and from active responders . Active responders: hosts that respond to unsolicited probes (exploit attempts) Emulating end-host behavior allows us to collect more session data In particular, completing the infection handshake will suffice to cause the attack to transmit the shellcode For example: ISystemActivator and RemoteActivation exploit Require active responders to capture RPC Bind and Request
  • 12. Methodology-Extracting Shellcodes Use Shield[29] to extract the shellcode for each exploit session from the traces However, not all of the collected data corresponds to executable code. Execution starts at an offset within the vulnerable buffer The buffer may contain random padding [29] H. Wang, C. Guo, D. Simon, and A. Zugenmaier. Shield: Vulnerability-Driven Network Filters for Preventing Known Vulnerability Exploits. In Proceedings of the ACM SIGCOMM Conference , Portland, Oregon, Sept. 2004.
  • 13. Methodology-Exploit Emulation Decoding the exploits is often necessary to reveal most of the actual executable code The easiest way to deal with the variety of decoding routines is to use binary emulation We implement the emulator using Intel’s Pin[13] on Linux Given an encoded shellcode, we first declare it as a statically allocated buffer in C source code that treats the buffer as a function By Iteratively retrying failed emulations at subsequent offsets To overcome any issues with non-executable prefixes As Pin successfully emulates the binary, we mark the executed instruction bytes for later analysis
  • 14. Methodology-Clustering Agglomerative clustering A form of hierarchical clustering Begins with each unique shellcode belonging to it own cluster. Performs merging on the closest ( distance ) pair of clusters Builds up a hierarchy of similarity among exploit samples by iteratively merging the closest pair of clusters at each step distance between clusters: the distance between the furthest samples in the two respective clusters
  • 15. Methodology-Clustering (cont’d) Distance Metrics Exedit Distance Edit Distance Does not distinguish code from data Random padding generates further noise Structural Distance Control flow graph (CFG)[11] Do not capture subtle variation between related exploits because entire basic blocks are summarized [11]C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic Worm Detection Using Structural Information of Executables. In Proceedings of Symposium on Recent Advances in Intrusion Detection (RAID), Seattle, WA, Sept. 2005
  • 16. Methodology-Clustering (cont’d) Exedit distance metric Edit distance over executed parts of shellcode Distinguishes code from data Maintains instruction-level details Canonical string for shellcode
  • 17. Outline Introduction Background and Related Work Methodology Exploit Diversity Discussion and Conclusion
  • 18. Exploit Diversity Four well-known vulnerabilities SQL Name Resolution (Slammer) LSASS (Sasser) MS RPC ISystemActivator (Blaster) MS RPC RemoteActivation (Blaster) Use methodology from Section 3 to cluster the shellcodes according to their variability and thus identify shellcode families provide a detailed characterization of each family to both convey the structure of shellcode families as well as the subtle functional variations among them show the prevalence of each shellcode family in the trace The trace 1 Capture exploit attempts on a residential DSL network for 2 days (2005 9/6) Fully patched Windows XP SP2 29 IP addresses Respond to incoming requests The Trace 2 From a honeyfarm at the Lawrence Berkeley National Laboratory
  • 19. SQL Name Resolution The Slammer worm (Jan. 2003) The outlier Its payload was likely corrupted one the network before being captured. The last 91 bytes: Un identified 22 bytes 20-byte IP header a UDP header The first 41 bytes of the Slammer exploit No exploit diversity
  • 20. LSASS (Local Security Authority Subsystem Service) The original Sasser worm: Apr. 2004 A handful of variants were responsible for a large number of occurrences
  • 21. LSASS (cont’d) Exedit Edit structural Not fundamental to the code Ignores subtle differences between shellcodes
  • 22. LSASS (cont’d) Inter-family analysis (Manual analysis) The differences in variations are 2–20 bytes, and correspond to phone-home/connect-back IP addresses, hostnames, and ports encoded in the payload. LSASS-1 The main body of the exploit followed immediately after the decoding loop The main body and data session were XOR-ed one byte at a time with key 0x99 LSASS-0 An unencoded main body followed by an encoded data session (byte-wise XOR with key 0xff) There are embedded URL strings belonged to previously classified malware LSASS-2,3,4 Share the same encoding scheme and roughly the same flow of execution
  • 23. LSASS (cont’d) - Prevalence
  • 24. ISystemActivator The Blaster worm (Aug 2003) Originally exploited the RemoteActivation The result of polymorphism? indicate that exploits within a family are similar, but that ISys families differ more substantially from each other than the LSASS exploit families
  • 25. ISystemActivator We confirmed that there were six different code bases There was no code polymorphism The differences were due to variations in data constants, such as encodings of phone-home addresses and hostnames, as well as names of executables ISys-0 used a 4-byte, non-overlapping XOR to encode its payload, whereas all other exploits used a byte-by-byte XOR ISys-4 had the largest payload length and its flow of execution was the most complicated. The moderate exedit distance within the ISys-1 family (9%) some different instructions, otherwise very similar. ISys-5 exploits had a characteristic execution flow performed consecutive jumps over two text sections “ tftp.exe -i <address> get <executable name>” <address> <executable name> accounted for 6.5% distance
  • 26. ISystemActivator 4-byte decoding key Kernel-address loading function Function-finding block 4-byte encoding key Kernel base loader Function finder
  • 27. ISystemActivator largest payload length and its flow of execution was the most complicated
  • 28. ISystemActivator performed consecutive jumps over two text sections “ tftp.exe -i <address> get <executable name>” <address> <executable name> accounted for 6.5% distance
  • 29. ISystemActivator Different instructions in parts, otherwise very similar
  • 30. ISystemActivator “ Bind” version required the newly-infected host to bind on a socket and wait for a connection attempt from the infecting host “ Connect-back” version required the newly-infected host to connect back to the infecting host Interestingly, the number of iterations in ISys-3’s loop overshoots the exploit payload. Thus, it seems that either ISys-2 was a refinement of ISys-3, or that ISys-3 was a poor imitation of ISys-2.
  • 32. RemoteActivation Unlike the other exploits, RemoteActivation exploits exhibited a high amount of exploit diversity per host
  • 33. RemoteActivation (cont’d) Exedit distance is very small The byte-wise encoding scheme only covered the main bodies of the exploits, but different exploits used different keys. And with manual inspection we confirmed that variable encoding of the exploit’s main body contributed to the jump in average intra-family distance. Changing keys along with random filler characters are commonly described techniques for polymorphism, and the RemoteActivation exploits had both of these features. 0 : “Bind” version 1 : “Connect-back” version Manual inspection : the last third (roughly 300 bytes) of the payload contained randomly generated characters
  • 34. Diversity Across Vulnerabilities The trace is a full-payload 4.5-day trace from a Windows honeyfarm running at the Lawrence Berkeley National Laboratory starting on April 19, 2006. Hosts in this honeyfarm served as active responders to incoming requests
  • 35. Diversity Across Vulnerabilities (cont’d) Dendrogram for the LBL trace exploits using exedit distance. The 1st set of hash marks just below 0% represent ISystemActivator, the 2nd represent LSASS, the 3rd represent PNP, and the 4th represent RemoteActivation.
  • 36. Diversity Across Vulnerabilities (cont’d) Multi-vector family
  • 37. Discussion - Polymorphism We generated a small set of signatures that exhaustively covered all exploits we observed for each vulnerability in the DSL residential trace. Each signature was a contiguous sequence of 100 bytes. For each individual vulnerability except LSASS, one signature sufficed to cover the set of exploits. LSASS required two: one covered 1645/1769 exploits, and the other covered the rest. Manual investigation of these signatures showed that they primarily focused on the portions of the shellcode that were mostly (but not entirely) NOPs. We then tested the signatures against a 5-GB trace on our internal network for false positives. None of the signatures yielded false positives in the internal trace The polymorphism was not effective for evasion ? Functional variation Increase the difficulty of reverse engineering
  • 38. Conclusion This paper presents a methodology for constructing the phylogeny of remote code injection exploits. And evaluates this methodology on network traces taken from several vantage points. The methodology is robust to the observed polymorphism The techniques reveal non-trivial code sharing among different exploit families, and the resulting phylogenies accurately capture the subtle variations among exploits within each family. Analyzing both the emergence of polymorphism and the phylogeny of remote code injection exploits is important