SlideShare a Scribd company logo
Copyright 2011 Trend Micro Inc. 1
Bytewise approximate matching,
searching and clustering
Liwei Ren, Ph.D
Ray Cheng, Ph.D
Trend Micro Inc.
DFRWS USA 2015, August , 2015, Philadelphia, PA
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Six Matching Problems and Bytewise Relevance
• Current Work: A Framework of Theory, Algorithms, and
Technologies
• Future Work
Classification 8/17/2015 2
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– Problem: Given two binary strings s1 and s2, measure their similarity.
• Do a hash that preserves similarity property of strings.
• Measure similarity by comparing two hash values.
– Example: TLSH, ssdeep, sdhash
Classification 8/17/2015 3
Copyright 2011 Trend Micro Inc.
Background
• NIST specification document NIST.SP.800-168 introduces the
concept of bytewise approximate matching :
– NIST document lists four cases to describe this concept:
• Object similarity detection: identify related artifacts, e.g. different versions
of a document.
• Cross Correlation: identify artifacts sharing a common object.
• Embedded Object Detection: identify a given object inside an artifact.
• Fragment Detection: identify the presence of traces/fragments of a known
artifact.
• Dr . Liwei Ren’s talk at DFRWS EU 2015:
– A Theoretic Framework for Evaluating Similarity Digesting Tools
– Using a mathematical model to describe binary similarity.
4
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• The NIST document does not cover all bytewise approximate
matching cases.
• We generalized NIST cases to six cases:
Classification 8/17/2015 5
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Continued:
6
Copyright 2011 Trend Micro Inc.
Classification of NIST approximate
matching cases
• Similarity Detection: identify related artifacts.
– AM1 (approximate match)
• Cross Correlation: identify artifacts sharing a
common object.
– EM3 (exact match cross-sharing)
• Embedded Object Detection: identify a given
object inside an artifact.
– EM2 (exact match containment)
• Fragment Detection: identify the presence of
traces/fragments of a known artifact.
– EM2 (one or more exact match containment)
Classification 8/17/2015 7
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of
six cases is true, we say R and T are bytewise relevant.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
8
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Define three fundamental problems using Bytewise
Relevance:
– Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.
– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that BR (o, b )=1.
– Clustering: Given a bag B of objects, partition B into groups { G1,
G2,…,Gm} based on BR.
• S = An object space S,
• O = An object in object space S,
•BR = Bytewise Relevance relationship for objects in S.
Classification 8/17/2015 9
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Our bytewise relevance framework :
Classification 8/17/2015 10
Copyright 2011 Trend Micro Inc.
Matching
• The Six Matching Problems EM1 – AM3
– Identicalness EM1 : the solution is trivial.
– Containment EM2 : the solution is Rabin-Karp algorithm.
– Cross-sharing EM3 :
• We established a theory on this interesting problem : how to measure cross-
sharing.
• We developed an algorithmic solution with theoretic analysis.
– Similarity AM1 :
• TLSH, ssdeep and sdhash
• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to
solve this problem.
– We designed a novel similarity digesting scheme TSFP.
– Approximate containment AM2: Two heuristic algorithms
– Approximate cross-sharing AM3: One heuristic algorithm
Classification 8/17/2015 11
Copyright 2011 Trend Micro Inc.
Searching
• For the relationship BR, the searching problem:
– B is a bag of strings. Given a string T , find s ∊ B such that BR(T,
s)=1.
Classification 8/17/2015 12
Copyright 2011 Trend Micro Inc.
Searching
• How to solve searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can
we scale to millions or billions? 
– Candidate selection approach: two-step approach
• STEP 1: select a few candidates { s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select good candidates?
• String fingerprinting: generate fingerprints from each string from B.
• Indexing Process: Index the fingerprints along with the string ID to create
a index DB as FP-DB.
• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we
use them to search possible candidates from FP-DB.
– NOTE:
• This is similar to a keyword based search engine where the keywords are
the fingerprints.
• The fingerprinting procedure is actually a special tokenization method.
Classification 8/17/2015 13
Copyright 2011 Trend Micro Inc.
Future Work: Clustering Problem
• For the relationship BR, one has a clustering problem :
– B is a bag of strings, partition B into groups of strings based on BR.
Classification 8/17/2015 14
Copyright 2011 Trend Micro Inc.
Future Work: Library and tools
• Analyze algorithms and measure performance.
– Verify they can scale.
• For bytewise approximate matching, searching and clustering,
– Library of functions
– API
– Tools
Classification 8/17/2015 15
Copyright 2011 Trend Micro Inc.
Application examples of Approximate
Matching, Searching, Clustering
• E-Discovery
– Comparing near duplicate documents
– Grouping near duplicate documents
• Digital forensic analysis
– Identifying similar objects or files
• Malware analysis
– Identifying similar malware or mutated malware
• Anti-plagiarism
– Detection of copyright violations
• Source code governance
• Spam filtering
• Data Loss Prevention
Classification 8/17/2015 16
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you.
• Any questions?
• Email:
– liwei_ren@trendmicro.com
– ray_cheng@trendmicro.com
17
Copyright 2011 Trend Micro Inc.
Application Example
• A search problem in DLP (Data Loss Prevension) system:
– Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.
Given any document T and 0<δ≤1, find a document d ∊ S such that
RLV(d,T)≥ δ.
• RLV is a function to measure the relevance of two documents.
• Challenges: how to construct RLV and δ? How to make search scalable?
Classification 8/17/2015 18
Copyright 2011 Trend Micro Inc.
Application Example
• A clustering problem in e-Discovery:
– Data are identified as potentially relevant by attorneys
– De-duplication technology.
– Problem: partition S into groups based on the textual relevance.
Classification 8/17/2015 19
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– A family of similarity preserving hashing techniques & tools
– Problem: Given two binary strings s1 and s2, measure the similarity
by s= SIM(H(s1), H(s2)).
• H is a hash function that preserves string similarity.
• SIM is another function to measure similarity of two hash values
– Example: TLSH, ssdeep, sdhash
– Challenge: how to evaluate pros & cons between them?
Classification 8/17/2015 20
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
• We have following relationship:
Classification 8/17/2015 21
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪

More Related Content

PDF
Bytewise Approximate Match: Theory, Algorithms and Applications
PPT
Finding Similar Files in Large Document Repositories
PDF
LDAvis
PPT
Introduction to question answering for linked data & big data
PPT
Computing with Directed Labeled Graphs
PDF
Atlas.ti making sense of research data in policy analysis
PPTX
Neural Information Retrieval: In search of meaningful progress
Bytewise Approximate Match: Theory, Algorithms and Applications
Finding Similar Files in Large Document Repositories
LDAvis
Introduction to question answering for linked data & big data
Computing with Directed Labeled Graphs
Atlas.ti making sense of research data in policy analysis
Neural Information Retrieval: In search of meaningful progress

What's hot (20)

PPTX
Tutorial on Question Answering Systems
PPTX
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
PPTX
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
PDF
Towards a Quality Assessment of Web Corpora for Language Technology Applications
PPTX
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
PDF
Atlas.ti tutorial
PPTX
From Story-Telling to Production
PPTX
Semantic Similarity and Selection of Resources Published According to Linked ...
PPTX
Supporting Springer Nature Editors by means of Semantic Technologies
PPTX
Tensor Networks and Their Applications on Machine Learning
PPTX
Semantic Interpretation of User Query for Question Answering on Interlinked Data
PPTX
ICWE2013 - Discovering links between political debates and media
PPTX
Topic Extraction using Machine Learning
PDF
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
PPTX
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
PDF
Datacamp - Networkx datacamp chapter 1
PPTX
Text Mining using LDA with Context
PPTX
Predicting the relevance of search results for e-commerce systems
PDF
Topics Modeling
PPT
Vsm 벡터공간모델
Tutorial on Question Answering Systems
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Towards a Quality Assessment of Web Corpora for Language Technology Applications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
Atlas.ti tutorial
From Story-Telling to Production
Semantic Similarity and Selection of Resources Published According to Linked ...
Supporting Springer Nature Editors by means of Semantic Technologies
Tensor Networks and Their Applications on Machine Learning
Semantic Interpretation of User Query for Question Answering on Interlinked Data
ICWE2013 - Discovering links between political debates and media
Topic Extraction using Machine Learning
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
Datacamp - Networkx datacamp chapter 1
Text Mining using LDA with Context
Predicting the relevance of search results for e-commerce systems
Topics Modeling
Vsm 벡터공간모델
Ad

Viewers also liked (20)

PDF
聊一聊大明朝的火器
PDF
Tarifario lt-2014-altoimpacto
PDF
SATI MED Publicidad alternativa
PDF
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
PDF
Las telecomunicaciones multimedia (telefònica)
PDF
PDF
Programa mes xabia-marc-2012
PDF
Sanamed 10(1) 2015
PDF
Webs de Telefonica
PPTX
презентация ооо бтех English финал
PDF
coupon réponse à retourner pour le 16 octobre 2015
PDF
Ptpm002 Pt Mgmt Of Limb Amputees
PDF
Mansfield U3A Newsletter: December 2015
PPTX
Apoteosis de claudio
PPTX
Legalwise presentation
PPTX
Business communication 5 steps to create mutual understanding
PPTX
Cómo elaborar una tortilla de patata española
PDF
Oray screens and home cinema seats catalog 2014
PDF
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria
DOCX
Sexualidad
聊一聊大明朝的火器
Tarifario lt-2014-altoimpacto
SATI MED Publicidad alternativa
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
Las telecomunicaciones multimedia (telefònica)
Programa mes xabia-marc-2012
Sanamed 10(1) 2015
Webs de Telefonica
презентация ооо бтех English финал
coupon réponse à retourner pour le 16 octobre 2015
Ptpm002 Pt Mgmt Of Limb Amputees
Mansfield U3A Newsletter: December 2015
Apoteosis de claudio
Legalwise presentation
Business communication 5 steps to create mutual understanding
Cómo elaborar una tortilla de patata española
Oray screens and home cinema seats catalog 2014
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria
Sexualidad
Ad

Similar to Bytewise approximate matching, searching and clustering (20)

PDF
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
PDF
A Primer on Entity Resolution
PDF
Binary Similarity : Theory, Algorithms and Tool Evaluation
PPTX
Discovering Overlapping Community Structure in Networks through Co-clustering
PPTX
How we use functional programming to find the bad guys @ Build Stuff LT and U...
PDF
record_linking
ODP
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
PDF
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DOC
learningIntro.doc
DOC
learningIntro.doc
PPT
Machine Learning ICS 273A
PPT
Machine Learning ICS 273A
PDF
A Theoretic Framework for Evaluating Similarity Digesting Tools
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
PPT
Similarity at scale
PPTX
Hierarchical clustering
PPTX
Cluster Analysis.pptx
PDF
large_scale_search.pdf
PPTX
Data Mining Lecture_10(b).pptx
PPTX
Generalization abstraction
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
A Primer on Entity Resolution
Binary Similarity : Theory, Algorithms and Tool Evaluation
Discovering Overlapping Community Structure in Networks through Co-clustering
How we use functional programming to find the bad guys @ Build Stuff LT and U...
record_linking
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
learningIntro.doc
learningIntro.doc
Machine Learning ICS 273A
Machine Learning ICS 273A
A Theoretic Framework for Evaluating Similarity Digesting Tools
Building graphs to discover information by David Martínez at Big Data Spain 2015
Similarity at scale
Hierarchical clustering
Cluster Analysis.pptx
large_scale_search.pdf
Data Mining Lecture_10(b).pptx
Generalization abstraction

More from Liwei Ren任力偉 (20)

PDF
信息安全领域里的创新和机遇
PDF
企业安全市场综述
PDF
Introduction to Deep Neural Network
PDF
防火牆們的故事
PDF
移动互联网时代下创新的思维
PDF
硅谷的那点事儿
PDF
非齐次特征值问题解存在性研究
PDF
世纪猜想
PDF
Arm the World with SPN based Security
PDF
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
PDF
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
PDF
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
PDF
Phase locking in chains of multiple-coupled oscillators
PDF
On existence of the solution of inhomogeneous eigenvalue problem
PDF
Math stories
PDF
IoT Security: Problems, Challenges and Solutions
PDF
Taxonomy of Differential Compression
PDF
Overview of Data Loss Prevention (DLP) Technology
PDF
DLP Systems: Models, Architecture and Algorithms
PDF
Mathematical Modeling for Practical Problems
信息安全领域里的创新和机遇
企业安全市场综述
Introduction to Deep Neural Network
防火牆們的故事
移动互联网时代下创新的思维
硅谷的那点事儿
非齐次特征值问题解存在性研究
世纪猜想
Arm the World with SPN based Security
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Phase locking in chains of multiple-coupled oscillators
On existence of the solution of inhomogeneous eigenvalue problem
Math stories
IoT Security: Problems, Challenges and Solutions
Taxonomy of Differential Compression
Overview of Data Loss Prevention (DLP) Technology
DLP Systems: Models, Architecture and Algorithms
Mathematical Modeling for Practical Problems

Recently uploaded (20)

PDF
KodekX | Application Modernization Development
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Approach and Philosophy of On baking technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Empathic Computing: Creating Shared Understanding
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Approach and Philosophy of On baking technology
The AUB Centre for AI in Media Proposal.docx
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
Encapsulation theory and applications.pdf
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf

Bytewise approximate matching, searching and clustering

  • 1. Copyright 2011 Trend Micro Inc. 1 Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D Ray Cheng, Ph.D Trend Micro Inc. DFRWS USA 2015, August , 2015, Philadelphia, PA
  • 2. Copyright 2011 Trend Micro Inc. Agenda • Background • Six Matching Problems and Bytewise Relevance • Current Work: A Framework of Theory, Algorithms, and Technologies • Future Work Classification 8/17/2015 2
  • 3. Copyright 2011 Trend Micro Inc. Background • Similarity digesting schemes: – Problem: Given two binary strings s1 and s2, measure their similarity. • Do a hash that preserves similarity property of strings. • Measure similarity by comparing two hash values. – Example: TLSH, ssdeep, sdhash Classification 8/17/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Background • NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching : – NIST document lists four cases to describe this concept: • Object similarity detection: identify related artifacts, e.g. different versions of a document. • Cross Correlation: identify artifacts sharing a common object. • Embedded Object Detection: identify a given object inside an artifact. • Fragment Detection: identify the presence of traces/fragments of a known artifact. • Dr . Liwei Ren’s talk at DFRWS EU 2015: – A Theoretic Framework for Evaluating Similarity Digesting Tools – Using a mathematical model to describe binary similarity. 4
  • 5. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • The NIST document does not cover all bytewise approximate matching cases. • We generalized NIST cases to six cases: Classification 8/17/2015 5
  • 6. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Continued: 6
  • 7. Copyright 2011 Trend Micro Inc. Classification of NIST approximate matching cases • Similarity Detection: identify related artifacts. – AM1 (approximate match) • Cross Correlation: identify artifacts sharing a common object. – EM3 (exact match cross-sharing) • Embedded Object Detection: identify a given object inside an artifact. – EM2 (exact match containment) • Fragment Detection: identify the presence of traces/fragments of a known artifact. – EM2 (one or more exact match containment) Classification 8/17/2015 7
  • 8. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of six cases is true, we say R and T are bytewise relevant. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0. 8
  • 9. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Define three fundamental problems using Bytewise Relevance: – Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1. – Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that BR (o, b )=1. – Clustering: Given a bag B of objects, partition B into groups { G1, G2,…,Gm} based on BR. • S = An object space S, • O = An object in object space S, •BR = Bytewise Relevance relationship for objects in S. Classification 8/17/2015 9
  • 10. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Our bytewise relevance framework : Classification 8/17/2015 10
  • 11. Copyright 2011 Trend Micro Inc. Matching • The Six Matching Problems EM1 – AM3 – Identicalness EM1 : the solution is trivial. – Containment EM2 : the solution is Rabin-Karp algorithm. – Cross-sharing EM3 : • We established a theory on this interesting problem : how to measure cross- sharing. • We developed an algorithmic solution with theoretic analysis. – Similarity AM1 : • TLSH, ssdeep and sdhash • Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem. – We designed a novel similarity digesting scheme TSFP. – Approximate containment AM2: Two heuristic algorithms – Approximate cross-sharing AM3: One heuristic algorithm Classification 8/17/2015 11
  • 12. Copyright 2011 Trend Micro Inc. Searching • For the relationship BR, the searching problem: – B is a bag of strings. Given a string T , find s ∊ B such that BR(T, s)=1. Classification 8/17/2015 12
  • 13. Copyright 2011 Trend Micro Inc. Searching • How to solve searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can we scale to millions or billions?  – Candidate selection approach: two-step approach • STEP 1: select a few candidates { s1, s2,…,sm} quickly • STEP 2: evaluate each BR(T, sk). – How to select good candidates? • String fingerprinting: generate fingerprints from each string from B. • Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB. • Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB. – NOTE: • This is similar to a keyword based search engine where the keywords are the fingerprints. • The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13
  • 14. Copyright 2011 Trend Micro Inc. Future Work: Clustering Problem • For the relationship BR, one has a clustering problem : – B is a bag of strings, partition B into groups of strings based on BR. Classification 8/17/2015 14
  • 15. Copyright 2011 Trend Micro Inc. Future Work: Library and tools • Analyze algorithms and measure performance. – Verify they can scale. • For bytewise approximate matching, searching and clustering, – Library of functions – API – Tools Classification 8/17/2015 15
  • 16. Copyright 2011 Trend Micro Inc. Application examples of Approximate Matching, Searching, Clustering • E-Discovery – Comparing near duplicate documents – Grouping near duplicate documents • Digital forensic analysis – Identifying similar objects or files • Malware analysis – Identifying similar malware or mutated malware • Anti-plagiarism – Detection of copyright violations • Source code governance • Spam filtering • Data Loss Prevention Classification 8/17/2015 16
  • 17. Copyright 2011 Trend Micro Inc. Q&A • Thank you. • Any questions? • Email: – liwei_ren@trendmicro.com – ray_cheng@trendmicro.com 17
  • 18. Copyright 2011 Trend Micro Inc. Application Example • A search problem in DLP (Data Loss Prevension) system: – Problem: S = {d1, d2,…, dn} is a collection of confidential documents,. Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ. • RLV is a function to measure the relevance of two documents. • Challenges: how to construct RLV and δ? How to make search scalable? Classification 8/17/2015 18
  • 19. Copyright 2011 Trend Micro Inc. Application Example • A clustering problem in e-Discovery: – Data are identified as potentially relevant by attorneys – De-duplication technology. – Problem: partition S into groups based on the textual relevance. Classification 8/17/2015 19
  • 20. Copyright 2011 Trend Micro Inc. Background • Similarity digesting schemes: – A family of similarity preserving hashing techniques & tools – Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)). • H is a hash function that preserves string similarity. • SIM is another function to measure similarity of two hash values – Example: TLSH, ssdeep, sdhash – Challenge: how to evaluate pros & cons between them? Classification 8/17/2015 20
  • 21. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y. • We have following relationship: Classification 8/17/2015 21 EM1 EM2 EM3 AM1 AM2 AM3 ↪ ↪ ↪ ↪ ↪ ↪ ↪