SlideShare a Scribd company logo
An Approach for Automatic and
Large Scale Image Forensics
Thamme Gowda*#
, Kyle Hundman #
, and Chris Mattmann *#
*
Computer Science Department,
University of Southern California,
Los Angeles, CA, USA
#
Jet Propulsion Laboratory
California Institute of Technology
Pasadena, CA, USA
Presenter: Paul Ramirez #
OVERVIEW
• Abstract
• Motivation
• Data
• Image Recognition
• Inception Net
• Integration
• Evaluation
• Conclusion
ABSTRACT
• Applications of deep learning-based image recognition in the DARPA
Memex program
• Integration of Tensorflow with Apache Tika for automatic image
forensics
• Evaluation of model performance on weapons dataset
MOTIVATION
DARPA Memex:
• Monitor online weapons sale in the United States
• Goal 1: Retrieve ads and relevant multimedia such as images, videos
• Goal 2: Forensics
• Classify illegal weapons
• Sale trends
• Goal 3: Discoverable / Searchable
DATA COLLECTION
• Used web crawlers specialized for retrieving data
• Crawlers that can login to web sites and run javascript in pages
• Crawlers that can work with Onion protocol
• Example: Apache Nutch, Sparkler, Scrapy, … by various teams
• Large repository of web pages and multimedia documents
• 1.4 M images from weapons domain
IMAGE RECOGNITION TASK
• Image Recognition: Detect real word entities in the digital images
• ImageNet dataset:
• Large visual dataset of annotated images
• ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
• Annual competition organized by Stanford and Princeton Universities
• Challenge: How accurately can your model identify 1000 classes
• From 2010 to Now
• Since 2012, Deep ConvNets ruled the competition
• Goto place to see state-of-the-art models for image recognition
INCEPTION NET
• Developed by Google Research Team
• Sergey et al,2015 - Originally GoogleNet, winner of ILSVRC 2014
• Code named Inception, multiple version V1, V2, V3, V4,..
• Google open sourced Tensorflow with Inception-V3 and its model
trained on ImageNet dataset
• Inception-V3 is optimized to run with less memory and fewer CPU cycles
(like Android devices)
• We have used Inception-V3 for our forensics
SOFTWARE STACK
• Apache Tika - universal parser for parsing files over a thousand file types
• Primarily written in Java; available for free via Apache License
• Meta data analysis
• Semantic analysis - detect names of people, locations etc in text
• And more - OCR in images
• One of the key technology for content analysis in DARPA Memex
• Had been useful for others too - heard of Panama Papers?
• Tensorflow
• Written in C++ with Python bindings; available free via Apache License
• Developed by google and now one of the popular deep learning frameworks
INTEGRATION METHODS
• Challenge:
• Make use of C++/Python code from a Java Client
• Techniques
• Command Line Invocation (CLI)
• Java Native Interface (JNI)
• gRPC Remote Procedure Call (gRPC)
• REpresentation State Transfer (REST) API
• REST API integration was the best among the above four
RESULTS
Labeled the 1.4 million images in
Memex Weapons Dataset
1000 target classes in training
data (ImageNet)
HANDLING THE SCALE
• 1.4 million images in the dataset
• REST integration took 36 hours to run on 32 Core CPUs, no GPUs used
• TensorFlow automatically parallelized the load on all CPU cores in a
single node
• Wiki https://wiki.apache.org/tika/TikaAndVision
• Recent work: We have hadoop/spark distributable framework powered
by Deeplearning4j
• https://wiki.apache.org/tika/TikaAndVisionDL4J
• https://github.com/thammegowda/tika-dl4j-spark-imgrec
RIFLES
REVOLVERS
EVALUATION
Our evaluation dataset:
• Consists of gun images
• Law enforcement officers manually
labelled them
Observations:
• Some Rifles mislabeled based on
surrounding objects - small size
• Top - 5 measure is a reasonable
measure
CONCLUSION
• We have made image recognition easy for Apache Tika users
• We have tested that Inception-V3 model was successful in detecting
weapon images
• Image labels helped to build a better web page classifier for Memex
ACKNOWLEDGEMENT:
This effort was supported in part by JPL, managed by the California Institute of Technology on behalf of NASA, and
additionally in part by the DARPA Memex/XDATA/D3M programs and NSF award numbers ICER-1639753,
PLR-1348450 and PLR-144562 funded a portion of the work
THANKS
Full Paper: https://memex.jpl.nasa.gov/MFSEC17.pdf

More Related Content

PDF
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
PPTX
Data Science at Scale by Sarah Guido
PDF
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
PPTX
Intro to Python Data Analysis in Wakari
PPTX
Skutil - H2O meets Sklearn - Taylor Smith
PDF
Stacked Ensembles in H2O
PDF
Scala: the unpredicted lingua franca for data science
PPTX
Analyzing Data With Python
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Science at Scale by Sarah Guido
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Intro to Python Data Analysis in Wakari
Skutil - H2O meets Sklearn - Taylor Smith
Stacked Ensembles in H2O
Scala: the unpredicted lingua franca for data science
Analyzing Data With Python

What's hot (20)

PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
PDF
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
PPTX
PDF
Microservices, containers, and machine learning
PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PDF
Python for Data Science
PDF
Enabling Biobank-Scale Genomic Processing with Spark SQL
PDF
Spark Meetup @ Netflix, 05/19/2015
PPTX
Strata sf - Amundsen presentation
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PDF
Agile data science with scala
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PPTX
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
PDF
SF Python Meetup: TextRank in Python
PDF
Apache Spark and the Emerging Technology Landscape for Big Data
PDF
Use of standards and related issues in predictive analytics
PDF
Introduction to Analytics with Azure Notebooks and Python
PDF
H2O with Erin LeDell at Portland R User Group
Apache Arrow: Cross-language Development Platform for In-memory Data
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Microservices, containers, and machine learning
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Python for Data Science
Enabling Biobank-Scale Genomic Processing with Spark SQL
Spark Meetup @ Netflix, 05/19/2015
Strata sf - Amundsen presentation
GraphFrames: DataFrame-based graphs for Apache® Spark™
Agile data science with scala
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
SF Python Meetup: TextRank in Python
Apache Spark and the Emerging Technology Landscape for Big Data
Use of standards and related issues in predictive analytics
Introduction to Analytics with Azure Notebooks and Python
H2O with Erin LeDell at Portland R User Group
Ad

Similar to Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017] (20)

PDF
Thamme Gowda's Summer2016- NASA JPL Internship
PPTX
Ai use cases
PPTX
Anomaly Detection with Azure and .NET
PDF
Introduction talk to Computer Vision
PDF
Automated Metadata Annotation What Is And Is Not Possible With Machine Learning
PPTX
Anomaly Detection with Azure and .net
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PDF
TensorFlow - La IA detrás de Google
PPTX
Facial recognition
PDF
Open Computer Vision with OpenCV, Apache NiFi, TensorFlow, Python
PDF
Improving computer vision models at scale presentation
PDF
Improving computer vision models at scale presentation
PDF
Khan farhan cv
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
PDF
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
PDF
Modelling Framework of a Neural Object Recognition
PDF
Computer vision for transportation
PPTX
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
PPTX
Tensorflow a brief introduction 2nd Sess.pptx
PPTX
AI/ML Deep Learning Study report
Thamme Gowda's Summer2016- NASA JPL Internship
Ai use cases
Anomaly Detection with Azure and .NET
Introduction talk to Computer Vision
Automated Metadata Annotation What Is And Is Not Possible With Machine Learning
Anomaly Detection with Azure and .net
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
TensorFlow - La IA detrás de Google
Facial recognition
Open Computer Vision with OpenCV, Apache NiFi, TensorFlow, Python
Improving computer vision models at scale presentation
Improving computer vision models at scale presentation
Khan farhan cv
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Modelling Framework of a Neural Object Recognition
Computer vision for transportation
Demystifying-AI-Frameworks-TensorFlow-PyTorch-JAX-and-More (1).pptx
Tensorflow a brief introduction 2nd Sess.pptx
AI/ML Deep Learning Study report
Ad

More from Thamme Gowda (7)

PDF
Thamme Gowda's PhD dissertation defense slides
PDF
Macro average: rare types are important too
PDF
500 languages to English Machine Translation Model
PDF
Sparkler at spark summit east 2017
PDF
Sparkler - Spark Crawler
PPTX
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
PDF
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda's PhD dissertation defense slides
Macro average: rare types are important too
500 languages to English Machine Translation Model
Sparkler at spark summit east 2017
Sparkler - Spark Crawler
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Clustering output of Apache Nutch using Apache Spark

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf

Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]

  • 1. An Approach for Automatic and Large Scale Image Forensics Thamme Gowda*# , Kyle Hundman # , and Chris Mattmann *# * Computer Science Department, University of Southern California, Los Angeles, CA, USA # Jet Propulsion Laboratory California Institute of Technology Pasadena, CA, USA Presenter: Paul Ramirez #
  • 2. OVERVIEW • Abstract • Motivation • Data • Image Recognition • Inception Net • Integration • Evaluation • Conclusion
  • 3. ABSTRACT • Applications of deep learning-based image recognition in the DARPA Memex program • Integration of Tensorflow with Apache Tika for automatic image forensics • Evaluation of model performance on weapons dataset
  • 4. MOTIVATION DARPA Memex: • Monitor online weapons sale in the United States • Goal 1: Retrieve ads and relevant multimedia such as images, videos • Goal 2: Forensics • Classify illegal weapons • Sale trends • Goal 3: Discoverable / Searchable
  • 5. DATA COLLECTION • Used web crawlers specialized for retrieving data • Crawlers that can login to web sites and run javascript in pages • Crawlers that can work with Onion protocol • Example: Apache Nutch, Sparkler, Scrapy, … by various teams • Large repository of web pages and multimedia documents • 1.4 M images from weapons domain
  • 6. IMAGE RECOGNITION TASK • Image Recognition: Detect real word entities in the digital images • ImageNet dataset: • Large visual dataset of annotated images • ImageNet Large Scale Visual Recognition Challenge (ILSVRC) • Annual competition organized by Stanford and Princeton Universities • Challenge: How accurately can your model identify 1000 classes • From 2010 to Now • Since 2012, Deep ConvNets ruled the competition • Goto place to see state-of-the-art models for image recognition
  • 7. INCEPTION NET • Developed by Google Research Team • Sergey et al,2015 - Originally GoogleNet, winner of ILSVRC 2014 • Code named Inception, multiple version V1, V2, V3, V4,.. • Google open sourced Tensorflow with Inception-V3 and its model trained on ImageNet dataset • Inception-V3 is optimized to run with less memory and fewer CPU cycles (like Android devices) • We have used Inception-V3 for our forensics
  • 8. SOFTWARE STACK • Apache Tika - universal parser for parsing files over a thousand file types • Primarily written in Java; available for free via Apache License • Meta data analysis • Semantic analysis - detect names of people, locations etc in text • And more - OCR in images • One of the key technology for content analysis in DARPA Memex • Had been useful for others too - heard of Panama Papers? • Tensorflow • Written in C++ with Python bindings; available free via Apache License • Developed by google and now one of the popular deep learning frameworks
  • 9. INTEGRATION METHODS • Challenge: • Make use of C++/Python code from a Java Client • Techniques • Command Line Invocation (CLI) • Java Native Interface (JNI) • gRPC Remote Procedure Call (gRPC) • REpresentation State Transfer (REST) API • REST API integration was the best among the above four
  • 10. RESULTS Labeled the 1.4 million images in Memex Weapons Dataset 1000 target classes in training data (ImageNet)
  • 11. HANDLING THE SCALE • 1.4 million images in the dataset • REST integration took 36 hours to run on 32 Core CPUs, no GPUs used • TensorFlow automatically parallelized the load on all CPU cores in a single node • Wiki https://wiki.apache.org/tika/TikaAndVision • Recent work: We have hadoop/spark distributable framework powered by Deeplearning4j • https://wiki.apache.org/tika/TikaAndVisionDL4J • https://github.com/thammegowda/tika-dl4j-spark-imgrec
  • 14. EVALUATION Our evaluation dataset: • Consists of gun images • Law enforcement officers manually labelled them Observations: • Some Rifles mislabeled based on surrounding objects - small size • Top - 5 measure is a reasonable measure
  • 15. CONCLUSION • We have made image recognition easy for Apache Tika users • We have tested that Inception-V3 model was successful in detecting weapon images • Image labels helped to build a better web page classifier for Memex ACKNOWLEDGEMENT: This effort was supported in part by JPL, managed by the California Institute of Technology on behalf of NASA, and additionally in part by the DARPA Memex/XDATA/D3M programs and NSF award numbers ICER-1639753, PLR-1348450 and PLR-144562 funded a portion of the work