SlideShare a Scribd company logo
Open Source Software for Geospatial
Analytics on Unstructured Big Data
Charlie Greenbacker, Principal Data Scientist
Background




                                                                 About Me:
                                                                        Data Scientist
                                                                        Natural Language Processing
                                                                        Unstructured Text  Information


                                                                 Berico Technologies:
                                                                        Veteran-owned Small Business
                                                                        Big Data Analytics in the Cloud
                                                                        Defense & Intel Community



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   2
The Problem: geotagging unstructured text




     Growing demand for
     geospatial analytics

     Most of human knowledge
     remains “trapped” in text

     Existing solutions are
     expensive and don’t scale

     Need an open source solution



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   3
The Solution: an open source geoparser



                                                                     1. Data Ingestion
                                                                                  Input: unstructured text
                                                                     2. Entity Extraction
                                                                                  Named entity recognition
                                                                                  Find location names in text
                                                                     3. Entity Resolution
                                                                                  Match against a gazetteer
                                                                                  “The Springfield Problem”
                                                                     4. Data Enrichment
                                                                                  Output: structured geo data



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   4
Data Ingestion: unstructured text




                                                                                                              photo: Flickr user NS Newsflash


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.          5
Entity Extraction: named entity recognition




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   6
Entity Resolution: match against a gazetteer




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   7
Data Enrichment: structured geo data




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   8
“The Springfield Problem”




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   9
Dealing with Ambiguity


     Intelligent Context-based Heuristics
            First: rank by population
            Next: look for other locations mentioned in the same document
                   “Springfield” + “Chicago” = Illinois
                   “Springfield” + “Boston” = Massachusetts
            Soon: calculate distance based on lat/lons


     Resolve alternate names to same geospatial entity
            “Ivory Coast” = “Côte d’Ivoire”


     Use fuzzy matching to capture misspelled place names
            Including both phonetic spelling & typographical errors

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   10
CLAVIN: an open source geoparser




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   11
System Architecture




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   12
Live Demonstration




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   13
Live Demonstration




                                              What can I do
                                              with this data?




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   14
Map Visualizations




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   15
Hierarchical Geospatial Search




                                                     Virginia

                                                               Reston           Arlington




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   16
Geospatial Bounding Box Search




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   17
Geospatial Analytics on Unstructured Text




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   18
Performance Metrics & Features


                                                     Accurate: 0.75 F-measure
     CLAVIN




“
                                                     Fast: 100 locations per sec per cpu
Cartographic
                                                     Scalable: processes 1 million documents
Location                                             in 1 hour on a 9-node Hadoop cluster

And                                                  Smart: natural language
                                                     processing, context-based heuristics, &
Vicinity                                             fuzzy matching

INdexer                                              Easy to use: simple Java-based API

                                                     Open source: Apache License
 All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   19
clavin.bericotechnologies.com
                                                                                    Charlie Greenbacker
                                                                                          @greenbacker
                                 meetup.com/DC-NLP
                                 @DCNLP


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   20

More Related Content

PDF
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
PDF
Prerequisites of AI Techniques Making Robot To Perform Task With Human (autos...
PPT
Mongo lessons learned
PPT
The value of open source software open analytics summit - open geo - eddie ...
PPTX
2013 open analytics_countingv3
PPTX
Big data-science-oanyc
PDF
Oas schwartz 16
PPTX
No sql and sql - open analytics summit
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
Prerequisites of AI Techniques Making Robot To Perform Task With Human (autos...
Mongo lessons learned
The value of open source software open analytics summit - open geo - eddie ...
2013 open analytics_countingv3
Big data-science-oanyc
Oas schwartz 16
No sql and sql - open analytics summit

Similar to Greenbacker open analyticsdc (20)

PDF
20120411 travelalliancemcguinnessfinal
PDF
Asun Gomez Perez's presentation at SSSW 2011
PPTX
MapR lucidworks joint webinar
KEY
Open Data Semantic Web Community Barn Raising
PPTX
Solving Volume, Velocity, and Variety Challenges with Location
PDF
Exploring Map-Based Discovery Services in the Digital Library Environment
PDF
Semantics and Linked Data for CyberGIS -- AAG 2013 Frontiers and Roadmaps Se...
PDF
OpenStreetMap
PDF
Intro To Geospatial
PPTX
MapR LucidWorks Joint Webinar 121211
PPTX
Tech4Africa - Opportunities around Big Data
PDF
This is not your grandmother's online map: Advancing your mission with GIS tools
PDF
[Day 3] Building Sustainable Communities
PDF
Supporting Data-Rich Research on Many Fronts
PDF
Semantic Integration at large scale
PPTX
Challenges on geo spatial visual analytics eurographics
KEY
Processing Big Data
PDF
Opportunities and Challenges in Crisis Informatics
PPT
Infinite graph nosql meetup dec 2012
PDF
Sharing data on the web (2013)
20120411 travelalliancemcguinnessfinal
Asun Gomez Perez's presentation at SSSW 2011
MapR lucidworks joint webinar
Open Data Semantic Web Community Barn Raising
Solving Volume, Velocity, and Variety Challenges with Location
Exploring Map-Based Discovery Services in the Digital Library Environment
Semantics and Linked Data for CyberGIS -- AAG 2013 Frontiers and Roadmaps Se...
OpenStreetMap
Intro To Geospatial
MapR LucidWorks Joint Webinar 121211
Tech4Africa - Opportunities around Big Data
This is not your grandmother's online map: Advancing your mission with GIS tools
[Day 3] Building Sustainable Communities
Supporting Data-Rich Research on Many Fronts
Semantic Integration at large scale
Challenges on geo spatial visual analytics eurographics
Processing Big Data
Opportunities and Challenges in Crisis Informatics
Infinite graph nosql meetup dec 2012
Sharing data on the web (2013)
Ad

More from Open Analytics (20)

PDF
Cyber after Snowden (OA Cyber Summit)
PPTX
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
PPT
CDM….Where do you start? (OA Cyber Summit)
PPTX
An Immigrant’s view of Cyberspace (OA Cyber Summit)
PPTX
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
PPTX
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
PPTX
Using Real-Time Data to Drive Optimization & Personalization
PPTX
M&A Trends in Telco Analytics
PPTX
Competing in the Digital Economy
PPTX
Piwik: An Analytics Alternative (Chicago Summit)
PDF
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
PDF
Crossing the Chasm (Ikanow - Chicago Summit)
PPTX
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
PDF
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
PDF
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
PDF
From Insight to Impact (Chicago Summit - Keynote)
PPT
Easybib Open Analytics NYC
PPTX
MarkLogic - Open Analytics Meetup
PPTX
The caprate presentation_july2013_open analytics dc meetup
PPTX
Verifeed open analytics_3min deck_071713_final
Cyber after Snowden (OA Cyber Summit)
Utilizing cyber intelligence to combat cyber adversaries (OA Cyber Summit)
CDM….Where do you start? (OA Cyber Summit)
An Immigrant’s view of Cyberspace (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Observations on CFR.org Website Traffic Surge Due to Chechnya Terrorism Scare...
Using Real-Time Data to Drive Optimization & Personalization
M&A Trends in Telco Analytics
Competing in the Digital Economy
Piwik: An Analytics Alternative (Chicago Summit)
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Crossing the Chasm (Ikanow - Chicago Summit)
On the “Moneyball” – Building the Team, Product, and Service to Rival (Pegged...
Data evolutions in media, marketing, and retail (Business Adv Group - Chicago...
Characterizing Risk in your Supply Chain (nContext - Chicago Summit)
From Insight to Impact (Chicago Summit - Keynote)
Easybib Open Analytics NYC
MarkLogic - Open Analytics Meetup
The caprate presentation_july2013_open analytics dc meetup
Verifeed open analytics_3min deck_071713_final
Ad

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Cloud computing and distributed systems.
PDF
Modernizing your data center with Dell and AMD
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
cuic standard and advanced reporting.pdf
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
NewMind AI Weekly Chronicles - August'25 Week I
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Review of recent advances in non-invasive hemoglobin estimation
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf

Greenbacker open analyticsdc

  • 1. Open Source Software for Geospatial Analytics on Unstructured Big Data Charlie Greenbacker, Principal Data Scientist
  • 2. Background About Me: Data Scientist Natural Language Processing Unstructured Text  Information Berico Technologies: Veteran-owned Small Business Big Data Analytics in the Cloud Defense & Intel Community All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
  • 3. The Problem: geotagging unstructured text Growing demand for geospatial analytics Most of human knowledge remains “trapped” in text Existing solutions are expensive and don’t scale Need an open source solution All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
  • 4. The Solution: an open source geoparser 1. Data Ingestion Input: unstructured text 2. Entity Extraction Named entity recognition Find location names in text 3. Entity Resolution Match against a gazetteer “The Springfield Problem” 4. Data Enrichment Output: structured geo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
  • 5. Data Ingestion: unstructured text photo: Flickr user NS Newsflash All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
  • 6. Entity Extraction: named entity recognition All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
  • 7. Entity Resolution: match against a gazetteer All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
  • 8. Data Enrichment: structured geo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
  • 9. “The Springfield Problem” All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
  • 10. Dealing with Ambiguity Intelligent Context-based Heuristics First: rank by population Next: look for other locations mentioned in the same document “Springfield” + “Chicago” = Illinois “Springfield” + “Boston” = Massachusetts Soon: calculate distance based on lat/lons Resolve alternate names to same geospatial entity “Ivory Coast” = “Côte d’Ivoire” Use fuzzy matching to capture misspelled place names Including both phonetic spelling & typographical errors All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
  • 11. CLAVIN: an open source geoparser All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
  • 12. System Architecture All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
  • 13. Live Demonstration All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
  • 14. Live Demonstration What can I do with this data? All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
  • 15. Map Visualizations All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
  • 16. Hierarchical Geospatial Search Virginia Reston Arlington All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
  • 17. Geospatial Bounding Box Search All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
  • 18. Geospatial Analytics on Unstructured Text All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
  • 19. Performance Metrics & Features Accurate: 0.75 F-measure CLAVIN “ Fast: 100 locations per sec per cpu Cartographic Scalable: processes 1 million documents Location in 1 hour on a 9-node Hadoop cluster And Smart: natural language processing, context-based heuristics, & Vicinity fuzzy matching INdexer Easy to use: simple Java-based API Open source: Apache License All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
  • 20. clavin.bericotechnologies.com Charlie Greenbacker @greenbacker meetup.com/DC-NLP @DCNLP All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20

Editor's Notes

  • #3: “Berico specializes in building open source software to support analytic missions, and implementing them through our services.”“We help our customers optimize the use of open source solutions for Cloud environments to replace the functionality traditionally licensed based projects.”“All of our products are built to run on and optimize cloud technologies – specifically HBase or Accumulo. We are the first authorized Cloudera partner in the federal sector”“CLAVIN is one of 7 open source products that we’ve built and implemented with customers in the DoD and IC. We’ve chosen CLAVIN as example to walk through today to illustrate how Berico’s open source products deliver great, market-leading, functionality with no licensing constraints, and at a fraction of the cost of proprietary tools in the market” (an infinite fraction – it’s free)
  • #11: Paris, France > Paris, Texas
  • #14: The interactivelive demo will be run offline from the presenter’s laptop. The CLAVIN demo interface accepts plain text as input, and returns a list of geospatial entities (with lat/lons, etc.) corresponding to the place names extracted and resolved from the text, along with a visualization plotting these locations on a map.The example text used in the demo may include the following:the sample text file built into the CLAVIN demo interface“Grover Cleveland was the 22nd president of the United States. He never went to Cuba.” (shows that CLAVIN knows “Grover Cleveland” is not a city in Ohio)“I was born in Boston and grew up in Springfield.” (produces a map of Massachusetts)“I was born in Chicago and grew up in Springfield.” (produces a map of Illinois)“I traveled to London and Oxford last summer.” (produces a map of England)“I traveled to London and Toronto last summer.” (produces a map of Ontario)a random news article from CNN.com (or a similar source)any example text provided by the audience
  • #20: geotag 1M documents containing 5.7M places names in under 1 hour on a 9-node Hadoop clustervsthe prohibitively expensive enterprise licenses of competing solutions like MetaCarta