SlideShare a Scribd company logo
datHere Monthly Webinar
Episode 1
Oct 2024
Data Infrastructure Engineering
Helping Open Data
since 2011
● Deployed ~100 CKAN portals in the US
● Helped fund & develop several CKAN
improvements & extensions
○ Open by default
○ “Enlightened self-interest”
● Few dozen migrations from other portals
(old CKAN sites & proprietary platforms)
● Helped firms integrate CKAN into their
solution stack
● Delivered Training
● Attended & Presented at data conferences
around the world
The Problem with Data Portals - PUBLIC (FINAL).pdf
A Data Portal is the tip of
a Data Governance Iceberg
TLDR
Data Portal
}Opening Data Inside
Data-Driven Culture
compiling/curating
Metadata
Central
Source of
Metadata
YOUR
ENTERPRISE
DATA
EXTERNAL
DATA
EXTERNAL
DATA
Internal
Data Portal
The Problem with Data Portals - PUBLIC (FINAL).pdf
“…the Civic Analytics Network (CAN) offers the
following eight guidelines that, if followed, would
advance the capabilities of government data portals
across the board and help deliver upon the promise
of a transparent government.”
An Open Letter to
the Open Data
Community
Civic Analytics Network
Mar 2017
1. Improve accessibility and usability to
engage a wider audience
2. Move away from a single dataset
centric view
3. Treat geospatial data as a first class
datatype
4. Improve management & usability of
metadata
5. Decrease the cost & work required to
publish data
6. Introduce revision history
7. Improve management of large
datasets
8. Set clear transparent pricing based
on memory, not number of datasets
The Problem with Data Portals - PUBLIC (FINAL).pdf
An Open Letter to
the Open Data
Community
ONE YEAR LATER
Civic Analytics Network
June 2018
● Acknowledged responses from
Vendors
● Called out several CAN open
data projects, experiments &
accomplishments across the
country
● Called for continued
engagement
“CAN’s call for open communication, shared
learning, and partnership remains open to vendors
and civic technologists alike and we look forward to
continuing our work to help grow and expand the
open data community and practices.”
The Problem with Data Portals - PUBLIC (FINAL).pdf
Data is
Infrastructure
CKAN Association’s Response
Sep 2018
● Detailed response to all eight
guidelines
● Examples from across the
entire CKAN ecosystem
around the world
● Called out CKAN’s extensibility
with its library of third-party
extensions catalog
● Confirmed that all CKAN
service providers do not
practice “nickel-and-diming”
and we all lived happily ever after…
NOT!
The Problem with Data Portals - PUBLIC (FINAL).pdf
The Problem with
Data Portals
what Sami & Joel learned the
hard way since 2011
1. Data Quality - or the Lack of It
2. It’s not FAIR!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
4. Raw Data, not Answers
5. User Experience is King!
6. A Data Portal is just the tip of a
Data Governance Iceberg
7. You need to “Open Data Inside”
8. Practical Data Wrangling required
9. Best-of-Breed is the Way
10. You need to “Humanize the Data”
Inside & Out
The Problem with Data Portals
1. Data Quality - or the Lack of It
a. Data ALWAYS needs to be
“massaged”
i. To remove PIIs
ii. To remove other sensitive data
iii. Join/Enrich with other data
iv. Fat-finger mistakes
b. Excel is the Duct Tape of Data
c. …and the bane of Open Data!
d. and PDFs!?!
i. Painful Document Format
ii. Practically Data Free
iii. Persistent Data Fortress
2. It’s not FAIR!
a. Findable
b. Accessible
c. Interoperable
d. Reusable
e. …but DCAT 3 is here!!!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
a. the “Metadata Tip of the Iceberg”
b. The “public” part of your
Data Management Initiative
c. You need to “Open (as a verb) Data
Inside” (see 7)
The Problem with Data Portals - PUBLIC (FINAL).pdf
The Problem with Data Portals
4. Raw Data, not Answers
a. Mostly Raw Data
b. Lack of High-Quality Metadata
i. Low Resolution
metadata about data files, not the
data inside the files
ii. Primitive Data Dictionary
1. No Summary Statistics
2. No Frequency Tables
3. No Links to Related Data
iii. Metadata has to be manually
compiled
c. It’s still mainly Keyword Search
d. No Natural Language search
No Answering People Interface
5. User Experience is NOT King!
a. Current Data Publisher UX does not incentivize timely
updates, exacerbating Data/Metadata Quality issues
b. Current Discoverability UX - for users to search &
explore the Catalog, is dated
c. Make it easy so that Data Publishers WANT TO update
the Data/Metadata
6. A Data Portal is the tip of a Data Governance Iceberg
a. The right DMS should enable your
Data Governance Strategy
b. It should be Data Infrastructure You Can Build On
(DIY-CBO)
c. And as such, it NEEDS to be
standards-based, if not an open-source platform
d. Platform = A mature & robust API
e. Something that can integrate and interoperate
with your existing tooling, systems & data sources
f. The portal is fed by Opening Data Inside (see 7)
The Problem with Data Portals
7. You first need to “Open Data Inside”
a. To promote a Data-Driven Culture
b. Culture = Process over Time
c. Culture eats Strategy for Breakfast
d. You need to make it Useful, Usable & Used
for internal folks first…
e. “Opening Data Inside” makes it easier for
them to do their day-to-day work (see 5c), and
f. High Quality Open Data naturally follows…
8. Practical Data Wrangling Required
a. On the Desktop w/o specialized skills
b. “Excel”-like, GUI anyone can use
c. It needs to be fast so folks can do “what-if”,
iterative data-wrangling
d. Desktop Data Wrangling deployable as a
production data pipeline
9. Best-of-Breed is the Way
a. DMS Core Competency
Metadata Catalog w/ a mature, robust API
b. No lock-ins! Interoperate! (see 2c)
c. Do not reinvent the wheel. Focus on 9a.
d. “Not Invented Here” not welcome here
e. Don’t try to build a ___ wanna-be, use ___
(fill in the blanks -Tableau, Power BI, etc.)
f. Prefer open source when possible
(e.g. Apache Superset instead of Tableau)
10. You need to “Humanize the Data” Inside & Out!
a. Incentivize Data Owners to share their Data and curate
the Metadata in the DMS, as doing so makes their
day-to-day work easier
b. Answering People Interface (API) (see 4d)
c. Connect with other Humans!
Other communities, vendors, users, instances, data
owners, standards bodies, etc. in the Ecosystem
d. Data-driven Storytelling
e. Cultivate a Data-driven Culture
Humanize the Data - The Product is a Civic Data Ecosystem
Pathways to Enable Open-Source Ecosystems
● NSF initiative that “aims to harness the power
of open-source development for the creation of
new technology solutions to problems of
national and societal importance.”
● Phase I “discovery grant” awarded in 2023 to
University of Pittsburgh & datHere
● Phase II “implementation grant” awarded in
August 2024!
● Currently spinning up…
● https://civicdataecosystems.org
● “The Product is the Ecosystem” blogpost
Building a Data-Driven Culture
From the TOP DOWN
From the BOTTOM UP
CULTURE = Process Over Time
DATA MANAGEMENT STRATEGY
DIRECTIVES
INCENTIVES
Humanizing the Data
● a Data-Driven Culture takes Time
● is a top-down, bottom-up initiative
● “Opening the Data Inside”
○ Creates a Virtuous Cycle
balancing Directives with Incentives
○ help Internal Staff with their day-to-
day data needs so they WANT to
open data (as a verb)
○ Opening Data inside includes
internal data that is not meant for
public use
○ High Quality Open Data (as a noun)
is a natural by-product
Humanizing the Data is
Pragmatic Data Governance
Culture Eats Strategy for Breakfast
Data Portal
}Opening Data Inside
Data-Driven Culture
compiling/curating
Metadata
Central
Source of
Metadata
YOUR
ENTERPRISE
DATA
EXTERNAL
DATA
EXTERNAL
DATA
Internal
Data Portal
“Our” Solution
The Problem with Data Portals - PUBLIC (FINAL).pdf
We needed a
“Data Wrangler”
● Works with a universal data format
● Cross-platform
● Fast, blazing Fast!
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate
into mission-critical data pipelines
Data
You
The Problem with Data Portals - PUBLIC (FINAL).pdf
qsv/qsv pro
Origin Story
It all started with a failed pilot
with a Hedge Fund to build an
Internal Data Portal in 2020
● datHere - new startup during COVID
● Data Portals! Anybody? Anybody?
● Nice! A Hedge Fund wants to try CKAN!
● An Internal Data Catalog Pilot -
populated with latest metadata from
vast data holdings, updated daily
● Central source of Truth for Metadata
● And we have to auto-infer the metadata
● Traditional metadata inferencing
pipeline (csvkit, pandas, numpy) was
too slow
● Forked xsv to start qsv…
qsv “Data Wrangler” Goals
● Works with a universal data format
● Cross-platform
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate into
mission-critical data pipelines
CSV, Excel, JSON, JSONL,
PostgreSQL, SQLite, Parquet,
Data Package, AVRO &
recognizes 130 file formats
Linux, macOS & Windows
Fast! Blazing Fast!!!
How fast is Blazing fast? (v0.137.0)
For a 1 million row sample of NYC’s 311 data (41 columns, 520 mb):
● 19 “streaming” summary statistics in 0.233 secs
● 18 more stats (total 37) & infer dates(19 formats recognized) in 1.305 secs
● Frequency table in 1.045 secs
● Count rows in 0.009 secs
● Validate against RFC 4180 CSV standard in 0.523 secs
● Validate against a JSON Schema in 3.094 secs
● Run a simple SQL query in 1.053 secs, a SQL aggregation in 1.058 secs & a
very inefficient SQL aggregation in 0.928 secs
● Reverse geocode WGS84 coordinate against Geonames in 3.782 secs
● And more…
https://qsv.dathere.com/benchmarks
comprehensive summary stats in 1.305 seconds!
answered in 0.928 seconds!
How is it so
Fast?
by standing on the
Shoulders of Giants &
The Ecosystem
● Rust
● Mem-mapped, Multi-threaded, Multi-I/O
● Advanced CPU features
● High performance libraries
● Performance architecture
○ Indexed access
○ Various caching techniques
○ Performance oriented memory allocator
● Built on a solid foundation (xsv)
● Polars Dataframes Engine
● Vibrant Rust & Polars Ecosystems
+
Why the
Obsessive
Need for Speed?
What does it unlock?
● Big Data is getting Bigger
● Embedding into other Systems (DP+)
● Quicker Data Investigations
● Enables new Data Wrangling Workflows
○ “Automagical Metadata”
Preemptive, near-real time
metadata inferencing
○ Compile Extended Data Dictionaries
○ Interactive, Iterative Data-Wrangling
○ Leverage AI
use RAG techniques to infer additional
extended metadata (describeGPT)
Datapusher+
Embedded use case
● Next-gen CKAN Data Ingestion
● Guaranteed Data Type inferences
● Data Validation / Metadata Inferencing
○ Dedupe
○ PII screening
○ As context for AI - “describeGPT”
○ Extended Data Dictionary
○ Pre-calculate metadata
(spatial extent, date range for time-
series data, etc.)
○ Pre-populate DCAT 3 recommended
metadata fields
○ Data Enrichment
https://ckan.org/events/ckan-
datapusher-plus-automagical-metadata
Data that is
Useful,
Usable &
Used
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
}We have a solution for this with DP+ & qsv
But what about actually Using the Data
to gain Actionable Insight,
to drive Evidence-based Decisions?
?
qsv pro
Cross-Platform Desktop Data-
Wrangling & Query tool
for the Rest of Us
● OpenRefine + Excel + qsv + CKAN +
recipes + High Value Curated Data =
qsv pro
● Familiar spreadsheet interface
● No need to know complex Command
Line Interface (CLI) commands
● FAST! Blazing Fast!
● Interactive Data Wrangling
● Recipes! (desktop ETL)
● Integration with datHere’s upcoming
cloud-based services
○ High Value Data Feeds
○ Data Enrichment
○ Data Normalization
○ Geocoding
● Natural Language Interface
https://qsvpro.dathere.com
https://qsvpro.dathere.com
● For a Data Analyst Audience
● You don’t need to be a Developer
● Use ready-made Recipes for common
tasks (e.g. Scan for PII, geocode,
deduplicate records, etc.)
● Create/modify/combine Recipes using
either Luau or Python
● Share your Recipes on the
datHere Recipe Catalog
● Pre-process security-sensitive data
on your desktop without uploading it
first
● Enrich your data with datHere’s ever-
expanding corpus of
High Value Data like the Census,
Bureau of Labor Statistics, etc.
● Use the “Answering People Interface”
on your data or of other CKAN portals
● Upload to your CKAN or to datHere’s
Data Catalog to share your data with
the world!
Cross-platform Desktop
Data Wrangling & Query tool
for the Rest of Us
Analyzed 50k rows,
compiling stats and
frequency tables instantly!
Ever-expanding Data-Wrangling
Recipe Library
Directly upload to any CKAN
running v2.9 and above!
The Problem with Data Portals - PUBLIC (FINAL).pdf
Ran SQL query in 1139ms!
Natural language query, along
with summary stats, frequency &
metadata sent to preferred LLM…
… an LLM we prompt to create a
SQL query based on the Natural
Language query & the context we
provided
Reproducible, hallucination-free
answers
Click to see GIF animation of Excel calling qsv pro API to get summary stats
https://github.com/jqnatividad/qsv/discussions/2221#discussioncomment-11008064
DMS Framework
more than an Open Data Portal application, a
Data Management System Framework
you can build on
● Built around CKAN
● Certified CKAN Extensions
● Bundled with other Best-of-
Breed open source tooling
● Integrated Data Enrichment
● Build DMS applications like
○ Water Data Hubs
○ Open Data Portals
○ Internal Data Exchange
○ Data Library
○ Enterprise Data Catalog
○ and more…
DEMO
Q&A
https://dathere.com/product-demo-request/
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
https://datHere.com
Data Infrastructure Engineering

More Related Content

PDF
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
PDF
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
PDF
Unlock Your Data for ML & AI using Data Virtualization
PDF
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
PDF
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
PPTX
One Large Data Lake, Hold the Hype
PPTX
One Large Data Lake, Hold the Hype
PDF
Big data rmoug
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Unlock Your Data for ML & AI using Data Virtualization
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
Big data rmoug

Similar to The Problem with Data Portals - PUBLIC (FINAL).pdf (20)

PDF
The Data Lake and Getting Buisnesses the Big Data Insights They Need
PDF
LinkedInSaxoBankDataWorkbench
PDF
How to build and run a big data platform in the 21st century
PPTX
Big Data Analytics with Microsoft
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
PPTX
Architecting for Big Data: Trends, Tips, and Deployment Options
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Democratizing Data Science in the Enterprise
PDF
Data Virtualization: An Introduction
PDF
How Data Virtualization Adds Value to Your Data Science Stack
PDF
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
PDF
Accelerate Self-Service Analytics with Data Virtualization and Visualization
PDF
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
PDF
An Introduction to Data Virtualization in 2018
PDF
Metadata Strategies - Data Squared
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
Self-Service Analytics with Guard Rails
PDF
Modern Data Management for Federal Modernization
PDF
Using Data Platforms That Are Fit-For-Purpose
The Data Lake and Getting Buisnesses the Big Data Insights They Need
LinkedInSaxoBankDataWorkbench
How to build and run a big data platform in the 21st century
Big Data Analytics with Microsoft
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Architecting for Big Data: Trends, Tips, and Deployment Options
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Democratizing Data Science in the Enterprise
Data Virtualization: An Introduction
How Data Virtualization Adds Value to Your Data Science Stack
Quicker Insights and Sustainable Business Agility Powered By Data Virtualizat...
Accelerate Self-Service Analytics with Data Virtualization and Visualization
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
An Introduction to Data Virtualization in 2018
Metadata Strategies - Data Squared
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Self-Service Analytics with Guard Rails
Modern Data Management for Federal Modernization
Using Data Platforms That Are Fit-For-Purpose
Ad

More from Joel Natividad (14)

PDF
The Problem with Data Portals: A Data Portal is just the tip of a Data Govern...
PPTX
DataTables view CKAN monthly live
PDF
Open source in government
PDF
The Next Generation of Open Data
PDF
Raw data in, Insights out - CKANcon 2015
PDF
The Coming Web of Data
PPTX
CityMission
PDF
NYCBigApps 2013 Expo/Hackathon Talk
PDF
NYC Remapped
PDF
Ontodia Overview - Semantics and Wikis panel - SemTech West 2012
PDF
NYCFacets: Metadata, Extrametadata and Crowdknowing
PDF
Smart Cities and Big Open Data
PDF
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
PDF
NYC Data Web (static version) - A Semantic, Open Public Data Exchange for NYC
The Problem with Data Portals: A Data Portal is just the tip of a Data Govern...
DataTables view CKAN monthly live
Open source in government
The Next Generation of Open Data
Raw data in, Insights out - CKANcon 2015
The Coming Web of Data
CityMission
NYCBigApps 2013 Expo/Hackathon Talk
NYC Remapped
Ontodia Overview - Semantics and Wikis panel - SemTech West 2012
NYCFacets: Metadata, Extrametadata and Crowdknowing
Smart Cities and Big Open Data
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
NYC Data Web (static version) - A Semantic, Open Public Data Exchange for NYC
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Computer network topology notes for revision
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Mega Projects Data Mega Projects Data
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Galatica Smart Energy Infrastructure Startup Pitch Deck
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Computer network topology notes for revision
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Mega Projects Data Mega Projects Data
Launch Your Data Science Career in Kochi – 2025
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Moving the Public Sector (Government) to a Digital Adoption
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

The Problem with Data Portals - PUBLIC (FINAL).pdf

  • 1. datHere Monthly Webinar Episode 1 Oct 2024 Data Infrastructure Engineering
  • 2. Helping Open Data since 2011 ● Deployed ~100 CKAN portals in the US ● Helped fund & develop several CKAN improvements & extensions ○ Open by default ○ “Enlightened self-interest” ● Few dozen migrations from other portals (old CKAN sites & proprietary platforms) ● Helped firms integrate CKAN into their solution stack ● Delivered Training ● Attended & Presented at data conferences around the world
  • 4. A Data Portal is the tip of a Data Governance Iceberg TLDR
  • 5. Data Portal }Opening Data Inside Data-Driven Culture compiling/curating Metadata Central Source of Metadata YOUR ENTERPRISE DATA EXTERNAL DATA EXTERNAL DATA Internal Data Portal
  • 7. “…the Civic Analytics Network (CAN) offers the following eight guidelines that, if followed, would advance the capabilities of government data portals across the board and help deliver upon the promise of a transparent government.”
  • 8. An Open Letter to the Open Data Community Civic Analytics Network Mar 2017 1. Improve accessibility and usability to engage a wider audience 2. Move away from a single dataset centric view 3. Treat geospatial data as a first class datatype 4. Improve management & usability of metadata 5. Decrease the cost & work required to publish data 6. Introduce revision history 7. Improve management of large datasets 8. Set clear transparent pricing based on memory, not number of datasets
  • 10. An Open Letter to the Open Data Community ONE YEAR LATER Civic Analytics Network June 2018 ● Acknowledged responses from Vendors ● Called out several CAN open data projects, experiments & accomplishments across the country ● Called for continued engagement
  • 11. “CAN’s call for open communication, shared learning, and partnership remains open to vendors and civic technologists alike and we look forward to continuing our work to help grow and expand the open data community and practices.”
  • 13. Data is Infrastructure CKAN Association’s Response Sep 2018 ● Detailed response to all eight guidelines ● Examples from across the entire CKAN ecosystem around the world ● Called out CKAN’s extensibility with its library of third-party extensions catalog ● Confirmed that all CKAN service providers do not practice “nickel-and-diming”
  • 14. and we all lived happily ever after…
  • 15. NOT!
  • 17. The Problem with Data Portals what Sami & Joel learned the hard way since 2011 1. Data Quality - or the Lack of It 2. It’s not FAIR! 3. Open Data is just one “application” of a Data Mgmt System (DMS) 4. Raw Data, not Answers 5. User Experience is King! 6. A Data Portal is just the tip of a Data Governance Iceberg 7. You need to “Open Data Inside” 8. Practical Data Wrangling required 9. Best-of-Breed is the Way 10. You need to “Humanize the Data” Inside & Out
  • 18. The Problem with Data Portals 1. Data Quality - or the Lack of It a. Data ALWAYS needs to be “massaged” i. To remove PIIs ii. To remove other sensitive data iii. Join/Enrich with other data iv. Fat-finger mistakes b. Excel is the Duct Tape of Data c. …and the bane of Open Data! d. and PDFs!?! i. Painful Document Format ii. Practically Data Free iii. Persistent Data Fortress 2. It’s not FAIR! a. Findable b. Accessible c. Interoperable d. Reusable e. …but DCAT 3 is here!!! 3. Open Data is just one “application” of a Data Mgmt System (DMS) a. the “Metadata Tip of the Iceberg” b. The “public” part of your Data Management Initiative c. You need to “Open (as a verb) Data Inside” (see 7)
  • 20. The Problem with Data Portals 4. Raw Data, not Answers a. Mostly Raw Data b. Lack of High-Quality Metadata i. Low Resolution metadata about data files, not the data inside the files ii. Primitive Data Dictionary 1. No Summary Statistics 2. No Frequency Tables 3. No Links to Related Data iii. Metadata has to be manually compiled c. It’s still mainly Keyword Search d. No Natural Language search No Answering People Interface 5. User Experience is NOT King! a. Current Data Publisher UX does not incentivize timely updates, exacerbating Data/Metadata Quality issues b. Current Discoverability UX - for users to search & explore the Catalog, is dated c. Make it easy so that Data Publishers WANT TO update the Data/Metadata 6. A Data Portal is the tip of a Data Governance Iceberg a. The right DMS should enable your Data Governance Strategy b. It should be Data Infrastructure You Can Build On (DIY-CBO) c. And as such, it NEEDS to be standards-based, if not an open-source platform d. Platform = A mature & robust API e. Something that can integrate and interoperate with your existing tooling, systems & data sources f. The portal is fed by Opening Data Inside (see 7)
  • 21. The Problem with Data Portals 7. You first need to “Open Data Inside” a. To promote a Data-Driven Culture b. Culture = Process over Time c. Culture eats Strategy for Breakfast d. You need to make it Useful, Usable & Used for internal folks first… e. “Opening Data Inside” makes it easier for them to do their day-to-day work (see 5c), and f. High Quality Open Data naturally follows… 8. Practical Data Wrangling Required a. On the Desktop w/o specialized skills b. “Excel”-like, GUI anyone can use c. It needs to be fast so folks can do “what-if”, iterative data-wrangling d. Desktop Data Wrangling deployable as a production data pipeline 9. Best-of-Breed is the Way a. DMS Core Competency Metadata Catalog w/ a mature, robust API b. No lock-ins! Interoperate! (see 2c) c. Do not reinvent the wheel. Focus on 9a. d. “Not Invented Here” not welcome here e. Don’t try to build a ___ wanna-be, use ___ (fill in the blanks -Tableau, Power BI, etc.) f. Prefer open source when possible (e.g. Apache Superset instead of Tableau) 10. You need to “Humanize the Data” Inside & Out! a. Incentivize Data Owners to share their Data and curate the Metadata in the DMS, as doing so makes their day-to-day work easier b. Answering People Interface (API) (see 4d) c. Connect with other Humans! Other communities, vendors, users, instances, data owners, standards bodies, etc. in the Ecosystem d. Data-driven Storytelling e. Cultivate a Data-driven Culture
  • 22. Humanize the Data - The Product is a Civic Data Ecosystem Pathways to Enable Open-Source Ecosystems ● NSF initiative that “aims to harness the power of open-source development for the creation of new technology solutions to problems of national and societal importance.” ● Phase I “discovery grant” awarded in 2023 to University of Pittsburgh & datHere ● Phase II “implementation grant” awarded in August 2024! ● Currently spinning up… ● https://civicdataecosystems.org ● “The Product is the Ecosystem” blogpost
  • 23. Building a Data-Driven Culture From the TOP DOWN From the BOTTOM UP CULTURE = Process Over Time DATA MANAGEMENT STRATEGY DIRECTIVES INCENTIVES
  • 24. Humanizing the Data ● a Data-Driven Culture takes Time ● is a top-down, bottom-up initiative ● “Opening the Data Inside” ○ Creates a Virtuous Cycle balancing Directives with Incentives ○ help Internal Staff with their day-to- day data needs so they WANT to open data (as a verb) ○ Opening Data inside includes internal data that is not meant for public use ○ High Quality Open Data (as a noun) is a natural by-product Humanizing the Data is Pragmatic Data Governance Culture Eats Strategy for Breakfast
  • 25. Data Portal }Opening Data Inside Data-Driven Culture compiling/curating Metadata Central Source of Metadata YOUR ENTERPRISE DATA EXTERNAL DATA EXTERNAL DATA Internal Data Portal
  • 28. We needed a “Data Wrangler” ● Works with a universal data format ● Cross-platform ● Fast, blazing Fast! ● Open Source ● Easy to Learn ● Easy to Use for initial investigations ● But powerful enough to integrate into mission-critical data pipelines Data You
  • 30. qsv/qsv pro Origin Story It all started with a failed pilot with a Hedge Fund to build an Internal Data Portal in 2020 ● datHere - new startup during COVID ● Data Portals! Anybody? Anybody? ● Nice! A Hedge Fund wants to try CKAN! ● An Internal Data Catalog Pilot - populated with latest metadata from vast data holdings, updated daily ● Central source of Truth for Metadata ● And we have to auto-infer the metadata ● Traditional metadata inferencing pipeline (csvkit, pandas, numpy) was too slow ● Forked xsv to start qsv…
  • 31. qsv “Data Wrangler” Goals ● Works with a universal data format ● Cross-platform ● Open Source ● Easy to Learn ● Easy to Use for initial investigations ● But powerful enough to integrate into mission-critical data pipelines CSV, Excel, JSON, JSONL, PostgreSQL, SQLite, Parquet, Data Package, AVRO & recognizes 130 file formats Linux, macOS & Windows Fast! Blazing Fast!!!
  • 32. How fast is Blazing fast? (v0.137.0) For a 1 million row sample of NYC’s 311 data (41 columns, 520 mb): ● 19 “streaming” summary statistics in 0.233 secs ● 18 more stats (total 37) & infer dates(19 formats recognized) in 1.305 secs ● Frequency table in 1.045 secs ● Count rows in 0.009 secs ● Validate against RFC 4180 CSV standard in 0.523 secs ● Validate against a JSON Schema in 3.094 secs ● Run a simple SQL query in 1.053 secs, a SQL aggregation in 1.058 secs & a very inefficient SQL aggregation in 0.928 secs ● Reverse geocode WGS84 coordinate against Geonames in 3.782 secs ● And more… https://qsv.dathere.com/benchmarks
  • 33. comprehensive summary stats in 1.305 seconds!
  • 34. answered in 0.928 seconds!
  • 35. How is it so Fast? by standing on the Shoulders of Giants & The Ecosystem ● Rust ● Mem-mapped, Multi-threaded, Multi-I/O ● Advanced CPU features ● High performance libraries ● Performance architecture ○ Indexed access ○ Various caching techniques ○ Performance oriented memory allocator ● Built on a solid foundation (xsv) ● Polars Dataframes Engine ● Vibrant Rust & Polars Ecosystems +
  • 36. Why the Obsessive Need for Speed? What does it unlock? ● Big Data is getting Bigger ● Embedding into other Systems (DP+) ● Quicker Data Investigations ● Enables new Data Wrangling Workflows ○ “Automagical Metadata” Preemptive, near-real time metadata inferencing ○ Compile Extended Data Dictionaries ○ Interactive, Iterative Data-Wrangling ○ Leverage AI use RAG techniques to infer additional extended metadata (describeGPT)
  • 37. Datapusher+ Embedded use case ● Next-gen CKAN Data Ingestion ● Guaranteed Data Type inferences ● Data Validation / Metadata Inferencing ○ Dedupe ○ PII screening ○ As context for AI - “describeGPT” ○ Extended Data Dictionary ○ Pre-calculate metadata (spatial extent, date range for time- series data, etc.) ○ Pre-populate DCAT 3 recommended metadata fields ○ Data Enrichment https://ckan.org/events/ckan- datapusher-plus-automagical-metadata
  • 38. Data that is Useful, Usable & Used Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used }We have a solution for this with DP+ & qsv But what about actually Using the Data to gain Actionable Insight, to drive Evidence-based Decisions? ?
  • 39. qsv pro Cross-Platform Desktop Data- Wrangling & Query tool for the Rest of Us ● OpenRefine + Excel + qsv + CKAN + recipes + High Value Curated Data = qsv pro ● Familiar spreadsheet interface ● No need to know complex Command Line Interface (CLI) commands ● FAST! Blazing Fast! ● Interactive Data Wrangling ● Recipes! (desktop ETL) ● Integration with datHere’s upcoming cloud-based services ○ High Value Data Feeds ○ Data Enrichment ○ Data Normalization ○ Geocoding ● Natural Language Interface https://qsvpro.dathere.com https://qsvpro.dathere.com
  • 40. ● For a Data Analyst Audience ● You don’t need to be a Developer ● Use ready-made Recipes for common tasks (e.g. Scan for PII, geocode, deduplicate records, etc.) ● Create/modify/combine Recipes using either Luau or Python ● Share your Recipes on the datHere Recipe Catalog ● Pre-process security-sensitive data on your desktop without uploading it first ● Enrich your data with datHere’s ever- expanding corpus of High Value Data like the Census, Bureau of Labor Statistics, etc. ● Use the “Answering People Interface” on your data or of other CKAN portals ● Upload to your CKAN or to datHere’s Data Catalog to share your data with the world! Cross-platform Desktop Data Wrangling & Query tool for the Rest of Us Analyzed 50k rows, compiling stats and frequency tables instantly! Ever-expanding Data-Wrangling Recipe Library Directly upload to any CKAN running v2.9 and above!
  • 42. Ran SQL query in 1139ms! Natural language query, along with summary stats, frequency & metadata sent to preferred LLM… … an LLM we prompt to create a SQL query based on the Natural Language query & the context we provided Reproducible, hallucination-free answers
  • 43. Click to see GIF animation of Excel calling qsv pro API to get summary stats https://github.com/jqnatividad/qsv/discussions/2221#discussioncomment-11008064
  • 44. DMS Framework more than an Open Data Portal application, a Data Management System Framework you can build on ● Built around CKAN ● Certified CKAN Extensions ● Bundled with other Best-of- Breed open source tooling ● Integrated Data Enrichment ● Build DMS applications like ○ Water Data Hubs ○ Open Data Portals ○ Internal Data Exchange ○ Data Library ○ Enterprise Data Catalog ○ and more…
  • 46. Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used https://datHere.com Data Infrastructure Engineering