SlideShare a Scribd company logo
datHere Monthly Webinar
Episode 1
Oct 2024
Data Infrastructure Engineering
Helping Open Data
since 2011
● Deployed ~100 CKAN portals in the US
● Helped fund & develop several CKAN
improvements & extensions
○ Open by default
○ “Enlightened self-interest”
● Few dozen migrations from other portals
(old CKAN sites & proprietary platforms)
● Helped firms integrate CKAN into their
solution stack
● Delivered Training
● Attended & Presented at data conferences
around the world
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
A Data Portal is the tip of
a Data Governance Iceberg
TLDR
Data Portal
}Opening Data Inside
Data-Driven Culture
compiling/curating
Metadata
Central
Source of
Metadata
YOUR
ENTERPRISE
DATA
EXTERNAL
DATA
EXTERNAL
DATA
Internal
Data Portal
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
“…the Civic Analytics Network (CAN) offers the
following eight guidelines that, if followed, would
advance the capabilities of government data portals
across the board and help deliver upon the promise
of a transparent government.”
An Open Letter to
the Open Data
Community
Civic Analytics Network
Mar 2017
1. Improve accessibility and usability to
engage a wider audience
2. Move away from a single dataset
centric view
3. Treat geospatial data as a first class
datatype
4. Improve management & usability of
metadata
5. Decrease the cost & work required to
publish data
6. Introduce revision history
7. Improve management of large
datasets
8. Set clear transparent pricing based
on memory, not number of datasets
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
An Open Letter to
the Open Data
Community
ONE YEAR LATER
Civic Analytics Network
June 2018
● Acknowledged responses from
Vendors
● Called out several CAN open
data projects, experiments &
accomplishments across the
country
● Called for continued
engagement
“CAN’s call for open communication, shared
learning, and partnership remains open to vendors
and civic technologists alike and we look forward to
continuing our work to help grow and expand the
open data community and practices.”
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
Data is
Infrastructure
CKAN Association’s Response
Sep 2018
● Detailed response to all eight
guidelines
● Examples from across the
entire CKAN ecosystem
around the world
● Called out CKAN’s extensibility
with its library of third-party
extensions catalog
● Confirmed that all CKAN
service providers do not
practice “nickel-and-diming”
and we all lived happily ever after…
NOT!
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
The Problem with
Data Portals
what Sami & Joel learned the
hard way since 2011
1. Data Quality - or the Lack of It
2. It’s not FAIR!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
4. Raw Data, not Answers
5. User Experience is King!
6. A Data Portal is just the tip of a
Data Governance Iceberg
7. You need to “Open Data Inside”
8. Practical Data Wrangling required
9. Best-of-Breed is the Way
10. You need to “Humanize the Data”
Inside & Out
The Problem with Data Portals
1. Data Quality - or the Lack of It
a. Data ALWAYS needs to be
“massaged”
i. To remove PIIs
ii. To remove other sensitive data
iii. Join/Enrich with other data
iv. Fat-finger mistakes
b. Excel is the Duct Tape of Data
c. …and the bane of Open Data!
d. and PDFs!?!
i. Painful Document Format
ii. Practically Data Free
iii. Persistent Data Fortress
2. It’s not FAIR!
a. Findable
b. Accessible
c. Interoperable
d. Reusable
e. …but DCAT 3 is here!!!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
a. the “Metadata Tip of the Iceberg”
b. The “public” part of your
Data Management Initiative
c. You need to “Open (as a verb) Data
Inside” (see 7)
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
The Problem with Data Portals
4. Raw Data, not Answers
a. Mostly Raw Data
b. Lack of High-Quality Metadata
i. Low Resolution
metadata about data files, not the
data inside the files
ii. Primitive Data Dictionary
1. No Summary Statistics
2. No Frequency Tables
3. No Links to Related Data
iii. Metadata has to be manually
compiled
c. It’s still mainly Keyword Search
d. No Natural Language search
No Answering People Interface
5. User Experience is NOT King!
a. Current Data Publisher UX does not incentivize timely
updates, exacerbating Data/Metadata Quality issues
b. Current Discoverability UX - for users to search &
explore the Catalog, is dated
c. Make it easy so that Data Publishers WANT TO update
the Data/Metadata
6. A Data Portal is the tip of a Data Governance Iceberg
a. The right DMS should enable your
Data Governance Strategy
b. It should be Data Infrastructure You Can Build On
(DIY-CBO)
c. And as such, it NEEDS to be
standards-based, if not an open-source platform
d. Platform = A mature & robust API
e. Something that can integrate and interoperate
with your existing tooling, systems & data sources
f. The portal is fed by Opening Data Inside (see 7)
The Problem with Data Portals
7. You first need to “Open Data Inside”
a. To promote a Data-Driven Culture
b. Culture = Process over Time
c. Culture eats Strategy for Breakfast
d. You need to make Data Useful, Usable &
Used for internal folks first…
e. “Opening Data Inside” makes it easier for
them to do their day-to-day work (see 5c), and
f. High Quality Open Data naturally follows…
8. Practical Data Wrangling Required
a. On the Desktop w/o specialized skills
b. “Excel”-like, GUI anyone can use
c. It needs to be fast so folks can do “what-if”,
iterative data-wrangling
d. Desktop Data Wrangling deployable as a
production data pipeline
9. Best-of-Breed is the Way
a. DMS Core Competency
Metadata Catalog w/ a mature, robust API
b. No lock-ins! Interoperate! (see 2c)
c. Do not reinvent the wheel. Focus on 9a.
d. “Not Invented Here” not welcome here
e. Don’t try to build a ___ wanna-be, use ___
(fill in the blanks -Tableau, Power BI, etc.)
f. Prefer open source when possible
(e.g. Apache Superset instead of Tableau)
10. You need to “Humanize the Data” Inside & Out!
a. Incentivize Data Owners to share their Data and curate
the Metadata in the DMS, as doing so makes their
day-to-day work easier
b. Answering People Interface (API) (see 4d)
c. Connect with other Humans!
Other communities, vendors, users, instances, data
owners, standards bodies, etc. in the Ecosystem
d. Data-driven Storytelling
e. Cultivate a Data-driven Culture
Humanize the Data - The Product is a Civic Data Ecosystem
Pathways to Enable Open-Source Ecosystems
● NSF initiative that “aims to harness the power
of open-source development for the creation of
new technology solutions to problems of
national and societal importance.”
● Phase I “discovery grant” awarded in 2023 to
University of Pittsburgh & datHere
● Phase II “implementation grant” awarded in
August 2024!
● Currently spinning up…
● https://civicdataecosystems.org
● “The Product is the Ecosystem” blogpost
Building a Data-Driven Culture
From the TOP DOWN
From the BOTTOM UP
CULTURE = Process Over Time
DATA GOVERNANCE STRATEGY
DIRECTIVES
INCENTIVES
Humanizing the Data
● a Data-Driven Culture takes Time
● is a top-down, bottom-up initiative
● “Opening the Data Inside”
○ Creates a Virtuous Cycle
balancing Directives with Incentives
○ help Internal Staff with their day-to-
day data needs so they WANT to
open data (as a verb)
○ Opening Data inside includes
internal data that is not meant for
public use
○ High Quality Open Data (as a noun)
is a natural by-product
Humanizing the Data is
Pragmatic Data Governance
Culture Eats Strategy for Breakfast
Data Portal
}Opening Data Inside
Data-Driven Culture
compiling/curating
Metadata
Central
Source of
Metadata
YOUR
ENTERPRISE
DATA
EXTERNAL
DATA
EXTERNAL
DATA
Internal
Data Portal
“Our” Solution
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
We needed a
“Data Wrangler”
● Works with a universal data format
● Cross-platform
● Fast, blazing Fast!
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate
into mission-critical data pipelines
Data
You
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
qsv/qsv pro
Origin Story
It all started with a failed pilot
with a Hedge Fund to build an
Internal Data Portal in 2020
● datHere - new startup during COVID
● Data Portals! Anybody? Anybody?
● Nice! A Hedge Fund wants to try CKAN!
● An Internal Data Catalog Pilot -
populated with latest metadata from
vast data holdings, updated daily
● Central source of Truth for Metadata
● And we have to auto-infer the metadata
● Traditional metadata inferencing
pipeline (csvkit, pandas, numpy) was
too slow
● Forked xsv to start qsv…
qsv “Data Wrangler” Goals
● Works with a universal data format
● Cross-platform
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate into
mission-critical data pipelines
CSV, Excel, JSON, JSONL,
PostgreSQL, SQLite, Parquet,
Data Package, AVRO &
recognizes 130 file formats
Linux, macOS & Windows
Fast! Blazing Fast!!!
How fast is Blazing fast? (v0.137.0)
For a 1 million row sample of NYC’s 311 data (41 columns, 520 mb):
● 19 “streaming” summary statistics in 0.233 secs
● 18 more stats (total 37) & infer dates(19 formats recognized) in 1.305 secs
● Frequency table in 1.045 secs
● Count rows in 0.009 secs
● Validate against RFC 4180 CSV standard in 0.523 secs
● Validate against a JSON Schema in 3.094 secs
● Run a simple SQL query in 1.053 secs, a SQL aggregation in 1.058 secs & a
very inefficient SQL aggregation in 0.928 secs
● Reverse geocode WGS84 coordinate against Geonames in 3.782 secs
● And more…
https://qsv.dathere.com/benchmarks
comprehensive summary stats in 1.305 seconds!
answered in 0.928 seconds!
How is it so
Fast?
by standing on the
Shoulders of Giants &
The Ecosystem
● Rust
● Mem-mapped, Multi-threaded, Multi-I/O
● Advanced CPU features
● High performance libraries
● Performance architecture
○ Indexed access
○ Various caching techniques
○ Performance oriented memory allocator
● Built on a solid foundation (xsv)
● Polars Dataframes Engine
● Vibrant Rust & Polars Ecosystems
+
Why the
Obsessive
Need for Speed?
What does it unlock?
● Big Data is getting Bigger
● Embedding into other Systems (DP+)
● Quicker Data Investigations
● Enables new Data Wrangling Workflows
○ “Automagical Metadata”
Preemptive, near-real time
metadata inferencing
○ Compile Extended Data Dictionaries
○ Interactive, Iterative Data-Wrangling
○ Leverage AI
use RAG techniques to infer additional
extended metadata (describeGPT)
Datapusher+
Embedded use case
● Next-gen CKAN Data Ingestion
● Guaranteed Data Type inferences
● Data Validation / Metadata Inferencing
○ Dedupe
○ PII screening
○ As context for AI - “describeGPT”
○ Extended Data Dictionary
○ Pre-calculate metadata
(spatial extent, date range for time-
series data, etc.)
○ Pre-populate DCAT 3
recommended metadata fields
○ Data Enrichment
https://ckan.org/events/ckan-
datapusher-plus-automagical-metadata
Data that is
Useful,
Usable &
Used
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
}We have a solution for this with DP+ & qsv
But what about actually Using the Data
to gain Actionable Insight,
to drive Evidence-based Decisions?
?
qsv pro
Cross-Platform Desktop Data-
Wrangling & Query tool
for the Rest of Us
● OpenRefine + Excel + qsv + CKAN +
recipes + High Value Curated Data =
qsv pro
● Familiar spreadsheet interface
● No need to know complex Command
Line Interface (CLI) commands
● FAST! Blazing Fast!
● Interactive Data Wrangling
● Recipes! (desktop ETL)
● Integration with datHere’s upcoming
cloud-based services
○ High Value Data Feeds
○ Data Enrichment
○ Data Normalization
○ Geocoding
● Natural Language Interface
https://qsvpro.dathere.com
https://qsvpro.dathere.com
● For a Data Analyst Audience
● You don’t need to be a Developer
● Use ready-made Recipes for common
tasks (e.g. Scan for PII, geocode,
deduplicate records, etc.)
● Create/modify/combine Recipes using
either Luau or Python
● Share your Recipes on the
datHere Recipe Catalog
● Pre-process security-sensitive data
on your desktop without uploading it
first
● Enrich your data with datHere’s ever-
expanding corpus of
High Value Data like the Census,
Bureau of Labor Statistics, etc.
● Use the “Answering People Interface”
on your data or of other CKAN portals
● Upload to your CKAN or to datHere’s
Data Catalog to share your data with
the world!
Cross-platform Desktop
Data Wrangling & Query tool
for the Rest of Us
Analyzed 50k rows,
compiling stats and
frequency tables instantly!
Ever-expanding Data-Wrangling
Recipe Library
Directly upload to any CKAN
running v2.9 and above!
The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg
Ran SQL query in 1139ms!
Natural language query, along
with summary stats, frequency &
metadata sent to preferred LLM…
… an LLM we prompt to create a
SQL query based on the Natural
Language query & the context we
provided
Reproducible, hallucination-free
answers
Click to see GIF animation of Excel calling qsv pro API to get summary stats
https://github.com/jqnatividad/qsv/discussions/2221#discussioncomment-11008064
DMS Framework
more than an Open Data Portal application, a
Data Management System Framework
you can build on
● Built around CKAN
● Certified CKAN Extensions
● Bundled with other Best-of-
Breed open source tooling
● Integrated Data Enrichment
● Build DMS applications like
○ Water Data Hubs
○ Open Data Portals
○ Internal Data Exchange
○ Data Library
○ Enterprise Data Catalog
○ and more…
DEMO
Q&A
https://dathere.com/product-demo-request/
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
https://datHere.com
Data Infrastructure Engineering

More Related Content

PDF
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
PDF
Harness the power of data
PPTX
DataPlatform.pptx
PDF
OpenMetadata Spotlight - OpenMetadata @ Carrefour Brazil
PDF
Intro to big data and applications - day 2
KEY
Datacamp @ Transparency Camp 2010
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
PPTX
BI: How Can Your High-Performance BI System Meet Expectations When You Feed I...
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Harness the power of data
DataPlatform.pptx
OpenMetadata Spotlight - OpenMetadata @ Carrefour Brazil
Intro to big data and applications - day 2
Datacamp @ Transparency Camp 2010
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
BI: How Can Your High-Performance BI System Meet Expectations When You Feed I...

Similar to The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg (20)

PDF
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
PDF
Data Warehouse - A Practitioner's Overview
PDF
The Maturity Model: Taking the Growing Pains Out of Hadoop
PDF
Got data?… now what? An introduction to modern data platforms
PDF
When and How Data Lakes Fit into a Modern Data Architecture
PDF
Setting Up the Data Lake
PDF
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
PDF
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
PDF
What Data Do You Have and Where is It?
PDF
BAR360 open data platform presentation at DAMA, Sydney
PDF
Data-Ed: Data Warehousing Strategies
PDF
Data-Ed Online Presents: Data Warehouse Strategies
PPTX
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
PDF
What makes an effective data team?
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
PDF
Decision Ready Data: Power Your Analytics with Great Data
PDF
How to Transform Into a Data-Driven Organization
PDF
Using Data Platforms That Are Fit-For-Purpose
PDF
The Great Lakes: How to Approach a Big Data Implementation
PPTX
Chap3-Data Warehousing and OLAP operations..pptx
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
Data Warehouse - A Practitioner's Overview
The Maturity Model: Taking the Growing Pains Out of Hadoop
Got data?… now what? An introduction to modern data platforms
When and How Data Lakes Fit into a Modern Data Architecture
Setting Up the Data Lake
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analytics 101 - How to build a data-driven organisation? - Rafał Małanij, Get...
What Data Do You Have and Where is It?
BAR360 open data platform presentation at DAMA, Sydney
Data-Ed: Data Warehousing Strategies
Data-Ed Online Presents: Data Warehouse Strategies
[DSC Europe 24] Josip Saban - Buidling cloud data platforms in enterprises
What makes an effective data team?
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
Decision Ready Data: Power Your Analytics with Great Data
How to Transform Into a Data-Driven Organization
Using Data Platforms That Are Fit-For-Purpose
The Great Lakes: How to Approach a Big Data Implementation
Chap3-Data Warehousing and OLAP operations..pptx
Ad

More from Joel Natividad (14)

PDF
The Problem with Data Portals - PUBLIC (FINAL).pdf
PPTX
DataTables view CKAN monthly live
PDF
Open source in government
PDF
The Next Generation of Open Data
PDF
Raw data in, Insights out - CKANcon 2015
PDF
The Coming Web of Data
PPTX
CityMission
PDF
NYCBigApps 2013 Expo/Hackathon Talk
PDF
NYC Remapped
PDF
Ontodia Overview - Semantics and Wikis panel - SemTech West 2012
PDF
NYCFacets: Metadata, Extrametadata and Crowdknowing
PDF
Smart Cities and Big Open Data
PDF
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
PDF
NYC Data Web (static version) - A Semantic, Open Public Data Exchange for NYC
The Problem with Data Portals - PUBLIC (FINAL).pdf
DataTables view CKAN monthly live
Open source in government
The Next Generation of Open Data
Raw data in, Insights out - CKANcon 2015
The Coming Web of Data
CityMission
NYCBigApps 2013 Expo/Hackathon Talk
NYC Remapped
Ontodia Overview - Semantics and Wikis panel - SemTech West 2012
NYCFacets: Metadata, Extrametadata and Crowdknowing
Smart Cities and Big Open Data
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
NYC Data Web (static version) - A Semantic, Open Public Data Exchange for NYC
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Computer network topology notes for revision
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Introduction to Business Data Analytics.
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Acumen Training GuidePresentation.pptx
Supervised vs unsupervised machine learning algorithms
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Computer network topology notes for revision
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction-to-Cloud-ComputingFinal.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Business Data Analytics.
Mega Projects Data Mega Projects Data
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Major-Components-ofNKJNNKNKNKNKronment.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx

The Problem with Data Portals: A Data Portal is just the tip of a Data Governance Iceberg

  • 1. datHere Monthly Webinar Episode 1 Oct 2024 Data Infrastructure Engineering
  • 2. Helping Open Data since 2011 ● Deployed ~100 CKAN portals in the US ● Helped fund & develop several CKAN improvements & extensions ○ Open by default ○ “Enlightened self-interest” ● Few dozen migrations from other portals (old CKAN sites & proprietary platforms) ● Helped firms integrate CKAN into their solution stack ● Delivered Training ● Attended & Presented at data conferences around the world
  • 4. A Data Portal is the tip of a Data Governance Iceberg TLDR
  • 5. Data Portal }Opening Data Inside Data-Driven Culture compiling/curating Metadata Central Source of Metadata YOUR ENTERPRISE DATA EXTERNAL DATA EXTERNAL DATA Internal Data Portal
  • 7. “…the Civic Analytics Network (CAN) offers the following eight guidelines that, if followed, would advance the capabilities of government data portals across the board and help deliver upon the promise of a transparent government.”
  • 8. An Open Letter to the Open Data Community Civic Analytics Network Mar 2017 1. Improve accessibility and usability to engage a wider audience 2. Move away from a single dataset centric view 3. Treat geospatial data as a first class datatype 4. Improve management & usability of metadata 5. Decrease the cost & work required to publish data 6. Introduce revision history 7. Improve management of large datasets 8. Set clear transparent pricing based on memory, not number of datasets
  • 10. An Open Letter to the Open Data Community ONE YEAR LATER Civic Analytics Network June 2018 ● Acknowledged responses from Vendors ● Called out several CAN open data projects, experiments & accomplishments across the country ● Called for continued engagement
  • 11. “CAN’s call for open communication, shared learning, and partnership remains open to vendors and civic technologists alike and we look forward to continuing our work to help grow and expand the open data community and practices.”
  • 13. Data is Infrastructure CKAN Association’s Response Sep 2018 ● Detailed response to all eight guidelines ● Examples from across the entire CKAN ecosystem around the world ● Called out CKAN’s extensibility with its library of third-party extensions catalog ● Confirmed that all CKAN service providers do not practice “nickel-and-diming”
  • 14. and we all lived happily ever after…
  • 15. NOT!
  • 17. The Problem with Data Portals what Sami & Joel learned the hard way since 2011 1. Data Quality - or the Lack of It 2. It’s not FAIR! 3. Open Data is just one “application” of a Data Mgmt System (DMS) 4. Raw Data, not Answers 5. User Experience is King! 6. A Data Portal is just the tip of a Data Governance Iceberg 7. You need to “Open Data Inside” 8. Practical Data Wrangling required 9. Best-of-Breed is the Way 10. You need to “Humanize the Data” Inside & Out
  • 18. The Problem with Data Portals 1. Data Quality - or the Lack of It a. Data ALWAYS needs to be “massaged” i. To remove PIIs ii. To remove other sensitive data iii. Join/Enrich with other data iv. Fat-finger mistakes b. Excel is the Duct Tape of Data c. …and the bane of Open Data! d. and PDFs!?! i. Painful Document Format ii. Practically Data Free iii. Persistent Data Fortress 2. It’s not FAIR! a. Findable b. Accessible c. Interoperable d. Reusable e. …but DCAT 3 is here!!! 3. Open Data is just one “application” of a Data Mgmt System (DMS) a. the “Metadata Tip of the Iceberg” b. The “public” part of your Data Management Initiative c. You need to “Open (as a verb) Data Inside” (see 7)
  • 20. The Problem with Data Portals 4. Raw Data, not Answers a. Mostly Raw Data b. Lack of High-Quality Metadata i. Low Resolution metadata about data files, not the data inside the files ii. Primitive Data Dictionary 1. No Summary Statistics 2. No Frequency Tables 3. No Links to Related Data iii. Metadata has to be manually compiled c. It’s still mainly Keyword Search d. No Natural Language search No Answering People Interface 5. User Experience is NOT King! a. Current Data Publisher UX does not incentivize timely updates, exacerbating Data/Metadata Quality issues b. Current Discoverability UX - for users to search & explore the Catalog, is dated c. Make it easy so that Data Publishers WANT TO update the Data/Metadata 6. A Data Portal is the tip of a Data Governance Iceberg a. The right DMS should enable your Data Governance Strategy b. It should be Data Infrastructure You Can Build On (DIY-CBO) c. And as such, it NEEDS to be standards-based, if not an open-source platform d. Platform = A mature & robust API e. Something that can integrate and interoperate with your existing tooling, systems & data sources f. The portal is fed by Opening Data Inside (see 7)
  • 21. The Problem with Data Portals 7. You first need to “Open Data Inside” a. To promote a Data-Driven Culture b. Culture = Process over Time c. Culture eats Strategy for Breakfast d. You need to make Data Useful, Usable & Used for internal folks first… e. “Opening Data Inside” makes it easier for them to do their day-to-day work (see 5c), and f. High Quality Open Data naturally follows… 8. Practical Data Wrangling Required a. On the Desktop w/o specialized skills b. “Excel”-like, GUI anyone can use c. It needs to be fast so folks can do “what-if”, iterative data-wrangling d. Desktop Data Wrangling deployable as a production data pipeline 9. Best-of-Breed is the Way a. DMS Core Competency Metadata Catalog w/ a mature, robust API b. No lock-ins! Interoperate! (see 2c) c. Do not reinvent the wheel. Focus on 9a. d. “Not Invented Here” not welcome here e. Don’t try to build a ___ wanna-be, use ___ (fill in the blanks -Tableau, Power BI, etc.) f. Prefer open source when possible (e.g. Apache Superset instead of Tableau) 10. You need to “Humanize the Data” Inside & Out! a. Incentivize Data Owners to share their Data and curate the Metadata in the DMS, as doing so makes their day-to-day work easier b. Answering People Interface (API) (see 4d) c. Connect with other Humans! Other communities, vendors, users, instances, data owners, standards bodies, etc. in the Ecosystem d. Data-driven Storytelling e. Cultivate a Data-driven Culture
  • 22. Humanize the Data - The Product is a Civic Data Ecosystem Pathways to Enable Open-Source Ecosystems ● NSF initiative that “aims to harness the power of open-source development for the creation of new technology solutions to problems of national and societal importance.” ● Phase I “discovery grant” awarded in 2023 to University of Pittsburgh & datHere ● Phase II “implementation grant” awarded in August 2024! ● Currently spinning up… ● https://civicdataecosystems.org ● “The Product is the Ecosystem” blogpost
  • 23. Building a Data-Driven Culture From the TOP DOWN From the BOTTOM UP CULTURE = Process Over Time DATA GOVERNANCE STRATEGY DIRECTIVES INCENTIVES
  • 24. Humanizing the Data ● a Data-Driven Culture takes Time ● is a top-down, bottom-up initiative ● “Opening the Data Inside” ○ Creates a Virtuous Cycle balancing Directives with Incentives ○ help Internal Staff with their day-to- day data needs so they WANT to open data (as a verb) ○ Opening Data inside includes internal data that is not meant for public use ○ High Quality Open Data (as a noun) is a natural by-product Humanizing the Data is Pragmatic Data Governance Culture Eats Strategy for Breakfast
  • 25. Data Portal }Opening Data Inside Data-Driven Culture compiling/curating Metadata Central Source of Metadata YOUR ENTERPRISE DATA EXTERNAL DATA EXTERNAL DATA Internal Data Portal
  • 28. We needed a “Data Wrangler” ● Works with a universal data format ● Cross-platform ● Fast, blazing Fast! ● Open Source ● Easy to Learn ● Easy to Use for initial investigations ● But powerful enough to integrate into mission-critical data pipelines Data You
  • 30. qsv/qsv pro Origin Story It all started with a failed pilot with a Hedge Fund to build an Internal Data Portal in 2020 ● datHere - new startup during COVID ● Data Portals! Anybody? Anybody? ● Nice! A Hedge Fund wants to try CKAN! ● An Internal Data Catalog Pilot - populated with latest metadata from vast data holdings, updated daily ● Central source of Truth for Metadata ● And we have to auto-infer the metadata ● Traditional metadata inferencing pipeline (csvkit, pandas, numpy) was too slow ● Forked xsv to start qsv…
  • 31. qsv “Data Wrangler” Goals ● Works with a universal data format ● Cross-platform ● Open Source ● Easy to Learn ● Easy to Use for initial investigations ● But powerful enough to integrate into mission-critical data pipelines CSV, Excel, JSON, JSONL, PostgreSQL, SQLite, Parquet, Data Package, AVRO & recognizes 130 file formats Linux, macOS & Windows Fast! Blazing Fast!!!
  • 32. How fast is Blazing fast? (v0.137.0) For a 1 million row sample of NYC’s 311 data (41 columns, 520 mb): ● 19 “streaming” summary statistics in 0.233 secs ● 18 more stats (total 37) & infer dates(19 formats recognized) in 1.305 secs ● Frequency table in 1.045 secs ● Count rows in 0.009 secs ● Validate against RFC 4180 CSV standard in 0.523 secs ● Validate against a JSON Schema in 3.094 secs ● Run a simple SQL query in 1.053 secs, a SQL aggregation in 1.058 secs & a very inefficient SQL aggregation in 0.928 secs ● Reverse geocode WGS84 coordinate against Geonames in 3.782 secs ● And more… https://qsv.dathere.com/benchmarks
  • 33. comprehensive summary stats in 1.305 seconds!
  • 34. answered in 0.928 seconds!
  • 35. How is it so Fast? by standing on the Shoulders of Giants & The Ecosystem ● Rust ● Mem-mapped, Multi-threaded, Multi-I/O ● Advanced CPU features ● High performance libraries ● Performance architecture ○ Indexed access ○ Various caching techniques ○ Performance oriented memory allocator ● Built on a solid foundation (xsv) ● Polars Dataframes Engine ● Vibrant Rust & Polars Ecosystems +
  • 36. Why the Obsessive Need for Speed? What does it unlock? ● Big Data is getting Bigger ● Embedding into other Systems (DP+) ● Quicker Data Investigations ● Enables new Data Wrangling Workflows ○ “Automagical Metadata” Preemptive, near-real time metadata inferencing ○ Compile Extended Data Dictionaries ○ Interactive, Iterative Data-Wrangling ○ Leverage AI use RAG techniques to infer additional extended metadata (describeGPT)
  • 37. Datapusher+ Embedded use case ● Next-gen CKAN Data Ingestion ● Guaranteed Data Type inferences ● Data Validation / Metadata Inferencing ○ Dedupe ○ PII screening ○ As context for AI - “describeGPT” ○ Extended Data Dictionary ○ Pre-calculate metadata (spatial extent, date range for time- series data, etc.) ○ Pre-populate DCAT 3 recommended metadata fields ○ Data Enrichment https://ckan.org/events/ckan- datapusher-plus-automagical-metadata
  • 38. Data that is Useful, Usable & Used Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used }We have a solution for this with DP+ & qsv But what about actually Using the Data to gain Actionable Insight, to drive Evidence-based Decisions? ?
  • 39. qsv pro Cross-Platform Desktop Data- Wrangling & Query tool for the Rest of Us ● OpenRefine + Excel + qsv + CKAN + recipes + High Value Curated Data = qsv pro ● Familiar spreadsheet interface ● No need to know complex Command Line Interface (CLI) commands ● FAST! Blazing Fast! ● Interactive Data Wrangling ● Recipes! (desktop ETL) ● Integration with datHere’s upcoming cloud-based services ○ High Value Data Feeds ○ Data Enrichment ○ Data Normalization ○ Geocoding ● Natural Language Interface https://qsvpro.dathere.com https://qsvpro.dathere.com
  • 40. ● For a Data Analyst Audience ● You don’t need to be a Developer ● Use ready-made Recipes for common tasks (e.g. Scan for PII, geocode, deduplicate records, etc.) ● Create/modify/combine Recipes using either Luau or Python ● Share your Recipes on the datHere Recipe Catalog ● Pre-process security-sensitive data on your desktop without uploading it first ● Enrich your data with datHere’s ever- expanding corpus of High Value Data like the Census, Bureau of Labor Statistics, etc. ● Use the “Answering People Interface” on your data or of other CKAN portals ● Upload to your CKAN or to datHere’s Data Catalog to share your data with the world! Cross-platform Desktop Data Wrangling & Query tool for the Rest of Us Analyzed 50k rows, compiling stats and frequency tables instantly! Ever-expanding Data-Wrangling Recipe Library Directly upload to any CKAN running v2.9 and above!
  • 42. Ran SQL query in 1139ms! Natural language query, along with summary stats, frequency & metadata sent to preferred LLM… … an LLM we prompt to create a SQL query based on the Natural Language query & the context we provided Reproducible, hallucination-free answers
  • 43. Click to see GIF animation of Excel calling qsv pro API to get summary stats https://github.com/jqnatividad/qsv/discussions/2221#discussioncomment-11008064
  • 44. DMS Framework more than an Open Data Portal application, a Data Management System Framework you can build on ● Built around CKAN ● Certified CKAN Extensions ● Bundled with other Best-of- Breed open source tooling ● Integrated Data Enrichment ● Build DMS applications like ○ Water Data Hubs ○ Open Data Portals ○ Internal Data Exchange ○ Data Library ○ Enterprise Data Catalog ○ and more…
  • 46. Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used https://datHere.com Data Infrastructure Engineering