The Problem with Data Portals - PUBLIC (FINAL).pdf

datHere Monthly Webinar
Episode 1
Oct 2024
Data Infrastructure Engineering

Helping Open Data
since 2011
● Deployed ~100 CKAN portals in the US
● Helped fund & develop several CKAN
improvements & extensions
○ Open by default
○ “Enlightened self-interest”
● Few dozen migrations from other portals
(old CKAN sites & proprietary platforms)
● Helped firms integrate CKAN into their
solution stack
● Delivered Training
● Attended & Presented at data conferences
around the world

A Data Portal is the tip of
a Data Governance Iceberg
TLDR

Data Portal
}Opening Data Inside
Data-Driven Culture
compiling/curating
Metadata
Central
Source of
Metadata
YOUR
ENTERPRISE
DATA
EXTERNAL
DATA
EXTERNAL
DATA
Internal
Data Portal

“…the Civic Analytics Network (CAN) offers the
following eight guidelines that, if followed, would
advance the capabilities of government data portals
across the board and help deliver upon the promise
of a transparent government.”

An Open Letter to
the Open Data
Community
Civic Analytics Network
Mar 2017
1. Improve accessibility and usability to
engage a wider audience
2. Move away from a single dataset
centric view
3. Treat geospatial data as a first class
datatype
4. Improve management & usability of
metadata
5. Decrease the cost & work required to
publish data
6. Introduce revision history
7. Improve management of large
datasets
8. Set clear transparent pricing based
on memory, not number of datasets

An Open Letter to
the Open Data
Community
ONE YEAR LATER
Civic Analytics Network
June 2018
● Acknowledged responses from
Vendors
● Called out several CAN open
data projects, experiments &
accomplishments across the
country
● Called for continued
engagement

“CAN’s call for open communication, shared
learning, and partnership remains open to vendors
and civic technologists alike and we look forward to
continuing our work to help grow and expand the
open data community and practices.”

Data is
Infrastructure
CKAN Association’s Response
Sep 2018
● Detailed response to all eight
guidelines
● Examples from across the
entire CKAN ecosystem
around the world
● Called out CKAN’s extensibility
with its library of third-party
extensions catalog
● Confirmed that all CKAN
service providers do not
practice “nickel-and-diming”

and we all lived happily ever after…

The Problem with
Data Portals
what Sami & Joel learned the
hard way since 2011
1. Data Quality - or the Lack of It
2. It’s not FAIR!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
4. Raw Data, not Answers
5. User Experience is King!
6. A Data Portal is just the tip of a
Data Governance Iceberg
7. You need to “Open Data Inside”
8. Practical Data Wrangling required
9. Best-of-Breed is the Way
10. You need to “Humanize the Data”
Inside & Out

The Problem with Data Portals
1. Data Quality - or the Lack of It
a. Data ALWAYS needs to be
“massaged”
i. To remove PIIs
ii. To remove other sensitive data
iii. Join/Enrich with other data
iv. Fat-finger mistakes
b. Excel is the Duct Tape of Data
c. …and the bane of Open Data!
d. and PDFs!?!
i. Painful Document Format
ii. Practically Data Free
iii. Persistent Data Fortress
2. It’s not FAIR!
a. Findable
b. Accessible
c. Interoperable
d. Reusable
e. …but DCAT 3 is here!!!
3. Open Data is just one “application”
of a Data Mgmt System (DMS)
a. the “Metadata Tip of the Iceberg”
b. The “public” part of your
Data Management Initiative
c. You need to “Open (as a verb) Data
Inside” (see 7)

4. Raw Data, not Answers
a. Mostly Raw Data
b. Lack of High-Quality Metadata
i. Low Resolution
metadata about data files, not the
data inside the files
ii. Primitive Data Dictionary
1. No Summary Statistics
2. No Frequency Tables
3. No Links to Related Data
iii. Metadata has to be manually
compiled
c. It’s still mainly Keyword Search
d. No Natural Language search
No Answering People Interface
5. User Experience is NOT King!
a. Current Data Publisher UX does not incentivize timely
updates, exacerbating Data/Metadata Quality issues
b. Current Discoverability UX - for users to search &
explore the Catalog, is dated
c. Make it easy so that Data Publishers WANT TO update
the Data/Metadata
6. A Data Portal is the tip of a Data Governance Iceberg
a. The right DMS should enable your
Data Governance Strategy
b. It should be Data Infrastructure You Can Build On
(DIY-CBO)
c. And as such, it NEEDS to be
standards-based, if not an open-source platform
d. Platform = A mature & robust API
e. Something that can integrate and interoperate
with your existing tooling, systems & data sources
f. The portal is fed by Opening Data Inside (see 7)

7. You first need to “Open Data Inside”
a. To promote a Data-Driven Culture
b. Culture = Process over Time
c. Culture eats Strategy for Breakfast
d. You need to make it Useful, Usable & Used
for internal folks first…
e. “Opening Data Inside” makes it easier for
them to do their day-to-day work (see 5c), and
f. High Quality Open Data naturally follows…
8. Practical Data Wrangling Required
a. On the Desktop w/o specialized skills
b. “Excel”-like, GUI anyone can use
c. It needs to be fast so folks can do “what-if”,
iterative data-wrangling
d. Desktop Data Wrangling deployable as a
production data pipeline
9. Best-of-Breed is the Way
a. DMS Core Competency
Metadata Catalog w/ a mature, robust API
b. No lock-ins! Interoperate! (see 2c)
c. Do not reinvent the wheel. Focus on 9a.
d. “Not Invented Here” not welcome here
e. Don’t try to build a ___ wanna-be, use ___
(fill in the blanks -Tableau, Power BI, etc.)
f. Prefer open source when possible
(e.g. Apache Superset instead of Tableau)
10. You need to “Humanize the Data” Inside & Out!
a. Incentivize Data Owners to share their Data and curate
the Metadata in the DMS, as doing so makes their
day-to-day work easier
b. Answering People Interface (API) (see 4d)
c. Connect with other Humans!
Other communities, vendors, users, instances, data
owners, standards bodies, etc. in the Ecosystem
d. Data-driven Storytelling
e. Cultivate a Data-driven Culture

Humanize the Data - The Product is a Civic Data Ecosystem
Pathways to Enable Open-Source Ecosystems
● NSF initiative that “aims to harness the power
of open-source development for the creation of
new technology solutions to problems of
national and societal importance.”
● Phase I “discovery grant” awarded in 2023 to
University of Pittsburgh & datHere
● Phase II “implementation grant” awarded in
August 2024!
● Currently spinning up…
● https://civicdataecosystems.org
● “The Product is the Ecosystem” blogpost

Building a Data-Driven Culture
From the TOP DOWN
From the BOTTOM UP
CULTURE = Process Over Time
DATA MANAGEMENT STRATEGY
DIRECTIVES
INCENTIVES

Humanizing the Data
● a Data-Driven Culture takes Time
● is a top-down, bottom-up initiative
● “Opening the Data Inside”
○ Creates a Virtuous Cycle
balancing Directives with Incentives
○ help Internal Staff with their day-to-
day data needs so they WANT to
open data (as a verb)
○ Opening Data inside includes
internal data that is not meant for
public use
○ High Quality Open Data (as a noun)
is a natural by-product
Humanizing the Data is
Pragmatic Data Governance
Culture Eats Strategy for Breakfast

We needed a
“Data Wrangler”
● Works with a universal data format
● Cross-platform
● Fast, blazing Fast!
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate
into mission-critical data pipelines
Data
You

qsv/qsv pro
Origin Story
It all started with a failed pilot
with a Hedge Fund to build an
Internal Data Portal in 2020
● datHere - new startup during COVID
● Data Portals! Anybody? Anybody?
● Nice! A Hedge Fund wants to try CKAN!
● An Internal Data Catalog Pilot -
populated with latest metadata from
vast data holdings, updated daily
● Central source of Truth for Metadata
● And we have to auto-infer the metadata
● Traditional metadata inferencing
pipeline (csvkit, pandas, numpy) was
too slow
● Forked xsv to start qsv…

qsv “Data Wrangler” Goals
● Works with a universal data format
● Cross-platform
● Open Source
● Easy to Learn
● Easy to Use for initial investigations
● But powerful enough to integrate into
mission-critical data pipelines
CSV, Excel, JSON, JSONL,
PostgreSQL, SQLite, Parquet,
Data Package, AVRO &
recognizes 130 file formats
Linux, macOS & Windows
Fast! Blazing Fast!!!

How fast is Blazing fast? (v0.137.0)
For a 1 million row sample of NYC’s 311 data (41 columns, 520 mb):
● 19 “streaming” summary statistics in 0.233 secs
● 18 more stats (total 37) & infer dates(19 formats recognized) in 1.305 secs
● Frequency table in 1.045 secs
● Count rows in 0.009 secs
● Validate against RFC 4180 CSV standard in 0.523 secs
● Validate against a JSON Schema in 3.094 secs
● Run a simple SQL query in 1.053 secs, a SQL aggregation in 1.058 secs & a
very inefficient SQL aggregation in 0.928 secs
● Reverse geocode WGS84 coordinate against Geonames in 3.782 secs
● And more…
https://qsv.dathere.com/benchmarks

comprehensive summary stats in 1.305 seconds!

How is it so
Fast?
by standing on the
Shoulders of Giants &
The Ecosystem
● Rust
● Mem-mapped, Multi-threaded, Multi-I/O
● Advanced CPU features
● High performance libraries
● Performance architecture
○ Indexed access
○ Various caching techniques
○ Performance oriented memory allocator
● Built on a solid foundation (xsv)
● Polars Dataframes Engine
● Vibrant Rust & Polars Ecosystems
+

Why the
Obsessive
Need for Speed?
What does it unlock?
● Big Data is getting Bigger
● Embedding into other Systems (DP+)
● Quicker Data Investigations
● Enables new Data Wrangling Workflows
○ “Automagical Metadata”
Preemptive, near-real time
metadata inferencing
○ Compile Extended Data Dictionaries
○ Interactive, Iterative Data-Wrangling
○ Leverage AI
use RAG techniques to infer additional
extended metadata (describeGPT)

Datapusher+
Embedded use case
● Next-gen CKAN Data Ingestion
● Guaranteed Data Type inferences
● Data Validation / Metadata Inferencing
○ Dedupe
○ PII screening
○ As context for AI - “describeGPT”
○ Extended Data Dictionary
○ Pre-calculate metadata
(spatial extent, date range for time-
series data, etc.)
○ Pre-populate DCAT 3 recommended
metadata fields
○ Data Enrichment
https://ckan.org/events/ckan-
datapusher-plus-automagical-metadata

Data that is
Useful,
Usable &
Used
Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
}We have a solution for this with DP+ & qsv
But what about actually Using the Data
to gain Actionable Insight,
to drive Evidence-based Decisions?
?

qsv pro
Cross-Platform Desktop Data-
Wrangling & Query tool
for the Rest of Us
● OpenRefine + Excel + qsv + CKAN +
recipes + High Value Curated Data =
qsv pro
● Familiar spreadsheet interface
● No need to know complex Command
Line Interface (CLI) commands
● FAST! Blazing Fast!
● Interactive Data Wrangling
● Recipes! (desktop ETL)
● Integration with datHere’s upcoming
cloud-based services
○ High Value Data Feeds
○ Data Enrichment
○ Data Normalization
○ Geocoding
● Natural Language Interface
https://qsvpro.dathere.com
https://qsvpro.dathere.com

● For a Data Analyst Audience
● You don’t need to be a Developer
● Use ready-made Recipes for common
tasks (e.g. Scan for PII, geocode,
deduplicate records, etc.)
● Create/modify/combine Recipes using
either Luau or Python
● Share your Recipes on the
datHere Recipe Catalog
● Pre-process security-sensitive data
on your desktop without uploading it
first
● Enrich your data with datHere’s ever-
expanding corpus of
High Value Data like the Census,
Bureau of Labor Statistics, etc.
● Use the “Answering People Interface”
on your data or of other CKAN portals
● Upload to your CKAN or to datHere’s
Data Catalog to share your data with
the world!
Cross-platform Desktop
Data Wrangling & Query tool
for the Rest of Us
Analyzed 50k rows,
compiling stats and
frequency tables instantly!
Ever-expanding Data-Wrangling
Recipe Library
Directly upload to any CKAN
running v2.9 and above!

Ran SQL query in 1139ms!
Natural language query, along
with summary stats, frequency &
metadata sent to preferred LLM…
… an LLM we prompt to create a
SQL query based on the Natural
Language query & the context we
provided
Reproducible, hallucination-free
answers

Click to see GIF animation of Excel calling qsv pro API to get summary stats
https://github.com/jqnatividad/qsv/discussions/2221#discussioncomment-11008064

DMS Framework
more than an Open Data Portal application, a
Data Management System Framework
you can build on
● Built around CKAN
● Certified CKAN Extensions
● Bundled with other Best-of-
Breed open source tooling
● Integrated Data Enrichment
● Build DMS applications like
○ Water Data Hubs
○ Open Data Portals
○ Internal Data Exchange
○ Data Library
○ Enterprise Data Catalog
○ and more…

DEMO
Q&A
https://dathere.com/product-demo-request/

Standards-based, best-of-breed, open source solutions to make your Data Useful, Usable & Used
https://datHere.com
Data Infrastructure Engineering

The Problem with Data Portals - PUBLIC (FINAL).pdf

More Related Content

Similar to The Problem with Data Portals - PUBLIC (FINAL).pdf (20)

More from Joel Natividad (14)

Recently uploaded (20)

The Problem with Data Portals - PUBLIC (FINAL).pdf