SlideShare a Scribd company logo
The
LinkedGov extension


        for
   Google Refine




                      @danpaulsmith
What is LinkedGov?
         A community project
               aiming to
      make public data more usable

              Cleaning
           Improving access
              Enriching
               Linking
                                     @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                             Question
                                               site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                     data
   XML)        (Data is stored as machine-   .linkedgov
                      readable data)             .org




                                               @danpaulsmith
What is Google Refine?

 “A power tool for working with messy data”
              “cleaning it up”,
             “ transforming it”,
               “extending it”,
              “and linking it”


                                       @danpaulsmith
@danpaulsmith
Spreadsheet software

Spreadsheet software           Google Refine

  Single-cell editing           Bulk-editing

 Create & input data    Use & transform existing data

  Document-based                Data-based

                          Allows extensions to be
                                 installed




                                                    @danpaulsmith
Transposition, multi-valued cells,
  clustering, faceting, filtering




                               @danpaulsmith
What does the LinkedGov extension
               do?




   Image curtosey of http://download.chip.eu
                                               @danpaulsmith
Typing wizards




Date & time   Measurements   Geolocations   Addresses




                                            @danpaulsmith
Other wizards




Columns to rows   Rows to columns   Blank values   Codes and symbols




                                                        @danpaulsmith
@danpaulsmith
@danpaulsmith
Cleaning




           @danpaulsmith
Enriching




            @danpaulsmith
What a machine understands
               before
                       (CSV, TSV, Excel)

      Column Column Column Column Column Column Column
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number
Row   number   word   number   word   date   number   number




                                                          @danpaulsmith
What a machine understands
              after
                 (machine-readable format)

                                                     Water
          Temp    Name    Gas/hour Postcode Date             Height
                                                     /hour
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius String   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres
Building Celsius string   kWh      Postcode   date   m3      metres




                                                              @danpaulsmith
The power of linking


 Latitude &
                   Postcodes       Dates      Measurements
 longitude




                   GP Surgery    NHS events   GP Surgery energy
NHS geo data      address data      data           use data



                                                    @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Cleaning tasks




                 @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
Question
     site




   @danpaulsmith
Data flow
                   Cleaning tasks

                   ✖

                                              Question
                                                site
                  LinkedGov
  Import           database
  existing            &
   data        core components
(CSV, Excel,                                      data
   XML)                                       .linkedgov
               (Data exists as linked data)
                                                  .org




                                                @danpaulsmith
data.linkedgov.org




                     @danpaulsmith
Feedback & questions



  http://linkedgov.org - Website

  http://wiki.linkedgov.org - Wiki

  @LinkedGov - Twitter

   #linkedgov – IRC (Freenode.net)




                                     @danpaulsmith

More Related Content

PDF
Graph Analysis over JSON, Larus
PPT
Web Services Catalog
PDF
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
PDF
LODOP - Multi-Query Optimization for Linked Data Profiling Queries
PDF
Hadoop and Neo4j: A Winning Combination for Bioinformatics
PDF
Building a data processing pipeline in Python
PPTX
The nature.com ontologies portal: nature.com/ontologies
PPTX
Family tree of data – provenance and neo4j
Graph Analysis over JSON, Larus
Web Services Catalog
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
LODOP - Multi-Query Optimization for Linked Data Profiling Queries
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Building a data processing pipeline in Python
The nature.com ontologies portal: nature.com/ontologies
Family tree of data – provenance and neo4j

What's hot (12)

PDF
Real-World NoSQL Schema Design
PDF
Graph All the Things: An Introduction to Graph Databases
PDF
Django and Neo4j - Domain modeling that kicks ass
PDF
A Spot of TEI
PDF
NOSQLEU - Graph Databases and Neo4j
PDF
RDF Stream Processing Models (RSP2014)
PDF
Drilling Cyber Security Data With Apache Drill
PDF
(PROJEKTURA) Big Data Open Data story for TGG
PDF
Democratizing Data at Airbnb
PDF
Introduction to Apache Drill - NYC Apache Drill Meetup
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PPTX
Signposting for Repositories
Real-World NoSQL Schema Design
Graph All the Things: An Introduction to Graph Databases
Django and Neo4j - Domain modeling that kicks ass
A Spot of TEI
NOSQLEU - Graph Databases and Neo4j
RDF Stream Processing Models (RSP2014)
Drilling Cyber Security Data With Apache Drill
(PROJEKTURA) Big Data Open Data story for TGG
Democratizing Data at Airbnb
Introduction to Apache Drill - NYC Apache Drill Meetup
AI與大數據數據處理 Spark實戰(20171216)
Signposting for Repositories
Ad

Similar to LinkedGov extension for Google Refine (20)

PDF
Database Survival Guide: Exploratory Webcast
PPTX
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
PDF
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
PDF
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
PDF
Vital AI: Big Data Modeling
PPTX
DataUp Overview: AGU 2012
PPTX
Graph Data: a New Data Management Frontier
PPTX
Operations-Driven Web Services at Rent the Runway
PDF
How Graph Databases used in Police Department?
PDF
Track B-1 建構新世代的智慧數據平台
PDF
Incorporating the Data Lake into Your Analytic Architecture
PDF
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
PPT
Spatial ETL For Web Services-Based Data Sharing
PPT
Alitora Innovation Networks
PDF
Spark Summit EU talk by Pat Patterson
PDF
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
PPTX
Data Vault 2.0: Big Data Meets Data Warehousing
PDF
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
PDF
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
PDF
Next Generation Hadoop Introduction
Database Survival Guide: Exploratory Webcast
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
Metadata Lakes for Next-Gen AI/ML - Lisa N. Cao
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Vital AI: Big Data Modeling
DataUp Overview: AGU 2012
Graph Data: a New Data Management Frontier
Operations-Driven Web Services at Rent the Runway
How Graph Databases used in Police Department?
Track B-1 建構新世代的智慧數據平台
Incorporating the Data Lake into Your Analytic Architecture
Big Data or Data Warehousing? How to Leverage Both in the Enterprise
Spatial ETL For Web Services-Based Data Sharing
Alitora Innovation Networks
Spark Summit EU talk by Pat Patterson
AWS를 통한 데이터 분석 및 처리의 새로운 혁신 기법 - 김윤건, AWS사업개발 담당:: AWS Summit Online Korea 2020
Data Vault 2.0: Big Data Meets Data Warehousing
"Building Data Warehouse with Google Cloud Platform", Artem Nikulchenko
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...
Next Generation Hadoop Introduction
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Programs and apps: productivity, graphics, security and other tools
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Programs and apps: productivity, graphics, security and other tools
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

LinkedGov extension for Google Refine

  • 1. The LinkedGov extension for Google Refine @danpaulsmith
  • 2. What is LinkedGov? A community project aiming to make public data more usable Cleaning Improving access Enriching Linking @danpaulsmith
  • 3. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) (Data is stored as machine- .linkedgov readable data) .org @danpaulsmith
  • 4. What is Google Refine? “A power tool for working with messy data” “cleaning it up”, “ transforming it”, “extending it”, “and linking it” @danpaulsmith
  • 6. Spreadsheet software Spreadsheet software Google Refine Single-cell editing Bulk-editing Create & input data Use & transform existing data Document-based Data-based Allows extensions to be installed @danpaulsmith
  • 7. Transposition, multi-valued cells, clustering, faceting, filtering @danpaulsmith
  • 8. What does the LinkedGov extension do? Image curtosey of http://download.chip.eu @danpaulsmith
  • 9. Typing wizards Date & time Measurements Geolocations Addresses @danpaulsmith
  • 10. Other wizards Columns to rows Rows to columns Blank values Codes and symbols @danpaulsmith
  • 13. Cleaning @danpaulsmith
  • 14. Enriching @danpaulsmith
  • 15. What a machine understands before (CSV, TSV, Excel) Column Column Column Column Column Column Column Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number Row number word number word date number number @danpaulsmith
  • 16. What a machine understands after (machine-readable format) Water Temp Name Gas/hour Postcode Date Height /hour Building Celsius string kWh Postcode date m3 metres Building Celsius String kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres Building Celsius string kWh Postcode date m3 metres @danpaulsmith
  • 17. The power of linking Latitude & Postcodes Dates Measurements longitude GP Surgery NHS events GP Surgery energy NHS geo data address data data use data @danpaulsmith
  • 18. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 19. Cleaning tasks @danpaulsmith
  • 20. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 21. Question site @danpaulsmith
  • 22. Data flow Cleaning tasks ✖ Question site LinkedGov Import database existing & data core components (CSV, Excel, data XML) .linkedgov (Data exists as linked data) .org @danpaulsmith
  • 23. data.linkedgov.org @danpaulsmith
  • 24. Feedback & questions http://linkedgov.org - Website http://wiki.linkedgov.org - Wiki @LinkedGov - Twitter #linkedgov – IRC (Freenode.net) @danpaulsmith

Editor's Notes

  • #2: Me. Recent graduate. Have been building interfaces and visualisations for last two years on government projects themed on transparency, big data, open data and linked machine-readable data.This is a presentation on an interface I’ve been building for LinkedGov recently.
  • #3: When you’re looking for public data – it can be quite hard to find(you need to create accounts, arrive at broken download links, searches fail due to a lack of metadata). Once you’ve found the data – it can be in the wrong format(so you then begin the time consuming process of converting that data into a format you can work with). Then once you’ve started working with the data – you can find it to be mysterious and lacking in explanation. So! LinkedGov makes life easier by:1. Cleaning data (spelling mistakes, formats…). 2. Improving access (format of choice, API’s, high quality metadata). 3. Enriches data – (labels and descriptions for the data at a fine-grained level, uses online vocabularies to describe what the data contains). 4. Links datasets to each other.
  • #4: The purple block here is Google Refine – with which data is imported. The importeddata is then cleaned and enriched by the LinkedGov extension. The final step of the import process is to store the data in LinkedGov’s database in a machine-understandable format. With the data stored, we can then do a few things: Create “cleaning tasks” for the community that help fix errors in the data. Power a “question site” that lets non-technical users form queries to query datasets. 3. And also power a technical search site aimed at developers that helps them find the data they want.
  • #5: Free. Open source. Runs in the web browser.
  • #6: This is what Refine looks like. A little bit like spreadsheet software – you have columns and rows. Though you don’t have any toolbars allowing you edit the style, insert charts, generate reports… That’s because…
  • #7: Refine has some key differences to spreadsheet software. Spreadsheet software focuses on single-cell editing and inputting of data, Refine focuses on editing hundreds of rows & columns at the same time. ------ Spreadsheet software is largely for creating and capturing data, Refine is for users to reshape and transform existing data. ------- Spreadsheet software is very document-based- allowing you to style the data, use multiple pages or insert media, Refine is data-based – only allowing you to alter the structure and values of the data. ------ Refine also allows people to build extensions for it!
  • #8: However. Cleaning and transforming data *is*complicated. A non-technical personwill get confused. Google Refine is designed for programmers / frequent data-wranglers…It would be useful if the people who create or own the data are able to clean the data themselves (they after all should know the most about it).
  • #9: Hides the technical stuff! Instead, asks the user questions about their data… Creates clean, formatted, machine-readable data.
  • #10: So what are we askingthe user? We ask them “can you spot any of these things in your data?”.Why do we ask these things? These four types of data are a good starting ground for linking datasets as they are common across most datasets. --------- If multiple datasets contain the same time span – you can try to compare them to see if there’s anything that connects. If multiple datasets contain the same measurements (i.e. kilowatts per hour) – it’s a good starting point to see if any of them relate. If multiple datasets contain latitude and longitude values – you can gather and compare data spatially and begin to plot things on maps which everybody seems to love. If multiple datasets contain postcodes – & if any of them match, you automatically have a number of different types of information for each postcode. --------- These questions come in the form of “wizards” – which basically leads the user through a small number tasks - asking them to select a column, specify how the data is currently formatted and then they press “Done”!
  • #11: Thereare also a few other wizards: The “colums to rows” & “rows to columns” wizards help the user reshape their data in a way that helps us store the data. These are currently the most problematic wizards in regards to the wording and conveying the benefit or reason behind asking the user to do this. The “blank” values wizard BLANKS out any values in the data that represent “NULL” values – each dataset is to it’s own, I’ve come across dashes, full stops and words like “missing” or “none”. The “codes and symbols” wizard asks the user to replace any codes or symbols with what they actually mean, so for example, in some NHS data, a column was filled with lots of A’s, C’s, D’s and P’s – after googling about, I found out that they actually meant Active, Closed, Dormant and Proposed. So having their actual meaning present in the data is obviously a lot more helpful to people trying to use the data.
  • #12: So, this is what Refine looks like before the extension has been installed… and after the LinkedGov extension is installed. The main addition to the interface being a new panel called the “Typing” panel – which houses the wizards. So, I’ll just walk you through a couple of wizards… Imagine I have some dates in my data and I click on the Date & Time wizard…
  • #13: The wizard appears and it asks me to select any columns that contain dates… So I select two columns “open date” and “close date” by clicking on their headers…
  • #14: We ask the user to specify each dart part for each column – as the values could be in any combination: year-month-day, year-month, day-month, month-day…. You can see the column contains a day, month and year – but in a mixture of formats. You have words, dashes and slashes as separators…which the user doesn’t have to worry about. They then press “Finish” and the magic happens. The values are all formatted properly to using the ISO standard, they are also linked to an online definition and breakdown of that specific date and finally stored as machine-readable linked data.
  • #15: This is the measurements wizard. Select “Avg. Temp” column. It then asks me to search for a measurement type by typing into a text box, which searches an online database of measurements. I click “Finish” after I’ve found the right measurement – “Celsius”, and then the measurements are stored using their online definition – which comes bundled with wikipedia-like information such as alternative names, a description or related measurements (i.e. centimeters, meters, kilometers). So not only is the measurement being stored as an actual measurement, but because we’re using an online database to define it, it comes bundled with a lot of other relevant and potentially useful information to the end user.
  • #16: Here’s an example of what a machine understands about the data before and after using our extension. After saving a file in spreadsheet software, a machine, at best, only understands that the data is a bunch of columns and rows, containing numbers, words and dates. The ability for machines to understand the data is the magic that powers the question site, the dataset directory and makes linking datasets together a breeze.
  • #17: After using the wizards, machines are able to understand a little bit more about the data. Now machines have a more in-depth understanding of what the data actually means, The guesswork and inaccuracy is removed when searching and querying the data.
  • #18: An example of how datasets can link… The red dataset contains latitude/longitudes. The blue dataset contains postcodes and latitude/longitudes. The green dataset contains postcods and dates. And the orange dataset contains dates and measurements… All four datasets can be linked together by those linkable values. When you’re able to start linking datasets together like this – NEW information is created from a NEWLY acquired sense of UNDERSTANDING of those datasets.
  • #19: So that’s what the LinkedGov extension is and does. I’ll briefly finish off with what happens to the machine-readable data. Cleaning tasks can now be created for the community – asking them to use their expertise and judgement to correct problematic data. For example, a column may contain cryptic codes that represent types of NHS walk-in-clinics. So a task may be to decode one of these values and replace it with what it actually means.
  • #20: Here’s a screenshot of an example task – It’s asking the user to try to fix a value that contains two dashes instead of a decimal point. The user has the options to say “Yes I can fix this”, ”Refer this to an expert”, “It’s actually fine” etc.
  • #21: The question site
  • #22: The question site is aimed at non-technical users. It allows them to form queries to retrieve data, without requiring any knowledge of query languages. They form the question in a human-readable way, using a mixture of selectable question fragments together with free text input. An example: Give me ALL … GP SURGERIES … in … LONDON…
  • #23: A finally, the data site.
  • #24: The data site is targeted at the developer community. and is powered by the enriching parts of the data such as: their metadata What types of data are actually in the datasets (postcodes, dates, measurements) What they could potentially link to…
  • #25: So that’s where we are so farFeedback & questions?