SlideShare a Scribd company logo
INTRODUCTION TO
OPENREFINE
CENTRAL PA OPEN SOURCE CONFERENCE
OCTOBER 17, 2015
Heather Myers / @privatestorm
https://www.linkedin.com/in/heathercmyers
ABOUT ME
Web administrator in the government and cultural heritage
sectors.
Currently working at the Pennsylvania Historical and
Museum Commission.
OPENREFINE
"a powerful tool for working with messy data: cleaning it;
transforming it from one format into another; and extending it
with web services and external data."
GETTING STARTED
Choose dataset
Decide what you want to accomplish with data
Install OpenRefine
Run OpenRefine
http://openrefine.org/download.html
http://127.0.0.1:3333
ABOUT THE DATASET
Pennsylvania Heritage magazine subject index.
Index of 12,000+ magazine terms for issues dated 1975–
2002.
http://bit.ly/1Udha8D
DATA TO DO LIST
Create lists
of terms for
specific
issues
Extract list
of terms
IMPORT ALL DATA
File upload
Web
download
Copy from
clipboard
Google data
import
CONFIGURE PARSING OPTIONS
Choose
options
Update
preview
Name
project
Create
project »
OPENREFINE PROJECT
APPLY A TEXT FILTER
Click down
arrow on
column
Choose text
filter from
menu
APPLY A TEXT FILTER
Type text
Choose
additional
options
TEXT FILTER OPTIONS
Add
multiple
filters
Reorder
filters
TEXT FILTER OPTIONS
Case
sensitive
Regular
expressions
EXPORT DATA
Click export
button
Choose
export type
Save file
DATA TO DO LIST
Create lists
of terms for
specific
issues
Extract list
of terms
IMPORT SELECTION OF DATA
Choose
CSV /TSV/
separator-
based files
Use
semicolon
as custom
separator
DATA SEPARATED INTO COLUMNS
SPLIT INTO SEVERAL COLUMNS
Click down
arrow on
column
Choose edit
column
Choose
split into
several
columns
SPLIT INTO SEVERAL COLUMNS
Split by
separator
or field
length
Choose
after
splitting
options
EDIT SINGLE CELL
Hover over
cell & click
edit
Update
data type
Update text
Click apply
TRANSFORM TEXT
Click down
arrow on
column
Choose edit
cells
Choose
transform
TRANSFORM TEXT
Type
expression
Choose
options
Choose OK
TERMS SPLIT INTO COLUMNS
MOVE COLUMN
Click down
arrow on
column
Choose edit
column
Choose
move
column left
TRIM WHITESPACE
Click down
arrow
Choose edit
cells,
common
trans-
formations
Choose
trim leading
& trailing
whitespace
JOIN CELLS
Custom
text
transform
GREL
expressions
plus space
in between
HIDE COLUMNS
Click down
arrow
Choose
view
Choose
collapse all
other
columns
SORT VALUES
Click down
arrow
Choose sort
Choose
additional
options
UNDO / REDO
Choose
undo/redo
tab in left
column
Click
previous
step to
undo
FILTER UNDO / REDO
Type
keywords in
filter box
EXTRACT OPERATION HISTORY
Choose
extract
button
Choose
steps to
save in left
column
Copy JSON
in right
column
APPLY OPERATION HISTORY
Open or
create new
project
Click apply
button in
undo/redo
Paste JSON
Click
perform
operations
EXPORT DATA
Click export
button
Choose
export type
Save file
LEARN MORE
Website: OpenRefine
http://openrefine.org/
Book: Using OpenRefine
http://bit.ly/1QC0oNS
Course: Big Data University
http://bit.ly/1QC1sl1
THE END

More Related Content

PPT
MySql slides (ppt)
PDF
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
PDF
Workshop - Neo4j Graph Data Science
PDF
PostgreSQL Tutorial For Beginners | Edureka
PDF
Intro to Neo4j and Graph Databases
PDF
Introduction of Knowledge Graphs
PPT
Graph database
MySql slides (ppt)
Adobe Behance Scales to Millions of Users at Lower TCO with Neo4j
Workshop - Neo4j Graph Data Science
PostgreSQL Tutorial For Beginners | Edureka
Intro to Neo4j and Graph Databases
Introduction of Knowledge Graphs
Graph database

What's hot (20)

PPTX
NiFi Best Practices for the Enterprise
PDF
Neo4j 4 Overview
PPTX
Power bi components
PDF
ENEL Electricity Topology Network on Neo4j Graph DB
KEY
Intro to Neo4j presentation
PDF
Introduction to Graph Databases
PDF
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
PPTX
Introduction to Data Engineering
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PPTX
The Basics of MongoDB
PDF
PostgreSQL Tutorial for Beginners | Edureka
PPTX
An Introduction to NOSQL, Graph Databases and Neo4j
PDF
Introduction to Power BI
PPT
Lecture 01 introduction to database
PPTX
Apache Flink and what it is used for
PDF
High-Performance Advanced Analytics with Spark-Alchemy
PDF
Dataflow with Apache NiFi
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PPTX
Power BI Overview
PDF
Map Reduce
NiFi Best Practices for the Enterprise
Neo4j 4 Overview
Power bi components
ENEL Electricity Topology Network on Neo4j Graph DB
Intro to Neo4j presentation
Introduction to Graph Databases
SQL vs NoSQL | MySQL vs MongoDB Tutorial | Edureka
Introduction to Data Engineering
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
The Basics of MongoDB
PostgreSQL Tutorial for Beginners | Edureka
An Introduction to NOSQL, Graph Databases and Neo4j
Introduction to Power BI
Lecture 01 introduction to database
Apache Flink and what it is used for
High-Performance Advanced Analytics with Spark-Alchemy
Dataflow with Apache NiFi
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Power BI Overview
Map Reduce
Ad

Viewers also liked (8)

PPTX
TXDHC OpenRefine Training
PPTX
OpenRefine Tutorial
PPTX
A Quick Tour of OpenRefine
PPTX
Employing Google Refine to publish Linked Data
PPTX
Data and Donuts: Data cleaning with OpenRefine
PPTX
OpenRefine Class Tutorial
ODP
OpenRefine - Data Science Training for Librarians
PPTX
Google refine tutotial
TXDHC OpenRefine Training
OpenRefine Tutorial
A Quick Tour of OpenRefine
Employing Google Refine to publish Linked Data
Data and Donuts: Data cleaning with OpenRefine
OpenRefine Class Tutorial
OpenRefine - Data Science Training for Librarians
Google refine tutotial
Ad

Similar to Introduction to OpenRefine (9)

PPTX
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
PPT
Theory & Practice of Data Cleaning: Introduction to OpenRefine
PDF
Let your data shine... with OpenRefine
PPTX
Data Wrangling with Open Refine
PDF
Course 6 (part 2) data visualisation by toon vanagt
PDF
Open refine to update and clean up your messy data
PPTX
Beautiful Research Data (Structured Data and Open Refine)
PDF
Poster: Using Open Source Tools to Improve Access to Oral History Collections
PPTX
Williams Open Refine for Librarians
Dressen-RSA-2019-preconference-data-workshop-copy.pptx
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Let your data shine... with OpenRefine
Data Wrangling with Open Refine
Course 6 (part 2) data visualisation by toon vanagt
Open refine to update and clean up your messy data
Beautiful Research Data (Structured Data and Open Refine)
Poster: Using Open Source Tools to Improve Access to Oral History Collections
Williams Open Refine for Librarians

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Business Analytics and business intelligence.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
Business Analytics and business intelligence.pdf
Fluorescence-microscope_Botany_detailed content
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Acumen Training GuidePresentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data_Analytics_and_PowerBI_Presentation.pptx

Introduction to OpenRefine