SlideShare a Scribd company logo
Advances in Data Science
Fall 2016
TUTORIAL
INTRODUCTION
FEATURES
INSTALLATION
DEMO
COMPARISON
WHAT IS …
??
• Formerly known as Google Refine
OpenRefine is a power tool for working with messy
data, primarily for
• detecting and fixing inconsistencies
• transforming data from one structure or format to
another
• extending it with web services and external data
• connecting names within your data to name
registries (databases)
Use OpenRefine when you need something ...
• more powerful than a spreadsheet
• more interactive and visual than scripting
• more provisional / exploratory / experimental /
. playful than a database
• Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON)
• Explore datasets in a matter of seconds
• Apply basic and advanced cell transformations
• Deal with cells that contain multiple values
• Create instantaneous links between datasets
• Filter and partition your data easily with regular expressions
• Use named-entity extraction on full-text fields to automatically identify topics
• Perform advanced data operations with the General Refine Expression Language
IMPORTANT FEATURES:
The LendingClub data contains complete loan
data for all loans issued through the time
period stated, including the current loan
status (Current, Late, Fully Paid, etc.) and
latest payment information
LENDING CLUB
LOAN STATS DATA
Our aim is to perform exploratory analysis on given financial data
• Getting the data
• Looking at the data
• Cleansing
• Transforming
• Creating visualizations
STEPS
1 – Getting started with OpenRefine
2 – Analyzing and Fixing Data
3 – Advanced Data Operations
4 – Linking Datasets
5 – Regular Expressions and GREL
TUTORIAL
• Requirements
• Java JRE installed
• Download
• OpenRefine is a desktop application. Here’s the link: Google OpenRefine
• Unlike most other desktop applications, it runs as a small web server on
your own computer
• You point your web browser at that web server in order to use Refine. So,
think of Refine as a personal and private web application
HOW TO INSTALL
• Install:
• Once you have downloaded the .zip file, uncompress it into a folder wherever you want
(such as in C:Google-Refine).
• Run:
• Run the .exe file in that folder. You should see the Command window in which OpenRefine
runs. By default, the Command window has a black background and text in monospace font
in it.
• Shut down:
• When you need to shut down OpenRefine, switch to that Command window, and press Ctrl-
C. Wait until there's a message that says the shutdown is complete. That window might
close automatically, or you can close it yourself. If you get asked, "Terminate all batch
processes? Y/N", just press Y.
INSTALLATION: WINDOWS
• Install:
• Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into
the Applications folder icon (just like you would normally install Mac applications).
• Run:
• To launch OpenRefine, go to the Applications folder and double click the OpenRefine
app. You'll see the OpenRefine app appear in your dock.
• Shut down:
• You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its
Quit command.
• If you use Yosemite you will need to install Java for OS X 2014-001 first.
INSTALLATION: MAC
• Install / Run: Once you have downloaded the tar.gz file, open a shell and
type
• tar xzf google-refine.tar.gz
• cd google-refine
• ./refine
• This will start OpenRefine and open your browser to its starting page.
• Shut down: Press Ctrl-C in the shell.
INSTALLATION: LINUX
RUN OPENREFINE
• To increase memory: refine.bat /m 4096m
IMPORT DATA
EXPLORING DATA
MANIPULATING COLUMNS
USING THE PROJECT HISTORY
EXPORTING A PROJECT
ANALYZING AND FIXING DATA
WORKING ON THE DATA
• sorting data
• faceting data
• detecting duplicates
• applying a text filter
• using simple cell transformations
• removing matching rows
• splitting data across columns
• adding derived columns
SPECIAL FEATURE
• Regular Expressions and GREL
• Can use Python, Clojure
ADDING A RECONCILIATION SERVICE AND
RECONCILING WITH LINKED DATA
ADVANCED DATA OPERATIONS
• handling multi-valued cells
• alternating between rows and records mode
• clustering similar cells
• transforming cell values
• adding derived columns
• transposing rows and columns
• installing extensions
• Documentation:
• https://github.com/OpenRefine/OpenRefine/wiki
• Youtube Tutorial:
• https://www.youtube.com/playlist?list=PL737054C67FCC0741
REFERENCES:

More Related Content

PPTX
TXDHC OpenRefine Training
PPTX
OpenRefine Tutorial
PDF
Introduction to OpenRefine
PPTX
Data Wrangling with Open Refine
PDF
Let your data shine... with OpenRefine
PDF
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
PDF
useR! 2012 Talk
PPTX
OpenRefine reconciliation services
TXDHC OpenRefine Training
OpenRefine Tutorial
Introduction to OpenRefine
Data Wrangling with Open Refine
Let your data shine... with OpenRefine
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
useR! 2012 Talk
OpenRefine reconciliation services

What's hot (20)

PDF
Open refine reconciliation service api (dc python 2013_03_05)
ODP
Introduction to ETL
PPT
Semantic Pipes and Semantic Mashups
PDF
The Digital Cavemen of Linked Lascaux
PPTX
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
PPT
Apache Stanbol 
and the Web of Data - ApacheCon 2011
PPTX
Doctrine Data migrations | May 2017
PDF
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
PPTX
Reproducible research
PPTX
Web Scraping Basics
PPTX
Linked data-tooling-xml
PPT
Achieving time effective federated information from scalable rdf data using s...
PPTX
Emerging technologies in academic libraries
PPT
Scalable Data Analysis in R -- Lee Edlefsen
PDF
Linked data tooling XML
PPTX
Session 03 acquiring data
PPTX
Introduction to Elastic with a hint of Symfony and Docker
PPTX
Evolutionary & Swarm Computing for the Semantic Web
PPTX
Reinhard LAWDI Presentation
PPTX
Stanbol
Open refine reconciliation service api (dc python 2013_03_05)
Introduction to ETL
Semantic Pipes and Semantic Mashups
The Digital Cavemen of Linked Lascaux
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
Apache Stanbol 
and the Web of Data - ApacheCon 2011
Doctrine Data migrations | May 2017
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Reproducible research
Web Scraping Basics
Linked data-tooling-xml
Achieving time effective federated information from scalable rdf data using s...
Emerging technologies in academic libraries
Scalable Data Analysis in R -- Lee Edlefsen
Linked data tooling XML
Session 03 acquiring data
Introduction to Elastic with a hint of Symfony and Docker
Evolutionary & Swarm Computing for the Semantic Web
Reinhard LAWDI Presentation
Stanbol
Ad

Viewers also liked (20)

PPTX
A Quick Tour of OpenRefine
PPTX
Google refine tutotial
PDF
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
PDF
Dawid Gonzo Kałędowski: R jako osobisty GPS
PPTX
Final_Project
PPTX
Final presentation
PPTX
DDL,DML,SQL Functions and Joins
PPTX
Tor Hovland: Taking a swim in the big data lake
PDF
What Is Reporting Services?
PPTX
PowerBI - Porto.Data - 20150219
PDF
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
PDF
Banking Database
PPTX
Employing Google Refine to publish Linked Data
PPTX
Subqueries, Backups, Users and Privileges
PPTX
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
PPTX
Data Visualization-Ashwin
PPTX
Data and Donuts: Data cleaning with OpenRefine
PPTX
Welcome to PowerBI and Tableau
PPTX
Rafał Korszuń: Security in Design of Cloud Applications
PPTX
Paweł Ciepły: PowerBI part1
A Quick Tour of OpenRefine
Google refine tutotial
Tomasz Kopacz: Architektura i service fabric - jak budować aplikacje w paas v2
Dawid Gonzo Kałędowski: R jako osobisty GPS
Final_Project
Final presentation
DDL,DML,SQL Functions and Joins
Tor Hovland: Taking a swim in the big data lake
What Is Reporting Services?
PowerBI - Porto.Data - 20150219
Denny Lee\'s Data Camp v1.0 talk on SSRS Best Practices for IT
Banking Database
Employing Google Refine to publish Linked Data
Subqueries, Backups, Users and Privileges
Sql Server 2012 Reporting-Services is Now a SharePoint Service Application
Data Visualization-Ashwin
Data and Donuts: Data cleaning with OpenRefine
Welcome to PowerBI and Tableau
Rafał Korszuń: Security in Design of Cloud Applications
Paweł Ciepły: PowerBI part1
Ad

Similar to OpenRefine Class Tutorial (20)

PPTX
An introduction to PHP : PHP and Using PHP, Variables Program control and Bui...
PPT
iOS Application Pentesting
PPTX
Hadoop-Automation-Tool_RamkishorTak
PDF
Neo4j Training Cypher
PDF
[NetApp Managing Big Workspaces with Storage Magic
PDF
How java works
PDF
How java works
PPTX
Introduction to r
PPT
iOS Application Penetration Testing for Beginners
PDF
Maven: from Scratch to Production (.pdf)
PDF
Rock Solid Deployment of Web Applications
PPTX
Extracting twitter data using apache flume
PPTX
Top 10 dev ops tools (1)
PDF
UKLUG 2012 - XPages, Beyond the basics
PPTX
Chapetr 4 C++ file object oriented programming
PPTX
DCRUG: Achieving Development-Production Parity
PDF
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
PPTX
On non existent 0-days, stable binary exploits and
PPTX
6 Ways of Solve Your Oracle Dev-Test Problems Using All-Flash Storage and Cop...
PPTX
[DanNotes] XPages - Beyound the Basics
An introduction to PHP : PHP and Using PHP, Variables Program control and Bui...
iOS Application Pentesting
Hadoop-Automation-Tool_RamkishorTak
Neo4j Training Cypher
[NetApp Managing Big Workspaces with Storage Magic
How java works
How java works
Introduction to r
iOS Application Penetration Testing for Beginners
Maven: from Scratch to Production (.pdf)
Rock Solid Deployment of Web Applications
Extracting twitter data using apache flume
Top 10 dev ops tools (1)
UKLUG 2012 - XPages, Beyond the basics
Chapetr 4 C++ file object oriented programming
DCRUG: Achieving Development-Production Parity
LuisRodriguezLocalDevEnvironmentsDrupalOpenDays
On non existent 0-days, stable binary exploits and
6 Ways of Solve Your Oracle Dev-Test Problems Using All-Flash Storage and Cop...
[DanNotes] XPages - Beyound the Basics

OpenRefine Class Tutorial

  • 1. Advances in Data Science Fall 2016 TUTORIAL
  • 3. WHAT IS … ?? • Formerly known as Google Refine OpenRefine is a power tool for working with messy data, primarily for • detecting and fixing inconsistencies • transforming data from one structure or format to another • extending it with web services and external data • connecting names within your data to name registries (databases) Use OpenRefine when you need something ... • more powerful than a spreadsheet • more interactive and visual than scripting • more provisional / exploratory / experimental / . playful than a database
  • 4. • Import data in various formats (Ex: TSV, CSV,Excel (.xls, xlsx),XML,RDF as XML,JSON) • Explore datasets in a matter of seconds • Apply basic and advanced cell transformations • Deal with cells that contain multiple values • Create instantaneous links between datasets • Filter and partition your data easily with regular expressions • Use named-entity extraction on full-text fields to automatically identify topics • Perform advanced data operations with the General Refine Expression Language IMPORTANT FEATURES:
  • 5. The LendingClub data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information LENDING CLUB LOAN STATS DATA Our aim is to perform exploratory analysis on given financial data
  • 6. • Getting the data • Looking at the data • Cleansing • Transforming • Creating visualizations STEPS
  • 7. 1 – Getting started with OpenRefine 2 – Analyzing and Fixing Data 3 – Advanced Data Operations 4 – Linking Datasets 5 – Regular Expressions and GREL TUTORIAL
  • 8. • Requirements • Java JRE installed • Download • OpenRefine is a desktop application. Here’s the link: Google OpenRefine • Unlike most other desktop applications, it runs as a small web server on your own computer • You point your web browser at that web server in order to use Refine. So, think of Refine as a personal and private web application HOW TO INSTALL
  • 9. • Install: • Once you have downloaded the .zip file, uncompress it into a folder wherever you want (such as in C:Google-Refine). • Run: • Run the .exe file in that folder. You should see the Command window in which OpenRefine runs. By default, the Command window has a black background and text in monospace font in it. • Shut down: • When you need to shut down OpenRefine, switch to that Command window, and press Ctrl- C. Wait until there's a message that says the shutdown is complete. That window might close automatically, or you can close it yourself. If you get asked, "Terminate all batch processes? Y/N", just press Y. INSTALLATION: WINDOWS
  • 10. • Install: • Once you have downloaded the .dmg file, open it, and drag the OpenRefine icon into the Applications folder icon (just like you would normally install Mac applications). • Run: • To launch OpenRefine, go to the Applications folder and double click the OpenRefine app. You'll see the OpenRefine app appear in your dock. • Shut down: • You can switch to the OpenRefine app (clicking on its icon in the dock) and invoke its Quit command. • If you use Yosemite you will need to install Java for OS X 2014-001 first. INSTALLATION: MAC
  • 11. • Install / Run: Once you have downloaded the tar.gz file, open a shell and type • tar xzf google-refine.tar.gz • cd google-refine • ./refine • This will start OpenRefine and open your browser to its starting page. • Shut down: Press Ctrl-C in the shell. INSTALLATION: LINUX
  • 12. RUN OPENREFINE • To increase memory: refine.bat /m 4096m
  • 16. USING THE PROJECT HISTORY
  • 19. WORKING ON THE DATA • sorting data • faceting data • detecting duplicates • applying a text filter • using simple cell transformations • removing matching rows • splitting data across columns • adding derived columns
  • 20. SPECIAL FEATURE • Regular Expressions and GREL • Can use Python, Clojure
  • 21. ADDING A RECONCILIATION SERVICE AND RECONCILING WITH LINKED DATA
  • 22. ADVANCED DATA OPERATIONS • handling multi-valued cells • alternating between rows and records mode • clustering similar cells • transforming cell values • adding derived columns • transposing rows and columns • installing extensions
  • 23. • Documentation: • https://github.com/OpenRefine/OpenRefine/wiki • Youtube Tutorial: • https://www.youtube.com/playlist?list=PL737054C67FCC0741 REFERENCES: