SlideShare a Scribd company logo
Winter 2015: Session #2
Programming on the Whiteboard
February 19, 2015
(Paige Morgan)
Last week...
• The work of creating usable data
• Forms that this data might take:
• markup language
• Spreadsheets (MySQL & relational
DBs)
• Graph databases (RDF/Linked Open
Data
This week:
• Caveat Curator (challenges of working
with data)
• Programming on the Whiteboard, i.e.,
conceptualizing the specific steps that
you need to take to accomplish your
goals
Goals/Takeaways
• A better understanding of the
workflow for dealing with data
• Greater ability to talk about what
you’re trying to do
Why this focus on data?
• Understanding your data, and your
intended actions, is a key skill for
developing any digital project (big or
small).
• You may have one big project – but
your data may support several
small/intermediary projects.
Caveat Curator
Programming
languages (and
digital apps) are like
human languages in
that they both have
phrases, patterns,
and rules.
Programming
languages are unlike
human languages in
that they aren’t for
communicating with
people.
They are also unlike
human languages in that
every programming
utterance does
something, i.e., causes an
action to occur.
You can get used to
patterns – even
unfamiliar ones.
The shift is in getting
used to thinking in
terms of every single
action.
Today’s subject matter
includes actions that you’ll
need to think about before
you work with...
Image: Josh Lee, @wtrsld, via Twitter, January
2014.
Even when you’re just
experimenting, you need to
prep your data.
You may know your
dataset in detail already,
from your research -- but
your computer is
concerned with different
levels of detail.
Becoming aware of those
levels of detail is not only
helpful for your project
ideas...
...it’s also a useful skill for
working with programming
languages.
(where a stray /> or ; can break your program/website)
Data only works if your
computer can read it.
But my data is just text!
(Doesn’t that make things easy?)
(Remember, your computer
is fairly stupid).
Formatted
text is often
full of text
your
computer
can’t parse
correctly.
The┘re┘sÜlt ís that yoÜr te┘xt
might come┘ oÜt looking
like┘this
whe┘n yoÜ ope┘n it in a
programming e┘nvironme┘nt.
So you need
to convert it to
plain text.
(without any of the fancy
details encoded in MS Word
fonts.)
(This is key if you work with
newspapers, older printed
texts, or archival material.)
Maybe you want to work
with sailing data and ports of
call:
The ship you’re interested in
leaves the Ivory Coast for
St. Helena...
DMDS Winter Workshop 2 Slides
But when you create your
map, you get this:
DMDS Winter Workshop 2 Slides
The latitude/longitude
coordinate is the significant
datum.
The city name is just the
human-readable
component.
Each datum needs to be
unique.
Figuring out what sort
of unique configuration
will work best involves
at least some
experimentation.
To experiment effectively,
you’ll want to keep careful
records.
If you develop categories
of information, you’ll want
to keep a record of what
each category means,
and what its limits are.
Cleaning and structuring
your data is a foundation
issue that changes,
depending on the available
format of your data.
What if your data is
crowdsourced?
You can require a particular
format for submissions
You can even put
programmatic limits on the
formats available for
submission
But in the end, you’re
probably still going to need
to scrub and/or format.
This is true even for data
from supposedly
reputable sources, like
government or media
organizations.
Example: Doctor Who
Villains dataset
http://tinyurl.com/doctorwhov
illains
This step is no fun!
But it’s absolutely
necessary.
If you are thinking about
your data, and the tasks that
you need to accomplish,
then it’s easier to determine
what sort of language or
platform your project needs.
There are countless tutorials,
online courses, etc., for almost
any programming language or
platform.
(You can also ask for a Sherman
Centre consultation to figure out
what you need to learn.)
Learning how to work with
any tool can be a slow
process, especially at first.
However, knowing what
tasks you’re working
towards makes it easier to
understand the purpose of
the introductory lessons.
It’s also easy to think about
how the first rules you learn
for any language or platform
might affect your goals.
Pseudocode
• Used by programmers to break down a
complex task into manageable steps
• Easily adaptable for use by non-
programmers
Pseudocode
Example
(Visible Prices)• Computer has a file that contains prices from different
texts.
• Computer must know that each price amount is
connected with an object, and with a bibliographical
record.
• Users can input a price amount, and computer will
retrieve all objects that match the price, and display
them to the user, along with bibliographical information.
• (More complex): Computer is able to retrieve prices
linked with certain categories (clothing, food, etc.)
(And now for a couple of
examples of projects in
process…)
Social Work and Social
Change
(Tina Wilson)
Social work + social change
• Recent history of academic social work in Canada; 1960s
onward
• Interested in the ways in which academic social work has
attempted to advance justice-oriented social change projects,
and how political, cultural, and theoretical shifts have
influenced this type of disciplinary imagination and work
• Related to disciplinary boundaries and methods and
orthodoxies, and the social role of universities
MARC Record, front end
MARC Record, back end
<?xml version="1.0"?>
<record xmlns="http://www.loc.gov/MARC21/slim"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21sli
m.xsd">
<leader>00000cam a000001</leader>
<controlfield tag="001">468966</controlfield>
<controlfield tag="008">710913s1968vaub000 0 eng
c</controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a">a68007753 </subfield>
</datafield>
<datafield tag="040" ind1=" " ind2=" ">
<subfield code="a">Virginia. Univ. Libr.</subfield>
<subfield code="b">eng</subfield>
<subfield code="c">DLC</subfield>
<subfield code="d">OCLCQ</subfield>
<subfield code="d">CLU</subfield>
<subfield code="d">OCLCO</subfield>
<subfield code="d">OCLCF</subfield>
<subfield code="d">OCLCQ</subfield>
</datafield>
<datafield tag="043" ind1=" " ind2=" ">
<subfield code="a">n-us-va</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2="4">
<subfield code="a">HV98.V8</subfield>
<subfield code="b">C46</subfield>
</datafield>
<datafield tag="082" ind1=" " ind2=" ">
<subfield code="a">361/.9/755</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Cepuran, Joseph.</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Public
assistance and child
welfare:</subfield>
<subfield code="b">the Virginia pattern, 1646 to
1964.</subfield>
</datafield>
<datafield tag="260" ind1=" " ind2=" ">
<subfield code="a">[Charlottesville]</subfield>
<subfield code="b">Institute of Government, University of
Virginia,</subfield>
<subfield code="c">1968.</subfield>
</datafield>
<datafield tag="300" ind1=" " ind2=" ">
<subfield code="a">vii, 120 pages</subfield>
<subfield code="c">28 cm</subfield>
</datafield>
<datafield tag="336" ind1=" " ind2=" ">
<subfield code="a">text</subfield>
<subfield code="b">txt</subfield>
<subfield code="2">rdacontent</subfield>
</datafield>
<datafield tag="337" ind1=" " ind2=" ">
<subfield code="a">unmediated</subfield>
<subfield code="b">n</subfield>
<subfield code="2">rdamedia</subfield>
</datafield>
<datafield tag="338" ind1=" " ind2=" ">
<subfield code="a">volume</subfield>
<subfield code="b">nc</subfield>
<subfield code="2">rdacarrier</subfield>
</datafield>
<datafield tag="504" ind1=" " ind2=" ">
<subfield code="a">Includes bibliographical
references.</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="0">
<subfield code="a">Public
welfare</subfield>
<subfield code="z">Virginia.</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="0">
<subfield code="a">Child welfare</subfield>
<subfield code="x">Government policy</subfield>
<subfield code="z">Virginia.</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="7">
<subfield code="a">Child welfare</subfield>
<subfield code="x">Government policy.</subfield>
<subfield code="2">fast</subfield>
<subfield code="0">(OCoLC)fst00854729</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="7">
<subfield code="a">Public welfare.</subfield>
<subfield code="2">fast</subfield>
<subfield code="0">(OCoLC)fst01083250</subfield>
</datafield>
<datafield tag="651" ind1=" " ind2="7">
<subfield code="a">Virginia.</subfield>
<subfield code="2">fast</subfield>
<subfield code="0">(OCoLC)fst01204597</subfield>
</datafield>
<datafield tag="710" ind1="2" ind2=" ">
<subfield code="a">University of Virginia.</subfield>
<subfield code="b">Institute of Government.</subfield>
</datafield>
</record>
MARC 21 Format
Things to count
• Social problems: child abuse, unemployment, inequality
• Concepts: mental hygiene, non-voluntary clients, culture of
poverty, consciousness raising, privilege
• Sub-populations: immigrants, unwed mothers, the oppressed
• Institutions: work houses, shelters, detention, the non-profit
industrial complex
• Interventions: motivational interviewing, case management,
urban planning, life skills education, community organizing
• Types of social work: case work, radical social work,
community development, clinical social work
• SW books in Canadian Libraries
Project: People, Persons and
Individuals: Is the DSM
Dehumanizing?
(Mackenzie Salt)
Objective
To analyze the diagnostic chapters of five
volumes of the DSM to determine whether
the referring expressions used therein are
dehumanizing and if so, determine if the
usages have changed over time.
Problems
Traditional discourse analysis is done by
hand and can be very time consuming.
Volumes of the DSM range from 494 pages
to 991 pages.
Solutions
• Digital Corpus Analysis
– Computer Software (R)
• Faster
• More Efficient
• Can handle large amounts of data at once
• Data has to be prepared before it is ready
to be used for digital analysis.
Preparation of Data
• Physical data must be converted to digital
medium
• Steps
– Permission
– Scanning to PDF
– OCR PDF
– Convert OCRed PDF to Plain Text
– Clean Plain Text
Permission
• Digital e-book copies of DSM are not
available for any of the versions
• American Psychiatric Association holds
copyright and is VERY protective
Scanning DSMs
• Physical copies of the DSMs need to be
scanned into a digital format (in this case
PDF)
• PDFs need to be converted to a text
format that a computer can read, edit, and
work with
OCR PDFs
Clean Plain Text Files
• Once you have OCRed plain text files,
you need to make sure they are accurate
– Computers are only as good as their input
• If the data input is messy, the analysis will be
messy
• Made files consisting of only the chapters
for analysis
• Checked for and fixed any remaining
OCR/Scanning errors
Now To The Project
• Come up with a list of referring
expressions based on a visual scan
through the DSMs
• Use R to narrow down the list to only the
most frequent
– Narrows 10K+ unique words to a handful
• Use R to pull out all sentences with the
terms in question
– Narrows down ~19K sentences to 655 for
individual
Benefits of Digital Analysis
• This project still used some manual
analysis
• Using digital technologies and corpora
sped things up considerably
• Made it easier to break down large corpus
to manageable parts
• Now have a corpus on which to do other
projects in the future: prep-work already
Working with Digital Corpora with R
• Pros
– Free and Cross-Platform
– Powerful, Efficient, Fast
– Capable of working with VERY large datasets
– Subsequent projects can be much faster as
code can be saved and built on or recycled
• Cons
– Code-based command-line style interface (?)
– GIGO
– Depending on project, input data may need
substantial preparation
Summary
• Overall, 75% of the time spent doing this
project was prepping the data
• Project took only 3-4 months to do, part-time
• Corpus analyzed totalled 1.08 million words
from the DSM-III through the DSM-5
• Future projects based on this corpus will be
much faster to do as well
• Digital technologies made this project
feasible
• Project was much faster than if done by hand
It is likely that your data will
have a longer life span than
any specific project you
create.
In many instances, it may be
more useful to focus on the
data curation as much as a
single project.
Key DS Values
• Adaptive
• Sustainable/resource-aware
• Collaborative
• Social
Key skill• Thinking flexibly about your data
(and potential project)
• Are there portions of your dataset that
could be extracted for use in a
particular tool?
• How can you adjust your data in order
to show it to people (and be more able
to talk/write/present about your
research interests?)
And now, it’s your turn...
For this activity, I
recommend that you pair up,
or form small groups to work
together.
Group Activity• What do you need to do with your
data? (share, aggregate, combine…?)
• What units might that data exist in?
• What categories do you need to
create?
• What connections need to exist
between the units and the categories?
Next steps
• What’s the smallest version of your
dataset possible? (useful for testing out
tools)
• Possible tools to examine (as ways of
presenting your data)
• Omeka (http://www.omeka.net)
• Scalar (http://scalar.usc.edu)
• Simile (http://www.simile-widgets.org)
• Google Fusion Tables
(https://support.google.com/fusiontables/answer/2571232)
SCDS support for data wrangling
• Consultations
http://www.tinyurl.com/scds-consult
• Colloquium slots (opportunities to talk
through your project plans for a
supportive audience)
• Graduate fellowships (workspace and
greater access to SCDS staff
expertise)
Spring Workshops!
• Project Ideation and Development;
Choosing Tools for Every Part of Your
Project
• April 9th and 16th, 2015 (pre-
registration available soon)

More Related Content

PPT
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
PPT
Dmdh session-1-2013-14
PPT
Dmdh session-2-2013-14
PPT
Dmdh workshop #6
PPT
Dmdh winter 2015 session #1
PPT
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
PPTX
DMDH HASTAC 2015 Presentation: Building and Sustaining DH Communities
PPTX
Modular Digital Scholarship // for Seeding Digital Scholarship
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
Dmdh session-1-2013-14
Dmdh session-2-2013-14
Dmdh workshop #6
Dmdh winter 2015 session #1
Demystifying Digital Humanities: Winter 2014 Workshop #2: Programming on the ...
DMDH HASTAC 2015 Presentation: Building and Sustaining DH Communities
Modular Digital Scholarship // for Seeding Digital Scholarship

Similar to DMDS Winter Workshop 2 Slides (20)

PPTX
Feb.2016 Demystifying Digital Humanities - Workshop 3
PPTX
Geek out : Adding Coding Skills to Your Professional Repertoire
PPT
discopen
PPTX
Scientific data management from the lab to the web
PPTX
Bosman and Kramer Open Research: A 2024 NISO Training Series, Session Four: O...
PPTX
Unit no_1.pptx
PDF
Introduction to the FP7 CODE project @ BDBC
PDF
Feb.2016 Demystifying Digital Humanities - Workshop 2
PPT
Data Munging in concepts of data mining in DS
PDF
Data-Intensive Text Processing with MapReduce
PDF
Data-Intensive Text Processing with MapReduce
PDF
Anomalous symmetry succession for seek out
PPTX
Public Information System Lesson 2 to Lesson 7 (1).pptx
PDF
Brief for W3C Government Linked Data Working Group 29-June 2011
PPT
Tutorial
PDF
Grab a bucket! It's raining data!
PDF
Getting Started with Unstructured Data
PPT
Hacking and mash-ups for beginners at MCN2011
PDF
Sq lite module2
PDF
S2-Programming_with_Data_Computational_Physics.pdf
Feb.2016 Demystifying Digital Humanities - Workshop 3
Geek out : Adding Coding Skills to Your Professional Repertoire
discopen
Scientific data management from the lab to the web
Bosman and Kramer Open Research: A 2024 NISO Training Series, Session Four: O...
Unit no_1.pptx
Introduction to the FP7 CODE project @ BDBC
Feb.2016 Demystifying Digital Humanities - Workshop 2
Data Munging in concepts of data mining in DS
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
Anomalous symmetry succession for seek out
Public Information System Lesson 2 to Lesson 7 (1).pptx
Brief for W3C Government Linked Data Working Group 29-June 2011
Tutorial
Grab a bucket! It's raining data!
Getting Started with Unstructured Data
Hacking and mash-ups for beginners at MCN2011
Sq lite module2
S2-Programming_with_Data_Computational_Physics.pdf
Ad

More from Paige Morgan (12)

PPTX
Feb.2016 Demystifying Digital Humanities - Workshop 1
PPTX
Miami Demystifying DH session 1 slides-FINAL
PPTX
Dmdh may 2015 - workshop 1
PPT
Demystifying Digital Scholarship Workshop 6 Slides
PPT
Demystifying Digital Scholarship Slides: Big Project, Small Project: Steps in...
PPTX
DMDS Winter 2015 Workshop 1 slides
PPTX
Demystifying Digital Scholarship: Using Social Media for Learning and Profess...
PPTX
Demystifying Digital Scholarship: Session 1, McMaster University
PPT
DMDH 2014: Workshop 5: Project Ideation and Development
PPT
Demystifying Digital Humanities: Winter 2014 session #1
PPT
Dmdh workshop 5 slides
PPT
Visible Prices: Archiving the Intersection Between Literature and Economics
Feb.2016 Demystifying Digital Humanities - Workshop 1
Miami Demystifying DH session 1 slides-FINAL
Dmdh may 2015 - workshop 1
Demystifying Digital Scholarship Workshop 6 Slides
Demystifying Digital Scholarship Slides: Big Project, Small Project: Steps in...
DMDS Winter 2015 Workshop 1 slides
Demystifying Digital Scholarship: Using Social Media for Learning and Profess...
Demystifying Digital Scholarship: Session 1, McMaster University
DMDH 2014: Workshop 5: Project Ideation and Development
Demystifying Digital Humanities: Winter 2014 session #1
Dmdh workshop 5 slides
Visible Prices: Archiving the Intersection Between Literature and Economics
Ad

Recently uploaded (20)

PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Pharma ospi slides which help in ospi learning
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
102 student loan defaulters named and shamed – Is someone you know on the list?
Week 4 Term 3 Study Techniques revisited.pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Cell Structure & Organelles in detailed.
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
RMMM.pdf make it easy to upload and study
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPH.pptx obstetrics and gynecology in nursing
Pharma ospi slides which help in ospi learning
Microbial disease of the cardiovascular and lymphatic systems
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
O5-L3 Freight Transport Ops (International) V1.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
TR - Agricultural Crops Production NC III.pdf
VCE English Exam - Section C Student Revision Booklet
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...

DMDS Winter Workshop 2 Slides

  • 1. Winter 2015: Session #2 Programming on the Whiteboard February 19, 2015 (Paige Morgan)
  • 2. Last week... • The work of creating usable data • Forms that this data might take: • markup language • Spreadsheets (MySQL & relational DBs) • Graph databases (RDF/Linked Open Data
  • 3. This week: • Caveat Curator (challenges of working with data) • Programming on the Whiteboard, i.e., conceptualizing the specific steps that you need to take to accomplish your goals
  • 4. Goals/Takeaways • A better understanding of the workflow for dealing with data • Greater ability to talk about what you’re trying to do
  • 5. Why this focus on data? • Understanding your data, and your intended actions, is a key skill for developing any digital project (big or small). • You may have one big project – but your data may support several small/intermediary projects.
  • 7. Programming languages (and digital apps) are like human languages in that they both have phrases, patterns, and rules.
  • 8. Programming languages are unlike human languages in that they aren’t for communicating with people.
  • 9. They are also unlike human languages in that every programming utterance does something, i.e., causes an action to occur.
  • 10. You can get used to patterns – even unfamiliar ones.
  • 11. The shift is in getting used to thinking in terms of every single action.
  • 12. Today’s subject matter includes actions that you’ll need to think about before you work with...
  • 13. Image: Josh Lee, @wtrsld, via Twitter, January 2014.
  • 14. Even when you’re just experimenting, you need to prep your data.
  • 15. You may know your dataset in detail already, from your research -- but your computer is concerned with different levels of detail.
  • 16. Becoming aware of those levels of detail is not only helpful for your project ideas...
  • 17. ...it’s also a useful skill for working with programming languages. (where a stray /> or ; can break your program/website)
  • 18. Data only works if your computer can read it.
  • 19. But my data is just text! (Doesn’t that make things easy?)
  • 20. (Remember, your computer is fairly stupid).
  • 21. Formatted text is often full of text your computer can’t parse correctly.
  • 22. The┘re┘sÜlt ís that yoÜr te┘xt might come┘ oÜt looking like┘this whe┘n yoÜ ope┘n it in a programming e┘nvironme┘nt.
  • 23. So you need to convert it to plain text. (without any of the fancy details encoded in MS Word fonts.)
  • 24. (This is key if you work with newspapers, older printed texts, or archival material.)
  • 25. Maybe you want to work with sailing data and ports of call:
  • 26. The ship you’re interested in leaves the Ivory Coast for St. Helena...
  • 28. But when you create your map, you get this:
  • 30. The latitude/longitude coordinate is the significant datum.
  • 31. The city name is just the human-readable component.
  • 32. Each datum needs to be unique.
  • 33. Figuring out what sort of unique configuration will work best involves at least some experimentation.
  • 34. To experiment effectively, you’ll want to keep careful records.
  • 35. If you develop categories of information, you’ll want to keep a record of what each category means, and what its limits are.
  • 36. Cleaning and structuring your data is a foundation issue that changes, depending on the available format of your data.
  • 37. What if your data is crowdsourced?
  • 38. You can require a particular format for submissions
  • 39. You can even put programmatic limits on the formats available for submission
  • 40. But in the end, you’re probably still going to need to scrub and/or format.
  • 41. This is true even for data from supposedly reputable sources, like government or media organizations.
  • 42. Example: Doctor Who Villains dataset http://tinyurl.com/doctorwhov illains
  • 43. This step is no fun!
  • 45. If you are thinking about your data, and the tasks that you need to accomplish, then it’s easier to determine what sort of language or platform your project needs.
  • 46. There are countless tutorials, online courses, etc., for almost any programming language or platform. (You can also ask for a Sherman Centre consultation to figure out what you need to learn.)
  • 47. Learning how to work with any tool can be a slow process, especially at first.
  • 48. However, knowing what tasks you’re working towards makes it easier to understand the purpose of the introductory lessons.
  • 49. It’s also easy to think about how the first rules you learn for any language or platform might affect your goals.
  • 50. Pseudocode • Used by programmers to break down a complex task into manageable steps • Easily adaptable for use by non- programmers
  • 51. Pseudocode Example (Visible Prices)• Computer has a file that contains prices from different texts. • Computer must know that each price amount is connected with an object, and with a bibliographical record. • Users can input a price amount, and computer will retrieve all objects that match the price, and display them to the user, along with bibliographical information. • (More complex): Computer is able to retrieve prices linked with certain categories (clothing, food, etc.)
  • 52. (And now for a couple of examples of projects in process…)
  • 53. Social Work and Social Change (Tina Wilson)
  • 54. Social work + social change • Recent history of academic social work in Canada; 1960s onward • Interested in the ways in which academic social work has attempted to advance justice-oriented social change projects, and how political, cultural, and theoretical shifts have influenced this type of disciplinary imagination and work • Related to disciplinary boundaries and methods and orthodoxies, and the social role of universities
  • 56. MARC Record, back end <?xml version="1.0"?> <record xmlns="http://www.loc.gov/MARC21/slim" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/MARC21/slim http://www.loc.gov/standards/marcxml/schema/MARC21sli m.xsd"> <leader>00000cam a000001</leader> <controlfield tag="001">468966</controlfield> <controlfield tag="008">710913s1968vaub000 0 eng c</controlfield> <datafield tag="010" ind1=" " ind2=" "> <subfield code="a">a68007753 </subfield> </datafield> <datafield tag="040" ind1=" " ind2=" "> <subfield code="a">Virginia. Univ. Libr.</subfield> <subfield code="b">eng</subfield> <subfield code="c">DLC</subfield> <subfield code="d">OCLCQ</subfield> <subfield code="d">CLU</subfield> <subfield code="d">OCLCO</subfield> <subfield code="d">OCLCF</subfield> <subfield code="d">OCLCQ</subfield> </datafield> <datafield tag="043" ind1=" " ind2=" "> <subfield code="a">n-us-va</subfield> </datafield> <datafield tag="050" ind1="0" ind2="4"> <subfield code="a">HV98.V8</subfield> <subfield code="b">C46</subfield> </datafield> <datafield tag="082" ind1=" " ind2=" "> <subfield code="a">361/.9/755</subfield> </datafield> <datafield tag="100" ind1="1" ind2=" "> <subfield code="a">Cepuran, Joseph.</subfield> </datafield> <datafield tag="245" ind1="1" ind2="0"> <subfield code="a">Public assistance and child welfare:</subfield> <subfield code="b">the Virginia pattern, 1646 to 1964.</subfield> </datafield> <datafield tag="260" ind1=" " ind2=" "> <subfield code="a">[Charlottesville]</subfield> <subfield code="b">Institute of Government, University of Virginia,</subfield> <subfield code="c">1968.</subfield> </datafield> <datafield tag="300" ind1=" " ind2=" "> <subfield code="a">vii, 120 pages</subfield> <subfield code="c">28 cm</subfield> </datafield> <datafield tag="336" ind1=" " ind2=" "> <subfield code="a">text</subfield> <subfield code="b">txt</subfield> <subfield code="2">rdacontent</subfield> </datafield> <datafield tag="337" ind1=" " ind2=" "> <subfield code="a">unmediated</subfield> <subfield code="b">n</subfield> <subfield code="2">rdamedia</subfield> </datafield> <datafield tag="338" ind1=" " ind2=" "> <subfield code="a">volume</subfield> <subfield code="b">nc</subfield> <subfield code="2">rdacarrier</subfield> </datafield> <datafield tag="504" ind1=" " ind2=" "> <subfield code="a">Includes bibliographical references.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="0"> <subfield code="a">Public welfare</subfield> <subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="0"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy</subfield> <subfield code="z">Virginia.</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Child welfare</subfield> <subfield code="x">Government policy.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst00854729</subfield> </datafield> <datafield tag="650" ind1=" " ind2="7"> <subfield code="a">Public welfare.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01083250</subfield> </datafield> <datafield tag="651" ind1=" " ind2="7"> <subfield code="a">Virginia.</subfield> <subfield code="2">fast</subfield> <subfield code="0">(OCoLC)fst01204597</subfield> </datafield> <datafield tag="710" ind1="2" ind2=" "> <subfield code="a">University of Virginia.</subfield> <subfield code="b">Institute of Government.</subfield> </datafield> </record> MARC 21 Format
  • 57. Things to count • Social problems: child abuse, unemployment, inequality • Concepts: mental hygiene, non-voluntary clients, culture of poverty, consciousness raising, privilege • Sub-populations: immigrants, unwed mothers, the oppressed • Institutions: work houses, shelters, detention, the non-profit industrial complex • Interventions: motivational interviewing, case management, urban planning, life skills education, community organizing • Types of social work: case work, radical social work, community development, clinical social work • SW books in Canadian Libraries
  • 58. Project: People, Persons and Individuals: Is the DSM Dehumanizing? (Mackenzie Salt)
  • 59. Objective To analyze the diagnostic chapters of five volumes of the DSM to determine whether the referring expressions used therein are dehumanizing and if so, determine if the usages have changed over time.
  • 60. Problems Traditional discourse analysis is done by hand and can be very time consuming. Volumes of the DSM range from 494 pages to 991 pages.
  • 61. Solutions • Digital Corpus Analysis – Computer Software (R) • Faster • More Efficient • Can handle large amounts of data at once • Data has to be prepared before it is ready to be used for digital analysis.
  • 62. Preparation of Data • Physical data must be converted to digital medium • Steps – Permission – Scanning to PDF – OCR PDF – Convert OCRed PDF to Plain Text – Clean Plain Text
  • 63. Permission • Digital e-book copies of DSM are not available for any of the versions • American Psychiatric Association holds copyright and is VERY protective
  • 64. Scanning DSMs • Physical copies of the DSMs need to be scanned into a digital format (in this case PDF) • PDFs need to be converted to a text format that a computer can read, edit, and work with OCR PDFs
  • 65. Clean Plain Text Files • Once you have OCRed plain text files, you need to make sure they are accurate – Computers are only as good as their input • If the data input is messy, the analysis will be messy • Made files consisting of only the chapters for analysis • Checked for and fixed any remaining OCR/Scanning errors
  • 66. Now To The Project • Come up with a list of referring expressions based on a visual scan through the DSMs • Use R to narrow down the list to only the most frequent – Narrows 10K+ unique words to a handful • Use R to pull out all sentences with the terms in question – Narrows down ~19K sentences to 655 for individual
  • 67. Benefits of Digital Analysis • This project still used some manual analysis • Using digital technologies and corpora sped things up considerably • Made it easier to break down large corpus to manageable parts • Now have a corpus on which to do other projects in the future: prep-work already
  • 68. Working with Digital Corpora with R • Pros – Free and Cross-Platform – Powerful, Efficient, Fast – Capable of working with VERY large datasets – Subsequent projects can be much faster as code can be saved and built on or recycled • Cons – Code-based command-line style interface (?) – GIGO – Depending on project, input data may need substantial preparation
  • 69. Summary • Overall, 75% of the time spent doing this project was prepping the data • Project took only 3-4 months to do, part-time • Corpus analyzed totalled 1.08 million words from the DSM-III through the DSM-5 • Future projects based on this corpus will be much faster to do as well • Digital technologies made this project feasible • Project was much faster than if done by hand
  • 70. It is likely that your data will have a longer life span than any specific project you create.
  • 71. In many instances, it may be more useful to focus on the data curation as much as a single project.
  • 72. Key DS Values • Adaptive • Sustainable/resource-aware • Collaborative • Social
  • 73. Key skill• Thinking flexibly about your data (and potential project) • Are there portions of your dataset that could be extracted for use in a particular tool? • How can you adjust your data in order to show it to people (and be more able to talk/write/present about your research interests?)
  • 74. And now, it’s your turn...
  • 75. For this activity, I recommend that you pair up, or form small groups to work together.
  • 76. Group Activity• What do you need to do with your data? (share, aggregate, combine…?) • What units might that data exist in? • What categories do you need to create? • What connections need to exist between the units and the categories?
  • 77. Next steps • What’s the smallest version of your dataset possible? (useful for testing out tools) • Possible tools to examine (as ways of presenting your data) • Omeka (http://www.omeka.net) • Scalar (http://scalar.usc.edu) • Simile (http://www.simile-widgets.org) • Google Fusion Tables (https://support.google.com/fusiontables/answer/2571232)
  • 78. SCDS support for data wrangling • Consultations http://www.tinyurl.com/scds-consult • Colloquium slots (opportunities to talk through your project plans for a supportive audience) • Graduate fellowships (workspace and greater access to SCDS staff expertise)
  • 79. Spring Workshops! • Project Ideation and Development; Choosing Tools for Every Part of Your Project • April 9th and 16th, 2015 (pre- registration available soon)