DMDS Winter Workshop 2 Slides

Winter 2015: Session #2
Programming on the Whiteboard
February 19, 2015
(Paige Morgan)

Last week...
• The work of creating usable data
• Forms that this data might take:
• markup language
• Spreadsheets (MySQL & relational
DBs)
• Graph databases (RDF/Linked Open
Data

This week:
• Caveat Curator (challenges of working
with data)
• Programming on the Whiteboard, i.e.,
conceptualizing the specific steps that
you need to take to accomplish your
goals

Goals/Takeaways
• A better understanding of the
workflow for dealing with data
• Greater ability to talk about what
you’re trying to do

Why this focus on data?
• Understanding your data, and your
intended actions, is a key skill for
developing any digital project (big or
small).
• You may have one big project – but
your data may support several
small/intermediary projects.

Programming
languages (and
digital apps) are like
human languages in
that they both have
phrases, patterns,
and rules.

Programming
languages are unlike
human languages in
that they aren’t for
communicating with
people.

They are also unlike
human languages in that
every programming
utterance does
something, i.e., causes an
action to occur.

You can get used to
patterns – even
unfamiliar ones.

The shift is in getting
used to thinking in
terms of every single
action.

Today’s subject matter
includes actions that you’ll
need to think about before
you work with...

Image: Josh Lee, @wtrsld, via Twitter, January
2014.

Even when you’re just
experimenting, you need to
prep your data.

You may know your
dataset in detail already,
from your research -- but
your computer is
concerned with different
levels of detail.

Becoming aware of those
levels of detail is not only
helpful for your project
ideas...

...it’s also a useful skill for
working with programming
languages.
(where a stray /> or ; can break your program/website)

Data only works if your
computer can read it.

But my data is just text!
(Doesn’t that make things easy?)

(Remember, your computer
is fairly stupid).

Formatted
text is often
full of text
your
computer
can’t parse
correctly.

The┘re┘sÜlt ís that yoÜr te┘xt
might come┘ oÜt looking
like┘this
whe┘n yoÜ ope┘n it in a
programming e┘nvironme┘nt.

So you need
to convert it to
plain text.
(without any of the fancy
details encoded in MS Word
fonts.)

(This is key if you work with
newspapers, older printed
texts, or archival material.)

Maybe you want to work
with sailing data and ports of
call:

The ship you’re interested in
leaves the Ivory Coast for
St. Helena...

But when you create your
map, you get this:

The latitude/longitude
coordinate is the significant
datum.

The city name is just the
human-readable
component.

Each datum needs to be
unique.

Figuring out what sort
of unique configuration
will work best involves
at least some
experimentation.

To experiment effectively,
you’ll want to keep careful
records.

If you develop categories
of information, you’ll want
to keep a record of what
each category means,
and what its limits are.

Cleaning and structuring
your data is a foundation
issue that changes,
depending on the available
format of your data.

What if your data is
crowdsourced?

You can require a particular
format for submissions

You can even put
programmatic limits on the
formats available for
submission

But in the end, you’re
probably still going to need
to scrub and/or format.

This is true even for data
from supposedly
reputable sources, like
government or media
organizations.

Example: Doctor Who
Villains dataset
http://tinyurl.com/doctorwhov
illains

But it’s absolutely
necessary.

If you are thinking about
your data, and the tasks that
you need to accomplish,
then it’s easier to determine
what sort of language or
platform your project needs.

There are countless tutorials,
online courses, etc., for almost
any programming language or
platform.
(You can also ask for a Sherman
Centre consultation to figure out
what you need to learn.)

Learning how to work with
any tool can be a slow
process, especially at first.

However, knowing what
tasks you’re working
towards makes it easier to
understand the purpose of
the introductory lessons.

It’s also easy to think about
how the first rules you learn
for any language or platform
might affect your goals.

Pseudocode
• Used by programmers to break down a
complex task into manageable steps
• Easily adaptable for use by non-
programmers

Pseudocode
Example
(Visible Prices)• Computer has a file that contains prices from different
texts.
• Computer must know that each price amount is
connected with an object, and with a bibliographical
record.
• Users can input a price amount, and computer will
retrieve all objects that match the price, and display
them to the user, along with bibliographical information.
• (More complex): Computer is able to retrieve prices
linked with certain categories (clothing, food, etc.)

(And now for a couple of
examples of projects in
process…)

Social Work and Social
Change
(Tina Wilson)

Social work + social change
• Recent history of academic social work in Canada; 1960s
onward
• Interested in the ways in which academic social work has
attempted to advance justice-oriented social change projects,
and how political, cultural, and theoretical shifts have
influenced this type of disciplinary imagination and work
• Related to disciplinary boundaries and methods and
orthodoxies, and the social role of universities

MARC Record, back end
<?xml version="1.0"?>
<record xmlns="http://www.loc.gov/MARC21/slim"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21sli
m.xsd">
<leader>00000cam a000001</leader>
<controlfield tag="001">468966</controlfield>
<controlfield tag="008">710913s1968vaub000 0 eng
c</controlfield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a">a68007753 </subfield>
</datafield>
<subfield code="a">Virginia. Univ. Libr.</subfield>
<subfield code="b">eng</subfield>
<subfield code="c">DLC</subfield>
<subfield code="d">OCLCQ</subfield>
<subfield code="d">CLU</subfield>
<subfield code="d">OCLCO</subfield>
<subfield code="d">OCLCF</subfield>
<subfield code="d">OCLCQ</subfield>
</datafield>
<subfield code="a">n-us-va</subfield>
</datafield>
<datafield tag="050" ind1="0" ind2="4">
<subfield code="a">HV98.V8</subfield>
<subfield code="b">C46</subfield>
</datafield>
<subfield code="a">361/.9/755</subfield>
</datafield>
<datafield tag="100" ind1="1" ind2=" ">
<subfield code="a">Cepuran, Joseph.</subfield>
</datafield>
<datafield tag="245" ind1="1" ind2="0">
<subfield code="a">Public
assistance and child
welfare:</subfield>
<subfield code="b">the Virginia pattern, 1646 to
1964.</subfield>
</datafield>
<subfield code="a">[Charlottesville]</subfield>
<subfield code="b">Institute of Government, University of
Virginia,</subfield>
<subfield code="c">1968.</subfield>
</datafield>
<subfield code="a">vii, 120 pages</subfield>
<subfield code="c">28 cm</subfield>
</datafield>
<subfield code="a">text</subfield>
<subfield code="b">txt</subfield>
<subfield code="2">rdacontent</subfield>
</datafield>
<subfield code="a">unmediated</subfield>
<subfield code="b">n</subfield>
<subfield code="2">rdamedia</subfield>
</datafield>
<subfield code="a">volume</subfield>
<subfield code="b">nc</subfield>
<subfield code="2">rdacarrier</subfield>
</datafield>
<subfield code="a">Includes bibliographical
references.</subfield>
</datafield>
<datafield tag="650" ind1=" " ind2="0">
<subfield code="a">Public
welfare</subfield>
<subfield code="z">Virginia.</subfield>
</datafield>
<subfield code="a">Child welfare</subfield>
<subfield code="x">Government policy</subfield>
<subfield code="z">Virginia.</subfield>
</datafield>
<subfield code="a">Child welfare</subfield>
<subfield code="x">Government policy.</subfield>
<subfield code="2">fast</subfield>
<subfield code="0">(OCoLC)fst00854729</subfield>
</datafield>
<subfield code="a">Public welfare.</subfield>
</datafield>
<subfield code="a">Virginia.</subfield>
</datafield>
<datafield tag="710" ind1="2" ind2=" ">
<subfield code="a">University of Virginia.</subfield>
<subfield code="b">Institute of Government.</subfield>
</datafield>
</record>
MARC 21 Format

Things to count
• Social problems: child abuse, unemployment, inequality
• Concepts: mental hygiene, non-voluntary clients, culture of
poverty, consciousness raising, privilege
• Sub-populations: immigrants, unwed mothers, the oppressed
• Institutions: work houses, shelters, detention, the non-profit
industrial complex
• Interventions: motivational interviewing, case management,
urban planning, life skills education, community organizing
• Types of social work: case work, radical social work,
community development, clinical social work
• SW books in Canadian Libraries

Project: People, Persons and
Individuals: Is the DSM
Dehumanizing?
(Mackenzie Salt)

Objective
To analyze the diagnostic chapters of five
volumes of the DSM to determine whether
the referring expressions used therein are
dehumanizing and if so, determine if the
usages have changed over time.

Problems
Traditional discourse analysis is done by
hand and can be very time consuming.
Volumes of the DSM range from 494 pages
to 991 pages.

Solutions
• Digital Corpus Analysis
– Computer Software (R)
• Faster
• More Efficient
• Can handle large amounts of data at once
• Data has to be prepared before it is ready
to be used for digital analysis.

Preparation of Data
• Physical data must be converted to digital
medium
• Steps
– Permission
– Scanning to PDF
– OCR PDF
– Convert OCRed PDF to Plain Text
– Clean Plain Text

Permission
• Digital e-book copies of DSM are not
available for any of the versions
• American Psychiatric Association holds
copyright and is VERY protective

Scanning DSMs
• Physical copies of the DSMs need to be
scanned into a digital format (in this case
PDF)
• PDFs need to be converted to a text
format that a computer can read, edit, and
work with
OCR PDFs

Clean Plain Text Files
• Once you have OCRed plain text files,
you need to make sure they are accurate
– Computers are only as good as their input
• If the data input is messy, the analysis will be
messy
• Made files consisting of only the chapters
for analysis
• Checked for and fixed any remaining
OCR/Scanning errors

Now To The Project
• Come up with a list of referring
expressions based on a visual scan
through the DSMs
• Use R to narrow down the list to only the
most frequent
– Narrows 10K+ unique words to a handful
• Use R to pull out all sentences with the
terms in question
– Narrows down ~19K sentences to 655 for
individual

Benefits of Digital Analysis
• This project still used some manual
analysis
• Using digital technologies and corpora
sped things up considerably
• Made it easier to break down large corpus
to manageable parts
• Now have a corpus on which to do other
projects in the future: prep-work already

Working with Digital Corpora with R
• Pros
– Free and Cross-Platform
– Powerful, Efficient, Fast
– Capable of working with VERY large datasets
– Subsequent projects can be much faster as
code can be saved and built on or recycled
• Cons
– Code-based command-line style interface (?)
– GIGO
– Depending on project, input data may need
substantial preparation

Summary
• Overall, 75% of the time spent doing this
project was prepping the data
• Project took only 3-4 months to do, part-time
• Corpus analyzed totalled 1.08 million words
from the DSM-III through the DSM-5
• Future projects based on this corpus will be
much faster to do as well
• Digital technologies made this project
feasible
• Project was much faster than if done by hand

It is likely that your data will
have a longer life span than
any specific project you
create.

In many instances, it may be
more useful to focus on the
data curation as much as a
single project.

Key DS Values
• Adaptive
• Sustainable/resource-aware
• Collaborative
• Social

Key skill• Thinking flexibly about your data
(and potential project)
• Are there portions of your dataset that
could be extracted for use in a
particular tool?
• How can you adjust your data in order
to show it to people (and be more able
to talk/write/present about your
research interests?)

For this activity, I
recommend that you pair up,
or form small groups to work
together.

Group Activity• What do you need to do with your
data? (share, aggregate, combine…?)
• What units might that data exist in?
• What categories do you need to
create?
• What connections need to exist
between the units and the categories?

Next steps
• What’s the smallest version of your
dataset possible? (useful for testing out
tools)
• Possible tools to examine (as ways of
presenting your data)
• Omeka (http://www.omeka.net)
• Scalar (http://scalar.usc.edu)
• Simile (http://www.simile-widgets.org)
• Google Fusion Tables
(https://support.google.com/fusiontables/answer/2571232)

SCDS support for data wrangling
• Consultations
http://www.tinyurl.com/scds-consult
• Colloquium slots (opportunities to talk
through your project plans for a
supportive audience)
• Graduate fellowships (workspace and
greater access to SCDS staff
expertise)

Spring Workshops!
• Project Ideation and Development;
Choosing Tools for Every Part of Your
Project
• April 9th and 16th, 2015 (pre-
registration available soon)

DMDS Winter Workshop 2 Slides

More Related Content

Similar to DMDS Winter Workshop 2 Slides (20)

More from Paige Morgan (12)

Recently uploaded (20)

DMDS Winter Workshop 2 Slides