SlideShare a Scribd company logo
What is Big Data? Methods What have others done? What can we do? The schedule
Big Data and Automated Content Analysis
Week 1 – Wednesday
»Introduction«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
1 April 2014
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Today
1 What is Big Data?
Definitions
Are we doing Big Data research?
2 Methods
Which techniques?
Which tools?
3 What have others done?
Online news sharing
Partisan asymmetries
4 What can we do?
Considerations regarding feasibility
Examples from last year
5 The schedule
Next meetings
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Big Data and Automated Content Analysis Damian Trilling
BD-ACA week1b
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
What would you call Big Data?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
A simple technical definition could be:
Everything that needs so much computational power and/or
storage that you cannot do it on a regular computer.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,
-velocity and -variety information assets that demand
cost-effective, innovative forms of information processing for
enhanced insight and decision making”
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Vis, 2013
• boyd & Crawford definition:
1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.
2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.
3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generate
insights that were previously impossible, with the aura of truth,
objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
What is Big Data?
Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,
-velocity and -variety information assets that demand
cost-effective, innovative forms of information processing for
enhanced insight and decision making”
• boyd & Crawford definition:
1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.
2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.
3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generate
insights that were previously impossible, with the aura of truth,
objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
Implications & critizism
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
Implications & critizism
boyd & Crawford, 2012
1 Big Data changes the definition of knowledge
2 Claims to objectivity and accuracy are misleading
3 Bigger data are not always better data
4 Taken out of context, Big Data loses its meaning
5 Just because it is accessible does not make it ethical
6 Limited access to Big Data creates new digital divides
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
APIs, researchers and tools make Big Data
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Definitions
APIs, researchers and tools make Big Data
Vis, 2013
Inevitable influences of:
• APIs
• filtering, search strings, . . .
• changing services over time
• organizations that provide the data
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computing
power and the amount of data
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computing
power and the amount of data
• But: We are using the same techniques. And they scale well.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computing
power and the amount of data
• But: We are using the same techniques. And they scale well.
• Oh, and about that high-performance computing in the cloud:
We actually do have access to that, so if someone has a really
great idea. . .
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Methods
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which techniques?
What we will learn the next weeks
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which techniques?
What we will learn the next weeks
1. How to collect data
APIs, scrapers and crawlers, feeds, databases, . . .
Storage in different file formats
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which techniques?
What we will learn the next weeks
1. How to collect data
APIs, scrapers and crawlers, feeds, databases, . . .
Storage in different file formats
2. How to analyze data
Sentiment analysis, automated content analysis, regular
expressions, natural language processing, cluster analysis, machine
learning, network analysis
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
The methods
The tools we use for this
Some existing tools, but mostly we will write our own tools (or
modify existing tools) using the programming language Python
(and make use of the huge amount of Python modules others
already wrote).
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
The methods
The tools we use for this
Some existing tools, but mostly we will write our own tools (or
modify existing tools) using the programming language Python
(and make use of the huge amount of Python modules others
already wrote).
But aren’t there existing tools?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
The methods
The tools we use for this
Some existing tools, but mostly we will write our own tools (or
modify existing tools) using the programming language Python
(and make use of the huge amount of Python modules others
already wrote).
But aren’t there existing tools? — sometimes. But look at this
example:
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
Why program your own tool?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
Why program your own tool?
If the task would have been done with a (commercial) tool, we can
only research what the tool allows us to do (⇒ our discussion from
some minutes ago).
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
Why program your own tool?
If the task would have been done with a (commercial) tool, we can
only research what the tool allows us to do (⇒ our discussion from
some minutes ago).
Luckily, the problem is easily solved
The task was done with a self-written Python program. We change
the line
lengte_list.append(len(row[textcolumn]))
to
lengte_list.append(len(row[textcolumn].split()))
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
Why program your own tool?
Moreover, the tools we use can limit the range of
questions that might be imagined, simply because they
do not fit the affordances of the tool. Not many
researchers themselves have the ability or access to
other researchers who can build the required tools in
line with any preferred enquiry. This then
introduces serious limitations in terms of the scope
of research that can be done. Vis, 2013
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
More considerations
Assuming that science should be transparent and reproducible
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
More considerations
Assuming that science should be transparent and reproducible, we
should
use tools that are
• platform-independent
• free (as in beer and as in speech, gratis and libre)
• which implies: open source
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
More considerations
Assuming that science should be transparent and reproducible, we
should
use tools that are
• platform-independent
• free (as in beer and as in speech, gratis and libre)
• which implies: open source
This ensures it can our research (a) can be reproduced by anyone,
and that there is (b) no black box that no one can look inside
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Which tools?
Why program your own tool?
[...] these [commercial] tools are often unsuitable
for academic purposes because of their cost, along
with the problematic ‘black box’ nature of many of
these tools. Vis, 2013
[...] we should resist the temptation to let the
opportunities and constraints of an application or
platform determine the research question [...] Mahrt &
Scharkow, 2013, p. 30
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
What have others done?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Online news sharing
Online news sharing
Castillo, El-Haddad, Pfeffer, & Stempeck, 2013
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Online news sharing
Online news sharing
The question
“We describe the interplay between website visitation patterns and
social media reactions to news content. [. . . ] We also show that
social media reactions can help predict future visitation patterns
early and accurately.” (p.1)
Castillo, El-Haddad, Pfeffer, & Stempeck, 2013
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Online news sharing
Online news sharing
The method
• data set 1: log files provided by Al Jazeera
• data set 2: Facebook and Twitter API
• analysis: link 1 and 2 and estimate the relationships
Castillo, El-Haddad, Pfeffer, & Stempeck, 2013
Big Data and Automated Content Analysis Damian Trilling
BD-ACA week1b
What is Big Data? Methods What have others done? What can we do? The schedule
Online news sharing
Online news sharing
The issues
• You need that guy at Al Jazeera
• You need the infrastructure to cope with the data
• Very much tailored to one outlet
Castillo, El-Haddad, Pfeffer, & Stempeck, 2013
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Partisan asymmetries
Partisan asymmetries
Conover, Gonçalves, Flammini, & Menczer, 2012
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Partisan asymmetries
Partisan asymmetries
The question
How does the Twitter behavior differ between right-wing and
left-wing users?
Conover, Gonçalves, Flammini, & Menczer, 2012
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Partisan asymmetries
Partisan asymmetries
The method
Starting with two hashtags (one used by progressives, one used by
conservatives), 55 co-occuring hashtags were identified.
Identification of follower networks, retweet networks, mention
networks within tweets with these hashtags.
Conover, Gonçalves, Flammini, & Menczer, 2012
Big Data and Automated Content Analysis Damian Trilling
BD-ACA week1b
What is Big Data? Methods What have others done? What can we do? The schedule
Partisan asymmetries
Partisan asymmetries
The issues
• The two seeds #tcot and #p2 oversample (extremly) partisan
content, inevitably leading to the structure in Figure 3.
• But maybe no problem, as the rest of the paper aims to
compare these groups.
• Not an empirical problem, but still: What do we learn exactly
from this study apart from the – interesting – case?
Conover, Gonçalves, Flammini, & Menczer, 2012
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Considerations regarding feasibility
To which extent could we conduct
such studies?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Considerations regarding feasibility
What can we do?
Cool research, sure, but what can we do?
• Dependency from third parties:
scraping < API < server-side implementation
• Restrictions (e.g., Twitter: sprinkler, garden hose, fire hose)
⇒ Vis, 2013: Data are made!
• We can’t just trust the numbers. Some tasks require human
coders – or a qualitative approach, at least as a pre-study.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Considerations regarding feasibility
What can we do?
pros, cons, and feasibility
• APIs (Twitter, Facebook, . . . )
• server-side implementations (⇒ Al Jazeera-example)
• scraping
• client-side log files
• "traditional" methods (surveys etc.)
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Considerations regarding feasibility
What can we do?
What helps us answer our questions?
1 Draft a RQ.
2 Think of different ways to approach it
3 Think of different data sources.
4 Think of different analyses.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Considerations regarding feasibility
What can we do?
What helps us answer our questions?
1 Draft a RQ.
2 Think of different ways to approach it
3 Think of different data sources.
4 Think of different analyses.
And then:
1 Check what the technical possibilities are. (e.g., is there an
API? How can we get the data?)
2 Re-evaluate all steps.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Examples from last year
Some examples from last year’s course
Questions
• Are (popular) apps for children in the Apple App Store framed
in terms of education or entertainment?
• How can house prices on funda.nl be predicted?
• Where are those who edit Wikipedia-entries about companies
geographically located, related to the company’s HQ?
• How do embassies [or top universities] communicate on
Twitter?
• How is the tone in newspaper election coverage related to the
tone on Twitter?
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
The schedule
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
The schedule
Each week
In general: A lecture (Monday) and a lab session (Wednesday).
Each week one method.
Examinations
A mid-term take-home exam in week 5 and an individual research
project on which you work during the whole course.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Next meetings
Week 2
No meeting on Monday (Easter)
Wednesday, 8 April: Lab Session
Getting to know Python better.
Do not forget to finish the exercises in chapter 2 and 3 in advance.
Big Data and Automated Content Analysis Damian Trilling
What is Big Data? Methods What have others done? What can we do? The schedule
Next meetings
Week 3: Data harvesting and storage
Monday, 13–4
A conceptual overview of APIs, scrapers, crawlers, RSS-feeds,
databases, and different file formats
Wednesday, 15–4
Writing some first data collection scripts
Big Data and Automated Content Analysis Damian Trilling

More Related Content

PDF
Introduction to Data Science and Large-scale Machine Learning
PPTX
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
PDF
From Big Data to AI
PDF
Clare Corthell: Learning Data Science Online
Introduction to Data Science and Large-scale Machine Learning
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
From Big Data to AI
Clare Corthell: Learning Data Science Online

What's hot (15)

PDF
Claudia Gold: Learning Data Science Online
PDF
Big data and AI presentation slides
PDF
How to become a Data Scientist?
PPSX
Bigdata presentation
PPTX
Lessons Learned The Hard Way: 32+ Data Science Interviews
PPTX
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
PPTX
Machine Learning for Non-technical People
PDF
Be a Data Scientist in 8 steps!
PPTX
Lecture #01
PDF
Data_Scientist_Position_Description
PPTX
How To Become a Data Scientist in Iran Marketplace
PPTX
Big Data Courses in Pune
PPTX
Machine Learning for Non-Technical People - Turing Fest 2019
PPTX
Course Information for March 25th Batch
PDF
Quantitative Ethics - Governance and ethics of AI decisions
Claudia Gold: Learning Data Science Online
Big data and AI presentation slides
How to become a Data Scientist?
Bigdata presentation
Lessons Learned The Hard Way: 32+ Data Science Interviews
Keynote - An overview on Big Data & Data Science - Dr Gregory Piatetsky-Shapiro
Machine Learning for Non-technical People
Be a Data Scientist in 8 steps!
Lecture #01
Data_Scientist_Position_Description
How To Become a Data Scientist in Iran Marketplace
Big Data Courses in Pune
Machine Learning for Non-Technical People - Turing Fest 2019
Course Information for March 25th Batch
Quantitative Ethics - Governance and ethics of AI decisions
Ad

Similar to BD-ACA week1b (20)

PDF
Career in Data Science (July 2017, DTLA)
PDF
Thinkful DC - Intro to Data Science
PPTX
Big data analytics
PDF
Getting Started in Data Science
PDF
How to succeed at data without even trying!
PDF
A Review of Big data for Social Policy Decision Making
PDF
Big Data, Big Opportunities
PDF
Tableau @ Facebook - Summer 2014
PDF
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
PDF
Getting started in data science (4:3)
PDF
Getting started in data science (4:3)
PDF
Getting started in Data Science (April 2017, Los Angeles)
PDF
Thinkful - Intro to Data Science - Washington DC
PPTX
Big Data and HR - Talk @SwissHR Congress
PDF
Startds9.19.17sd
PDF
Getstarteddssd12717sd
PDF
Deck 92-146 (3)
PDF
2017 06-14-getting started with data science
PDF
Intro to Data Science
Career in Data Science (July 2017, DTLA)
Thinkful DC - Intro to Data Science
Big data analytics
Getting Started in Data Science
How to succeed at data without even trying!
A Review of Big data for Social Policy Decision Making
Big Data, Big Opportunities
Tableau @ Facebook - Summer 2014
Big Data Intoduction & Hadoop ArchitectureModule1.pdf
Getting started in data science (4:3)
Getting started in data science (4:3)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful - Intro to Data Science - Washington DC
Big Data and HR - Talk @SwissHR Congress
Startds9.19.17sd
Getstarteddssd12717sd
Deck 92-146 (3)
2017 06-14-getting started with data science
Intro to Data Science
Ad

More from Department of Communication Science, University of Amsterdam (20)

PDF
Media diets in an age of apps and social media: Dealing with a third layer of...
PDF
Conceptualizing and measuring news exposure as network of users and news items
PDF
Data Science: Case "Political Communication 2/2"
PDF
Data Science: Case "Political Communication 1/2"
Media diets in an age of apps and social media: Dealing with a third layer of...
Conceptualizing and measuring news exposure as network of users and news items
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 1/2"

Recently uploaded (20)

PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Classroom Observation Tools for Teachers
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Lesson notes of climatology university.
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Trump Administration's workforce development strategy
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
RMMM.pdf make it easy to upload and study
PPTX
GDM (1) (1).pptx small presentation for students
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Supply Chain Operations Speaking Notes -ICLT Program
Classroom Observation Tools for Teachers
Chinmaya Tiranga quiz Grand Finale.pdf
Lesson notes of climatology university.
Orientation - ARALprogram of Deped to the Parents.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
01-Introduction-to-Information-Management.pdf
Final Presentation General Medicine 03-08-2024.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
human mycosis Human fungal infections are called human mycosis..pptx
Trump Administration's workforce development strategy
202450812 BayCHI UCSC-SV 20250812 v17.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharma ospi slides which help in ospi learning
Microbial disease of the cardiovascular and lymphatic systems
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Yogi Goddess Pres Conference Studio Updates
RMMM.pdf make it easy to upload and study
GDM (1) (1).pptx small presentation for students

BD-ACA week1b

  • 1. What is Big Data? Methods What have others done? What can we do? The schedule Big Data and Automated Content Analysis Week 1 – Wednesday »Introduction« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 1 April 2014 Big Data and Automated Content Analysis Damian Trilling
  • 2. What is Big Data? Methods What have others done? What can we do? The schedule Today 1 What is Big Data? Definitions Are we doing Big Data research? 2 Methods Which techniques? Which tools? 3 What have others done? Online news sharing Partisan asymmetries 4 What can we do? Considerations regarding feasibility Examples from last year 5 The schedule Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 3. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Big Data and Automated Content Analysis Damian Trilling
  • 5. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? What would you call Big Data? Big Data and Automated Content Analysis Damian Trilling
  • 6. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Big Data and Automated Content Analysis Damian Trilling
  • 7. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? A simple technical definition could be: Everything that needs so much computational power and/or storage that you cannot do it on a regular computer. Big Data and Automated Content Analysis Damian Trilling
  • 8. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Big Data and Automated Content Analysis Damian Trilling
  • 9. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Vis, 2013 • “commercial” definition (Gartner): “’Big data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” Big Data and Automated Content Analysis Damian Trilling
  • 10. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Vis, 2013 • boyd & Crawford definition: 1 Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. 2 Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. 3 Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy. Big Data and Automated Content Analysis Damian Trilling
  • 11. What is Big Data? Methods What have others done? What can we do? The schedule Definitions What is Big Data? Vis, 2013 • “commercial” definition (Gartner): “’Big data’ is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making” • boyd & Crawford definition: 1 Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. 2 Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. 3 Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy. Big Data and Automated Content Analysis Damian Trilling
  • 12. What is Big Data? Methods What have others done? What can we do? The schedule Definitions Implications & critizism Big Data and Automated Content Analysis Damian Trilling
  • 13. What is Big Data? Methods What have others done? What can we do? The schedule Definitions Implications & critizism boyd & Crawford, 2012 1 Big Data changes the definition of knowledge 2 Claims to objectivity and accuracy are misleading 3 Bigger data are not always better data 4 Taken out of context, Big Data loses its meaning 5 Just because it is accessible does not make it ethical 6 Limited access to Big Data creates new digital divides Big Data and Automated Content Analysis Damian Trilling
  • 14. What is Big Data? Methods What have others done? What can we do? The schedule Definitions APIs, researchers and tools make Big Data Big Data and Automated Content Analysis Damian Trilling
  • 15. What is Big Data? Methods What have others done? What can we do? The schedule Definitions APIs, researchers and tools make Big Data Vis, 2013 Inevitable influences of: • APIs • filtering, search strings, . . . • changing services over time • organizations that provide the data Big Data and Automated Content Analysis Damian Trilling
  • 16. What is Big Data? Methods What have others done? What can we do? The schedule Are we doing Big Data research? Are we doing Big Data research in this course? Big Data and Automated Content Analysis Damian Trilling
  • 17. What is Big Data? Methods What have others done? What can we do? The schedule Are we doing Big Data research? Are we doing Big Data research in this course? Depends on the definition • Not if we take a definition that only focuses on computing power and the amount of data Big Data and Automated Content Analysis Damian Trilling
  • 18. What is Big Data? Methods What have others done? What can we do? The schedule Are we doing Big Data research? Are we doing Big Data research in this course? Depends on the definition • Not if we take a definition that only focuses on computing power and the amount of data • But: We are using the same techniques. And they scale well. Big Data and Automated Content Analysis Damian Trilling
  • 19. What is Big Data? Methods What have others done? What can we do? The schedule Are we doing Big Data research? Are we doing Big Data research in this course? Depends on the definition • Not if we take a definition that only focuses on computing power and the amount of data • But: We are using the same techniques. And they scale well. • Oh, and about that high-performance computing in the cloud: We actually do have access to that, so if someone has a really great idea. . . Big Data and Automated Content Analysis Damian Trilling
  • 20. What is Big Data? Methods What have others done? What can we do? The schedule Methods Big Data and Automated Content Analysis Damian Trilling
  • 21. What is Big Data? Methods What have others done? What can we do? The schedule Which techniques? What we will learn the next weeks Big Data and Automated Content Analysis Damian Trilling
  • 22. What is Big Data? Methods What have others done? What can we do? The schedule Which techniques? What we will learn the next weeks 1. How to collect data APIs, scrapers and crawlers, feeds, databases, . . . Storage in different file formats Big Data and Automated Content Analysis Damian Trilling
  • 23. What is Big Data? Methods What have others done? What can we do? The schedule Which techniques? What we will learn the next weeks 1. How to collect data APIs, scrapers and crawlers, feeds, databases, . . . Storage in different file formats 2. How to analyze data Sentiment analysis, automated content analysis, regular expressions, natural language processing, cluster analysis, machine learning, network analysis Big Data and Automated Content Analysis Damian Trilling
  • 24. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? The methods The tools we use for this Some existing tools, but mostly we will write our own tools (or modify existing tools) using the programming language Python (and make use of the huge amount of Python modules others already wrote). Big Data and Automated Content Analysis Damian Trilling
  • 25. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? The methods The tools we use for this Some existing tools, but mostly we will write our own tools (or modify existing tools) using the programming language Python (and make use of the huge amount of Python modules others already wrote). But aren’t there existing tools? Big Data and Automated Content Analysis Damian Trilling
  • 26. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? The methods The tools we use for this Some existing tools, but mostly we will write our own tools (or modify existing tools) using the programming language Python (and make use of the huge amount of Python modules others already wrote). But aren’t there existing tools? — sometimes. But look at this example: Big Data and Automated Content Analysis Damian Trilling
  • 27. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? Why program your own tool? Big Data and Automated Content Analysis Damian Trilling
  • 28. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? Why program your own tool? If the task would have been done with a (commercial) tool, we can only research what the tool allows us to do (⇒ our discussion from some minutes ago). Big Data and Automated Content Analysis Damian Trilling
  • 29. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? Why program your own tool? If the task would have been done with a (commercial) tool, we can only research what the tool allows us to do (⇒ our discussion from some minutes ago). Luckily, the problem is easily solved The task was done with a self-written Python program. We change the line lengte_list.append(len(row[textcolumn])) to lengte_list.append(len(row[textcolumn].split())) Big Data and Automated Content Analysis Damian Trilling
  • 30. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? Why program your own tool? Moreover, the tools we use can limit the range of questions that might be imagined, simply because they do not fit the affordances of the tool. Not many researchers themselves have the ability or access to other researchers who can build the required tools in line with any preferred enquiry. This then introduces serious limitations in terms of the scope of research that can be done. Vis, 2013 Big Data and Automated Content Analysis Damian Trilling
  • 31. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? More considerations Assuming that science should be transparent and reproducible Big Data and Automated Content Analysis Damian Trilling
  • 32. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? More considerations Assuming that science should be transparent and reproducible, we should use tools that are • platform-independent • free (as in beer and as in speech, gratis and libre) • which implies: open source Big Data and Automated Content Analysis Damian Trilling
  • 33. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? More considerations Assuming that science should be transparent and reproducible, we should use tools that are • platform-independent • free (as in beer and as in speech, gratis and libre) • which implies: open source This ensures it can our research (a) can be reproduced by anyone, and that there is (b) no black box that no one can look inside Big Data and Automated Content Analysis Damian Trilling
  • 34. What is Big Data? Methods What have others done? What can we do? The schedule Which tools? Why program your own tool? [...] these [commercial] tools are often unsuitable for academic purposes because of their cost, along with the problematic ‘black box’ nature of many of these tools. Vis, 2013 [...] we should resist the temptation to let the opportunities and constraints of an application or platform determine the research question [...] Mahrt & Scharkow, 2013, p. 30 Big Data and Automated Content Analysis Damian Trilling
  • 35. What is Big Data? Methods What have others done? What can we do? The schedule What have others done? Big Data and Automated Content Analysis Damian Trilling
  • 36. What is Big Data? Methods What have others done? What can we do? The schedule Online news sharing Online news sharing Castillo, El-Haddad, Pfeffer, & Stempeck, 2013 Big Data and Automated Content Analysis Damian Trilling
  • 37. What is Big Data? Methods What have others done? What can we do? The schedule Online news sharing Online news sharing The question “We describe the interplay between website visitation patterns and social media reactions to news content. [. . . ] We also show that social media reactions can help predict future visitation patterns early and accurately.” (p.1) Castillo, El-Haddad, Pfeffer, & Stempeck, 2013 Big Data and Automated Content Analysis Damian Trilling
  • 38. What is Big Data? Methods What have others done? What can we do? The schedule Online news sharing Online news sharing The method • data set 1: log files provided by Al Jazeera • data set 2: Facebook and Twitter API • analysis: link 1 and 2 and estimate the relationships Castillo, El-Haddad, Pfeffer, & Stempeck, 2013 Big Data and Automated Content Analysis Damian Trilling
  • 40. What is Big Data? Methods What have others done? What can we do? The schedule Online news sharing Online news sharing The issues • You need that guy at Al Jazeera • You need the infrastructure to cope with the data • Very much tailored to one outlet Castillo, El-Haddad, Pfeffer, & Stempeck, 2013 Big Data and Automated Content Analysis Damian Trilling
  • 41. What is Big Data? Methods What have others done? What can we do? The schedule Partisan asymmetries Partisan asymmetries Conover, Gonçalves, Flammini, & Menczer, 2012 Big Data and Automated Content Analysis Damian Trilling
  • 42. What is Big Data? Methods What have others done? What can we do? The schedule Partisan asymmetries Partisan asymmetries The question How does the Twitter behavior differ between right-wing and left-wing users? Conover, Gonçalves, Flammini, & Menczer, 2012 Big Data and Automated Content Analysis Damian Trilling
  • 43. What is Big Data? Methods What have others done? What can we do? The schedule Partisan asymmetries Partisan asymmetries The method Starting with two hashtags (one used by progressives, one used by conservatives), 55 co-occuring hashtags were identified. Identification of follower networks, retweet networks, mention networks within tweets with these hashtags. Conover, Gonçalves, Flammini, & Menczer, 2012 Big Data and Automated Content Analysis Damian Trilling
  • 45. What is Big Data? Methods What have others done? What can we do? The schedule Partisan asymmetries Partisan asymmetries The issues • The two seeds #tcot and #p2 oversample (extremly) partisan content, inevitably leading to the structure in Figure 3. • But maybe no problem, as the rest of the paper aims to compare these groups. • Not an empirical problem, but still: What do we learn exactly from this study apart from the – interesting – case? Conover, Gonçalves, Flammini, & Menczer, 2012 Big Data and Automated Content Analysis Damian Trilling
  • 46. What is Big Data? Methods What have others done? What can we do? The schedule Considerations regarding feasibility To which extent could we conduct such studies? Big Data and Automated Content Analysis Damian Trilling
  • 47. What is Big Data? Methods What have others done? What can we do? The schedule Considerations regarding feasibility What can we do? Cool research, sure, but what can we do? • Dependency from third parties: scraping < API < server-side implementation • Restrictions (e.g., Twitter: sprinkler, garden hose, fire hose) ⇒ Vis, 2013: Data are made! • We can’t just trust the numbers. Some tasks require human coders – or a qualitative approach, at least as a pre-study. Big Data and Automated Content Analysis Damian Trilling
  • 48. What is Big Data? Methods What have others done? What can we do? The schedule Considerations regarding feasibility What can we do? pros, cons, and feasibility • APIs (Twitter, Facebook, . . . ) • server-side implementations (⇒ Al Jazeera-example) • scraping • client-side log files • "traditional" methods (surveys etc.) Big Data and Automated Content Analysis Damian Trilling
  • 49. What is Big Data? Methods What have others done? What can we do? The schedule Considerations regarding feasibility What can we do? What helps us answer our questions? 1 Draft a RQ. 2 Think of different ways to approach it 3 Think of different data sources. 4 Think of different analyses. Big Data and Automated Content Analysis Damian Trilling
  • 50. What is Big Data? Methods What have others done? What can we do? The schedule Considerations regarding feasibility What can we do? What helps us answer our questions? 1 Draft a RQ. 2 Think of different ways to approach it 3 Think of different data sources. 4 Think of different analyses. And then: 1 Check what the technical possibilities are. (e.g., is there an API? How can we get the data?) 2 Re-evaluate all steps. Big Data and Automated Content Analysis Damian Trilling
  • 51. What is Big Data? Methods What have others done? What can we do? The schedule Examples from last year Some examples from last year’s course Questions • Are (popular) apps for children in the Apple App Store framed in terms of education or entertainment? • How can house prices on funda.nl be predicted? • Where are those who edit Wikipedia-entries about companies geographically located, related to the company’s HQ? • How do embassies [or top universities] communicate on Twitter? • How is the tone in newspaper election coverage related to the tone on Twitter? Big Data and Automated Content Analysis Damian Trilling
  • 52. What is Big Data? Methods What have others done? What can we do? The schedule The schedule Big Data and Automated Content Analysis Damian Trilling
  • 53. What is Big Data? Methods What have others done? What can we do? The schedule The schedule Each week In general: A lecture (Monday) and a lab session (Wednesday). Each week one method. Examinations A mid-term take-home exam in week 5 and an individual research project on which you work during the whole course. Big Data and Automated Content Analysis Damian Trilling
  • 54. What is Big Data? Methods What have others done? What can we do? The schedule Next meetings Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 55. What is Big Data? Methods What have others done? What can we do? The schedule Next meetings Week 2 No meeting on Monday (Easter) Wednesday, 8 April: Lab Session Getting to know Python better. Do not forget to finish the exercises in chapter 2 and 3 in advance. Big Data and Automated Content Analysis Damian Trilling
  • 56. What is Big Data? Methods What have others done? What can we do? The schedule Next meetings Week 3: Data harvesting and storage Monday, 13–4 A conceptual overview of APIs, scrapers, crawlers, RSS-feeds, databases, and different file formats Wednesday, 15–4 Writing some first data collection scripts Big Data and Automated Content Analysis Damian Trilling