SlideShare a Scribd company logo
Last week’s exercise Data harvesting and storage Storing data Next meetings
Big Data and Automated Content Analysis
Week 3 – Monday
Data harvesting and storage
Anne Kroon & Damian Trilling
a.c.kroon@uva.nl
@annekroon
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
19 February 2018
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Today
1 Last week’s exercise
Step by step
Concluding remarks
2 Data harvesting and storage
APIs
RSS feeds
Scraping and crawling
Parsing text files
3 Storing data
CSV tables
JSON and XML
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise
Discussing the code
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Reading a JSON file into a dict, looping over the dict
Task 1: Print all titles of all videos
1 import json
2
3 with open("/home/damian/pornexercise/xhamster.json") as fi:
4 data=json.load(fi)
5
6 for k,v in data.items()):
7 print (v["title"])
NB: You have to know (e.g., by reading the documentation of the dataset)
that the key is called title
NB: data is in fact a dict of dicts, such that each value v is another dict.
For each of these dicts, we retrieve the value that corresponds to the key title
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the functions type() and len() and/or
the dictionary method .keys()
1 len(data)
2 type(data)
3 data.keys()
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the functions type() and len() and/or
the dictionary method .keys()
1 len(data)
2 type(data)
3 data.keys()
len() returns the number of items of an object; type() returns the type
object; .keys() returns a list of all available keys in the dictionary
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
What to do if you do not know the structure of the
dataset?
Inspecting your data: use the module pprint
1 from pprint import pprint
2
3 pprint(data)
Big Data and Automated Content Analysis Damian Trilling
BDACA - Lecture3
BDACA - Lecture3
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
For the sake of completeness. . .
.items() returns a key-value pair, that’s why we need to assign
two variables in the for statement.
These alternatives would also work:
1 for v in data.values()):
2 print(v["title"])
1 for k in data: #or: for k in data.keys():
2 print(data[k]["title"])
Do you see (dis-)advantages?
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Working with a subset of the data
What to do if you want to work with a smaller subset of the data?
Taking a random sample of 10 items in a dict:
1 import random
2 mydict_short = dict(random.sample(mydict.items(),10))
Taking the first 10 elements in a list:
1 mylist_short = mylist[:10]
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Initializing variables, merging two lists, using a counter
Task 2: Average tags per video and most frequently used tags
1 from collections import Counter
2
3 alltags=[]
4 i=0
5 for k,v in data.items():
6 i+=1
7 alltags+=v["channels"] # or: alltags.extend(v["channels"])
8
9 print(len(alltags),"tags are describing",i,"different videos")
10 print("Thus, we have an average of",len(alltags)/i,"tags per video")
11
12 c=Counter(alltags)
13 print (c.most_common(100))
(there are other, more efficient ways of doing this)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for k,v in data.items():
5 for tag in v["channels"]:
6 try:
7 commentspercat[tag]+=int(v["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Nesting blocks, using a defaultdict to count, error handling
Task 3: What porn category is most frequently commented on?
1 from collections import defaultdict
2
3 commentspercat=defaultdict(int)
4 for k,v in data.items():
5 for tag in v["channels"]:
6 try:
7 commentspercat[tag]+=int(v["nb_comments"])
8 except:
9 pass
10 print(commentspercat)
11 # if you want to print in a fancy way, you can do it like this:
12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True):
13 print( tag,"t", commentspercat[tag])
A defaultdict is a normal dict, with the difference that the type of each value is
pre-defined and it doesn’t give an error if you look up a non-existing key
NB: In line 7, we assume the value to be an int, but the datasets sometimes contains
the string “NA” instead of a string representing an int. That’s why we need the
try/except construction
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Adding elements to a list, sum() and len()
Task 4: Average length of descriptions
1 length=[]
2 for k,v in data.items():
3 length.append(len(v["description"]))
4
5 print ("Average length",sum(length)/len(length))
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Merging vs appending
Merging:
1 l1 = [1,2,3]
2 l2 = [4,5,6]
3 l1 = l1 + l2
4 print(l1)
gives [1,2,3,4,5,6]
Appending:
1 l1 = [1,2,3]
2 l2 = [4,5,6]
3 l1.append(l2)
4 print(l1)
gives [1,2,3,[4,5,6]]
l2 is seen as one element to append to l1
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Step by step
Tokenizing with .split()
Task 5: Most frequently used words
1 allwords=[]
2 for k,v in data.items():
3 allwords+=v["description"].split()
4 c2=Counter(allwords)
5 print(c2.most_common(100))
.split() changes a string to a list of words.
"This is cool”.split()
results in
["This", "is", "cool"]
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Concluding remarks
Concluding remarks
Make sure you fully understand the code!
Re-read the corresponding chapters
PLAY AROUND!!!
Big Data and Automated Content Analysis Damian Trilling
Data harvesting and storage
An overview of APIs, scrapers, crawlers, RSS-feeds, and different
file formats
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Collecting data:
APIs
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Querying an API
1 # contact the Twitter API
2 auth = OAuth(access_key, access_secret, consumer_key, consumer_secret)
3 twitter = Twitter(auth = auth)
4
5 # get all info about the user ’username’)
6 tweepinfo=twitter.users.show(screen_name=username)
7
8 # save his bio statement to the variable bio
9 bio=tweepinfo["description"])
10
11 # save his location to the variable location
12 location=tweepinfo["location"]
(abbreviated Python example of how to query the Twitter REST API)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Who offers APIs?
The usual suspects: Twitter, Facebook, Google – but also Reddit,
Youtube, . . .
If you ever leave your bag on a bus on Chicago
. . . but do have Python on your laptop, watch this:
https://www.youtube.com/watch?v=RrPZza_vZ3w.
That guy queries the Chicago bus company’s API to calculate
when exactly the vehicle with his bag arrives the next time at the
bus stop in front of his office.
(Yes, he tried calling the help desk before, but they didn’t know. He got his bag back.)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
APIs
Pro
• Structured data
• Easy to process automatically
• Can be directy embedded in your script
Con
• Often limitations (requests per minute, sampling, . . . )
• You have to trust the provider that he delivers the right
content (⇒ Morstatter e.a., 2013)
• Some APIs won’t allow you to go back in time!
Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from
Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
So we have learned that we can access an API directly.
But what if we have to do so 24/7?
Collecting tweets with a tool running on a server that does query
the API 24/7.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
It queries the API and stores the result
It continuosly calls the Twitter-API and saves
all tweets containing specific hashtags to a
MySQL-database.
You tell it once which data to collect – and
wait some months.
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
APIs
Data harvesting and storage
Retrieving the data for analysis
You could access the MySQL-database directly.
Generally raw data needs to be cleaned and
parsed, before you can start analyzing and
visualizing
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
Collecting data:
RSS feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
What’s that?
• A structured (XML) format in which for example news sites
and blogs offer their content
• You get only the new news items (and that’s great!)
• Title, teaser (or full text), date and time, link
http://www.nu.nl/rss
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feed
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
Parsing RSS feeds
1
2 tree = fromstring(htmlsource)
3
4 # parsing article teaser
5 try:
6 teaser=tree.xpath(’//*[@class="item-excerpt"]//text()’)[0]
7 except:
8 logger.debug("Could not parse article teaser.")
9 teaser=""
10
11 # parsing article text
12 try:
13 text=" ".join(tree.xpath(’//*[@class="block-wrapper"]/div[@class="
block-content"]/p//text()’)).strip()
14 except:
15 text = ""
16 logger.warning("Could not parse article text")
(abbreviated Python example of how to parse the RSS feed of NU.nl)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
RSS feeds
RSS feeds
Pro
• One protocol for all services
• Easy to use
Con
• Full text often not included, you have to download the link
separately (⇒ Problems associated with scraping)
• You can’t go back in time! But we have archived a lot of RSS
feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Collecting data:
Scraping and crawling
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Scraping and crawling
Scraping and crawling
If you have no chance of getting already structured data via
one of the approaches above
• Download web pages, try to identify the structure yourself
• You have to parse the data
• Can get very complicated (depending on the specific task),
especially if the structure of the web pages changes
Further reading:
http://scrapy.org
https:
//github.com/anthonydb/python-get-started/blob/master/5-web-scraping.py
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
Collecting data:
Parsing text files
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to
computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 3452.
doi:10.1080/08838151.2012.761702
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Parsing text files
For messy input data or for semi-structured data
Guiding question: Can we identify some kind of pattern?
Examples
• Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file
• LexisNexis gives you a chunk of text (rather than, e.g., a
structured JSON or XML object)
But in both cases, as long as you can find any pattern or structure
in it, you can try to write a Python script to parse the data.
Big Data and Automated Content Analysis Damian Trilling
BDACA - Lecture3
BDACA - Lecture3
1 tekst={}
2 section={}
3 length={}
4 ...
5 ...
6 with open(bestandsnaam) as f:
7 for line in f:
8 line=line.replace("r","")
9 if line=="n":
10 continue
11 matchObj=re.match(r"s+(d+) of (d+) DOCUMENTS",line)
12 if matchObj:
13 artikelnr= int(matchObj.group(1))
14 tekst[artikelnr]=""
15 continue
16 if line.startswith("SECTION"):
17 section[artikelnr]=line.replace("SECTION: ","").rstrip("n")
18 elif line.startswith("LENGTH"):
19 length[artikelnr]=line.replace("LENGTH: ","").rstrip("n")
20 ...
21 ...
22 ...
23
24 else:
25 tekst[artikelnr]=tekst[artikelnr]+line
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
Storing data:
CSV tables
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
CSV-files
Always a good choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column
breaks
• No limits regarging the size
• But: several dialects (e.g., , vs. ; as delimiter)
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
CSV tables
A CSV-file with tweets
1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source,
profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,
created_at,time
2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t.
co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http://
pbs.twimg.com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01
09:57:00 +0000 2013,1385891820
3 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment
ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4
Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http
://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg
.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28
13:55:35 +0000 2013,1385646935
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
Storing data:
JSON and XML
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
JSON and XML
Great if we have a nested data structure
• Items within feeds
• Personal data within authors within books
• Tweets within followers within users
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
A JSON object containing GoogleBooks data
1 {’totalItems’: 574, ’items’: [{’kind’: ’books#volume’, ’volumeInfo’: {’
publisher’: ’"O’Reilly Media, Inc."’, ’description’: u’Get a
comprehensive, in-depth introduction to the core Python language
with this hands-on book. Based on author Mark Lutzu2019s popular
training course, this updated fifth edition will help you quickly
write efficient, high-quality code with Python. Itu2019s an ideal
way to begin, whether youu2019re new to programming or a
professional developer versed in other languages. Complete with
quizzes, exercises, and helpful illustrations, this easy-to-follow,
self-paced tutorial gets you started with both Python 2.7 and 3.3
u2014 the
2 ...
3 ...
4 ’kind’: ’books#volumes’}
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
An XML object containing an RSS feed
1 ...
2 <item>
3 <title>Agema doet aangifte tegen Samsom en Spekman</title>
4 <link>http://www.nu.nl/politiek/3743441/agema-doet-aangifte-
samsom-en-spekman.html</link>
5 <guid>http://www.nu.nl/politiek/3743441/index.html</guid>
6 <description>PVV-Kamerlid Fleur Agema gaat vrijdag aangifte doen
tegen PvdA-leider Diederik Samsom en PvdA-voorzitter Hans
Spekman wegens uitspraken die zij hebben gedaan over
Marokkanen. </description>
7 <pubDate>Thu, 03 Apr 2014 21:58:48 +0200</pubDate>
8 <category>Algemeen</category>
9 <enclosure url="http://bin.snmmd.nl/m/m1mxwpka6nn2_sqr256.jpg"
type="image/jpeg" />
10 <copyrightPhoto>nu.nl</copyrightPhoto>
11 </item>
12 ...
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
JSON and XML
It’s the same as our “dict of dicts”/“dict of lists”/. . . data
model!
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Last week’s exercise Data harvesting and storage Storing data Next meetings
Next meetings
Wednesday, 21–3
Writing some first data collection scripts
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot (20)

PDF
Analyzing social media with Python and other tools (1/4)
PPTX
An Introduction To Python - Working With Data
PDF
Text Analysis: Latent Topics and Annotated Documents
PDF
binary_search
PPTX
Using the search engine as recommendation engine
PPTX
Python for Big Data Analytics
PPT
Searching algorithm
PDF
Learning how to learn
PPTX
Ground Gurus - Python Code Camp - Day 3 - Classes
PDF
Python cheat-sheet
PPT
Finding Similar Files in Large Document Repositories
Analyzing social media with Python and other tools (1/4)
An Introduction To Python - Working With Data
Text Analysis: Latent Topics and Annotated Documents
binary_search
Using the search engine as recommendation engine
Python for Big Data Analytics
Searching algorithm
Learning how to learn
Ground Gurus - Python Code Camp - Day 3 - Classes
Python cheat-sheet
Finding Similar Files in Large Document Repositories
Ad

Similar to BDACA - Lecture3 (20)

PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
PPTX
Session 10 handling bigger data
PPTX
Session 10 handling bigger data
PDF
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
DOCX
w3schools.com THE WORLDS LARGEST WEB DEVELOPER SITE
PPTX
Data collection and enhancement
PPTX
Data web analytics scraping 12345_II.pptx
PPTX
Digital Contact's big data presentation to the University of Kent
PPTX
Python for Big Data Analytics
PDF
Analyzing social media with Python and other tools (2/4)
PDF
Big data analysis in python @ PyCon.tw 2013
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PPTX
Python PPT
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PDF
Python webinar 4th june
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
PPTX
UNIT_5_Data Wrangling.pptx
PDF
Python 101 for Data Science to Absolute Beginners
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
Session 10 handling bigger data
Session 10 handling bigger data
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
w3schools.com THE WORLDS LARGEST WEB DEVELOPER SITE
Data collection and enhancement
Data web analytics scraping 12345_II.pptx
Digital Contact's big data presentation to the University of Kent
Python for Big Data Analytics
Analyzing social media with Python and other tools (2/4)
Big data analysis in python @ PyCon.tw 2013
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python PPT
AI與大數據數據處理 Spark實戰(20171216)
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Python webinar 4th june
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
UNIT_5_Data Wrangling.pptx
Python 101 for Data Science to Absolute Beginners
Ad

More from Department of Communication Science, University of Amsterdam (17)

PDF
Media diets in an age of apps and social media: Dealing with a third layer of...
PDF
Conceptualizing and measuring news exposure as network of users and news items
PDF
Data Science: Case "Political Communication 2/2"
PDF
Data Science: Case "Political Communication 1/2"
PPTX
Media diets in an age of apps and social media: Dealing with a third layer of...
Conceptualizing and measuring news exposure as network of users and news items
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 1/2"

Recently uploaded (20)

PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
RMMM.pdf make it easy to upload and study
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
Trump Administration's workforce development strategy
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Lesson notes of climatology university.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
Orientation - ARALprogram of Deped to the Parents.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Final Presentation General Medicine 03-08-2024.pptx
human mycosis Human fungal infections are called human mycosis..pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Microbial diseases, their pathogenesis and prophylaxis
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Anesthesia in Laparoscopic Surgery in India
RMMM.pdf make it easy to upload and study
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Trump Administration's workforce development strategy
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Lesson notes of climatology university.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Supply Chain Operations Speaking Notes -ICLT Program
VCE English Exam - Section C Student Revision Booklet

BDACA - Lecture3

  • 1. Last week’s exercise Data harvesting and storage Storing data Next meetings Big Data and Automated Content Analysis Week 3 – Monday Data harvesting and storage Anne Kroon & Damian Trilling a.c.kroon@uva.nl @annekroon d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 19 February 2018 Big Data and Automated Content Analysis Damian Trilling
  • 2. Last week’s exercise Data harvesting and storage Storing data Next meetings Today 1 Last week’s exercise Step by step Concluding remarks 2 Data harvesting and storage APIs RSS feeds Scraping and crawling Parsing text files 3 Storing data CSV tables JSON and XML 4 Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 4. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for k,v in data.items()): 7 print (v["title"]) Big Data and Automated Content Analysis Damian Trilling
  • 5. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for k,v in data.items()): 7 print (v["title"]) NB: You have to know (e.g., by reading the documentation of the dataset) that the key is called title Big Data and Automated Content Analysis Damian Trilling
  • 6. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Reading a JSON file into a dict, looping over the dict Task 1: Print all titles of all videos 1 import json 2 3 with open("/home/damian/pornexercise/xhamster.json") as fi: 4 data=json.load(fi) 5 6 for k,v in data.items()): 7 print (v["title"]) NB: You have to know (e.g., by reading the documentation of the dataset) that the key is called title NB: data is in fact a dict of dicts, such that each value v is another dict. For each of these dicts, we retrieve the value that corresponds to the key title Big Data and Automated Content Analysis Damian Trilling
  • 7. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step What to do if you do not know the structure of the dataset? Inspecting your data: use the functions type() and len() and/or the dictionary method .keys() 1 len(data) 2 type(data) 3 data.keys() Big Data and Automated Content Analysis Damian Trilling
  • 8. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step What to do if you do not know the structure of the dataset? Inspecting your data: use the functions type() and len() and/or the dictionary method .keys() 1 len(data) 2 type(data) 3 data.keys() len() returns the number of items of an object; type() returns the type object; .keys() returns a list of all available keys in the dictionary Big Data and Automated Content Analysis Damian Trilling
  • 9. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step What to do if you do not know the structure of the dataset? Inspecting your data: use the module pprint 1 from pprint import pprint 2 3 pprint(data) Big Data and Automated Content Analysis Damian Trilling
  • 12. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step For the sake of completeness. . . .items() returns a key-value pair, that’s why we need to assign two variables in the for statement. These alternatives would also work: 1 for v in data.values()): 2 print(v["title"]) 1 for k in data: #or: for k in data.keys(): 2 print(data[k]["title"]) Do you see (dis-)advantages? Big Data and Automated Content Analysis Damian Trilling
  • 13. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Working with a subset of the data What to do if you want to work with a smaller subset of the data? Taking a random sample of 10 items in a dict: 1 import random 2 mydict_short = dict(random.sample(mydict.items(),10)) Taking the first 10 elements in a list: 1 mylist_short = mylist[:10] Big Data and Automated Content Analysis Damian Trilling
  • 14. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Initializing variables, merging two lists, using a counter Task 2: Average tags per video and most frequently used tags 1 from collections import Counter 2 3 alltags=[] 4 i=0 5 for k,v in data.items(): 6 i+=1 7 alltags+=v["channels"] # or: alltags.extend(v["channels"]) 8 9 print(len(alltags),"tags are describing",i,"different videos") 10 print("Thus, we have an average of",len(alltags)/i,"tags per video") 11 12 c=Counter(alltags) 13 print (c.most_common(100)) (there are other, more efficient ways of doing this) Big Data and Automated Content Analysis Damian Trilling
  • 15. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Nesting blocks, using a defaultdict to count, error handling Task 3: What porn category is most frequently commented on? 1 from collections import defaultdict 2 3 commentspercat=defaultdict(int) 4 for k,v in data.items(): 5 for tag in v["channels"]: 6 try: 7 commentspercat[tag]+=int(v["nb_comments"]) 8 except: 9 pass 10 print(commentspercat) 11 # if you want to print in a fancy way, you can do it like this: 12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True): 13 print( tag,"t", commentspercat[tag]) A defaultdict is a normal dict, with the difference that the type of each value is pre-defined and it doesn’t give an error if you look up a non-existing key Big Data and Automated Content Analysis Damian Trilling
  • 16. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Nesting blocks, using a defaultdict to count, error handling Task 3: What porn category is most frequently commented on? 1 from collections import defaultdict 2 3 commentspercat=defaultdict(int) 4 for k,v in data.items(): 5 for tag in v["channels"]: 6 try: 7 commentspercat[tag]+=int(v["nb_comments"]) 8 except: 9 pass 10 print(commentspercat) 11 # if you want to print in a fancy way, you can do it like this: 12 for tag in sorted(commentspercat, key=commentspercat.get, reverse=True): 13 print( tag,"t", commentspercat[tag]) A defaultdict is a normal dict, with the difference that the type of each value is pre-defined and it doesn’t give an error if you look up a non-existing key NB: In line 7, we assume the value to be an int, but the datasets sometimes contains the string “NA” instead of a string representing an int. That’s why we need the try/except construction Big Data and Automated Content Analysis Damian Trilling
  • 17. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Adding elements to a list, sum() and len() Task 4: Average length of descriptions 1 length=[] 2 for k,v in data.items(): 3 length.append(len(v["description"])) 4 5 print ("Average length",sum(length)/len(length)) Big Data and Automated Content Analysis Damian Trilling
  • 18. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Merging vs appending Merging: 1 l1 = [1,2,3] 2 l2 = [4,5,6] 3 l1 = l1 + l2 4 print(l1) gives [1,2,3,4,5,6] Appending: 1 l1 = [1,2,3] 2 l2 = [4,5,6] 3 l1.append(l2) 4 print(l1) gives [1,2,3,[4,5,6]] l2 is seen as one element to append to l1 Big Data and Automated Content Analysis Damian Trilling
  • 19. Last week’s exercise Data harvesting and storage Storing data Next meetings Step by step Tokenizing with .split() Task 5: Most frequently used words 1 allwords=[] 2 for k,v in data.items(): 3 allwords+=v["description"].split() 4 c2=Counter(allwords) 5 print(c2.most_common(100)) .split() changes a string to a list of words. "This is cool”.split() results in ["This", "is", "cool"] Big Data and Automated Content Analysis Damian Trilling
  • 20. Last week’s exercise Data harvesting and storage Storing data Next meetings Concluding remarks Concluding remarks Make sure you fully understand the code! Re-read the corresponding chapters PLAY AROUND!!! Big Data and Automated Content Analysis Damian Trilling
  • 21. Data harvesting and storage An overview of APIs, scrapers, crawlers, RSS-feeds, and different file formats
  • 22. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Collecting data: APIs Big Data and Automated Content Analysis Damian Trilling
  • 23. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Big Data and Automated Content Analysis Damian Trilling
  • 24. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Querying an API 1 # contact the Twitter API 2 auth = OAuth(access_key, access_secret, consumer_key, consumer_secret) 3 twitter = Twitter(auth = auth) 4 5 # get all info about the user ’username’) 6 tweepinfo=twitter.users.show(screen_name=username) 7 8 # save his bio statement to the variable bio 9 bio=tweepinfo["description"]) 10 11 # save his location to the variable location 12 location=tweepinfo["location"] (abbreviated Python example of how to query the Twitter REST API) Big Data and Automated Content Analysis Damian Trilling
  • 25. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . Big Data and Automated Content Analysis Damian Trilling
  • 26. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . If you ever leave your bag on a bus on Chicago Big Data and Automated Content Analysis Damian Trilling
  • 27. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Who offers APIs? The usual suspects: Twitter, Facebook, Google – but also Reddit, Youtube, . . . If you ever leave your bag on a bus on Chicago . . . but do have Python on your laptop, watch this: https://www.youtube.com/watch?v=RrPZza_vZ3w. That guy queries the Chicago bus company’s API to calculate when exactly the vehicle with his bag arrives the next time at the bus stop in front of his office. (Yes, he tried calling the help desk before, but they didn’t know. He got his bag back.) Big Data and Automated Content Analysis Damian Trilling
  • 28. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Pro • Structured data • Easy to process automatically • Can be directy embedded in your script Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media. Big Data and Automated Content Analysis Damian Trilling
  • 29. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs APIs Pro • Structured data • Easy to process automatically • Can be directy embedded in your script Con • Often limitations (requests per minute, sampling, . . . ) • You have to trust the provider that he delivers the right content (⇒ Morstatter e.a., 2013) • Some APIs won’t allow you to go back in time! Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? Comparing data from Twitters Streaming API with Twitters Firehose. International AAAI Conference on Weblogs and Social Media. Big Data and Automated Content Analysis Damian Trilling
  • 30. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs So we have learned that we can access an API directly. But what if we have to do so 24/7? Big Data and Automated Content Analysis Damian Trilling
  • 31. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs So we have learned that we can access an API directly. But what if we have to do so 24/7? Collecting tweets with a tool running on a server that does query the API 24/7. Big Data and Automated Content Analysis Damian Trilling
  • 32. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Data harvesting and storage Big Data and Automated Content Analysis Damian Trilling
  • 33. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Data harvesting and storage It queries the API and stores the result It continuosly calls the Twitter-API and saves all tweets containing specific hashtags to a MySQL-database. You tell it once which data to collect – and wait some months. Big Data and Automated Content Analysis Damian Trilling
  • 34. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Data harvesting and storage Big Data and Automated Content Analysis Damian Trilling
  • 35. Last week’s exercise Data harvesting and storage Storing data Next meetings APIs Data harvesting and storage Retrieving the data for analysis You could access the MySQL-database directly. Generally raw data needs to be cleaned and parsed, before you can start analyzing and visualizing Big Data and Automated Content Analysis Damian Trilling
  • 36. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds Collecting data: RSS feeds Big Data and Automated Content Analysis Damian Trilling
  • 37. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content Big Data and Automated Content Analysis Damian Trilling
  • 38. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) Big Data and Automated Content Analysis Damian Trilling
  • 39. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) • Title, teaser (or full text), date and time, link Big Data and Automated Content Analysis Damian Trilling
  • 40. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds What’s that? • A structured (XML) format in which for example news sites and blogs offer their content • You get only the new news items (and that’s great!) • Title, teaser (or full text), date and time, link http://www.nu.nl/rss Big Data and Automated Content Analysis Damian Trilling
  • 41. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feed Big Data and Automated Content Analysis Damian Trilling
  • 42. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds Parsing RSS feeds 1 2 tree = fromstring(htmlsource) 3 4 # parsing article teaser 5 try: 6 teaser=tree.xpath(’//*[@class="item-excerpt"]//text()’)[0] 7 except: 8 logger.debug("Could not parse article teaser.") 9 teaser="" 10 11 # parsing article text 12 try: 13 text=" ".join(tree.xpath(’//*[@class="block-wrapper"]/div[@class=" block-content"]/p//text()’)).strip() 14 except: 15 text = "" 16 logger.warning("Could not parse article text") (abbreviated Python example of how to parse the RSS feed of NU.nl) Big Data and Automated Content Analysis Damian Trilling
  • 43. Last week’s exercise Data harvesting and storage Storing data Next meetings RSS feeds RSS feeds Pro • One protocol for all services • Easy to use Con • Full text often not included, you have to download the link separately (⇒ Problems associated with scraping) • You can’t go back in time! But we have archived a lot of RSS feeds Big Data and Automated Content Analysis Damian Trilling
  • 44. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Collecting data: Scraping and crawling Big Data and Automated Content Analysis Damian Trilling
  • 45. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above Big Data and Automated Content Analysis Damian Trilling
  • 46. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above • Download web pages, try to identify the structure yourself • You have to parse the data Big Data and Automated Content Analysis Damian Trilling
  • 47. Last week’s exercise Data harvesting and storage Storing data Next meetings Scraping and crawling Scraping and crawling If you have no chance of getting already structured data via one of the approaches above • Download web pages, try to identify the structure yourself • You have to parse the data • Can get very complicated (depending on the specific task), especially if the structure of the web pages changes Further reading: http://scrapy.org https: //github.com/anthonydb/python-get-started/blob/master/5-web-scraping.py Big Data and Automated Content Analysis Damian Trilling
  • 48. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files Collecting data: Parsing text files Big Data and Automated Content Analysis Damian Trilling
  • 49. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Big Data and Automated Content Analysis Damian Trilling
  • 50. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file Lewis, S. C., Zamith, R., & Hermida, A. (2013). Content analysis in an era of Big Data: A hybrid approach to computational and manual methods. Journal of Broadcasting & Electronic Media, 57(1), 3452. doi:10.1080/08838151.2012.761702 Big Data and Automated Content Analysis Damian Trilling
  • 51. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file • LexisNexis gives you a chunk of text (rather than, e.g., a structured JSON or XML object) Big Data and Automated Content Analysis Damian Trilling
  • 52. Last week’s exercise Data harvesting and storage Storing data Next meetings Parsing text files For messy input data or for semi-structured data Guiding question: Can we identify some kind of pattern? Examples • Lewis, Zamith, & Hermida (2013) had a corrupt CSV-file • LexisNexis gives you a chunk of text (rather than, e.g., a structured JSON or XML object) But in both cases, as long as you can find any pattern or structure in it, you can try to write a Python script to parse the data. Big Data and Automated Content Analysis Damian Trilling
  • 55. 1 tekst={} 2 section={} 3 length={} 4 ... 5 ... 6 with open(bestandsnaam) as f: 7 for line in f: 8 line=line.replace("r","") 9 if line=="n": 10 continue 11 matchObj=re.match(r"s+(d+) of (d+) DOCUMENTS",line) 12 if matchObj: 13 artikelnr= int(matchObj.group(1)) 14 tekst[artikelnr]="" 15 continue 16 if line.startswith("SECTION"): 17 section[artikelnr]=line.replace("SECTION: ","").rstrip("n") 18 elif line.startswith("LENGTH"): 19 length[artikelnr]=line.replace("LENGTH: ","").rstrip("n") 20 ... 21 ... 22 ... 23 24 else: 25 tekst[artikelnr]=tekst[artikelnr]+line
  • 56. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables Storing data: CSV tables Big Data and Automated Content Analysis Damian Trilling
  • 57. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables CSV-files Always a good choice • All programs can read it • Even human-readable in a simple text editor: • Plain text, with a comma (or a semicolon) denoting column breaks • No limits regarging the size • But: several dialects (e.g., , vs. ; as delimiter) Big Data and Automated Content Analysis Damian Trilling
  • 58. Last week’s exercise Data harvesting and storage Storing data Next meetings CSV tables A CSV-file with tweets 1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source, profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1, created_at,time 2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t. co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http:// pbs.twimg.com/profile_images/378800000673845195/ b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun Dec 01 09:57:00 +0000 2013,1385891820 3 Wat zijn de resulaten vd #klimaattop in #Warschau waard? @EP_Environment ontmoet voorzitter klimaattop @MarcinKorolec http://t.co/4 Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http ://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg .com/profile_images/2943831271/ b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu Nov 28 13:55:35 +0000 2013,1385646935 Big Data and Automated Content Analysis Damian Trilling
  • 59. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML Storing data: JSON and XML Big Data and Automated Content Analysis Damian Trilling
  • 60. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure Big Data and Automated Content Analysis Damian Trilling
  • 61. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds Big Data and Automated Content Analysis Damian Trilling
  • 62. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds • Personal data within authors within books Big Data and Automated Content Analysis Damian Trilling
  • 63. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML JSON and XML Great if we have a nested data structure • Items within feeds • Personal data within authors within books • Tweets within followers within users Big Data and Automated Content Analysis Damian Trilling
  • 64. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML A JSON object containing GoogleBooks data 1 {’totalItems’: 574, ’items’: [{’kind’: ’books#volume’, ’volumeInfo’: {’ publisher’: ’"O’Reilly Media, Inc."’, ’description’: u’Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutzu2019s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. Itu2019s an ideal way to begin, whether youu2019re new to programming or a professional developer versed in other languages. Complete with quizzes, exercises, and helpful illustrations, this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3 u2014 the 2 ... 3 ... 4 ’kind’: ’books#volumes’} Big Data and Automated Content Analysis Damian Trilling
  • 65. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML An XML object containing an RSS feed 1 ... 2 <item> 3 <title>Agema doet aangifte tegen Samsom en Spekman</title> 4 <link>http://www.nu.nl/politiek/3743441/agema-doet-aangifte- samsom-en-spekman.html</link> 5 <guid>http://www.nu.nl/politiek/3743441/index.html</guid> 6 <description>PVV-Kamerlid Fleur Agema gaat vrijdag aangifte doen tegen PvdA-leider Diederik Samsom en PvdA-voorzitter Hans Spekman wegens uitspraken die zij hebben gedaan over Marokkanen. </description> 7 <pubDate>Thu, 03 Apr 2014 21:58:48 +0200</pubDate> 8 <category>Algemeen</category> 9 <enclosure url="http://bin.snmmd.nl/m/m1mxwpka6nn2_sqr256.jpg" type="image/jpeg" /> 10 <copyrightPhoto>nu.nl</copyrightPhoto> 11 </item> 12 ... Big Data and Automated Content Analysis Damian Trilling
  • 66. Last week’s exercise Data harvesting and storage Storing data Next meetings JSON and XML It’s the same as our “dict of dicts”/“dict of lists”/. . . data model! Big Data and Automated Content Analysis Damian Trilling
  • 67. Last week’s exercise Data harvesting and storage Storing data Next meetings Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 68. Last week’s exercise Data harvesting and storage Storing data Next meetings Next meetings Wednesday, 21–3 Writing some first data collection scripts Big Data and Automated Content Analysis Damian Trilling