ENTER 2015 Research Track Slide Number 1
A Method for Analysing Large-Scale UGC Data for Tourism:
Application to the Case of Catalonia
Estela Mariné-Roig and Salvador Anton Clavé
Research Group on Territorial Analysis and Tourism Studies (GRATET)
University Rovira i Virgili, Catalonia, Spain
estela.marine@aegern.udl.cat
salvador.anton@urv.cat
http://www.urv.cat/en_index.html
ENTER 2015 Research Track Slide Number 2
Introduction and aim
 UGC data  good source of information for DMOs, stakeholders and tourists.
 Travel blogs and Online Travel Reviews (OTRs)  first-hand experiences of
travellers.
 They have mostly been analysed with content analysis and narrative analysis
(Banyai & Glover, 2012) in the areas of service quality, destination image and
reputation, UGC, experiences and behaviour, and mobility patterns (Lu &
Stepchenkova, 2014)
 Such UGC data have exponentially grown in recent years and it is now considered
that its manipulation requires the use of Big Data technologies.
 However, in most studies concerning UGC data the collection is done “by hand”
(Lu and Stepchenkova, 2014) and is usually non-random  very time-consuming
and non-representative.
This article aims to propose a method for semi-automatic
downloading, arranging, cleaning, debugging, and
analysing large-scale travel blog and OTR data.
ENTER 2015 Research Track Slide Number 3
Web mining background
 Web mining, using data mining techniques, intends to
find useful information or to extract knowledge of the
hyperlink structure and content of webpages Liu
(2011)
 To automatize the process of extraction, first a Web
crawler programme is needed, capable of roaming
the hyperlink structure and downloading the linked
webpages.
 There is abundant literature on data mining related to
tourism and some on massive downloads.
ENTER 2015 Research Track Slide Number 4
Methodology
 Abburu and Babu (2013) propose a framework for web data extraction
and analysis based on three basic steps: finding URLs of webpages,
extracting information from webpages, and data analysis.
 The above system architecture is divided into three modules:
 web crawling
 information extraction
 Mining
 In this research we add the cleaning and debugging phases to
eliminate the noise present in the webpage to be able to get to the
content analysis phase with quality information in the original HTML
format  Resulting webpages only contain what the user wrote.
The methodology is applied to the case of Catalonia to analyse
about 85,000 travel diaries created between the years 2004 -2013
ENTER 2015 Research Track Slide Number 5
Destination selected for the case study (Catalonia)
Attributes:
•Millenary history
•Mediterranean destination
•Bathed by 580 km of shoreline
•Own culture and language (Catalan)
•Wealthy historical and natural heritage
•Third European region (overnight stays)
•Foreign tourists in 2013: 15,631,500
•Nine regional tourism brands:
Tourist brand Abbr.
Barcelona
Costa Barcelona
Costa Brava
Costa Daurada
Paisatges Barcelona
Pirineus
Terres de l’Ebre
Terres de Lleida
Val d’Aran
(unclassified)
Barna
cBarc
cBrav
cDaur
pBarc
Pyren
tEbre
tLlei
vAran
unCla
ENTER 2015 Research Track Slide Number 6
Selection of the most suitable websites hosting UGC data
Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S)
oBorda count (B): Method that ranks options in order of preference
Webometrics:
oVisibility (V):
• Indexed pages in search engines (Google.com, Bing.com)
• Link-based ranks (Google page rank PR, Yandex topical citation index CY)
oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com)
oSize (S): Number of UGC entries related to the case study
Websites hosting UGC data selected:
o1st
TripAdvisor.com (TA): Hosts online travel reviews (OTRs)
o2nd
VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs
o3rd
TravelBlog.org (TB): Hosts travel blogs
o4th
TravelPod.com (TP): Hosts travel blogs and a few OTRs
ENTER 2015 Research Track Slide Number 7
Webometrics of the top four websites hosting travel diaries
TA TB TP VT
Indexed
pages
Google.com
Bing.com
18,600,000
23,800,000
478,000
320,000
759,000
448,000
1,120,000
415,000
Link-based
rank
Google PR
Yandex CY
8
1,600
6
110
6
350
7
375
Visit-based
rank
Compete.com
Quantcast.co
m
Alexa.com
51
127
182
38,742
36,067
21,123
11,824
9,279
21,324
2,500
2,065
4,156
Size Entries 72,874 2,988 2,116 7,791
TBRH Rank 1 3 4 2
ENTER 2015 Research Track Slide Number 8
Gathering process on websites
Filters: Simplified flow diagram of the downloading process:
oLevel (0, 1, ... no level limit)
Inclusive / exclusive
oURL
• Protocol (HTTP, FTP, ...)
• Server
• Domain
• Directories (folders)
• Filename
• File type (html, jpg, ...)
o Content. Search
• for all keywords
• for exact word sequence
• inside HTML tags
ENTER 2015 Research Track Slide Number 9
UGC data arrangement
Structure of folders and files:
rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm
ENTER 2015 Research Track Slide Number 10
UGC data cleaning
Aims: Before: 52 KB After: 2 KB (both without pictures)
The cleaning and debugging phases
are essential to be able to obtain
quality information, limited to the
web content as written and posted
by the diary author, and overcoming
the most significant errors.
Sample of removed HTML elements:
•<meta ... />
•<form ... </form>
•<iframe ... </iframe>
•<div id="header">... </div>
•<!-- [comment] -->
•<div id="comment">... </div>
•<div id="footer">... </div>
•<script type ... </script>
ENTER 2015 Research Track Slide Number 11
UGC data debugging (encoding and common mistakes)
ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding
Encoding: HTML entities
Gaudí: UTF-8 (GaudÃ--), HTML number (Gaud&#237;), HTML name (Gaud&iacute;)
Mistakes:
Correct noun Misspellings
Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ...
Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ...
Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ...
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
E_ à á â ã ä å æ ç è é ê ë ì í î ï
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
Number Name
À &#192; &Agrave;
Á &#193; &Aacute;
 &#194; &Acirc;
à &#195; &Atilde;
Ä &#196; &Auml;
Å &#197; &Aring;
HEX Symb
À c3 8o à €
Á c3 81 Ã ?
 c3 82 à ‚
à c3 83 à ƒ
Ä c3 84 Ã „
Å c3 85 Ã …
ENTER 2015 Research Track Slide Number 12
Results: Trends
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045
TB 22 139 254 427 662 415 328 362 231 148
TP 29 100 236 276 258 226 238 218 189 346
VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251
Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875
cBarc 34 42 53 70 79 34 63 115 325 560
cBrav 201 204 163 238 191 134 177 332 1,448 1,707
cDaur 61 46 82 117 134 121 288 698 2,599 2,498
pBarc 57 45 38 37 45 20 35 89 412 927
Pyren 10 20 12 25 8 10 22 14 62 149
tLlei 6 1 1 3 5 5 11 19 16 16
tEbre 4 3 0 1 1 5 1 2 3 10
vAran 1 0 7 0 0 3 2 3 9 11
unCla 14 40 47 51 56 35 44 48 28 37
Trends in web hosting
and Catalan brands
Monthly distribution of travel
blogs and OTRs (TA, TB, TP, & VT)
ENTER 2015 Research Track Slide Number 13
Results: Top keywords
Rank Keyword Count
Site-wide
Density
Average
Weight
Remark
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
barcelona
great
tour
sagrada familia
gaudi
city
place
good
visit
amazing
park
basilica
park guell
beautiful
way
197,723
51,525
49,221
38,341
33,187
28,155
26,597
26,098
25,973
25,242
24,962
23,618
23,367
23,322
22,996
3.77 %
0.98 %
0.94 %
0.73 %
0.63 %
0.54 %
0.51 %
0.50 %
0.49 %
0.48 %
0.47 %
0.45 %
0.44 %
0.44 %
0.44 %
56.26
23.73
18.08
60.75
19.66
11.70
15.73
15.02
14.86
24.18
28.38
81.68
62.06
23.01
15.02
Capital of Catalonia
Good feeling
Gaudi’s masterpiece
Architect A. Gaudi
Good feeling
Good feeling
Religious building
Gaudi’s work
Good feeling
 Site Content Analyzer (SCA) was applied to the dataset
ENTER 2015 Research Track Slide Number 14
Top keywords: Barcelona, Gaudi and two Gaudi’s works
Barcelona: Guell Park / Mosaic Dragon
Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi
ENTER 2015 Research Track Slide Number 15
Conclusions
 The proposed methodology facilitates the massive gathering of UGC
data from the most suitable sources for a specific case study.
 The hierarchical territorial structure of folders and the articulation
of the individual diaries’ file name, enable multiple classifications
using utilities to order and manipulate the files.
 This structure also allows to focus the analysis on a specific place,
language or subject.
 The cleaning and debugging phases are essential to obtain quality
information, limited to what has been written by the diary author.
 The HTML dataset is prepared for any offline content analysis in
future work and most phases of this method are useful for the
content analysis of other web data sources.
ENTER 2015 Research Track Slide Number 16
Thank you for your attention!
estela.marine@aegern.udl.cat
salvador.anton@urv.cat

More Related Content

PDF
Reputation = Revenue. Increasing Revenue with Online & Offline Feedback
PPTX
Investigating E-Lerning Effects on Continuance Intentions of Hospitality Mana...
PPT
Destination Web Reputation. Combining explicit and implicit popularity to bui...
PPTX
Increasing Financial Returns and Guest Satisfaction through Human Capital Dev...
PPT
Methodology for the publication of Linked Open Data from small and medium siz...
PPT
Travellers' intended future trip arrangement strategies for things to do duri...
PPTX
An examination of the e-bookers and e-browsers in emerging markets – online b...
PPT
UNCOVERING TRAVELLERS’ EXPECTATIONS THROUGH ‘NETNOGRAPHY’: A BIG DATA APPROAC...
Reputation = Revenue. Increasing Revenue with Online & Offline Feedback
Investigating E-Lerning Effects on Continuance Intentions of Hospitality Mana...
Destination Web Reputation. Combining explicit and implicit popularity to bui...
Increasing Financial Returns and Guest Satisfaction through Human Capital Dev...
Methodology for the publication of Linked Open Data from small and medium siz...
Travellers' intended future trip arrangement strategies for things to do duri...
An examination of the e-bookers and e-browsers in emerging markets – online b...
UNCOVERING TRAVELLERS’ EXPECTATIONS THROUGH ‘NETNOGRAPHY’: A BIG DATA APPROAC...

Viewers also liked (18)

PDF
Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...
PPT
What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...
PDF
Athens Destination Specialist Program: The New Era of Destination Specialist ...
PPT
@Spain is different. Co-branding strategies between Spanish national and regi...
PPT
Linked Data for Cross-Domain Decision-making in Tourism
PPT
Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...
PPT
Usages and Role of Instant Messaging Applications during the Beatification of...
PPTX
Trevii: Cheaper tickets for tourist attraction in an user-friendly way.
PDF
Tourism destination perspective. Best practices of Zermatt - Matterhorn.
PPT
Gamification in Tourism: Analysis of Brazil Quest Game
PPT
The Role of Personal Value in Information Search Strategies for Community-Bas...
PPT
The Evolution of eTourism Research A Case of ENTER Conference
Transportation Mode Annotation of Tourist GPS Trajectories under Environmenta...
What Types of Hotels Make Their Guests (Un)Happy? Text Analytics of Customer ...
Athens Destination Specialist Program: The New Era of Destination Specialist ...
@Spain is different. Co-branding strategies between Spanish national and regi...
Linked Data for Cross-Domain Decision-making in Tourism
Exhibition Attendees' Smart Technology Actual Usage: A Case of Near Field Com...
Usages and Role of Instant Messaging Applications during the Beatification of...
Trevii: Cheaper tickets for tourist attraction in an user-friendly way.
Tourism destination perspective. Best practices of Zermatt - Matterhorn.
Gamification in Tourism: Analysis of Brazil Quest Game
The Role of Personal Value in Information Search Strategies for Community-Bas...
The Evolution of eTourism Research A Case of ENTER Conference
Ad

Similar to A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia (20)

PPTX
From the projected to the transmitted image: The 2.0 construction of tourist ...
PDF
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
PPT
Automated hyperlink text analysis of city websites projected image representa...
PPTX
Data Feed SEO for Affiliates by Will Critchlow
PPT
PDF
Master in Big Data Analytics and Social Mining 20015
PDF
SEO for Large/Enterprise Websites - Data & Tech Side
PDF
Analysis of websites as graphs for SEO
PDF
Analysis of Websites as Graphs for SEO
PDF
MediaEval 2016 - Retrieving Diverse Social Images Task Overview
PDF
MediaEval 2016: Task Overview: Retrieving Diverse Social Images
PDF
Designing a travel itinerary planner in 30 hours.
PPTX
SEO for Large Websites
PPT
Exploiting Web Analytics Tracking for Bootstrapping a Case-based Recommender ...
PDF
4-Step SEO Waltz: Tackle SEO Challenges Head-On
PDF
Optimized travel recommendation using location based collaborative filtering
PDF
PDF
Utilizing Open Data in Tourism
PPT
Retrieving Diverse Social Images at MediaEval 2014: Challenge, Dataset and Ev...
PDF
Advanced on-page SEO 2013
From the projected to the transmitted image: The 2.0 construction of tourist ...
Tom Critchlow - Data Feed SEO & Advanced Site Architecture
Automated hyperlink text analysis of city websites projected image representa...
Data Feed SEO for Affiliates by Will Critchlow
Master in Big Data Analytics and Social Mining 20015
SEO for Large/Enterprise Websites - Data & Tech Side
Analysis of websites as graphs for SEO
Analysis of Websites as Graphs for SEO
MediaEval 2016 - Retrieving Diverse Social Images Task Overview
MediaEval 2016: Task Overview: Retrieving Diverse Social Images
Designing a travel itinerary planner in 30 hours.
SEO for Large Websites
Exploiting Web Analytics Tracking for Bootstrapping a Case-based Recommender ...
4-Step SEO Waltz: Tackle SEO Challenges Head-On
Optimized travel recommendation using location based collaborative filtering
Utilizing Open Data in Tourism
Retrieving Diverse Social Images at MediaEval 2014: Challenge, Dataset and Ev...
Advanced on-page SEO 2013
Ad

Recently uploaded (20)

PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
Hazard Identification & Risk Assessment .pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PPTX
Education and Perspectives of Education.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
What’s under the hood: Parsing standardized learning content for AI
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
Empowerment Technology for Senior High School Guide
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Core Concepts of Personalized Learning and Virtual Learning Environments
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Introduction to pro and eukaryotes and differences.pptx
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
Hazard Identification & Risk Assessment .pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
Environmental Education MCQ BD2EE - Share Source.pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
Education and Perspectives of Education.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
My India Quiz Book_20210205121199924.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
What if we spent less time fighting change, and more time building what’s rig...
What’s under the hood: Parsing standardized learning content for AI
Journal of Dental Science - UDMY (2021).pdf
Empowerment Technology for Senior High School Guide
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf

A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia

  • 1. ENTER 2015 Research Track Slide Number 1 A Method for Analysing Large-Scale UGC Data for Tourism: Application to the Case of Catalonia Estela Mariné-Roig and Salvador Anton Clavé Research Group on Territorial Analysis and Tourism Studies (GRATET) University Rovira i Virgili, Catalonia, Spain estela.marine@aegern.udl.cat salvador.anton@urv.cat http://www.urv.cat/en_index.html
  • 2. ENTER 2015 Research Track Slide Number 2 Introduction and aim  UGC data  good source of information for DMOs, stakeholders and tourists.  Travel blogs and Online Travel Reviews (OTRs)  first-hand experiences of travellers.  They have mostly been analysed with content analysis and narrative analysis (Banyai & Glover, 2012) in the areas of service quality, destination image and reputation, UGC, experiences and behaviour, and mobility patterns (Lu & Stepchenkova, 2014)  Such UGC data have exponentially grown in recent years and it is now considered that its manipulation requires the use of Big Data technologies.  However, in most studies concerning UGC data the collection is done “by hand” (Lu and Stepchenkova, 2014) and is usually non-random  very time-consuming and non-representative. This article aims to propose a method for semi-automatic downloading, arranging, cleaning, debugging, and analysing large-scale travel blog and OTR data.
  • 3. ENTER 2015 Research Track Slide Number 3 Web mining background  Web mining, using data mining techniques, intends to find useful information or to extract knowledge of the hyperlink structure and content of webpages Liu (2011)  To automatize the process of extraction, first a Web crawler programme is needed, capable of roaming the hyperlink structure and downloading the linked webpages.  There is abundant literature on data mining related to tourism and some on massive downloads.
  • 4. ENTER 2015 Research Track Slide Number 4 Methodology  Abburu and Babu (2013) propose a framework for web data extraction and analysis based on three basic steps: finding URLs of webpages, extracting information from webpages, and data analysis.  The above system architecture is divided into three modules:  web crawling  information extraction  Mining  In this research we add the cleaning and debugging phases to eliminate the noise present in the webpage to be able to get to the content analysis phase with quality information in the original HTML format  Resulting webpages only contain what the user wrote. The methodology is applied to the case of Catalonia to analyse about 85,000 travel diaries created between the years 2004 -2013
  • 5. ENTER 2015 Research Track Slide Number 5 Destination selected for the case study (Catalonia) Attributes: •Millenary history •Mediterranean destination •Bathed by 580 km of shoreline •Own culture and language (Catalan) •Wealthy historical and natural heritage •Third European region (overnight stays) •Foreign tourists in 2013: 15,631,500 •Nine regional tourism brands: Tourist brand Abbr. Barcelona Costa Barcelona Costa Brava Costa Daurada Paisatges Barcelona Pirineus Terres de l’Ebre Terres de Lleida Val d’Aran (unclassified) Barna cBarc cBrav cDaur pBarc Pyren tEbre tLlei vAran unCla
  • 6. ENTER 2015 Research Track Slide Number 6 Selection of the most suitable websites hosting UGC data Weighted-formula applied (Marine-Roig, 2014a): TBRH = 1*B(V) + 1*B(P) + 2*B(S) oBorda count (B): Method that ranks options in order of preference Webometrics: oVisibility (V): • Indexed pages in search engines (Google.com, Bing.com) • Link-based ranks (Google page rank PR, Yandex topical citation index CY) oPopularity (P): Visit-based ranks (Alexa.com, Compete.com, Quantcast.com) oSize (S): Number of UGC entries related to the case study Websites hosting UGC data selected: o1st TripAdvisor.com (TA): Hosts online travel reviews (OTRs) o2nd VirtualTourist.com (VT): Hosts travel blogs, travelogues and OTRs o3rd TravelBlog.org (TB): Hosts travel blogs o4th TravelPod.com (TP): Hosts travel blogs and a few OTRs
  • 7. ENTER 2015 Research Track Slide Number 7 Webometrics of the top four websites hosting travel diaries TA TB TP VT Indexed pages Google.com Bing.com 18,600,000 23,800,000 478,000 320,000 759,000 448,000 1,120,000 415,000 Link-based rank Google PR Yandex CY 8 1,600 6 110 6 350 7 375 Visit-based rank Compete.com Quantcast.co m Alexa.com 51 127 182 38,742 36,067 21,123 11,824 9,279 21,324 2,500 2,065 4,156 Size Entries 72,874 2,988 2,116 7,791 TBRH Rank 1 3 4 2
  • 8. ENTER 2015 Research Track Slide Number 8 Gathering process on websites Filters: Simplified flow diagram of the downloading process: oLevel (0, 1, ... no level limit) Inclusive / exclusive oURL • Protocol (HTTP, FTP, ...) • Server • Domain • Directories (folders) • Filename • File type (html, jpg, ...) o Content. Search • for all keywords • for exact word sequence • inside HTML tags
  • 9. ENTER 2015 Research Track Slide Number 9 UGC data arrangement Structure of folders and files: rootwebsitebranddestinationdate_lang_[isFrom]_pageName_[theme].htm
  • 10. ENTER 2015 Research Track Slide Number 10 UGC data cleaning Aims: Before: 52 KB After: 2 KB (both without pictures) The cleaning and debugging phases are essential to be able to obtain quality information, limited to the web content as written and posted by the diary author, and overcoming the most significant errors. Sample of removed HTML elements: •<meta ... /> •<form ... </form> •<iframe ... </iframe> •<div id="header">... </div> •<!-- [comment] --> •<div id="comment">... </div> •<div id="footer">... </div> •<script type ... </script>
  • 11. ENTER 2015 Research Track Slide Number 11 UGC data debugging (encoding and common mistakes) ISO 8859-15 (ASCII Latin-1 extended characters) UTF-8 encoding Encoding: HTML entities Gaudí: UTF-8 (GaudÃ--), HTML number (Gaud&#237;), HTML name (Gaud&iacute;) Mistakes: Correct noun Misspellings Barcelona Bathelona, Barcellona, Barthelonaaaa, Bar-tha-lona, Bar-the-lona ... Gaudí Gaudi; Gaudì, Gaüdi, Gaudie, Gaudii, Goudi, Goudí, Guadi, Gualdi ... Parc Güell Parc Guël, Güel, Guéll, Guelle; Park Gueil, Guel, Güelle, Guelli; Güelle ... _0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F C_ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 D_ Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 E_ à á â ã ä å æ ç è é ê ë ì í î ï 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 F_ ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 Number Name À &#192; &Agrave; Á &#193; &Aacute;  &#194; &Acirc; à &#195; &Atilde; Ä &#196; &Auml; Å &#197; &Aring; HEX Symb À c3 8o à € Á c3 81 à ?  c3 82 à ‚ à c3 83 à ƒ Ä c3 84 à „ Å c3 85 à …
  • 12. ENTER 2015 Research Track Slide Number 12 Results: Trends 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 TA 40 38 81 117 204 608 1,421 5,933 28,387 36,045 TB 22 139 254 427 662 415 328 362 231 148 TP 29 100 236 276 258 226 238 218 189 346 VT 1,474 1,498 1,023 1,031 762 413 398 635 306 251 Barna 1,177 1,374 1,191 1,309 1,367 1,295 1,742 5,828 24,211 30,875 cBarc 34 42 53 70 79 34 63 115 325 560 cBrav 201 204 163 238 191 134 177 332 1,448 1,707 cDaur 61 46 82 117 134 121 288 698 2,599 2,498 pBarc 57 45 38 37 45 20 35 89 412 927 Pyren 10 20 12 25 8 10 22 14 62 149 tLlei 6 1 1 3 5 5 11 19 16 16 tEbre 4 3 0 1 1 5 1 2 3 10 vAran 1 0 7 0 0 3 2 3 9 11 unCla 14 40 47 51 56 35 44 48 28 37 Trends in web hosting and Catalan brands Monthly distribution of travel blogs and OTRs (TA, TB, TP, & VT)
  • 13. ENTER 2015 Research Track Slide Number 13 Results: Top keywords Rank Keyword Count Site-wide Density Average Weight Remark 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 barcelona great tour sagrada familia gaudi city place good visit amazing park basilica park guell beautiful way 197,723 51,525 49,221 38,341 33,187 28,155 26,597 26,098 25,973 25,242 24,962 23,618 23,367 23,322 22,996 3.77 % 0.98 % 0.94 % 0.73 % 0.63 % 0.54 % 0.51 % 0.50 % 0.49 % 0.48 % 0.47 % 0.45 % 0.44 % 0.44 % 0.44 % 56.26 23.73 18.08 60.75 19.66 11.70 15.73 15.02 14.86 24.18 28.38 81.68 62.06 23.01 15.02 Capital of Catalonia Good feeling Gaudi’s masterpiece Architect A. Gaudi Good feeling Good feeling Religious building Gaudi’s work Good feeling  Site Content Analyzer (SCA) was applied to the dataset
  • 14. ENTER 2015 Research Track Slide Number 14 Top keywords: Barcelona, Gaudi and two Gaudi’s works Barcelona: Guell Park / Mosaic Dragon Basilica of La Sagrada Familia (Passion façade) / Antoni Gaudi
  • 15. ENTER 2015 Research Track Slide Number 15 Conclusions  The proposed methodology facilitates the massive gathering of UGC data from the most suitable sources for a specific case study.  The hierarchical territorial structure of folders and the articulation of the individual diaries’ file name, enable multiple classifications using utilities to order and manipulate the files.  This structure also allows to focus the analysis on a specific place, language or subject.  The cleaning and debugging phases are essential to obtain quality information, limited to what has been written by the diary author.  The HTML dataset is prepared for any offline content analysis in future work and most phases of this method are useful for the content analysis of other web data sources.
  • 16. ENTER 2015 Research Track Slide Number 16 Thank you for your attention! estela.marine@aegern.udl.cat salvador.anton@urv.cat

Editor's Notes

  • #6: Catalonia is not an Anglophone region, and therefore the problems related to character codification beyond ASCII 127 should be considered and, specifically, those related to existing accent marks in destination and tourist attraction factor names.
  • #7: The first thing after selecting the methodology is to The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study. WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
  • #8: The first thing after selecting the methodology is to The problem is that blogs come from diverse sources and websites do not have homogeneous structures, which makes it impossible to automatize the process of downloading, classification and refinement, as intended in this study. WEBSITES WITH A CRITERION: at least 100 entries about Catalonia
  • #9: HTTrack Website Copier (www.httrack.com). The first step to download data is to navigate the selected websites manually to identify the initial pages, that is to say, those containing hyperlinks which lead to the individual blogs and OTR pages, and save their complete URLs. A Level 0 filter only downloads the page indicated by the initial URL, a Level 1 filter, downloads that page and all the resources directly linked to it, etc; 2. The file type filter allows to download, for example, only HTML files, and the remaining files (multimedia, PDF, etc.) will only be visualized if an Internet connection is available; this system is ideal to analyse the textual content of diaries saving space in the local disk; 3. The URL filters allow to act at any part of it (protocol, server, domain subdirectories or folders, filename and file type); and 4. The content filter is the least efficient because it is necessary to download the page to assess whether or not it contains the chain of key characters, while with URL filters only the pages of interest are downloaded (Figure 2). For example, in the case of TB, it is sufficient to place an inclusive folder filter: /Catalonia/, with no level limit, because the server has a hierarchical territorial structure of folders to store the files. Conversely, in the case of TA, all the files of interest contain the word Catalonia, those which have hyperlinks which lead to OTRs start with Attraction, and those of the same OTRs start with ShowUserReview; therefore, a couple of inclusive filename filters are enough: Attraction*Catalonia and ShowUserReview*Catalonia. To understand the importance of the filters in this case, we ought to bear in mind that TA reached the figure of more than 170,000,000 reviews and opinions, and all its webpages are linked at different levels by hyperlinks.
  • #10: Geography: TAcode;VTcode;Destination;Brand g187496;c1;Catalonia;unCla g187497;430de;Barcelona;Barna g494960;402fe;Lloret-de-Mar;cBrav LANGUAGE DETECTION: plain text  HTML As Text Once the CSV files are ready, a batch programme (Marine-Roig, 2013: Annex A3) is run for each website, which goes through all files, extracts internal data such as the date of the diary and the name of the destination, eliminates entries without narrative content (more than 70,000 OTRs in the case of TA), changes the format of such dates to yyyymmdd, creates new territorial directories, and transfers the diary to the destination folder already with its articulated name to facilitate future classifications. Finally, the two-character ISO 639-1 codes are introduced in the name of the files, after the date (Figure 3).
  • #11: Eliminating noise. The original HTML format should be preserved in order to be able to weight keywords and key phrases according to their potential impact SITE CONTENT ANALYZER, frequency, sitewide-density, weight