SlideShare a Scribd company logo
Design and Multilingual Users
on Twitter and Wikipedia
Scott A. Hale
scott.hale@oii.ox.ac.uk
http://www.scotthale.net/
Oxford Internet Institute
University of Oxford
17 June 2014
Scott A. Hale Design and Multilingual Users
Importance of design
Scott A. Hale Design and Multilingual Users
Importance of design
Scott A. Hale Design and Multilingual Users
Content is diverse across languages
“multilingualism...[is] the norm for most of the world’s societies” (Birner,
2005), with over half of Europe and over a fifth of the US multilingual
(Erard, 2012); yet, many platforms are designed only with monolingual users
in mind.
In a Uzbekistan survey, Internet users reported accessing content in
foreign languages even while simultaneously reporting poor foreign
language skills (Wei & Kolko, 2005)
Scott A. Hale Design and Multilingual Users
Content is diverse across languages
“multilingualism...[is] the norm for most of the world’s societies” (Birner,
2005), with over half of Europe and over a fifth of the US multilingual
(Erard, 2012); yet, many platforms are designed only with monolingual users
in mind.
In a Uzbekistan survey, Internet users reported accessing content in
foreign languages even while simultaneously reporting poor foreign
language skills (Wei & Kolko, 2005)
Users often contribute local content/knowledge (Hecht & Gergle,
2010a)
Large diversity in information between languages (Hecht & Gergle,
2010b)
Can lead to self-focus bias (Hecht & Gergle, 2009)
Scott A. Hale Design and Multilingual Users
Motivations
Language clustering vs. small-worlds
Users thought to cluster by language in most online platforms (Barnett
& Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng
& Varis, 1974; Takhteyev, Gruzd, & Wellman, 2011; Wilkinson &
Thelwall, 2012)
Many online platforms thought to exhibit the ‘small-world’ phenomenon
of small path lengths between users (despite high clustering)
Scott A. Hale Design and Multilingual Users
Motivations
Language clustering vs. small-worlds
Users thought to cluster by language in most online platforms (Barnett
& Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng
& Varis, 1974; Takhteyev et al., 2011; Wilkinson & Thelwall, 2012)
Many online platforms thought to exhibit the ‘small-world’ phenomenon
of small path lengths between users (despite high clustering)
Role of multilingual users
⇒ If users cluster by language and platforms are small-worlds, there must
be brokers bridging different language groups (spanning structural
holes)
Multilingual users are possible bridge users. Only one study
investigating this: Ego-net level study on Twitter following–follower
network structure (Eleta & Golbeck, 2012).
No study multiplatform study, no study at large-scale level
Scott A. Hale Design and Multilingual Users
Outline
What are the roles of multilinguals and platform design in shaping the
spread of information in social media?
Twitter and Wikipedia at a global level
1 Language will have strong role in structuring the platform
2 Users engaging with content in multiple languages (multilingual users)
serve as bridges between different clusters/editions
3 Users primarily writing in less-represented languages will be more likely
to cross-language boundaries than users writing in highly-represented
languages
4 When users cross languages they will cross to larger languages (e.g.
English) and thus at a language level English will form more bridges
than other other languages
Scott A. Hale Design and Multilingual Users
Data
Twitter
Twitter mentions, retweet
network
18 days of ‘spritzer’ 1% sample
stream from June 2011
7,341,271 nodes. 8,545,693
directed, weighted edges
Wikipedia
Edits from top 46 language
editions
8 July to 9 August 2013
3.5 million non-minor edits by
55,568 registered users
Global Connectivity and Multilinguals in the Twitter Network (2014).
http://www.scotthale.net/pubs/?chi2014
Multilinguals and Wikipedia Editing (2014).
http://www.scotthale.net/pubs/?websci2014
Scott A. Hale Design and Multilingual Users
Twitter: Data cleaning
Language classification
Clean text of tweets for language
detection (remove urls,
usernames, emoticons)
Use Chromium Compact
Language Detection kit for
language detection (Graham,
Hale, & Gaffney, 2013)
Scott A. Hale Design and Multilingual Users
Twitter: Data cleaning
Language classification
Clean text of tweets for language
detection (remove urls,
usernames, emoticons)
Use Chromium Compact
Language Detection kit for
language detection (Graham et
al., 2013)
Remove users with less than 2
tweets or 20% of the user’s
tweets in one language
Remove users with less than four
tweets total
Scott A. Hale Design and Multilingual Users
Twitter: Data cleaning
Language classification
Clean text of tweets for language
detection (remove urls,
usernames, emoticons)
Use Chromium Compact
Language Detection kit for
language detection (Graham et
al., 2013)
Remove users with less than 2
tweets or 20% of the user’s
tweets in one language
Remove users with less than four
tweets total
Bots and spam users
Remove users with no mentions
(indegree=0)
Select only the largest
weakly-connected component
(88% of nodes)
Scott A. Hale Design and Multilingual Users
Twitter: Data cleaning
Language classification
Clean text of tweets for language
detection (remove urls,
usernames, emoticons)
Use Chromium Compact
Language Detection kit for
language detection (Graham et
al., 2013)
Remove users with less than 2
tweets or 20% of the user’s
tweets in one language
Remove users with less than four
tweets total
Bots and spam users
Remove users with no mentions
(indegree=0)
Select only the largest
weakly-connected component
(88% of nodes)
End result
916,836 nodes (users) and 2,652,618
directed edges (mentions/retweets)
Each user assigned most used
language and frequency [0-1] that
the most used language is used
Scott A. Hale Design and Multilingual Users
Wikipedia: Data cleaning
Non-minor edits by registered, human users to articles
Only edits to main (article) namespace
Removed articles flagged as being created by ‘bots’
Removed anonymous users
Removed undeclared bots and users with only one edit session in the
month
Require at least four edits and at least 2 edits to one edition
Matching users and articles across languages
Look for common usernames across language editions
Check usernames are indeed linked global accounts
WikiData dump to match articles across languages
55,568 users (excluding Simple English edition) with a total of 3,518,955
edits.
Scott A. Hale Design and Multilingual Users
User counts
Twitter
Language User Count
English (en) 375,474
Japanese (ja) 137,263
Portuguese (pt) 133,501
Malay/Indonesian (ms) 106,223
Spanish (es) 70,246
Dutch (nl) 31,035
Korean (ko) 16,123
Thai (th) 8,629
Arabic (ar) 7,679
French (fr) 5,769
Filipino/Tagalog (fil) 5,393
Wikipedia
Language User Count
English 22,412
German 4,920
French 3,430
Russian 3,330
Spanish 3,299
Japanese 3,164
Italian 2,202
Chinese 1,975
Portuguese 1,220
Polish 1,011
Dutch 1,007
Scott A. Hale Design and Multilingual Users
Twitter: Multilinguals vs Monolinguals
On Twitter, 11% of users (˜103,000) were observed to use more than one
language and designated as multilingual users.
Multilingual vs. monolingual users: Comparison of tweet count, out-degree, and
in-degree.
Scott A. Hale Design and Multilingual Users
Wikipedia: Multilinguals vs Monolinguals
On Wikipedia, 15.4% of users
(8,544) edited more than one
language edition and were
designated as multilingual users.
Density plot compares the
number of edits made by
monolingual and multilingual
Wikipedia users. Size of edits
does not differ significantly.
Scott A. Hale Design and Multilingual Users
Wikipedia: Multilinguals vs Monolinguals
On Wikipedia, 15.4% of users
(8,544) edited more than one
language edition and were
designated as multilingual users.
Density plot compares the
number of edits made by
monolingual and multilingual
Wikipedia users. Size of edits
does not differ significantly.
Only 2.6% of edits are from users
writing in their non-primary
languages on Wikipedia.
Scott A. Hale Design and Multilingual Users
Twitter: Language and structure
Label propagation algorithm (Raghavan, Albert, & Kumara, 2007) found
20,253 communities.
Histograms of the size of communities (left) and the number of languages within
each community (right). Modularity score of 0.81 for this community structure.
Scott A. Hale Design and Multilingual Users
Twitter: Language and structure
Scatter plot of community size and
the percentage of users in the
community most often using the most
prevalent language.
Scott A. Hale Design and Multilingual Users
Language and structure
Most-used
language
% users
in most-used
language
Number of
languages
Number of
nodes
Malay (ms) 78.3 41 123,616
English (en) 99.3 39 114,826
Portuguese (pt) 94.3 40 101,987
Japanese (ja) 99.6 19 83,785
English (en) 75.7 44 80,387
English (en) 55.1 42 37,688
Dutch (nl) 90.6 23 20,634
Table Clusters with over 10,000 nodes found through the label propagation
algorithm. Collectively 61% of all users are in one of these clusters.
Scott A. Hale Design and Multilingual Users
Twitter: Do multilinguals bridge clusters?
Size of the largest, weakly-connected component (left), total number of components
(center), and average size of the components (right) created by removing all
multilingual users, an equivalent number of monolingual users randomly, an
equivalent number of all users randomly, and removing all multilingual users from a
network with the same degree distribution but with edges randomly shuffled. Box
plots show values from 100 realizations. Mean values are indicated with +.
Scott A. Hale Design and Multilingual Users
Wikipedia: Do multilinguals bridge editions?
Do multilinguals edit similar articles across languages?
A large number of users did not edit any of the same articles in their primary
languages, but a large number of users also always edited the same articles in their
primary languages.
Scott A. Hale Design and Multilingual Users
Wikipedia: Do multilinguals bridge editions?
Do multilinguals edit similar articles across languages?
A large number of users did not edit any of the same articles in their primary
languages, but a large number of users also always edited the same articles in their
primary languages.
Scott A. Hale Design and Multilingual Users
Variations by language
Twitter Wikipedia
Number of users in each language compared to the percentage of these users
classified as multilingual.
Scott A. Hale Design and Multilingual Users
Twitter: Cross-language connections
ar
de
en
es
fil
fr
gl
it
ja
koms
nl
pt
th
Mentions and retweets across
languages
Nodes represent most-used
language
Directed, weighted edges show
the log of the number of users
primarily using one language who
mention / retweet users in
another language
Only edges with weights over
1.96 standard deviations above
the mean are shown
Colors indicate communities
found by the infomap community
detection algorithm
N.B. This differs from the published paper where edges were normalized by the expected number of connections between language
pairs if tweets were directed at users randomly without regard to language.
Scott A. Hale Design and Multilingual Users
Wikipedia: Language crossings
ar
bg
ca
cs
da
de
en
es
fa
fifr
he
hu
id
it
ja
ko
nl
no
pl
pt
ro
ru
sv
tr
uk
zh
Co-editing network graph
Nodes represent language
editions
Directed, weighted edges show
the log of the number of users
primarily editing one language
edition who edited another
edition
Only edges with weights over
1.96 standard deviations above
the mean are shown
Colors indicate communities
found by the infomap community
detection algorithm
Scott A. Hale Design and Multilingual Users
Wikipedia: Language crossings (English removed)
ca
cs
de
es
fr
it
ja
nl
pl
pt
ru
sv
uk zh
Co-editing network graph
Nodes represent language
editions
Directed, weighted edges show
the log of the number of users
primarily editing one language
edition who edited another
edition
Only edges with weights over
1.96 standard deviations above
the mean are shown
Colors indicate communities
found by the infomap community
detection algorithm
Scott A. Hale Design and Multilingual Users
Summary and Implications
Scott A. Hale Design and Multilingual Users
Multilingualism correlated with
activity on both platforms
Design for multilingual users
Allow users to have multiple
preferred languages when
personalizing search results,
friend recommendations, etc.
Summary and Implications
Scott A. Hale Design and Multilingual Users
Multilingualism correlated with
activity on both platforms
Design for multilingual users
Allow users to have multiple
preferred languages when
personalizing search results,
friend recommendations, etc.
Structured by language
Language has a strong role
structuring both platforms
Multilingual users in position to
bridge clusters/editions, but
mixed evidence on actual role
Multilingual user percentage ∝
1/self-focus bias
Summary and Implications
Scott A. Hale Design and Multilingual Users
Multilingualism correlated with
activity on both platforms
Design for multilingual users
Allow users to have multiple
preferred languages when
personalizing search results,
friend recommendations, etc.
Structured by language
Language has a strong role
structuring both platforms
Multilingual users in position to
bridge clusters/editions, but
mixed evidence on actual role
Multilingual user percentage ∝
1/self-focus bias
Important per language variations
Users in less-represented languages
more likely to cross-language
boundaries on Wikipedia, but no
correlation on Twitter.
Platform differences?
Consistent findings of English
and Japanese as outliers
Summary and Implications
Scott A. Hale Design and Multilingual Users
Multilingualism correlated with
activity on both platforms
Design for multilingual users
Allow users to have multiple
preferred languages when
personalizing search results,
friend recommendations, etc.
Structured by language
Language has a strong role
structuring both platforms
Multilingual users in position to
bridge clusters/editions, but
mixed evidence on actual role
Multilingual user percentage ∝
1/self-focus bias
Important per language variations
Users in less-represented languages
more likely to cross-language
boundaries on Wikipedia, but no
correlation on Twitter.
Platform differences?
Consistent findings of English
and Japanese as outliers
Larger languages form bridges
Especially English, but
Other geolinguistic patterns
evident
Global connectivity results
through the combination of
multilinguals across many
language pairs
Design and Multilingual Users
on Twitter and Wikipedia
Scott A. Hale
scott.hale@oii.ox.ac.uk
http://www.scotthale.net/
Oxford Internet Institute
University of Oxford
17 June 2014
Scott A. Hale Design and Multilingual Users
I would like to thank Eric T. Meyer, Taha Yasseri, Jonathan Bright, and Mike Thelwall
who provided helpful comments on various aspects of this research.
Barnett, G. A., & Choi, Y. (1995). Physical Distance and Language as
Determinants of the International Telecommunications Network.
International Political Science Review, 16(3), 249–265. Available from
http://ips.sagepub.com/content/16/3/249.abstract
Birner, B. (2005). Bilingualism (Tech. Rep.). Washington, DC, USA:
Linguistic Socieyt of America. Available from
http://www.linguisticsociety.org/files/Bilingual.pdf
Eleta, I., & Golbeck, J. (2012). Bridging Languages in Social Networks:
How Multilingual Users of Twitter Connect Language Communities.
Proceedings of the American Society for Information Science and
Technology, 49(1), 1–4. Available from
http://dx.doi.org/10.1002/meet.14504901327
Erard, M. (2012, January). Are we Really Monolingual? Available from
http://www.nytimes.com/2012/01/15/opinion/sunday/
are-we-really-monolingual.html
Scott A. Hale Design and Multilingual Users
Graham, M., Hale, S. A., & Gaffney, D. (2013). Where in the world are
you? Geolocation and language identification in Twitter. Professional
Geographer.
Hale, S. A. (2012a). Impact of platform design on cross-language
information exchange. In Proceedings of the 2012 acm annual
conference on human factors in computing systems extended abstracts
(pp. 1363–1368). New York, NY, USA: ACM. Available from
http://doi.acm.org/10.1145/2212776.2212456
Hale, S. A. (2012b). Net Increase? Cross-Lingual Linking in the
Blogosphere. Journal of Computer-Mediated Communication, 17(2),
135–151. Available from http://onlinelibrary.wiley.com/doi/
10.1111/j.1083-6101.2011.01568.x/full
Hale, S. A. (2014a). Global Connectivity and Multilinguals in the Twitter
Network. In Proceedings of the sigchi conference on human factors in
computing systems (pp. 833–842). New York, NY, USA: ACM.
Available from http://doi.acm.org/10.1145/2556288.2557203
Scott A. Hale Design and Multilingual Users
Hale, S. A. (2014b). Multilinguals and Wikipedia Editing. In Proceedings of
the 6th annual acm web science conference. New York, NY, USA:
ACM. Available from http://arxiv.org/abs/1312.0976
Hecht, B., & Gergle, D. (2009). Measuring self-focus bias in
community-maintained knowledge repositories. In Proceedings of the
fourth international conference on communities and technologies (pp.
11–20). New York, NY, USA: ACM. Available from
http://doi.acm.org/10.1145/1556460.1556463
Hecht, B., & Gergle, D. (2010a). On the “localness” of user-generated
content. In Proceedings of the 2010 acm conference on computer
supported cooperative work (pp. 229–232). New York, NY, USA:
ACM. Available from
http://doi.acm.org/10.1145/1718918.1718962
Hecht, B., & Gergle, D. (2010b). The Tower of Babel meets Web 2.0:
User-generated content and its applications in a multilingual context.
In Proceedings of the 28th international conference on human factors
in computing systems (pp. 291–300). New York, NY, USA: ACM.
Available from http://doi.acm.org/10.1145/1753326.1753370
Scott A. Hale Design and Multilingual Users
Herring, S. C., Paolillo, J. C., Ramos-Vielba, I., Kouper, I., Wright, E.,
Stoerger, S., et al. (2007). Language Networks on LiveJournal. In
Proceedings of the 40th annual hawaii international conference on
system sciences. Washington, DC, USA: IEEE Computer Society.
Available from http://dx.doi.org/10.1109/HICSS.2007.320
Nordenstreng, K., & Varis, T. (1974). Television traffic: A one-way street?
A survey and analysis of the international flow of television programme
material. Reports and Papers on Mass Communication(70).
Raghavan, U. N., Albert, R., & Kumara, S. (2007, September). Near linear
time algorithm to detect community structures in large-scale networks.
Phys. Rev. E, 76(3), 36106. Available from
http://link.aps.org/doi/10.1103/PhysRevE.76.036106
Takhteyev, Y., Gruzd, A., & Wellman, B. (2011). Geography of Twitter
networks. Social Networks, 1–26. Available from
http://www.sciencedirect.com/science/article/pii/
S0378873311000359#FCANote
Scott A. Hale Design and Multilingual Users
Wei, C. Y., & Kolko, B. E. (2005). Resistance to globalization: Language
and Internet diffusion patterns in Uzbekistan. New Review of
Hypermedia and Multimedia, 11(2), 205–220.
Wilkinson, D., & Thelwall, M. (2012). Trending Twitter topics in English:
An international comparison. Journal of the American Society for
Information Science and Technology, 63(8), 1631–1646. Available
from http://dx.doi.org/10.1002/asi.22713
Zuckerman, E. (2008). Meet the bridgebloggers. Public Choice, 134(1),
47–65.
Zuckerman, E. (2013). Rewire: Digital Cosmopolitans in the Age of
Connection. London: W. W. Norton & Company.
Scott A. Hale Design and Multilingual Users

More Related Content

PDF
Gender Gap in Collaborative Platforms: Language and emotions in Wikipedia Dis...
PDF
Global connectivity and multilinguals in the Twitter network (slides)
PDF
Multilinguals and Wikipedia Editing
PPS
LRC XIII Localisation Conference - Using community feedback to improve social...
PDF
Multilingual user interface for website using resource
PDF
Multilingual user interface for website using resource files
PDF
CASL Report1
PDF
Promoting the Use of Basque via Language Technology
Gender Gap in Collaborative Platforms: Language and emotions in Wikipedia Dis...
Global connectivity and multilinguals in the Twitter network (slides)
Multilinguals and Wikipedia Editing
LRC XIII Localisation Conference - Using community feedback to improve social...
Multilingual user interface for website using resource
Multilingual user interface for website using resource files
CASL Report1
Promoting the Use of Basque via Language Technology

Similar to Design and Multilingual Users on Twitter and Wikipedia (20)

PPT
Creating Technical Documents In English For Global Audiences
PPT
5Cs and Web 2.0: Enhancing Foreign Language Teaching with Web 2.0 Technologies
PDF
Tackling the Problem of Multilingualism in Voice Assistants
PPT
Glis Localization Internationalization 05 20071030
PPT
Cyflwyniad Bloc
PPT
Greek Evaluation
PDF
IRJET - Analysis on Code-Mixed Data for Movie Reviews
PPS
Using community feedback to improve social networking terminology in Microsof...
PDF
Exploring Language Communities on Github
PPS
DAISY Consortium Open Source Projects
PDF
A Survey Of Current Datasets For Code-Switching Research
PPT
FLOSSCom Workshop Greece
PDF
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
PDF
Language, Twitter and Academic Conferences
PPT
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
PDF
Enrope language policy, linguistic housekeeping, definitions and implementation
PDF
An Open Online Dictionary for Endangered Uralic Languages.pdf
ODP
eLanguage.net: Shifting the paradigm in Linguistics
PPT
Technology
PPT
Technology
Creating Technical Documents In English For Global Audiences
5Cs and Web 2.0: Enhancing Foreign Language Teaching with Web 2.0 Technologies
Tackling the Problem of Multilingualism in Voice Assistants
Glis Localization Internationalization 05 20071030
Cyflwyniad Bloc
Greek Evaluation
IRJET - Analysis on Code-Mixed Data for Movie Reviews
Using community feedback to improve social networking terminology in Microsof...
Exploring Language Communities on Github
DAISY Consortium Open Source Projects
A Survey Of Current Datasets For Code-Switching Research
FLOSSCom Workshop Greece
AUTOMATIC DETECTION AND LANGUAGE IDENTIFICATION OF MULTILINGUAL DOCUMENTS
Language, Twitter and Academic Conferences
2010 06-u maryland-crowd_sourcing-workshop-v2010-06-16-10h44
Enrope language policy, linguistic housekeeping, definitions and implementation
An Open Online Dictionary for Endangered Uralic Languages.pdf
eLanguage.net: Shifting the paradigm in Linguistics
Technology
Technology
Ad

More from Scott A. Hale (10)

PDF
Researching Misinformation
PDF
Big Tech & Disinformation: What are the main threats and how can journalists ...
PDF
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
PDF
Foreign-language Reviews: Help or Hindrance? (Slides)
PDF
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
PDF
Interactive Visualizations for teaching, research, and dissemination
PDF
Oxford Digital Humanities Summer School
PDF
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
PDF
Ancient History of the UK Web
PDF
ECPR 2011 Leaders and Followers Experiment
Researching Misinformation
Big Tech & Disinformation: What are the main threats and how can journalists ...
No Master Algorithm: Human-machine intelligence and the real-world needs of f...
Foreign-language Reviews: Help or Hindrance? (Slides)
How much is said in a microblog? A multilingual inquiry based on Weibo and Tw...
Interactive Visualizations for teaching, research, and dissemination
Oxford Digital Humanities Summer School
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
Ancient History of the UK Web
ECPR 2011 Leaders and Followers Experiment
Ad

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
[EN] Industrial Machine Downtime Prediction
PDF
.pdf is not working space design for the following data for the following dat...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to Data Science and Data Analysis
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
modul_python (1).pptx for professional and student
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PDF
Business Analytics and business intelligence.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Computer network topology notes for revision
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
SAP 2 completion done . PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
[EN] Industrial Machine Downtime Prediction
.pdf is not working space design for the following data for the following dat...
ISS -ESG Data flows What is ESG and HowHow
Reliability_Chapter_ presentation 1221.5784
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Data Science and Data Analysis
STUDY DESIGN details- Lt Col Maksud (21).pptx
annual-report-2024-2025 original latest.
modul_python (1).pptx for professional and student
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
Business Analytics and business intelligence.pdf
climate analysis of Dhaka ,Banglades.pptx
Computer network topology notes for revision

Design and Multilingual Users on Twitter and Wikipedia

  • 1. Design and Multilingual Users on Twitter and Wikipedia Scott A. Hale scott.hale@oii.ox.ac.uk http://www.scotthale.net/ Oxford Internet Institute University of Oxford 17 June 2014 Scott A. Hale Design and Multilingual Users
  • 2. Importance of design Scott A. Hale Design and Multilingual Users
  • 3. Importance of design Scott A. Hale Design and Multilingual Users
  • 4. Content is diverse across languages “multilingualism...[is] the norm for most of the world’s societies” (Birner, 2005), with over half of Europe and over a fifth of the US multilingual (Erard, 2012); yet, many platforms are designed only with monolingual users in mind. In a Uzbekistan survey, Internet users reported accessing content in foreign languages even while simultaneously reporting poor foreign language skills (Wei & Kolko, 2005) Scott A. Hale Design and Multilingual Users
  • 5. Content is diverse across languages “multilingualism...[is] the norm for most of the world’s societies” (Birner, 2005), with over half of Europe and over a fifth of the US multilingual (Erard, 2012); yet, many platforms are designed only with monolingual users in mind. In a Uzbekistan survey, Internet users reported accessing content in foreign languages even while simultaneously reporting poor foreign language skills (Wei & Kolko, 2005) Users often contribute local content/knowledge (Hecht & Gergle, 2010a) Large diversity in information between languages (Hecht & Gergle, 2010b) Can lead to self-focus bias (Hecht & Gergle, 2009) Scott A. Hale Design and Multilingual Users
  • 6. Motivations Language clustering vs. small-worlds Users thought to cluster by language in most online platforms (Barnett & Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng & Varis, 1974; Takhteyev, Gruzd, & Wellman, 2011; Wilkinson & Thelwall, 2012) Many online platforms thought to exhibit the ‘small-world’ phenomenon of small path lengths between users (despite high clustering) Scott A. Hale Design and Multilingual Users
  • 7. Motivations Language clustering vs. small-worlds Users thought to cluster by language in most online platforms (Barnett & Choi, 1995; Hale, 2012a, 2012b; Herring et al., 2007; Nordenstreng & Varis, 1974; Takhteyev et al., 2011; Wilkinson & Thelwall, 2012) Many online platforms thought to exhibit the ‘small-world’ phenomenon of small path lengths between users (despite high clustering) Role of multilingual users ⇒ If users cluster by language and platforms are small-worlds, there must be brokers bridging different language groups (spanning structural holes) Multilingual users are possible bridge users. Only one study investigating this: Ego-net level study on Twitter following–follower network structure (Eleta & Golbeck, 2012). No study multiplatform study, no study at large-scale level Scott A. Hale Design and Multilingual Users
  • 8. Outline What are the roles of multilinguals and platform design in shaping the spread of information in social media? Twitter and Wikipedia at a global level 1 Language will have strong role in structuring the platform 2 Users engaging with content in multiple languages (multilingual users) serve as bridges between different clusters/editions 3 Users primarily writing in less-represented languages will be more likely to cross-language boundaries than users writing in highly-represented languages 4 When users cross languages they will cross to larger languages (e.g. English) and thus at a language level English will form more bridges than other other languages Scott A. Hale Design and Multilingual Users
  • 9. Data Twitter Twitter mentions, retweet network 18 days of ‘spritzer’ 1% sample stream from June 2011 7,341,271 nodes. 8,545,693 directed, weighted edges Wikipedia Edits from top 46 language editions 8 July to 9 August 2013 3.5 million non-minor edits by 55,568 registered users Global Connectivity and Multilinguals in the Twitter Network (2014). http://www.scotthale.net/pubs/?chi2014 Multilinguals and Wikipedia Editing (2014). http://www.scotthale.net/pubs/?websci2014 Scott A. Hale Design and Multilingual Users
  • 10. Twitter: Data cleaning Language classification Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham, Hale, & Gaffney, 2013) Scott A. Hale Design and Multilingual Users
  • 11. Twitter: Data cleaning Language classification Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the user’s tweets in one language Remove users with less than four tweets total Scott A. Hale Design and Multilingual Users
  • 12. Twitter: Data cleaning Language classification Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the user’s tweets in one language Remove users with less than four tweets total Bots and spam users Remove users with no mentions (indegree=0) Select only the largest weakly-connected component (88% of nodes) Scott A. Hale Design and Multilingual Users
  • 13. Twitter: Data cleaning Language classification Clean text of tweets for language detection (remove urls, usernames, emoticons) Use Chromium Compact Language Detection kit for language detection (Graham et al., 2013) Remove users with less than 2 tweets or 20% of the user’s tweets in one language Remove users with less than four tweets total Bots and spam users Remove users with no mentions (indegree=0) Select only the largest weakly-connected component (88% of nodes) End result 916,836 nodes (users) and 2,652,618 directed edges (mentions/retweets) Each user assigned most used language and frequency [0-1] that the most used language is used Scott A. Hale Design and Multilingual Users
  • 14. Wikipedia: Data cleaning Non-minor edits by registered, human users to articles Only edits to main (article) namespace Removed articles flagged as being created by ‘bots’ Removed anonymous users Removed undeclared bots and users with only one edit session in the month Require at least four edits and at least 2 edits to one edition Matching users and articles across languages Look for common usernames across language editions Check usernames are indeed linked global accounts WikiData dump to match articles across languages 55,568 users (excluding Simple English edition) with a total of 3,518,955 edits. Scott A. Hale Design and Multilingual Users
  • 15. User counts Twitter Language User Count English (en) 375,474 Japanese (ja) 137,263 Portuguese (pt) 133,501 Malay/Indonesian (ms) 106,223 Spanish (es) 70,246 Dutch (nl) 31,035 Korean (ko) 16,123 Thai (th) 8,629 Arabic (ar) 7,679 French (fr) 5,769 Filipino/Tagalog (fil) 5,393 Wikipedia Language User Count English 22,412 German 4,920 French 3,430 Russian 3,330 Spanish 3,299 Japanese 3,164 Italian 2,202 Chinese 1,975 Portuguese 1,220 Polish 1,011 Dutch 1,007 Scott A. Hale Design and Multilingual Users
  • 16. Twitter: Multilinguals vs Monolinguals On Twitter, 11% of users (˜103,000) were observed to use more than one language and designated as multilingual users. Multilingual vs. monolingual users: Comparison of tweet count, out-degree, and in-degree. Scott A. Hale Design and Multilingual Users
  • 17. Wikipedia: Multilinguals vs Monolinguals On Wikipedia, 15.4% of users (8,544) edited more than one language edition and were designated as multilingual users. Density plot compares the number of edits made by monolingual and multilingual Wikipedia users. Size of edits does not differ significantly. Scott A. Hale Design and Multilingual Users
  • 18. Wikipedia: Multilinguals vs Monolinguals On Wikipedia, 15.4% of users (8,544) edited more than one language edition and were designated as multilingual users. Density plot compares the number of edits made by monolingual and multilingual Wikipedia users. Size of edits does not differ significantly. Only 2.6% of edits are from users writing in their non-primary languages on Wikipedia. Scott A. Hale Design and Multilingual Users
  • 19. Twitter: Language and structure Label propagation algorithm (Raghavan, Albert, & Kumara, 2007) found 20,253 communities. Histograms of the size of communities (left) and the number of languages within each community (right). Modularity score of 0.81 for this community structure. Scott A. Hale Design and Multilingual Users
  • 20. Twitter: Language and structure Scatter plot of community size and the percentage of users in the community most often using the most prevalent language. Scott A. Hale Design and Multilingual Users
  • 21. Language and structure Most-used language % users in most-used language Number of languages Number of nodes Malay (ms) 78.3 41 123,616 English (en) 99.3 39 114,826 Portuguese (pt) 94.3 40 101,987 Japanese (ja) 99.6 19 83,785 English (en) 75.7 44 80,387 English (en) 55.1 42 37,688 Dutch (nl) 90.6 23 20,634 Table Clusters with over 10,000 nodes found through the label propagation algorithm. Collectively 61% of all users are in one of these clusters. Scott A. Hale Design and Multilingual Users
  • 22. Twitter: Do multilinguals bridge clusters? Size of the largest, weakly-connected component (left), total number of components (center), and average size of the components (right) created by removing all multilingual users, an equivalent number of monolingual users randomly, an equivalent number of all users randomly, and removing all multilingual users from a network with the same degree distribution but with edges randomly shuffled. Box plots show values from 100 realizations. Mean values are indicated with +. Scott A. Hale Design and Multilingual Users
  • 23. Wikipedia: Do multilinguals bridge editions? Do multilinguals edit similar articles across languages? A large number of users did not edit any of the same articles in their primary languages, but a large number of users also always edited the same articles in their primary languages. Scott A. Hale Design and Multilingual Users
  • 24. Wikipedia: Do multilinguals bridge editions? Do multilinguals edit similar articles across languages? A large number of users did not edit any of the same articles in their primary languages, but a large number of users also always edited the same articles in their primary languages. Scott A. Hale Design and Multilingual Users
  • 25. Variations by language Twitter Wikipedia Number of users in each language compared to the percentage of these users classified as multilingual. Scott A. Hale Design and Multilingual Users
  • 26. Twitter: Cross-language connections ar de en es fil fr gl it ja koms nl pt th Mentions and retweets across languages Nodes represent most-used language Directed, weighted edges show the log of the number of users primarily using one language who mention / retweet users in another language Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm N.B. This differs from the published paper where edges were normalized by the expected number of connections between language pairs if tweets were directed at users randomly without regard to language. Scott A. Hale Design and Multilingual Users
  • 27. Wikipedia: Language crossings ar bg ca cs da de en es fa fifr he hu id it ja ko nl no pl pt ro ru sv tr uk zh Co-editing network graph Nodes represent language editions Directed, weighted edges show the log of the number of users primarily editing one language edition who edited another edition Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm Scott A. Hale Design and Multilingual Users
  • 28. Wikipedia: Language crossings (English removed) ca cs de es fr it ja nl pl pt ru sv uk zh Co-editing network graph Nodes represent language editions Directed, weighted edges show the log of the number of users primarily editing one language edition who edited another edition Only edges with weights over 1.96 standard deviations above the mean are shown Colors indicate communities found by the infomap community detection algorithm Scott A. Hale Design and Multilingual Users
  • 29. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc.
  • 30. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage ∝ 1/self-focus bias
  • 31. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage ∝ 1/self-focus bias Important per language variations Users in less-represented languages more likely to cross-language boundaries on Wikipedia, but no correlation on Twitter. Platform differences? Consistent findings of English and Japanese as outliers
  • 32. Summary and Implications Scott A. Hale Design and Multilingual Users Multilingualism correlated with activity on both platforms Design for multilingual users Allow users to have multiple preferred languages when personalizing search results, friend recommendations, etc. Structured by language Language has a strong role structuring both platforms Multilingual users in position to bridge clusters/editions, but mixed evidence on actual role Multilingual user percentage ∝ 1/self-focus bias Important per language variations Users in less-represented languages more likely to cross-language boundaries on Wikipedia, but no correlation on Twitter. Platform differences? Consistent findings of English and Japanese as outliers Larger languages form bridges Especially English, but Other geolinguistic patterns evident Global connectivity results through the combination of multilinguals across many language pairs
  • 33. Design and Multilingual Users on Twitter and Wikipedia Scott A. Hale scott.hale@oii.ox.ac.uk http://www.scotthale.net/ Oxford Internet Institute University of Oxford 17 June 2014 Scott A. Hale Design and Multilingual Users I would like to thank Eric T. Meyer, Taha Yasseri, Jonathan Bright, and Mike Thelwall who provided helpful comments on various aspects of this research.
  • 34. Barnett, G. A., & Choi, Y. (1995). Physical Distance and Language as Determinants of the International Telecommunications Network. International Political Science Review, 16(3), 249–265. Available from http://ips.sagepub.com/content/16/3/249.abstract Birner, B. (2005). Bilingualism (Tech. Rep.). Washington, DC, USA: Linguistic Socieyt of America. Available from http://www.linguisticsociety.org/files/Bilingual.pdf Eleta, I., & Golbeck, J. (2012). Bridging Languages in Social Networks: How Multilingual Users of Twitter Connect Language Communities. Proceedings of the American Society for Information Science and Technology, 49(1), 1–4. Available from http://dx.doi.org/10.1002/meet.14504901327 Erard, M. (2012, January). Are we Really Monolingual? Available from http://www.nytimes.com/2012/01/15/opinion/sunday/ are-we-really-monolingual.html Scott A. Hale Design and Multilingual Users
  • 35. Graham, M., Hale, S. A., & Gaffney, D. (2013). Where in the world are you? Geolocation and language identification in Twitter. Professional Geographer. Hale, S. A. (2012a). Impact of platform design on cross-language information exchange. In Proceedings of the 2012 acm annual conference on human factors in computing systems extended abstracts (pp. 1363–1368). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/2212776.2212456 Hale, S. A. (2012b). Net Increase? Cross-Lingual Linking in the Blogosphere. Journal of Computer-Mediated Communication, 17(2), 135–151. Available from http://onlinelibrary.wiley.com/doi/ 10.1111/j.1083-6101.2011.01568.x/full Hale, S. A. (2014a). Global Connectivity and Multilinguals in the Twitter Network. In Proceedings of the sigchi conference on human factors in computing systems (pp. 833–842). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/2556288.2557203 Scott A. Hale Design and Multilingual Users
  • 36. Hale, S. A. (2014b). Multilinguals and Wikipedia Editing. In Proceedings of the 6th annual acm web science conference. New York, NY, USA: ACM. Available from http://arxiv.org/abs/1312.0976 Hecht, B., & Gergle, D. (2009). Measuring self-focus bias in community-maintained knowledge repositories. In Proceedings of the fourth international conference on communities and technologies (pp. 11–20). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1556460.1556463 Hecht, B., & Gergle, D. (2010a). On the “localness” of user-generated content. In Proceedings of the 2010 acm conference on computer supported cooperative work (pp. 229–232). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1718918.1718962 Hecht, B., & Gergle, D. (2010b). The Tower of Babel meets Web 2.0: User-generated content and its applications in a multilingual context. In Proceedings of the 28th international conference on human factors in computing systems (pp. 291–300). New York, NY, USA: ACM. Available from http://doi.acm.org/10.1145/1753326.1753370 Scott A. Hale Design and Multilingual Users
  • 37. Herring, S. C., Paolillo, J. C., Ramos-Vielba, I., Kouper, I., Wright, E., Stoerger, S., et al. (2007). Language Networks on LiveJournal. In Proceedings of the 40th annual hawaii international conference on system sciences. Washington, DC, USA: IEEE Computer Society. Available from http://dx.doi.org/10.1109/HICSS.2007.320 Nordenstreng, K., & Varis, T. (1974). Television traffic: A one-way street? A survey and analysis of the international flow of television programme material. Reports and Papers on Mass Communication(70). Raghavan, U. N., Albert, R., & Kumara, S. (2007, September). Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E, 76(3), 36106. Available from http://link.aps.org/doi/10.1103/PhysRevE.76.036106 Takhteyev, Y., Gruzd, A., & Wellman, B. (2011). Geography of Twitter networks. Social Networks, 1–26. Available from http://www.sciencedirect.com/science/article/pii/ S0378873311000359#FCANote Scott A. Hale Design and Multilingual Users
  • 38. Wei, C. Y., & Kolko, B. E. (2005). Resistance to globalization: Language and Internet diffusion patterns in Uzbekistan. New Review of Hypermedia and Multimedia, 11(2), 205–220. Wilkinson, D., & Thelwall, M. (2012). Trending Twitter topics in English: An international comparison. Journal of the American Society for Information Science and Technology, 63(8), 1631–1646. Available from http://dx.doi.org/10.1002/asi.22713 Zuckerman, E. (2008). Meet the bridgebloggers. Public Choice, 134(1), 47–65. Zuckerman, E. (2013). Rewire: Digital Cosmopolitans in the Age of Connection. London: W. W. Norton & Company. Scott A. Hale Design and Multilingual Users