SlideShare a Scribd company logo
News construction from microblogging
posts using open data
Francisco Berrizbeitia
Universidad Simón Bolívar
Caracas, Venezuela
fberrizbeitia@gmail.com
June, 2014
Abstract
Information access can be limited in some situations where traditional media outlets can’t
cover the events due to geographical limitations or censorship. Examples of those situations
can be civil unrest, war or natural disasters. In these situations citizen journalism replace or
complement traditional media in the documentation of such events. Microblogging services
such as Twitter have become of great use in these scenarios due their mobile nature and
multimedia capabilities.
In this research we propose a method to create searchable, semantically annotated news
articles from tweets in an automated way using the cloud of linked open data.
Keywords
Semantic web, News, Microblogging, Twitter, Automatic document generation, Data
Journalims, citizen journalism.
1 Introduction
Citizen journalism has become a very common practice with the arrival of the smartphones
and microblogging services such as Twitter. Due to the multimedia capacities of the devices
and the mobile nature of the social network people all over the word are documenting all sort
events and publishing on the Web on real time. This type of journalism has particular
importance in situations where the traditional media can’t cover the events, such as natural
disasters, war, civil unrest or due to government or self-imposed censorship.
Citizen journalism is protected by the Universal Declaration of Human Rights, article 19:
(United Nations)
“Everyone has the right to freedom of opinion and expression; this right includes
freedom to hold opinions without interference and to seek, receive and impart
information and ideas through any media and regardless of frontiers.”
This protection has had tremendous implications in the recent past, in situations where the
only available information was found on social networks and in international media outlets
with very limited coverage. We believe it’s of great importance to develop a technology that
allows the creation of “fair” documents from all the contributions made by the users during
such events.
The hope is that the automated documents created by this technology will be closer to what
really happened and guarantee impartiality.
As a first step in this research we want to construct a news article from a single 140 character
message using the open data cloud.
In the rest of the report we will first describe the overall approach we took to the problem
then describe the system we developed for this task and finally look at the results.
2 Related Work
Information extraction on from Twitter and other microblogging plataforms has been done in
the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the
semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web
using structured microblogging messages. (Ritter, 2012) uses natural language processing and
information extraction techniques over a corpora of tweets to extract machine readable
information.
Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010)
where they propose a machine learning method to classify the tweets in positive, negative and
neutral.
3 Description
The main objective is to obtain the semantically meaningful concepts expressed in the
micropost from the Open Data Cloud and then create a document that extends the original
text with the retrieved concepts. If we succeed in this task we will end up with a news article
where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be
derived from the micropost and extended with the linked open data cloud.
Figure 1. Overall view of the process
In figure 1 we can see the overall process of the news creation. Being this our first approach to
the problem we decided to limit the sources of information to Twitter as the only microblog
input and DBpedia as our source of semantically annotated information.
The system was implemented as a web application written in PHP. In the next section we will
describe each part of the system.
3.1 Information gathering and text preparation
The first task consists in gathering the posted information by a user of the social network; we
collect not only the published text, but also the media when available and information about
the author. We obtain all the information using the public API provided by Twitter. As shown in
figure 2, the only input the system need is the tweet ID .
Figure 2. Input screen of the system
After the text is retrieved it must be “denoised” before any further processing. At this point all
the stop words are removed as well as links and Tweeter specific words such as RT or FF. The
hashtag character (#) is removed leaving the remaining word.
3.2 Candidate selection
Before querying the DBpedia endpoint we run first a local analysis using a version of the
Wordnet database. Each word is analyzed and a matrix of acceptations for the words is
created. Following a set rules we create a list possible 2-words and 1-word candidates that
may be relevant concepts, places or persons. By doing this we wanted to reduce the queries
we need to make the endpoint.
Since the Wikipedia and the DBpedia are tightly related, we decided to query first the
Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the
main topic.
And the end of this process we ended up having a list of candidate with known Wikipedia
pages.
3.3 Semantically annotated information retrieval from the Open Data
Cloud
The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated
information related to the tweet topic detected in the previous step. Once the information is
received from the endpoint it is put together with the author information from Twitter in a
turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews
Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3.
Figure 3. Subset of the rNews Ontology used for the project
4 Results
To test the approach and the system we selected 90 tweets directly from the Twitter search on
3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The
process of collecting the microposts consisted on making the search thru the API and collect
the first 30 messages with an associated picture, doing the same process for each of the
selected topics.
After the sample was obtained we proceeded to manually tag each tweet. This was made two
times by different persons to minimize the human errors. After the sample was manually
tagged we ran the automated process for each tweet and saved the results for each case. The
results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of
those 317 were an exact match to what was expected in the manual process, 63 resulted in
information that is not wrong but adds no real value, 53 that were wrong concepts. This give a
precision of 76.36%, that’s the expect terms that were automatically detected using the
method and 12.24% of errors.
Figure 4. Result of the test cases
Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a
meaning for the context. For example, in the context of the Brazilian riots, the concept “fire”
was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the
other topics that were tested.
The terms that were not detected by the automated method were candidate with known
Wikipedia pages that had no corresponding entrance in the DBpedia.
5 Future work
We’re encouraged with the obtained results to further develop the method and include
automated context detection as a way to maximize the precision. A possible approach to solve
this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012).
We also would like to further develop the system, to not only detect, retrieve and save
information of one message but to be able to create a complete documentation of an event
for extended period of time, based on several micro blogging platforms and media outlets,
both independent and corporate. The end result we hope to reach is create a full searchable,
semantically annotated news stream that will serve as a neutral and centralized endpoint for
data journalism.
6 Conclusions
In this research we proposed a method to automatically create a news article from a tweet
using the cloud of linked open data, to do it we successfully implemented a web system that
takes a Tweet ID as input and generate semantically annotated news article based on a subset
of the rNews Ontology. To test our approach we collected a group of 90 tweet on three
subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The
messages where tagged manually and then compared with automatically found annotations.
Our method was able to capture 76.36% of the manually detected terms with an error of rate
12.24% due mostly to disambiguation problems.
These results encourage us to further develop the method and the system to solve first the
disambiguation problems and to create a more ambitious approach that will allow us to create
a semantically annotated news stream based not only on tweet, but also includes other
microblogging services, independent blogs and corporate media outlets that can serve a
centralized semantic endpoint for data journalism.
7 References
Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining.
Valletta, Malta: Proceedings of the Seventh International Conference on Language
Resources and Evaluation.
David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010.
Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation.
Rome: CLEF Iniciative.
International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from
IPTC site for developers: http://dev.iptc.org/rNews
Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai:
Proceedings of the Workshop on Information Extraction and Entity Analytics on Social
Media Data.
Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University
of Washington.
Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North
Carolina.
United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of
Human Rights: http://www.un.org/en/documents/udhr/index.shtml
Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org:
http://en.wikipedia.org/wiki/Five_Ws

More Related Content

PDF
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
PDF
Groundhog day: near duplicate detection on twitter
PDF
Analyzing-Threat-Levels-of-Extremists-using-Tweets
PDF
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
PDF
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
PPTX
Political prediction analysis using text mining and deep learning
PDF
IRJET- Twitter Spammer Detection
PPTX
Link prediction with the linkpred tool
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
Groundhog day: near duplicate detection on twitter
Analyzing-Threat-Levels-of-Extremists-using-Tweets
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
Political prediction analysis using text mining and deep learning
IRJET- Twitter Spammer Detection
Link prediction with the linkpred tool

What's hot (19)

PDF
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
PDF
IRJET- Fake Profile Identification using Machine Learning
PDF
Fake News Detection using Machine Learning
PDF
Link prediction 방법의 개념 및 활용
PDF
IRJET- Fake News Detection
PDF
IRJET - Fake News Detection using Machine Learning
PDF
Automatic Hate Speech Detection: A Literature Review
PDF
IRJET- Fake News Detection using Logistic Regression
PDF
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
PDF
IRJET - Suicidal Text Detection using Machine Learning
PDF
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
PDF
Event detection and summarization based on social networks and semantic query...
PDF
Content-based link prediction
PPTX
PDF
Twitter: Social Network Or News Medium?
PDF
IRJET- Fake News Detection and Rumour Source Identification
PPTX
Seminar on detecting fake accounts in social media using machine learning
PDF
IRJET - Twitter Spam Detection using Cobweb
PPTX
CML's Presentation at FengChia University
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
IRJET- Fake Profile Identification using Machine Learning
Fake News Detection using Machine Learning
Link prediction 방법의 개념 및 활용
IRJET- Fake News Detection
IRJET - Fake News Detection using Machine Learning
Automatic Hate Speech Detection: A Literature Review
IRJET- Fake News Detection using Logistic Regression
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
IRJET - Suicidal Text Detection using Machine Learning
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
Event detection and summarization based on social networks and semantic query...
Content-based link prediction
Twitter: Social Network Or News Medium?
IRJET- Fake News Detection and Rumour Source Identification
Seminar on detecting fake accounts in social media using machine learning
IRJET - Twitter Spam Detection using Cobweb
CML's Presentation at FengChia University
Ad

Viewers also liked (9)

PPTX
Autosimilaridad en vinculaciones
PPTX
Tarea - Mensaje inicial - curso formación de tutores en línea
ODP
Enric valor(3)
ODP
News construction from microblogging posts using open data
PPTX
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...
PDF
Vinculaciones autosimilares
DOCX
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
PPTX
Módulo 1, Diplomado Tutorias
PDF
Capacitación inicial para psicólogos de nuevo ingreso a la educación
Autosimilaridad en vinculaciones
Tarea - Mensaje inicial - curso formación de tutores en línea
Enric valor(3)
News construction from microblogging posts using open data
Evaluación de diferentes estrategias de muestreo para tratar el problema de ...
Vinculaciones autosimilares
Un enfoque de aprendizaje automático supervisado para el etiquetado de mensaj...
Módulo 1, Diplomado Tutorias
Capacitación inicial para psicólogos de nuevo ingreso a la educación
Ad

Similar to News construction from microblogging post using open data (20)

PDF
Document(2)
PDF
Weeki - Wikipedia &lt;- tweets
PDF
Classification of Disastrous Tweets on Twitter using BERT Model
PDF
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
PDF
Vakulenko PhD Status Report - 16 February 2016
PPTX
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
PDF
Final Poster for Engineering Showcase
PPTX
A framework for real time semantic social media analysis
PDF
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
PPTX
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
PDF
Tweet Cloud
PDF
Twitris - Web Information System 2011 Course
PDF
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
PDF
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
PPTX
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
PDF
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
PDF
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
PDF
Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis
DOCX
JPJ1419 Discovering Emerging Topics in Social Streams via Link-Anomaly Detec...
PDF
Collecting Twitter Data
Document(2)
Weeki - Wikipedia &lt;- tweets
Classification of Disastrous Tweets on Twitter using BERT Model
Stretching the Life of Twitter Classifiers with Time-Stamped Semantic Graphs
Vakulenko PhD Status Report - 16 February 2016
LIS 653 Knowledge Organization | Pratt Institute School of Information | Fall...
Final Poster for Engineering Showcase
A framework for real time semantic social media analysis
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Classifying Crisis Information Relevancy with Semantics (ESWC 2018)
Tweet Cloud
Twitris - Web Information System 2011 Course
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
USING HASHTAG GRAPH-BASED TOPIC MODEL TO CONNECT SEMANTICALLY-RELATED WORDS W...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Curating and Contextualizing Twitter Stories to Assist with Social Newsgathering
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Data-Mining Twitter for Political Science -Hickman, Alfredo - Honors Thesis
JPJ1419 Discovering Emerging Topics in Social Streams via Link-Anomaly Detec...
Collecting Twitter Data

More from Francisco Berrizbeitia (20)

DOCX
Trabajo 1 - Definición de un sitio web de contenido multimedia
PPTX
Introducción al el mercadeo en Internet
PPTX
¿ Cómo empezar con mi sitio web?
PDF
2013 digital future_in_focus_venezuela
PPTX
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...
PDF
Caracterización de la popularidad de los archivos de un wiki a gran escala v3
PDF
Formación en salud y seguridad industrial llave en mano
PDF
Listado de cursos manual rse
PPTX
PPTX
AID Aprendizaje - Nosotros
PDF
Keylight ae user guide
PPTX
Personalizacion de blogspot
DOCX
Trabajo 1 - Conceptualización del proyecto de difusión audiovisual
PPTX
Clase 3 estrategias de difusion
PPTX
Emprendimiento en web 2.0 / Cifras y casos de exito
PPTX
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
PPTX
SEM - Search Engine Marketing / Mercadeo en búscadores
PPTX
Estrategías de difusión en web 2.0
PDF
Imágenes Digitales. Raster y Vectoriales
Trabajo 1 - Definición de un sitio web de contenido multimedia
Introducción al el mercadeo en Internet
¿ Cómo empezar con mi sitio web?
2013 digital future_in_focus_venezuela
Tiene sentido crear contenido audiovisual para ser difundido exclusivamente ...
Caracterización de la popularidad de los archivos de un wiki a gran escala v3
Formación en salud y seguridad industrial llave en mano
Listado de cursos manual rse
AID Aprendizaje - Nosotros
Keylight ae user guide
Personalizacion de blogspot
Trabajo 1 - Conceptualización del proyecto de difusión audiovisual
Clase 3 estrategias de difusion
Emprendimiento en web 2.0 / Cifras y casos de exito
Internet en América Latina - ¿ Por qué generar contenido multimedia para la red?
SEM - Search Engine Marketing / Mercadeo en búscadores
Estrategías de difusión en web 2.0
Imágenes Digitales. Raster y Vectoriales

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation theory and applications.pdf
PDF
August Patch Tuesday
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Machine Learning_overview_presentation.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Empathic Computing: Creating Shared Understanding
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation theory and applications.pdf
August Patch Tuesday
Univ-Connecticut-ChatGPT-Presentaion.pdf
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
cloud_computing_Infrastucture_as_cloud_p
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Machine Learning_overview_presentation.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Group 1 Presentation -Planning and Decision Making .pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Empathic Computing: Creating Shared Understanding

News construction from microblogging post using open data

  • 1. News construction from microblogging posts using open data Francisco Berrizbeitia Universidad Simón Bolívar Caracas, Venezuela fberrizbeitia@gmail.com June, 2014 Abstract Information access can be limited in some situations where traditional media outlets can’t cover the events due to geographical limitations or censorship. Examples of those situations can be civil unrest, war or natural disasters. In these situations citizen journalism replace or complement traditional media in the documentation of such events. Microblogging services such as Twitter have become of great use in these scenarios due their mobile nature and multimedia capabilities. In this research we propose a method to create searchable, semantically annotated news articles from tweets in an automated way using the cloud of linked open data. Keywords Semantic web, News, Microblogging, Twitter, Automatic document generation, Data Journalims, citizen journalism. 1 Introduction Citizen journalism has become a very common practice with the arrival of the smartphones and microblogging services such as Twitter. Due to the multimedia capacities of the devices and the mobile nature of the social network people all over the word are documenting all sort events and publishing on the Web on real time. This type of journalism has particular importance in situations where the traditional media can’t cover the events, such as natural disasters, war, civil unrest or due to government or self-imposed censorship. Citizen journalism is protected by the Universal Declaration of Human Rights, article 19: (United Nations) “Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.”
  • 2. This protection has had tremendous implications in the recent past, in situations where the only available information was found on social networks and in international media outlets with very limited coverage. We believe it’s of great importance to develop a technology that allows the creation of “fair” documents from all the contributions made by the users during such events. The hope is that the automated documents created by this technology will be closer to what really happened and guarantee impartiality. As a first step in this research we want to construct a news article from a single 140 character message using the open data cloud. In the rest of the report we will first describe the overall approach we took to the problem then describe the system we developed for this task and finally look at the results. 2 Related Work Information extraction on from Twitter and other microblogging plataforms has been done in the past. (David Laniado, 2010) explored the semantic value of hashtags as identifiers for the semantic web. (Shinavier, 2010) proposed the possibility of creating a real-time semantic web using structured microblogging messages. (Ritter, 2012) uses natural language processing and information extraction techniques over a corpora of tweets to extract machine readable information. Sentiment analysis has also been a topic of research like the work of (Alexander Pak, 2010) where they propose a machine learning method to classify the tweets in positive, negative and neutral. 3 Description The main objective is to obtain the semantically meaningful concepts expressed in the micropost from the Open Data Cloud and then create a document that extends the original text with the retrieved concepts. If we succeed in this task we will end up with a news article where the questions: who, what, where, when and why (Wikipedia, 2014) are going to be derived from the micropost and extended with the linked open data cloud.
  • 3. Figure 1. Overall view of the process In figure 1 we can see the overall process of the news creation. Being this our first approach to the problem we decided to limit the sources of information to Twitter as the only microblog input and DBpedia as our source of semantically annotated information. The system was implemented as a web application written in PHP. In the next section we will describe each part of the system. 3.1 Information gathering and text preparation The first task consists in gathering the posted information by a user of the social network; we collect not only the published text, but also the media when available and information about the author. We obtain all the information using the public API provided by Twitter. As shown in figure 2, the only input the system need is the tweet ID . Figure 2. Input screen of the system
  • 4. After the text is retrieved it must be “denoised” before any further processing. At this point all the stop words are removed as well as links and Tweeter specific words such as RT or FF. The hashtag character (#) is removed leaving the remaining word. 3.2 Candidate selection Before querying the DBpedia endpoint we run first a local analysis using a version of the Wordnet database. Each word is analyzed and a matrix of acceptations for the words is created. Following a set rules we create a list possible 2-words and 1-word candidates that may be relevant concepts, places or persons. By doing this we wanted to reduce the queries we need to make the endpoint. Since the Wikipedia and the DBpedia are tightly related, we decided to query first the Wikipedia page using the API to obtain the Wikipedia page URL of witch the candidate is the main topic. And the end of this process we ended up having a list of candidate with known Wikipedia pages. 3.3 Semantically annotated information retrieval from the Open Data Cloud The next step is to query DBpedia ‘s sparql endpoint to retrieve the semantically annotated information related to the tweet topic detected in the previous step. Once the information is received from the endpoint it is put together with the author information from Twitter in a turtle file in order to make it available via a sparql endpoint. We used a subset of the rNews Ontology (International Press Telecominication Counsil, 2011) shown in Figure 3. Figure 3. Subset of the rNews Ontology used for the project
  • 5. 4 Results To test the approach and the system we selected 90 tweets directly from the Twitter search on 3 subjects: The Brazilian riot during the 2014 world cup, Barack Obama and Venezuela. The process of collecting the microposts consisted on making the search thru the API and collect the first 30 messages with an associated picture, doing the same process for each of the selected topics. After the sample was obtained we proceeded to manually tag each tweet. This was made two times by different persons to minimize the human errors. After the sample was manually tagged we ran the automated process for each tweet and saved the results for each case. The results can be seen on Figure 4. We expected to find 415 terms for all tweets and found 433, of those 317 were an exact match to what was expected in the manual process, 63 resulted in information that is not wrong but adds no real value, 53 that were wrong concepts. This give a precision of 76.36%, that’s the expect terms that were automatically detected using the method and 12.24% of errors. Figure 4. Result of the test cases Analyzing the errors we noticed that, the automatically retrieved concept brought a wrong a meaning for the context. For example, in the context of the Brazilian riots, the concept “fire” was defined as in “a burning fire” instead of “fire a gun”. Similar cases can be found in the other topics that were tested. The terms that were not detected by the automated method were candidate with known Wikipedia pages that had no corresponding entrance in the DBpedia.
  • 6. 5 Future work We’re encouraged with the obtained results to further develop the method and include automated context detection as a way to maximize the precision. A possible approach to solve this is described in (Esther Villar Rodríguez, 2012) and (Nebhi, 2012). We also would like to further develop the system, to not only detect, retrieve and save information of one message but to be able to create a complete documentation of an event for extended period of time, based on several micro blogging platforms and media outlets, both independent and corporate. The end result we hope to reach is create a full searchable, semantically annotated news stream that will serve as a neutral and centralized endpoint for data journalism. 6 Conclusions In this research we proposed a method to automatically create a news article from a tweet using the cloud of linked open data, to do it we successfully implemented a web system that takes a Tweet ID as input and generate semantically annotated news article based on a subset of the rNews Ontology. To test our approach we collected a group of 90 tweet on three subjects: the Brazilian riots during the 2014 World Cup, Barack Obama and Venezuela. The messages where tagged manually and then compared with automatically found annotations. Our method was able to capture 76.36% of the manually detected terms with an error of rate 12.24% due mostly to disambiguation problems. These results encourage us to further develop the method and the system to solve first the disambiguation problems and to create a more ambitious approach that will allow us to create a semantically annotated news stream based not only on tweet, but also includes other microblogging services, independent blogs and corporate media outlets that can serve a centralized semantic endpoint for data journalism. 7 References Alexander Pak, P. P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion Mining. Valletta, Malta: Proceedings of the Seventh International Conference on Language Resources and Evaluation. David Laniado, P. M. (2010). Making sense of Twitter. Shangai, China: ISWC 2010. Esther Villar Rodríguez, A. I. (2012). Using Linked Open Data sources for Entity Disambiguation. Rome: CLEF Iniciative. International Press Telecominication Counsil. (2011, 10 7). rNews. Retrieved 6 21, 2014, from IPTC site for developers: http://dev.iptc.org/rNews
  • 7. Nebhi, K. (2012). Ontology-Based Information Extraction from Twitter. (pp. 17-22). Mumbai: Proceedings of the Workshop on Information Extraction and Entity Analytics on Social Media Data. Ritter, A. (2012). Extracting Knowledge from Twitter and The Web. Doctorate Thesis. University of Washington. Shinavier, J. (2010). Realtime #SemanticWeb in <= 140 Characters. WWW2010. Raleigh, North Carolina. United Nations. (n.d.). United Nations. Retrieved 6 22, 2014, from The Universal Declaration of Human Rights: http://www.un.org/en/documents/udhr/index.shtml Wikipedia. (2014, 6 11). Five Ws. Retrieved 6 20, 2014, from wikipedia.org: http://en.wikipedia.org/wiki/Five_Ws