SlideShare a Scribd company logo
Hands on Text Analytics
with Orange
Ajda	Pretnar	
ajda.pretnar@fri.uni-lj.si	
University	of	Ljubljana,	Slovenia	
	
Niko	Colnerič	
niko.colneric@fri.uni-lj.si	
University	of	Ljubljana,	Slovenia	
	
Lan	Žagar	
lan.zagar@fri.uni-lj.si	
University	of	Ljubljana,	Slovenia	
	
Orange for Text Analytics
In	recent	years,	the	digital	humanities	community	
has	been	introduced	to	many	powerful	tools	for	text	
analysis,	but	few	of	these	tools	combine	powerful	data	
mining	and	machine	learning	algorithms	within	a	sim-
ple	and	capable	user	interface.	For	flexible	and	crea-
tive	analysis,	researchers	need	a	tool	that	focuses	on	
intuition,	visualizations	and	interactivity.	
This	 workshop	 will	 introduce	 participants	 to	 Or-
ange,	a	visual	programming	environment	for	data	min-
ing,	suitable	for	both	beginners	and	experts.	Particular	
emphasis	will	be	placed	on	its	Text	add-on,	which	of-
fers	 components	 for	 text	 mining,	 visualization	 and	
deep-	learning-based	embedding.		
This	 is	 a	 hands-on	 workshop,	 where	 the	 partici-
pants	will	actively	construct	analytical	workflows	and	
go	through	case	studies	with	the	help	of	the	instruc-
tors.	They	will	learn	how	to	manage	textual	data,	pre-
process	it,	use	machine	learning,	data	projection	and	
visualisation	 techniques	 to	 expose	 hidden	 patterns	
and	evaluate	the	resulting	models.	At	the	end	of	the	
workshop,	the	participants	will	know	how	to	use	vis-
ual	 programming	 to	 seamlessly	 construct	 powerful	
data	 analysis	 workflows,	 which	 can	 be	 applied	 to	 a	
wide	range	of	challenges	in	digital	humanities.	
Structure of the Workshop
Part 1: Visual programming, workflows, data
input and preprocessing
First,	 we	 will	 show	 the	 basics	 of	 Orange:	 how	 to	
load	the	data,	inspect	and	visualize	it.	Participants	will	
be	introduced	to	several	options	for	data	import,	from	
standard	Corpus	to	Twitter,	Guardian	and	Text	Import.	
Once	the	corpus	is	loaded,	we	will	preprocess	it	and	
display	the	result	in	a	word	cloud.	A	particular	empha-
sis	will	be	on	the	use	of	custom	preprocessing	tech-
niques	and	how	to	successfully	apply	them	to	the	cor-
pus.	The	results	of	each	technique	will	be	observed	in	
an	interactive	word	cloud	and	concordances.	
	
Figure 1: Preprocessing results displayed in a word cloud
Part 2: Machine learning and deep-learning-
based embedding for predictive analysis
Next,	we	will	use	Twitter	data	to	construct	an	au-
thor	prediction	pipeline	and	test	some	classifiers.	We	
will	fetch	author	Timelines	from	Twitter	and	observe	
the	 retrieved	 corpus.	 This	 time	 we	 will	 introduce	 a	
pre-trained	 tweet	 tokenizer	 and	 pass	 the	 prepro-
cessed	corpus	through	a	bag	of	words.	We	will	discuss	
bag	of	words	parameters	and	how	to	best	prepare	the	
data	for	further	analysis.	The	results	of	using	different	
parameters	will	be	observed	in	a	data	table	to	under-
stand	the	underlying	data	structures.	For	comparison,	
we	will	use	deep-learning-based	embedding	to	derive	
vector	representation	of	tweets	and	in	this	way	enable	
machine	learning.	
We	will	explain	how	we	can	use	machine	learning	
in	text	mining	and	introduce	a	number	of	techniques	
for	predictive	analysis.	We	will	use	cross-validation	to	
test	the	constructed	bag	of	words	models	and	compare	
classification	scores	for	each	algorithm.	We	will	dis-
cuss	 the	 quality	 of	 constructed	 models	 and	 what	
scores	are	usually	the	best	for	observing	model	qual-
ity.	Additionally,	we	will	inspect	misclassified	tweets	
in	 a	 confusion	 matrix	 and	 even	 further	 in	 Corpus	
Viewer,	to	leverage	the	possibilities	of	a	close(r)	read-
ing.	
Part 3: Data clustering, sentiment analysis,
image and geo analytics
In	the	third	part,	we	will	work	on	geomapping	and	
image	analytics.	We	will	transform	textual	and	visual	
data	 into	 feature	 vectors	 and	 plot	 these	 data	 onto	 a	
world	map	to	discover	interesting	relations.	
We	 will	 discuss	 how	 to	 acquire	 geolocated	 data	
from	Twitter	and	why	this	is	useful.	Next,	we	will	use	
geotagged	Twitter	data	and	apply	a	pre-trained	senti-
ment	analysis	model	to	acquire	sentiment	orientation.	
We	will	map	the	sentiment-tagged	tweets	and	explore	
how	to	use	sentiment	together	with	geomapping.	
Finally,	the	participants	will	be	introduced	to	image	
analytics	for	humanities	research.	We	will	explain	why	
and	 how	 to	 transform	 raw	 images	 into	 multidimen-
sional	vectors	and	how	to	work	with	the	new	data.	We	
will	cluster	Instagram	images	into	groups	and	explore	
how	to	map	image-containing	tweets	on	a	world	map.	
Do	images	correspond	to	geolocation?	We	will	see.	
	
	
Figure 2: Images from social media are embedded with
ImageNet embedding, clustered with Hierarchical Clustering
and displayed on a map by their geolocation.

More Related Content

PDF
RTF
PPTX
Neural word embedding and language modelling
PDF
Text mining
PDF
Hussien ezzat Cv
PPSX
SSRFC-ДНДЕКЦ
PPT
e-Research Adaptive Interface (eRaUI)
PDF
Development of an intelligent information resource model based on modern na...
Neural word embedding and language modelling
Text mining
Hussien ezzat Cv
SSRFC-ДНДЕКЦ
e-Research Adaptive Interface (eRaUI)
Development of an intelligent information resource model based on modern na...

Similar to Text-Analysis-Orange.pdf (20)

PDF
py04.pdf
PPTX
Ferenc Józsa - Hungarian University of Fine Arts
PDF
A Strong Object Recognition Using Lbp, Ltp And Rlbp
PDF
Pankaj rajanresume2014
PPT
The Community of Interest in France
PDF
CV _Manoj
PDF
Port
PDF
Nguyen Nhat Tien CV
DOC
CV-Jayusman
PPTX
Gesture detection
PPTX
dh_specialist_interview
PDF
Resume jim yu
PDF
A Comparative Study of Recent Ontology Visualization Tools with a Case of Dia...
PPSX
Digital Art ToolKit
DOCX
Yiran_Wang_Resume
DOCX
Guia 2-examen-de-ingles
PDF
Rae's Resume - UX Research
PDF
Workshop: Data Visualization for Corpus Linguistics via Shiny Framework
PDF
Evolution Of Object Oriented Technology
PDF
Jiali_Han_Resume
py04.pdf
Ferenc Józsa - Hungarian University of Fine Arts
A Strong Object Recognition Using Lbp, Ltp And Rlbp
Pankaj rajanresume2014
The Community of Interest in France
CV _Manoj
Port
Nguyen Nhat Tien CV
CV-Jayusman
Gesture detection
dh_specialist_interview
Resume jim yu
A Comparative Study of Recent Ontology Visualization Tools with a Case of Dia...
Digital Art ToolKit
Yiran_Wang_Resume
Guia 2-examen-de-ingles
Rae's Resume - UX Research
Workshop: Data Visualization for Corpus Linguistics via Shiny Framework
Evolution Of Object Oriented Technology
Jiali_Han_Resume
Ad

More from Akuhuruf (20)

PDF
PER-4_PP_2017-1.pdf
PDF
Materi-Taksonomi-Hijau-resize_compressed.pdf
PDF
Panduan-Ergonomi-WFH-PEI-200514-OnlineVer.pdf
PDF
ODOI-7-Maret-2022-Profesionalisme.pdf
PDF
Pengelolaan-Kinerja-Organisasi.pdf
PDF
CRS.pdf
PDF
Mengelola-Emosi-Orang-Tua-Saat-Mendampingi-Anak-PJJ-Rheni-M..pdf
PDF
Hipertensi-Sharing-Experience.pdf
PDF
laut-bercerita (1).pdf
PDF
S2-2021-449118-summary_id_compressed (1).pdf
PDF
S2-2021-449118-complete_compressed (1).pdf
PDF
Juknis-Rumbel_compressed.pdf
PDF
PP-Nomor-38-Tahun-2016-PP-Nomor-38-Tahun-2016.pdf
PDF
Kesediaan-Relokasi-by-yusuf_compressed.pdf
PDF
ODOI-Pilar.pdf
PDF
s40537-015-0030-3-data-analytics-a-survey.pdf
PDF
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
PDF
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
PDF
Hidroponik-Asik-di-Masa-Pandemik-Riza-S.-N.pdf
PDF
kambatla2014.pdf
PER-4_PP_2017-1.pdf
Materi-Taksonomi-Hijau-resize_compressed.pdf
Panduan-Ergonomi-WFH-PEI-200514-OnlineVer.pdf
ODOI-7-Maret-2022-Profesionalisme.pdf
Pengelolaan-Kinerja-Organisasi.pdf
CRS.pdf
Mengelola-Emosi-Orang-Tua-Saat-Mendampingi-Anak-PJJ-Rheni-M..pdf
Hipertensi-Sharing-Experience.pdf
laut-bercerita (1).pdf
S2-2021-449118-summary_id_compressed (1).pdf
S2-2021-449118-complete_compressed (1).pdf
Juknis-Rumbel_compressed.pdf
PP-Nomor-38-Tahun-2016-PP-Nomor-38-Tahun-2016.pdf
Kesediaan-Relokasi-by-yusuf_compressed.pdf
ODOI-Pilar.pdf
s40537-015-0030-3-data-analytics-a-survey.pdf
big-data-analytics-and-iot-in-logistics-a-case-study-2018.pdf
pole2016-A-Recent-Study-of-Emerging-Tools.pdf
Hidroponik-Asik-di-Masa-Pandemik-Riza-S.-N.pdf
kambatla2014.pdf
Ad

Recently uploaded (20)

PDF
Creating Memorable Moments_ Personalized Plant Gifts.pdf
PDF
PPT Item #s 2&3 - 934 Patterson SUP & Final Review
PDF
Item # 3 - 934 Patterson Final Review.pdf
DOCX
EAPP.docxdffgythjyuikuuiluikluikiukuuuuuu
PPTX
True Fruits_ reportcccccccccccccccc.pptx
PDF
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
PPTX
Weekly Report 17-10-2024_cybersecutity.pptx
PPTX
Neurons.pptx and the family in London are you chatgpt
PPTX
Part I CSO Conference and AVP Overview.pptx
PPTX
Chapter 1: Philippines constitution laws
PPTX
Presentatio koos kokos koko ossssn5.pptx
PDF
PPT Item # 4 - 328 Albany St compt. review
PDF
UNEP/ UNEA Plastic Treaty Negotiations Report of Inc 5.2 Geneva
PPTX
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
PDF
4_Key Concepts Structure and Governance plus UN.pdf okay
PDF
PPT - Primary Rules of Interpretation (1).pdf
PDF
PPT Item # 2 -- Announcements Powerpoint
PPTX
SUKANYA SAMRIDDHI YOJANA RESEARCH REPORT AIMS OBJECTIVES ITS PROVISION AND IM...
PDF
CXPA Finland Webinar: Rated 5 Stars - Delivering Service That Customers Truly...
PDF
PPT Item # 5 - 5307 Broadway St (Final Review).pdf
Creating Memorable Moments_ Personalized Plant Gifts.pdf
PPT Item #s 2&3 - 934 Patterson SUP & Final Review
Item # 3 - 934 Patterson Final Review.pdf
EAPP.docxdffgythjyuikuuiluikluikiukuuuuuu
True Fruits_ reportcccccccccccccccc.pptx
PPT Items # 6&7 - 900 Cambridge Oval Right-of-Way
Weekly Report 17-10-2024_cybersecutity.pptx
Neurons.pptx and the family in London are you chatgpt
Part I CSO Conference and AVP Overview.pptx
Chapter 1: Philippines constitution laws
Presentatio koos kokos koko ossssn5.pptx
PPT Item # 4 - 328 Albany St compt. review
UNEP/ UNEA Plastic Treaty Negotiations Report of Inc 5.2 Geneva
Quiz - Saturday.pptxaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4_Key Concepts Structure and Governance plus UN.pdf okay
PPT - Primary Rules of Interpretation (1).pdf
PPT Item # 2 -- Announcements Powerpoint
SUKANYA SAMRIDDHI YOJANA RESEARCH REPORT AIMS OBJECTIVES ITS PROVISION AND IM...
CXPA Finland Webinar: Rated 5 Stars - Delivering Service That Customers Truly...
PPT Item # 5 - 5307 Broadway St (Final Review).pdf

Text-Analysis-Orange.pdf

  • 1. Hands on Text Analytics with Orange Ajda Pretnar ajda.pretnar@fri.uni-lj.si University of Ljubljana, Slovenia Niko Colnerič niko.colneric@fri.uni-lj.si University of Ljubljana, Slovenia Lan Žagar lan.zagar@fri.uni-lj.si University of Ljubljana, Slovenia Orange for Text Analytics In recent years, the digital humanities community has been introduced to many powerful tools for text analysis, but few of these tools combine powerful data mining and machine learning algorithms within a sim- ple and capable user interface. For flexible and crea- tive analysis, researchers need a tool that focuses on intuition, visualizations and interactivity. This workshop will introduce participants to Or- ange, a visual programming environment for data min- ing, suitable for both beginners and experts. Particular emphasis will be placed on its Text add-on, which of- fers components for text mining, visualization and deep- learning-based embedding. This is a hands-on workshop, where the partici- pants will actively construct analytical workflows and go through case studies with the help of the instruc- tors. They will learn how to manage textual data, pre- process it, use machine learning, data projection and visualisation techniques to expose hidden patterns and evaluate the resulting models. At the end of the workshop, the participants will know how to use vis- ual programming to seamlessly construct powerful data analysis workflows, which can be applied to a wide range of challenges in digital humanities. Structure of the Workshop Part 1: Visual programming, workflows, data input and preprocessing First, we will show the basics of Orange: how to load the data, inspect and visualize it. Participants will be introduced to several options for data import, from standard Corpus to Twitter, Guardian and Text Import. Once the corpus is loaded, we will preprocess it and display the result in a word cloud. A particular empha- sis will be on the use of custom preprocessing tech- niques and how to successfully apply them to the cor- pus. The results of each technique will be observed in an interactive word cloud and concordances. Figure 1: Preprocessing results displayed in a word cloud Part 2: Machine learning and deep-learning- based embedding for predictive analysis Next, we will use Twitter data to construct an au- thor prediction pipeline and test some classifiers. We will fetch author Timelines from Twitter and observe the retrieved corpus. This time we will introduce a pre-trained tweet tokenizer and pass the prepro- cessed corpus through a bag of words. We will discuss bag of words parameters and how to best prepare the data for further analysis. The results of using different parameters will be observed in a data table to under- stand the underlying data structures. For comparison, we will use deep-learning-based embedding to derive vector representation of tweets and in this way enable machine learning. We will explain how we can use machine learning in text mining and introduce a number of techniques for predictive analysis. We will use cross-validation to test the constructed bag of words models and compare classification scores for each algorithm. We will dis- cuss the quality of constructed models and what scores are usually the best for observing model qual- ity. Additionally, we will inspect misclassified tweets in a confusion matrix and even further in Corpus Viewer, to leverage the possibilities of a close(r) read- ing. Part 3: Data clustering, sentiment analysis, image and geo analytics
  • 2. In the third part, we will work on geomapping and image analytics. We will transform textual and visual data into feature vectors and plot these data onto a world map to discover interesting relations. We will discuss how to acquire geolocated data from Twitter and why this is useful. Next, we will use geotagged Twitter data and apply a pre-trained senti- ment analysis model to acquire sentiment orientation. We will map the sentiment-tagged tweets and explore how to use sentiment together with geomapping. Finally, the participants will be introduced to image analytics for humanities research. We will explain why and how to transform raw images into multidimen- sional vectors and how to work with the new data. We will cluster Instagram images into groups and explore how to map image-containing tweets on a world map. Do images correspond to geolocation? We will see. Figure 2: Images from social media are embedded with ImageNet embedding, clustered with Hierarchical Clustering and displayed on a map by their geolocation.