Lecture 7: Learning from Massive Datasets

October 2013

Machine Learning for Language Technology

Lecture 7:
Learning from Massive Datasets
Marina Santini, Uppsala University
Department of Linguistics and Philology

Outline
Watch the pitfalls
Learning from massive datasets









Data Mining
Text Mining – Text Analytics
Web Mining
Big Data



Programming Languages and Framework for Big Data



Big Textual Data & Commercial Applications



Events, MeetUps, Coursera
2

Lect. 7: Learning from Massive Datasets

Practical Machine Learning

3


Data Mining
Data mining is the extraction of implicit, previously
unknown and potentially useful information from data
(Witten and Frank, 2005)



4


Watch out!
Machine Learning is not just about:

Finding data and blindly applying learning algorithms to
it
Blindly compare machine learning methods:

1.

2.

Model complexity
Representativeness of training data distribution
Reliability of class labels

1.
2.
3.

Remember: Practitioners’ expertise counts!

5


Massive Datasets
Space and Time
Three ways to make learning feasible (the old way)








Small subset
Parallelization
Data chunks

The new way:






6

Develop new algorithms with lower computational
complexity
Increase background knowledge

Domain Knowledge


Metadata



Semantic relation
Causal relation
Functional dependencies




7


Text Mining
Actionable information
Comprehensible information
Problems








8

Text Analytics


Definition: Text Analytics
A set of NLP techniques that provide some structure
to textual documents and help identify and extract
important information.



9


Set of NLP (Natural Language Processing )
techniques


Common components of a text analytic package are:









10

Tokenization
Morphological Analysis
Syntactic Analysis
Named Entity Recognition
Sentiment Analysis
Automatic Summarization
Etc.


NLP at Coursera (www.coursera.org)

11


NLP is pervasive
Ex: spell-checkers







Google Search
Google Mail
Facebook
Office Word
[…]

12


NLP is parvasive
Ex: Name Entity Recognition




Opinion mining
Brand Trends
Conversation
clouds on web
magazines and
online
newspapers…

13


Sentiment Analysis

14


Text Analytics Products and Frameworks


Commercial Products:












Attensity
Clarabridge
Temis
Lexalytics
Texify
SAS
SPSS
IBM Cognos
etc.
15

Open Source Frameworks:
•
•
•
•
•

GATE
NLTK
UIMA
openNLP
etc.


However… (I)


NLP tools and applications (both commercial and
open source) are not perfect. Research is still very
active in all NLP fields.

16


Ex: Syntactic Parser


Connexor



What about parsing a tweet?
“My son, Ky/o, asked me for the first time today how my
DAY was . . . I about melted. Told him that I had pizza for
lunch. Response? No fair “ (Twitter Tutorial 1: How to
Tweet Well)



17

Why NLP and Text Analytics for Text Mining?


Why is it important to know that a word is a noun, or
a verb or the name of brand?



Broadly speaking (Think about these as features for
a classification problem!)





18

Nouns and verbs (a.k.a. content words): Nouns are
important for topic detection; verbs are important if you
want to identify actions or intentions.
Adjectives = sentiment identification.
Function words (a.k.a. stop words) are important for
authorship attribution, plagiarism detection, etc.
etc.

However… (II)


At present, the main pitfall of many NLP applications
is that they are not flexible enough to:





Completly disambiguate language
Identify how language is used in different types of
documents (a.k.a. genres).

For instance, in tweets langauge is used in a
different way than an emails, language used in
email is different from the language used in
academic papers, etc. )
Often tweaking NLP tools to different types of
text or solve language ambiguity in an ad-hoc
manner is time-consuming, difficult and
unrewarding…
19


What for?










Text summarization
Document clustering
Authorship attribution
Automatic medadata extraction
Entity extraction
Information extraction
Information discovery
ACTIONABLE INTELLIGENCE

20


Actionable Textual Intelligence


Business Intelligence (BI) + Customer Analytics + Social
Network Analytics + Crisis Intelligence […] = Actionable
Intelligence



Actionable Intelligence is information that:
1.

2.
3.
4.
5.

6.

21

must be accurate and verifiable
must be timely
must be comprehensive
must be comprehensible
!!! give the power to make decisions and to act straightaway !!!
!!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL
DATA !!!

Big Data


BIG DATA [Wikipedia]:


Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and
process the data within a tolerable elapsed time. Big data sizes
are a constantly moving target, as of 2012 ranging from a few
dozen terabytes to many petabytes of data in a single data
set. With this difficulty, new platforms of "big data" tools are being
developed to handle various aspects of large quantities of data.



Examples include Big Science, web logs, RFID, sensor networks,

social networks, social data (due to the social
data revolution), Internet text and documents,
Internet search indexing, call detail records,
astronomy, atmospheric science, genomics, biogeochemical,
biological, and other complex and often interdisciplinary scientific
research, military surveillance, medical records, photography
archives, video archives, and large-scale e-commerce.
22


Big Unstructured
TEXTUAL Data

Merrill Lynch is one of the world's leading
financial management and advisory companies,
providing financial advice.



―Merrill Lynch estimates that more than 85 percent of all
business information exists as unstructured data –
commonly appearing in e‐mails, memos, notes from call
centers and support operations, news, user groups,
chats, reports, letters, surveys, white papers,
marketing material, research, presentations and web
pages.‖ [DM Review Magazine, February 2003 Issue]



ECONOMIC LOSS!

23


Simple search is not enough…


Of course, it is possible to use simple search. But
simple search is unrewarding, because is based on
single terms.


24

”a search is made on the term felony. In a simple search,
the term felony is used, and everywhere there is a
reference to felony, a hit to an unstructured document is
made. But a simple search is crude. It does not find
references to crime, arson, murder, embezzlement,
vehicular homicide, and such, even though these crimes
are types of felonies” [ Source: Inmon, B. & A. Nesavich,
"Unstructured Textual Data in the Organization" from
"Managing Unstructured data in the organization",
Prentice Hall 2008, pp. 1–13]

Programming languages and
frameworks for big data

25


http://www.r-project.org/

R


R is a statistical programming language. It is a free
software programming language and a software
environment for statistical computing and graphics. The
R language is widely used among statisticians and data
miners for developing statistical software and data
analysis. Polls and surveys of data miners are showing
R's popularity has increased substantially in recent years
(wikipedia)

26

27


MeetUps: R in Stockholm

28


Can R help out?


Can R help overcome NLP shortcomings and open a
new direction in order to extract useful information
from Big TEXTUAL Data?

29


Existing literature for linguists


Stefan Th. Gries (2013)
Statistics for linguistics
With R: A Practical
Introduction. De Gruyter
Mouton. New Edition.



Stefan Th. Gries (2009)
Quantitative corpus
linguistics with R: a practical
introduction. Routledge,
Taylor & Francis Group
(companion website).



Harald R. Baayen (2008)
Analyzing Linguistic Data: A
Practical Introduction to
Statistics using R.
Cambridge.
….



30


Companion website by Stefan Th. Gries


BNC=British National Corpus (PoS tagged)

31


BNC


The British National Corpus (BNC) is a 100 million word collection of
samples of written and spoken language from a wide range of
sources, designed to represent a wide cross-section of British
English from the later part of the 20th century, both spoken and
written. The latest edition is the BNC XML Edition, released in 2007.



The corpus is encoded according to the Guidelines of the Text
Encoding Initiative (TEI) to represent both the output from CLAWS
(automatic part-of-speech tagger) and a variety of other structural
properties of texts (e.g. headings, paragraphs, lists etc.). Full
classification, contextual and bibliographic information is also
included with each text in the form of a TEI-conformant header.

32


R & the BNC: Excerpt from Google Books

33


What about Big Textual Data?





Non standardized language
Non standard texts
Electronic documents of all kinds, eg. formal,
informal, short, long, private, public, etc.

34


Not distributed system



Open Source




The name Scala is a
portmanteau of
"scalable" and
"language", signifying
that it is designed to
grow with the demands
of its users. James
Strachan, the creator of
Groovy, described Scala
as a possible successor to
Java






Commercial





35

R
Scala (also distributed
systems)
Rapid Miner
Weka
…
SPSS
SAS
MatLab
…


From The Economist:
The Big Data scenario

36


Commercial applications for Big Textual Data


Recorded Future  web intelligence (anticipating
emerging threats, future trends, anticipating
competitors’ actions, etc.)



Gavagai  large-scale textual analysis (prediction
and future trends)

37


Thanks to Staffan Truffe’ for the ff slides

38


Size

39


In a few pictures…

40


Metrics, structure and time

41


Metric

42


Structure

43


Time

44


Facts

45


Pipeline

46


Multi-Language

47


Text Analytics

48


Predictions

49


Gavagai




Jussi Karlgren (PhD in Stylistics in Information Retrieval)
Magnus Sahlgren (PhD thesis in distributional semantics)
Fredrick Olsson (PhD thesis in Active Learning)


(co-workers at SICS)

The indeterminacy of translation is a thesis propounded by 20thcentury American analytic philosopher W. V. Quine.
Quine uses the example of the word "gavagai" uttered by a
native speaker of the unknown language Arunta upon seeing a
rabbit. A speaker of English could do what seems natural and
translate this as "Lo, a rabbit." But other translations would be
compatible with all the evidence he has: "Lo, food"; "Let's go
hunting"; "There will be a storm tonight" (these natives may be
superstitious)… (wikipedia)
50


Ethersource presented
Thanks to F. Olsson for the ff slides

51


Associations

52


Language is flux

53


Learning from use

54


Scope

55


Architecture

56


Web vs printed world

57


Multi-linguality

59


SICS

60

Watch the videos!


Big Data MeetUp, Stockholm

61


BIG DATA
communities

62


Future Directions in Machine Learning for
Language Technology





Deluge of data
Little linguistic analysis in the realm of big-data realworld platforms and applications
Top-down systems cannot efficiently deal with
irregularity and unpredictability of big textual data
Data-driven systems can make it. However,


63

…we know that computers are not at ease with natural
languages used by humans, unless they learn how to
learn linguistic structure underlying natual language from
data…


For a data-driven approach…






Annotated datasets that are needed for complete
supervised machine learning are costly, timecomsuming and require specialist expertise.
Is complete supervision even thinkable when we talk
about tera-, peta- or yottabytes? How big should
then be the training set?
Alternative solutions:





64

Semi-supervised methods (combination of labelled and
unlabelled data)
Weakly supervised methods (human-constructed rules
are typically used to guide the unsupervised learner)
Unsupervised learning results cannot still compete with
suprevised learning in many tasks…

A new way to explore: Incomplete Supervision


Relies on partially labelled data:




65

‖ Human experts — or possibly a crowd of laymen —
annotate text with some linguistic structure related to the
structure
that one wants to predict. This data is then used for
partially supervised learning with a statistical model that
exploits the annotated structure to infer the linguistic
structure of interest.‖ p. 4


Example






”…it is possible to construct accurate and robust part-of-speech
taggers for a wide range of languages, by combining (1) manually
annotated resources in English, or some other language for which
such resources are already available, with (2) a crowd-sourced
target-language specific lexicon, which lists the potential parts of
speech that each word may take in some context, at least for a
subset of the words.
Both (1) and (2) only provide partial information for the part-ofspeech tagging task. However, taken together they turn out to
provide substantially more information than either taken alone. “ p. 46
Oscar Täckström “Predicting Linguistic Structure with Incomplete and
Cross-Lingual Supervision” PhD Thesis, Uppsala University, 2013
(http://soda.swedish-ict.se/5513/)
66


Conclusions






This course is an introduction to
Machine leaning for Language
Technology”.
You get a flavour of the
problems we come across when
devising models for enabling
machines to analyse and make
sense of natural human
language.

The next big big big step is to
bring as much linguistic
awareness as possible into big
data.
67


Reading


Witten and Frank (2005) Ch. 8

68


Thanx for your attention!

69


Lecture 7: Learning from Massive Datasets

More Related Content

What's hot (20)

Similar to Lecture 7: Learning from Massive Datasets (20)

More from Marina Santini (20)

Recently uploaded (20)

Lecture 7: Learning from Massive Datasets

Editor's Notes