SlideShare a Scribd company logo
Search in Research, Let’s Make it More Complex!
Collaboratively Looking Under the Hood and Its
Consequences
Marijn Koolen
Humanities Cluster - Royal Netherlands Academy of Arts and Sciences
CLARIAH Media Studies Summer School
Netherlands Institute for Sound and Vision, 3 July 2018
Overview
1. Search in Research
a. Search as part of research process
b. Search vs. other access methods
2. Search, Retrieval and Ranking
a. Retrieval Systems, Ranking Algorithms and Relevance Models
3. Searching in Digital Collections
a. Understanding (digital) collections and their construction
b. Tool analysis through experimentation
4. Search Strategies and Corpus Building
a. Systematic searching
b. Search strategies and sampling
1. Search in Research
● Research Phases
○ Exploration, gathering, analysis, synthesis, presentation
○ Extremely non-linear (affordance of digital realm)
● Search happens throughout research process
○ Search phases: pre-focus, focus, post-focus
○ Use different types of collections and search engines
■ General purpose search engines,
■ Domain- and collection-specific (e.g. GLAMS),
■ Personal/private (offline) collections
○ Search strategies:
■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), …
■ Important for data and tool criticism
Research Process
● For many online materials access is limited to search interface
○ Browsing is guided by available structure
■ Drill down via facets
■ Navigate via metadata fields (if enabled)
○ Without (relevant) structure, direct search is only practical alternative
● Searching as exploration
○ How does search engine provide overview?
■ How big is collection?
■ How is collection structure communicated?
■ What (meta)data is available?
■ How are search characteristics explained?
■ How are search results summarised?
Search Engine as Mediator
● Browsing brings you along unintended materials:
○ Navigating your way to relevance
○ Impresses on you what else there is (see also Putnam 2016)
● Keyword search tends to focus on relevance
○ Pushes back related/nearby materials
○ Collection structure can be enabled to allow faceting (overview)
● Search and research methodology
○ Impact of digital keyword search needs to be reflected in methodology
○ How do you account for search process in scholarly communication?
■ Method of citation is based on analogue browse/search in archives and libraries
■ Pre-focus to focus: switch between ad hoc and systematic?
■ Non-linearity: exploration never stops, assumptions constantly challenged
Browsing vs. Keyword Searching
'To take a single example of this disconnect between research process and representation, many of us
use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without
mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent
the limitations within which we are working.' (Hitchcock 2013, 12)
'This is not only about being explicit about our use of keyword searching - it is about moving beyond a
traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.'
(Hitchcock, Confronting the Digital, 2013, p. 19).
Keyword Search and “Confronting the Digital”
Information Search and Seeking
● Search takes place in context
○ Part of seeking, and overall inf. behaviour (Wilson)
○ As inf. behaviour changes (phases), so does seeking
and search behaviour
● Reflection-in-action
○ When and where are choice points?
○ How do search actions relate to strategy and inf.
need?
Digital Tool Criticism
Search and Accountability
● What should scholars account for?
○ Aspects of sources, tools and process
● Digital source criticism
○ How to evaluate digital sources (Fickers 2012)
○ Who made digital source, when, why, what for, how?
● Digital tool criticism
○ How to evaluate impact of digital tools (Koolen et al. 2018)
○ Reflection-in-action, experimentation
● Data Scopes
○ How to communicate research process to others (Hoekstra & Koolen 2018)
○ Discuss process of selection, modelling, normalization, linking, classification
2. Search, Retrieval and Ranking
Anatomy of Retrieval Process
Retrieval - Matching and Similarity
● Matching based on user query
○ Query: free text, controlled facet, example (doc, AV or text)
○ Matching docs returned in certain order (non-matching are not retrieved)
■ How does search engine perform matching (esp. for free text and example)?
■ Potentially many objects match query: does order matter?
● Similarity
○ Degree of matching: some match better than others (notion of similarity)
■ Retrieve most similar documents first (ranking)
○ Similar how? Does interface explain?
● Retrieval and ranking
○ Retrieval: which matching documents are returned to the user as results?
○ Ranking: in which order are the results returned?
Retrieval, Ranking and Relevance
● Retrieval results form a set
○ Can be ordered or unordered (e.g. SQL or SPARQL query)
■ Even unordered sets need to be presented to the user in some order
○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links)
■ Ordering re-organizes materials, temporarily disrupts “original” organization
■ Provides different view on materials
● Many systems perform relevance ranking
○ Relevant to who or what?
■ Query: document similarity scores
■ User: e.g. search history, preferences
■ Situation: user, location, time, device, query, work context (page views, annotations)
■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
● How does an algorithm understand the notion of relevance?
○ Statistical interpretation:
■ Generally: frequent words carry less signal, look for unexpected stuff
■ Many ways of scoring signal
○ TF-IDF:
■ Term Frequency in document (relevance of term in document)
■ Inverse of Document Frequency in collection (commonness of term across docs)
○ Probabilistic Language Model (PLM):
■ Probability of picking term from document as bag of words (relevance of term in doc)
■ Probability of picking term from collection as bag of words (commonness of term)
○ Many other relevance models, e.g. BM25, DFR, SDM, …
■ Different interpretations of relevance, hence different rankings
Algorithmic Interpretation of Relevance
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Ranking Issues
● Document length
○ TF-IDF doesn’t model document length, favours longer documents
○ PLM explicitly normalizes on document length, favours shorter documents
○ Upshot: Delpher API returns short documents first for short queries
● Document priors: are all documents equal or not?
○ Can use document prior probability (independent of query)
○ Can favour documents that are more popular, recent, authoritative, …
○ Can favour documents that are more appropriate for situation (location, time of day, …)
● Problem: how do you know how search engine scores relevance?
○ How much should you know about it?
○ Many GLAM search engines have relatively straightforward relevance models, no doc priors
○ Google uses many hundreds of features for document, query, user and situation
Relevance in Metadata Records
● Relevance ranking of metadata records
○ Metadata records are peculiar textual representations
■ Minimal amount of text, low redundancy
■ Majority of terms occur only once
○ Which part of TF-IDF contributes more to score of metadata record?
○ Which fields are useful/used for matching?
● NISV collection
○ Search engine indexes metadata records
■ Some records have lengthy itemized descriptions, some have not
■ Some have transcripts, some have not
○ Consequences for retrieving? And for ranking?
■ How does search engine handle this?
■ How does search engine communicate this?
● Hard to match keywords against AV signal directly
○ Option: use text representation for AV document
■ E.g. metadata, description, script, speech transcript, ...
○ Option: use AV representation of query
■ E.g. example document or user recording
■ Use audio or visual similarity (again, similar how?)
Retrieving and Ranking Audiovisual Materials
● Experiment to understand search functionalities
○ How can you find out if multiple search terms are treated with Boolean AND or OR operators?
○ How can you find out if terms are stemmed/normalized?
● Phrase search:
○ What happens when you use quotation marks to group terms into a phrase?
○ How do the results compare to those using no quotation marks?
● Proximity search:
○ Can you specify that terms should be near each other?
● Fuzzy search: wildcard and edit distance searches
○ Controlling lexical variation vs. uncontrolled wildcard search
○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...)
Opaqueness of Interfaces and Experimentation
● Experiment with Search and Compare tools of the CLARIAH Mediasuite
○ Find out if stopwords are removed
○ Find out if words are stemmed/normalized
○ Find out how multi-word queries are interpreted, i.e. as AND or OR
○ Find out how standard search operators work
■ Boolean AND, OR and NOT
■ Quotation marks for phrases
Exercise
3. Searching in Digital Collections
● Collections of GLAMs are often built up over decades
○ Based on aims and selection criteria
■ Rarely "complete", dependent on availability of materials
○ Digital access via digitization, or digital archiving (born-digital)
■ Some things are lost in this process (e.g. context, quality, …)
● Heterogeneity: mix of object/source types (sub-collections)
○ Different modalities, different ways of accessing and presenting
■ Text vs. Image vs. AV vs. 3D (or 4D)
Nature of Digital Collections
Nature of Metadata
● Digital access via metadata
○ Metadata: data about the object/source
○ Types: formal, structural, technical, administrative, aboutness
○ Metadata fields allow selection and search via specific fields
■ Title, description, creator, creation date, genre, …
○ Allows (seemingly) uniform access to heterogeneous collections
■ But, different materials have different aspects to describe
■ Edition is relevant for books and films, not so much for paintings
● Metadata creation process
○ Often done with limited time, information and system flexibility
○ Inherently subjective, especially content analysis
● Size matters
○ Requirements change as size of collection grows (also depends on expectations)
● Hierarchical organization
○ 4 levels
■ Series: De Wereld Draait Door
■ Season: De Wereld Draait Door 2016
■ Program: De Wereld Draait Door 21-06-2016
■ Segment: De Wereld Draait Door 21-06-2016
○ Each level has a metadata record (with overlap in field, e.g. title)
● Follows archival standard
○ Describe aspect at highest relevant level
○ Don’t repeat at lower levels unless it deviates (e.g. main titles)
○ Fonds: aggregation of documents from same origin
Archival Structure and NISV Audiovisual Collection
● Power of the archive
○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002)
● History of the archive
○ Collections created over decades often go through changes in
■ selection criteria, cataloguers (human or algorithm),
■ cataloguing budgets, policies, rules, practice and vocabularies,
■ software (migrations and updates), hardware,
■ institutional mission, societal attitudes, …
○ Most of these aspects remain undocumented or partially documented
● Consequences
○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect
○ After many years, it's hard to retrace what happened
■ and how it affects access, selection and analysis
Digital Source and Data Criticism
Metadata in theory Metadata in practice
Source: Jaap Kamps
Combined Collections
● Several portals combine (heterogeneous) collections
○ Examples:
■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of
California, …
○ Worldwide aggregated collections:
■ ArchiveGrid (1000+ archives): over 5M finding aids
■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons
● Huge challenge for source criticism as well as search
○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and
richness
○ Heterogeneous metadata schemas have been mapped to single schema
■ Causes problems for interpretation
■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
Assessing Metadata Quality
● Questions
○ What are pitfalls in relying on metadata?
○ How can we evaluate metadata quality?
○ What are relevant aspects to consider?
● Collection inspection
○ In CLARIAH Media Suite we created a tool for inspecting metadata
■ Esp. useful for complex collections like NISV audiovisual collection
■ Somewhat ad hoc, please feel encouraged to give feedback!
○ Please go to the Media Suite and go to the Collection Inspector tool
■ Click on “select field to analyse” and let the interface load the data on completeness (this
will take awhile)
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Search in Research, Let's Make it More Complex!
Assessing Timelines and Other Visualizations
● Timeline visualizations give view of temporal spread
○ Very difficult to interpret properly
● Issues with absolute frequencies:
○ Collection materials not evenly distributed
○ Need to compare query-specific distribution to collection-distribution
● Issues with relative frequencies:
○ Incompleteness not evenly distributed (use collection inspector)
Retrievability and Metadata Characteristics
● Different types of metadata fields
○ Controlled vocabulary: e.g. broadcast channel (radio or tv)
○ Number: number of episodes/seasons/segments
○ Time/date: program length, recording date
○ Free keyword/keyphrase: title, person name (tend to be non-unique)
○ Free text: description, summary, transcript, … (tend to be unique)
● Different types allow different forms of retrieval and ranking
○ Long text fields have more terms, with higher frequencies
■ Some types of programs have longer descriptions/transcript
■ These match more queries, so higher chance of being retrieved
■ Impact of long text fields on ranking depends on relevance model!
○ Repeated values allow aggregation, navigation
● Some search interfaces offer facets to narrow down search results
○ E.g. broadcaster and genre in the CLARIAH Media Suite
○ Facets provide overview, afford focusing through selection
● How do facets work?
○ Based on metadata fields: rich schema has rich options for facets
○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text
■ Facets work for field with limited range of values, so not free text fields
○ Long tails in facets: typically, few high frequency, many low frequency values
Metadata and Search Facets
Search in Research, Let's Make it More Complex!
Exercise
● Experiment with the Collection Inspector of the CLARIAH Mediasuite
○ Try out the collection inspector:
■ Scroll through the list of fields to get an idea of what is available
■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards”
■ Which metadata fields are relatively complete?
■ At which archival levels are they most complete?
● Explore which fields are available and which fields make good facets
○ Explore facet distributions in entire collection and for specific queries
4. Search Strategies and Corpus
Building
● Importance of selection criteria
○ Do you have to hand pick each document?
○ Or can you select sets based on matching criteria?
○ Is representativeness important? If so, representativeness of what?
○ Or completeness? Why?
● Exploiting facets and dates
○ Filtering: align facets/dates with research focus
○ Sampling: compare across facets
■ Which facet types can you use?
○ Sampling strategies
■ Sample per facet/year (e.g. X items per facet/year)
■ Within facets, select random or not
Searching for Corpus Building
Tracking Context in Corpus Building
● Why were certain documents selected?
○ How were they selected?
○ What strategy was used?
○ Documenting helps understanding/remembering choices?
● Do research goals and questions change during collection?
○ Interacting with sources during search updates knowledge structures (Vakkari 2016)
○ Updates tend to be small and incremental, hence barely noticeable
○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018)
○ Adding annotations can also provide context
Systematic Searching
● Systematic (comprehensive) search has two factors (Yakel 2010):
○ Search strategy (user)
○ Search functionalities (system)
○ Functionalities shape/affect strategy
● Step 1: systematic search for relevant collections online
○ Different collections/sites offer different search functionalities and levels of detail
○ Explicitly address what consequences this has for your strategy and research goals
● Step 2:
○ Explore individual collections using one or more strategies
○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging
practices." (Yakel 2010, p. 110)
○ Footnote and reference chasing: references often give an "information scent", suggesting
other collections and items to explore.
Search Strategies
● Web search strategies defined by Drabenstott (2001)
○ Discussed in archive context by Yakel (2010)
● Five strategies
○ Synonym generation
○ Chaining
○ Name collection
○ Pearl growing
○ Successive segmentation
● Somewhat related to information seeking patterns by Ellis (1989)
○ Starting, chaining, browsing, differentiating, monitoring, extracting
● Synonym generation: 1) search with relevant term, 2) close read results to
identify related terms (wordclouds, facets), 3) search via related terms for
synonyms.
● Chaining: follow references/citations (explicit or implicit), identify relevant
subset and use explicit structure to explore connected/related subset
● Name collection: search with keywords, identify relevant names, search with
names, identify related names and keywords, repeat. Similar to keyword
harvesting (Burke 2011).
Drabenstott’s Strategies (1/2)
Drabenstott’s Strategies (2/2)
● Pearl growing: start small and focused with specific search terms, slowly
expand out with additional terms to broader topics/themes
● Successive segmentation: opposite of pearl growing; start broad and
increasingly zoom in and focus; e.g. make queries increasingly specific by
adding (ANDing) keywords, replace broad terms with lower frequency terms,
or select facets
Search Strategies and Research Phases
● Research phase
○ Exploration <-> search phase pre-focus
i. Ad hoc, no need yet for systematic search
ii. Mostly pearl growing and/or successive segmentation to determine focus
○ Analysis <-> search phase focus
i. Switch to systematic, determine strategy
ii. Use chaining, name collection, synonym generation (for coverage/representation,
boundaries)
● But reality resists:
○ (Re)search process is very non-linear
○ Boundary between exploration and analysis is not always clear
○ Late discoveries can prompt or force new directions, ...
When To Stop
● Often switch from exploration to “sorta” systematic search
○ But hard to remember and explain what and how you searched
○ Moreover, difficult to determine when to stop
○ Explicit strategy allows for stopping criteria
● Stopping criteria
○ Check whole set/sample, all available facets, ...
○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare
○ When stopping, make explicit (at least for yourself) when and why you stopped
● Meta-strategy:
○ chance strategy/tactics
○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
Wrap Up
● Search in research
○ How to incorporate these processes in research methodology
● Large, heterogeneous collections introduce issues for research
○ Assessing incompleteness of materials
○ Assessing incompleteness, incorrectness and inconsistency of metadata
● Looking under the hood
○ Evaluating information access functionalities (search and browse)
○ Selecting an appropriate search strategy for research goals
○ Determining success/failure of searches
○ Understanding search for corpus building
Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011.
Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25.
Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View
journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf
Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and
Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070
Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of
Quantitative and Interdisciplinary History, Volume 51 (2), 2018.
References
References
Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism
Workshop. Digital Humanities in the Benelux 2018 Conference.
Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They
Cast. American Historical Review, Volume 121, Number 2, pp. 377-402.
Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information
Science, 42(1) 2016, pp. 7-18.
Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the
archives: Practical research methods for rhetoric and composition, pp.102-118.

More Related Content

PDF
Tool criticism
PDF
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
PDF
Lessons Learned from a Digital Tool Criticism Workshop
PPTX
Influence of Timeline and Named-entity Components on User Engagement
PDF
Matrix Queries and Matrix Data Representations in NVivo 11 Plus
PDF
Writing and Publishing about Applied Technologies in Tech Journals and Books
PDF
Using “Distant Reading” to Explore Discussion Threads in Online Courses
PPTX
ECO1010F/ECO1110F Essay Workshop-2019
Tool criticism
Tools that Encourage Criticism - Leiden University Symposium on Tools Criticism
Lessons Learned from a Digital Tool Criticism Workshop
Influence of Timeline and Named-entity Components on User Engagement
Matrix Queries and Matrix Data Representations in NVivo 11 Plus
Writing and Publishing about Applied Technologies in Tech Journals and Books
Using “Distant Reading” to Explore Discussion Threads in Online Courses
ECO1010F/ECO1110F Essay Workshop-2019

What's hot (14)

PPT
Presentation Timo Kouwenhoven FIATIFTA
PDF
"Mass Surveillance" through Distant Reading
PPT
Semantic Search
PPTX
Large-Scale Semantic Search
PPTX
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
PPTX
Educational Standards Webinar - Sept 2015 - Patricia Payton
PPTX
How search engines work Anand Saini
PPTX
Graph Models for Deep Learning
PDF
Taxonomy design best practices
PPT
Information searching & retrieving techniques khalid
PDF
Capitalizing on Machine Reading to Engage Bigger Data
PDF
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
PDF
Bringing semantic publishing into TEI: ideas and pointers
PPTX
Spatial Decision Support Portal- Presented at AAG 2010
Presentation Timo Kouwenhoven FIATIFTA
"Mass Surveillance" through Distant Reading
Semantic Search
Large-Scale Semantic Search
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Educational Standards Webinar - Sept 2015 - Patricia Payton
How search engines work Anand Saini
Graph Models for Deep Learning
Taxonomy design best practices
Information searching & retrieving techniques khalid
Capitalizing on Machine Reading to Engage Bigger Data
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10
Bringing semantic publishing into TEI: ideas and pointers
Spatial Decision Support Portal- Presented at AAG 2010
Ad

Similar to Search in Research, Let's Make it More Complex! (20)

PDF
OPG 2025 Tutorial on Digital Political History and the role of search in rese...
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PPTX
Semantic Search tutorial at SemTech 2012
PDF
Relevance redefined
PPT
675d614e68cce (5).ppt image retrival for UI
PPTX
Machine Learned Relevance at A Large Scale Search Engine
PDF
Semantic Search Tutorial at SemTech 2012
PPTX
Information retrieval introduction
PPTX
Chapter 1 Intro Information Rerieval.pptx
PDF
Introduction to irs notes easy way learning
PDF
Digital libraries : interoperability and uses 1st Edition Papy all chapter in...
PPTX
PPTX
Semantic Search at Yahoo
PPTX
Beyond document retrieval using semantic annotations
PDF
Extracting and Reducing the Semantic Information Content of Web Documents to ...
PPTX
Information retrieval 1 introduction to ir
PPT
Tovek Presentation by Livio Costantini
PDF
A42020106
PPTX
Use of ontologies in natural language processing
PDF
Corrall & Dove - Web scale discovery and information literacy: competing visi...
OPG 2025 Tutorial on Digital Political History and the role of search in rese...
14. Michael Oakes (UoW) Natural Language Processing for Translation
Semantic Search tutorial at SemTech 2012
Relevance redefined
675d614e68cce (5).ppt image retrival for UI
Machine Learned Relevance at A Large Scale Search Engine
Semantic Search Tutorial at SemTech 2012
Information retrieval introduction
Chapter 1 Intro Information Rerieval.pptx
Introduction to irs notes easy way learning
Digital libraries : interoperability and uses 1st Edition Papy all chapter in...
Semantic Search at Yahoo
Beyond document retrieval using semantic annotations
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Information retrieval 1 introduction to ir
Tovek Presentation by Livio Costantini
A42020106
Use of ontologies in natural language processing
Corrall & Dove - Web scale discovery and information literacy: competing visi...
Ad

More from Marijn Koolen (9)

PDF
Recommender Systems NL Meetup
PDF
Narrative-Driven Recommendation for Casual Leisure Needs
PDF
Digital History - Maritieme Carrieres bij de VOC
PDF
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
PDF
Facilitating reusable third-party annotations in the digital edition
PDF
Narrative-Driven Recommendation for Casual Leisure Needs
PDF
Scholary Web Annotation - HuC Live 2018
PDF
A hands-on approach to digital tool criticism: Tools for (self-)reflection
PDF
Data Scopes - Towards transparent data research in digital humanities (Digita...
Recommender Systems NL Meetup
Narrative-Driven Recommendation for Casual Leisure Needs
Digital History - Maritieme Carrieres bij de VOC
Hobby horses-and-detail-devils-transparency-in-digital-humanities-research-an...
Facilitating reusable third-party annotations in the digital edition
Narrative-Driven Recommendation for Casual Leisure Needs
Scholary Web Annotation - HuC Live 2018
A hands-on approach to digital tool criticism: Tools for (self-)reflection
Data Scopes - Towards transparent data research in digital humanities (Digita...

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Introduction to Data Science and Data Analysis
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPT
Predictive modeling basics in data cleaning process
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Microsoft 365 products and services descrption
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Microsoft Core Cloud Services powerpoint
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Global Data and Analytics Market Outlook Report
DOCX
Factor Analysis Word Document Presentation
PPTX
Business_Capability_Map_Collection__pptx
PDF
Transcultural that can help you someday.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Introduction to Data Science and Data Analysis
Pilar Kemerdekaan dan Identi Bangsa.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Predictive modeling basics in data cleaning process
ISS -ESG Data flows What is ESG and HowHow
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Microsoft 365 products and services descrption
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Microsoft Core Cloud Services powerpoint
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Global Data and Analytics Market Outlook Report
Factor Analysis Word Document Presentation
Business_Capability_Map_Collection__pptx
Transcultural that can help you someday.
Optimise Shopper Experiences with a Strong Data Estate.pdf

Search in Research, Let's Make it More Complex!

  • 1. Search in Research, Let’s Make it More Complex! Collaboratively Looking Under the Hood and Its Consequences Marijn Koolen Humanities Cluster - Royal Netherlands Academy of Arts and Sciences CLARIAH Media Studies Summer School Netherlands Institute for Sound and Vision, 3 July 2018
  • 2. Overview 1. Search in Research a. Search as part of research process b. Search vs. other access methods 2. Search, Retrieval and Ranking a. Retrieval Systems, Ranking Algorithms and Relevance Models 3. Searching in Digital Collections a. Understanding (digital) collections and their construction b. Tool analysis through experimentation 4. Search Strategies and Corpus Building a. Systematic searching b. Search strategies and sampling
  • 3. 1. Search in Research
  • 4. ● Research Phases ○ Exploration, gathering, analysis, synthesis, presentation ○ Extremely non-linear (affordance of digital realm) ● Search happens throughout research process ○ Search phases: pre-focus, focus, post-focus ○ Use different types of collections and search engines ■ General purpose search engines, ■ Domain- and collection-specific (e.g. GLAMS), ■ Personal/private (offline) collections ○ Search strategies: ■ Ad hoc or systematic: berrypicking (Bates 1989), keyword harvesting (Burke 2011), … ■ Important for data and tool criticism Research Process
  • 5. ● For many online materials access is limited to search interface ○ Browsing is guided by available structure ■ Drill down via facets ■ Navigate via metadata fields (if enabled) ○ Without (relevant) structure, direct search is only practical alternative ● Searching as exploration ○ How does search engine provide overview? ■ How big is collection? ■ How is collection structure communicated? ■ What (meta)data is available? ■ How are search characteristics explained? ■ How are search results summarised? Search Engine as Mediator
  • 6. ● Browsing brings you along unintended materials: ○ Navigating your way to relevance ○ Impresses on you what else there is (see also Putnam 2016) ● Keyword search tends to focus on relevance ○ Pushes back related/nearby materials ○ Collection structure can be enabled to allow faceting (overview) ● Search and research methodology ○ Impact of digital keyword search needs to be reflected in methodology ○ How do you account for search process in scholarly communication? ■ Method of citation is based on analogue browse/search in archives and libraries ■ Pre-focus to focus: switch between ad hoc and systematic? ■ Non-linearity: exploration never stops, assumptions constantly challenged Browsing vs. Keyword Searching
  • 7. 'To take a single example of this disconnect between research process and representation, many of us use and cite eighteenth and nineteenth-century newspapers as simple hard-copy references without mention of how we navigated to the specific article, page and issue. In doing so, we actively misrepresent the limitations within which we are working.' (Hitchcock 2013, 12) 'This is not only about being explicit about our use of keyword searching - it is about moving beyond a traditional form of scholarship to data modelling and to what Franco Moretti calls “distant reading”.' (Hitchcock, Confronting the Digital, 2013, p. 19). Keyword Search and “Confronting the Digital”
  • 8. Information Search and Seeking ● Search takes place in context ○ Part of seeking, and overall inf. behaviour (Wilson) ○ As inf. behaviour changes (phases), so does seeking and search behaviour ● Reflection-in-action ○ When and where are choice points? ○ How do search actions relate to strategy and inf. need?
  • 10. Search and Accountability ● What should scholars account for? ○ Aspects of sources, tools and process ● Digital source criticism ○ How to evaluate digital sources (Fickers 2012) ○ Who made digital source, when, why, what for, how? ● Digital tool criticism ○ How to evaluate impact of digital tools (Koolen et al. 2018) ○ Reflection-in-action, experimentation ● Data Scopes ○ How to communicate research process to others (Hoekstra & Koolen 2018) ○ Discuss process of selection, modelling, normalization, linking, classification
  • 11. 2. Search, Retrieval and Ranking
  • 13. Retrieval - Matching and Similarity ● Matching based on user query ○ Query: free text, controlled facet, example (doc, AV or text) ○ Matching docs returned in certain order (non-matching are not retrieved) ■ How does search engine perform matching (esp. for free text and example)? ■ Potentially many objects match query: does order matter? ● Similarity ○ Degree of matching: some match better than others (notion of similarity) ■ Retrieve most similar documents first (ranking) ○ Similar how? Does interface explain? ● Retrieval and ranking ○ Retrieval: which matching documents are returned to the user as results? ○ Ranking: in which order are the results returned?
  • 14. Retrieval, Ranking and Relevance ● Retrieval results form a set ○ Can be ordered or unordered (e.g. SQL or SPARQL query) ■ Even unordered sets need to be presented to the user in some order ○ Criteria for ordering: alphabetic, size, recency, popularity (views, likes, citations, links) ■ Ordering re-organizes materials, temporarily disrupts “original” organization ■ Provides different view on materials ● Many systems perform relevance ranking ○ Relevant to who or what? ■ Query: document similarity scores ■ User: e.g. search history, preferences ■ Situation: user, location, time, device, query, work context (page views, annotations) ■ Other aspects: quality, diversity, controversy, polarity, exploration/exploitation, ...
  • 15. ● How does an algorithm understand the notion of relevance? ○ Statistical interpretation: ■ Generally: frequent words carry less signal, look for unexpected stuff ■ Many ways of scoring signal ○ TF-IDF: ■ Term Frequency in document (relevance of term in document) ■ Inverse of Document Frequency in collection (commonness of term across docs) ○ Probabilistic Language Model (PLM): ■ Probability of picking term from document as bag of words (relevance of term in doc) ■ Probability of picking term from collection as bag of words (commonness of term) ○ Many other relevance models, e.g. BM25, DFR, SDM, … ■ Different interpretations of relevance, hence different rankings Algorithmic Interpretation of Relevance
  • 18. Ranking Issues ● Document length ○ TF-IDF doesn’t model document length, favours longer documents ○ PLM explicitly normalizes on document length, favours shorter documents ○ Upshot: Delpher API returns short documents first for short queries ● Document priors: are all documents equal or not? ○ Can use document prior probability (independent of query) ○ Can favour documents that are more popular, recent, authoritative, … ○ Can favour documents that are more appropriate for situation (location, time of day, …) ● Problem: how do you know how search engine scores relevance? ○ How much should you know about it? ○ Many GLAM search engines have relatively straightforward relevance models, no doc priors ○ Google uses many hundreds of features for document, query, user and situation
  • 19. Relevance in Metadata Records ● Relevance ranking of metadata records ○ Metadata records are peculiar textual representations ■ Minimal amount of text, low redundancy ■ Majority of terms occur only once ○ Which part of TF-IDF contributes more to score of metadata record? ○ Which fields are useful/used for matching? ● NISV collection ○ Search engine indexes metadata records ■ Some records have lengthy itemized descriptions, some have not ■ Some have transcripts, some have not ○ Consequences for retrieving? And for ranking? ■ How does search engine handle this? ■ How does search engine communicate this?
  • 20. ● Hard to match keywords against AV signal directly ○ Option: use text representation for AV document ■ E.g. metadata, description, script, speech transcript, ... ○ Option: use AV representation of query ■ E.g. example document or user recording ■ Use audio or visual similarity (again, similar how?) Retrieving and Ranking Audiovisual Materials
  • 21. ● Experiment to understand search functionalities ○ How can you find out if multiple search terms are treated with Boolean AND or OR operators? ○ How can you find out if terms are stemmed/normalized? ● Phrase search: ○ What happens when you use quotation marks to group terms into a phrase? ○ How do the results compare to those using no quotation marks? ● Proximity search: ○ Can you specify that terms should be near each other? ● Fuzzy search: wildcard and edit distance searches ○ Controlling lexical variation vs. uncontrolled wildcard search ○ voetbal+voetballen vs. voetbal* (matches voetbalvereniging, voetbalveld, ...) Opaqueness of Interfaces and Experimentation
  • 22. ● Experiment with Search and Compare tools of the CLARIAH Mediasuite ○ Find out if stopwords are removed ○ Find out if words are stemmed/normalized ○ Find out how multi-word queries are interpreted, i.e. as AND or OR ○ Find out how standard search operators work ■ Boolean AND, OR and NOT ■ Quotation marks for phrases Exercise
  • 23. 3. Searching in Digital Collections
  • 24. ● Collections of GLAMs are often built up over decades ○ Based on aims and selection criteria ■ Rarely "complete", dependent on availability of materials ○ Digital access via digitization, or digital archiving (born-digital) ■ Some things are lost in this process (e.g. context, quality, …) ● Heterogeneity: mix of object/source types (sub-collections) ○ Different modalities, different ways of accessing and presenting ■ Text vs. Image vs. AV vs. 3D (or 4D) Nature of Digital Collections
  • 25. Nature of Metadata ● Digital access via metadata ○ Metadata: data about the object/source ○ Types: formal, structural, technical, administrative, aboutness ○ Metadata fields allow selection and search via specific fields ■ Title, description, creator, creation date, genre, … ○ Allows (seemingly) uniform access to heterogeneous collections ■ But, different materials have different aspects to describe ■ Edition is relevant for books and films, not so much for paintings ● Metadata creation process ○ Often done with limited time, information and system flexibility ○ Inherently subjective, especially content analysis ● Size matters ○ Requirements change as size of collection grows (also depends on expectations)
  • 26. ● Hierarchical organization ○ 4 levels ■ Series: De Wereld Draait Door ■ Season: De Wereld Draait Door 2016 ■ Program: De Wereld Draait Door 21-06-2016 ■ Segment: De Wereld Draait Door 21-06-2016 ○ Each level has a metadata record (with overlap in field, e.g. title) ● Follows archival standard ○ Describe aspect at highest relevant level ○ Don’t repeat at lower levels unless it deviates (e.g. main titles) ○ Fonds: aggregation of documents from same origin Archival Structure and NISV Audiovisual Collection
  • 27. ● Power of the archive ○ Problem of perspective (from archive-as-source to archive-as-subject, Stoler 2002) ● History of the archive ○ Collections created over decades often go through changes in ■ selection criteria, cataloguers (human or algorithm), ■ cataloguing budgets, policies, rules, practice and vocabularies, ■ software (migrations and updates), hardware, ■ institutional mission, societal attitudes, … ○ Most of these aspects remain undocumented or partially documented ● Consequences ○ Almost inherently incomplete, inconsistent and sometimes necessarily incorrect ○ After many years, it's hard to retrace what happened ■ and how it affects access, selection and analysis Digital Source and Data Criticism
  • 28. Metadata in theory Metadata in practice Source: Jaap Kamps
  • 29. Combined Collections ● Several portals combine (heterogeneous) collections ○ Examples: ■ Europeana, European Newspapers, EU screen, Nederlab, Delpher, Online Archives of California, … ○ Worldwide aggregated collections: ■ ArchiveGrid (1000+ archives): over 5M finding aids ■ WorldCat (72,000 libraries): 400M records, 2.6B assets, 100M persons ● Huge challenge for source criticism as well as search ○ Collections vary in size, provenance, selection criteria, metadata policies, interpretation and richness ○ Heterogeneous metadata schemas have been mapped to single schema ■ Causes problems for interpretation ■ E.g. what does creator mean for paintings, films, tv series, letters, advertisements, ...?
  • 30. Assessing Metadata Quality ● Questions ○ What are pitfalls in relying on metadata? ○ How can we evaluate metadata quality? ○ What are relevant aspects to consider? ● Collection inspection ○ In CLARIAH Media Suite we created a tool for inspecting metadata ■ Esp. useful for complex collections like NISV audiovisual collection ■ Somewhat ad hoc, please feel encouraged to give feedback! ○ Please go to the Media Suite and go to the Collection Inspector tool ■ Click on “select field to analyse” and let the interface load the data on completeness (this will take awhile)
  • 34. Assessing Timelines and Other Visualizations ● Timeline visualizations give view of temporal spread ○ Very difficult to interpret properly ● Issues with absolute frequencies: ○ Collection materials not evenly distributed ○ Need to compare query-specific distribution to collection-distribution ● Issues with relative frequencies: ○ Incompleteness not evenly distributed (use collection inspector)
  • 35. Retrievability and Metadata Characteristics ● Different types of metadata fields ○ Controlled vocabulary: e.g. broadcast channel (radio or tv) ○ Number: number of episodes/seasons/segments ○ Time/date: program length, recording date ○ Free keyword/keyphrase: title, person name (tend to be non-unique) ○ Free text: description, summary, transcript, … (tend to be unique) ● Different types allow different forms of retrieval and ranking ○ Long text fields have more terms, with higher frequencies ■ Some types of programs have longer descriptions/transcript ■ These match more queries, so higher chance of being retrieved ■ Impact of long text fields on ranking depends on relevance model! ○ Repeated values allow aggregation, navigation
  • 36. ● Some search interfaces offer facets to narrow down search results ○ E.g. broadcaster and genre in the CLARIAH Media Suite ○ Facets provide overview, afford focusing through selection ● How do facets work? ○ Based on metadata fields: rich schema has rich options for facets ○ Types of metadata fields: controlled vocab, number, date, keyword/phrase, free text ■ Facets work for field with limited range of values, so not free text fields ○ Long tails in facets: typically, few high frequency, many low frequency values Metadata and Search Facets
  • 38. Exercise ● Experiment with the Collection Inspector of the CLARIAH Mediasuite ○ Try out the collection inspector: ■ Scroll through the list of fields to get an idea of what is available ■ Look at completeness of fields for f.i. “genre”, “keywords” and “awards” ■ Which metadata fields are relatively complete? ■ At which archival levels are they most complete? ● Explore which fields are available and which fields make good facets ○ Explore facet distributions in entire collection and for specific queries
  • 39. 4. Search Strategies and Corpus Building
  • 40. ● Importance of selection criteria ○ Do you have to hand pick each document? ○ Or can you select sets based on matching criteria? ○ Is representativeness important? If so, representativeness of what? ○ Or completeness? Why? ● Exploiting facets and dates ○ Filtering: align facets/dates with research focus ○ Sampling: compare across facets ■ Which facet types can you use? ○ Sampling strategies ■ Sample per facet/year (e.g. X items per facet/year) ■ Within facets, select random or not Searching for Corpus Building
  • 41. Tracking Context in Corpus Building ● Why were certain documents selected? ○ How were they selected? ○ What strategy was used? ○ Documenting helps understanding/remembering choices? ● Do research goals and questions change during collection? ○ Interacting with sources during search updates knowledge structures (Vakkari 2016) ○ Updates tend to be small and incremental, hence barely noticeable ○ Explicit reflection-in-action can bring these to the surface (Koolen et al. 2018) ○ Adding annotations can also provide context
  • 42. Systematic Searching ● Systematic (comprehensive) search has two factors (Yakel 2010): ○ Search strategy (user) ○ Search functionalities (system) ○ Functionalities shape/affect strategy ● Step 1: systematic search for relevant collections online ○ Different collections/sites offer different search functionalities and levels of detail ○ Explicitly address what consequences this has for your strategy and research goals ● Step 2: ○ Explore individual collections using one or more strategies ○ "Researchers need to be flexible and creative to accommodate the vagaries of cataloging practices." (Yakel 2010, p. 110) ○ Footnote and reference chasing: references often give an "information scent", suggesting other collections and items to explore.
  • 43. Search Strategies ● Web search strategies defined by Drabenstott (2001) ○ Discussed in archive context by Yakel (2010) ● Five strategies ○ Synonym generation ○ Chaining ○ Name collection ○ Pearl growing ○ Successive segmentation ● Somewhat related to information seeking patterns by Ellis (1989) ○ Starting, chaining, browsing, differentiating, monitoring, extracting
  • 44. ● Synonym generation: 1) search with relevant term, 2) close read results to identify related terms (wordclouds, facets), 3) search via related terms for synonyms. ● Chaining: follow references/citations (explicit or implicit), identify relevant subset and use explicit structure to explore connected/related subset ● Name collection: search with keywords, identify relevant names, search with names, identify related names and keywords, repeat. Similar to keyword harvesting (Burke 2011). Drabenstott’s Strategies (1/2)
  • 45. Drabenstott’s Strategies (2/2) ● Pearl growing: start small and focused with specific search terms, slowly expand out with additional terms to broader topics/themes ● Successive segmentation: opposite of pearl growing; start broad and increasingly zoom in and focus; e.g. make queries increasingly specific by adding (ANDing) keywords, replace broad terms with lower frequency terms, or select facets
  • 46. Search Strategies and Research Phases ● Research phase ○ Exploration <-> search phase pre-focus i. Ad hoc, no need yet for systematic search ii. Mostly pearl growing and/or successive segmentation to determine focus ○ Analysis <-> search phase focus i. Switch to systematic, determine strategy ii. Use chaining, name collection, synonym generation (for coverage/representation, boundaries) ● But reality resists: ○ (Re)search process is very non-linear ○ Boundary between exploration and analysis is not always clear ○ Late discoveries can prompt or force new directions, ...
  • 47. When To Stop ● Often switch from exploration to “sorta” systematic search ○ But hard to remember and explain what and how you searched ○ Moreover, difficult to determine when to stop ○ Explicit strategy allows for stopping criteria ● Stopping criteria ○ Check whole set/sample, all available facets, ... ○ Diminishing returns: you increasingly encounter seen things, new relevance becomes rare ○ When stopping, make explicit (at least for yourself) when and why you stopped ● Meta-strategy: ○ chance strategy/tactics ○ E.g. successive segmentation -> harvest keywords -> switch segment -> harvest keywords, ...
  • 48. Wrap Up ● Search in research ○ How to incorporate these processes in research methodology ● Large, heterogeneous collections introduce issues for research ○ Assessing incompleteness of materials ○ Assessing incompleteness, incorrectness and inconsistency of metadata ● Looking under the hood ○ Evaluating information access functionalities (search and browse) ○ Selecting an appropriate search strategy for research goals ○ Determining success/failure of searches ○ Understanding search for corpus building
  • 49. Burke, T. 2011. How I Talk About Searching, Discovery and Research in Courses. May 9, 2011. Drabenstott, K.M., 2001. Web Search Strategy Development. Online, 25(4), pp.18-25. Fickers, F. 2012. Towards a New Digital Historicism? Doing History in the Age of Abundance. View journal, volume 1 (1). http://orbilu.uni.lu/bitstream/10993/7615/1/4-4-1-PB.pdf Hitchcock, T. 2013. Confronting the Digital - Or How Academic History Writing Lost the Plot. Cultural and Social History, Volume 10, Issue 1, pp. 9-23. https://doi.org/10.2752/147800413X13515292098070 Hoekstra, R., M. Koolen. 2018. Data Scopes for Digital History Research. Historical Methods: A Journal of Quantitative and Interdisciplinary History, Volume 51 (2), 2018. References
  • 50. References Koolen, M., J. van Gorp, J. van Ossenbruggen. 2018. Lessons Learned from a Digital Tool Criticism Workshop. Digital Humanities in the Benelux 2018 Conference. Putnam L. 2016. The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast. American Historical Review, Volume 121, Number 2, pp. 377-402. Vakkari, P. 2016. Searching as Learning: A systematization based on literature. Journal of Information Science, 42(1) 2016, pp. 7-18. Yakel, E., 2010. Searching and seeking in the deep web: Primary sources on the internet. Working in the archives: Practical research methods for rhetoric and composition, pp.102-118.