SlideShare a Scribd company logo
Data Science and What It
Means to Library and
Information Science
Jian Qin
School of Information Studies
Syracuse University
iSpeaker Series at Sungkyunkwan University
Seoul, Korea, December 8, 2015
Agenda
• What is data science?
• What is a data scientist?
• What areas of library work can benefit from data
science?
212/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
3
•
What is data science?
“An emerging area of work
concerned with the collection,
presentation, analysis,
visualization, management, and
preservation of large collections
of information.”
Stanton, J. (2012). Introduction to Data Science.
http://ischool.syr.edu/media/documents/2012/3/DataScienc
eBook1_1.pdf
The whole lifecycle of data from collection to analysis
to preservation
LCAS DM workshop, Beijing, 201512/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
“We’re increasingly
finding data in the wild,
and data scientists are
involved with gathering
data, massaging it into a
tractable form, making it
tell its story, and
presenting that story to
others.”
Loukides, M. (2011). What is data science? Sebastopol, CA:
O’Reilly.
What is data science?
4
Gathering and massaging data to tell its story
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
5
A systematic enterprise that builds and
organizes knowledge in the form of
testable explanations and predictions.
The study of the generalizable extraction of knowledge
from data, which involves data and statistics or the
systematic study of the organization, properties, and
analysis of data and its role in inference, including our
confidence in the inference.
Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12): 64-73.
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Why is data science different from
statistics and other existing disciplines?
• Raw material, the “data” part of data science, is
increasingly heterogeneous and unstructured and often
emanating from networks with complex relationships
between the entities.
• Analysis of data requires integration, interpretation, and
sense making that is increasingly derived through tools
from computer science, linguistics, econometrics,
sociology, and other disciplines.
• Data are increasingly generated by computer and for
computer consumption, that is, computers increasingly
do background work for each other and make decisions
automatically
612/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
7
Dhar, V. (2013). Data science and prediction. Communications of the ACM,
56(12): 64-73, p. 64.
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
8
Main fields in data science
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
What is a data scientist?
• Math skills: Statistics and linear algebra
• Computing skills: programming and infrastructure design
• Able to communicate: ability to create narratives around
their work
• Ask the right questions: involves domain knowledge and
expertise, coupled with a keen ability to see the problem,
see the available data, and match up the two.
912/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Analysis of data problems: Story 1
• Domain: Global migration studies
• What’s involved: migrants, refuges, detention centers, refuge
camps, Asylums, …
• Data types: interview audio recordings, photos, articles, clippings,
written notes, …
• Analysis software: Atlas.ti, SPSS
• Bottleneck problem:
• difficulty in finding the data by person, interview, and related artifacts and in
transforming the data into analysis software
1012/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
We’ve got
a problem
Researcher:
How to use
Atlas.ti?
Data scientist:
What data do
you have?
Data scientist:
How do you
collect them?
Data scientist:
What do you do
with the data?
Analysis of data problems: story 2
• Domain: Thermochronology and tectonics
• Data types: Excel data files (lots of them), spectrum and microscopic images,
annotations
• Analysis: modeling by combining data from multiple data files with specialized
software
• Bottleneck problem:
• manually matching/merging/filtering data is extremely cumbersome and the problem is
compounded by the difficulty finding the right data files
1112/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
What is involved: workflows in a
research lifecycle
Analysis of data problem: story 3
• Domain: collaboration networks in a data repository
• What’s involved: metadata describing DNA sequences
• Data types: semi-structured data in plain text format
• Analysis: identify entities and relationships, build the
data into a database for querying and extraction
• Bottleneck problems:
• Extremely large data sets with multiple entities, which makes
manual processing impossible
• Disambiguation of author names and correctly linking between
entities
1212/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Analysis of data problems
Analysis of
domain data
Requirement analysis
Workflow analysis
Data modeling
Data transformation
needs analysis
Data provenance
needs analysis
Analysis of data problems is an
analysis of domain data,
requirements, and workflows
that will lead to the
development of solutions.
1312/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Skills required to perform
analysis of domain data problems
Requirement
analysis
Workflow
analysis
Data modeling
Data
transformation
needs analysis
Data
provenance
needs analysis
Interview skills,
analysis and
generalization skills
Ability to capture
components and
sequences in workflows
Ability to translate
domain analysis into
data models
Ability to envision the data
model within the larger
system architecture
1412/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Example 1: modeling research data for
gravitational wave research
15
1. Understand research lifecycle
2. Workflows: steps and relationships
3. Data flows: what goes in and out at
which step
4. Entities and attributes, relationships
5. Researcher’s practice and habits in
documenting and managing data
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Example 2: asking the right question in
mining metadata
16
Metadata describing
datasets is big data that can
used to study:
• Collaboration networks
• Scholarly
communication patterns
• Research frontiers and
trends
• Knowledge transfer
• Research impact
assessment
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
What areas of library work can
benefit from data science?
1712/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data services and data-driven services
18
Library
Data services that
support research,
learning, and policy
making (external)
Data-driven services
that support library
planning, management,
and evaluation
(internal)
Data literacy
training
Data
discovery
Data
consulting
Data
mining
Data
collection Data
integration
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data-drive organization
• Consumer internet companies
• Google, Amazon, Facebook, LinkedIn
• Brick-mortar companies:
• Walmart, UPS, FedEx, GE
• “A data-driven organization
acquires, processes, and
leverage data in a timely fashion
to create efficiencies, iterate on
and develop new products, and
navigate the competitive
landscape...”
19
Is your library
(company, research
center, etc.) a data-
driven organization?
Patil, D.J. & Mason, H. (2015). Data Driven: Creating a Data
Culture. Sebastopol, CA: O’Reilly Media, p. 6.
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data curation
20
“the active and ongoing management of
data through its life cycle of interest and
usefulness to scholarship, science, and
education. Data curation activities enable
data discovery and retrieval, maintain its
quality, add value, and provide for reuse
over time, and this new field includes
authentication, archiving, management,
preservation, retrieval, and representation.”
–UIUC GSLIS
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data collection
• Build data collections through
• Institutional repositories
• Community repositories
• Developing tools for researchers to submit,
manage, preserve, and discover data
• Develop data collections
• Specialized
• Analysis-ready
• Reusable
• Actionable
21
• For library service planning, decision
making, and evaluation
• To support policy making, research, and
learning
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data discovery
• Complex data landscape:
• International, national, regional
• Disciplinary, community
• Open access vs. closed access
• Data sources for various purposes:
• Utility data sources: open, reusable
• Census data: open, but need additional
processing/meshing to reach the analysis-
ready state
• Government data: open, reusable, but require
additional processing
• Disciplinary research data: access varies,
require special knowledge to access and use
22
Data involving human
subjects are under
strict control by law
and often follow
additional compliance
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data consulting
• Search, locate, and verify data for
particular research purposes
• Plan, design, and implement data
curation and/or data analysis
projects
• Provide training and consulting for
statistical methods and tools
2312/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data mining
• Using internal data:
• Users, uses, expenses, collections, staff
• Goal: improve efficiencies and service
quality
• Using external data:
• Trends and indicators in scholarly
communication, technology, economy, and
culture
• Goal: adjust current services and plan for
new services
2412/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Data integration
Data integration is the combination of technical
and business processes used to combine data
from disparate sources into meaningful and
valuable information.
--IBM, http://www.ibm.com/analytics/us/en/technology/data-
integration/
25
A process of understanding, cleansing,
monitoring, transforming, and delivering data,
which offers opportunities to develop data
products as an infrastructure for research,
learning, policymaking, and decision making.
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
A home buyer’s information integration
26
What houses for sale under $250K have at least 2 bathrooms, 2
bedrooms, a nearby school ranking in the upper third, in a
neighborhood with below-average crime rate and diverse population?
Information
integration
Realtor School rankings Crime rate Demographics
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Research data
integration
Diabetes data and
trends—Country
level estimates:
http://apps.nccd.cdc.gov/D
DT_STRS2/NationalDiabet
esPrevalenceEstimates.aspx
?mode=PHY ;
Diabetes Data &
Trends home page:
http://apps.nccd.cdc.gov/dd
tstrs/default.aspx
12/8/2015 27iSpeaker Series at Sungkyunkwan University, Seoul, Korea
Summary
• Data science is not a new discipline, but rather, a new way of
utilizing data, methods, and tools to ask the right questions in
solving problems.
• Practicing data science requires strong skills in math,
computing, interpersonal communication, and asking the right
questions
• Libraries are at a strategic position in practicing data science.
How to leverage this position relies on the
• vision
• courage of risk taking
• knowledge of data science and related topics
• careful planning
• collaboration
2812/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea 29
Thank you!
Questions?

More Related Content

PPTX
Web of Science
PDF
Enterprise Data Lake
PPTX
Redundant Publications.pptx
PPTX
METS(Metadata Encoding and Transmission Standard )
PPTX
Information Intermediaries
PDF
UKSG 2023 - Plenary 1 - Indigenous Knowledge Preservation as a Sign of Respec...
PPTX
Research Metrics
PDF
Open Access Publishing, Self archiving, Predatory publishing issues, and Jour...
Web of Science
Enterprise Data Lake
Redundant Publications.pptx
METS(Metadata Encoding and Transmission Standard )
Information Intermediaries
UKSG 2023 - Plenary 1 - Indigenous Knowledge Preservation as a Sign of Respec...
Research Metrics
Open Access Publishing, Self archiving, Predatory publishing issues, and Jour...

What's hot (20)

PPTX
Columnar Databases (1).pptx
PPTX
Indest
PPTX
Use of ict in a library
PDF
Business Intelligence Data Warehouse System
PPT
INDEST-AICTE Library Consortia: A Study
PPTX
Measuring Scientific Productivity
PDF
Ist Daten-Liberalismus der richtige Weg?
PPTX
Bibliometrics and its application
PPTX
PHILOSOPHY OF RESEARCH.pptx
PPTX
Library consortia
PPTX
Big Data use cases in telcos
PPTX
Introduction to Data Analytics
PPTX
369017012-Enterprise-Data-Strategy2.pptx
PPTX
Database And their types
PPTX
Scientific misconduct
PPTX
Scientometrics
PPTX
Compendex and ISI
Columnar Databases (1).pptx
Indest
Use of ict in a library
Business Intelligence Data Warehouse System
INDEST-AICTE Library Consortia: A Study
Measuring Scientific Productivity
Ist Daten-Liberalismus der richtige Weg?
Bibliometrics and its application
PHILOSOPHY OF RESEARCH.pptx
Library consortia
Big Data use cases in telcos
Introduction to Data Analytics
369017012-Enterprise-Data-Strategy2.pptx
Database And their types
Scientific misconduct
Scientometrics
Compendex and ISI
Ad

Viewers also liked (20)

PPTX
J.M. Díaz Nafría: Science of Information: Emergence and evolution of meaning
PDF
Conceptions of information science
PPTX
Share: Science Information Life Cycle
PPT
Information, Science, and Society
PPTX
INFORMATION SCIENCE
PDF
Towards Neuro–Information Science
PPTX
KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY- SCOPE,THEORIES AND...
PDF
Big Data and Hadoop - key drivers, ecosystem and use cases
PPTX
Big data + data science startup focus points
PDF
Sharing & Sustaining Ecosystem Data
PDF
Semiotics and Information Science
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
PDF
Big data ecosystem
PPTX
Real time data services
PDF
Real Time Big Data
PDF
Big data ecosystem
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PPTX
Big Data Ecosystem
PDF
Earley Executive Roundtable - Building a Digital Transformation Roadmap
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
J.M. Díaz Nafría: Science of Information: Emergence and evolution of meaning
Conceptions of information science
Share: Science Information Life Cycle
Information, Science, and Society
INFORMATION SCIENCE
Towards Neuro–Information Science
KNOWLEDGE SCIENCE; NOT INFORMATION SCIENCE OR TECHNOLOGY- SCOPE,THEORIES AND...
Big Data and Hadoop - key drivers, ecosystem and use cases
Big data + data science startup focus points
Sharing & Sustaining Ecosystem Data
Semiotics and Information Science
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big data ecosystem
Real time data services
Real Time Big Data
Big data ecosystem
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Big Data Ecosystem
Earley Executive Roundtable - Building a Digital Transformation Roadmap
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Ad

Similar to Data Science and What It Means to Library and Information Science (20)

PPTX
Data science unit1
PPTX
Data science.chapter-1,2,3
PDF
A Deep Dissertion Of Data Science Related Issues And Its Applications
PPTX
Management of Data Collections
PDF
A Comprehensive Overview of Advance Techniques, Applications and Challenges i...
PDF
Data Science Unit1 AMET.pdf
PPTX
Data Science Introduction to Data Science
PDF
Luciano uvi hackfest.28.10.2020
PPTX
Labou "Data Science and the Library at UC San Diego"
PPTX
Chapter 2- Data Science and big data.pptx
PPTX
Data Science presentation for explanation of numpy and pandas
PPTX
DATA SCINCE.pptx
PPTX
Ch7-Overview of data Science-part 1.pptx
PPTX
Session 01 designing and scoping a data science project
PPTX
Session 01 designing and scoping a data science project
PDF
Data Science 1st Edition Robert Stahlbock Gary M Weiss Mahmoud Abounasr
PPTX
Next generation data services at the Marriott Library
PPTX
Data Science Introduction: Concepts, lifecycle, applications.pptx
PPTX
DATASCIENCE.pptx
PPTX
Unit 1-FDS. .pptx
Data science unit1
Data science.chapter-1,2,3
A Deep Dissertion Of Data Science Related Issues And Its Applications
Management of Data Collections
A Comprehensive Overview of Advance Techniques, Applications and Challenges i...
Data Science Unit1 AMET.pdf
Data Science Introduction to Data Science
Luciano uvi hackfest.28.10.2020
Labou "Data Science and the Library at UC San Diego"
Chapter 2- Data Science and big data.pptx
Data Science presentation for explanation of numpy and pandas
DATA SCINCE.pptx
Ch7-Overview of data Science-part 1.pptx
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
Data Science 1st Edition Robert Stahlbock Gary M Weiss Mahmoud Abounasr
Next generation data services at the Marriott Library
Data Science Introduction: Concepts, lifecycle, applications.pptx
DATASCIENCE.pptx
Unit 1-FDS. .pptx

More from Jian Qin (12)

PDF
How Portable Are the Metadata Standards for Scientific Data?
PDF
Functional and Architectural Requirements for Metadata: Supporting Discovery...
PDF
Survey research
PDF
Data repositories -- Xiamen University 2012 06-08
PDF
Developing Data Services to Support Scientific Data Management (v3)
PDF
Preparing eScience librarians -- RDAP 2012
PDF
Developing Data Services to Support eScience/eResearch
PDF
Scientific data management (v2)
PDF
Scientific Data Management
PPTX
Research literature review
PPTX
Scholarly communication
PPTX
Linking Scientific Metadata (presented at DC2010)
How Portable Are the Metadata Standards for Scientific Data?
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Survey research
Data repositories -- Xiamen University 2012 06-08
Developing Data Services to Support Scientific Data Management (v3)
Preparing eScience librarians -- RDAP 2012
Developing Data Services to Support eScience/eResearch
Scientific data management (v2)
Scientific Data Management
Research literature review
Scholarly communication
Linking Scientific Metadata (presented at DC2010)

Recently uploaded (20)

PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Cell Structure & Organelles in detailed.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Cell Types and Its function , kingdom of life
PDF
RMMM.pdf make it easy to upload and study
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Pharma ospi slides which help in ospi learning
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Cell Structure & Organelles in detailed.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Cell Types and Its function , kingdom of life
RMMM.pdf make it easy to upload and study
STATICS OF THE RIGID BODIES Hibbelers.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
TR - Agricultural Crops Production NC III.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Supply Chain Operations Speaking Notes -ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
VCE English Exam - Section C Student Revision Booklet
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Microbial disease of the cardiovascular and lymphatic systems
102 student loan defaulters named and shamed – Is someone you know on the list?
GDM (1) (1).pptx small presentation for students
Sports Quiz easy sports quiz sports quiz
Pharma ospi slides which help in ospi learning

Data Science and What It Means to Library and Information Science

  • 1. Data Science and What It Means to Library and Information Science Jian Qin School of Information Studies Syracuse University iSpeaker Series at Sungkyunkwan University Seoul, Korea, December 8, 2015
  • 2. Agenda • What is data science? • What is a data scientist? • What areas of library work can benefit from data science? 212/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 3. 3 • What is data science? “An emerging area of work concerned with the collection, presentation, analysis, visualization, management, and preservation of large collections of information.” Stanton, J. (2012). Introduction to Data Science. http://ischool.syr.edu/media/documents/2012/3/DataScienc eBook1_1.pdf The whole lifecycle of data from collection to analysis to preservation LCAS DM workshop, Beijing, 201512/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 4. “We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” Loukides, M. (2011). What is data science? Sebastopol, CA: O’Reilly. What is data science? 4 Gathering and massaging data to tell its story 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 5. 5 A systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions. The study of the generalizable extraction of knowledge from data, which involves data and statistics or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12): 64-73. 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 6. Why is data science different from statistics and other existing disciplines? • Raw material, the “data” part of data science, is increasingly heterogeneous and unstructured and often emanating from networks with complex relationships between the entities. • Analysis of data requires integration, interpretation, and sense making that is increasingly derived through tools from computer science, linguistics, econometrics, sociology, and other disciplines. • Data are increasingly generated by computer and for computer consumption, that is, computers increasingly do background work for each other and make decisions automatically 612/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 7. 7 Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12): 64-73, p. 64. 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 8. 8 Main fields in data science 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 9. What is a data scientist? • Math skills: Statistics and linear algebra • Computing skills: programming and infrastructure design • Able to communicate: ability to create narratives around their work • Ask the right questions: involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two. 912/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 10. Analysis of data problems: Story 1 • Domain: Global migration studies • What’s involved: migrants, refuges, detention centers, refuge camps, Asylums, … • Data types: interview audio recordings, photos, articles, clippings, written notes, … • Analysis software: Atlas.ti, SPSS • Bottleneck problem: • difficulty in finding the data by person, interview, and related artifacts and in transforming the data into analysis software 1012/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea We’ve got a problem Researcher: How to use Atlas.ti? Data scientist: What data do you have? Data scientist: How do you collect them? Data scientist: What do you do with the data?
  • 11. Analysis of data problems: story 2 • Domain: Thermochronology and tectonics • Data types: Excel data files (lots of them), spectrum and microscopic images, annotations • Analysis: modeling by combining data from multiple data files with specialized software • Bottleneck problem: • manually matching/merging/filtering data is extremely cumbersome and the problem is compounded by the difficulty finding the right data files 1112/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea What is involved: workflows in a research lifecycle
  • 12. Analysis of data problem: story 3 • Domain: collaboration networks in a data repository • What’s involved: metadata describing DNA sequences • Data types: semi-structured data in plain text format • Analysis: identify entities and relationships, build the data into a database for querying and extraction • Bottleneck problems: • Extremely large data sets with multiple entities, which makes manual processing impossible • Disambiguation of author names and correctly linking between entities 1212/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 13. Analysis of data problems Analysis of domain data Requirement analysis Workflow analysis Data modeling Data transformation needs analysis Data provenance needs analysis Analysis of data problems is an analysis of domain data, requirements, and workflows that will lead to the development of solutions. 1312/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 14. Skills required to perform analysis of domain data problems Requirement analysis Workflow analysis Data modeling Data transformation needs analysis Data provenance needs analysis Interview skills, analysis and generalization skills Ability to capture components and sequences in workflows Ability to translate domain analysis into data models Ability to envision the data model within the larger system architecture 1412/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 15. Example 1: modeling research data for gravitational wave research 15 1. Understand research lifecycle 2. Workflows: steps and relationships 3. Data flows: what goes in and out at which step 4. Entities and attributes, relationships 5. Researcher’s practice and habits in documenting and managing data 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 16. Example 2: asking the right question in mining metadata 16 Metadata describing datasets is big data that can used to study: • Collaboration networks • Scholarly communication patterns • Research frontiers and trends • Knowledge transfer • Research impact assessment 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 17. What areas of library work can benefit from data science? 1712/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 18. Data services and data-driven services 18 Library Data services that support research, learning, and policy making (external) Data-driven services that support library planning, management, and evaluation (internal) Data literacy training Data discovery Data consulting Data mining Data collection Data integration 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 19. Data-drive organization • Consumer internet companies • Google, Amazon, Facebook, LinkedIn • Brick-mortar companies: • Walmart, UPS, FedEx, GE • “A data-driven organization acquires, processes, and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape...” 19 Is your library (company, research center, etc.) a data- driven organization? Patil, D.J. & Mason, H. (2015). Data Driven: Creating a Data Culture. Sebastopol, CA: O’Reilly Media, p. 6. 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 20. Data curation 20 “the active and ongoing management of data through its life cycle of interest and usefulness to scholarship, science, and education. Data curation activities enable data discovery and retrieval, maintain its quality, add value, and provide for reuse over time, and this new field includes authentication, archiving, management, preservation, retrieval, and representation.” –UIUC GSLIS 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 21. Data collection • Build data collections through • Institutional repositories • Community repositories • Developing tools for researchers to submit, manage, preserve, and discover data • Develop data collections • Specialized • Analysis-ready • Reusable • Actionable 21 • For library service planning, decision making, and evaluation • To support policy making, research, and learning 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 22. Data discovery • Complex data landscape: • International, national, regional • Disciplinary, community • Open access vs. closed access • Data sources for various purposes: • Utility data sources: open, reusable • Census data: open, but need additional processing/meshing to reach the analysis- ready state • Government data: open, reusable, but require additional processing • Disciplinary research data: access varies, require special knowledge to access and use 22 Data involving human subjects are under strict control by law and often follow additional compliance 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 23. Data consulting • Search, locate, and verify data for particular research purposes • Plan, design, and implement data curation and/or data analysis projects • Provide training and consulting for statistical methods and tools 2312/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 24. Data mining • Using internal data: • Users, uses, expenses, collections, staff • Goal: improve efficiencies and service quality • Using external data: • Trends and indicators in scholarly communication, technology, economy, and culture • Goal: adjust current services and plan for new services 2412/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 25. Data integration Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. --IBM, http://www.ibm.com/analytics/us/en/technology/data- integration/ 25 A process of understanding, cleansing, monitoring, transforming, and delivering data, which offers opportunities to develop data products as an infrastructure for research, learning, policymaking, and decision making. 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 26. A home buyer’s information integration 26 What houses for sale under $250K have at least 2 bathrooms, 2 bedrooms, a nearby school ranking in the upper third, in a neighborhood with below-average crime rate and diverse population? Information integration Realtor School rankings Crime rate Demographics 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 27. Research data integration Diabetes data and trends—Country level estimates: http://apps.nccd.cdc.gov/D DT_STRS2/NationalDiabet esPrevalenceEstimates.aspx ?mode=PHY ; Diabetes Data & Trends home page: http://apps.nccd.cdc.gov/dd tstrs/default.aspx 12/8/2015 27iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 28. Summary • Data science is not a new discipline, but rather, a new way of utilizing data, methods, and tools to ask the right questions in solving problems. • Practicing data science requires strong skills in math, computing, interpersonal communication, and asking the right questions • Libraries are at a strategic position in practicing data science. How to leverage this position relies on the • vision • courage of risk taking • knowledge of data science and related topics • careful planning • collaboration 2812/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea
  • 29. 12/8/2015 iSpeaker Series at Sungkyunkwan University, Seoul, Korea 29 Thank you! Questions?