SlideShare a Scribd company logo
Leabharlann UCD
An Coláiste Ollscoile, Baile
Átha Cliath,
Belfield, Baile Átha Cliath 4,
Eire
UCD Library
University College Dublin,
Belfield, Dublin 4, Ireland
Joseph Greene
Research Repository Librarian
University College Dublin
joseph.greene@ucd.ie
http://researchrepository.ucd.ie
#iCanHazRobot?
Improved robot detection for IR usage statistics
Open Repositories 2016
Dublin, 14 June
Overview and take-home points
• Usage stats are important
– (go to the Usage Stats panel on Thursday,
16/Jun/2016: 11:00am - 12:30pm)
• Robot filtration is a problem, especially in
repositories
• Robot detection has an exponential effect on
usage stats’ accuracy in repositories
• 2-3 ways to improve DSpace and EPrints’ usage
stats by 20% or more will be demonstrated
Experimental study
• Simple random sample of 2 years of UCD
repository’s download data
– n=341, N=3.3 million; 96.20% certainty
• Manually checked to determine if robot or human
• Applied DSpace, EPrints robot detection
algorithms to the dataset
– This is an EXPERIMENT, simulating algorithms on a
DSpace repository’s usage data and Apache logs
– The data is real, live data, and the algorithms were
very easy to simulate
First finding
85% of unfiltered
repository downloads
come from robots
• This is confirmed in a 2013 IRUS-UK white paper
on 20 IRs; 85% was also found to be robots
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracyofdownloadstats(inverseprecition)
Recall (robots)
Catching more robots improves stats
(But how much depends on the number of robots)
Getbetterstats
Catch more robots
Typical website, 15% robot traffic
OA journal, 40% robot
Internet Archive, 91% robot
OA repositories, 85% robot
Robot detection techniques used
DSpace EPrints
Minho DSpace
Statistics Add-on
Rate of requests ✓3
User agent string ✓ ✓ ✓
robots.txt access ✓
Volume of requests ✓2
✓3
List of known robot IP addresses ✓ ✓
Reverse DNS name lookup ✓1
Trap file ✓
User agents per IP address
Width of traversal in the URL space ✓3
1
Only implemented nominally or experimentally
2
Via the repeat download or ‘double-click’ filter
3
Data available as a configurable report for manual decision making
Measurements used in robot detection
• All measurements are a number between 0 and 1
• Recall: proportion of robots detected
– I can haz robot?
• Precision: true positives in robot detection
– Proportion of discounted downloads that are
actually made by robots (sometimes humans are
counted as robots)
• Accuracy of download stats measured as inverse
precision:
– Proportion of stats that are actually made by
humans
How they perform, out-of-the-box
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace EPrints Minho Minho with
monthly manual
checking
No robot detection
Robot detection in OA IR systems
Recall Precision Negative precision (accuracy of download stats)
Room for improvement?
1. Ability to manually check for outliers
• At UCD, once a month, we check:
– Daily downloads for the last 2-4 months
– Top 10 most downloaded items
– Top 20 downloading IP addresses for the last 2-4
months
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
#iCanHazRobot?: improved robot detection for IR usage statistics
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
Out-…
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking (outlier exclusion)
2. Recalibrate the EPrints repeat-
download (double-click) filter
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall (robots) Precision (accuracy
of excluded
downloads)
Inverse recall
(legitimate
downloads
accounted for in
stats)
Inverse precision
(accuracy of
reported download
stats)
Overall accuracy
Effect of double-click filter on EPrints’ robot detection and stats
Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter*
𝑻𝒑 + 𝑻𝒏
𝒏
3. Port Minho’s robot detection code (a
log parser) onto DSpace or EPrints
• 1 Java class
• Input is Apache Combined Log Format
• Output is a database update (robot = true field)
– Similar to EPrints' $is_robot variable in Robots.pm,
– Could be modified to update the DSpace 'isBot'
field in the SOLR usage events document
• Requires 2 database tables to store learned
agents and IPs
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho
Robots caught (Recall)
0
0.2
0.4
0.6
0.8
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box With Minho log parser
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho
Robots caught
(Recall)
Out-of-the-box
With manual
checking (outlier
exclusion)
With recalibrated
double click filter*
With Minho log
parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
4. Combine two or more techniques
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DSpace Eprints Minho Wihtout robot
detection
Accuracy of reported download stats
(Inverse precision)
Out-of-the-box
With manual checking
(outlier exclusion)
With recalibrated
double click filter*
With Minho log parser
With Minho and
outliers
Minho, outliers, and
recalibrated double-
click*
Thank you!

More Related Content

PDF
Robot Hunter: or precisely what I thought I wouldn't be doing when I became a...
PPTX
How Accurate are IR Usage Statistics?
PPTX
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...
PPT
BioTorrents: A File Sharing Service for Scientific Data
PPT
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
PPTX
10 Minutes: Developing Personas
PDF
Com fer efectiva la transparència?
PPT
Information Management Assignment One: Situation Analysis of Real Groovy
Robot Hunter: or precisely what I thought I wouldn't be doing when I became a...
How Accurate are IR Usage Statistics?
Robot Hunter, or, precisely what I thought I wouldn't be doing when I became ...
BioTorrents: A File Sharing Service for Scientific Data
PROMISE 2011: "Detecting Bug Duplicate Reports through Locality of Reference"
10 Minutes: Developing Personas
Com fer efectiva la transparència?
Information Management Assignment One: Situation Analysis of Real Groovy

Viewers also liked (20)

PPT
Using a consultancy to assist in developing the UCD vision for the future onl...
PPTX
Les possibilitats d’Internet aplicades a l’agricultura ecològica
PPTX
CII S'Marketing Convention 2009
PPT
Is peer review peerless? Author: Tony Eklof
PDF
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
PDF
Dades Obertes. El valor del coneixement lliure.
PPTX
OpenGovernment
PPT
Weiying1新生儿
PPTX
Presentation6
PPT
Noms
PPT
Andy warhol . Raul and Gerard
PPT
Курсовая работа
PPTX
Dynasties
PPT
Loex 2008 (P2)
PPT
On the shelf in time : developing a strategy to improve reading list support....
PPT
Resource description and new media : challenges and opportunities. Authors: E...
PPT
The library as place. Author: Peter Hickey
PPT
Pharmacy Businesslaw2
PDF
Graphis Feature
PPTX
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Using a consultancy to assist in developing the UCD vision for the future onl...
Les possibilitats d’Internet aplicades a l’agricultura ecològica
CII S'Marketing Convention 2009
Is peer review peerless? Author: Tony Eklof
El Gobierno Abierto es la respuesta, ¿pero cuál era la pregunta?
Dades Obertes. El valor del coneixement lliure.
OpenGovernment
Weiying1新生儿
Presentation6
Noms
Andy warhol . Raul and Gerard
Курсовая работа
Dynasties
Loex 2008 (P2)
On the shelf in time : developing a strategy to improve reading list support....
Resource description and new media : challenges and opportunities. Authors: E...
The library as place. Author: Peter Hickey
Pharmacy Businesslaw2
Graphis Feature
Finishing the Jigsaw: consolidating and profiling the plagiarism awareness se...
Ad

Similar to #iCanHazRobot?: improved robot detection for IR usage statistics (20)

PDF
Developing COUNTER Standards to Measure the Use of Open Access Resources
PDF
Software Analytics: Data Analytics for Software Engineering
PPTX
Bots & spiders
PPTX
The data streaming processing paradigm and its use in modern fog architectures
PPTX
Building and Measuring Privacy-Preserving Mobility Analytics
PDF
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
PDF
Learning from Biometric Fingerprints to prevent Cyber Security Threats
PDF
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
PDF
2015 moloch recipes
PDF
BotMagnifier: Locating Spambots on the Internet
PDF
PhD Symposium 2014
PPTX
NHM Data Portal: first steps toward the Graph-of-Life
PPTX
NHM Data Portal: first steps toward the Graph-of-Life
PPTX
2014 nicta-reproducibility
PPTX
2015 genome-center
PDF
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
PDF
Technical Workshop - Win32/Georbot Analysis
PPTX
2015 illinois-talk
PDF
Large scale Click-streaming and tranaction log mining
PDF
IEEE.BigData.Tutorial.2.slides
Developing COUNTER Standards to Measure the Use of Open Access Resources
Software Analytics: Data Analytics for Software Engineering
Bots & spiders
The data streaming processing paradigm and its use in modern fog architectures
Building and Measuring Privacy-Preserving Mobility Analytics
"Automated Malware Analysis" de Gabriel Negreira Barbosa, Malware Research an...
Learning from Biometric Fingerprints to prevent Cyber Security Threats
3 Pitfalls Everyone Should Avoid with Cloud Native Observability
2015 moloch recipes
BotMagnifier: Locating Spambots on the Internet
PhD Symposium 2014
NHM Data Portal: first steps toward the Graph-of-Life
NHM Data Portal: first steps toward the Graph-of-Life
2014 nicta-reproducibility
2015 genome-center
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Technical Workshop - Win32/Georbot Analysis
2015 illinois-talk
Large scale Click-streaming and tranaction log mining
IEEE.BigData.Tutorial.2.slides
Ad

More from UCD Library (20)

PDF
The role of academic libraries in supporting a culture of research integrity
PDF
Collection Management and GreenGlass at UCD Library
PDF
The authentic research experience: UCD Special Collections in the BA Humanities
PDF
Show and teach: the role of exhibitions in outreach and education
PDF
Print to pixels: digitised periodical collections in UCD Digital Library
PDF
Appearances can be deceiving: how to avoid 'predatory' publishers
PPTX
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
PPTX
UCD Library's Training Programme and Resources for Researchers
PDF
Going Global: UCD Library's Experience of Teaching Information Literacy in China
PDF
Going Global: UCD Library's Experiences in China
PPTX
Clifden Arts Festival Archive@UCD: an Overview
PPTX
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
PDF
Optimising Workflows for Digital Archives: UCD Digital Library
PPTX
Creating the Collected Letters of Nano Nagle Digital Collection
PDF
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
PDF
Enhancing User Engagement and Experiences through the Development of UCD Libr...
PDF
UCD Library and GreenGlass: Defining Needs, Redefining Collections
PDF
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
PDF
Pin It! Linking shelf-marks to shelf locations
PDF
Real Life Digital Curation and Preservation
The role of academic libraries in supporting a culture of research integrity
Collection Management and GreenGlass at UCD Library
The authentic research experience: UCD Special Collections in the BA Humanities
Show and teach: the role of exhibitions in outreach and education
Print to pixels: digitised periodical collections in UCD Digital Library
Appearances can be deceiving: how to avoid 'predatory' publishers
Re-using OERs in UCD’s Research Accelerator for the Social Sciences Online Mo...
UCD Library's Training Programme and Resources for Researchers
Going Global: UCD Library's Experience of Teaching Information Literacy in China
Going Global: UCD Library's Experiences in China
Clifden Arts Festival Archive@UCD: an Overview
UCD Digital Library: Creating Digitised Content from Archival Collections - P...
Optimising Workflows for Digital Archives: UCD Digital Library
Creating the Collected Letters of Nano Nagle Digital Collection
#Nuntastic: Transcribing Nano Nagle's Letters using Collaborative Transcripti...
Enhancing User Engagement and Experiences through the Development of UCD Libr...
UCD Library and GreenGlass: Defining Needs, Redefining Collections
Are They Being Served? Reference Services Student Experience Project, UCD Lib...
Pin It! Linking shelf-marks to shelf locations
Real Life Digital Curation and Preservation

Recently uploaded (20)

PPTX
Presentation on HIE in infants and its manifestations
PDF
01-Introduction-to-Information-Management.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Lesson notes of climatology university.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Classroom Observation Tools for Teachers
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
Institutional Correction lecture only . . .
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Presentation on HIE in infants and its manifestations
01-Introduction-to-Information-Management.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
GDM (1) (1).pptx small presentation for students
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
A systematic review of self-coping strategies used by university students to ...
102 student loan defaulters named and shamed – Is someone you know on the list?
Microbial diseases, their pathogenesis and prophylaxis
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
Lesson notes of climatology university.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Classroom Observation Tools for Teachers
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
VCE English Exam - Section C Student Revision Booklet
Chinmaya Tiranga quiz Grand Finale.pdf
Institutional Correction lecture only . . .
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE

#iCanHazRobot?: improved robot detection for IR usage statistics

  • 1. Leabharlann UCD An Coláiste Ollscoile, Baile Átha Cliath, Belfield, Baile Átha Cliath 4, Eire UCD Library University College Dublin, Belfield, Dublin 4, Ireland Joseph Greene Research Repository Librarian University College Dublin joseph.greene@ucd.ie http://researchrepository.ucd.ie #iCanHazRobot? Improved robot detection for IR usage statistics Open Repositories 2016 Dublin, 14 June
  • 2. Overview and take-home points • Usage stats are important – (go to the Usage Stats panel on Thursday, 16/Jun/2016: 11:00am - 12:30pm) • Robot filtration is a problem, especially in repositories • Robot detection has an exponential effect on usage stats’ accuracy in repositories • 2-3 ways to improve DSpace and EPrints’ usage stats by 20% or more will be demonstrated
  • 3. Experimental study • Simple random sample of 2 years of UCD repository’s download data – n=341, N=3.3 million; 96.20% certainty • Manually checked to determine if robot or human • Applied DSpace, EPrints robot detection algorithms to the dataset – This is an EXPERIMENT, simulating algorithms on a DSpace repository’s usage data and Apache logs – The data is real, live data, and the algorithms were very easy to simulate
  • 4. First finding 85% of unfiltered repository downloads come from robots • This is confirmed in a 2013 IRUS-UK white paper on 20 IRs; 85% was also found to be robots
  • 5. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Accuracyofdownloadstats(inverseprecition) Recall (robots) Catching more robots improves stats (But how much depends on the number of robots) Getbetterstats Catch more robots Typical website, 15% robot traffic OA journal, 40% robot Internet Archive, 91% robot OA repositories, 85% robot
  • 6. Robot detection techniques used DSpace EPrints Minho DSpace Statistics Add-on Rate of requests ✓3 User agent string ✓ ✓ ✓ robots.txt access ✓ Volume of requests ✓2 ✓3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓3 1 Only implemented nominally or experimentally 2 Via the repeat download or ‘double-click’ filter 3 Data available as a configurable report for manual decision making
  • 7. Measurements used in robot detection • All measurements are a number between 0 and 1 • Recall: proportion of robots detected – I can haz robot? • Precision: true positives in robot detection – Proportion of discounted downloads that are actually made by robots (sometimes humans are counted as robots) • Accuracy of download stats measured as inverse precision: – Proportion of stats that are actually made by humans
  • 8. How they perform, out-of-the-box 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace EPrints Minho Minho with monthly manual checking No robot detection Robot detection in OA IR systems Recall Precision Negative precision (accuracy of download stats)
  • 10. 1. Ability to manually check for outliers • At UCD, once a month, we check: – Daily downloads for the last 2-4 months – Top 10 most downloaded items – Top 20 downloading IP addresses for the last 2-4 months
  • 15. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) Out-… 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion)
  • 16. 2. Recalibrate the EPrints repeat- download (double-click) filter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall (robots) Precision (accuracy of excluded downloads) Inverse recall (legitimate downloads accounted for in stats) Inverse precision (accuracy of reported download stats) Overall accuracy Effect of double-click filter on EPrints’ robot detection and stats Without double-click filter With double-click filter (out-of-the-box) With recalibrated double-click filter* 𝑻𝒑 + 𝑻𝒏 𝒏
  • 17. 3. Port Minho’s robot detection code (a log parser) onto DSpace or EPrints • 1 Java class • Input is Apache Combined Log Format • Output is a database update (robot = true field) – Similar to EPrints' $is_robot variable in Robots.pm, – Could be modified to update the DSpace 'isBot' field in the SOLR usage events document • Requires 2 database tables to store learned agents and IPs
  • 18. 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Robots caught (Recall) 0 0.2 0.4 0.6 0.8 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With Minho log parser
  • 19. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Robots caught (Recall) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*
  • 20. 4. Combine two or more techniques 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DSpace Eprints Minho Wihtout robot detection Accuracy of reported download stats (Inverse precision) Out-of-the-box With manual checking (outlier exclusion) With recalibrated double click filter* With Minho log parser With Minho and outliers Minho, outliers, and recalibrated double- click*

Editor's Notes

  • #3: Good news: DSpace and EPrints do robot filtration out-of-the-box, bad news: the stats are still quite inaccurate More good news: Improving robot recall has an exponential effect on usage stats accuracy Usage stats: primarily download counts, used heavily in marketing the repository and they provide a measure of ROI both to those who have uploaded them (investment of time/effort) and to those who fund the repository. More downloads = more UCD visibility – one measure of our ROI.
  • #4: Experiment: simple random sample of 2 years of download data (n=341, N=3.3 million for 96.20% certainty), manually checked to determine if robot or human. DSpace 1.8.2 with U. Minho DSpace Statistics Add-on v. 4. Apache Tomcat behind Apache HTTP server; logs in Apache Combined Log Format. Minho registers every download in the PostgreSQL database. Results to be published in July 2016 issue of Library Hi Tech (Greene 2016) This dataset is used to experimentally test different detection techniques used alone and in combination Weaknesses: The data is taken from a DSpace/Minho system (it's own SEO, it's own way of being crawled, etc.) 'In vitro': Except for the original system (DSpace/Minho + monthly manual outlier checking), the robot detection techniques are simulated. Hence, EXPERIMENTAL Strengths: 'In vivo': the data is real data from a production OA IR system Simulating the various detection techniques was very easy to do, so is probably a very accurate picture of how each system would have treated this dataset
  • #5: See: INFORMATION POWER LTD. 2013. IRUS download data: identifying unusual usage [Online]. Available: http://www.irus.mimas.ac.uk/news/IRUS_download_data_Final_report.pdf [Accessed 2015-12-11. Confirms 85% figure DORAN, D. & GOKHALE, S. S. 2011. Web robot detection techniques: overview and limitations. Data Mining and Knowledge Discovery, 22, 183-210. Hypothesizes why so high in OA (p.191)
  • #6: Typical website (15% robot traffic) (precision = 0.8727, mean of four studies; robots:total sessions = 0.1516, mean of four studies) OA journal (40% robot) HUNTINGTON, P., NICHOLAS, D. & JAMALI, H. R. 2008. Web robot detection in the scholarly information environment. Journal of Information Science, 34, 726-741. OA repositories (85% robot) Greene 2016 and Information Power 2013 (see above) Internet Archive (91% robot) ALNOAMANY, Y., WEIGLE, M. C. & NELSON, M. L. 2013. Access patterns for robots and humans in web archives. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 339-348. Reverse is also true: fail to catch robots (e.g. deterioration over time as robots improve their capabilities), accuracy of stats diminishes Formula: Greene 2016 𝐏𝐢𝐧𝐯 = 𝐓𝐑(𝐑−𝐏𝐑−𝟏)+𝟐𝐓𝐏𝐑−𝐏(𝐓+𝐑−𝟏) 𝐑(𝐓𝐑−𝐏−𝐓)+𝐏 R = recall (robot detection) P = precision (robot detection) Pinv = inverse precision (human stats) T = ratio of robots to total
  • #7: Greene 2016
  • #9: Minho with monthly manual checking is the original data as measured in vivo. Minho alone has detected manual outliers removed. DSpace and EPrints have been generated by applying their native algorithms to the data.
  • #11: Outliers: c.f. LAMOTHE, A. R. 2014. The importance of identifying and accommodating e-resource usage data for the presence of outliers. Information Technology and Libraries, 33, 31-44.
  • #17: *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period See also: JOINT, N., FIELD, A. & GREGSON, M. 2011. Please change the way IRstats works [Online]. Available: http://www.eprints.org/tech.php/15695.html [Accessed November 28 2015]. The drop in inverse recall (loss of legitimate downloads) supports the concern raised in this email discussion. However, if the recalibration were to be implemented, the improvement to robot precision means that the increase in legitimate downloads is offset by the decrease in illegitimate ones, so inverse precision is not affected a great deal. Overall accuracy improves notably however.
  • #20: *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period
  • #21: *Recalibrated double-click filter: a single IP address downloading a single item more than 10 times in 24 hours is excluded. By default the filter is 1 IP, downloads 1 item more than 1 time in 24 hours. This can be configured in terms of the timeout length but currently can't be configured to increase the number of downloads allowed within the period