SlideShare a Scribd company logo
Web Scraping and Healthcare
Presented By:
Avanish Kumar Giri
BMSCE, Bangalore
Contents
• Introduction
• Research paper-1: Big data analytics in healthcare
• Research paper-2: Big data analytics in healthcare: promise and potential
• Research paper-3: Improving Healthcare Using Big Data Analytics
• Research paper-4: Big Data Analytics: Solution to healthcare
• Research paper-5: A Dive into Web Scraper World
• Research paper-6: Automated scraping of structured data records from
health discussion forums using semantic analysis
• Comparison of all the research papers
• Proposed Framework
• Conclusion
• References and Bibliography
Introduction
• The internet is a massive storehouse of health related information.
• But it is not freely available because
▫ Individual privacy issues
▫ Data leak
▫ Website restrictions
• So, we can target health discussion forums which are easily accessible to
scrape the data and analyze it for various purposes.
• Web scraping,
• Also known as web extraction or harvesting, is a technique to extract data from the
World Wide Web (WWW) and save it to a file system or database for later retrieval
or analysis.
Research Paper-1
• Big data analytics in healthcare: promise and potential
▫ The healthcare industry historically has generated large amounts of data,
driven by record keeping, compliance, regulatory requirements, and
patient care.
▫ While most data is stored in hard copy form, the current trend is toward
rapid digitization of these large amounts of data.
▫ Big data for U.S. healthcare will soon reach the zettabyte (1021 gigabytes)
scale.[1]
Research Paper-2
• Big data analytics in healthcare
▫ Image processing
 Computed tomography (CT),
 Magnetic resonance imaging (MRI),
 X ray, molecular imaging, ultrasound, etc.
▫ Signal processing
▫ Data analytics in disease detection
▫ Data Analytics in medical diagnosis
Research Paper-3
• Improving Healthcare Using Big Data Analytics
▫ 50 Petabytes of data in the health care realm, predicted to grow to
25,000 Petabytes by 2020, reported by a new info-graphic from
Oracle.[2]
▫ Data analytics in public health research
 With the wild expansion of public health information, we can use data
analytic technique to crawl and filter out varied types of public health
info data.
 Hospital Information system (HIS), which includes electronic
medical record system (EMRS)
 Laboratory Information system (LIS)
 Radiology Information system (RIS)
 Clinical decision support system (CDSS), etc.
Research Paper-4
• Web Scraping: Data Extraction from websites
▫ Web scraping, also known as web extraction or harvesting, is a technique
to extract data from the World Wide Web (WWW) and save it to a file
system or database for later retrieval or analysis.
▫ Scraping is mentioned as one of sources for big data collection. In the
definition is also mentioned another term – Web Crawler.
Fig 1: Web crawler Vs. Web Scraping
Source: https://www.quora.com/What-are-the-
biggest-differences-between-web-crawling-and-
web-scraping
Research Paper-4(cntd.)
• Available Frameworks
• Scrapy
▫ It is one of the advanced web scraping frameworks available.
▫ The Framework is written in Python.
• FMiner
▫ It combines visual configuration with scripting features.
Research Paper-5
• A Dive into Web Scraper World
▫ Is Web Scraping Legal?
 This question is always left unanswered properly.
 There are lots of different views of different people on the legal and
illegal aspects of Scraping the Web.
▫ Crawling policies
 Selection policies- It states the pages to download
 Re-Visits Policy- It states when to check for changes to the pages
▫ Designing a custom scrapper
 A Web Scraper broadly composed of two parts:
 Web crawler for crawling links
 Data extractor from crawled link
Research Paper-6
• Automated scraping of structured data records from
health discussion forums using semantic analysis
▫ In the context of healthcare, web scraping is gaining foothold gradually
but qualitatively.
▫ Several factors have led to the use of web scraping in healthcare.
 Too complex to be analysed by traditional techniques.
 Web scraping along with data extraction can improve decision
making.
Source: https://patient.info/forums/discuss/browse/abdominal-disorders-3321
What is a health discussion forum
General Framework for data extraction[6]
Fig 2: General Framework of data extraction
Source:https://www.sciencedirect.com/science/article/pii/S2352914817302253
Comparison
Papers Process Advantages Disadvantages Future work
1. Big data concepts in
healthcare
Improved record
keeping and patient
care
Large amount of
data
Analyzing size of
the data
2. Image and signal
processing in big data
Easy processing of
reports
Domain knowledge
is important to
process such data
Many other types
of reports can be
processed
3. Data analytics in PHR Easy processing of
reports
Research knowledge
required
Easy definition of
reports
4. Web crawler and
available frameworks
Easy frameworks
are easily available
Illegal activities
could take place
Limitations in
frameworks
5. Legal aspects and
crawling policies
No proper legal
definition
Depends on the
website
Defining clear
legal structure
6. Scraping web
discussion forums
Easy to get data
from such forums
Website policies Less frequent
requests
Proposed Framework
• Fetching data from health discussion forums.
• Use of BeautifulSoup for parsing (Python Library)
• Store data in JSON Format
• Process data for decision making
Fetching the data
• Involves finding the endpoint - URL or URL’s
• Sending HTTP requests to the server
• Using requests library:
▫ import requests
data=requests.get(‘https://patient.info/forums/discuss/browse/anxiety-
disorders-70)
▫ html = data.content
Use BeautifulSoup for parsing
• Provides simple methods to-
▫ Search
▫ Navigate
▫ Select
• Export the data
▫ Database (relational or non-relational)
▫ CSV
▫ JSON
Impact of web scraping on healthcare
• Healthcare isn’t the sector that depends completely on the person-to-
person interactions.
• With the current system, where all options are data-centric, healthcare web
scraping can affect the lives, teach people, and generate awareness. Because
the people don’t rely on doctors as well as pharmacists anymore, the
healthcare web scraping can improve the lives by providing balanced
solutions.
• With data extraction and web scraping methods, the healthcare
organization may decrease the fraud attempts, the doctors can discover
effective cures and the best practices, as well as patients, can have better
and more affordable healthcare services.
Conclusion
• From the discussed points from all the research papers, we can conclude
that use of web scrapping in healthcare can improve traditional process of
decision making.
• Targeting web discussion forums to detect risks can be helpful for any
healthcare organization as well as the individual
• As Scraper opens up another world of retrieving information without the
use of API, and mostly it is anonymously accessed.
• But the people who are doing Scraping should take into account that they
are not breaking any kind of law which could make them liable for any
offence.
References
1. https://link.springer.com/journal/13755
2. https://www.3idatascraping.com/web-scraping-for-healthcare-companies.php
3. https://www.elsevier.com/locate/imu
4. https://patient.info/forums/discuss/browse/abdominal-disorders-3321
5. https://www.quora.com/What-are-the-biggest-differences-between-web-crawling-
and-web-scraping
6. https://www.sciencedirect.com/science/article/pii/S2352914817302253
Bibliography
• Research Paper 1:
▫ Title: Big data analytics in healthcare: promise and potential
▫ Authors : Wullianallur Raghupathi and Viju Raghupathi
▫ https://link.springer.com/journal/13755
▫ Pages :1-10
• Research Paper 2:
▫ Title: Big Data Analytics in Healthcare
▫ Authors : Ashwin Belle, Raghuram Thiagarajan, S.M.Reza Soroushmehr and
Kayvan Najarian
▫ Biomedicine and Biotechnology, January 2015
▫ Pages :1-38
• Research Paper 3:
▫ Title: Big Data Analytics in Healthcare
▫ Author : Revanth Sonnati
▫ Acedemia
▫ Pages :1-5
Bibliography
• Research Paper 4:
▫ Title: Web Scraping: Data Extraction from websites
▫ Authors : Vojtech Draxl
▫ Science Direct
▫ Pages :1-38
• Research Paper 5:
▫ Title: A Dive into Web Scraper World
▫ Authors: Deepak Kumar Mahto, Lisha Singh
▫ 2016 International Conference on Computing for Sustainable Global
Development
▫ Pages :1-5
• Research Paper 6:
▫ Title: Automated scraping of structured data records from health discussion
forums using semantic analysis
▫ Authors: Umamageswari Baskaran, Kalpana Ramanujam
▫ www.elsevier.com/locate/imu
▫ Science Direct
▫ Pages :1-10
Thank You

More Related Content

PDF
Snowflake SnowPro Core Cert CheatSheet.pdf
PPTX
Azure Data Fundamentals DP 900 Full Course
PPTX
Matrimonial presentation
PPTX
Amazon Web Services - Media Use Cases
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
PDF
Cloud APIs and Cloud Frameworks
PDF
Neo4j 4 Overview
PDF
Introduction to Google Cloud Platform
Snowflake SnowPro Core Cert CheatSheet.pdf
Azure Data Fundamentals DP 900 Full Course
Matrimonial presentation
Amazon Web Services - Media Use Cases
Snowflake SnowPro Certification Exam Cheat Sheet
Cloud APIs and Cloud Frameworks
Neo4j 4 Overview
Introduction to Google Cloud Platform

What's hot (20)

PPTX
Streaming Real-time Data to Azure Data Lake Storage Gen 2
PPT
An overview of snowflake
PPTX
An Introduction to MongoDB Compass
PDF
An Introduction to AWS
PPTX
Identity and Access Management
PDF
NUMA and Java Databases
PPTX
How is IoT used in the Fitness World.pptx
PDF
Google Dremel. Concept and Implementations.
PDF
Keeping Identity Graphs In Sync With Apache Spark
PPTX
AWS Monitoring & Logging
PDF
[금융고객을 위한 AWS re:Invent 2022 re:Cap] 3.AWS reInvent 2022 Technical Highlights...
ODP
Apache hadoop hbase
PPTX
original.pptx
PPTX
Snowflake Datawarehouse Architecturing
PPTX
Amazon Connect Rethink Your Contact Center with CloudHesive.pptx
PPTX
Cognitive Search: Announcing the smartest enterprise search engine, now with ...
PPTX
Online shopping ppt
PDF
Discover AI with Microsoft Azure
PPTX
Introduction to PolyBase
PPTX
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Streaming Real-time Data to Azure Data Lake Storage Gen 2
An overview of snowflake
An Introduction to MongoDB Compass
An Introduction to AWS
Identity and Access Management
NUMA and Java Databases
How is IoT used in the Fitness World.pptx
Google Dremel. Concept and Implementations.
Keeping Identity Graphs In Sync With Apache Spark
AWS Monitoring & Logging
[금융고객을 위한 AWS re:Invent 2022 re:Cap] 3.AWS reInvent 2022 Technical Highlights...
Apache hadoop hbase
original.pptx
Snowflake Datawarehouse Architecturing
Amazon Connect Rethink Your Contact Center with CloudHesive.pptx
Cognitive Search: Announcing the smartest enterprise search engine, now with ...
Online shopping ppt
Discover AI with Microsoft Azure
Introduction to PolyBase
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Ad

Similar to Web scraping and healthcare (20)

PPTX
Meeting Federal Research Requirements
PDF
Data Governance in two different data archives: When is a federal data reposi...
PPTX
Innovative project1
PDF
A Data Biosphere for Biomedical Research
PPTX
Shifting the goal post – from high impact journals to high impact data
PPTX
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
PPTX
Lecture 6_Data acquisition.pptx power points
PPTX
HealthVault this is a diff presentatikon
PDF
Preparing Research Data for Sharing
PPTX
Big Data in Clinical Research
PDF
BLC & Digital Science: Mark Hahnel, Figshare
PPTX
Introduction to Big Data and its Potential for Dementia Research
PDF
Sharing and standards christopher hart - clinical innovation and partnering...
PPTX
Big Data Mining Methods in Medical Applications [Autosaved].pptx
PPTX
Harbinger Tech Session in cloud Expo 2015- Harnessing the power of linked ope...
PPTX
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
PPTX
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
PPTX
Bigdata and Hadoop with applications
PPTX
Hospital Cloud Forum - thoughts for panel
PDF
Meeting Federal Research Requirements
Data Governance in two different data archives: When is a federal data reposi...
Innovative project1
A Data Biosphere for Biomedical Research
Shifting the goal post – from high impact journals to high impact data
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Lecture 6_Data acquisition.pptx power points
HealthVault this is a diff presentatikon
Preparing Research Data for Sharing
Big Data in Clinical Research
BLC & Digital Science: Mark Hahnel, Figshare
Introduction to Big Data and its Potential for Dementia Research
Sharing and standards christopher hart - clinical innovation and partnering...
Big Data Mining Methods in Medical Applications [Autosaved].pptx
Harbinger Tech Session in cloud Expo 2015- Harnessing the power of linked ope...
Electronic Data Capture (EDC) Systems: Streamlining Data Collection and Manag...
Enabling simultaneous analysis of multiple cohort studies: A BRISSKit use case
Bigdata and Hadoop with applications
Hospital Cloud Forum - thoughts for panel
Ad

Recently uploaded (20)

PDF
Dr Masood Ahmed Expertise And Sucess Story
PPTX
General Pharmacology by Nandini Ratne, Nagpur College of Pharmacy, Hingna Roa...
PPTX
community services team project 2(4).pptx
PPTX
Trichuris trichiura infection
PPTX
Immunity....(shweta).................pptx
PPTX
Vaginal Bleeding and Uterine Fibroids p
PPTX
Pulmonary Circulation PPT final for easy
PDF
MINERAL & VITAMIN CHARTS fggfdtujhfd.pdf
PDF
DAY-6. Summer class. Ppt. Cultural Nursing
PDF
Myers’ Psychology for AP, 1st Edition David G. Myers Test Bank.pdf
PDF
Structure Composition and Mechanical Properties of Australian O.pdf
PPTX
HEMODYNAMICS - I DERANGEMENTS OF BODY FLUIDS.pptx
PDF
Khaled Sary- Trailblazers of Transformation Middle East's 5 Most Inspiring Le...
PDF
Assessment of Complications in Patients Maltreated with Fixed Self Cure Acryl...
PPTX
COMMUNICATION SKILSS IN NURSING PRACTICE
PPTX
different types of Gait in orthopaedic injuries
PPTX
NUTRITIONAL PROBLEMS, CHANGES NEEDED TO PREVENT MALNUTRITION
PPTX
CBT FOR OCD TREATMENT WITHOUT MEDICATION
PPTX
AI_in_Pharmaceutical_Technology_Presentation.pptx
PPTX
Nursing Care Aspects for High Risk newborn.pptx
Dr Masood Ahmed Expertise And Sucess Story
General Pharmacology by Nandini Ratne, Nagpur College of Pharmacy, Hingna Roa...
community services team project 2(4).pptx
Trichuris trichiura infection
Immunity....(shweta).................pptx
Vaginal Bleeding and Uterine Fibroids p
Pulmonary Circulation PPT final for easy
MINERAL & VITAMIN CHARTS fggfdtujhfd.pdf
DAY-6. Summer class. Ppt. Cultural Nursing
Myers’ Psychology for AP, 1st Edition David G. Myers Test Bank.pdf
Structure Composition and Mechanical Properties of Australian O.pdf
HEMODYNAMICS - I DERANGEMENTS OF BODY FLUIDS.pptx
Khaled Sary- Trailblazers of Transformation Middle East's 5 Most Inspiring Le...
Assessment of Complications in Patients Maltreated with Fixed Self Cure Acryl...
COMMUNICATION SKILSS IN NURSING PRACTICE
different types of Gait in orthopaedic injuries
NUTRITIONAL PROBLEMS, CHANGES NEEDED TO PREVENT MALNUTRITION
CBT FOR OCD TREATMENT WITHOUT MEDICATION
AI_in_Pharmaceutical_Technology_Presentation.pptx
Nursing Care Aspects for High Risk newborn.pptx

Web scraping and healthcare

  • 1. Web Scraping and Healthcare Presented By: Avanish Kumar Giri BMSCE, Bangalore
  • 2. Contents • Introduction • Research paper-1: Big data analytics in healthcare • Research paper-2: Big data analytics in healthcare: promise and potential • Research paper-3: Improving Healthcare Using Big Data Analytics • Research paper-4: Big Data Analytics: Solution to healthcare • Research paper-5: A Dive into Web Scraper World • Research paper-6: Automated scraping of structured data records from health discussion forums using semantic analysis • Comparison of all the research papers • Proposed Framework • Conclusion • References and Bibliography
  • 3. Introduction • The internet is a massive storehouse of health related information. • But it is not freely available because ▫ Individual privacy issues ▫ Data leak ▫ Website restrictions • So, we can target health discussion forums which are easily accessible to scrape the data and analyze it for various purposes. • Web scraping, • Also known as web extraction or harvesting, is a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis.
  • 4. Research Paper-1 • Big data analytics in healthcare: promise and potential ▫ The healthcare industry historically has generated large amounts of data, driven by record keeping, compliance, regulatory requirements, and patient care. ▫ While most data is stored in hard copy form, the current trend is toward rapid digitization of these large amounts of data. ▫ Big data for U.S. healthcare will soon reach the zettabyte (1021 gigabytes) scale.[1]
  • 5. Research Paper-2 • Big data analytics in healthcare ▫ Image processing  Computed tomography (CT),  Magnetic resonance imaging (MRI),  X ray, molecular imaging, ultrasound, etc. ▫ Signal processing ▫ Data analytics in disease detection ▫ Data Analytics in medical diagnosis
  • 6. Research Paper-3 • Improving Healthcare Using Big Data Analytics ▫ 50 Petabytes of data in the health care realm, predicted to grow to 25,000 Petabytes by 2020, reported by a new info-graphic from Oracle.[2] ▫ Data analytics in public health research  With the wild expansion of public health information, we can use data analytic technique to crawl and filter out varied types of public health info data.  Hospital Information system (HIS), which includes electronic medical record system (EMRS)  Laboratory Information system (LIS)  Radiology Information system (RIS)  Clinical decision support system (CDSS), etc.
  • 7. Research Paper-4 • Web Scraping: Data Extraction from websites ▫ Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World Wide Web (WWW) and save it to a file system or database for later retrieval or analysis. ▫ Scraping is mentioned as one of sources for big data collection. In the definition is also mentioned another term – Web Crawler.
  • 8. Fig 1: Web crawler Vs. Web Scraping Source: https://www.quora.com/What-are-the- biggest-differences-between-web-crawling-and- web-scraping
  • 9. Research Paper-4(cntd.) • Available Frameworks • Scrapy ▫ It is one of the advanced web scraping frameworks available. ▫ The Framework is written in Python. • FMiner ▫ It combines visual configuration with scripting features.
  • 10. Research Paper-5 • A Dive into Web Scraper World ▫ Is Web Scraping Legal?  This question is always left unanswered properly.  There are lots of different views of different people on the legal and illegal aspects of Scraping the Web. ▫ Crawling policies  Selection policies- It states the pages to download  Re-Visits Policy- It states when to check for changes to the pages ▫ Designing a custom scrapper  A Web Scraper broadly composed of two parts:  Web crawler for crawling links  Data extractor from crawled link
  • 11. Research Paper-6 • Automated scraping of structured data records from health discussion forums using semantic analysis ▫ In the context of healthcare, web scraping is gaining foothold gradually but qualitatively. ▫ Several factors have led to the use of web scraping in healthcare.  Too complex to be analysed by traditional techniques.  Web scraping along with data extraction can improve decision making.
  • 13. General Framework for data extraction[6] Fig 2: General Framework of data extraction Source:https://www.sciencedirect.com/science/article/pii/S2352914817302253
  • 14. Comparison Papers Process Advantages Disadvantages Future work 1. Big data concepts in healthcare Improved record keeping and patient care Large amount of data Analyzing size of the data 2. Image and signal processing in big data Easy processing of reports Domain knowledge is important to process such data Many other types of reports can be processed 3. Data analytics in PHR Easy processing of reports Research knowledge required Easy definition of reports 4. Web crawler and available frameworks Easy frameworks are easily available Illegal activities could take place Limitations in frameworks 5. Legal aspects and crawling policies No proper legal definition Depends on the website Defining clear legal structure 6. Scraping web discussion forums Easy to get data from such forums Website policies Less frequent requests
  • 15. Proposed Framework • Fetching data from health discussion forums. • Use of BeautifulSoup for parsing (Python Library) • Store data in JSON Format • Process data for decision making
  • 16. Fetching the data • Involves finding the endpoint - URL or URL’s • Sending HTTP requests to the server • Using requests library: ▫ import requests data=requests.get(‘https://patient.info/forums/discuss/browse/anxiety- disorders-70) ▫ html = data.content
  • 17. Use BeautifulSoup for parsing • Provides simple methods to- ▫ Search ▫ Navigate ▫ Select • Export the data ▫ Database (relational or non-relational) ▫ CSV ▫ JSON
  • 18. Impact of web scraping on healthcare • Healthcare isn’t the sector that depends completely on the person-to- person interactions. • With the current system, where all options are data-centric, healthcare web scraping can affect the lives, teach people, and generate awareness. Because the people don’t rely on doctors as well as pharmacists anymore, the healthcare web scraping can improve the lives by providing balanced solutions. • With data extraction and web scraping methods, the healthcare organization may decrease the fraud attempts, the doctors can discover effective cures and the best practices, as well as patients, can have better and more affordable healthcare services.
  • 19. Conclusion • From the discussed points from all the research papers, we can conclude that use of web scrapping in healthcare can improve traditional process of decision making. • Targeting web discussion forums to detect risks can be helpful for any healthcare organization as well as the individual • As Scraper opens up another world of retrieving information without the use of API, and mostly it is anonymously accessed. • But the people who are doing Scraping should take into account that they are not breaking any kind of law which could make them liable for any offence.
  • 20. References 1. https://link.springer.com/journal/13755 2. https://www.3idatascraping.com/web-scraping-for-healthcare-companies.php 3. https://www.elsevier.com/locate/imu 4. https://patient.info/forums/discuss/browse/abdominal-disorders-3321 5. https://www.quora.com/What-are-the-biggest-differences-between-web-crawling- and-web-scraping 6. https://www.sciencedirect.com/science/article/pii/S2352914817302253
  • 21. Bibliography • Research Paper 1: ▫ Title: Big data analytics in healthcare: promise and potential ▫ Authors : Wullianallur Raghupathi and Viju Raghupathi ▫ https://link.springer.com/journal/13755 ▫ Pages :1-10 • Research Paper 2: ▫ Title: Big Data Analytics in Healthcare ▫ Authors : Ashwin Belle, Raghuram Thiagarajan, S.M.Reza Soroushmehr and Kayvan Najarian ▫ Biomedicine and Biotechnology, January 2015 ▫ Pages :1-38 • Research Paper 3: ▫ Title: Big Data Analytics in Healthcare ▫ Author : Revanth Sonnati ▫ Acedemia ▫ Pages :1-5
  • 22. Bibliography • Research Paper 4: ▫ Title: Web Scraping: Data Extraction from websites ▫ Authors : Vojtech Draxl ▫ Science Direct ▫ Pages :1-38 • Research Paper 5: ▫ Title: A Dive into Web Scraper World ▫ Authors: Deepak Kumar Mahto, Lisha Singh ▫ 2016 International Conference on Computing for Sustainable Global Development ▫ Pages :1-5 • Research Paper 6: ▫ Title: Automated scraping of structured data records from health discussion forums using semantic analysis ▫ Authors: Umamageswari Baskaran, Kalpana Ramanujam ▫ www.elsevier.com/locate/imu ▫ Science Direct ▫ Pages :1-10