SlideShare a Scribd company logo
JSOUP
Overview
What is Jsoup
Parsing with Url
Parsing with File
Modify Data
Prevent cross site scripting
JSOUP
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for
extracting and manipulating data,
● scrape and parse HTML from a URL, file, or string
● find and extract data, using DOM traversal or CSS selectors
● manipulate the HTML elements, attributes, and text
● clean user-submitted content against a safe white-list, to prevent XSS attacks
● output tidy HTML
Parse a document from a url
The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If
an error occurs whilst fetching the URL, it will throw an IOException, which you should handle
appropriately.
Document document = Jsoup.connect("https://grails.org/").get()
String title = document.title()
.
Continue..
The Connection interface is designed for method chaining to build specific requests:
Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Parse a document from a string
You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make
sure it's well formed, or to modify it. The String may have come from user input, a file, or from the
web.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Load a document from a file
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
String content = document.getElementById(“content”)
String tag = document.getElementByTag(“p”)
String class = document.getElementByClass(“green”)
Use DOM methods to navigate a document
You have a HTML document that you want to extract data from.
File file = new File("/home/shipra/Downloads/Jsoup.html")
Document document = Jsoup.parse(file, "UTF-8")
Elements elements = document.select(".nav-sections li")
elements.each { element ->
String text = element.select("a").text()
String attr = element.select("a").attr("href")
}
Modify Data
Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key,
String value).
If you need to modify the class attribute of an element, use the Element.addClass(String className)
and Element.removeClass(String className) methods.
The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow"
attribute to every a element inside a div:
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
Setting the text content of an element
Element div = document.select("div").first();
div.html("<p>paragraph</p>");
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
Sanitize untrusted HTML (to prevent XSS)
Whitelist allows what are the features that are passed to cleaning and others are discarded.
String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
String safe = Jsoup.clean(unsafe, Whitelist.basic());
Tidy HTML
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of
whether the HTML is well-formed or not. It handles:
● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
● reliably creating the document structure (html containing a head and body, and only
appropriate elements within the head)
Demo Reference
https://github.com/NexThoughts/JSOUP.git
Jsoup

More Related Content

PPTX
Json processing
PPTX
Mdst 3559-02-08-css
PPTX
A Higher-Order Data Flow Model for Heterogeneous Big Data
PPTX
Dom parser
PPTX
Session 17 - Collections - Lists, Sets
PPT
Applied component i unit 2
PPT
File System Object in QTP
PPTX
Session 16 - Collections - Sorting, Comparing Basics
Json processing
Mdst 3559-02-08-css
A Higher-Order Data Flow Model for Heterogeneous Big Data
Dom parser
Session 17 - Collections - Lists, Sets
Applied component i unit 2
File System Object in QTP
Session 16 - Collections - Sorting, Comparing Basics

What's hot (20)

PPTX
Session 20 - Collections - Maps
PDF
09.Local Database Files and Storage on WP
PPTX
Xml processors
PPTX
Data handling in python
PPTX
Introductionto xslt
PPTX
Mongo db nosql (1)
PPTX
PDF
Elasticsearch
PPSX
Elasticsearch - basics and beyond
PPT
PDF
Users as Data
ODP
Xml processing in scala
PPTX
XSL - XML STYLE SHEET
PPTX
OData and SharePoint
PPTX
Chapter iii(working with data)
PPT
Json – java script object notation
PDF
Wanna search? Piece of cake!
Session 20 - Collections - Maps
09.Local Database Files and Storage on WP
Xml processors
Data handling in python
Introductionto xslt
Mongo db nosql (1)
Elasticsearch
Elasticsearch - basics and beyond
Users as Data
Xml processing in scala
XSL - XML STYLE SHEET
OData and SharePoint
Chapter iii(working with data)
Json – java script object notation
Wanna search? Piece of cake!
Ad

Viewers also liked (17)

PPTX
PDF
Spring Web Flow
PPTX
Introduction to es6
PDF
Introduction to gradle
PPTX
Grails with swagger
PPTX
Actors model in gpars
PDF
Unit test-using-spock in Grails
PDF
Reactive java - Reactive Programming + RxJava
PDF
Cosmos DB Service
PPTX
PPTX
Progressive Web-App (PWA)
PDF
Java 8 features
PDF
Introduction to thymeleaf
Spring Web Flow
Introduction to es6
Introduction to gradle
Grails with swagger
Actors model in gpars
Unit test-using-spock in Grails
Reactive java - Reactive Programming + RxJava
Cosmos DB Service
Progressive Web-App (PWA)
Java 8 features
Introduction to thymeleaf
Ad

Similar to Jsoup (20)

PPTX
Jsoup tutorial
PPTX
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
PPTX
Nextzy Technologies Co.,ltd. Jsoup
PPTX
Jsoup Tutorial for Beginners - Javatpoint
PDF
Web 6 | JavaScript DOM
PPT
03DOM.ppt
DOCX
Url&doc html
PPTX
Jquery fundamentals
PDF
Advancing JavaScript with Libraries (Yahoo Tech Talk)
PDF
Web Crawling with NodeJS
PPTX
JavaScript APIs you’ve never heard of (and some you have)
PDF
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
PPTX
DOM and Events
PDF
From Hacker to Programmer (w/ Webpack, Babel and React)
PPTX
JSON(JavaScript Object Notation)
PPTX
Dom date and objects and event handling
PPTX
PDF
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
PDF
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
KEY
Palm Developer Day PhoneGap
Jsoup tutorial
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Nextzy Technologies Co.,ltd. Jsoup
Jsoup Tutorial for Beginners - Javatpoint
Web 6 | JavaScript DOM
03DOM.ppt
Url&doc html
Jquery fundamentals
Advancing JavaScript with Libraries (Yahoo Tech Talk)
Web Crawling with NodeJS
JavaScript APIs you’ve never heard of (and some you have)
StHack 2014 - Mario "@0x6D6172696F" Heiderich - JSMVCOMFG
DOM and Events
From Hacker to Programmer (w/ Webpack, Babel and React)
JSON(JavaScript Object Notation)
Dom date and objects and event handling
JSMVCOMFG - To sternly look at JavaScript MVC and Templating Frameworks
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Palm Developer Day PhoneGap

More from NexThoughts Technologies (20)

PDF
PDF
Docker & kubernetes
PDF
Apache commons
PDF
Microservice Architecture using Spring Boot with React & Redux
PDF
Solid Principles
PDF
Introduction to TypeScript
PDF
Smart Contract samples
PDF
My Doc of geth
PDF
Geth important commands
PDF
Ethereum genesis
PPTX
Springboot Microservices
PDF
An Introduction to Redux
PPTX
Google authentication
Docker & kubernetes
Apache commons
Microservice Architecture using Spring Boot with React & Redux
Solid Principles
Introduction to TypeScript
Smart Contract samples
My Doc of geth
Geth important commands
Ethereum genesis
Springboot Microservices
An Introduction to Redux
Google authentication

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Network Security Unit 5.pdf for BCA BBA.
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Advanced Soft Computing BINUS July 2025.pdf
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
GamePlan Trading System Review: Professional Trader's Honest Take
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Jsoup

  • 2. Overview What is Jsoup Parsing with Url Parsing with File Modify Data Prevent cross site scripting
  • 3. JSOUP jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, ● scrape and parse HTML from a URL, file, or string ● find and extract data, using DOM traversal or CSS selectors ● manipulate the HTML elements, attributes, and text ● clean user-submitted content against a safe white-list, to prevent XSS attacks ● output tidy HTML
  • 4. Parse a document from a url The connect(String url) method creates a new Connection, and get()fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException, which you should handle appropriately. Document document = Jsoup.connect("https://grails.org/").get() String title = document.title() .
  • 5. Continue.. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.connect("http://example.com") .userAgent("Mozilla") .cookie("auth", "token") .timeout(3000) .post();
  • 6. Parse a document from a string You have HTML in a Java String, and you want to parse that HTML to get at its contents, or to make sure it's well formed, or to modify it. The String may have come from user input, a file, or from the web. String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
  • 7. Load a document from a file File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") String content = document.getElementById(“content”) String tag = document.getElementByTag(“p”) String class = document.getElementByClass(“green”)
  • 8. Use DOM methods to navigate a document You have a HTML document that you want to extract data from. File file = new File("/home/shipra/Downloads/Jsoup.html") Document document = Jsoup.parse(file, "UTF-8") Elements elements = document.select(".nav-sections li") elements.each { element -> String text = element.select("a").text() String attr = element.select("a").attr("href") }
  • 9. Modify Data Use the attribute setter methods Element.attr(String key, String value), and Elements.attr(String key, String value). If you need to modify the class attribute of an element, use the Element.addClass(String className) and Element.removeClass(String className) methods. The Elements collection has bulk attribue and class methods. For example, to add a rel="nofollow" attribute to every a element inside a div: doc.select("div.comments a").attr("rel", "nofollow"); doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
  • 10. Setting the text content of an element Element div = document.select("div").first(); div.html("<p>paragraph</p>"); div.prepend("<p>First</p>"); div.append("<p>Last</p>");
  • 11. Sanitize untrusted HTML (to prevent XSS) Whitelist allows what are the features that are passed to cleaning and others are discarded. String unsafe ="<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>" String safe = Jsoup.clean(unsafe, Whitelist.basic());
  • 12. Tidy HTML The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles: ● unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>) ● implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...) ● reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)