Java Program to Extract Content from a HTML document Last Updated : 19 Nov, 2022 Comments Improve Suggest changes Like Article Like Report HTML is the core of the web, all the pages you see on the internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP, or any other web technology. Your browser actually parses HTML and render it for you But if we need to parse an HTML document and find some elements, tags, attributes or check if a particular element exists or not. In java, we can extract the HTML content and can parse the HTML Document. Approaches: Using FileReaderUsing the Url.openStream() Approach 1: The library called the FileReader which provides the way to read any File irrespective of any Extension. The way to append the HTML lines to the String Builder is as follows: Using the FileReader to read the file from the Source Folder and furtherAppend each line to the String builder.When there is not any content left in HTML Document then close the open File using the function br.close().Print out the String. Implementation: Java // Java Program to Extract Content from a HTML document // Importing input/output java libraries import java.io.*; public class GFG { // Main driver method public static void main(String[] args) throws FileNotFoundException { /* Constructing String Builder to append the string into the html */ StringBuilder html = new StringBuilder(); // Reading html file on local directory FileReader fr = new FileReader( "C:\\Users\\rohit\\OneDrive\\Desktop\\article.html"); // Try block to check exceptions try { // Initialization of the buffered Reader to get // the String append to the String Builder BufferedReader br = new BufferedReader(fr); String val; // Reading the String till we get the null // string and appending to the string while ((val = br.readLine()) != null) { html.append(val); } // AtLast converting into the string String result = html.toString(); System.out.println(result); // Closing the file after all the completion of // Extracting br.close(); } // Catch block to handle exceptions catch (Exception ex) { /* Exception of not finding the location and string reading termination the function br.close(); */ System.out.println(ex.getMessage()); } } } Output: Approach 2: Using the Url.openStream() Call the url.openStream() function that initiates the new TCP connection to the Server that the URL provides it to.Now, HTTP gets Request is sent to the connection after the server sends back the HTTP response containing the information into it.That information is in the form of the bytes then that information is read using the InputStreamReader() and openStream() method return the data to the program.BufferedReader br = new BufferedReader(new InputStreamReader(URL.openStream()));First, we open the URL using the openStream() to fetch the information. The information is contained in the URL in the form of bytes if the connection is all OK (means is shows 200) then HTTP request to the URL To fetch the content.Then the information is collected in the form of bytes using the inputStreamReader()Now the loop is run to print the information as the demand is to print the information in the console.while ((val = br.readLine()) != null) // condition { System.out.println(val); // execution if condition is true } Implementation: Java // Java Program to Extract Content from a HTML document // Importing java generic class import java.io.*; import java.util.*; // Importing java URL class import java.net.URL; public class GFG { // Man driver method public static void main(String[] args) throws FileNotFoundException { // Try block to check exceptions try { String val; // Constructing the URL connection // by defining the URL constructors URL URL = new URL( "file:///C:/Users/rohit/OneDrive/Desktop/article.html"); // Reading the HTML content from the .HTML File BufferedReader br = new BufferedReader( new InputStreamReader(URL.openStream())); /* Catching the string and if found any null character break the String */ while ((val = br.readLine()) != null) { System.out.println(val); } // Closing the file br.close(); } // Catch block to handle exceptions catch (Exception ex) { // No file found System.out.println(ex.getMessage()); } } } Output:  Comment More infoAdvertise with us Next Article Java Program to Extract Content from a HTML document rohit2sahu Follow Improve Article Tags : Java Technical Scripter Java Programs Technical Scripter 2020 Practice Tags : Java Similar Reads Java Program to Extract Content from a TXT document Java class< file using the Apache Tika library is used.  For document type detection and content extraction from various file formats, it uses various document parsers and document type detection techniques to detect and extract data. It provides a single generic API for parsing different file fo 3 min read Java Program to Extract Content From a XML Document An XML file contains data between the tags so it is complex to read the data when compared to other file formats like docx and txt. There are two types of parsers which parse an XML file: Object-Based (e.g. D.O.M)Event-Based (e.g. SAX, StAX) In this article, we will discuss how to parse XML using Ja 7 min read Java Program to Extract Content from a PDF Java class< file using the Apache Tika< library is used.  For document type detection and content extraction from various file formats, it uses various document parsers and document type detection techniques to detect and extract data. It provides a single generic API for parsing different fil 3 min read How to Extract Content From a Text Document using Java? There are several ways present in java to read the text file like BufferReader, FileReader, and Scanner.  Each and every method provides a unique way of reading the text file. Methods: Using Files classUsing FileReader classUsing BufferReader classUsing Scanner class Let's see each and every method 10 min read Java Program to Extract Paragraphs From a Word Document The article demonstrates how to extract paragraphs from a word document using the getParagraphs() method of XWPFDocument class provided by the Apache POI package. Apache POI is a project developed and maintained by Apache Software Foundation that provides libraries to perform numerous operations on 2 min read Java Program to Extract Content from a Java's .class File In this article, we are going to extract the contents of the Java class file using the Apache Tika library. Apache Tika is used for document type detection and content extraction from various file formats. It uses various document parsers and document type detection techniques to detect and extract 2 min read Java Program to Extract an HTML Tag from a String using RegEx In this article, we will find/extract an HTML tag from a string with help of regular expressions. The Regular Expression Regex or Rational Expression is simply a character sequence that specifies a search pattern in a particular text. It can contain a single character or it can have a complex sequen 3 min read How to Read and Write XML Files in Java? XML is defined as the Extensible Markup Language, and it is mostly used as a format for storing and exchanging data between systems. To read and write XML files, Java programming offers several easily implementable libraries. The most widely used library is the built-in JAXP (Java API for XML proces 5 min read How to Parse Invalid (Bad /Not Well-Formed) XML? Parsing invalid or not well-formed XML can be a necessity when dealing with data from diverse sources. While standard XML parsers expect well-formed XML, there are strategies and techniques to handle and extract information from malformed XML documents. In this article, we will explore how to parse 2 min read How to get all HTML content from DOMParser excluding the outer body tag ? DOM (Document Object Model) allows us to dynamically access and manipulate the HTML data. All the text data from an HTML file can also be extracted using DOMParser. DOM parser returns an HTML/XML/SVG object. All the objects can be accessed using the [ ] operator in javascript. The HTML DOM Tree of o 3 min read Like