Java-web-scraping

Quick guide with code example how to use Java for web scraping

While some people prefer using Python, another popular option is utilizing Java for web scraping. Here is a step-by-step guide of how to easily accomplish this.

Before you begin, ensure that you have the following set up on your computer so that the environment is optimal for web scraping:

Java11 -There are more advanced versions but this remains by far the most popular among developers.

Maven – Is a building automation tool for dependency management

IntelliJ IDEA – IntelliJ IDEA is an integrated development environment for developing computer software written in Java.

HtmlUnit – This is a browser activity simulator (e.g. form submission simulation).

You can check installations with these commands:

java -version
mvn -v

Alternative Solution

Bright Data's Web Scraping API offers a fully automated solution for data collection. Skip the complexities of setting up and maintaining your scrapers—simply define your target site, desired dataset, and output format. Whether you need structured data in real-time or scheduled deliveries, Bright Data's robust tools ensure accuracy, scalability, and ease of use. Perfect for professionals who value efficiency and reliability in their data operations.

Now, let's continue with our Java scraper.

Step One: Inspect your target page

Head to the target site that you would like to collect data from, right click anywhere and hit ‘inspect element’ in order to access the ‘Developer Console’, which will grant you access to the web page's HTML.

Step Two: Begin scraping the HTML

Open IntelliJ IDEA and create a Maven project:

Maven projects have a pom.xml file. Navigate to the pom.xml file, and first set up the JDK version for your project:

<properties>
		<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
		<maven.compiler.source>11</maven.compiler.source>
		<maven.compiler.target>11</maven.compiler.target>
	</properties>

And then add the HtmlUnit dependency to the pom.xml file as follows:

<dependencies>
  	<dependency>
  		<groupId>net.sourceforge.htmlunit</groupId>
  		<artifactId>htmlunit</artifactId>
  		<version>2.63.0</version>
  	</dependency>
  </dependencies>

Now everything is set-up to begin writing the first Java class. Start by creating a new Java source file like so:

We need to create a main method for our application to start. Create the main method like this:

public static void main(String[] args) throws IOException {
 }

The app will start with this method. It is the application entrypoint. You can now send an HTTP request using HtmlUnit imports as follows:

import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import java.io.IOException;
import java.util.List;

Now create a WebClient by setting the options as follows:

private static WebClient createWebClient() {
  	WebClient webClient = new WebClient(BrowserVersion.CHROME);
  	webClient.getOptions().setThrowExceptionOnScriptError(false);
  	webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(false);
  	return webClient;
  }

Step Three: Extract/parse the data from the HTML

Now let’s extract the target price data that we are interested in. We will use the following **HtmlUnit** built-in commands in order to accomplish this. Here is what that would look like for data points pertaining to **product price**:

WebClient webClient = createWebClient();
	    
		try {
			String link = "https://www.ebay.com/itm/332852436920?epid=108867251&hash=item4d7f8d1fb8:g:cvYAAOSwOIlb0NGY";
			HtmlPage page = webClient.getPage(link);
			
			System.out.println(page.getTitleText());
			
			String xpath = "//*[@id=\"mm-saleDscPrc\"]";			
			HtmlSpan priceDiv = (HtmlSpan) page.getByXPath(xpath).get(0);			
			System.out.println(priceDiv.asNormalizedText());
			
			CsvWriter.writeCsvFile(link, priceDiv.asNormalizedText());
			
		} catch (FailingHttpStatusCodeException | IOException e) {
			e.printStackTrace();
		} finally {
			webClient.close();
		}

To get the XPath of the desired element, go ahead and use the Developer Console. On the Developer Console, right-click the selected section and click “Copy XPath”. This command will copy the selected section as an XPath expression:

The web pages contain links, text, graphics, and tables. If you select an XPath of a table, you can export it to CSV and make further calculations, and analysis with programs such as Microsoft Excel. In the next step, we will examine exporting a table as a CSV file.

Step Four: Exporting the data

Now that the data has been parsed, we can export it into CSV format for further analysis. This format may be preferred by certain professionals over others, as it can then be easily opened/viewed in Microsoft Excel. Here are the command lines to use in order to accomplish this:

public static void writeCsvFile(String link, String price) throws IOException {
		
		FileWriter recipesFile = new FileWriter("export.csv", true);

		recipesFile.write("link, price\n");

		recipesFile.write(link + ", " + price);

		recipesFile.close();
	}

Conclusion

Although Java can help professionals in various fields extract the data they need, the process of web scraping can be quite time-consuming. To fully automate your data collection operations you can utilize a tool like the Bright Data's Web Scraping API. All you need to do is choose the target site, and output dataset, and then select your desired schedule, file format, and delivery method.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Java2 intellij.png		Java2 intellij.png
Java3 intellij.png		Java3 intellij.png
Java4.png		Java4.png
LICENSE		LICENSE
README.md		README.md
Web scraping with Java - Ultimate guide.png		Web scraping with Java - Ultimate guide.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java-web-scraping

Quick guide with code example how to use Java for web scraping

Alternative Solution

Step One: Inspect your target page

Step Two: Begin scraping the HTML

Step Three: Extract/parse the data from the HTML

Step Four: Exporting the data

Conclusion

About

Releases

Packages

Contributors 2

License

luminati-io/java-web-scraping

Folders and files

Latest commit

History

Repository files navigation

Java-web-scraping

Quick guide with code example how to use Java for web scraping

Alternative Solution

Step One: Inspect your target page

Step Two: Begin scraping the HTML

Step Three: Extract/parse the data from the HTML

Step Four: Exporting the data

Conclusion

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages