Web Scraping with Jsoup only functioning half the time

I’ve been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures …

via Java Application Development Tutorial » Search Results » ajax:

Web Scraping with Jsoup only functioning half the time

I’ve been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?
Here is the class that does all the ‘magic’ :
import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;

public class HTMLParser

private Document d;
private String url;
private String content;

public HTMLParser(String url)
this.url = url;
connect();
parse();
display();

private void connect()
try
d = Jsoup.connect(url).get();
catch(IOException e){}
}

private void parse()
content = d.body().text();

private void display()
System.out.println(content);

}
…………………………

You might also have a problem if the site dynamically loads data. Especially in this age of AJAX. Does JSoup ignore robot.txt, or can you make it do so?
Ideally you need to render the page, and THEN scrape it.
This software apparently renders web pages: http://lobobrowser.org/java-browser.jsp And there’s certainly an API, which might allow you to look into the webpage’s structure.

For more info: Web Scraping with Jsoup only functioning half the time

Java Application Development Tutorial » Search Results » ajax

Web Scraping with Jsoup only functioning half the time

Share this post:

Related Posts

Leave a Comment