Java web scraper useful links identification

arahant_neo · October 5, 2017, 12:34pm

Hi, I am using Java for Web scraping. I’m going through the html structure of each page I wish to extract and selectively obtaining those Elements with necessary/useful links. But this is very primitive and inefficient. I cannot go through the html structure of every webpage I wish to extract, and change my code every time.
What I need is to extract these useful and/or necessary links, and not every link. Is there any algorithm to identify such useful links?

Eg - There are links for the sharing on FB, or email etc. I don’t need to extract any of those, only the useful links pointing to actual web documents.
Like, on the Oracle Java documentation site, I want to extract links only relating to the Java API, and nothing more.

So, I’m looking for an algorithm to identify such links.

Thank you.