You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1.7 KiB
1.7 KiB
Java Web Scraper
A complete web scraping application demonstrating:
- CLI Interface
- MVC Architecture
- Command Pattern
- Strategy Pattern
- Custom Exception Hierarchy
Features
- 3 different scraping strategies:
news_scraper- Scrapes quotes from http://quotes.toscrape.combooks_scraper- Scrapes books from https://books.toscrape.comtech_news_scraper- Scrapes news from https://www.bbc.com/news
- Saves data to JSON files
- Command-line interface
- Extensible architecture
Building
cd java-scraper
mvn clean package
Usage
List available scrapers:
mvn exec:java -Dexec.mainClass="com.scraper.Main" -Dexec.args="list"
Scrape using a specific strategy:
mvn exec:java -Dexec.mainClass="com.scraper.Main" -Dexec.args="scrape news_scraper"
Scrape all:
mvn exec:java -Dexec.mainClass="com.scraper.Main" -Dexec.args="scrape all"
Custom output directory:
mvn exec:java -Dexec.mainClass="com.scraper.Main" -Dexec.args="scrape news_scraper --output my_data"
Using the built JAR:
java -jar target/java-scraper-1.0-SNAPSHOT.jar list
java -jar target/java-scraper-1.0-SNAPSHOT.jar scrape news_scraper
Architecture
MVC
- Model:
ScrapedItem,ScrapedData - View:
ConsoleView - Controller:
ScraperController
Command Pattern
CommandinterfaceScrapeCommandListCommand
Strategy Pattern
ScraperStrategyinterfaceNewsScraperStrategyBooksScraperStrategyTechNewsScraperStrategy
Exception Hierarchy
ScraperException(base)NetworkExceptionParseExceptionStorageExceptionStrategyException
Requirements
- Java 11 or higher
- Maven