You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
unknown 9acfa1a73f first commit 3 weeks ago
data first commit 3 weeks ago
src first commit 3 weeks ago
.gitignore first commit 3 weeks ago
README.md first commit 3 weeks ago
pom.xml first commit 3 weeks ago

README.md

University News Crawler

Java homework project for crawling:

  • https://news.hnu.edu.cn/
  • https://news.csu.edu.cn/
  • https://news.hunnu.edu.cn/

The code demonstrates the required architecture:

  • CLI interactive command line
  • MVC: model, view, controller
  • Command pattern: command package
  • Strategy pattern: strategy package, one strategy per target website
  • Custom exception hierarchy: exception package
  • File persistence: JSON or CSV output

Run

mvn test
mvn exec:java -Dexec.args="crawl --site all --limit 5 --format json --out data/news.json"

Interactive CLI:

mvn exec:java

Useful commands:

help
sites
crawl --site all --limit 10 --format json --out data/news.json
crawl --site hnu --limit 5 --format csv --out data/hnu.csv
exit

Output Fields

Each crawled news item includes:

  • school
  • site key
  • title
  • url
  • publish time
  • source
  • author
  • summary
  • content preview
  • crawled time