You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
|
3 weeks ago | |
|---|---|---|
| .. | ||
| data | 3 weeks ago | |
| src | 3 weeks ago | |
| .gitignore | 3 weeks ago | |
| README.md | 3 weeks ago | |
| pom.xml | 3 weeks ago | |
README.md
University News Crawler
Java homework project for crawling:
https://news.hnu.edu.cn/https://news.csu.edu.cn/https://news.hunnu.edu.cn/
The code demonstrates the required architecture:
- CLI interactive command line
- MVC:
model,view,controller - Command pattern:
commandpackage - Strategy pattern:
strategypackage, one strategy per target website - Custom exception hierarchy:
exceptionpackage - File persistence: JSON or CSV output
Run
mvn test
mvn exec:java -Dexec.args="crawl --site all --limit 5 --format json --out data/news.json"
Interactive CLI:
mvn exec:java
Useful commands:
help
sites
crawl --site all --limit 10 --format json --out data/news.json
crawl --site hnu --limit 5 --format csv --out data/hnu.csv
exit
Output Fields
Each crawled news item includes:
- school
- site key
- title
- url
- publish time
- source
- author
- summary
- content preview
- crawled time