4 changed files with 460 additions and 0 deletions
@ -0,0 +1,358 @@ |
|||||
|
# Java爬虫框架 |
||||
|
|
||||
|
基于MVC架构的Java爬虫框架,支持多态扩展,可轻松添加新的网站爬虫。 |
||||
|
|
||||
|
## 功能特性 |
||||
|
|
||||
|
- **MVC架构**:清晰的分层设计,职责分明 |
||||
|
- **多态扩展**:通过继承BaseCrawler实现新爬虫 |
||||
|
- **命令行界面**:支持交互式命令操作 |
||||
|
- **自动识别**:根据URL自动选择合适的爬虫 |
||||
|
- **日期提取**:支持从URL中提取发布日期 |
||||
|
|
||||
|
## 支持的网站 |
||||
|
|
||||
|
| 网站 | 域名 | 爬虫名称 | |
||||
|
|------|------|----------| |
||||
|
| 湖南大学官网 | `*.hnu.edu.cn` | HunanUniversityCrawler | |
||||
|
| 湖南大学新闻网 | `news.hnu.edu.cn` | HunanUniversityNewsCrawler | |
||||
|
| 中国天气网 | `*.weather.com.cn` | ChinaWeatherCrawler | |
||||
|
| 骑砍中文站 | `*.mountblade.com.cn` | MountBladeCrawler | |
||||
|
|
||||
|
## 快速开始 |
||||
|
|
||||
|
### 编译项目 |
||||
|
|
||||
|
```bash |
||||
|
javac -d target/classes src/main/java/com/crawler/**/*.java |
||||
|
``` |
||||
|
|
||||
|
### 运行程序 |
||||
|
|
||||
|
```bash |
||||
|
java -cp target/classes com.crawler.Main |
||||
|
``` |
||||
|
|
||||
|
### 命令行使用 |
||||
|
|
||||
|
``` |
||||
|
======================================== |
||||
|
Java爬虫框架 |
||||
|
======================================== |
||||
|
|
||||
|
======================================== |
||||
|
Java爬虫框架 - 命令行模式 |
||||
|
======================================== |
||||
|
输入 'help' 查看可用指令 |
||||
|
======================================== |
||||
|
> help |
||||
|
可用指令: |
||||
|
--------- |
||||
|
help : 显示所有可用指令 |
||||
|
list : 查看使用过的指令历史 |
||||
|
crawl : 运行爬虫,输入URL自动选择爬虫 |
||||
|
exit : 退出程序 |
||||
|
|
||||
|
> crawl |
||||
|
请输入要爬取的URL: https://www.mountblade.com.cn |
||||
|
使用爬虫: MountBladeCrawler |
||||
|
... |
||||
|
``` |
||||
|
|
||||
|
## 项目结构 |
||||
|
|
||||
|
``` |
||||
|
src/main/java/com/crawler/ |
||||
|
├── Main.java # 主入口 |
||||
|
├── model/ |
||||
|
│ ├── CrawlerData.java # 爬取数据模型(标题、链接、来源、发布日期) |
||||
|
│ └── CrawlerConfig.java # 爬虫配置(超时时间、User-Agent) |
||||
|
├── view/ |
||||
|
│ └── CrawlerView.java # 视图层(输出结果展示) |
||||
|
├── controller/ |
||||
|
│ └── CrawlerController.java # 爬虫控制器 |
||||
|
├── crawler/ |
||||
|
│ ├── Crawler.java # 爬虫接口 |
||||
|
│ ├── BaseCrawler.java # 爬虫抽象基类 |
||||
|
│ ├── CrawlerFactory.java # 爬虫工厂(自动选择爬虫) |
||||
|
│ └── impl/ |
||||
|
│ ├── ExampleCrawler.java # 通用爬虫 |
||||
|
│ ├── TestCrawler.java # 测试爬虫 |
||||
|
│ ├── HunanUniversityCrawler.java |
||||
|
│ ├── HunanUniversityNewsCrawler.java |
||||
|
│ ├── ChinaWeatherCrawler.java |
||||
|
│ └── MountBladeCrawler.java |
||||
|
└── command/ |
||||
|
├── Command.java # 命令接口 |
||||
|
├── BaseCommand.java # 命令抽象基类 |
||||
|
├── CommandHistory.java # 命令历史记录 |
||||
|
├── HelpCommand.java # 帮助命令 |
||||
|
├── ListCommand.java # 历史记录命令 |
||||
|
├── CrawlCommand.java # 爬虫命令 |
||||
|
├── ExitCommand.java # 退出命令 |
||||
|
└── CommandController.java # 命令控制器 |
||||
|
``` |
||||
|
|
||||
|
## 扩展新爬虫 |
||||
|
|
||||
|
只需继承 `BaseCrawler` 并重写两个方法: |
||||
|
|
||||
|
```java |
||||
|
package com.crawler.crawler.impl; |
||||
|
|
||||
|
import com.crawler.crawler.BaseCrawler; |
||||
|
import com.crawler.model.CrawlerData; |
||||
|
|
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
import java.util.regex.Matcher; |
||||
|
import java.util.regex.Pattern; |
||||
|
|
||||
|
public class MyWebsiteCrawler extends BaseCrawler { |
||||
|
private static final String BASE_URL = "https://www.mywebsite.com"; |
||||
|
|
||||
|
@Override |
||||
|
public String getCrawlerName() { |
||||
|
return "MyWebsiteCrawler"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
protected List<CrawlerData> parseHtml(String html) { |
||||
|
List<CrawlerData> results = new ArrayList<>(); |
||||
|
|
||||
|
// 使用正则表达式解析HTML |
||||
|
Pattern pattern = Pattern.compile("<a href=\"([^\"]+)\">([^<]+)</a>"); |
||||
|
Matcher matcher = pattern.matcher(html); |
||||
|
|
||||
|
while (matcher.find()) { |
||||
|
CrawlerData data = new CrawlerData(); |
||||
|
data.setTitle(matcher.group(2)); |
||||
|
data.setUrl(normalizeUrl(matcher.group(1))); |
||||
|
data.setSource(getCrawlerName()); |
||||
|
data.setPublishDate(extractDateFromUrl(matcher.group(1))); |
||||
|
results.add(data); |
||||
|
} |
||||
|
|
||||
|
return results; |
||||
|
} |
||||
|
|
||||
|
private String normalizeUrl(String url) { |
||||
|
if (url.startsWith("/")) { |
||||
|
return BASE_URL + url; |
||||
|
} |
||||
|
return url; |
||||
|
} |
||||
|
|
||||
|
private String extractDateFromUrl(String url) { |
||||
|
Pattern datePattern = Pattern.compile("/(\\d{4}-\\d{2}-\\d{2})/"); |
||||
|
Matcher matcher = datePattern.matcher(url); |
||||
|
return matcher.find() ? matcher.group(1) : null; |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
然后在 `CrawlerFactory.java` 中添加识别规则: |
||||
|
|
||||
|
```java |
||||
|
crawlerPatterns.put("MyWebsiteCrawler", |
||||
|
Pattern.compile(".*mywebsite\\.com.*", Pattern.CASE_INSENSITIVE)); |
||||
|
``` |
||||
|
|
||||
|
并在 `createCrawlerByName` 方法中添加: |
||||
|
|
||||
|
```java |
||||
|
case "MyWebsiteCrawler": |
||||
|
return new MyWebsiteCrawler(); |
||||
|
``` |
||||
|
|
||||
|
## 架构设计 |
||||
|
|
||||
|
### MVC模式 |
||||
|
|
||||
|
- **Model**:`CrawlerData`(数据模型)、`CrawlerConfig`(配置) |
||||
|
- **View**:`CrawlerView`(结果展示) |
||||
|
- **Controller**:`CrawlerController`(爬虫控制)、`CommandController`(命令控制) |
||||
|
|
||||
|
### 多态设计 |
||||
|
|
||||
|
- `Crawler` 接口定义标准方法 |
||||
|
- `BaseCrawler` 提供通用HTTP请求能力 |
||||
|
- 各爬虫实现类继承 `BaseCrawler` 并重写 `parseHtml` 方法 |
||||
|
|
||||
|
### 工厂模式 |
||||
|
|
||||
|
`CrawlerFactory` 根据URL模式自动选择合适的爬虫实现。 |
||||
|
|
||||
|
## 配置说明 |
||||
|
|
||||
|
`CrawlerConfig` 支持以下配置: |
||||
|
|
||||
|
- `timeout`:HTTP请求超时时间(默认30000毫秒) |
||||
|
- `userAgent`:User-Agent(默认模拟Chrome浏览器) |
||||
|
|
||||
|
## 命令列表 |
||||
|
|
||||
|
| 命令 | 功能 | |
||||
|
|------|------| |
||||
|
| `help` | 显示所有可用指令 | |
||||
|
| `list` | 查看使用过的指令历史 | |
||||
|
| `crawl` | 运行爬虫,输入目标URL,爬取后可保存结果 | |
||||
|
| `cache` | 缓存操作:save/load/list/delete | |
||||
|
| `exit` | 退出程序 | |
||||
|
|
||||
|
### cache 命令子操作 |
||||
|
|
||||
|
| 子操作 | 功能 | |
||||
|
|--------|------| |
||||
|
| `save` | 保存当前爬取数据到数据文件 | |
||||
|
| `load` | 从数据文件读取数据 | |
||||
|
| `list` | 列出 `data/` 目录中的所有文件 | |
||||
|
| `delete` | 删除指定的数据文件或所有文件 | |
||||
|
|
||||
|
### 数据目录 |
||||
|
|
||||
|
程序会自动创建 `data/` 目录用于保存爬取的数据文件。 |
||||
|
|
||||
|
### 爬取后自动保存 |
||||
|
|
||||
|
使用 `crawl` 命令爬取完成后,系统会自动询问是否保存结果: |
||||
|
|
||||
|
``` |
||||
|
爬虫运行完成,共获取 10 条数据 |
||||
|
======================================== |
||||
|
|
||||
|
是否保存爬取结果? (y/n): y |
||||
|
请输入保存路径 (默认: data/crawler_data.json): |
||||
|
数据已保存到: data/crawler_data.json |
||||
|
``` |
||||
|
|
||||
|
### 删除缓存文件示例 |
||||
|
|
||||
|
``` |
||||
|
> cache |
||||
|
请输入缓存操作 (save/load/list/delete): delete |
||||
|
======================================== |
||||
|
可选删除的文件: |
||||
|
======================================== |
||||
|
[1] crawler_data.json (1024 bytes) |
||||
|
[2] mountblade_data.json (2048 bytes) |
||||
|
[all] 删除所有文件 |
||||
|
======================================== |
||||
|
请输入要删除的文件序号或 'all': 1 |
||||
|
确定要删除 'crawler_data.json' 吗? (y/n): y |
||||
|
已删除: crawler_data.json |
||||
|
``` |
||||
|
|
||||
|
## 输出示例 |
||||
|
|
||||
|
``` |
||||
|
[12] |
||||
|
标题: 骑砍2《战帆》v1.2.4与本体v1.4.4测试版更新日志 |
||||
|
链接: https://www.mountblade.com.cn/news/Bannerlord/2026-05-13/3175.html |
||||
|
来源: MountBladeCrawler |
||||
|
发布日期: 2026-05-13 |
||||
|
---------------------------------------- |
||||
|
``` |
||||
|
|
||||
|
## 异常处理 |
||||
|
|
||||
|
项目采用分层异常体系设计,区分受检异常和非受检异常: |
||||
|
|
||||
|
### 异常分类 |
||||
|
|
||||
|
| 异常类型 | 说明 | 示例 | |
||||
|
|---------|------|------| |
||||
|
| **受检异常** | 可恢复异常,强制调用者处理 | `HttpRequestException`, `TimeoutException`, `HtmlParseException`, `DataExtractException` | |
||||
|
| **非受检异常** | 编程错误,不可恢复 | `InvalidUrlException`, `UnsupportedCrawlerException` | |
||||
|
|
||||
|
### 异常继承树 |
||||
|
|
||||
|
``` |
||||
|
CrawlerException (爬虫框架根异常 - 受检) |
||||
|
├── NetworkException (网络异常父类) |
||||
|
│ ├── HttpRequestException (HTTP请求失败) |
||||
|
│ └── TimeoutException (连接超时) |
||||
|
└── ParseException (解析异常父类) |
||||
|
├── HtmlParseException (HTML解析失败) |
||||
|
└── DataExtractException (数据提取失败) |
||||
|
|
||||
|
ConfigurationException (配置异常父类 - 非受检) |
||||
|
├── InvalidUrlException (无效URL) |
||||
|
└── UnsupportedCrawlerException (不支持的爬虫) |
||||
|
``` |
||||
|
|
||||
|
### 异常处理示例 |
||||
|
|
||||
|
```java |
||||
|
try { |
||||
|
List<CrawlerData> data = crawler.crawl(); |
||||
|
view.showData(data); |
||||
|
} catch (HttpRequestException e) { |
||||
|
view.showErrorMessage("HTTP请求失败: " + e.getStatusCode()); |
||||
|
} catch (TimeoutException e) { |
||||
|
view.showErrorMessage("连接超时,请稍后重试"); |
||||
|
} catch (HtmlParseException e) { |
||||
|
view.showErrorMessage("HTML解析失败: " + e.getSourceUrl()); |
||||
|
} catch (CrawlerException e) { |
||||
|
view.showErrorMessage("爬虫执行失败: " + e.getMessage()); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
完整的异常设计文档请参考 [EXCEPTIONS.md](file:///C:/Users/黄志楷/Documents/ocix/学校相关/jwork/w12/EXCEPTIONS.md) |
||||
|
|
||||
|
## 数据序列化 |
||||
|
|
||||
|
项目提供基于Jackson的JSON序列化工具类,支持将爬取数据保存到文件和从文件读取。 |
||||
|
|
||||
|
### 使用示例 |
||||
|
|
||||
|
```java |
||||
|
import com.crawler.util.JsonSerializer; |
||||
|
import com.crawler.model.CrawlerData; |
||||
|
import java.util.List; |
||||
|
|
||||
|
List<CrawlerData> dataList = crawler.crawl(); |
||||
|
|
||||
|
JsonSerializer.serializeToFile(dataList, "output/crawler_data.json"); |
||||
|
|
||||
|
List<CrawlerData> loadedData = JsonSerializer.deserializeFromFile("output/crawler_data.json"); |
||||
|
``` |
||||
|
|
||||
|
### JsonSerializer 类方法 |
||||
|
|
||||
|
| 方法 | 功能 | |
||||
|
|------|------| |
||||
|
| `serializeToFile(List<CrawlerData>, String)` | 将数据列表序列化到指定文件 | |
||||
|
| `deserializeFromFile(String)` | 从文件反序列化数据列表 | |
||||
|
| `toJsonString(List<CrawlerData>)` | 将数据列表转换为JSON字符串 | |
||||
|
| `toJsonString(CrawlerData)` | 将单条数据转换为JSON字符串 | |
||||
|
| `fromJsonString(String)` | 从JSON字符串反序列化数据列表 | |
||||
|
| `fromJsonStringToSingle(String)` | 从JSON字符串反序列化单条数据 | |
||||
|
|
||||
|
### 输出格式示例 |
||||
|
|
||||
|
```json |
||||
|
[ |
||||
|
{ |
||||
|
"title": "新闻标题", |
||||
|
"content": "新闻内容", |
||||
|
"url": "https://example.com/news/1", |
||||
|
"source": "ExampleCrawler", |
||||
|
"publishDate": "2026-05-21" |
||||
|
} |
||||
|
] |
||||
|
``` |
||||
|
|
||||
|
## 技术栈 |
||||
|
|
||||
|
- Java 21+ |
||||
|
- Java HttpClient(内置HTTP客户端) |
||||
|
- Jackson(JSON序列化) |
||||
|
- 正则表达式(HTML解析) |
||||
|
|
||||
|
## 注意事项 |
||||
|
|
||||
|
1. 请遵守目标网站的robots.txt规则 |
||||
|
2. 不要频繁请求,避免给目标服务器造成压力 |
||||
|
3. 某些网站可能有反爬机制,可能需要添加额外的请求头 |
||||
|
4. 建议在爬取前获取网站的爬取授权 |
||||
@ -0,0 +1,79 @@ |
|||||
|
<?xml version="1.0" encoding="UTF-8"?> |
||||
|
<project xmlns="http://maven.apache.org/POM/4.0.0" |
||||
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
||||
|
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 |
||||
|
http://maven.apache.org/xsd/maven-4.0.0.xsd"> |
||||
|
<modelVersion>4.0.0</modelVersion> |
||||
|
|
||||
|
<groupId>com.crawler</groupId> |
||||
|
<artifactId>crawler-framework</artifactId> |
||||
|
<version>1.0.0</version> |
||||
|
<packaging>jar</packaging> |
||||
|
|
||||
|
<name>crawler-framework</name> |
||||
|
<description>Java MVC Crawler Framework</description> |
||||
|
|
||||
|
<properties> |
||||
|
<maven.compiler.source>11</maven.compiler.source> |
||||
|
<maven.compiler.target>11</maven.compiler.target> |
||||
|
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> |
||||
|
</properties> |
||||
|
|
||||
|
<dependencies> |
||||
|
<dependency> |
||||
|
<groupId>junit</groupId> |
||||
|
<artifactId>junit</artifactId> |
||||
|
<version>4.13.2</version> |
||||
|
<scope>test</scope> |
||||
|
</dependency> |
||||
|
<dependency> |
||||
|
<groupId>com.fasterxml.jackson.core</groupId> |
||||
|
<artifactId>jackson-databind</artifactId> |
||||
|
<version>2.15.2</version> |
||||
|
</dependency> |
||||
|
<dependency> |
||||
|
<groupId>com.fasterxml.jackson.datatype</groupId> |
||||
|
<artifactId>jackson-datatype-jsr310</artifactId> |
||||
|
<version>2.15.2</version> |
||||
|
</dependency> |
||||
|
</dependencies> |
||||
|
|
||||
|
<build> |
||||
|
<plugins> |
||||
|
<plugin> |
||||
|
<groupId>org.apache.maven.plugins</groupId> |
||||
|
<artifactId>maven-compiler-plugin</artifactId> |
||||
|
<version>3.8.1</version> |
||||
|
<configuration> |
||||
|
<source>11</source> |
||||
|
<target>11</target> |
||||
|
</configuration> |
||||
|
</plugin> |
||||
|
|
||||
|
<plugin> |
||||
|
<groupId>org.apache.maven.plugins</groupId> |
||||
|
<artifactId>maven-assembly-plugin</artifactId> |
||||
|
<version>3.3.0</version> |
||||
|
<configuration> |
||||
|
<descriptorRefs> |
||||
|
<descriptorRef>jar-with-dependencies</descriptorRef> |
||||
|
</descriptorRefs> |
||||
|
<archive> |
||||
|
<manifest> |
||||
|
<mainClass>com.crawler.Main</mainClass> |
||||
|
</manifest> |
||||
|
</archive> |
||||
|
</configuration> |
||||
|
<executions> |
||||
|
<execution> |
||||
|
<id>make-assembly</id> |
||||
|
<phase>package</phase> |
||||
|
<goals> |
||||
|
<goal>single</goal> |
||||
|
</goals> |
||||
|
</execution> |
||||
|
</executions> |
||||
|
</plugin> |
||||
|
</plugins> |
||||
|
</build> |
||||
|
</project> |
||||
Binary file not shown.
@ -0,0 +1,23 @@ |
|||||
|
<!DOCTYPE html> |
||||
|
<html> |
||||
|
<head> |
||||
|
<title>Test Page</title> |
||||
|
</head> |
||||
|
<body> |
||||
|
<h1>测试页面</h1> |
||||
|
<div class="news-list"> |
||||
|
<article> |
||||
|
<h2><a href="https://example.com/news1">新闻标题1 - 这是第一条测试新闻</a></h2> |
||||
|
<p>这是第一条新闻的内容摘要...</p> |
||||
|
</article> |
||||
|
<article> |
||||
|
<h2><a href="https://example.com/news2">新闻标题2 - 这是第二条测试新闻</a></h2> |
||||
|
<p>这是第二条新闻的内容摘要...</p> |
||||
|
</article> |
||||
|
<article> |
||||
|
<h2><a href="https://example.com/news3">新闻标题3 - 这是第三条测试新闻</a></h2> |
||||
|
<p>这是第三条新闻的内容摘要...</p> |
||||
|
</article> |
||||
|
</div> |
||||
|
</body> |
||||
|
</html> |
||||
Loading…
Reference in new issue