1 changed files with 0 additions and 358 deletions
@ -1,358 +0,0 @@ |
|||||
# Java爬虫框架 |
|
||||
|
|
||||
基于MVC架构的Java爬虫框架,支持多态扩展,可轻松添加新的网站爬虫。 |
|
||||
|
|
||||
## 功能特性 |
|
||||
|
|
||||
- **MVC架构**:清晰的分层设计,职责分明 |
|
||||
- **多态扩展**:通过继承BaseCrawler实现新爬虫 |
|
||||
- **命令行界面**:支持交互式命令操作 |
|
||||
- **自动识别**:根据URL自动选择合适的爬虫 |
|
||||
- **日期提取**:支持从URL中提取发布日期 |
|
||||
|
|
||||
## 支持的网站 |
|
||||
|
|
||||
| 网站 | 域名 | 爬虫名称 | |
|
||||
|------|------|----------| |
|
||||
| 湖南大学官网 | `*.hnu.edu.cn` | HunanUniversityCrawler | |
|
||||
| 湖南大学新闻网 | `news.hnu.edu.cn` | HunanUniversityNewsCrawler | |
|
||||
| 中国天气网 | `*.weather.com.cn` | ChinaWeatherCrawler | |
|
||||
| 骑砍中文站 | `*.mountblade.com.cn` | MountBladeCrawler | |
|
||||
|
|
||||
## 快速开始 |
|
||||
|
|
||||
### 编译项目 |
|
||||
|
|
||||
```bash |
|
||||
javac -d target/classes src/main/java/com/crawler/**/*.java |
|
||||
``` |
|
||||
|
|
||||
### 运行程序 |
|
||||
|
|
||||
```bash |
|
||||
java -cp target/classes com.crawler.Main |
|
||||
``` |
|
||||
|
|
||||
### 命令行使用 |
|
||||
|
|
||||
``` |
|
||||
======================================== |
|
||||
Java爬虫框架 |
|
||||
======================================== |
|
||||
|
|
||||
======================================== |
|
||||
Java爬虫框架 - 命令行模式 |
|
||||
======================================== |
|
||||
输入 'help' 查看可用指令 |
|
||||
======================================== |
|
||||
> help |
|
||||
可用指令: |
|
||||
--------- |
|
||||
help : 显示所有可用指令 |
|
||||
list : 查看使用过的指令历史 |
|
||||
crawl : 运行爬虫,输入URL自动选择爬虫 |
|
||||
exit : 退出程序 |
|
||||
|
|
||||
> crawl |
|
||||
请输入要爬取的URL: https://www.mountblade.com.cn |
|
||||
使用爬虫: MountBladeCrawler |
|
||||
... |
|
||||
``` |
|
||||
|
|
||||
## 项目结构 |
|
||||
|
|
||||
``` |
|
||||
src/main/java/com/crawler/ |
|
||||
├── Main.java # 主入口 |
|
||||
├── model/ |
|
||||
│ ├── CrawlerData.java # 爬取数据模型(标题、链接、来源、发布日期) |
|
||||
│ └── CrawlerConfig.java # 爬虫配置(超时时间、User-Agent) |
|
||||
├── view/ |
|
||||
│ └── CrawlerView.java # 视图层(输出结果展示) |
|
||||
├── controller/ |
|
||||
│ └── CrawlerController.java # 爬虫控制器 |
|
||||
├── crawler/ |
|
||||
│ ├── Crawler.java # 爬虫接口 |
|
||||
│ ├── BaseCrawler.java # 爬虫抽象基类 |
|
||||
│ ├── CrawlerFactory.java # 爬虫工厂(自动选择爬虫) |
|
||||
│ └── impl/ |
|
||||
│ ├── ExampleCrawler.java # 通用爬虫 |
|
||||
│ ├── TestCrawler.java # 测试爬虫 |
|
||||
│ ├── HunanUniversityCrawler.java |
|
||||
│ ├── HunanUniversityNewsCrawler.java |
|
||||
│ ├── ChinaWeatherCrawler.java |
|
||||
│ └── MountBladeCrawler.java |
|
||||
└── command/ |
|
||||
├── Command.java # 命令接口 |
|
||||
├── BaseCommand.java # 命令抽象基类 |
|
||||
├── CommandHistory.java # 命令历史记录 |
|
||||
├── HelpCommand.java # 帮助命令 |
|
||||
├── ListCommand.java # 历史记录命令 |
|
||||
├── CrawlCommand.java # 爬虫命令 |
|
||||
├── ExitCommand.java # 退出命令 |
|
||||
└── CommandController.java # 命令控制器 |
|
||||
``` |
|
||||
|
|
||||
## 扩展新爬虫 |
|
||||
|
|
||||
只需继承 `BaseCrawler` 并重写两个方法: |
|
||||
|
|
||||
```java |
|
||||
package com.crawler.crawler.impl; |
|
||||
|
|
||||
import com.crawler.crawler.BaseCrawler; |
|
||||
import com.crawler.model.CrawlerData; |
|
||||
|
|
||||
import java.util.ArrayList; |
|
||||
import java.util.List; |
|
||||
import java.util.regex.Matcher; |
|
||||
import java.util.regex.Pattern; |
|
||||
|
|
||||
public class MyWebsiteCrawler extends BaseCrawler { |
|
||||
private static final String BASE_URL = "https://www.mywebsite.com"; |
|
||||
|
|
||||
@Override |
|
||||
public String getCrawlerName() { |
|
||||
return "MyWebsiteCrawler"; |
|
||||
} |
|
||||
|
|
||||
@Override |
|
||||
protected List<CrawlerData> parseHtml(String html) { |
|
||||
List<CrawlerData> results = new ArrayList<>(); |
|
||||
|
|
||||
// 使用正则表达式解析HTML |
|
||||
Pattern pattern = Pattern.compile("<a href=\"([^\"]+)\">([^<]+)</a>"); |
|
||||
Matcher matcher = pattern.matcher(html); |
|
||||
|
|
||||
while (matcher.find()) { |
|
||||
CrawlerData data = new CrawlerData(); |
|
||||
data.setTitle(matcher.group(2)); |
|
||||
data.setUrl(normalizeUrl(matcher.group(1))); |
|
||||
data.setSource(getCrawlerName()); |
|
||||
data.setPublishDate(extractDateFromUrl(matcher.group(1))); |
|
||||
results.add(data); |
|
||||
} |
|
||||
|
|
||||
return results; |
|
||||
} |
|
||||
|
|
||||
private String normalizeUrl(String url) { |
|
||||
if (url.startsWith("/")) { |
|
||||
return BASE_URL + url; |
|
||||
} |
|
||||
return url; |
|
||||
} |
|
||||
|
|
||||
private String extractDateFromUrl(String url) { |
|
||||
Pattern datePattern = Pattern.compile("/(\\d{4}-\\d{2}-\\d{2})/"); |
|
||||
Matcher matcher = datePattern.matcher(url); |
|
||||
return matcher.find() ? matcher.group(1) : null; |
|
||||
} |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
然后在 `CrawlerFactory.java` 中添加识别规则: |
|
||||
|
|
||||
```java |
|
||||
crawlerPatterns.put("MyWebsiteCrawler", |
|
||||
Pattern.compile(".*mywebsite\\.com.*", Pattern.CASE_INSENSITIVE)); |
|
||||
``` |
|
||||
|
|
||||
并在 `createCrawlerByName` 方法中添加: |
|
||||
|
|
||||
```java |
|
||||
case "MyWebsiteCrawler": |
|
||||
return new MyWebsiteCrawler(); |
|
||||
``` |
|
||||
|
|
||||
## 架构设计 |
|
||||
|
|
||||
### MVC模式 |
|
||||
|
|
||||
- **Model**:`CrawlerData`(数据模型)、`CrawlerConfig`(配置) |
|
||||
- **View**:`CrawlerView`(结果展示) |
|
||||
- **Controller**:`CrawlerController`(爬虫控制)、`CommandController`(命令控制) |
|
||||
|
|
||||
### 多态设计 |
|
||||
|
|
||||
- `Crawler` 接口定义标准方法 |
|
||||
- `BaseCrawler` 提供通用HTTP请求能力 |
|
||||
- 各爬虫实现类继承 `BaseCrawler` 并重写 `parseHtml` 方法 |
|
||||
|
|
||||
### 工厂模式 |
|
||||
|
|
||||
`CrawlerFactory` 根据URL模式自动选择合适的爬虫实现。 |
|
||||
|
|
||||
## 配置说明 |
|
||||
|
|
||||
`CrawlerConfig` 支持以下配置: |
|
||||
|
|
||||
- `timeout`:HTTP请求超时时间(默认30000毫秒) |
|
||||
- `userAgent`:User-Agent(默认模拟Chrome浏览器) |
|
||||
|
|
||||
## 命令列表 |
|
||||
|
|
||||
| 命令 | 功能 | |
|
||||
|------|------| |
|
||||
| `help` | 显示所有可用指令 | |
|
||||
| `list` | 查看使用过的指令历史 | |
|
||||
| `crawl` | 运行爬虫,输入目标URL,爬取后可保存结果 | |
|
||||
| `cache` | 缓存操作:save/load/list/delete | |
|
||||
| `exit` | 退出程序 | |
|
||||
|
|
||||
### cache 命令子操作 |
|
||||
|
|
||||
| 子操作 | 功能 | |
|
||||
|--------|------| |
|
||||
| `save` | 保存当前爬取数据到数据文件 | |
|
||||
| `load` | 从数据文件读取数据 | |
|
||||
| `list` | 列出 `data/` 目录中的所有文件 | |
|
||||
| `delete` | 删除指定的数据文件或所有文件 | |
|
||||
|
|
||||
### 数据目录 |
|
||||
|
|
||||
程序会自动创建 `data/` 目录用于保存爬取的数据文件。 |
|
||||
|
|
||||
### 爬取后自动保存 |
|
||||
|
|
||||
使用 `crawl` 命令爬取完成后,系统会自动询问是否保存结果: |
|
||||
|
|
||||
``` |
|
||||
爬虫运行完成,共获取 10 条数据 |
|
||||
======================================== |
|
||||
|
|
||||
是否保存爬取结果? (y/n): y |
|
||||
请输入保存路径 (默认: data/crawler_data.json): |
|
||||
数据已保存到: data/crawler_data.json |
|
||||
``` |
|
||||
|
|
||||
### 删除缓存文件示例 |
|
||||
|
|
||||
``` |
|
||||
> cache |
|
||||
请输入缓存操作 (save/load/list/delete): delete |
|
||||
======================================== |
|
||||
可选删除的文件: |
|
||||
======================================== |
|
||||
[1] crawler_data.json (1024 bytes) |
|
||||
[2] mountblade_data.json (2048 bytes) |
|
||||
[all] 删除所有文件 |
|
||||
======================================== |
|
||||
请输入要删除的文件序号或 'all': 1 |
|
||||
确定要删除 'crawler_data.json' 吗? (y/n): y |
|
||||
已删除: crawler_data.json |
|
||||
``` |
|
||||
|
|
||||
## 输出示例 |
|
||||
|
|
||||
``` |
|
||||
[12] |
|
||||
标题: 骑砍2《战帆》v1.2.4与本体v1.4.4测试版更新日志 |
|
||||
链接: https://www.mountblade.com.cn/news/Bannerlord/2026-05-13/3175.html |
|
||||
来源: MountBladeCrawler |
|
||||
发布日期: 2026-05-13 |
|
||||
---------------------------------------- |
|
||||
``` |
|
||||
|
|
||||
## 异常处理 |
|
||||
|
|
||||
项目采用分层异常体系设计,区分受检异常和非受检异常: |
|
||||
|
|
||||
### 异常分类 |
|
||||
|
|
||||
| 异常类型 | 说明 | 示例 | |
|
||||
|---------|------|------| |
|
||||
| **受检异常** | 可恢复异常,强制调用者处理 | `HttpRequestException`, `TimeoutException`, `HtmlParseException`, `DataExtractException` | |
|
||||
| **非受检异常** | 编程错误,不可恢复 | `InvalidUrlException`, `UnsupportedCrawlerException` | |
|
||||
|
|
||||
### 异常继承树 |
|
||||
|
|
||||
``` |
|
||||
CrawlerException (爬虫框架根异常 - 受检) |
|
||||
├── NetworkException (网络异常父类) |
|
||||
│ ├── HttpRequestException (HTTP请求失败) |
|
||||
│ └── TimeoutException (连接超时) |
|
||||
└── ParseException (解析异常父类) |
|
||||
├── HtmlParseException (HTML解析失败) |
|
||||
└── DataExtractException (数据提取失败) |
|
||||
|
|
||||
ConfigurationException (配置异常父类 - 非受检) |
|
||||
├── InvalidUrlException (无效URL) |
|
||||
└── UnsupportedCrawlerException (不支持的爬虫) |
|
||||
``` |
|
||||
|
|
||||
### 异常处理示例 |
|
||||
|
|
||||
```java |
|
||||
try { |
|
||||
List<CrawlerData> data = crawler.crawl(); |
|
||||
view.showData(data); |
|
||||
} catch (HttpRequestException e) { |
|
||||
view.showErrorMessage("HTTP请求失败: " + e.getStatusCode()); |
|
||||
} catch (TimeoutException e) { |
|
||||
view.showErrorMessage("连接超时,请稍后重试"); |
|
||||
} catch (HtmlParseException e) { |
|
||||
view.showErrorMessage("HTML解析失败: " + e.getSourceUrl()); |
|
||||
} catch (CrawlerException e) { |
|
||||
view.showErrorMessage("爬虫执行失败: " + e.getMessage()); |
|
||||
} |
|
||||
``` |
|
||||
|
|
||||
完整的异常设计文档请参考 [EXCEPTIONS.md](file:///C:/Users/黄志楷/Documents/ocix/学校相关/jwork/w12/EXCEPTIONS.md) |
|
||||
|
|
||||
## 数据序列化 |
|
||||
|
|
||||
项目提供基于Jackson的JSON序列化工具类,支持将爬取数据保存到文件和从文件读取。 |
|
||||
|
|
||||
### 使用示例 |
|
||||
|
|
||||
```java |
|
||||
import com.crawler.util.JsonSerializer; |
|
||||
import com.crawler.model.CrawlerData; |
|
||||
import java.util.List; |
|
||||
|
|
||||
List<CrawlerData> dataList = crawler.crawl(); |
|
||||
|
|
||||
JsonSerializer.serializeToFile(dataList, "output/crawler_data.json"); |
|
||||
|
|
||||
List<CrawlerData> loadedData = JsonSerializer.deserializeFromFile("output/crawler_data.json"); |
|
||||
``` |
|
||||
|
|
||||
### JsonSerializer 类方法 |
|
||||
|
|
||||
| 方法 | 功能 | |
|
||||
|------|------| |
|
||||
| `serializeToFile(List<CrawlerData>, String)` | 将数据列表序列化到指定文件 | |
|
||||
| `deserializeFromFile(String)` | 从文件反序列化数据列表 | |
|
||||
| `toJsonString(List<CrawlerData>)` | 将数据列表转换为JSON字符串 | |
|
||||
| `toJsonString(CrawlerData)` | 将单条数据转换为JSON字符串 | |
|
||||
| `fromJsonString(String)` | 从JSON字符串反序列化数据列表 | |
|
||||
| `fromJsonStringToSingle(String)` | 从JSON字符串反序列化单条数据 | |
|
||||
|
|
||||
### 输出格式示例 |
|
||||
|
|
||||
```json |
|
||||
[ |
|
||||
{ |
|
||||
"title": "新闻标题", |
|
||||
"content": "新闻内容", |
|
||||
"url": "https://example.com/news/1", |
|
||||
"source": "ExampleCrawler", |
|
||||
"publishDate": "2026-05-21" |
|
||||
} |
|
||||
] |
|
||||
``` |
|
||||
|
|
||||
## 技术栈 |
|
||||
|
|
||||
- Java 21+ |
|
||||
- Java HttpClient(内置HTTP客户端) |
|
||||
- Jackson(JSON序列化) |
|
||||
- 正则表达式(HTML解析) |
|
||||
|
|
||||
## 注意事项 |
|
||||
|
|
||||
1. 请遵守目标网站的robots.txt规则 |
|
||||
2. 不要频繁请求,避免给目标服务器造成压力 |
|
||||
3. 某些网站可能有反爬机制,可能需要添加额外的请求头 |
|
||||
4. 建议在爬取前获取网站的爬取授权 |
|
||||
Loading…
Reference in new issue