4 changed files with 460 additions and 0 deletions
@ -0,0 +1,358 @@ |
|||
# Java爬虫框架 |
|||
|
|||
基于MVC架构的Java爬虫框架,支持多态扩展,可轻松添加新的网站爬虫。 |
|||
|
|||
## 功能特性 |
|||
|
|||
- **MVC架构**:清晰的分层设计,职责分明 |
|||
- **多态扩展**:通过继承BaseCrawler实现新爬虫 |
|||
- **命令行界面**:支持交互式命令操作 |
|||
- **自动识别**:根据URL自动选择合适的爬虫 |
|||
- **日期提取**:支持从URL中提取发布日期 |
|||
|
|||
## 支持的网站 |
|||
|
|||
| 网站 | 域名 | 爬虫名称 | |
|||
|------|------|----------| |
|||
| 湖南大学官网 | `*.hnu.edu.cn` | HunanUniversityCrawler | |
|||
| 湖南大学新闻网 | `news.hnu.edu.cn` | HunanUniversityNewsCrawler | |
|||
| 中国天气网 | `*.weather.com.cn` | ChinaWeatherCrawler | |
|||
| 骑砍中文站 | `*.mountblade.com.cn` | MountBladeCrawler | |
|||
|
|||
## 快速开始 |
|||
|
|||
### 编译项目 |
|||
|
|||
```bash |
|||
javac -d target/classes src/main/java/com/crawler/**/*.java |
|||
``` |
|||
|
|||
### 运行程序 |
|||
|
|||
```bash |
|||
java -cp target/classes com.crawler.Main |
|||
``` |
|||
|
|||
### 命令行使用 |
|||
|
|||
``` |
|||
======================================== |
|||
Java爬虫框架 |
|||
======================================== |
|||
|
|||
======================================== |
|||
Java爬虫框架 - 命令行模式 |
|||
======================================== |
|||
输入 'help' 查看可用指令 |
|||
======================================== |
|||
> help |
|||
可用指令: |
|||
--------- |
|||
help : 显示所有可用指令 |
|||
list : 查看使用过的指令历史 |
|||
crawl : 运行爬虫,输入URL自动选择爬虫 |
|||
exit : 退出程序 |
|||
|
|||
> crawl |
|||
请输入要爬取的URL: https://www.mountblade.com.cn |
|||
使用爬虫: MountBladeCrawler |
|||
... |
|||
``` |
|||
|
|||
## 项目结构 |
|||
|
|||
``` |
|||
src/main/java/com/crawler/ |
|||
├── Main.java # 主入口 |
|||
├── model/ |
|||
│ ├── CrawlerData.java # 爬取数据模型(标题、链接、来源、发布日期) |
|||
│ └── CrawlerConfig.java # 爬虫配置(超时时间、User-Agent) |
|||
├── view/ |
|||
│ └── CrawlerView.java # 视图层(输出结果展示) |
|||
├── controller/ |
|||
│ └── CrawlerController.java # 爬虫控制器 |
|||
├── crawler/ |
|||
│ ├── Crawler.java # 爬虫接口 |
|||
│ ├── BaseCrawler.java # 爬虫抽象基类 |
|||
│ ├── CrawlerFactory.java # 爬虫工厂(自动选择爬虫) |
|||
│ └── impl/ |
|||
│ ├── ExampleCrawler.java # 通用爬虫 |
|||
│ ├── TestCrawler.java # 测试爬虫 |
|||
│ ├── HunanUniversityCrawler.java |
|||
│ ├── HunanUniversityNewsCrawler.java |
|||
│ ├── ChinaWeatherCrawler.java |
|||
│ └── MountBladeCrawler.java |
|||
└── command/ |
|||
├── Command.java # 命令接口 |
|||
├── BaseCommand.java # 命令抽象基类 |
|||
├── CommandHistory.java # 命令历史记录 |
|||
├── HelpCommand.java # 帮助命令 |
|||
├── ListCommand.java # 历史记录命令 |
|||
├── CrawlCommand.java # 爬虫命令 |
|||
├── ExitCommand.java # 退出命令 |
|||
└── CommandController.java # 命令控制器 |
|||
``` |
|||
|
|||
## 扩展新爬虫 |
|||
|
|||
只需继承 `BaseCrawler` 并重写两个方法: |
|||
|
|||
```java |
|||
package com.crawler.crawler.impl; |
|||
|
|||
import com.crawler.crawler.BaseCrawler; |
|||
import com.crawler.model.CrawlerData; |
|||
|
|||
import java.util.ArrayList; |
|||
import java.util.List; |
|||
import java.util.regex.Matcher; |
|||
import java.util.regex.Pattern; |
|||
|
|||
public class MyWebsiteCrawler extends BaseCrawler { |
|||
private static final String BASE_URL = "https://www.mywebsite.com"; |
|||
|
|||
@Override |
|||
public String getCrawlerName() { |
|||
return "MyWebsiteCrawler"; |
|||
} |
|||
|
|||
@Override |
|||
protected List<CrawlerData> parseHtml(String html) { |
|||
List<CrawlerData> results = new ArrayList<>(); |
|||
|
|||
// 使用正则表达式解析HTML |
|||
Pattern pattern = Pattern.compile("<a href=\"([^\"]+)\">([^<]+)</a>"); |
|||
Matcher matcher = pattern.matcher(html); |
|||
|
|||
while (matcher.find()) { |
|||
CrawlerData data = new CrawlerData(); |
|||
data.setTitle(matcher.group(2)); |
|||
data.setUrl(normalizeUrl(matcher.group(1))); |
|||
data.setSource(getCrawlerName()); |
|||
data.setPublishDate(extractDateFromUrl(matcher.group(1))); |
|||
results.add(data); |
|||
} |
|||
|
|||
return results; |
|||
} |
|||
|
|||
private String normalizeUrl(String url) { |
|||
if (url.startsWith("/")) { |
|||
return BASE_URL + url; |
|||
} |
|||
return url; |
|||
} |
|||
|
|||
private String extractDateFromUrl(String url) { |
|||
Pattern datePattern = Pattern.compile("/(\\d{4}-\\d{2}-\\d{2})/"); |
|||
Matcher matcher = datePattern.matcher(url); |
|||
return matcher.find() ? matcher.group(1) : null; |
|||
} |
|||
} |
|||
``` |
|||
|
|||
然后在 `CrawlerFactory.java` 中添加识别规则: |
|||
|
|||
```java |
|||
crawlerPatterns.put("MyWebsiteCrawler", |
|||
Pattern.compile(".*mywebsite\\.com.*", Pattern.CASE_INSENSITIVE)); |
|||
``` |
|||
|
|||
并在 `createCrawlerByName` 方法中添加: |
|||
|
|||
```java |
|||
case "MyWebsiteCrawler": |
|||
return new MyWebsiteCrawler(); |
|||
``` |
|||
|
|||
## 架构设计 |
|||
|
|||
### MVC模式 |
|||
|
|||
- **Model**:`CrawlerData`(数据模型)、`CrawlerConfig`(配置) |
|||
- **View**:`CrawlerView`(结果展示) |
|||
- **Controller**:`CrawlerController`(爬虫控制)、`CommandController`(命令控制) |
|||
|
|||
### 多态设计 |
|||
|
|||
- `Crawler` 接口定义标准方法 |
|||
- `BaseCrawler` 提供通用HTTP请求能力 |
|||
- 各爬虫实现类继承 `BaseCrawler` 并重写 `parseHtml` 方法 |
|||
|
|||
### 工厂模式 |
|||
|
|||
`CrawlerFactory` 根据URL模式自动选择合适的爬虫实现。 |
|||
|
|||
## 配置说明 |
|||
|
|||
`CrawlerConfig` 支持以下配置: |
|||
|
|||
- `timeout`:HTTP请求超时时间(默认30000毫秒) |
|||
- `userAgent`:User-Agent(默认模拟Chrome浏览器) |
|||
|
|||
## 命令列表 |
|||
|
|||
| 命令 | 功能 | |
|||
|------|------| |
|||
| `help` | 显示所有可用指令 | |
|||
| `list` | 查看使用过的指令历史 | |
|||
| `crawl` | 运行爬虫,输入目标URL,爬取后可保存结果 | |
|||
| `cache` | 缓存操作:save/load/list/delete | |
|||
| `exit` | 退出程序 | |
|||
|
|||
### cache 命令子操作 |
|||
|
|||
| 子操作 | 功能 | |
|||
|--------|------| |
|||
| `save` | 保存当前爬取数据到数据文件 | |
|||
| `load` | 从数据文件读取数据 | |
|||
| `list` | 列出 `data/` 目录中的所有文件 | |
|||
| `delete` | 删除指定的数据文件或所有文件 | |
|||
|
|||
### 数据目录 |
|||
|
|||
程序会自动创建 `data/` 目录用于保存爬取的数据文件。 |
|||
|
|||
### 爬取后自动保存 |
|||
|
|||
使用 `crawl` 命令爬取完成后,系统会自动询问是否保存结果: |
|||
|
|||
``` |
|||
爬虫运行完成,共获取 10 条数据 |
|||
======================================== |
|||
|
|||
是否保存爬取结果? (y/n): y |
|||
请输入保存路径 (默认: data/crawler_data.json): |
|||
数据已保存到: data/crawler_data.json |
|||
``` |
|||
|
|||
### 删除缓存文件示例 |
|||
|
|||
``` |
|||
> cache |
|||
请输入缓存操作 (save/load/list/delete): delete |
|||
======================================== |
|||
可选删除的文件: |
|||
======================================== |
|||
[1] crawler_data.json (1024 bytes) |
|||
[2] mountblade_data.json (2048 bytes) |
|||
[all] 删除所有文件 |
|||
======================================== |
|||
请输入要删除的文件序号或 'all': 1 |
|||
确定要删除 'crawler_data.json' 吗? (y/n): y |
|||
已删除: crawler_data.json |
|||
``` |
|||
|
|||
## 输出示例 |
|||
|
|||
``` |
|||
[12] |
|||
标题: 骑砍2《战帆》v1.2.4与本体v1.4.4测试版更新日志 |
|||
链接: https://www.mountblade.com.cn/news/Bannerlord/2026-05-13/3175.html |
|||
来源: MountBladeCrawler |
|||
发布日期: 2026-05-13 |
|||
---------------------------------------- |
|||
``` |
|||
|
|||
## 异常处理 |
|||
|
|||
项目采用分层异常体系设计,区分受检异常和非受检异常: |
|||
|
|||
### 异常分类 |
|||
|
|||
| 异常类型 | 说明 | 示例 | |
|||
|---------|------|------| |
|||
| **受检异常** | 可恢复异常,强制调用者处理 | `HttpRequestException`, `TimeoutException`, `HtmlParseException`, `DataExtractException` | |
|||
| **非受检异常** | 编程错误,不可恢复 | `InvalidUrlException`, `UnsupportedCrawlerException` | |
|||
|
|||
### 异常继承树 |
|||
|
|||
``` |
|||
CrawlerException (爬虫框架根异常 - 受检) |
|||
├── NetworkException (网络异常父类) |
|||
│ ├── HttpRequestException (HTTP请求失败) |
|||
│ └── TimeoutException (连接超时) |
|||
└── ParseException (解析异常父类) |
|||
├── HtmlParseException (HTML解析失败) |
|||
└── DataExtractException (数据提取失败) |
|||
|
|||
ConfigurationException (配置异常父类 - 非受检) |
|||
├── InvalidUrlException (无效URL) |
|||
└── UnsupportedCrawlerException (不支持的爬虫) |
|||
``` |
|||
|
|||
### 异常处理示例 |
|||
|
|||
```java |
|||
try { |
|||
List<CrawlerData> data = crawler.crawl(); |
|||
view.showData(data); |
|||
} catch (HttpRequestException e) { |
|||
view.showErrorMessage("HTTP请求失败: " + e.getStatusCode()); |
|||
} catch (TimeoutException e) { |
|||
view.showErrorMessage("连接超时,请稍后重试"); |
|||
} catch (HtmlParseException e) { |
|||
view.showErrorMessage("HTML解析失败: " + e.getSourceUrl()); |
|||
} catch (CrawlerException e) { |
|||
view.showErrorMessage("爬虫执行失败: " + e.getMessage()); |
|||
} |
|||
``` |
|||
|
|||
完整的异常设计文档请参考 [EXCEPTIONS.md](file:///C:/Users/黄志楷/Documents/ocix/学校相关/jwork/w12/EXCEPTIONS.md) |
|||
|
|||
## 数据序列化 |
|||
|
|||
项目提供基于Jackson的JSON序列化工具类,支持将爬取数据保存到文件和从文件读取。 |
|||
|
|||
### 使用示例 |
|||
|
|||
```java |
|||
import com.crawler.util.JsonSerializer; |
|||
import com.crawler.model.CrawlerData; |
|||
import java.util.List; |
|||
|
|||
List<CrawlerData> dataList = crawler.crawl(); |
|||
|
|||
JsonSerializer.serializeToFile(dataList, "output/crawler_data.json"); |
|||
|
|||
List<CrawlerData> loadedData = JsonSerializer.deserializeFromFile("output/crawler_data.json"); |
|||
``` |
|||
|
|||
### JsonSerializer 类方法 |
|||
|
|||
| 方法 | 功能 | |
|||
|------|------| |
|||
| `serializeToFile(List<CrawlerData>, String)` | 将数据列表序列化到指定文件 | |
|||
| `deserializeFromFile(String)` | 从文件反序列化数据列表 | |
|||
| `toJsonString(List<CrawlerData>)` | 将数据列表转换为JSON字符串 | |
|||
| `toJsonString(CrawlerData)` | 将单条数据转换为JSON字符串 | |
|||
| `fromJsonString(String)` | 从JSON字符串反序列化数据列表 | |
|||
| `fromJsonStringToSingle(String)` | 从JSON字符串反序列化单条数据 | |
|||
|
|||
### 输出格式示例 |
|||
|
|||
```json |
|||
[ |
|||
{ |
|||
"title": "新闻标题", |
|||
"content": "新闻内容", |
|||
"url": "https://example.com/news/1", |
|||
"source": "ExampleCrawler", |
|||
"publishDate": "2026-05-21" |
|||
} |
|||
] |
|||
``` |
|||
|
|||
## 技术栈 |
|||
|
|||
- Java 21+ |
|||
- Java HttpClient(内置HTTP客户端) |
|||
- Jackson(JSON序列化) |
|||
- 正则表达式(HTML解析) |
|||
|
|||
## 注意事项 |
|||
|
|||
1. 请遵守目标网站的robots.txt规则 |
|||
2. 不要频繁请求,避免给目标服务器造成压力 |
|||
3. 某些网站可能有反爬机制,可能需要添加额外的请求头 |
|||
4. 建议在爬取前获取网站的爬取授权 |
|||
@ -0,0 +1,79 @@ |
|||
<?xml version="1.0" encoding="UTF-8"?> |
|||
<project xmlns="http://maven.apache.org/POM/4.0.0" |
|||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
|||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 |
|||
http://maven.apache.org/xsd/maven-4.0.0.xsd"> |
|||
<modelVersion>4.0.0</modelVersion> |
|||
|
|||
<groupId>com.crawler</groupId> |
|||
<artifactId>crawler-framework</artifactId> |
|||
<version>1.0.0</version> |
|||
<packaging>jar</packaging> |
|||
|
|||
<name>crawler-framework</name> |
|||
<description>Java MVC Crawler Framework</description> |
|||
|
|||
<properties> |
|||
<maven.compiler.source>11</maven.compiler.source> |
|||
<maven.compiler.target>11</maven.compiler.target> |
|||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> |
|||
</properties> |
|||
|
|||
<dependencies> |
|||
<dependency> |
|||
<groupId>junit</groupId> |
|||
<artifactId>junit</artifactId> |
|||
<version>4.13.2</version> |
|||
<scope>test</scope> |
|||
</dependency> |
|||
<dependency> |
|||
<groupId>com.fasterxml.jackson.core</groupId> |
|||
<artifactId>jackson-databind</artifactId> |
|||
<version>2.15.2</version> |
|||
</dependency> |
|||
<dependency> |
|||
<groupId>com.fasterxml.jackson.datatype</groupId> |
|||
<artifactId>jackson-datatype-jsr310</artifactId> |
|||
<version>2.15.2</version> |
|||
</dependency> |
|||
</dependencies> |
|||
|
|||
<build> |
|||
<plugins> |
|||
<plugin> |
|||
<groupId>org.apache.maven.plugins</groupId> |
|||
<artifactId>maven-compiler-plugin</artifactId> |
|||
<version>3.8.1</version> |
|||
<configuration> |
|||
<source>11</source> |
|||
<target>11</target> |
|||
</configuration> |
|||
</plugin> |
|||
|
|||
<plugin> |
|||
<groupId>org.apache.maven.plugins</groupId> |
|||
<artifactId>maven-assembly-plugin</artifactId> |
|||
<version>3.3.0</version> |
|||
<configuration> |
|||
<descriptorRefs> |
|||
<descriptorRef>jar-with-dependencies</descriptorRef> |
|||
</descriptorRefs> |
|||
<archive> |
|||
<manifest> |
|||
<mainClass>com.crawler.Main</mainClass> |
|||
</manifest> |
|||
</archive> |
|||
</configuration> |
|||
<executions> |
|||
<execution> |
|||
<id>make-assembly</id> |
|||
<phase>package</phase> |
|||
<goals> |
|||
<goal>single</goal> |
|||
</goals> |
|||
</execution> |
|||
</executions> |
|||
</plugin> |
|||
</plugins> |
|||
</build> |
|||
</project> |
|||
Binary file not shown.
@ -0,0 +1,23 @@ |
|||
<!DOCTYPE html> |
|||
<html> |
|||
<head> |
|||
<title>Test Page</title> |
|||
</head> |
|||
<body> |
|||
<h1>测试页面</h1> |
|||
<div class="news-list"> |
|||
<article> |
|||
<h2><a href="https://example.com/news1">新闻标题1 - 这是第一条测试新闻</a></h2> |
|||
<p>这是第一条新闻的内容摘要...</p> |
|||
</article> |
|||
<article> |
|||
<h2><a href="https://example.com/news2">新闻标题2 - 这是第二条测试新闻</a></h2> |
|||
<p>这是第二条新闻的内容摘要...</p> |
|||
</article> |
|||
<article> |
|||
<h2><a href="https://example.com/news3">新闻标题3 - 这是第三条测试新闻</a></h2> |
|||
<p>这是第三条新闻的内容摘要...</p> |
|||
</article> |
|||
</div> |
|||
</body> |
|||
</html> |
|||
Loading…
Reference in new issue