31 changed files with 2394 additions and 0 deletions
@ -0,0 +1,4 @@ |
|||||
|
*.jar |
||||
|
*.jar |
||||
|
*.class |
||||
|
*.log |
||||
@ -0,0 +1,492 @@ |
|||||
|
--- |
||||
|
id: "24" |
||||
|
title: w10-设计模式 |
||||
|
slug: w10-design-patterns |
||||
|
status: draft |
||||
|
view_count: 0 |
||||
|
created_at: 2026-05-07T12:00:00+08:00 |
||||
|
updated_at: 2026-05-07T14:00:00.000000000+08:00 |
||||
|
--- |
||||
|
|
||||
|
# 高级程序设计 · 第10周 |
||||
|
|
||||
|
### 设计模式:灵活性与可扩展性 |
||||
|
|
||||
|
### 策略模式 + 工厂 + Repository 实战 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 📌 本周导航 |
||||
|
|
||||
|
- W9回顾:骨架的成就与隐患 |
||||
|
- 策略模式:解析器的“插头标准” |
||||
|
- 解析器工厂:自动匹配的魔法 |
||||
|
- Repository:武装数据访问 |
||||
|
- 整体架构串联:调用链全程 |
||||
|
- 代码落地 + 实践任务 |
||||
|
- 架构反思 + W11 预告 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 1️⃣ W9回顾:骨架的成就与隐患 |
||||
|
|
||||
|
### 我们建了一座漂亮的房子 |
||||
|
|
||||
|
- ✅ MVC 分层清晰 |
||||
|
- ✅ Command 模式:**新增命令,Controller 零改动** |
||||
|
- ✅ 所有输出走 `ConsoleView` |
||||
|
- ✅ 工程包结构标准 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 但问题也随之而来 |
||||
|
|
||||
|
```java |
||||
|
// CrawlCommand 里解析逻辑怎么办? |
||||
|
if (url.contains("blog.example.com")) { |
||||
|
// 博客解析... |
||||
|
} else if (url.contains("news.example.com")) { |
||||
|
// 新闻解析... |
||||
|
} else { |
||||
|
view.printError("Unsupported website!"); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
> 😫 每支持一个新网站,就要加一个 `else if` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 还有另一个“裸奔”的数据 |
||||
|
|
||||
|
```java |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
// 所有 Command 都可以: |
||||
|
articles.clear(); |
||||
|
articles.add(null); |
||||
|
articles.remove(0); |
||||
|
``` |
||||
|
|
||||
|
> 🚨 数据没有任何保护,靠口头约定是靠不住的 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 本周任务 |
||||
|
|
||||
|
1. **解析逻辑可插拔** → 策略模式 + 工厂 |
||||
|
2. **数据访问加守卫** → Repository 模式 |
||||
|
|
||||
|
> W9 搭骨架,W10 装盔甲 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 2️⃣ 策略模式:解析器的“插头标准” |
||||
|
|
||||
|
### 墙上的插座,为什么什么电器都能插? |
||||
|
|
||||
|
- **三孔插座** 是标准接口 |
||||
|
- 电视、电脑、手机充电器都实现这个接口 |
||||
|
- 插座不关心你是什么电器 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 爬虫的世界也一样 |
||||
|
|
||||
|
- `CrawlStrategy` = 插座接口 |
||||
|
- `BlogStrategy`、`NewsStrategy` = 具体电器 |
||||
|
- `CrawlCommand` = 使用电器的人 |
||||
|
- `StrategyFactory` = 插座面板 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 接口即合同 |
||||
|
|
||||
|
```java |
||||
|
public interface CrawlStrategy { |
||||
|
List<Article> parse(String url, Document doc); |
||||
|
boolean supports(String url); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
- `supports()`:我能不能处理这个 URL? |
||||
|
- `parse()`:怎么解析? |
||||
|
- **任何网站想被爬,签这份合同!** |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 策略 vs 硬编码 |
||||
|
|
||||
|
| 维度 | if-else 屎山 | 策略模式 | |
||||
|
|------|-------------|----------| |
||||
|
| 新增网站 | 改 Command | 新建策略类 | |
||||
|
| 修改解析 | 翻找 else if | 只改对应类 | |
||||
|
| 测试 | 启动整个爬虫 | 单独测策略 | |
||||
|
| 开闭原则 | ❌ 修改开放 | ✅ 扩展开放,修改关闭 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 具体策略示例 |
||||
|
|
||||
|
```java |
||||
|
public class BlogStrategy implements CrawlStrategy { |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("blog.example.com"); |
||||
|
} |
||||
|
public List<Article> parse(String url, Document doc) { |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
for (Element e : doc.select(".post-title")) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
return articles; |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
> ✨ 一个新网站,一个独立类,各扫门前雪 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 3️⃣ 解析器工厂:自动匹配的魔法 |
||||
|
|
||||
|
### 谁来选择策略? |
||||
|
|
||||
|
- 如果 `CrawlCommand` 遍历所有策略 → 策略模式白用了 |
||||
|
- 我们需要一个黑盒子:**丢入 URL,返回合适的解析器** |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 工厂登场 |
||||
|
|
||||
|
```java |
||||
|
public class StrategyFactory { |
||||
|
private final List<CrawlStrategy> strategies = new ArrayList<>(); |
||||
|
|
||||
|
public StrategyFactory() { |
||||
|
strategies.add(new BlogStrategy()); |
||||
|
strategies.add(new NewsStrategy()); |
||||
|
} |
||||
|
|
||||
|
public CrawlStrategy getStrategy(String url) { |
||||
|
for (CrawlStrategy s : strategies) { |
||||
|
if (s.supports(url)) return s; |
||||
|
} |
||||
|
return null; |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
> 🔧 新增网站只需:新建策略类 + 工厂里注册一行 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 开闭原则的胜利 |
||||
|
|
||||
|
- ✅ `CrawlCommand` 完全不改 |
||||
|
- ✅ 新增 `XxxStrategy` 和一行注册 |
||||
|
- ✅ 所有策略的调用方式完全一致 |
||||
|
|
||||
|
> 这就是 **“对扩展开放,对修改关闭”** |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 重构后的 CrawlCommand |
||||
|
|
||||
|
```java |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
String url = args[1]; |
||||
|
CrawlStrategy strategy = strategyFactory.getStrategy(url); |
||||
|
if (strategy == null) { |
||||
|
view.printError("No strategy for: " + url); |
||||
|
return; |
||||
|
} |
||||
|
Document doc = Jsoup.connect(url).get(); |
||||
|
List<Article> parsed = strategy.parse(url, doc); |
||||
|
for (Article a : parsed) { |
||||
|
repository.add(a); |
||||
|
} |
||||
|
view.printSuccess("Crawled " + parsed.size() + " articles."); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
> 🧠 CrawlCommand 现在只做 **“调度”**,不做解析 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 4️⃣ Repository:武装数据访问 |
||||
|
|
||||
|
### 共享 List 的问题 |
||||
|
|
||||
|
```java |
||||
|
articles.clear(); // 清空 |
||||
|
articles.add(null); // 塞 null |
||||
|
articles.remove(0); // 随意删除 |
||||
|
``` |
||||
|
|
||||
|
> 靠约定维护的秩序,终将被打破 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 给数据装上防盗门 |
||||
|
|
||||
|
```java |
||||
|
public class ArticleRepository { |
||||
|
private final List<Article> articles = new ArrayList<>(); |
||||
|
|
||||
|
public void add(Article article) { |
||||
|
if (article == null) throw new IllegalArgumentException(...); |
||||
|
articles.add(article); |
||||
|
} |
||||
|
|
||||
|
public List<Article> getAll() { |
||||
|
return Collections.unmodifiableList(articles); |
||||
|
} |
||||
|
|
||||
|
public int size() { return articles.size(); } |
||||
|
|
||||
|
public void clear() { articles.clear(); } |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 三道防线 |
||||
|
|
||||
|
| 机制 | 作用 | |
||||
|
|------|------| |
||||
|
| **add 拒绝 null** | 规则写在代码里,不靠口头约定 | |
||||
|
| **getAll 返回不可变视图** | 任何修改立即抛异常 | |
||||
|
| **必须通过 repository 访问** | 封装内部结构,只暴露安全方法 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 所有 Command 签名改变 |
||||
|
|
||||
|
```java |
||||
|
// W9 |
||||
|
public void execute(String[] args, List<Article> articles); |
||||
|
|
||||
|
// W10 |
||||
|
public void execute(String[] args, ArticleRepository repository); |
||||
|
``` |
||||
|
|
||||
|
> 语义变化:从“给你数据随便玩” → “给你安全的存取通道” |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 5️⃣ 整体架构串联 |
||||
|
|
||||
|
### 一个 `crawl` 命令的完整旅程 |
||||
|
|
||||
|
``` |
||||
|
用户输入 "crawl https://blog.example.com" |
||||
|
↓ |
||||
|
ConsoleView 解析 |
||||
|
↓ |
||||
|
Controller 路由 → CrawlCommand |
||||
|
↓ |
||||
|
StrategyFactory.getStrategy(url) → BlogStrategy |
||||
|
↓ |
||||
|
Jsoup 抓取 → Document |
||||
|
↓ |
||||
|
BlogStrategy.parse(url, doc) → List<Article> |
||||
|
↓ |
||||
|
Repository.add() 存储 |
||||
|
↓ |
||||
|
ConsoleView 输出成功信息 |
||||
|
``` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 架构全景图 |
||||
|
|
||||
|
 |
||||
|
|
||||
|
```mermaid |
||||
|
flowchart TD |
||||
|
User(["👤 用户输入<br/>crawl https://blog.example.com"]) --> View |
||||
|
|
||||
|
subgraph View["🎨 View 层 (ConsoleView)"] |
||||
|
ReadLine["readLine()"] |
||||
|
Display["display() / printSuccess()"] |
||||
|
end |
||||
|
|
||||
|
ReadLine --> Controller |
||||
|
|
||||
|
subgraph Controller["🧭 Controller 层"] |
||||
|
Router["CrawlerController<br/>Map 路由"] |
||||
|
end |
||||
|
|
||||
|
Router --> Command |
||||
|
|
||||
|
subgraph Command["⚡ Command 层"] |
||||
|
CrawlCmd["CrawlCommand<br/>(调度者)"] |
||||
|
end |
||||
|
|
||||
|
CrawlCmd --> Factory |
||||
|
|
||||
|
subgraph Strategy["🧩 Strategy 层"] |
||||
|
Factory["StrategyFactory<br/>(自动匹配)"] |
||||
|
StrategyI["<<interface>> CrawlStrategy"] |
||||
|
BlogS["BlogStrategy"] |
||||
|
NewsS["NewsStrategy"] |
||||
|
Factory --> StrategyI --> BlogS |
||||
|
StrategyI --> NewsS |
||||
|
end |
||||
|
|
||||
|
BlogS --> Repository |
||||
|
|
||||
|
subgraph Repository["🔐 Repository 层"] |
||||
|
Repo["ArticleRepository<br/>(add / getAll)"] |
||||
|
RepoList["List<Article> (私有)"] |
||||
|
Repo --> RepoList |
||||
|
end |
||||
|
|
||||
|
RepoList --> Model |
||||
|
|
||||
|
subgraph Model["📦 Model 层"] |
||||
|
Article["Article"] |
||||
|
end |
||||
|
|
||||
|
CrawlCmd --> Display |
||||
|
Repository --> Display |
||||
|
``` |
||||
|
|
||||
|
> 🗺️ 每一层都有清晰的职责,每一处扩展都只需要新增而不是修改 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 6️⃣ 代码落地(分步升级) |
||||
|
|
||||
|
### 从 W9 升级到 W10 的改动清单 |
||||
|
|
||||
|
1. 新建 `strategy/` 包 → `CrawlStrategy` 接口 |
||||
|
2. 实现 `BlogStrategy`、`NewsStrategy` |
||||
|
3. 实现 `StrategyFactory` |
||||
|
4. 新建 `repository/` 包 → `ArticleRepository` |
||||
|
5. 修改 `Command` 接口签名 |
||||
|
6. 重写 `CrawlCommand` |
||||
|
7. 调整其他所有 `Command` |
||||
|
8. 调整 `Controller` 和 `App.java` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 关键代码演示 |
||||
|
|
||||
|
- `Collections.unmodifiableList()` 的用法 |
||||
|
- `StrategyFactory.getStrategy()` 的遍历逻辑 |
||||
|
- `CrawlCommand` 从“写死解析”到“调度组装” |
||||
|
|
||||
|
```java |
||||
|
// 一个改动示例 |
||||
|
for (Article a : parsed) { |
||||
|
repository.add(a); // 旧: articles.add(a); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 找茬点 |
||||
|
|
||||
|
- `StrategyFactory` 没匹配到策略时返回 `null` |
||||
|
- `CrawlCommand` 检查 `null` 并报错 |
||||
|
- 有没有更优雅的方式避免 `null` 判断? |
||||
|
|
||||
|
> 🔍 课后用 AI 探索 “空对象模式” 的前奏 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 7️⃣ 架构反思 + 下周预告 |
||||
|
|
||||
|
### 当前架构的脆弱点 |
||||
|
|
||||
|
- ❌ 异常处理单一笼统 |
||||
|
- ❌ 没有重试机制 |
||||
|
- ❌ 网络超时无控制 |
||||
|
- ❌ 日志仅输出到终端 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### W11 目标:健壮性工程 |
||||
|
|
||||
|
- ✅ **自定义异常体系**:把“出错了”变成具体的业务异常 |
||||
|
- ✅ **工程化日志**:记录谁、什么时间、做了什么 |
||||
|
- ✅ **防御式编程 + 重试机制**:网络抖动不再致命 |
||||
|
|
||||
|
> W9 搭骨架 → W10 装盔甲 → W11 让它经得起毒打 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 8️⃣ 实践任务(现场) |
||||
|
|
||||
|
### 必做 |
||||
|
|
||||
|
1. 基于 W9 项目升级到 W10 |
||||
|
2. 至少实现 2 个 CrawlStrategy(可模拟) |
||||
|
3. 实现 `StrategyFactory` 和 `ArticleRepository` |
||||
|
4. 测试完整 `crawl` → `list` 流程 |
||||
|
|
||||
|
### 验收标准 |
||||
|
|
||||
|
- [ ] 新增策略只加类+注册,零改动旧代码 |
||||
|
- [ ] `getAll()` 返回不可修改视图 |
||||
|
- [ ] `CrawlCommand` 不含网站特定解析 |
||||
|
- [ ] 所有 Command 用 Repository |
||||
|
- [ ] 无地方直接操作 `List<Article>` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 9️⃣ 课后作业 |
||||
|
|
||||
|
### 必做 |
||||
|
|
||||
|
1. 完善 `ArticleRepository`:增加 `addAll`,防御 null |
||||
|
2. **★ AnalyzeCommand**:复用策略解析但不存储,输出统计信息 |
||||
|
3. **AI 架构审计**:发送类签名给 AI,检查策略解耦与封装 |
||||
|
|
||||
|
### 选做 |
||||
|
|
||||
|
- 正则策略匹配、默认策略、策略优先级 |
||||
|
- 思考题:两个策略都 `supports` 同一 URL 时怎么办? |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 🤖 AI 协同升级 |
||||
|
|
||||
|
### 架构审计师(必做) |
||||
|
|
||||
|
- 画出类依赖图 |
||||
|
- 发给 AI:“检查开闭原则达成度,Repository 封装完备性,是否存在循环依赖” |
||||
|
|
||||
|
### 进阶探究 |
||||
|
|
||||
|
- 不用工厂,直接用 `Map<String, CrawlStrategy>` 存起来 vs `StrategyFactory` 的区别? |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 📚 总结 |
||||
|
|
||||
|
- ✅ 策略模式:算法可插拔,新增网站零痛苦 |
||||
|
- ✅ 工厂:自动匹配,URL → 策略的魔法 |
||||
|
- ✅ Repository:数据守卫,规则从口头约定变成代码强制 |
||||
|
- ✅ 架构:从“分开”到“优雅合上”,对扩展开放,对修改关闭 |
||||
|
|
||||
|
### W11 预告 |
||||
|
|
||||
|
自定义异常体系 + 日志 + 重试机制 |
||||
|
|
||||
|
> 🚀 让我们造的爬虫,经得住现实的考验 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 谢谢! |
||||
|
|
||||
|
**保持工程洁癖,下周见!** |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
# 居中标题 |
||||
|
|
||||
|
## 居中副标题 |
||||
|
|
||||
|
### 居中内容 |
||||
|
|
||||
|
--- |
||||
@ -0,0 +1,62 @@ |
|||||
|
<project xmlns="http://maven.apache.org/POM/4.0.0" |
||||
|
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
||||
|
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> |
||||
|
<modelVersion>4.0.0</modelVersion> |
||||
|
<groupId>com.example</groupId> |
||||
|
<artifactId>datacollect-cli</artifactId> |
||||
|
<version>0.1.0</version> |
||||
|
<properties> |
||||
|
<maven.compiler.source>11</maven.compiler.source> |
||||
|
<maven.compiler.target>11</maven.compiler.target> |
||||
|
</properties> |
||||
|
<dependencies> |
||||
|
<dependency> |
||||
|
<groupId>org.jsoup</groupId> |
||||
|
<artifactId>jsoup</artifactId> |
||||
|
<version>1.17.2</version> |
||||
|
</dependency> |
||||
|
<dependency> |
||||
|
<groupId>org.slf4j</groupId> |
||||
|
<artifactId>slf4j-api</artifactId> |
||||
|
<version>2.0.9</version> |
||||
|
</dependency> |
||||
|
<dependency> |
||||
|
<groupId>ch.qos.logback</groupId> |
||||
|
<artifactId>logback-classic</artifactId> |
||||
|
<version>1.4.14</version> |
||||
|
</dependency> |
||||
|
</dependencies> |
||||
|
<build> |
||||
|
<plugins> |
||||
|
<plugin> |
||||
|
<groupId>org.apache.maven.plugins</groupId> |
||||
|
<artifactId>maven-compiler-plugin</artifactId> |
||||
|
<version>3.8.1</version> |
||||
|
</plugin> |
||||
|
<plugin> |
||||
|
<groupId>org.apache.maven.plugins</groupId> |
||||
|
<artifactId>maven-assembly-plugin</artifactId> |
||||
|
<version>3.3.0</version> |
||||
|
<configuration> |
||||
|
<archive> |
||||
|
<manifest> |
||||
|
<mainClass>com.example.datacollect.Main</mainClass> |
||||
|
</manifest> |
||||
|
</archive> |
||||
|
<descriptorRefs> |
||||
|
<descriptorRef>jar-with-dependencies</descriptorRef> |
||||
|
</descriptorRefs> |
||||
|
</configuration> |
||||
|
<executions> |
||||
|
<execution> |
||||
|
<id>make-assembly</id> |
||||
|
<phase>package</phase> |
||||
|
<goals> |
||||
|
<goal>single</goal> |
||||
|
</goals> |
||||
|
</execution> |
||||
|
</executions> |
||||
|
</plugin> |
||||
|
</plugins> |
||||
|
</build> |
||||
|
</project> |
||||
@ -0,0 +1,41 @@ |
|||||
|
package com.example.datacollect; |
||||
|
|
||||
|
import com.example.datacollect.controller.CrawlerController; |
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.strategy.StrategyFactory; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
/*- 添加 logger 成员 |
||||
|
- 记录启动日志 |
||||
|
- 添加全局异常处理 */ |
||||
|
public class Main { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(Main.class); |
||||
|
|
||||
|
public static void main(String[] args) { |
||||
|
try { |
||||
|
logger.info("Starting CLI Crawler application"); |
||||
|
|
||||
|
ConsoleView view = new ConsoleView(); |
||||
|
ArticleRepository repository = new ArticleRepository(); |
||||
|
StrategyFactory strategyFactory = new StrategyFactory(); |
||||
|
CrawlerController controller = new CrawlerController(view, repository, strategyFactory); |
||||
|
|
||||
|
view.printSuccess("Welcome to CLI Crawler (w10_3)! Type help for commands."); |
||||
|
logger.info("Application initialized successfully"); |
||||
|
|
||||
|
while (true) { |
||||
|
try { |
||||
|
controller.handle(view.readLine()); |
||||
|
} catch (Exception e) { |
||||
|
view.printError("Error: " + e.getMessage()); |
||||
|
logger.error("Error in main loop: {}", e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Fatal error in application: {}", e.getMessage(), e); |
||||
|
System.err.println("Fatal error: " + e.getMessage()); |
||||
|
System.exit(1); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,103 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.exception.NetworkException; |
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.model.Article; |
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.strategy.CrawlStrategy; |
||||
|
import com.example.datacollect.strategy.StrategyFactory; |
||||
|
import com.example.datacollect.util.RetryUtils; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.jsoup.Jsoup; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
|
||||
|
import java.io.IOException; |
||||
|
import java.util.List; |
||||
|
import java.util.concurrent.Callable; |
||||
|
|
||||
|
public class AnalyzeCommand implements Command { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(AnalyzeCommand.class); |
||||
|
private final ConsoleView view; |
||||
|
private final StrategyFactory strategyFactory; |
||||
|
|
||||
|
public AnalyzeCommand(ConsoleView view, StrategyFactory strategyFactory) { |
||||
|
this.view = view; |
||||
|
this.strategyFactory = strategyFactory; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String getName() { |
||||
|
return "analyze"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
if (args.length < 2) { |
||||
|
view.printError("Usage: analyze <url>"); |
||||
|
logger.warn("Invalid command: missing URL argument"); |
||||
|
return; |
||||
|
} |
||||
|
String url = args[1]; |
||||
|
logger.info("Analyze command executed for URL: {}", url); |
||||
|
|
||||
|
try { |
||||
|
CrawlStrategy strategy = strategyFactory.getStrategy(url); |
||||
|
if (strategy == null) { |
||||
|
view.printError("No strategy found for: " + url); |
||||
|
logger.error("No strategy found for URL: {}", url); |
||||
|
return; |
||||
|
} |
||||
|
|
||||
|
Callable<Document> fetchTask = () -> { |
||||
|
logger.debug("Fetching document from: {}", url); |
||||
|
try { |
||||
|
return Jsoup.connect(url) |
||||
|
.userAgent("Mozilla/5.0") |
||||
|
.timeout(5000) |
||||
|
.get(); |
||||
|
} catch (IOException e) { |
||||
|
throw new NetworkException("Failed to connect to " + url + ": " + e.getMessage(), e); |
||||
|
} |
||||
|
}; |
||||
|
|
||||
|
Document doc = RetryUtils.executeWithRetry(fetchTask); |
||||
|
logger.info("Successfully fetched document from: {}", url); |
||||
|
|
||||
|
List<Article> articles = strategy.parse(url, doc); |
||||
|
logger.info("Parsed {} articles for analysis", articles.size()); |
||||
|
|
||||
|
int total = articles.size(); |
||||
|
int totalTitleLen = 0; |
||||
|
int totalContentLen = 0; |
||||
|
|
||||
|
for (Article a : articles) { |
||||
|
totalTitleLen += a.getTitle() == null ? 0 : a.getTitle().length(); |
||||
|
totalContentLen += a.getContent() == null ? 0 : a.getContent().length(); |
||||
|
} |
||||
|
|
||||
|
view.printInfo("===== 分析统计结果 ====="); |
||||
|
view.printInfo("文章总数:" + total + " 篇"); |
||||
|
view.printInfo("标题总长度:" + totalTitleLen); |
||||
|
view.printInfo("内容总长度:" + totalContentLen); |
||||
|
if (total > 0) { |
||||
|
view.printInfo("平均标题长度:" + (totalTitleLen / total)); |
||||
|
view.printInfo("平均内容长度:" + (totalContentLen / total)); |
||||
|
} |
||||
|
view.printInfo("======================"); |
||||
|
view.printSuccess("分析完成(数据未保存)"); |
||||
|
|
||||
|
logger.info("Analysis completed: {} articles analyzed", total); |
||||
|
} catch (NetworkException e) { |
||||
|
view.printError("Network error: " + e.getMessage()); |
||||
|
logger.error("Network error while analyzing {}: {}", url, e.getMessage(), e); |
||||
|
} catch (ParseException e) { |
||||
|
view.printError("Parse error: " + e.getMessage()); |
||||
|
logger.error("Parse error while analyzing {}: {}", url, e.getMessage(), e); |
||||
|
} catch (Exception e) { |
||||
|
view.printError("分析失败:" + e.getMessage()); |
||||
|
logger.error("Unexpected error while analyzing {}: {}", url, e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,8 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
|
||||
|
public interface Command { |
||||
|
String getName(); |
||||
|
void execute(String[] args, ArticleRepository repository); |
||||
|
} |
||||
@ -0,0 +1,87 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.exception.NetworkException; |
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.strategy.CrawlStrategy; |
||||
|
import com.example.datacollect.strategy.StrategyFactory; |
||||
|
import com.example.datacollect.util.RetryUtils; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.jsoup.Jsoup; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
|
||||
|
import java.io.IOException; |
||||
|
import java.util.concurrent.Callable; |
||||
|
|
||||
|
public class CrawlCommand implements Command { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(CrawlCommand.class); |
||||
|
private final ConsoleView view; |
||||
|
private final StrategyFactory strategyFactory; |
||||
|
|
||||
|
public CrawlCommand(ConsoleView view, StrategyFactory strategyFactory) { |
||||
|
this.view = view; |
||||
|
this.strategyFactory = strategyFactory; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String getName() { |
||||
|
return "crawl"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
if (args.length < 2) { |
||||
|
view.printError("Usage: crawl <url>"); |
||||
|
logger.warn("Invalid command: missing URL argument"); |
||||
|
return; |
||||
|
} |
||||
|
String url = args[1]; |
||||
|
logger.info("Crawl started for: {}", url); |
||||
|
|
||||
|
CrawlStrategy strategy = strategyFactory.getStrategy(url); |
||||
|
if (strategy == null) { |
||||
|
view.printError("No strategy found for: " + url); |
||||
|
logger.error("No strategy found for URL: {}", url); |
||||
|
return; |
||||
|
} |
||||
|
|
||||
|
try { |
||||
|
view.printInfo("Crawling: " + url); |
||||
|
|
||||
|
Callable<Document> fetchTask = () -> { |
||||
|
logger.debug("Fetching document from: {}", url); |
||||
|
try { |
||||
|
return Jsoup.connect(url) |
||||
|
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36") |
||||
|
.timeout(10000) |
||||
|
.get(); |
||||
|
} catch (IOException e) { |
||||
|
throw new NetworkException("Failed to connect to " + url + ": " + e.getMessage(), e); |
||||
|
} |
||||
|
}; |
||||
|
|
||||
|
Document doc = RetryUtils.executeWithRetry(fetchTask); |
||||
|
logger.info("Successfully fetched document from: {}", url); |
||||
|
|
||||
|
var articles = strategy.parse(url, doc); |
||||
|
logger.info("Parsed {} articles", articles.size()); |
||||
|
|
||||
|
repository.addAll(articles); |
||||
|
logger.info("Successfully added {} articles to repository", articles.size()); |
||||
|
|
||||
|
view.printSuccess("Crawled " + articles.size() + " articles."); |
||||
|
logger.info("Successfully crawled {} articles from {}", articles.size(), url); |
||||
|
} catch (NetworkException e) { |
||||
|
view.printError("Network error: " + e.getMessage()); |
||||
|
logger.error("Network error while crawling {}: {}", url, e.getMessage(), e); |
||||
|
} catch (ParseException e) { |
||||
|
view.printError("Parse error: " + e.getMessage()); |
||||
|
logger.error("Parse error while crawling {}: {}", url, e.getMessage(), e); |
||||
|
} catch (Exception e) { |
||||
|
view.printError("Failed to crawl: " + e.getMessage()); |
||||
|
logger.error("Unexpected error while crawling {}: {}", url, e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,27 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
|
||||
|
public class ExitCommand implements Command { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(ExitCommand.class); |
||||
|
private final ConsoleView view; |
||||
|
|
||||
|
public ExitCommand(ConsoleView view) { |
||||
|
this.view = view; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String getName() { |
||||
|
return "exit"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
logger.info("Exit command executed, shutting down"); |
||||
|
view.printSuccess("Bye!"); |
||||
|
System.exit(0);/*退出程序 */ |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,26 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
|
||||
|
public class HelpCommand implements Command { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(HelpCommand.class); |
||||
|
private final ConsoleView view; |
||||
|
|
||||
|
public HelpCommand(ConsoleView view) { |
||||
|
this.view = view; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String getName() { |
||||
|
return "help"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
logger.info("Help command executed"); |
||||
|
view.printInfo("Commands: crawl <url>, list, help, exit, analyze"); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,26 @@ |
|||||
|
package com.example.datacollect.command; |
||||
|
|
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
|
||||
|
public class ListCommand implements Command { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(ListCommand.class); |
||||
|
private final ConsoleView view; |
||||
|
|
||||
|
public ListCommand(ConsoleView view) { |
||||
|
this.view = view; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String getName() { |
||||
|
return "list"; |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
logger.info("List command executed, showing {} articles", repository.size()); |
||||
|
view.display(repository.getAll()); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,64 @@ |
|||||
|
package com.example.datacollect.controller; |
||||
|
|
||||
|
import com.example.datacollect.command.AnalyzeCommand; |
||||
|
import com.example.datacollect.command.Command; |
||||
|
import com.example.datacollect.command.CrawlCommand; |
||||
|
import com.example.datacollect.command.ExitCommand; |
||||
|
import com.example.datacollect.command.HelpCommand; |
||||
|
import com.example.datacollect.command.ListCommand; |
||||
|
import com.example.datacollect.repository.ArticleRepository; |
||||
|
import com.example.datacollect.strategy.StrategyFactory; |
||||
|
import com.example.datacollect.view.ConsoleView; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.HashMap; |
||||
|
import java.util.Map; |
||||
|
|
||||
|
public class CrawlerController { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(CrawlerController.class); |
||||
|
private final Map<String, Command> commands = new HashMap<>(); |
||||
|
private final ConsoleView view; |
||||
|
private final ArticleRepository repository; |
||||
|
|
||||
|
public CrawlerController(ConsoleView view, ArticleRepository repository, StrategyFactory strategyFactory) { |
||||
|
this.view = view; |
||||
|
this.repository = repository; |
||||
|
register(new HelpCommand(view)); |
||||
|
register(new ListCommand(view)); |
||||
|
register(new CrawlCommand(view, strategyFactory)); |
||||
|
register(new ExitCommand(view)); |
||||
|
register(new AnalyzeCommand(view, strategyFactory)); |
||||
|
logger.info("CrawlerController initialized with {} commands", commands.size()); |
||||
|
} |
||||
|
|
||||
|
private void register(Command command) { |
||||
|
commands.put(command.getName(), command); |
||||
|
logger.debug("Registered command: {}", command.getName()); |
||||
|
} |
||||
|
|
||||
|
public void handle(String input) {/* 处理用户输入 */ |
||||
|
String text = input == null ? "" : input.trim();/* 处理空输入 */ |
||||
|
if (text.isEmpty()) { |
||||
|
return; |
||||
|
} |
||||
|
|
||||
|
String[] args = text.split("\\s+");/* 解析命令行参数 */ |
||||
|
String cmdName = args[0].toLowerCase();/* 提取命令名称并转换为小写 */ |
||||
|
|
||||
|
logger.debug("Processing command: {}", cmdName); |
||||
|
|
||||
|
Command command = commands.get(cmdName);/* 获取命令对象 */ |
||||
|
if (command == null) { |
||||
|
view.printError("Unknown command: " + cmdName); |
||||
|
logger.warn("Unknown command attempted: {}", cmdName); |
||||
|
return; |
||||
|
} |
||||
|
|
||||
|
try { |
||||
|
command.execute(args, repository);/* 执行命令 */ |
||||
|
} catch (Exception e) { |
||||
|
view.printError("Command execution failed: " + e.getMessage()); |
||||
|
logger.error("Error executing command {}: {}", cmdName, e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,10 @@ |
|||||
|
package com.example.datacollect.exception; |
||||
|
|
||||
|
public class CrawlerException extends Exception { |
||||
|
public CrawlerException(String message) { |
||||
|
super(message); |
||||
|
} |
||||
|
public CrawlerException(String message, Throwable cause) { |
||||
|
super(message, cause); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,10 @@ |
|||||
|
package com.example.datacollect.exception; |
||||
|
|
||||
|
public class NetworkException extends CrawlerException { |
||||
|
public NetworkException(String message) { |
||||
|
super(message); |
||||
|
} |
||||
|
public NetworkException(String message, Throwable cause) { |
||||
|
super(message, cause); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,10 @@ |
|||||
|
package com.example.datacollect.exception; |
||||
|
|
||||
|
public class ParseException extends CrawlerException { |
||||
|
public ParseException(String message) { |
||||
|
super(message); |
||||
|
} |
||||
|
public ParseException(String message, Throwable cause) { |
||||
|
super(message, cause); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,72 @@ |
|||||
|
package com.example.datacollect.model; |
||||
|
/*- 文章模型类 |
||||
|
- 添加字段验证 |
||||
|
- 添加 toString() 方法(已有) |
||||
|
- 考虑添加 equals() 和 hashCode() */ |
||||
|
public class Article { |
||||
|
private String title; |
||||
|
private String url; |
||||
|
private String content; |
||||
|
|
||||
|
public Article(String title, String url, String content) { |
||||
|
setTitle(title); |
||||
|
setUrl(url); |
||||
|
setContent(content); |
||||
|
} |
||||
|
|
||||
|
public String getTitle() { |
||||
|
return title; |
||||
|
} |
||||
|
|
||||
|
public void setTitle(String title) { |
||||
|
if (title == null) { |
||||
|
throw new IllegalArgumentException("Title cannot be null"); |
||||
|
} |
||||
|
if (title.trim().isEmpty()) { |
||||
|
throw new IllegalArgumentException("Title cannot be empty"); |
||||
|
} |
||||
|
if (title.length() > 500) { |
||||
|
throw new IllegalArgumentException("Title cannot exceed 500 characters"); |
||||
|
} |
||||
|
this.title = title.trim(); |
||||
|
} |
||||
|
|
||||
|
public String getUrl() { |
||||
|
return url; |
||||
|
} |
||||
|
|
||||
|
public void setUrl(String url) { |
||||
|
if (url == null) { |
||||
|
throw new IllegalArgumentException("URL cannot be null"); |
||||
|
} |
||||
|
if (url.trim().isEmpty()) { |
||||
|
throw new IllegalArgumentException("URL cannot be empty"); |
||||
|
} |
||||
|
if (!url.startsWith("http://") && !url.startsWith("https://")) { |
||||
|
throw new IllegalArgumentException("URL must start with http:// or https://"); |
||||
|
} |
||||
|
this.url = url.trim(); |
||||
|
} |
||||
|
|
||||
|
public String getContent() { |
||||
|
return content; |
||||
|
} |
||||
|
|
||||
|
public void setContent(String content) { |
||||
|
if (content == null) { |
||||
|
this.content = ""; |
||||
|
} else if (content.length() > 10000) { |
||||
|
this.content = content.substring(0, 10000);/* 截断内容到 10000 个字符 */ |
||||
|
} else { |
||||
|
this.content = content; |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public String toString() { |
||||
|
return "Article{" |
||||
|
+ "title='" + title + '\'' |
||||
|
+ ", url='" + url + '\'' |
||||
|
+ '}'; |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,113 @@ |
|||||
|
package com.example.datacollect.repository; |
||||
|
|
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.Collections; |
||||
|
import java.util.HashSet; |
||||
|
import java.util.List; |
||||
|
import java.util.Set; |
||||
|
/* 文章仓库 |
||||
|
- 添加 logger 成员 |
||||
|
- 增强 add() 方法的防御检查 |
||||
|
- 增强 addALL() 方法的防御检查 |
||||
|
- 添加空值检查、重复检查、长度验证 |
||||
|
- 记录操作日志*/ |
||||
|
public class ArticleRepository { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(ArticleRepository.class); |
||||
|
private static final int MAX_TITLE_LENGTH = 500;/* 最大标题长度 */ |
||||
|
private static final int MAX_CONTENT_LENGTH = 10000;/* 最大内容长度 */ |
||||
|
|
||||
|
private final List<Article> articles = new ArrayList<>(); |
||||
|
private final Set<String> urlSet = new HashSet<>(); |
||||
|
|
||||
|
public void add(Article article) { |
||||
|
if (article == null) { |
||||
|
logger.error("Attempted to add null article"); |
||||
|
throw new IllegalArgumentException("Article cannot be null"); |
||||
|
} |
||||
|
|
||||
|
String title = article.getTitle(); |
||||
|
String url = article.getUrl(); |
||||
|
String content = article.getContent(); |
||||
|
|
||||
|
if (title == null || title.trim().isEmpty()) { |
||||
|
logger.warn("Attempted to add article with empty title"); |
||||
|
throw new IllegalArgumentException("Article title cannot be null or empty"); |
||||
|
} |
||||
|
|
||||
|
if (url == null || url.trim().isEmpty()) { |
||||
|
logger.warn("Attempted to add article with empty URL"); |
||||
|
throw new IllegalArgumentException("Article URL cannot be null or empty"); |
||||
|
} |
||||
|
|
||||
|
if (title.length() > MAX_TITLE_LENGTH) { |
||||
|
logger.warn("Article title too long: {} characters (max: {})", title.length(), MAX_TITLE_LENGTH); |
||||
|
throw new IllegalArgumentException("Article title exceeds maximum length of " + MAX_TITLE_LENGTH); |
||||
|
} |
||||
|
|
||||
|
if (content != null && content.length() > MAX_CONTENT_LENGTH) { |
||||
|
logger.warn("Article content too long: {} characters (max: {})", content.length(), MAX_CONTENT_LENGTH); |
||||
|
content = content.substring(0, MAX_CONTENT_LENGTH); |
||||
|
} |
||||
|
|
||||
|
if (!url.startsWith("http://") && !url.startsWith("https://")) { |
||||
|
logger.warn("Invalid URL format: {}", url); |
||||
|
throw new IllegalArgumentException("Article URL must start with http:// or https://"); |
||||
|
} |
||||
|
|
||||
|
if (urlSet.contains(url)) { |
||||
|
logger.warn("Duplicate article URL detected: {}", url); |
||||
|
return;/* 跳过重复文章 */ |
||||
|
} |
||||
|
|
||||
|
Article validatedArticle = new Article(title.trim(), url.trim(), content != null ? content.trim() : "");/* 创建验证后的文章 */ |
||||
|
articles.add(validatedArticle);/* 添加文章到列表 */ |
||||
|
urlSet.add(url);/* 添加URL到集合 */ |
||||
|
logger.debug("Added article: {}", title);/* 记录添加日志 */ |
||||
|
} |
||||
|
|
||||
|
public void addAll(List<Article> articleList) { |
||||
|
if (articleList == null) { |
||||
|
logger.error("Attempted to add null article list"); |
||||
|
throw new IllegalArgumentException("Article list cannot be null"); |
||||
|
} |
||||
|
|
||||
|
int successCount = 0;/* 成功添加的文章数量 */ |
||||
|
int skipCount = 0;/* 跳过的无效文章数量 */ |
||||
|
|
||||
|
for (Article article : articleList) { |
||||
|
if (article != null) { |
||||
|
try { |
||||
|
add(article); |
||||
|
successCount++; |
||||
|
} catch (IllegalArgumentException e) { |
||||
|
logger.warn("Skipped invalid article: {}", e.getMessage()); |
||||
|
skipCount++; |
||||
|
} |
||||
|
} else { |
||||
|
logger.warn("Skipped null article in list"); |
||||
|
skipCount++; |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
logger.info("Added {} articles, skipped {} invalid articles", successCount, skipCount); |
||||
|
} |
||||
|
|
||||
|
public List<Article> getAll() { |
||||
|
logger.debug("Retrieving all articles, total: {}", articles.size()); |
||||
|
return Collections.unmodifiableList(articles);/* 返回不可修改的列表 */ |
||||
|
} |
||||
|
|
||||
|
public int size() { |
||||
|
return articles.size();/* 返回文章数量 */ |
||||
|
} |
||||
|
|
||||
|
public void clear() { |
||||
|
int count = articles.size();/* 记录当前文章数量 */ |
||||
|
articles.clear(); |
||||
|
urlSet.clear(); |
||||
|
logger.info("Cleared repository, removed {} articles", count); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,25 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.jsoup.nodes.Element; |
||||
|
import org.jsoup.select.Elements; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public class BlogStrategy implements CrawlStrategy { |
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("blog.example.com"); |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) { |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
Elements titles = doc.select(".post-title"); |
||||
|
for (Element e : titles) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
return articles; |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,11 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public interface CrawlStrategy { |
||||
|
List<Article> parse(String url, Document doc) throws ParseException; |
||||
|
boolean supports(String url); |
||||
|
} |
||||
@ -0,0 +1,77 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.jsoup.nodes.Element; |
||||
|
import org.jsoup.select.Elements; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
|
||||
|
/* HNU News 策略 |
||||
|
- 添加 logger 成员 |
||||
|
- 添加异常处理 |
||||
|
- 实现防御性编程 */ |
||||
|
public class HnuNewsStrategy implements CrawlStrategy { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(HnuNewsStrategy.class); |
||||
|
|
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("news.hnu.edu.cn");/* 支持 HNU News 网站 */ |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) throws ParseException { |
||||
|
logger.info("Starting to parse HNU News: {}", url); |
||||
|
List<Article> articles = new ArrayList<>();/* 存储储解析后的文章 */ |
||||
|
|
||||
|
try { |
||||
|
Elements listItems = doc.select("ul.list11 li");/* 选择文章列表项 */ |
||||
|
logger.debug("Found {} list items", listItems.size());/* 记录找到的列表项数量 */ |
||||
|
|
||||
|
for (Element li : listItems) { |
||||
|
try { |
||||
|
Element link = li.selectFirst("a");/* 选择列表项中的链接 */ |
||||
|
if (link == null) { |
||||
|
logger.warn("No link found in list item");/* 记录未找到链接 */ |
||||
|
continue; |
||||
|
} |
||||
|
|
||||
|
String articleUrl = link.attr("href");/* 获取链接的 href 属性值 */ |
||||
|
if (!articleUrl.startsWith("http")) { |
||||
|
articleUrl = "https://news.hnu.edu.cn" + articleUrl.replace("..", "");/* 补全相对路径 */ |
||||
|
} |
||||
|
|
||||
|
String title = "";/* 存储文章标题 */ |
||||
|
Element titleEl = link.selectFirst("h4.l2.h4s2");/* 选择标题元素 */ |
||||
|
if (titleEl != null) { |
||||
|
title = titleEl.text().trim();/* 提取标题文本并移除首尾空格 */ |
||||
|
} |
||||
|
|
||||
|
String content = "";/* 存储文章内容 */ |
||||
|
Element contentEl = link.selectFirst("p.l3.ps3");/* 选择内容元素 */ |
||||
|
if (contentEl != null) { |
||||
|
content = contentEl.text().trim();/* 提取内容文本并移除首尾空格 */ |
||||
|
} |
||||
|
|
||||
|
if (!title.isEmpty()) { |
||||
|
Article article = new Article(title, articleUrl, content);/* 创建文章对象 */ |
||||
|
articles.add(article);/* 将文章添加到列表 */ |
||||
|
} else { |
||||
|
logger.warn("Empty title found, skipping article"); |
||||
|
} |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Error parsing individual article: {}", e.getMessage()); |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
logger.info("Successfully parsed {} articles from HNU News", articles.size()); |
||||
|
return articles; |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Failed to parse HNU News page: {}", e.getMessage(), e); |
||||
|
throw new ParseException("Failed to parse HNU News: " + e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,25 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.jsoup.nodes.Element; |
||||
|
import org.jsoup.select.Elements; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public class NewsStrategy implements CrawlStrategy { |
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("news.example.com"); |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) { |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
Elements items = doc.select(".article-headline"); |
||||
|
for (Element e : items) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
return articles; |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,83 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.jsoup.nodes.Element; |
||||
|
import org.jsoup.select.Elements; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
/* 人民网策略类 */ |
||||
|
public class PeopleStrategy implements CrawlStrategy { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(PeopleStrategy.class); |
||||
|
|
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("people.com.cn");/* 检查URL是否包含people.com.cn */ |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) throws ParseException { |
||||
|
logger.info("Starting to parse People's Daily News: {}", url); |
||||
|
List<Article> articles = new ArrayList<>();/* 初始化文章列表 */ |
||||
|
|
||||
|
try { |
||||
|
Elements newsItems = doc.select("div.w1000, div.news-item, li.list_item");/* 选择新闻容器 */ |
||||
|
logger.debug("Found {} news containers", newsItems.size()); |
||||
|
|
||||
|
if (newsItems.isEmpty()) { |
||||
|
newsItems = doc.select("a[href*='/n1/']");/* 选择替代选择器 */ |
||||
|
logger.debug("Trying alternative selector, found {} items", newsItems.size()); |
||||
|
} |
||||
|
|
||||
|
for (Element item : newsItems) { |
||||
|
try { |
||||
|
Element link = item.selectFirst("a");/* 选择链接元素 */ |
||||
|
if (link == null) { |
||||
|
link = item.tagName().equals("a") ? item : null;/* 检查是否为链接元素 */ |
||||
|
} |
||||
|
|
||||
|
if (link == null) { |
||||
|
logger.warn("No link found in news item"); |
||||
|
continue; |
||||
|
} |
||||
|
|
||||
|
String articleUrl = link.attr("href");/* 获取链接URL */ |
||||
|
if (!articleUrl.startsWith("http")) {/* 检查是否为绝对URL */ |
||||
|
if (articleUrl.startsWith("/")) { |
||||
|
articleUrl = "https://www.people.com.cn" + articleUrl; |
||||
|
} else { |
||||
|
articleUrl = "https://www.people.com.cn/" + articleUrl; |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
String title = link.text().trim();/* 获取标题文本 */ |
||||
|
|
||||
|
String content = "";/* 初始化内容文本 */ |
||||
|
Element contentEl = item.selectFirst("p, div.ed, div.summary");/* 选择内容元素 */ |
||||
|
if (contentEl != null) { |
||||
|
content = contentEl.text().trim();/* 获取内容文本 */ |
||||
|
} |
||||
|
|
||||
|
if (!title.isEmpty() && title.length() > 5) { |
||||
|
Article article = new Article(title, articleUrl, content);/* 创建文章对象 */ |
||||
|
articles.add(article);/* 添加文章到列表 */ |
||||
|
logger.debug("Parsed article: {}", title);/* 记录解析文章 */ |
||||
|
} else { |
||||
|
logger.warn("Invalid title found, skipping article");/* 记录无效标题 */ |
||||
|
} |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Error parsing individual article: {}", e.getMessage()); |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
logger.info("Successfully parsed {} articles from People's Daily News", articles.size()); |
||||
|
return articles; |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Failed to parse People's Daily News page: {}", e.getMessage(), e); |
||||
|
throw new ParseException("Failed to parse People's Daily News: " + e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,36 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public class StrategyFactory { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(StrategyFactory.class); |
||||
|
private final List<CrawlStrategy> strategies = new ArrayList<>(); |
||||
|
|
||||
|
public StrategyFactory() { |
||||
|
strategies.add(new HnuNewsStrategy()); |
||||
|
strategies.add(new YouthStrategy()); |
||||
|
strategies.add(new PeopleStrategy()); |
||||
|
strategies.add(new BlogStrategy()); |
||||
|
strategies.add(new NewsStrategy()); |
||||
|
logger.info("Initialized StrategyFactory with {} strategies", strategies.size()); |
||||
|
} |
||||
|
|
||||
|
public CrawlStrategy getStrategy(String url) { |
||||
|
for (CrawlStrategy s : strategies) { |
||||
|
if (s.supports(url)) { |
||||
|
logger.debug("Found strategy {} for URL: {}", s.getClass().getSimpleName(), url); |
||||
|
return s; |
||||
|
} |
||||
|
} |
||||
|
logger.warn("No strategy found for URL: {}", url); |
||||
|
return null; |
||||
|
} |
||||
|
|
||||
|
public void register(CrawlStrategy strategy) { |
||||
|
strategies.add(strategy); |
||||
|
logger.info("Registered new strategy: {}", strategy.getClass().getSimpleName()); |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,87 @@ |
|||||
|
package com.example.datacollect.strategy; |
||||
|
|
||||
|
import com.example.datacollect.exception.ParseException; |
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import org.jsoup.nodes.Element; |
||||
|
import org.jsoup.select.Elements; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
/* 青年网新闻解析策略*/ |
||||
|
public class YouthStrategy implements CrawlStrategy { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(YouthStrategy.class); |
||||
|
|
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("youth.cn");/* 检查URL是否包含青年网域名 */ |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) throws ParseException { |
||||
|
logger.info("Starting to parse Youth News: {}", url); |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
|
||||
|
try { |
||||
|
Elements newsItems = doc.select("div.news-item, div.article-item, li.news-list-item");/* 选择新闻项元素 */ |
||||
|
logger.debug("Found {} news items", newsItems.size()); |
||||
|
|
||||
|
if (newsItems.isEmpty()) { |
||||
|
newsItems = doc.select("a[href*='/n1/']");/* 选择替代选择器 */ |
||||
|
logger.debug("Trying alternative selector, found {} items", newsItems.size()); |
||||
|
} |
||||
|
|
||||
|
for (Element item : newsItems) { |
||||
|
try { |
||||
|
Element link = item.selectFirst("a");/* 选择链接元素 */ |
||||
|
if (link == null) { |
||||
|
link = item.tagName().equals("a") ? item : null;/* 检查是否为链接元素 */ |
||||
|
} |
||||
|
|
||||
|
if (link == null) { |
||||
|
logger.warn("No link found in news item"); |
||||
|
continue; |
||||
|
} |
||||
|
|
||||
|
String articleUrl = link.attr("href");/* 获取链接URL */ |
||||
|
|
||||
|
if (!articleUrl.startsWith("http")) {/* 检查URL是否为绝对URL */ |
||||
|
if (articleUrl.startsWith("/")) { |
||||
|
articleUrl = "https://www.youth.cn" + articleUrl; |
||||
|
} else { |
||||
|
articleUrl = "https://www.youth.cn/" + articleUrl; |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
String title = link.text().trim();/* 获取链接文本 */ |
||||
|
if (title.isEmpty()) {/* 检查标题是否为空 */ |
||||
|
continue; |
||||
|
} |
||||
|
|
||||
|
String content = "";/* 初始化内容为空字符串 */ |
||||
|
Element contentEl = item.selectFirst("p.summary, p.desc, div.brief");/* 选择摘要元素 */ |
||||
|
if (contentEl != null) { |
||||
|
content = contentEl.text().trim();/* 获取摘要文本 */ |
||||
|
} |
||||
|
|
||||
|
if (!title.isEmpty() && title.length() > 5) { |
||||
|
Article article = new Article(title, articleUrl, content); |
||||
|
articles.add(article); |
||||
|
logger.debug("Parsed article: {}", title); |
||||
|
} else { |
||||
|
logger.warn("Invalid title found, skipping article"); |
||||
|
} |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Error parsing individual article: {}", e.getMessage()); |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
logger.info("Successfully parsed {} articles from Youth News", articles.size()); |
||||
|
return articles; |
||||
|
} catch (Exception e) { |
||||
|
logger.error("Failed to parse Youth News page: {}", e.getMessage(), e); |
||||
|
throw new ParseException("Failed to parse Youth News: " + e.getMessage(), e); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,49 @@ |
|||||
|
package com.example.datacollect.util; |
||||
|
|
||||
|
import com.example.datacollect.exception.NetworkException; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.concurrent.Callable; |
||||
|
|
||||
|
public class RetryUtils { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(RetryUtils.class); |
||||
|
|
||||
|
private static final int DEFAULT_MAX_RETRIES = 3; |
||||
|
private static final long DEFAULT_RETRY_DELAY_MS = 1000; |
||||
|
|
||||
|
public static <T> T executeWithRetry(Callable<T> task) throws Exception { |
||||
|
return executeWithRetry(task, DEFAULT_MAX_RETRIES, DEFAULT_RETRY_DELAY_MS); |
||||
|
} |
||||
|
|
||||
|
public static <T> T executeWithRetry(Callable<T> task, int maxRetries, long retryDelayMs) throws Exception { |
||||
|
Exception lastException = null; |
||||
|
|
||||
|
for (int attempt = 0; attempt <= maxRetries; attempt++) { |
||||
|
try { |
||||
|
if (attempt > 0) { |
||||
|
logger.info("Retry attempt {}/{} for task", attempt, maxRetries); |
||||
|
Thread.sleep(retryDelayMs); |
||||
|
} |
||||
|
|
||||
|
return task.call(); |
||||
|
} catch (Exception e) { |
||||
|
lastException = e; |
||||
|
|
||||
|
if (e instanceof NetworkException) { |
||||
|
logger.warn("Network error on attempt {}: {}", attempt, e.getMessage()); |
||||
|
|
||||
|
if (attempt < maxRetries) { |
||||
|
logger.info("Will retry in {} ms...", retryDelayMs); |
||||
|
continue; |
||||
|
} |
||||
|
} else { |
||||
|
logger.error("Non-retryable error: {}", e.getMessage()); |
||||
|
throw e; |
||||
|
} |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
logger.error("All {} retry attempts failed", maxRetries + 1); |
||||
|
throw lastException; |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,46 @@ |
|||||
|
package com.example.datacollect.view; |
||||
|
|
||||
|
import com.example.datacollect.model.Article; |
||||
|
import org.slf4j.Logger; |
||||
|
import org.slf4j.LoggerFactory; |
||||
|
import java.util.List; |
||||
|
import java.util.Scanner; |
||||
|
|
||||
|
public class ConsoleView { |
||||
|
private static final Logger logger = LoggerFactory.getLogger(ConsoleView.class); |
||||
|
private static final String ANSI_RESET = "\u001B[0m"; |
||||
|
private static final String ANSI_GREEN = "\u001B[32m"; |
||||
|
private static final String ANSI_RED = "\u001B[31m"; |
||||
|
private static final String ANSI_BLUE = "\u001B[34m"; |
||||
|
|
||||
|
private final Scanner scanner = new Scanner(System.in); |
||||
|
|
||||
|
public String readLine() { |
||||
|
System.out.print("> "); |
||||
|
String input = scanner.nextLine(); |
||||
|
return input;/* 返回用户输入 */ |
||||
|
} |
||||
|
|
||||
|
public void printSuccess(String msg) { |
||||
|
System.out.println(ANSI_GREEN + msg + ANSI_RESET); |
||||
|
} |
||||
|
|
||||
|
public void printError(String msg) { |
||||
|
System.out.println(ANSI_RED + msg + ANSI_RESET); |
||||
|
} |
||||
|
|
||||
|
public void printInfo(String msg) { |
||||
|
System.out.println(ANSI_BLUE + msg + ANSI_RESET); |
||||
|
} |
||||
|
|
||||
|
public void display(List<Article> articles) { |
||||
|
if (articles.isEmpty()) { |
||||
|
printInfo("暂无文章,请先执行 crawl。"); |
||||
|
return; |
||||
|
} |
||||
|
for (int i = 0; i < articles.size(); i++) { |
||||
|
Article a = articles.get(i); |
||||
|
System.out.println((i + 1) + ". " + a.getTitle() + " | " + a.getUrl()); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
@ -0,0 +1,24 @@ |
|||||
|
<?xml version="1.0" encoding="UTF-8"?> |
||||
|
<configuration> |
||||
|
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender"> |
||||
|
<encoder> |
||||
|
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern> |
||||
|
</encoder> |
||||
|
</appender> |
||||
|
|
||||
|
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender"> |
||||
|
<file>logs/crawler.log</file> |
||||
|
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> |
||||
|
<fileNamePattern>logs/crawler.%d{yyyy-MM-dd}.log</fileNamePattern> |
||||
|
<maxHistory>30</maxHistory> |
||||
|
</rollingPolicy> |
||||
|
<encoder> |
||||
|
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern> |
||||
|
</encoder> |
||||
|
</appender> |
||||
|
|
||||
|
<root level="INFO"> |
||||
|
<appender-ref ref="CONSOLE" /> |
||||
|
<appender-ref ref="FILE" /> |
||||
|
</root> |
||||
|
</configuration> |
||||
@ -0,0 +1,24 @@ |
|||||
|
<?xml version="1.0" encoding="UTF-8"?> |
||||
|
<configuration> |
||||
|
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender"> |
||||
|
<encoder> |
||||
|
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern> |
||||
|
</encoder> |
||||
|
</appender> |
||||
|
|
||||
|
<appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender"> |
||||
|
<file>logs/crawler.log</file> |
||||
|
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> |
||||
|
<fileNamePattern>logs/crawler.%d{yyyy-MM-dd}.log</fileNamePattern> |
||||
|
<maxHistory>30</maxHistory> |
||||
|
</rollingPolicy> |
||||
|
<encoder> |
||||
|
<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern> |
||||
|
</encoder> |
||||
|
</appender> |
||||
|
|
||||
|
<root level="INFO"> |
||||
|
<appender-ref ref="CONSOLE" /> |
||||
|
<appender-ref ref="FILE" /> |
||||
|
</root> |
||||
|
</configuration> |
||||
@ -0,0 +1,3 @@ |
|||||
|
artifactId=datacollect-cli |
||||
|
groupId=com.example |
||||
|
version=0.1.0 |
||||
@ -0,0 +1,22 @@ |
|||||
|
com\example\datacollect\command\ListCommand.class |
||||
|
com\example\datacollect\strategy\PeopleStrategy.class |
||||
|
com\example\datacollect\command\CrawlCommand.class |
||||
|
com\example\datacollect\strategy\BlogStrategy.class |
||||
|
com\example\datacollect\repository\ArticleRepository.class |
||||
|
com\example\datacollect\Main.class |
||||
|
com\example\datacollect\view\ConsoleView.class |
||||
|
com\example\datacollect\command\ExitCommand.class |
||||
|
com\example\datacollect\command\HelpCommand.class |
||||
|
com\example\datacollect\util\RetryUtils.class |
||||
|
com\example\datacollect\strategy\NewsStrategy.class |
||||
|
com\example\datacollect\command\Command.class |
||||
|
com\example\datacollect\controller\CrawlerController.class |
||||
|
com\example\datacollect\exception\CrawlerException.class |
||||
|
com\example\datacollect\exception\NetworkException.class |
||||
|
com\example\datacollect\command\AnalyzeCommand.class |
||||
|
com\example\datacollect\strategy\StrategyFactory.class |
||||
|
com\example\datacollect\strategy\HnuNewsStrategy.class |
||||
|
com\example\datacollect\strategy\YouthStrategy.class |
||||
|
com\example\datacollect\exception\ParseException.class |
||||
|
com\example\datacollect\strategy\CrawlStrategy.class |
||||
|
com\example\datacollect\model\Article.class |
||||
@ -0,0 +1,22 @@ |
|||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\NewsStrategy.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\controller\CrawlerController.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\repository\ArticleRepository.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\HnuNewsStrategy.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\ExitCommand.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\Command.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\Main.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\CrawlCommand.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\exception\NetworkException.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\StrategyFactory.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\BlogStrategy.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\util\RetryUtils.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\HelpCommand.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\exception\CrawlerException.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\exception\ParseException.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\model\Article.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\view\ConsoleView.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\AnalyzeCommand.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\YouthStrategy.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\command\ListCommand.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\CrawlStrategy.java |
||||
|
C:\Users\27687\Desktop\java-cli\src\main\java\com\example\datacollect\strategy\PeopleStrategy.java |
||||
@ -0,0 +1,705 @@ |
|||||
|
# 教案:《高级程序设计》第10周——设计模式:灵活性与可扩展性 |
||||
|
|
||||
|
| 项目 | 内容 | |
||||
|
| -------- | ---------------------------------------------------------------------------- | |
||||
|
| **课程名称** | 高级程序设计 | |
||||
|
| **周次** | 第10周 | |
||||
|
| **主题** | 设计模式——灵活性与可扩展性 | |
||||
|
| **学时** | 2学时(90分钟) | |
||||
|
| **授课对象** | 已完成第9周CLI+MVC架构学习,具备Command模式基础 | |
||||
|
| **教学环境** | JDK 17+、IntelliJ IDEA、Maven | |
||||
|
| **前情提要** | W9搭建了CLI骨架:MVC分层 + Command路由,但留下了两大隐患——解析逻辑耦合在Command中、List\<Article\>共享引用裸奔 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 教学调整说明:为什么W10要在“骨架”上装“盔甲”? |
||||
|
|
||||
|
> **W9成果**:一个可扩展的命令行骨架 → **W9痛点**:解析器与数据存储仍在“裸奔” |
||||
|
|
||||
|
| 维度 | W9状态 | W10目标 | |
||||
|
|------|--------|---------| |
||||
|
| **架构** | MVC分层清晰 | MVC + 策略模式 + 仓库层 | |
||||
|
| **命令扩展** | 新增命令不改Controller | 新增解析器不改任何旧代码 | |
||||
|
| **数据安全** | List\<Article\>全员可写 | Repository封装,只暴露安全接口 | |
||||
|
| **解析逻辑** | 硬编码在CrawlCommand内 | 策略模式,按URL自动匹配 | |
||||
|
| **代码量** | ~8个类 | ~12个类,但每个更小更纯粹 | |
||||
|
|
||||
|
**决策理由**: |
||||
|
1. W9学生已经感受到Command模式的好处——**多态带来的扩展性** |
||||
|
2. 策略模式是多态思想的又一次实战,是**接口抽象的深化** |
||||
|
3. 仓库层是“封装”这一OOP核心原则的落地,补上W9留下的课 |
||||
|
4. 解析器工厂让学生看到**“自动匹配”**的威力——增加网站支持只需新增一个类 |
||||
|
|
||||
|
**更深层的教育价值**: |
||||
|
> W9教会学生“怎么把代码分开”,W10要教会学生“怎么把代码分开后还能优雅地合上”——**接口即合同,工厂即自动匹配,仓库即数据守卫**。这三句话,就是本周的全部精华。 |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 一、教学目标 |
||||
|
|
||||
|
| 目标维度 | 具体描述 | |
||||
|
|----------|----------| |
||||
|
| **知识掌握** | 理解策略模式的定义与多态本质;掌握工厂模式的两类变体(工厂方法/简单工厂)及适用场景;理解仓库模式对数据访问的封装原理。 | |
||||
|
| **工程实践** | 能在爬虫项目中用策略模式封装不同网站的解析逻辑;能实现解析器工厂,根据URL自动匹配解析策略;能用Repository模式替代裸List,提供安全的数据访问接口。 | |
||||
|
| **思维转型** | 从“写死逻辑”转向“策略可插拔”;从“直接操作集合”转向“通过仓库存取”;理解“对扩展开放,对修改关闭”的开闭原则。 | |
||||
|
| **工具应用** | 利用AI审查策略模式实现是否真正解耦;让AI扮演“网站结构分析师”辅助编写具体解析策略;用AI生成Repository的安全接口建议。 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 二、教学重点与难点 |
||||
|
|
||||
|
| 项目 | 内容 | 突破方法 | |
||||
|
|------|------|----------| |
||||
|
| **重点** | 策略模式的多态本质、解析器工厂的自动匹配机制、Repository对数据访问的封装 | 以“新增网站需要改什么”为切入点,展示策略模式的开闭原则达成;通过“攻击”当前List裸奔的问题,引出Repository的必然性 | |
||||
|
| **难点** | 理解“接口即合同”的抽象思维、工厂模式中反射/Map注册的实现、仓库层与Strategy模式的协同 | 用“插座与电器”类比接口标准;现场演示从硬编码→工厂→反射的演进路径;用时序图展示“用户→Command→Strategy→Repository”的完整调用链 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 三、教学过程设计(90分钟) |
||||
|
|
||||
|
| 环节 | 时间 | 教学内容 | 师生活动 | AI协同点 | |
||||
|
| -------------------------- | --- | ----------------------------------------------------------------- | -------------------------------------- | --------------------------- | |
||||
|
| **1. W9回顾与痛点暴露** | 8' | 回顾W9成果(CLI骨架),暴露两大隐患:①CrawlCommand里解析逻辑硬编码;②List\<Article\>全员可读可写 | **教师演示**:展示W9代码,用“事故场景”引发思考 | — | |
||||
|
| **2. 策略模式:解析器的“插头标准化”** | 18' | 策略模式定义、接口设计、多态调用、与Command模式的对比 | **类比**:插座与电器;**教师演示**:从if-else到策略模式的演进 | 让AI生成“策略模式vs switch-case”对比 | |
||||
|
| **3. 解析器工厂:自动匹配的魔法** | 14' | 工厂模式的两种形态(简单工厂→Map注册工厂),解析器工厂实现 | **教师演示**:先用if-else判断host,再升级为Map注册工厂 | 让AI解释工厂模式与策略模式如何协同 | |
||||
|
| **4. Repository模式:武装数据访问** | 12' | Repository定义、接口设计、替换List\<Article\>后的影响 | **教师演示**:在原代码中把List替换为Repository,展示改动点 | 学生用AI审计Repository接口的“最小完备性” | |
||||
|
| **5. 整体架构串联** | 8' | 用一张时序图串联:用户→CLI→Controller→Command→Strategy→Repository→Model | **师生互动**:让学生在白板上画出调用链 | — | |
||||
|
| **6. 代码落地** | 20' | 实现CrawlStrategy接口 + 两个策略 + 解析器工厂 + ArticleRepository | **教师演示**:分步写出代码,刻意埋入“策略匹配失败”的异常处理 | 完成后用AI检查策略模式实现 | |
||||
|
| **7. 架构反思与W11预告** | 5' | 当前架构还有什么隐患?(异常处理不统一、日志缺失)→ 预告W11健壮性工程 | **师生互动**:如果解析器工厂找不到匹配策略,会发生什么? | — | |
||||
|
| **8. 实践任务** | 5' | 实现策略模式和仓库层,完成本周代码升级 | 学生现场编码,教师巡视 | — | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 四、核心教学内容脚本 |
||||
|
|
||||
|
### 4.1 W9回顾与痛点暴露(8分钟) |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "上节课我们搭了一个很漂亮的骨架——CLI+MVC+Command模式。我们先来表扬一下自己:新增一个命令,只要新建一个类,Controller零改动。但请大家想一个问题——" |
||||
|
|
||||
|
**投影展示W9的CrawlCommand存根**: |
||||
|
```java |
||||
|
public class CrawlCommand implements Command { |
||||
|
// ... |
||||
|
public void execute(String[] args, List<Article> articles) { |
||||
|
if (args.length < 2) { |
||||
|
view.printError("Usage: crawl <url>"); |
||||
|
return; |
||||
|
} |
||||
|
view.printInfo("Stub: Would crawl " + args[1]); |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**提问引导**: |
||||
|
1. "这个存根下周要填坑了。假设我们现在要真正实现爬取,代码写在哪?" |
||||
|
2. "如果我要支持两个网站——比如一个技术博客和一个新闻网站——它们的HTML结构完全不一样,这个`execute`方法会变成什么样?" |
||||
|
|
||||
|
**展示“噩梦版”CrawlCommand**: |
||||
|
```java |
||||
|
public void execute(String[] args, List<Article> articles) { |
||||
|
String url = args[1]; |
||||
|
// 五十行if-else地狱... |
||||
|
if (url.contains("blog.example.com")) { |
||||
|
// 解析技术博客的HTML |
||||
|
Document doc = Jsoup.connect(url).get(); |
||||
|
Elements titles = doc.select(".post-title"); |
||||
|
for (Element e : titles) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
} else if (url.contains("news.example.com")) { |
||||
|
// 解析新闻网站的HTML |
||||
|
Document doc = Jsoup.connect(url).get(); |
||||
|
Elements items = doc.select(".article-headline"); |
||||
|
for (Element e : items) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
} else { |
||||
|
view.printError("Unsupported website!"); |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**痛点提炼**: |
||||
|
> "看到了吗?每支持一个新网站,就要在这里加一个`else if`。这就是W1我们痛批的'牵一发而动全身',只不过这次灾难地点从`main`搬到了`CrawlCommand`。" |
||||
|
> |
||||
|
> "更重要的是,我们上节课辛辛苦苦实现了Command模式,难道解析逻辑又要回到if-else地狱吗?**这就是W10要解决的第一个问题:怎么让解析逻辑也可插拔?**" |
||||
|
|
||||
|
**第二个隐患——共享状态的回顾**: |
||||
|
> "还有一件事,我们上节课结束前提到的:`List<Article> articles`在所有Command之间共享。任何一个Command都可以往里面塞东西、删东西、甚至清空。这是W10要解决的第二个问题:**怎么给数据装上'防盗门'?**" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.2 策略模式:解析器的“插头标准化”(18分钟) |
||||
|
|
||||
|
#### 4.2.1 从类比切入 |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "先讲个生活场景。你家里墙上有一个三孔插座,你可以插电视、插电脑、插手机充电器——任何符合这个标准的电器都能用。插座不在乎你是什么电器,它只认接口标准。" |
||||
|
|
||||
|
**类比映射**: |
||||
|
|
||||
|
| 生活场景 | 代码对应 | |
||||
|
|----------|----------| |
||||
|
| 三孔插座 | `CrawlStrategy` 接口 | |
||||
|
| 电视/电脑充电器 | 具体解析策略(BlogStrategy/NewsStrategy) | |
||||
|
| 电流 | 输入:URL + Document;输出:List\<Article\> | |
||||
|
| 你(使用者) | CrawlCommand | |
||||
|
| 插座面板 | 解析器工厂 | |
||||
|
|
||||
|
> "策略模式的核心思想就是:**定义一个算法接口,让具体的算法实现可以互相替换,而使用算法的客户端不受影响。**" |
||||
|
|
||||
|
#### 4.2.2 策略模式定义 |
||||
|
|
||||
|
```java |
||||
|
// src/main/java/com/crawler/strategy/CrawlStrategy.java |
||||
|
package com.crawler.strategy; |
||||
|
|
||||
|
import com.crawler.model.Article; |
||||
|
import org.jsoup.nodes.Document; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public interface CrawlStrategy { |
||||
|
/** |
||||
|
* 从已获取的Document中解析文章列表 |
||||
|
* @param url 原始请求URL(用于填充Article) |
||||
|
* @param doc Jsoup解析后的Document |
||||
|
* @return 解析出的文章列表 |
||||
|
*/ |
||||
|
List<Article> parse(String url, Document doc); |
||||
|
|
||||
|
/** |
||||
|
* 判断此策略是否为给定URL服务 |
||||
|
* @param url 待判断的URL |
||||
|
* @return true表示此策略可以处理该URL |
||||
|
*/ |
||||
|
boolean supports(String url); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "注意,策略接口里有两个方法。`parse`是干活的那个,`supports`是'我能不能干这个活'——这是什么?**这是合同!** 任何网站想被我们爬虫支持,就必须签署这份合同:告诉我你是不是我的客户(supports),以及怎么解析你(parse)。" |
||||
|
|
||||
|
#### 4.2.3 具体策略实现示例 |
||||
|
|
||||
|
```java |
||||
|
// BlogStrategy.java - 技术博客解析策略 |
||||
|
public class BlogStrategy implements CrawlStrategy { |
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("blog.example.com"); |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) { |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
Elements titles = doc.select(".post-title"); |
||||
|
for (Element e : titles) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
return articles; |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
// NewsStrategy.java - 新闻网站解析策略 |
||||
|
public class NewsStrategy implements CrawlStrategy { |
||||
|
@Override |
||||
|
public boolean supports(String url) { |
||||
|
return url.contains("news.example.com"); |
||||
|
} |
||||
|
|
||||
|
@Override |
||||
|
public List<Article> parse(String url, Document doc) { |
||||
|
List<Article> articles = new ArrayList<>(); |
||||
|
Elements items = doc.select(".article-headline"); |
||||
|
for (Element e : items) { |
||||
|
articles.add(new Article(e.text(), url, "")); |
||||
|
} |
||||
|
return articles; |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**对比:策略模式 vs 硬编码if-else** |
||||
|
|
||||
|
| 维度 | if-else屎山 | 策略模式 | |
||||
|
|------|-------------|----------| |
||||
|
| 新增网站 | 改CrawlCommand,加else if | 新写一个类,实现CrawlStrategy | |
||||
|
| 修改解析逻辑 | 在CrawlCommand里翻找对应的else if | 只改对应策略类 | |
||||
|
| 测试 | 必须启动整个爬虫 | 单独对Strategy做单元测试 | |
||||
|
| 是否符合开闭原则 | ❌ 对修改开放 | ✅ 对扩展开放,对修改关闭 | |
||||
|
|
||||
|
**与Command模式的对比(加深理解)**: |
||||
|
> "上节课Command模式,我们为每个命令定义一个类;这节课策略模式,我们为每个网站的解析算法定义一个类。**本质上都是同一个OOP思想:用多态替代条件分支。** 只不过Command的接口是`execute()`,Strategy的接口是`parse()`。" |
||||
|
> |
||||
|
> "这张图你们可以记下来:**接口是消除if-else的利器,多态是接口的灵魂。**" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.3 解析器工厂:自动匹配的魔法(14分钟) |
||||
|
|
||||
|
#### 4.3.1 问题引出 |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "现在我们有A网站的策略、B网站的策略。问题来了:谁来选策略?谁来遍历所有策略,找到一个supports返回true的?" |
||||
|
> |
||||
|
> "如果把这个逻辑写在CrawlCommand里,那策略模式就白用了——CrawlCommand还是得'知道'有哪些策略。我们要的是一个黑盒子:**把URL丢进去,自动弹出一个合适的解析器。**" |
||||
|
|
||||
|
#### 4.3.2 解析器工厂的实现 |
||||
|
|
||||
|
```java |
||||
|
// src/main/java/com/crawler/strategy/StrategyFactory.java |
||||
|
package com.crawler.strategy; |
||||
|
|
||||
|
import java.util.ArrayList; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public class StrategyFactory { |
||||
|
private final List<CrawlStrategy> strategies = new ArrayList<>(); |
||||
|
|
||||
|
// 注册策略——新的网站只需在这里加一行 |
||||
|
public StrategyFactory() { |
||||
|
strategies.add(new BlogStrategy()); |
||||
|
strategies.add(new NewsStrategy()); |
||||
|
// 未来增加新网站:strategies.add(new XxxStrategy()); |
||||
|
} |
||||
|
|
||||
|
/** |
||||
|
* 根据URL自动匹配解析策略 |
||||
|
* @param url 目标URL |
||||
|
* @return 匹配的策略,如果没有匹配返回null |
||||
|
*/ |
||||
|
public CrawlStrategy getStrategy(String url) { |
||||
|
for (CrawlStrategy s : strategies) { |
||||
|
if (s.supports(url)) { |
||||
|
return s; |
||||
|
} |
||||
|
} |
||||
|
return null; // 未找到匹配策略 |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "这个工厂类足够简单:一个List存所有策略,一个方法遍历找到匹配的。但简单不等于不强大。** |
||||
|
> |
||||
|
> **关键点**:新增网站支持,只需要——" |
||||
|
1. 写一个`XxxStrategy`实现`CrawlStrategy` |
||||
|
2. 在工厂构造器里加一行`strategies.add(new XxxStrategy())` |
||||
|
> |
||||
|
> "CrawlCommand一行不改。这就是开闭原则的胜利。" |
||||
|
|
||||
|
#### 4.3.3 从简单工厂到更高级的注册机制(拓展思维) |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "有同学可能会问:还要在工厂构造器里加一行,能不能做到完全零改动?当然可以——用反射或者SPI。" |
||||
|
|
||||
|
**演示概念(不要求实现)**: |
||||
|
```java |
||||
|
// 进阶思路:扫描指定包下的所有CrawlStrategy实现类 |
||||
|
// 用反射自动注册,真正做到“新增类即生效” |
||||
|
// 这是Spring框架的核心思想之一 |
||||
|
``` |
||||
|
|
||||
|
> "这个技术我们暂时不要求掌握,但我希望你们知道:你现在写的每一个`new XxxStrategy()`,在未来都可能进化为框架级别的自动装配。**你现在建立的思维习惯,决定了你未来能走多高。**" |
||||
|
|
||||
|
#### 4.3.4 重构后的CrawlCommand |
||||
|
|
||||
|
```java |
||||
|
public class CrawlCommand implements Command { |
||||
|
private ConsoleView view; |
||||
|
private StrategyFactory strategyFactory; |
||||
|
private ArticleRepository repository; // 注意:这里是Repository了! |
||||
|
|
||||
|
public CrawlCommand(ConsoleView v, StrategyFactory f, ArticleRepository r) { |
||||
|
this.view = v; |
||||
|
this.strategyFactory = f; |
||||
|
this.repository = r; |
||||
|
} |
||||
|
|
||||
|
public String getName() { return "crawl"; } |
||||
|
|
||||
|
public void execute(String[] args, List<Article> articles) { |
||||
|
if (args.length < 2) { |
||||
|
view.printError("Usage: crawl <url>"); |
||||
|
return; |
||||
|
} |
||||
|
String url = args[1]; |
||||
|
|
||||
|
// 1. 工厂自动选策略 |
||||
|
CrawlStrategy strategy = strategyFactory.getStrategy(url); |
||||
|
if (strategy == null) { |
||||
|
view.printError("No strategy found for: " + url); |
||||
|
return; |
||||
|
} |
||||
|
|
||||
|
// 2. 抓取页面 |
||||
|
view.printInfo("Crawling: " + url); |
||||
|
try { |
||||
|
Document doc = Jsoup.connect(url).get(); |
||||
|
List<Article> parsed = strategy.parse(url, doc); |
||||
|
|
||||
|
// 3. 通过仓库存入(而不是直接操作List) |
||||
|
for (Article a : parsed) { |
||||
|
repository.add(a); |
||||
|
} |
||||
|
view.printSuccess("Crawled " + parsed.size() + " articles."); |
||||
|
} catch (IOException e) { |
||||
|
view.printError("Failed to crawl: " + e.getMessage()); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "注意这个CrawlCommand现在的职责:拿到URL → 交给工厂选策略 → 执行解析 → 交给仓库存储。**它自己在干什么?在调度!** 这就是上节课我们讲的Controller的'调度思维',现在向Command内部延伸了。" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.4 Repository模式:武装数据访问(12分钟) |
||||
|
|
||||
|
#### 4.4.1 问题重提 |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "回到上节课结束时的那个问题:`List<Article>`在所有Command之间共享。任何一个Command都可以做这些事——" |
||||
|
```java |
||||
|
articles.clear(); // 清空所有文章 |
||||
|
articles.add(null); // 塞入null |
||||
|
articles.remove(0); // 随意删除 |
||||
|
``` |
||||
|
|
||||
|
> "如果一个新同事接手开发,他不知道'不要动这个List'的潜规则,写了一个`articles.clear()`,你的`list`命令就突然什么都不显示了。**靠代码约定维护的秩序,早晚会被打破。我们需要实体的'规则'——代码层面的约束。**" |
||||
|
|
||||
|
#### 4.4.2 ArticleRepository的定义 |
||||
|
|
||||
|
```java |
||||
|
// src/main/java/com/crawler/repository/ArticleRepository.java |
||||
|
package com.crawler.repository; |
||||
|
|
||||
|
import com.crawler.model.Article; |
||||
|
import java.util.ArrayList; |
||||
|
import java.util.Collections; |
||||
|
import java.util.List; |
||||
|
|
||||
|
public class ArticleRepository { |
||||
|
private final List<Article> articles = new ArrayList<>(); |
||||
|
|
||||
|
/** |
||||
|
* 添加一篇文章。注意:不接受null,这是代码层面的规则,不是口头约定。 |
||||
|
*/ |
||||
|
public void add(Article article) { |
||||
|
if (article == null) { |
||||
|
throw new IllegalArgumentException("Article cannot be null"); |
||||
|
} |
||||
|
articles.add(article); |
||||
|
} |
||||
|
|
||||
|
/** |
||||
|
* 获取所有文章的只读视图 |
||||
|
* 调用者无法通过此返回值修改内部数据 |
||||
|
*/ |
||||
|
public List<Article> getAll() { |
||||
|
return Collections.unmodifiableList(articles); |
||||
|
} |
||||
|
|
||||
|
/** |
||||
|
* 获取文章数量 |
||||
|
*/ |
||||
|
public int size() { |
||||
|
return articles.size(); |
||||
|
} |
||||
|
|
||||
|
/** |
||||
|
* 清空(仅管理员可调——下一篇:权限控制) |
||||
|
*/ |
||||
|
public void clear() { |
||||
|
articles.clear(); |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "三个关键设计点——" |
||||
|
> |
||||
|
> - **add()拒绝null**:规则写在代码里,不是写在邮件里 |
||||
|
> - **getAll()返回不可修改的视图**:`Collections.unmodifiableList()`——调用者如果尝试add/remove,会**直接抛异常**,不是'悄悄的bug' |
||||
|
> - **ClearCommand要清空数据?调`repository.clear()`**,而不是直接操作List |
||||
|
> |
||||
|
> "这就是面向对象的第一课——封装。把数据藏起来,只暴露安全的方法。从'直接操作集合'到'通过仓库存取',是程序员成熟度的分水岭。" |
||||
|
|
||||
|
#### 4.4.3 仓库引入后的架构变化 |
||||
|
|
||||
|
**Command接口的execute方法调整**: |
||||
|
|
||||
|
```java |
||||
|
// 调整前(W9) |
||||
|
public interface Command { |
||||
|
String getName(); |
||||
|
void execute(String[] args, List<Article> articles); |
||||
|
} |
||||
|
|
||||
|
// 调整后(W10) |
||||
|
public interface Command { |
||||
|
String getName(); |
||||
|
void execute(String[] args, ArticleRepository repository); |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "这个改动很小——把`List<Article>`换成`ArticleRepository`。但语义完全不同:之前是'给你数据随便玩',现在是'给你一个安全的存取通道'。" |
||||
|
|
||||
|
**所有Command同步调整**: |
||||
|
|
||||
|
```java |
||||
|
// ListCommand.java - 调整后 |
||||
|
public class ListCommand implements Command { |
||||
|
private ConsoleView view; |
||||
|
public ListCommand(ConsoleView v) { this.view = v; } |
||||
|
public String getName() { return "list"; } |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
view.display(repository.getAll()); // 通过仓库获取数据 |
||||
|
} |
||||
|
} |
||||
|
|
||||
|
// ClearCommand.java(新增示例) |
||||
|
public class ClearCommand implements Command { |
||||
|
private ConsoleView view; |
||||
|
public ClearCommand(ConsoleView v) { this.view = v; } |
||||
|
public String getName() { return "clear"; } |
||||
|
public void execute(String[] args, ArticleRepository repository) { |
||||
|
repository.clear(); |
||||
|
view.printSuccess("All articles cleared."); |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
**Controller和main的调整**: |
||||
|
|
||||
|
```java |
||||
|
// App.java - 调整后 |
||||
|
public class App { |
||||
|
public static void main(String[] args) { |
||||
|
ConsoleView view = new ConsoleView(); |
||||
|
ArticleRepository repository = new ArticleRepository(); // 替代 List<Article> |
||||
|
StrategyFactory factory = new StrategyFactory(); // 新增 |
||||
|
|
||||
|
CrawlerController controller = new CrawlerController(view, repository, factory); |
||||
|
|
||||
|
view.printSuccess("Welcome to CLI Crawler v2.0!"); |
||||
|
view.printInfo("Type 'help' for commands."); |
||||
|
|
||||
|
while (true) { |
||||
|
controller.handle(view.readLine()); |
||||
|
} |
||||
|
} |
||||
|
} |
||||
|
``` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.5 整体架构串联(8分钟) |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "现在我们把所有部件串起来,看看一个`crawl https://blog.example.com`命令走过的完整路径。" |
||||
|
|
||||
|
**时序图(口述配白板绘制)**: |
||||
|
``` |
||||
|
用户输入 "crawl https://blog.example.com" |
||||
|
│ |
||||
|
▼ |
||||
|
ConsoleView.readLine() |
||||
|
│ |
||||
|
▼ |
||||
|
CrawlerController.handle("crawl https://blog.example.com") |
||||
|
│ Map查找 "crawl" → CrawlCommand |
||||
|
▼ |
||||
|
CrawlCommand.execute(args, repository) |
||||
|
│ |
||||
|
├─► StrategyFactory.getStrategy(url) |
||||
|
│ │ 遍历List<CrawlStrategy> |
||||
|
│ │ BlogStrategy.supports(url) → true! |
||||
|
│ ▼ |
||||
|
│ 返回 BlogStrategy |
||||
|
│ |
||||
|
├─► Jsoup.connect(url).get() → Document |
||||
|
│ |
||||
|
├─► BlogStrategy.parse(url, doc) → List<Article> |
||||
|
│ |
||||
|
└─► for each article: repository.add(article) |
||||
|
│ |
||||
|
▼ |
||||
|
ArticleRepository.articles.add(article) |
||||
|
|
||||
|
最终:ConsoleView.printSuccess("Crawled N articles.") |
||||
|
``` |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "七步调用,每一步职责清晰:View负责输入输出,Controller负责路由,Command负责调度,Factory负责匹配,Strategy负责解析,Repository负责存储。**没有哪个类干了两个人的活,也没有哪个类不知道自己的活是什么。**" |
||||
|
> |
||||
|
> "这就是工程化——不是把代码写得快,是把代码写得对。" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.6 代码落地(20分钟) |
||||
|
|
||||
|
**教师准备**:课前准备一份“W9升级到W10”的改动清单,现场演示关键改动。 |
||||
|
|
||||
|
**改动清单**: |
||||
|
1. 新建`strategy/`包,创建`CrawlStrategy`接口 |
||||
|
2. 新建`strategy/BlogStrategy.java` |
||||
|
3. 新建`strategy/NewsStrategy.java` |
||||
|
4. 新建`strategy/StrategyFactory.java` |
||||
|
5. 新建`repository/`包,创建`ArticleRepository.java` |
||||
|
6. 修改`Command`接口的`execute`签名 |
||||
|
7. 修改`CrawlCommand`,引入`StrategyFactory`和`ArticleRepository` |
||||
|
8. 修改其余所有`Command`实现类 |
||||
|
9. 修改`CrawlerController`构造器 |
||||
|
10. 修改`App.java` |
||||
|
|
||||
|
**教师演示关键步骤**(重点演示): |
||||
|
- `ArticleRepository`的`Collections.unmodifiableList()` |
||||
|
- `StrategyFactory`的遍历匹配逻辑 |
||||
|
- `CrawlCommand`重写后的调度结构 |
||||
|
|
||||
|
**刻意埋入的“找茬点”**: |
||||
|
> "我在`StrategyFactory.getStrategy()`里,如果没有匹配的策略就返回`null`。然后在`CrawlCommand`里检查null。这其实叫'null object pattern的前奏'——如果我不想让Command检查null,我应该怎么改工厂?大家带着这个问题用AI探究。" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.7 架构反思与W11预告(5分钟) |
||||
|
|
||||
|
**教师口播**: |
||||
|
> "现在我们的架构比W9强壮多了:解析逻辑可插拔,数据访问有守卫。但还有一些漏洞——" |
||||
|
|
||||
|
**逐一点破**: |
||||
|
1. **异常处理**:`CrawlCommand`用了一个笼统的`catch (IOException e)`,如果解析过程中抛出其他异常怎么办? |
||||
|
2. **网络超时**:如果目标网站3秒没响应,当前代码会一直等吗? |
||||
|
3. **日志缺失**:所有的成功/失败信息只输出到终端,如果程序半夜跑,第二天想看昨晚抓了多少——看不了。 |
||||
|
4. **重试机制**:如果一次失败就直接报错,要不要给个重试的机会? |
||||
|
|
||||
|
**W11预告**: |
||||
|
> "下周,我们会做三件事:**自定义异常体系**、**工程化日志框架**、**防御式编程与重试机制**。W9搭骨架,W10装盔甲,W11要让这个系统**经得起现实的毒打**。" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
### 4.8 实践任务(5分钟) |
||||
|
|
||||
|
**任务要求**: |
||||
|
1. 从W9代码出发,完成W10升级 |
||||
|
2. 实现至少两个`CrawlStrategy`(可以是模拟的,不要求真实爬取) |
||||
|
3. 实现`StrategyFactory`和`ArticleRepository` |
||||
|
4. 确保所有Command通过Repository访问数据 |
||||
|
5. 运行并测试完整流程 |
||||
|
|
||||
|
**验收标准**: |
||||
|
- [x] 新增策略类只需新建文件+工厂注册一行,其余代码零改动 |
||||
|
- [x] `ArticleRepository`的`getAll()`返回不可修改视图 |
||||
|
- [x] `CrawlCommand`不包含任何网站特定的解析逻辑 |
||||
|
- [x] `StrategyFactory`能根据URL自动匹配正确的策略 |
||||
|
- [x] 所有Command的`execute`方法签名已更新为`ArticleRepository` |
||||
|
- [x] 无任何地方直接操作`List<Article>` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 五、课后作业 |
||||
|
|
||||
|
### 5.1 必做任务 |
||||
|
|
||||
|
1. **完善ArticleRepository**:增加`addAll(List<Article>)`批量添加方法,注意防御null |
||||
|
2. **★ AnalyzeCommand(集大成作业)**: |
||||
|
- 实现`analyze <url>`命令 |
||||
|
- 内部调用`StrategyFactory`匹配策略 |
||||
|
- 调用策略解析文章后,**不存到Repository**,而是分析统计信息: |
||||
|
- 文章总数 |
||||
|
- 标题平均长度 |
||||
|
- 按某种规则排名的Top 5 |
||||
|
- 结果只输出,不存储 |
||||
|
- **提示**:这就是策略的复用——同一个解析策略,既能为`crawl`服务(存入仓库),也能为`analyze`服务(仅分析) |
||||
|
|
||||
|
3. **AI架构审计**:将完整代码的类图(或类名与方法签名列表)发给AI,指令: |
||||
|
> "作为Java架构审计师,请检查:①策略模式的实现是否正确解耦(CrawlCommand是否仍然包含网站特定逻辑);②Repository是否真正封装了数据访问(是否存在绕过Repository直接操作List的地方);③工厂的匹配逻辑是否存在性能隐患。请给出具体的改进建议。" |
||||
|
|
||||
|
### 5.2 选做任务 |
||||
|
|
||||
|
1. **正则策略匹配**:将`Supports()`的判断从`url.contains()`改为正则表达式,让一张策略可以匹配一类URL |
||||
|
2. **默认策略(DefaultStrategy)**:当没有策略匹配时,提供一个通用的“标题提取”逻辑 |
||||
|
3. **策略优先级**:给每个策略加一个`priority`字段,工厂按优先级匹配(而不是按注册顺序) |
||||
|
4. **思考并回答(200字)**: |
||||
|
> "策略模式中,策略的`supports()`方法有可能让两个策略都返回true,这时该选哪个?`StrategyFactory`的遍历顺序会如何影响结果?你有什么解决方案?" |
||||
|
|
||||
|
### 5.3 思考题 |
||||
|
|
||||
|
1. **Repository与List的区别是什么?** 如果Repository只是包了一层List,为什么还要用? |
||||
|
2. **策略工厂的演进**:如果网站数量增加到100个,逐个注册的写法还合适吗?你想到什么解决方案? |
||||
|
3. **`Collections.unmodifiableList()`返回的是什么?** 它真的“不可修改”吗?如果原List被修改,这个不可修改视图会怎样? |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 六、AI协同升级 |
||||
|
|
||||
|
### 架构审计师任务(必做) |
||||
|
|
||||
|
**学生执行步骤**: |
||||
|
1. 画出当前项目的类依赖图(手绘或工具生成) |
||||
|
2. 将类名和依赖关系发给AI |
||||
|
3. 输入指令: |
||||
|
> "作为Java架构审计师,请检查这个爬虫项目的架构。重点关注:①策略模式是否真正实现了开闭原则(增加新网站是否真的只需新增类);②Repository封装是否完整(是否有绕过Repository的路径);③是否存在循环依赖。请逐一指出问题并给出改进建议。" |
||||
|
|
||||
|
**预期AI输出**: |
||||
|
- 指出是否还存在“改一处影响多处”的耦合 |
||||
|
- 判断Repository的API设计是否完备 |
||||
|
- 评价整体架构的开闭原则达成度 |
||||
|
|
||||
|
### 进阶AI探究(选做) |
||||
|
|
||||
|
> "假设我有一个CrawlStrategy接口和10个实现类。不用工厂模式,直接用一个Map<String, CrawlStrategy>存起来,key是策略名称。这和StrategyFactory设计有什么本质区别?各自的优缺点是什么?" |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 七、教学反思与调整记录 |
||||
|
|
||||
|
| 日期 | 事项 | 调整内容 | |
||||
|
|------|------|----------| |
||||
|
| 2026-05-01 | 首次编写 | 基于W9骨架,引入策略模式+工厂+Repository | |
||||
|
| 2026-05-07 | 结构优化 | 调整策略模式与工厂的讲解顺序,先策略后工厂更自然 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 附录1:W9到W10改动对照表 |
||||
|
|
||||
|
| 改动项 | W9代码 | W10代码 | |
||||
|
|--------|--------|---------| |
||||
|
| 数据存储 | `List<Article> articles` | `ArticleRepository repository` | |
||||
|
| Command接口 | `execute(String[], List<Article>)` | `execute(String[], ArticleRepository)` | |
||||
|
| 解析逻辑位置 | `CrawlCommand`内部 | 各`CrawlStrategy`实现类 | |
||||
|
| URL匹配 | 无(硬编码) | `StrategyFactory.getStrategy(url)` | |
||||
|
| 数据添加 | `articles.add(article)` | `repository.add(article)` | |
||||
|
| 数据读取 | 直接遍历`articles` | `repository.getAll()` | |
||||
|
|
||||
|
## 附录2:常见问题速查 |
||||
|
|
||||
|
| 问题 | 解答 | |
||||
|
|------|------| |
||||
|
| 策略模式和Command模式有什么区别? | Command封装“动作”(做什么事),Strategy封装“算法”(怎么做)。在爬虫中:crawl是命令(动作),如何解析是策略(算法)。 | |
||||
|
| 工厂一定要叫Factory吗? | 不必须。但叫Factory意味着“创建对象”的职责,符合模式命名的惯例。 | |
||||
|
| `Collections.unmodifiableList()`有什么用? | 返回一个只读视图,调用add/remove等方法会抛`UnsupportedOperationException`。 | |
||||
|
| Repository和DAO有什么区别? | 在我们的上下文中可以视为同义词。严谨地说,Repository是领域驱动设计的概念,更偏向“集合语义”;DAO更偏数据库操作。 | |
||||
|
| 策略的`supports()`返回true但解析失败怎么办? | 那是策略实现的bug,该策略应修复。Factory不负责验证策略的正确性。 | |
||||
|
|
||||
|
## 附录3:教学逻辑说明 |
||||
|
|
||||
|
| 顺序 | 内容 | 设计理由 | |
||||
|
|------|------|----------| |
||||
|
| 1 | W9回顾+痛点暴露 | 承上启下,从已知问题引出新知识 | |
||||
|
| 2 | 策略模式 | 解决解析逻辑耦合问题,深化多态理解 | |
||||
|
| 3 | 解析器工厂 | 解决策略选择问题,引入工厂模式 | |
||||
|
| 4 | Repository模式 | 解决数据安全问题,实践封装原则 | |
||||
|
| 5 | 架构串联 | 将所有部件统一,形成完整心智模型 | |
||||
|
| 6 | 代码落地 | 实践验证,从“听懂”到“会做” | |
||||
|
| 7 | 架构反思+预告 | 暴露新问题,为W11健壮性工程铺垫 | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 版本说明 |
||||
|
|
||||
|
- **v1(本版)**:基于W9教案模式首次编写,包含策略模式、工厂模式、Repository模式的完整引入 |
||||
Loading…
Reference in new issue