diff --git a/project/202506050325-袁锐睿-期末实验报告.docx b/project/202506050325-袁锐睿-期末实验报告.docx new file mode 100644 index 0000000..6763405 Binary files /dev/null and b/project/202506050325-袁锐睿-期末实验报告.docx differ diff --git a/project/crawler.log b/project/crawler.log new file mode 100644 index 0000000..60498c6 --- /dev/null +++ b/project/crawler.log @@ -0,0 +1 @@ +[2026-05-21 14:54:56] [CRAWLER_004] 未知网站: unknown,可选值: stats, cas, gov, all diff --git a/project/crawler_20260527.log b/project/crawler_20260527.log new file mode 100644 index 0000000..813adc0 --- /dev/null +++ b/project/crawler_20260527.log @@ -0,0 +1,100 @@ +[2026-05-27 14:58:22] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 14:58:22] [INFO] [com.crawler.site.GovNewsCrawler] ========== Start crawling: 中国政府网 ========== +[2026-05-27 14:58:22] [INFO] [com.crawler.site.GovNewsCrawler] Total pages to crawl: 1 +[2026-05-27 14:58:22] [DEBUG] [com.crawler.site.GovNewsCrawler] Preparing to crawl page 1: https://www.gov.cn/ +[2026-05-27 14:58:22] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 1573ms before request +[2026-05-27 14:58:24] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15 +[2026-05-27 14:58:24] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.gov.cn/ +[2026-05-27 14:58:24] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.gov.cn/ status: 200 duration: 404ms +[2026-05-27 14:58:24] [INFO] [com.crawler.site.GovNewsCrawler] Page 1 completed, got 3 items +[2026-05-27 14:58:24] [INFO] [com.crawler.site.GovNewsCrawler] Saving 3 items +[2026-05-27 14:58:24] [INFO] [com.crawler.site.GovNewsCrawler] ========== Crawling completed: 中国政府网, duration: 2081ms ========== +[2026-05-27 15:00:00] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:00:00] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:00:00] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:00:00] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:00:00] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 1244ms before request +[2026-05-27 15:00:01] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:00:01] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:00:02] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 840ms +[2026-05-27 15:00:02] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:00:02] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:00:02] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 2265ms ========== +[2026-05-27 15:00:02] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:00:02] [INFO] [com.crawler.site.CasResearchCrawler] ========== Start crawling: 中科院-科研动态 ========== +[2026-05-27 15:00:02] [INFO] [com.crawler.site.CasResearchCrawler] Total pages to crawl: 1 +[2026-05-27 15:00:02] [DEBUG] [com.crawler.site.CasResearchCrawler] Preparing to crawl page 1: https://www.cas.cn/ +[2026-05-27 15:00:02] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 2944ms before request +[2026-05-27 15:00:05] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:00:05] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.cas.cn/ +[2026-05-27 15:00:05] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.cas.cn/ status: 200 duration: 242ms +[2026-05-27 15:00:05] [DEBUG] [com.crawler.http.JsoupHttpClient] Cookie updated +[2026-05-27 15:00:06] [INFO] [com.crawler.site.CasResearchCrawler] Page 1 completed, got 14 items +[2026-05-27 15:00:06] [INFO] [com.crawler.site.CasResearchCrawler] Saving 14 items +[2026-05-27 15:00:06] [INFO] [com.crawler.site.CasResearchCrawler] ========== Crawling completed: 中科院-科研动态, duration: 3254ms ========== +[2026-05-27 15:00:06] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:00:06] [INFO] [com.crawler.site.GovNewsCrawler] ========== Start crawling: 中国政府网 ========== +[2026-05-27 15:00:06] [INFO] [com.crawler.site.GovNewsCrawler] Total pages to crawl: 1 +[2026-05-27 15:00:06] [DEBUG] [com.crawler.site.GovNewsCrawler] Preparing to crawl page 1: https://www.gov.cn/ +[2026-05-27 15:00:06] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 1484ms before request +[2026-05-27 15:00:07] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15 +[2026-05-27 15:00:07] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.gov.cn/ +[2026-05-27 15:00:07] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.gov.cn/ status: 200 duration: 236ms +[2026-05-27 15:00:07] [INFO] [com.crawler.site.GovNewsCrawler] Page 1 completed, got 3 items +[2026-05-27 15:00:07] [INFO] [com.crawler.site.GovNewsCrawler] Saving 3 items +[2026-05-27 15:00:07] [INFO] [com.crawler.site.GovNewsCrawler] ========== Crawling completed: 中国政府网, duration: 1740ms ========== +[2026-05-27 15:02:06] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:02:06] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:02:06] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:02:06] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:02:06] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 2483ms before request +[2026-05-27 15:02:08] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:02:08] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:02:09] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 642ms +[2026-05-27 15:02:09] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:02:09] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:02:09] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 3231ms ========== +[2026-05-27 15:08:01] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:08:01] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:08:01] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:08:01] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:08:01] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 1781ms before request +[2026-05-27 15:08:03] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:08:03] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:08:03] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 507ms +[2026-05-27 15:08:03] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:08:03] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:08:03] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 2507ms ========== +[2026-05-27 15:39:23] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:39:23] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:39:23] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:39:23] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:39:23] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 1015ms before request +[2026-05-27 15:39:24] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:39:24] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:39:24] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 647ms +[2026-05-27 15:39:24] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:39:24] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:39:24] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 1795ms ========== +[2026-05-27 15:40:20] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:40:20] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:40:20] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:40:20] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:40:20] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 2219ms before request +[2026-05-27 15:40:22] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 +[2026-05-27 15:40:22] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:40:23] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 600ms +[2026-05-27 15:40:23] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:40:23] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:40:23] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 2922ms ========== +[2026-05-27 15:42:23] [INFO] [com.crawler.http.JsoupHttpClient] JsoupHttpClient initialized, timeout: 15000ms +[2026-05-27 15:42:23] [INFO] [com.crawler.site.StatsGovCrawler] ========== Start crawling: 国家统计局-新闻发布 ========== +[2026-05-27 15:42:23] [INFO] [com.crawler.site.StatsGovCrawler] Total pages to crawl: 1 +[2026-05-27 15:42:23] [DEBUG] [com.crawler.site.StatsGovCrawler] Preparing to crawl page 1: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:42:23] [DEBUG] [com.crawler.http.JsoupHttpClient] Waiting 2380ms before request +[2026-05-27 15:42:25] [DEBUG] [com.crawler.http.JsoupHttpClient] Using User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 +[2026-05-27 15:42:25] [INFO] [com.crawler.http.JsoupHttpClient] Starting request: https://www.stats.gov.cn/sj/sjjd/ +[2026-05-27 15:42:26] [INFO] [com.crawler.http.JsoupHttpClient] Request completed: https://www.stats.gov.cn/sj/sjjd/ status: 200 duration: 644ms +[2026-05-27 15:42:26] [INFO] [com.crawler.site.StatsGovCrawler] Page 1 completed, got 30 items +[2026-05-27 15:42:26] [INFO] [com.crawler.site.StatsGovCrawler] Saving 30 items +[2026-05-27 15:42:26] [INFO] [com.crawler.site.StatsGovCrawler] ========== Crawling completed: 国家统计局-新闻发布, duration: 3130ms ========== diff --git a/project/data.zip b/project/data.zip new file mode 100644 index 0000000..c351c32 Binary files /dev/null and b/project/data.zip differ diff --git a/project/logs.zip b/project/logs.zip new file mode 100644 index 0000000..453ad9d Binary files /dev/null and b/project/logs.zip differ diff --git a/project/src.zip b/project/src.zip new file mode 100644 index 0000000..3e84a51 Binary files /dev/null and b/project/src.zip differ