进阶向:爬虫抓取热门新闻并生成摘要

超级小识

于 2025-08-04 08:00:00 发布

阅读量579

点赞数 13

CC 4.0 BY-SA版权

文章标签：爬虫 python 开发语言人工智能性能优化

本文链接：https://blog.csdn.net/2302_77626561/article/details/149781176

理解网络爬虫的基本概念

网络爬虫是一种自动化程序，用于从互联网上抓取数据。其核心功能是通过HTTP/HTTPS协议访问目标网站，下载网页内容，并从中提取结构化数据。与普通用户使用浏览器手动访问网页不同，爬虫可以批量、高效地完成这些任务，并且可以按照预设规则自动遍历网站链接。

爬虫的工作原理主要包含以下几个步骤：

种子URL管理：从一个或多个初始URL开始
网页下载：通过HTTP请求获取网页原始HTML代码
内容解析：使用XPath、CSS选择器或正则表达式提取目标数据
链接提取：发现新的URL并加入待抓取队列
数据存储：将提取的数据保存到数据库或文件系统中

在实际应用中，爬虫需要处理各种技术挑战，包括：

反爬虫机制（如验证码、IP限制）
动态网页内容（JavaScript渲染）
网页结构变化
数据清洗和去重

对于新闻摘要生成任务，爬虫的工作流程更加专业化：

首先抓取新闻网站或RSS源的网页内容
通过DOM解析技术定位正文区域，排除广告、导航等噪音内容
提取新闻的标题、发布时间、作者等元数据
将清洗后的文本内容传递给自然语言处理模块
NLP系统会使用文本摘要算法（如TextRank或BERT等深度学习模型）分析内容，识别关键实体和重要句子
最终生成包含核心事实的简洁摘要

这种技术组合在多个场景中都有广泛应用，例如：

新闻聚合平台的内容采集
金融领域的舆情监控
学术研究的文献调研
企业竞争情报分析

需要注意的是，在实际开发中，开发者需要遵守robots.txt协议，控制爬取频率，尊重网站的服务条款，避免对目标服务器造成过大负担。

爬取热门新闻的技术实现

Python是编写爬虫的常用语言，因其丰富的库支持。requests库用于发送HTTP请求获取网页内容，BeautifulSoup库用于解析HTML并提取所需数据。以下代码片段展示了如何抓取新闻标题和正文：

import requests
from bs4 import BeautifulSoup

url = "https://example-news-site.com/latest"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
titles = soup.find_all('h2', class_='news-title')
contents = soup.find_all('div', class_='article-content')

数据清洗与预处理

抓取的原始数据常包含无关内容（如广告、脚本代码），需通过正则表达式或字符串操作清洗。例如，移除HTML标签、空白字符和特殊符号：

import re

def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)  # 移除HTML标签
    text = re.sub(r'\s+', ' ', text)     # 合并多余空格
    return text.strip()

cleaned_content = [clean_text(content.get_text()) for content in contents]

生成新闻摘要的方法

摘要生成可采用提取式或生成式方法。提取式方法从原文中选取重要句子（如TF-IDF或TextRank算法），生成式方法则通过模型（如GPT）重写内容。以下是使用nltk库的简单提取式实现：

from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

def generate_summary(text, num_sentences=3):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    sentence_scores = tfidf_matrix.sum(axis=1)
    top_sentences = [sentences[i] for i in sentence_scores.argsort(flatten=True)[-num_sentences:]]
    return ' '.join(top_sentences)

完整源码实现

以下代码整合了爬取、清洗和摘要生成功能，并以JSON格式输出结果：

import requests
from bs4 import BeautifulSoup
import re
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import json

def clean_text(text):
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def generate_summary(text, num_sentences=3):
    sentences = sent_tokenize(text)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    sentence_scores = tfidf_matrix.sum(axis=1)
    top_sentences = [sentences[i] for i in sentence_scores.argsort(flatten=True)[-num_sentences:]]
    return ' '.join(top_sentences)

def scrape_news(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    articles = []
    for item in soup.select('.news-item'):
        title = clean_text(item.select_one('.title').get_text())
        content = clean_text(item.select_one('.content').get_text())
        summary = generate_summary(content)
        articles.append({'title': title, 'content': content, 'summary': summary})
    return articles

if __name__ == '__main__':
    news_url = "https://example-news-site.com/latest"
    news_data = scrape_news(news_url)
    with open('news_summaries.json', 'w') as f:
        json.dump(news_data, f, indent=2)

扩展功能与优化建议

反爬虫策略处理：添加User-Agent头部和请求间隔延迟：

headers = {'User-Agent': 'Mozilla/5.0'}
time.sleep(2)  # 避免高频请求

持久化存储：支持数据库存储（如SQLite）：

import sqlite3
conn = sqlite3.connect('news.db')
cursor = conn.cursor()
cursor.execute('CREATE TABLE IF NOT EXISTS news (title TEXT, summary TEXT)')

可视化展示：利用matplotlib生成词云：

from wordcloud import WordCloud
wordcloud = WordCloud().generate(' '.join(all_texts))
plt.imshow(wordcloud)

附录：完整项目源码

# news_scraper_with_summary.py
import requests
from bs4 import BeautifulSoup
import re
import time
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import json
import sqlite3

# 配置参数
HEADERS = {'User-Agent': 'Mozilla/5.0'}
SUMMARY_SENTENCES = 2

def clean_text(text):
    """清洗文本中的HTML标签和多余空格"""
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def generate_summary(text, num_sentences=SUMMARY_SENTENCES):
    """基于TF-IDF的提取式摘要生成"""
    sentences = sent_tokenize(text)
    if len(sentences) <= num_sentences:
        return text
    
    vectorizer = TfidfVectorizer(stop_words='english')
    try:
        tfidf_matrix = vectorizer.fit_transform(sentences)
        sentence_scores = tfidf_matrix.sum(axis=1)
        top_indices = sentence_scores.argsort(flatten=True)[-num_sentences:]
        return ' '.join([sentences[i] for i in sorted(top_indices)])
    except:
        return sentences[0]

def save_to_db(data):
    """存储结果到SQLite数据库"""
    conn = sqlite3.connect('news.db')
    cursor = conn.cursor()
    cursor.execute('''CREATE TABLE IF NOT EXISTS news
                     (id INTEGER PRIMARY KEY AUTOINCREMENT,
                      title TEXT,
                      summary TEXT,
                      timestamp DATETIME DEFAULT CURRENT_TIMESTAMP)''')
    
    for item in data:
        cursor.execute("INSERT INTO news (title, summary) VALUES (?, ?)",
                      (item['title'], item['summary']))
    conn.commit()
    conn.close()

def scrape_news(base_url, page_count=1):
    """主爬虫函数"""
    all_articles = []
    for page in range(1, page_count + 1):
        url = f"{base_url}?page={page}" if page_count > 1 else base_url
        try:
            response = requests.get(url, headers=HEADERS)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            for item in soup.select('.news-item'):
                title_elem = item.select_one('.title')
                content_elem = item.select_one('.content')
                
                if not title_elem or not content_elem:
                    continue
                    
                title = clean_text(title_elem.get_text())
                content = clean_text(content_elem.get_text())
                summary = generate_summary(content)
                
                all_articles.append({
                    'title': title,
                    'content': content,
                    'summary': summary
                })
            
            time.sleep(1)  # 礼貌性延迟
        except Exception as e:
            print(f"Error scraping {url}: {str(e)}")
    
    return all_articles

if __name__ == '__main__':
    # 示例新闻网站（实际使用时需替换为真实URL）
    NEWS_SOURCE = "https://example-news-site.com/latest"
    
    # 执行爬取并保存结果
    articles = scrape_news(NEWS_SOURCE, page_count=3)
    
    # JSON输出
    with open('news_summaries.json', 'w', encoding='utf-8') as f:
        json.dump(articles, f, ensure_ascii=False, indent=2)
    
    # 数据库存储
    save_to_db(articles)
    
    print(f"成功抓取并处理 {len(articles)} 篇新闻")

项目结构说明

依赖安装：需提前安装以下库：

pip install requests beautifulsoup4 nltk scikit-learn
python -m nltk.downloader punkt

文件输出：
- news_summaries.json：包含标题、正文和摘要的JSON文件
- news.db：SQLite数据库文件
自定义调整：
- 修改select()中的CSS选择器以匹配目标网站结构
- 调整SUMMARY_SENTENCES控制摘要长度
- 增加代理设置应对IP封锁