Python网页解析之BeautifulSoup详解-CSDN博客

本文链接：https://blog.csdn.net/m0_56967679/article/details/139086094

Python网页解析之BeautifulSoup详解

一、BeautifulSoup简介

BeautifulSoup是Python的一个库,可以从HTML或XML文件中提取数据。它能够自动将输入文件转换为一个树形结构,之后用户就可以从这个树形结构中方便地提取出HTML/XML标签、文本数据等。BeautifulSoup库的主要作用如下:

1. 从一个HTML/XML文件中获取数据
2. 从互联网上抓取网页数据
3. 可以使用Python编码风格从文档中导航
4. 对编码格式进行智能处理
5. 方便地提取特定HTML/XML标签及其内容

二、安装BeautifulSoup

可以使用pip工具在Python环境中安装BeautifulSoup库:

pip install beautifulsoup4

三、基本使用

导入BeautifulSoup库后,需要创建一个BeautifulSoup对象,可以解析HTML/XML文本数据。例如:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
soup = BeautifulSoup(html_doc, 'html.parser')

上面代码创建了一个BeautifulSoup对象`soup`,用于解析HTML文本数据。第二个参数`html.parser`是Python内置的HTML解析器。

四、提取数据

通过BeautifulSoup对象的多个方法和属性,我们可以方便地从HTML/XML文档中提取各种数据。

4.1 提取标签

使用`soup.tagName`可以提取所有指定标签,如:

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.p)  
# <p class="title"><b>The Dormouse's story</b></p>

4.2 提取标签内容

使用`tag.string`可以获取标签内的文本内容:

print(soup.title.string)
# The Dormouse's story

如果标签内包含多个字符串,可以使用`tag.strings`获取一个生成器:

for string in soup.strings:
    print(repr(string))
    
# 输出:
# 'The Dormouse's story'  
# '\n\n'
# "The Dormouse's story"
# '\n'
# ...
```

### 4.3 提取标签属性

使用`tag['attribute']`可以获取标签的某个属性值:

```python 
print(soup.p['class'])
# ['title']

4.4 查找标签

使用`find()`和`find_all()`方法可以查找指定标签:

# 查找所有<a>标签
print(soup.find_all('a'))

# 查找第一个<a>标签  
print(soup.find('a'))

# 查找所有class为'sister'的<a>标签
print(soup.find_all('a', attrs={'class': 'sister'}))

五、高级用法

5.1 CSS选择器查找

BeautifulSoup支持使用复杂的CSS选择器来查找元素,可以极大地提高查找效率:

# 查找所有id="link3"的标签
soup.select("#link3")

# 查找所有<span>标签和子孙<span>标签 
soup.select("span, span *") 

# 查找body下所有<p>标签    
soup.select("body > p")

5.2 修改文档树

BeautifulSoup不仅可以解析和查找数据,还可以方便地修改HTML/XML文档树:

# 修改标签
soup.p['class'] = 'newClass'

# 添加新标签
new_tag = soup.new_tag('p')
new_tag.string = 'New Paragraph'
soup.body.insert(1, new_tag)

以上就是关于Python网页解析库BeautifulSoup的详细介绍,希望对您有所帮助。我们分别介绍了BeautifulSoup的基本概念、安装方法、基本使用、数据提取方法以及一些高级用法。如有任何疑问,欢迎继续交流讨论。