爬取的站点:http://beijing.8684.cn/
(1)环境配置,直接上代码:
# -*- coding: utf-8 -*-
import requests ##导入requests
from bs4 import BeautifulSoup ##导入bs4中的BeautifulSoup
import os
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
all_url = 'http://beijing.8684.cn' ##开始的URL地址
start_html = requests.get(all_url, headers=headers)
#print (start_html.text)
Soup = BeautifulSoup(start_html.text, 'lxml') # 以lxml的方式解析html文档
(2)爬取站点分析
1、北京市公交线路分类方式有3种:
本文通过数字开头来进行爬取,"F12”启动开发者工具,点击"Elements”,点击"1”,可以发现链接保存在
代码:
all_a = Soup.find("div',class_='bus_kt_r1').find_all("a')
2、接着往下,发现每1路的链接都在
href = a['href'] #取出a标签的href 属性
html = all_url + href
second_html = requests.get(html,headers=headers)
#print (second_html.text)
Soup2 = BeautifulSoup(second_html.text, 'lxml')
all_a2 = Soup2.find('div',class_='cc_content').find_all('div')[-1].find_all('a')