说明
看官多多点赞呀~
上次爬取到了电影名,但远远不够,笔者试图对top250再进行数据分析,因此需要爬到更多的信息。代码如下,应该是CSDN代码量最少最简单的了~
关于怎么保存爬取的数据,参看我之前的博客,很简单的;另外,为防止被封,爬取一页需要短暂的sleep(),我设定的1s
import requests
from bs4 import BeautifulSoup
import time
import re
import pandas as pd
headers = {
'Cookie': 'douban-fav-remind=1; __yadk_uid=CKhB1u3n6xEqzyUBAyvDcIG0qqutnMNR; bid=3tIcGsf0QDM; __gads=ID=8dc5077d330e5d35-22f3e8e3e6c6005d:T=1617088787:RT=1617088787:S=ALNI_MYFQCk-CtmQowxEUgWKzk-nkWX55w; __utmc=30149280; __utmz=30149280.1618455783.8.5.utmcsr=sogou.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; __utmc=223695111; __utmz=223695111.1618455783.6.3.utmcsr=sogou.com|utmccn=(referral)|utmcmd=referral|utmcct=/link; ll="118318"; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1618623032%2C%22https%3A%2F%2Fsiteproxy.ruqli.workers.dev%3A443%2Fhttps%2Fwww.sogou.com%2Flink%3Furl%3DhedJjaC291PRk6U3MmR8l1gSvBeWzmiQ6zoJDnqY_nQO5VUb2AU5IA..%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.193803470.1583831845.1618586080.1618623033.10; __utmb=30149280.0.10.1618623033; __utma=223695111.744523459.1601995022.1618586080.1618623033.8; __utmb=223695111.0.10.1618623033; _vwo_uuid_v2=D7B1C2A974658E5E33A73E04665581A99|349fac20bbf0f95e5ab5e381b8995d91; _pk_id.100001.4cf6=3ded1b2e1c85efc8.1601995022.8.1618623046.1618586085.',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'
}
target = []
for i in range(10):
res = requests.get(