requests+xpath爬取电影天堂电影信息

最新推荐文章于 2025-07-27 15:30:18 发布

原创最新推荐文章于 2025-07-27 15:30:18 发布 · 1.7k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#编码 #一个键对应多值

python 专栏收录该内容

39 篇文章

订阅专栏

电影天地网址：http://www.ygdy8.net/html/gndy/china/list_4_1.html
目标：
1.爬取电影天堂的国内电影一栏的所有电影的url
2.进入每个电影的url获取电影的信息

使用到的库：
请求获取资源：requests库
配合xpath使用的库：from lxml import etree
定位数据：xpath
系统文件相关的I/O操作：os库
创建线程使用到的库：threading
线程进程的理解：https://blog.csdn.net/weixin_43919632/article/details/89209638

值得注意的：

重要的问题打五角星哦 ☆ 在这里插入图片描述
1.☆编码问题：
用requests库get网页时候出现：（查看了网页源代码：使用的编码方式为GB2313---------gbk的一种），所以我使用gbk方式编码，但是网页都打印不出来，后来我换成utf-8，网页还是打印不出来。经过网上一番搜寻得知：这类网页不规范，网页上的有些字符不是规范的，即一个网页上使用了多种不同类的字符，造成不管用gbk还是utf-8都会出现乱码。
编码类问题通用解决方法：
先自定义编码，再相应的解码,同时忽略非法字符，（很重要)。

import requests

res=requests.get(url)
text=res.content.decode("utf-8","ignore")

这里说明一下：content表示获取网页的原字符的文本，即字节形式的文本。
而 res.text是requests库采取默认的解码方式来解码得到的网页，这样难免会出错。

使用到的知识点：

1.map函数的使用
2.lambda函数的使用
3.enumerate函数的使用
4.字典中实现一个键对应多个值
5.线程的创建
6.startswith-----判断字符串的开头
7.replace和strip方法使用

1.map函数使用：
map函数允许传入两个参数，一个函数和一个列表。列表中的每个元素都执行传入的这个参数，最后返回一个新的列表。

def func(x):
    return x*x
print（map(func, [1, 2, 3, 4, 5, 6, 7, 8, 9])）
输出结果：
[1, 4, 9, 10, 25, 36, 49, 64, 81]

2.lambda函数的使用，常和map函数配合使用：

list=[1,2,3,4,5]
#" :"前面的“x“”是指list中的每一个元素，“：”后面的表示对列表中每个元素进行操作的表达式
result=map(lambda x:x**2,list)
print(result)
结果：
[1,4,9,16,25]

3.enumerate函数使用：

list=["hello","world","Alice","Ben"]
for index,i in enumerate("list")
	print(index,i)
结果：
0 hello
1 world
2 Alice
3 Ben
显而易见：其作用是返回列表中的每个元素的索引位置。

enumerate参考菜鸟教程：http://www.runoob.com/python/python-func-enumerate.html
4.一个键对应多个值思想方法：将这个键对应的所有值用列表装起来即可，再把这个列表赋给这个键


film={}
actors=[]
for i in range(index+1,len(other_infos)):
	actor=info[i].strip()
	if actor.startswith("◎"):
           break
    actors.append(actor)
    film["主演"]=actors

5.线程的创建：https://blog.csdn.net/weixin_43919632/article/details/89209638
6.startswith使用：

str = "this is string example....wow!!!"
if str.startswith( 'this' ))  # 字符串是否以 this 开头
	print("是this开头")
else：
	print("不是")
if (str.startswith( 'string', 8 )):  # 从第八个字符开始的字符串是否以 string 开头
	print(“是”)
else:
	print("不是")
if(str.startswith( 'this', 2, 4 )): # 从第2个字符开始到第四个字符结束的字符串是否以 this 开头
	print("是")
else：
	print("不是")
结果：
是
是
不是

7.strip方法菜鸟参考：http://www.runoob.com/python3/python3-string-strip.html
replace方法菜鸟参考：http://www.runoob.com/python3/python3-string-replace.html

爬取电影天堂电影信息全代码：

 #encoding:utf-8
import requests
from lxml import etree
import threading
import os

"""
warning:网页出现 不管用gbk或是utf-8都出现乱码问题时，需要我们先编码再解码，同时忽略掉非法字符
"""

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36"}
BaseDomain="http://www.ygdy8.net"


#get all flm_url  of every_page
def get_detail_urls(page_url):
    re=requests.get(page_url,headers=HEADERS)
    lxml=etree.HTML(re.content.decode("gbk","ignore"))
    flm_urls=lxml.xpath("//table[@class='tbspan']//a[2]/@href")
    """
    1.what the map function does is it allows two params a funciton and a list ,
    finally it returns a new list that has been processed by the function
    2.lambda function: the word before":" means it's a params,after the":"means it's a 
    funcitonal expression and it will be returned
    3. the params "url" in the map is one in flm_urls list
    """
    flm_urls=map(lambda url:BaseDomain+url,flm_urls)
    return flm_urls

def get_flm_infos(flm_url):
    film={}
    global flm_name
    res = requests.get(flm_url, headers=HEADERS)
    html = etree.HTML(res.content.decode("gbk", "ignore"))
    try:
        flm_name = html.xpath("//div[@class='title_all']//font/text()")[0]
        zoom = html.xpath("//div[@id='Zoom']")[0]
        imgs = zoom.xpath(".//img/@src")
        cover = imgs[0]
        screenshot = imgs[1]
        download_url = html.xpath("//td/a/@href")[0]
        other_infos=zoom.xpath(".//text()")
        film["电影名"]=flm_name
        film["电影下载地址"]=download_url
        film["海报"]=cover
        film["电影截图"]=screenshot
        #the enumerate funciton : the param-index means  index,order . the second param-info
        #is one in other_infos
        for index, info in enumerate(other_infos):

            if info.startswith("◎年代"):
                info=info.replace("◎年代","").strip()
                film["年代"]=info
            elif info.startswith("◎产  地"):
                info=info.replace("◎产　　地","").strip()
                film["产地"]=info
            elif info.startswith("◎类　　别"):
                info=info.replace("◎类　　别","").strip()
                film["类别"]=info
            elif info.startswith("◎语　　言"):
                info=info.replace("◎语　　言","").strip()
                film["语言"]=info
            elif info.startswith("◎上映日期"):
                info=info.replace("◎上映日期","").strip()
                film["上映日期"]=info
            elif info.startswith("◎豆瓣评分"):
                info=info.replace("◎豆瓣评分","").strip()
                film["豆瓣评分"]=info
            elif info.startswith("◎片　　长"):
                info=info.replace("◎片　　长","").strip()
                film["片长"]=info
            elif info.startswith("◎导　　演　"):
                info=info.replace("◎导　　演","").strip()
                film["导演"]=info
            elif info.startswith("◎编　　剧"):
                info=info.replace("◎编　　剧","").strip()
                film["编剧"]=info
            elif info.startswith("◎主　　演"):
                info=info.replace("◎主　　演","").strip()
                """
                实现字典一个键 对应多个值：
                """
                actors=[info]
                for i in range(index+1,len(other_infos)):
                    actor=other_infos[i].strip()
                    if actor.startswith("◎"):
                        break
                    actors.append(actor)
                    film["主演"]="".join(actors)
                """
                网页源代码里发现，"简介"是单独一行的，简介的内容是下一行.
                用enumerate获取简介内容的下标，利用列表取值方法，获得简介内容
                """
            elif info.startswith("◎简　　介 "):
                    info=other_infos[index+1].strip()
                    film["电影简介"]=info

    except:
        pass

    return film


def film_spider():
    # roll page and get page_url
    for i in range(1, 2):
        page_url = "http://www.ygdy8.net/html/gndy/china/list_4_{}.html".format(i)
        flm_urls = get_detail_urls(page_url)
        for flm_url in flm_urls:
            #print(flm_url)
            film=get_flm_infos(flm_url)
            #print(film)
            save_films(film)
            print(flm_name,"保存完成")

#save all_films
def save_films(film):
    path="C:\\Users\\Zhangchuan\\Pictures\\电影天堂电影信息"
    if not os.path.exists(path):
        os.mkdir(path)
    file_name=path+"\\"+"电影天堂电影信息.txt"
    with open(file_name,"a",encoding="utf-8") as f:
        for (key,value) in film.items():
            #每写入一个键值对就换一行
            f.write("{}:{} \n".format(key,value))
        #一个电影写完后就换两行
        f.write("\n\n")



if __name__=="__main__":
    #establish  16 threads
    for i in range(1):
        tl=threading.Thread(target=film_spider())
        tl.start()