1.今天抓取数据是碰到的问题,读取的数据是多个列表的形式存放在一个文件中,想要读取出来并且合并成一个列表,方便后面的遍历使用。
先来看一下我要读取的数据格式:
由N个列表组成的一个文件数据,我先是直接用read读取然后看一下读取到的内容是什么并且看一下获取到的数据类型是什么:
with open("./data/category_info/category_info.txt","r",encoding="utf-8") as f:
list1 = f.read()
print(list1)
print(type(list1))
打印出来的结果如下:
..........
[
"https://shop111676896.taobao.com",
"https://shop208960730.taobao.com",
"https://shop58053041.taobao.com",
"https://shop325469838.taobao.com",
"https://shop100874380.taobao.com",
"https://shop114260282.taobao.com",
"https://shop237827344.taobao.com",
"https://shop356054501.taobao.com",
"https://shop103565720.taobao.com",
"https://shop544563404.taobao.com",
"https://shop33279777.taobao.com",
"https://shop61924147.taobao.com",
"https://shop114426830.taobao.com",
"https://shop73216230.taobao.com",
"https://shop73117760.taobao.com",
"https://shop103944295.taobao.com",
"https://shop106643107.taobao.com",
"https://shop159725090.taobao.com",
"https://shop540286265.taobao.com",
"https://shop548965527.taobao.com"
][
"https://shop329927844.taobao.com",
"https://shop130237536.taobao.com",
"https://shop109321639.taobao.com",
"https://shop112297494.taobao.com",
"https://shop60995523.taobao.com",
"https://shop100979410.taobao.com",
"https://shop155515714.taobao.com",
"https://shop111458770.taobao.com",
"https://shop62806179.taobao.com",
"https://shop126622874.taobao.com",
"https://shop149039686.taobao.com",
"https://shop35637701.taobao.com",
"https://shop163580339.taobao.com",
"https://shop35981166.taobao.com",
"https://shop124412834.taobao.com",
"https://shop114735097.taobao.com",
"https://shop108812765.taobao.com",
"https://shop130240325.taobao.com",
"https://shop73264454.taobao.com",
"https://shop67453584.taobao.com"
]
<class 'str'>
看到数据类型是str,那就很好办了,直接用replace将"]["替换成“,”不就行了,开始第一次尝试
with open("./data/shop_info/shop_url.txt","r",encoding="utf-8") as f:
list1 = f.read()
list1.replace("][",",")
print(list1)
>>>打印结果如下
.........
"https://shop67453584.taobao.com"
][
"https://shop111676896.taobao.com",
"https://shop208960730.taobao.com",
"https://shop58053041.taobao.com",
"https://shop325469838.taobao.com",
"https://shop100874380.taobao.com",
"https://shop114260282.taobao.com",
"https://shop237827344.taobao.com",
"https://shop356054501.taobao.com",
"https://shop103565720.taobao.com",
"https://shop544563404.taobao.com",
"https://shop33279777.taobao.com",
"https://shop61924147.taobao.com",
"https://shop114426830.taobao.com",
"https://shop73216230.taobao.com",
"https://shop73117760.taobao.com",
"https://shop103944295.taobao.com",
"https://shop106643107.taobao.com",
"https://shop159725090.taobao.com",
"https://shop540286265.taobao.com",
"https://shop548965527.taobao.com"
][
"https://shop329927844.taobao.com",
"https://shop130237536.taobao.com",
"https://shop109321639.taobao.com",
"https://shop112297494.taobao.com",
"https://shop60995523.taobao.com",
"https://shop100979410.taobao.com",
"https://shop155515714.taobao.com",
"https://shop111458770.taobao.com",
"https://shop62806179.taobao.com",
"https://shop126622874.taobao.com",
"https://shop149039686.taobao.com",
"https://shop35637701.taobao.com",
"https://shop163580339.taobao.com",
"https://shop35981166.taobao.com",
"https://shop124412834.taobao.com",
"https://shop114735097.taobao.com",
"https://shop108812765.taobao.com",
"https://shop130240325.taobao.com",
"https://shop73264454.taobao.com",
"https://shop67453584.taobao.com"
]
发现还是存在括号,紧接着使用split切割还是一样,最后转变思路,想了一下我只需要双引号中的url数据,直接使用正则表达式匹配不就好了,re正则表达式实现代码如下:
import re
with open("./data/shop_info/shop_url.txt","r",encoding="utf-8") as f:
list1 = f.read()
shop_list = re.findall(r'"(.*?)"',list1)
print(shop_list)
>>>>>打印结果如下:
..........
'https://shop115626267.taobao.com', 'https://shop101808441.taobao.com', 'https://shop100674644.taobao.com', 'https://shop71949547.taobao.com', 'https://shop63540798.taobao.com', 'https://shop211037737.taobao.com', 'https://shop66171793.taobao.com', 'https://shop35999051.taobao.com', 'https://shop156395003.taobao.com', 'https://shop324472272.taobao.com', 'https://shop583778897.taobao.com', 'https://shop317880714.taobao.com', 'https://shop109103413.taobao.com', 'https://shop113207130.taobao.com', 'https://shop114201820.taobao.com', 'https://shop162134567.taobao.com', 'https://shop109819808.taobao.com', 'https://shop110423139.taobao.com', 'https://shop109672786.taobao.com', 'https://shop111373248.taobao.com', 'https://shop115626267.taobao.com', 'https://shop101808441.taobao.com', 'https://shop100674644.taobao.com', 'https://shop71949547.taobao.com', 'https://shop63540798.taobao.com', 'https://shop211037737.taobao.com', 'https://shop66171793.taobao.com', 'https://shop35999051.taobao.com', 'https://shop156395003.taobao.com', 'https://shop324472272.taobao.com', 'https://shop329927844.taobao.com', 'https://shop130237536.taobao.com', 'https://shop109321639.taobao.com', 'https://shop112297494.taobao.com', 'https://shop60995523.taobao.com', 'https://shop100979410.taobao.com', 'https://shop155515714.taobao.com', 'https://shop111458770.taobao.com', 'https://shop62806179.taobao.com', 'https://shop126622874.taobao.com', 'https://shop149039686.taobao.com', 'https://shop35637701.taobao.com', 'https://shop163580339.taobao.com', 'https://shop35981166.taobao.com', 'https://shop124412834.taobao.com', 'https://shop114735097.taobao.com', 'https://shop108812765.taobao.com', 'https://shop130240325.taobao.com', 'https://shop73264454.taobao.com', 'https://shop67453584.taobao.com', 'https://shop111676896.taobao.com', 'https://shop208960730.taobao.com', 'https://shop58053041.taobao.com', 'https://shop325469838.taobao.com', 'https://shop100874380.taobao.com', 'https://shop114260282.taobao.com', 'https://shop237827344.taobao.com', 'https://shop356054501.taobao.com', 'https://shop103565720.taobao.com', 'https://shop544563404.taobao.com', 'https://shop33279777.taobao.com', 'https://shop61924147.taobao.com', 'https://shop114426830.taobao.com', 'https://shop73216230.taobao.com', 'https://shop73117760.taobao.com', 'https://shop103944295.taobao.com', 'https://shop106643107.taobao.com', 'https://shop159725090.taobao.com', 'https://shop540286265.taobao.com', 'https://shop548965527.taobao.com', 'https://shop329927844.taobao.com', 'https://shop130237536.taobao.com', 'https://shop109321639.taobao.com', 'https://shop112297494.taobao.com', 'https://shop60995523.taobao.com', 'https://shop100979410.taobao.com', 'https://shop155515714.taobao.com', 'https://shop111458770.taobao.com', 'https://shop62806179.taobao.com', 'https://shop126622874.taobao.com', 'https://shop149039686.taobao.com', 'https://shop35637701.taobao.com', 'https://shop163580339.taobao.com', 'https://shop35981166.taobao.com', 'https://shop124412834.taobao.com', 'https://shop114735097.taobao.com', 'https://shop108812765.taobao.com', 'https://shop130240325.taobao.com', 'https://shop73264454.taobao.com', 'https://shop67453584.taobao.com']
能力有限这是能想到的最简单的方法,有更好的方法欢迎在评论区留言哦