利用 PHP 爬虫获得淘宝商品评论实战指南-CSDN博客

本文链接：https://blog.csdn.net/wanbangAPI01/article/details/150344782

在电商领域，淘宝的商品评论数据是了解产品口碑和市场反馈的重要渠道。通过 PHP 爬虫技术，我们可以高效地获取这些评论数据，为市场分析、产品优化和用户体验改进提供有力支持。以下是一个详细的实战指南，包括代码示例。

一、环境准备

（一）PHP 开发环境

确保你的开发环境中已经安装了 PHP，并且启用了 cURL 扩展，用于发送 HTTP 请求。

（二）安装必要的库

安装 GuzzleHttp 库，用于发送 HTTP 请求。可以通过 Composer 安装：

bash

composer require guzzlehttp/guzzle

二、编写爬虫代码

（一）发送 HTTP 请求

使用 GuzzleHttp 库发送 GET 请求，获取商品评论页面的 HTML 内容。

php

<?php
require 'vendor/autoload.php';

use GuzzleHttp\Client;

function getHtml($url) {
    $client = new Client();
    $response = $client->request('GET', $url, [
        'headers' => [
            'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        ]
    ]);
    return $response->getBody()->getContents();
}
?>

（二）解析 HTML 内容

使用 DOMDocument 和 DOMXPath 解析 HTML 内容，提取评论数据。

php

<?php
function parseHtml($html) {
    $doc = new DOMDocument();
    @$doc->loadHTML($html);
    $xpath = new DOMXPath($doc);
    $comments = [];

    $items = $xpath->query('//div[@class="comment-list"]/div[@class="comment-inner"]');
    foreach ($items as $item) {
        $content = $xpath->query('.//div[@class="comment-content"]/p', $item)->item(0)->nodeValue;
        $comments[] = trim($content);
    }

    return $comments;
}
?>

（三）按关键字搜索商品评论

根据商品 ID 构建评论请求 URL，并获取评论数据。

php

<?php
function getTaobaoComments($itemId, $page = 1) {
    $url = "https://rate.taobao.com/feedRateList.htm";
    $params = [
        'auctionNumId' => $itemId,
        'currentPageNum' => $page,
        'pageSize' => 20,
        'rateType' => 1, // 1-全部评价 2-好评 3-中评 4-差评
        'orderType' => 'sort_weight' // 排序方式
    ];
    $headers = [
        'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Referer: https://item.taobao.com/item.htm?id='.$itemId,
        'Accept: application/json'
    ];

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url.'?'.http_build_query($params));
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIEFILE, 'taobao_cookies.txt');
    curl_setopt($ch, CURLOPT_COOKIEJAR, 'taobao_cookies.txt');
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

    $response = curl_exec($ch);
    curl_close($ch);

    // 处理 JSONP 响应
    $response = preg_replace('/^jsonp\d+(/', '', rtrim($response, ')'));
    return json_decode($response, true);
}
?>

（四）解析评论数据

解析返回的 JSON 数据，提取评论内容。

php

<?php
function parseComments($rawData) {
    $comments = [];
    if (!isset($rawData['comments'])) {
        return $comments;
    }
    foreach ($rawData['comments'] as $item) {
        $comment = [
            'id' => $item['id'],
            'author' => $item['user']['nick'],
            'content' => $item['content'],
            'date' => $item['date'],
            'rate' => $item['rate'],
            'photos' => [],
            'append' => null
        ];
        if (isset($item['photos']) && is_array($item['photos'])) {
            foreach ($item['photos'] as $photo) {
                $comment['photos'][] = 'https:'.$photo['url'];
            }
        }
        if (isset($item['appendComment'])) {
            $comment['append'] = [
                'content' => $item['appendComment']['content'],
                'date' => $item['appendComment']['date']
            ];
        }
        $comments[] = $comment;
    }
    return $comments;
}
?>

（五）整合代码

将上述功能整合到主程序中，实现完整的爬虫程序。

php

<?php
require 'vendor/autoload.php';

$goodsId = '1005006'; // 商品 ID
$pageNum = 1; // 评论分页页码

$rawData = getTaobaoComments($goodsId, $pageNum);
$comments = parseComments($rawData);

foreach ($comments as $comment) {
    echo "昵称: " . $comment['author'] . "\n";
    echo "评论内容: " . $comment['content'] . "\n";
    echo "评论时间: " . $comment['date'] . "\n";
    echo "----------------------\n";
}
?>