Linux下网络爬虫程序

最新推荐文章于 2022-10-23 19:44:47 发布

原创最新推荐文章于 2022-10-23 19:44:47 发布 · 1.1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#Linux #网络爬虫

Linux 专栏收录该内容

16 篇文章

订阅专栏

<span style="font-family: Arial, Helvetica, sans-serif; background-color: rgb(255, 255, 255);"><span style="white-space:pre">	</span>网络爬虫是搜索引擎最主要的组成部分，网络爬虫通过互联网获取网页，并将其存储在本地。然后通过对下载的网页的进一步分析，获取网页中的URL从而下载更多的资源。本文通过借助于curl库以及C++中的STL库来编写一个简单的网络爬虫程序。程序需要输入一个索引网页（例如www.baidu.com），然后程序会下载解析该网页，并将其中的url按照优先级进行保存。</span>

1、网页下载：

网页下载主要是借助于现有的curl库进行，curl的具体使用方法可以参考man手册，在这里我们仅用到其-o(保存下载的网页)和-m(设置下载时间)选项。下载文件我们通过执行脚本程序来完成。

</pre><pre name="code" class="cpp"><span style="font-size:18px;"># !/bin/bash
# Progran:
# 	This program is going to down url from web by curl
# Para:
#	$1:the url will be download	$2: the new name of the url
# History:
#	2014.11.06	By Gao Fan, SEU		first release
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/local/bin:/local/sbin:~/bin
export PATH

curl -m 30 $1 -o $2 && exit 0

exit 1</span>

脚本程序传入两个参数，其中$1为要下载网页的URL，$2表示保存的文件名。为了方便，我们将url中的‘/’替换为‘_’来作为该url的本地保存文件名。其中我们设置连接时间最长为30s以防止一直试图下载一个存在但不能下载的网页。
以上是采用curl进行网页下载的，该方法只能将网页下载完后对其解析。当然也可以自己写TCP程序请求服务器获得网页内容，这样就可以边下载边解析。

然后我们定义一个URL类，类名为URLclass，该类主要是解析URL，将其主机名和所请求的资源名分开，本程序由于采用curl下载网页，所以用不到该类，但如果想要通过自己写TCP来从主机直接获得网页资源，运用该类会非常方便。

#ifndef URL_H__
#define URL_H__

#include "webspider.h"
#include<string>

using namespace std;

class URLclass
{
    private:
        string url;            //提供的URL
        string hostURL;    //主机名
        string file;         //文件名（包含参数）
		int paraseURL();
    public:
        URLclass()
        {}
        int setURL(string murl);
        string getHostURL();
        string getFile();
};

#endif // URL_H__

string URLclass::getHostURL()
{
    return hostURL;
}

string URLclass::getFile()
{
    return file;
}

int URLclass::paraseURL()
{
    string s1;
    if(url.find("//") == string::npos)
    {
        return 0;
    }
    s1 = url.substr(url.find("//") + 2, url.size());
    if(s1.find('/', 0) == string::npos)
    {
        hostURL = s1;
        file = "/";
    }
    else
    {
        hostURL = s1.substr(0, s1.find("/"));
        file = s1.substr(s1.find("/"), s1.size());
    }
    return 1;
}

int URLclass::setURL(string murl)
{
   url = murl;
   int flag = paraseURL();
   return flag;
}

我们可以通过声明URLclass对象，并使用setURL来对其赋予url，通过getHostUrl和getFIle方法获得url中的主机名和请求资源。
当我们下载完一个资源后，将该资源的本地文件名放到消息队列里，供解析文件程序解析。我们可以定义消息队列消息为如下结构：

struct mymesg
{
    long mtype;
    int len;
    char pathName[512];
};

下载URL的完整程序如下：

int Download(const char *url, char *pathName)
{
    char *cmd;
    cmd = (char*)malloc(1024);
    memset(cmd, '\0', 1024);
    sprintf(cmd, "/home/web/Spiderscript/downURL %s %s", url, pathName);  //downURL是上述脚本程序
    int flag = system(cmd);
    if(flag != 0)
    {
        return 0;
    }

    mymesg *MyMsg;
    MyMsg = (mymesg*)malloc(sizeof(mymesg));
    memset(MyMsg, '\0', sizeof(mymesg));
    MyMsg->mtype = 1;
    memcpy(MyMsg->pathName, pathName, strlen(pathName));
    MyMsg->len = sizeof(int) + strlen(MyMsg->pathName);
    msgsnd(msgID,(char *)MyMsg, MyMsg->len, IPC_NOWAIT);
    free(MyMsg);
    free(cmd);
    return 1;
}

为了使程序执行较快，我们将定义一个线程专门下载网页。

int tranName(const char *str, char *name)
{
    memcpy(name, str, strlen(str));
    int i;
    for(i = 0; i < strlen(name); i++)
    {
        if(name[i] == '/')
        {
            name[i] = '_';
        }
    }
    return 1;
}

void* Down(void* ptr)
{
     char name[200];
     while(1)
     {
        if(URLMap.empty())
        {
            sleep(10);
        }
        else
        {
            if(URLDeque.empty())
            {
                pthread_mutex_lock(&DequeMutex);          //实现线程同步
                mapSortByValue(URLMap, URLDeque);
                pthread_mutex_unlock(&DequeMutex);
            }
            string url = URLDeque.front();        //获取队列中第一个URL
            VisitedURL.insert(url);               //存入已访问set中。
            URLDeque.pop_front();
            cout << "visit:" << url << endl;
            memset(name, '\0', 200);
            tranName(url.c_str(), name);
            Download(url.c_str(), name);
        }
        sleep(1);
     }
}

其中tranName函数为将url中的'/'替换为'_'的函数

2、网页解析

网页解析主要是解析网页内容并提取其中的URL，然后对URL进行优先级分类。我们利用pcre库提供的正则表达式来解析网页并提取其中的url。pcre的安装参考http://blog.csdn.net/lanxinglan/article/details/40798933。正则表达式的使用参考http://blog.chinaunix.net/uid-26575352-id-3517146.html。

为了解析出rul，我们定义如下的正则表达式：const char pattern[] = "<a href=\"([^\"]*)\"";此表达式可以提取出

<a href="http://blog.csdn.net/microzone/article/details/6684436">

类似的URL，其中提取到的URL为

http://blog.csdn.net/microzone/article/details/6684436

这样我们就可以根据此URL来下载相应的网页了。

接下来的问题就是如何对URL进行优先级分级，对网页解析一般分为广度优先遍历算法，PageRank算法和OPIC算法。

其中广度优先遍历算法最简单，其将每一个网页中的url按照解析的顺序插入到一个队列的尾部，然后从队列头部取url进行下载解析。离root（index）越近的url会优先得到解析。

PageRank算法是将所有网页抓到本地，然后对其url解析，此算法认为当所有网页中某个url数量最多时其最重要，优先级也就最高。然而在现实中无法获取所有网页后再对其解析，因此我们采用非完全的PageRank算法。解析一定数量网页后取出数量最多的url再进行下载。

OPIC是将一个网页的权值平均分给该网页中所有URL的一种算法。

我们采用非完全的PageRank算法，改程序我们声明一个map<string, int>URLMap、一个deque<string>URLDeque、一个set<string>VisitedURL来完成此算法。其中URLMap中存放解析出来的url和该url目前为止出现的次数。URLDeque中存放URLMap中value值最大的前DequeSize个url。Visit而URL中存放已经访问过的url。

由于map结构无法使其按照value值进行排序，所以我们需要定义一个函数来获得map中value值最大的url。该函数定义如下：

bool cmp(const PAIR &val1, const PAIR &val2)
{
    return val1.second > val2.second;
}

void mapSortByValue(map<string, int> &myMap, deque<string> &myDeque)
{
    vector<PAIR> tmpVec(myMap.begin(), myMap.end());       //利用vector作为中介对其进行排序
    sort(tmpVec.begin(), tmpVec.end(), cmp);

    vector<PAIR> :: iterator iter = tmpVec.begin();
    int i = 0;

    while(i != DequeSize && iter != tmpVec.end())
    {
        myDeque.push_back(iter->first);
        cout << "\n" ;
        cout << iter->first << " " << iter->second << "\n";
        myMap.erase(iter->first);
        ++i;
        ++iter;
    }
}

我们通过读取本地网页来对网页进行解析，每次读入一行，然后调用正则表达式获取此行中的url并保存起来。具体函数如下：

/*
 * 函数名：GetURL
 * 函数描述：读取消息队列中第一条消息，得到要解析URL的文件名，对其进行解析，将解析到的
 * URL放到优先队列URLDeququ中，此函数只得到带.html的URL
 * 参数：*ptr(由于在用g++编译时必须指定参数，所以在此指定参数，该参数没有什么实际运用)
 */

void* GetURL(void *ptr)
{
    mymesg *MyMsg;
    MyMsg = (mymesg*)malloc(sizeof(mymesg));
    memset(MyMsg, '\0', sizeof(mymesg));
    char buf[MAXLINE];
    int fd;
    rio_t rp;
    string str1, str2;

    pcre  *re;
    const char *error;
    int  erroffset;
    int  ovector[OVECCOUNT];
    int  rc, i;
    while(1)
    {
        memset(MyMsg, '\0', sizeof(mymesg));
        int flag = msgrcv(msgID, (char*)MyMsg, sizeof(mymesg) - sizeof(long), 1, 0);//获取消息队列里消息
        if(flag == -1)
        {
            continue;
        }
        else
        {
            fd = open(MyMsg->pathName, O_RDONLY);
            if(fd < 0)
            {
                continue;
            }
            rio_readinitb(&rp, fd);

            re = pcre_compile(pattern,       // pattern, 输入参数，将要被编译的字符串形式的正则表达式
                      0,            // options, 输入参数，用来指定编译时的一些选项
                      &error,       // errptr, 输出参数，用来输出错误信息
                      &erroffset,   // erroffset, 输出参数，pattern中出错位置的偏移量
                      NULL);
            if (re == NULL)
            {                 //如果编译失败，返回错误信息
                printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
                return (void*)0;
            }

            while(rio_readnb(&rp, buf, MAXLINE) > 0)
            {
                int t = 0;
                char *ptr = buf;
                do{
                    rc = pcre_exec(re,            // code, 输入参数，用pcre_compile编译好的正则表达结构的指针
                    NULL,          // extra, 输入参数，用来向pcre_exec传一些额外的数据信息的结构的指针
                    ptr,           // subject, 输入参数，要被用来匹配的字符串
                    strlen(ptr),  // length, 输入参数， 要被用来匹配的字符串的指针
                    0,             // startoffset, 输入参数，用来指定subject从什么位置开始被匹配的偏移量
                    0,             // options, 输入参数， 用来指定匹配过程中的一些选项
                    ovector,       // ovector, 输出参数，用来返回匹配位置偏移量的数组
                    OVECCOUNT);
                    if(rc <= 0)
                    {
                        break;
                    }
                    char str[1024];
                    memset(str, '\0', 1024);
                    char *substring_start = ptr + ovector[2];
                    int substring_length = ovector[3] - ovector[2];
                    memcpy(str, substring_start,substring_length);
                    string cstr = str;
                    if(cstr.find("http") != string::npos)    //某些URL中省略主机名，所以需要保存一个主机名。
                    {
                        memset(host, '\0', 50);
                        strcpy(host, "http://");
                        URLclass url;
                        url.setURL(cstr);
                        strcpy(host, url.getHostURL().c_str());
                    }
                    else if(cstr.find("http") == string::npos)<span style="white-space:pre">	</span>//一般省略的主机名为本网页中的主机名
                    {
                        char tmp[40];
                        strcpy(tmp, host);
                        strcat(tmp, str);
                        memset(str, '\0', strlen(str));
                        memcpy(str, tmp, strlen(tmp));
                    }
                    if(string(str).find(keyWord) == string::npos)  //keyWord为url限制
                    {
                    }
                    else if(VisitedURL.count(str) == 0)
                    {
                        pthread_mutex_lock(&MapMutex);      //多线程操作需对其同步
                        if(URLMap.count(str) == 0)
                        {
                            URLMap[str] = 1;
                        }
                        else
                        {
                            ++URLMap[str];
                        }
                        pthread_mutex_unlock(&MapMutex);
                    }

                    memset(str, '\0', 1024);

                    t = ovector[1];
                    ptr += t;
                    memset(ovector, '\0', sizeof(ovector));
                  }while(rc > 0);        //如果一行中包含多个url，需要进行循环获得
            }
            close(fd);
        }
        sleep(1);
    }

    free(MyMsg);
    return (void*)0;
}

我们运用到的读函数为rio_readnb，此函数鲁棒性较强，详细说明请参考《深入理解计算机操作系统》一书。另外此函数声明为void*，因为此函数为一个单独的线程函数。在C++中定义线程函数时必须包含参数。

3、main函数

我们的main函数主要是负责初始化和创建主线程。其中第二个参数argv[1]表示初始网页（index）。第三个参数argv[2]表示网页中需要包含的字符串。具体实现如下：

int main(int argc, char *argv[])
{


    if(argc < 2)
    {
        cout << "please input the index html for start" << endl;
        return 0;
    }

     int key = ftok(dirPath, 2);
     while((msgID = msgget(key,  IPC_CREAT | 0660)) < 0)
     {
        cout << "Create msgID failed..." << endl;
        sleep(1);
     }
	 
    Download(argv[1], "index.html");
    strcpy(host, argv[1]);
    strcpy(keyWord, argv[2]);
    pthread_t downhtml, geturl;
    pthread_create(&geturl, NULL, GetURL, NULL);
    pthread_create(&downhtml, NULL, Down, NULL);


    while(1)
    {
        sleep(10);
    }
    return 0;
}