当前位置： > 认证考试 > 网络工程师 > 文章内容

php爬虫(php爬虫和python爬虫)

http://www.itjxue.com 2023-01-24 16:44 来源:未知 点击次数:

PHP爬虫基础，xampp是干嘛的软件？PhpStorm又是干嘛的？dreamweaver呢？

xampp是Apache+MySQL+PHP+PERL，可以再多个系统下使用，支持多种语言包括中文！

phpstorm是写php代码的一个编译软件。

dreamweaver简称dw，中文名梦想编织者，网页制作和管理网站为一体的网页编辑器。

php爬虫(php爬虫和python爬虫)

如何用php 编写网络爬虫

其实用PHP来爬会非常方便，主要是PHP的正则表达式功能在搜集页面连接方面很方便，另外PHP的fopen、file_get_contents以及libcur的函数非常方便的下载网页内容。

php 实现网络爬虫

pcntl_fork或者swoole_process实现多进程并发。按照每个网页抓取耗时500ms，开200个进程，可以实现每秒400个页面的抓取。

curl实现页面抓取，设置cookie可以实现模拟登录

simple_html_dom 实现页面的解析和DOM处理

如果想要模拟浏览器，可以使用casperJS。用swoole扩展封装一个服务接口给PHP层调用

在这里有一套爬虫系统就是基于上述技术方案实现的，每天会抓取几千万个页面。

php中curl爬虫怎么样通过网页获取所有链接

本文承接上面两篇，本篇中的示例要调用到前两篇中的函数，做一个简单的URL采集。一般php采集网络数据会用file_get_contents、file和cURL。不过据说cURL会比file_get_contents、file更快更专业，更适合采集。今天就试试用cURL来获取网页上的所有链接。示例如下：

?php

* 使用curl 采集hao123.com下的所有链接。

include_once('function.php');

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, '');

// 只需返回HTTP header

curl_setopt($ch, CURLOPT_HEADER, 1);

// 页面内容我们并不需要

// curl_setopt($ch, CURLOPT_NOBODY, 1);

// 返回结果，而不是输出它

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$html = curl_exec($ch);

$info = curl_getinfo($ch);

if ($html === false) {

echo "cURL Error: " . curl_error($ch);

}

curl_close($ch);

$linkarr = _striplinks($html);

// 主机部分，补全用

$host = '';

if (is_array($linkarr)) {

foreach ($linkarr as $k = $v) {

$linkresult[$k] = _expandlinks($v, $host);

}

printf("p此页面的所有链接为：/ppre%s/pren", var_export($linkresult , true));

function.php内容如下（即为上两篇中两个函数的合集）：

?php

function _striplinks($document) {

preg_match_all("'s*as.*?hrefs*=s*(["'])?(?(1) (.*?)\1 | ([^s]+))'isx", $document, $links);

// catenate the non-empty matches from the conditional subpattern

while (list($key, $val) = each($links[2])) {

if (!empty($val))

$match[] = $val;

} while (list($key, $val) = each($links[3])) {

if (!empty($val))

$match[] = $val;

}

// return the links

return $match;

}

/*===================================================================*

Function: _expandlinks

Purpose: expand each link into a fully qualified URL

Input: $links the links to qualify

$URI the full URI to get the base from

Output: $expandedLinks the expanded links

*===================================================================*/

function _expandlinks($links,$URI)

{

$URI_PARTS = parse_url($URI);

$host = $URI_PARTS["host"];

preg_match("/^[^?]+/",$URI,$match);

$match = preg_replace("|/[^/.]+.[^/.]+$|","",$match[0]);

$match = preg_replace("|/$|","",$match);

$match_part = parse_url($match);

$match_root =

$match_part["scheme"]."://".$match_part["host"];

$search = array( "|^http://".preg_quote($host)."|i",

"|^(/)|i",

"|^(?!http://)(?!mailto:)|i",

"|/./|",

"|/[^/]+/../|"

);

$replace = array( "",

$match_root."/",

$match."/",

"/",

"/"

);

$expandedLinks = preg_replace($search,$replace,$links);

return $expandedLinks;

}

PHP判断是不是爬虫的方法

理论上是无法判断，一般可以判断浏览器代理，但是爬虫其实可以完全模拟浏览器。我网站禁止爬虫的代码如下：

//禁止OFFICE、尼姆达、蜘蛛

if?(stripos($_SERVER['HTTP_USER_AGENT'],'Microsoft?')===0?||

????stripos($_SERVER['HTTP_USER_AGENT'],'Microsoft-WebDAV-MiniRedir')===0?||

????stripos($_SERVER['HTTP_USER_AGENT'],'Baiduspider')===0?||

????stripos($_SERVER['HTTP_USER_AGENT'],'Sogou?Orion?spider')===0?||

????stripos($_SERVER['HTTP_USER_AGENT'],'Googlebot'))?exit('EXPLORER?ERROR(你的浏览器出现严重错误),MAY?BE?INFFECT?VIRUS(你的电脑可能感染病毒)!');

(责任编辑：IT教学网)

复制链接发给好友收藏本文关闭此页

上一篇：哪些属性是表单元素的属性(单元的属性有哪些)

下一篇：全部音符(全部音符怎么画)

php爬虫(php爬虫和python爬虫)

PHP爬虫基础，xampp是干嘛的软件？PhpStorm又是干嘛的？dreamweaver呢？

如何用php 编写网络爬虫

php 实现网络爬虫

php中curl爬虫怎么样通过网页获取所有链接

PHP判断是不是爬虫的方法

(责任编辑：IT教学网)

相关网络工程师文章

阅读排行

专题教程

推荐网络工程师文章

最新更新网络工程师

php爬虫(php爬虫和python爬虫)

PHP爬虫基础，xampp是干嘛的软件？PhpStorm又是干嘛的？dreamweaver呢？

如何用php 编写网络爬虫

php 实现网络爬虫

php中curl爬虫 怎么样通过网页获取所有链接

PHP判断是不是爬虫的方法

(责任编辑：IT教学网)

相关网络工程师文章

阅读排行

专题教程

推荐网络工程师文章

最新更新网络工程师

php中curl爬虫怎么样通过网页获取所有链接