您好,欢迎访问一九零五行业门户网

Curl 采集乱码与采集不到 PHP

php程序是用gbk2312编码的:
在采集sina.com.cn时,是正常的,但是采集163.com时是为空的,采集sohu.com时是丢码的.
这是怎么回事呢?如何解决?有哪位怎么呀?先谢谢了!!!没多少分了,不好意思。
回复讨论(解决方案) 网易限制了api采集不到。sohu也可能限制了。
用 fopen 或 file_get_content可以,但file_get_content容易出现超时就停止程序执行了。
别的不说,我就是来拿分的.楼主记得给全分
$curl=curl_init('http://www.163.com');curl_setopt($curl,curlopt_returntransfer,1);curl_setopt($curl,curlopt_useragent,'mozilla/4.0 (compatible; msie 6.0; windows nt 5.2; sv1; .net clr 1.1.4322)');$html=curl_exec($curl);var_dump($html);$curl=curl_init('http://www.sohu.com');curl_setopt($curl,curlopt_returntransfer,1);curl_setopt($curl,curlopt_useragent,'mozilla/4.0 (compatible; msie 6.0; windows nt 5.2; sv1; .net clr 1.1.4322)');$html=curl_exec($curl);//$html=strstr($html,'<');$html=gzdecode($html);var_dump($html);function gzdecode($data) { $len = strlen($data); if ($len < 18 || strcmp(substr($data,0,2),\x1f\x8b)) { return null; // not gzip format (see rfc 1952) } $method = ord(substr($data,2,1)); // compression method $flags = ord(substr($data,3,1)); // flags if ($flags & 31 != $flags) { // reserved bits are set -- not allowed by rfc 1952 return null; } // note: $mtime may be negative (php integer limitations) $mtime = unpack(v, substr($data,4,4)); $mtime = $mtime[1]; $xfl = substr($data,8,1); $os = substr($data,8,1); $headerlen = 10; $extralen = 0; $extra = ; if ($flags & 4) { // 2-byte length prefixed extra data in header if ($len - $headerlen - 2 < 8) { return false; // invalid format } $extralen = unpack(v,substr($data,8,2)); $extralen = $extralen[1]; if ($len - $headerlen - 2 - $extralen < 8) { return false; // invalid format } $extra = substr($data,10,$extralen); $headerlen += 2 + $extralen; } $filenamelen = 0; $filename = ; if ($flags & 8) { // c-style string file name data in header if ($len - $headerlen - 1 < 8) { return false; // invalid format } $filenamelen = strpos(substr($data,8+$extralen),chr(0)); if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) { return false; // invalid format } $filename = substr($data,$headerlen,$filenamelen); $headerlen += $filenamelen + 1; } $commentlen = 0; $comment = ; if ($flags & 16) { // c-style string comment data in header if ($len - $headerlen - 1 < 8) { return false; // invalid format } $commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0)); if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) { return false; // invalid header format } $comment = substr($data,$headerlen,$commentlen); $headerlen += $commentlen + 1; } $headercrc = ; if ($flags & 1) { // 2-bytes (lowest order) of crc32 on header present if ($len - $headerlen - 2 < 8) { return false; // invalid format } $calccrc = crc32(substr($data,0,$headerlen)) & 0xffff; $headercrc = unpack(v, substr($data,$headerlen,2)); $headercrc = $headercrc[1]; if ($headercrc != $calccrc) { return false; // bad header crc } $headerlen += 2; } // gzip footer - these be negative due to php's limitations $datacrc = unpack(v,substr($data,-8,4)); $datacrc = $datacrc[1]; $isize = unpack(v,substr($data,-4)); $isize = $isize[1]; // perform the decompression: $bodylen = $len-$headerlen-8; if ($bodylen 0) { switch ($method) { case 8: // currently the only supported compression method: $data = gzinflate($body); break; default: // unknown compression method return false; } } else { // i'm not sure if zero-byte body content is allowed. // allow it for now... do nothing... } // verifiy decompressed size and crc32: // note: this may fail with large data sizes depending on how // php's integer limitations affect strlen() since $isize // may be negative for large sizes. if ($isize != strlen($data) || crc32($data) != $datacrc) { // bad format! length or crc doesn't match! return false; } return $data; }
非常感谢young5335,给全分,可惜就这么点分了,想多给都不行呀。
curl_setopt($ch, curlopt_useragent,'mozilla/4.0 (compatible; msie 6.0; windows nt 5.2; sv1; .net clr 1.1.4322)');
那么一大堆代码,这句最有用,也解决了问题
其它类似信息

推荐信息